0.0
The project is in a healthy, maintained state
Ruby bindings for LiteParse — an open-source document parser that extracts text with spatial layout information, bounding boxes, OCR support, and more.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Development

~> 0.9

Runtime

~> 0.9
 Project Readme

liteparse-rb

Gem Version

Ruby bindings for LiteParse — a fast, open-source document parser that extracts text with spatial layout information, bounding boxes, and OCR support.

Built with magnus for native Rust→Ruby FFI.

Installation

Add liteparse-rb to your application with Bundler:

bundle add liteparse-rb

This will add the gem to your Gemfile and install it.

You can also install it directly with:

gem install liteparse-rb

Runtime Dependencies

The gem bundles PDFium inside platform gems and shells out to external tools for certain input formats.

Dependency Required for When needed Config / Env var
Tesseract (C++ library + traineddata) OCR on scanned pages and images Always (feature default, can disable with tesseract: false) TESSDATA_PREFIX or tessdata_path: config option; traineddata auto-downloads if missing
ImageMagick (magick or convert) Image file input (.jpg, .png, .gif, .bmp, .tiff, .webp, .svg) Only when parsing image files
LibreOffice (libreoffice or soffice) Office document input (.docx, .pptx, .xlsx, .odt, etc.) Only when parsing office files
Ghostscript (gs) Vector format conversion (.svg, .eps, .ps, .ai) Only when parsing vector files (used by ImageMagick)

macOS (Homebrew)

brew install tesseract imagemagick libreoffice ghostscript

macOS (Nix)

nix shell nixpkgs#tesseract nixpkgs#imagemagick nixpkgs#libreoffice nixpkgs#ghostscript

Or add to shell.nix / flake.nix:

{pkgs}: pkgs.mkShell {
  buildInputs = [
    pkgs.tesseract
    pkgs.imagemagick
    pkgs.libreoffice
    pkgs.ghostscript
  ];
}

Linux (Debian/Ubuntu)

sudo apt-get install -y cmake libtesseract-dev tesseract-ocr-eng imagemagick libreoffice ghostscript

Prebuilt platform gems include PDFium for all platforms. If using the source gem (installs on platforms without a prebuilt gem), pdfium is downloaded and cached at build time via the Rust build script — no manual setup needed.

Usage

Basic parsing

require "liteparse"

parser = LiteParse::LiteParse.new
result = parser.parse("document.pdf")

puts result.text          # full document text
puts result.num_pages     # page count

result.pages.each do |page|
  puts "Page #{page.page_num}: #{page.width}x#{page.height}"
  puts page.text
end

Configuration

All keyword args match the Python API:

parser = LiteParse::LiteParse.new(
  ocr_enabled: false,
  output_format: "markdown",
  max_pages: 10,
  dpi: 200,
  password: "secret",
  quiet: true,
  image_mode: "embed",
  extract_links: true,
)

Parse from bytes

data = File.binread("document.pdf")
result = parser.parse_bytes(data)

Text items with bounding boxes

Each page exposes text_items — individual word/phrase runs with position data:

page.text_items.each do |item|
  puts "#{item.text} at (#{item.x}, #{item.y}) #{item.width}x#{item.height}"
  puts "  font: #{item.font_name}, size: #{item.font_size}"
  puts "  confidence: #{item.confidence}"  # OCR confidence (nil for native text)
end

Searching text items

items = page.text_items
matches = LiteParse.search_items(items, "swimmer name", case_sensitive: false)

matches.each do |m|
  puts "Found '#{m.text}' at x=#{m.x} y=#{m.y}"
end

Screenshots

screenshots = parser.screenshot("document.pdf", page_numbers: [1, 3])
screenshots.each do |s|
  File.binwrite("page_#{s.page_num}.png", s.image_bytes)
end

Inspecting config

cfg = parser.config
puts cfg.ocr_enabled       # true/false
puts cfg.output_format     # "json", "text", "markdown"
puts cfg.dpi               # 150.0

Types

Ruby Class Description
LiteParse::LiteParse Main parser
LiteParse::ParseResult Parsed document with pages/text/images
LiteParse::ParsedPage Single page with text items
LiteParse::TextItem Word/phrase with bounding box
LiteParse::ExtractedImage Embedded raster image (in embed mode)
LiteParse::ScreenshotResult Page screenshot PNG bytes
LiteParse::Config Resolved configuration
LiteParse::ParseError Raised on parse failures

Development

Prerequisites: Rust toolchain, Ruby 3.3+, cmake.

bundle install
rake compile
ruby -I lib -e "require 'liteparse'; puts LiteParse::VERSION"

Release

# Bump version in lib/liteparse/version.rb and crates/*/Cargo.toml
# Add entry to CHANGELOG.md
bundle install    # syncs Gemfile.lock with new version
git add -A && git commit -m "Release v0.1.0"
git tag v0.1.0
git push && git push --tags

Tag push triggers CI then publishes prebuilt platform gems to rubygems.org.

License

Apache-2.0 — same as LiteParse upstream.