liteparse-rb
Ruby bindings for LiteParse — a fast, open-source document parser that extracts text with spatial layout information, bounding boxes, and OCR support.
Built with magnus for native Rust→Ruby FFI.
Installation
Add liteparse-rb to your application with Bundler:
bundle add liteparse-rbThis will add the gem to your Gemfile and install it.
You can also install it directly with:
gem install liteparse-rbRuntime Dependencies
The gem bundles PDFium inside platform gems and shells out to external tools for certain input formats.
| Dependency | Required for | When needed | Config / Env var |
|---|---|---|---|
| Tesseract (C++ library + traineddata) | OCR on scanned pages and images | Always (feature default, can disable with tesseract: false) |
TESSDATA_PREFIX or tessdata_path: config option; traineddata auto-downloads if missing |
ImageMagick (magick or convert) |
Image file input (.jpg, .png, .gif, .bmp, .tiff, .webp, .svg) |
Only when parsing image files | — |
LibreOffice (libreoffice or soffice) |
Office document input (.docx, .pptx, .xlsx, .odt, etc.) |
Only when parsing office files | — |
Ghostscript (gs) |
Vector format conversion (.svg, .eps, .ps, .ai) |
Only when parsing vector files (used by ImageMagick) | — |
macOS (Homebrew)
brew install tesseract imagemagick libreoffice ghostscriptmacOS (Nix)
nix shell nixpkgs#tesseract nixpkgs#imagemagick nixpkgs#libreoffice nixpkgs#ghostscriptOr add to shell.nix / flake.nix:
{pkgs}: pkgs.mkShell {
buildInputs = [
pkgs.tesseract
pkgs.imagemagick
pkgs.libreoffice
pkgs.ghostscript
];
}Linux (Debian/Ubuntu)
sudo apt-get install -y cmake libtesseract-dev tesseract-ocr-eng imagemagick libreoffice ghostscriptPrebuilt platform gems include PDFium for all platforms. If using the source gem (installs on platforms without a prebuilt gem), pdfium is downloaded and cached at build time via the Rust build script — no manual setup needed.
Usage
Basic parsing
require "liteparse"
parser = LiteParse::LiteParse.new
result = parser.parse("document.pdf")
puts result.text # full document text
puts result.num_pages # page count
result.pages.each do |page|
puts "Page #{page.page_num}: #{page.width}x#{page.height}"
puts page.text
endConfiguration
All keyword args match the Python API:
parser = LiteParse::LiteParse.new(
ocr_enabled: false,
output_format: "markdown",
max_pages: 10,
dpi: 200,
password: "secret",
quiet: true,
image_mode: "embed",
extract_links: true,
)Parse from bytes
data = File.binread("document.pdf")
result = parser.parse_bytes(data)Text items with bounding boxes
Each page exposes text_items — individual word/phrase runs with position data:
page.text_items.each do |item|
puts "#{item.text} at (#{item.x}, #{item.y}) #{item.width}x#{item.height}"
puts " font: #{item.font_name}, size: #{item.font_size}"
puts " confidence: #{item.confidence}" # OCR confidence (nil for native text)
endSearching text items
items = page.text_items
matches = LiteParse.search_items(items, "swimmer name", case_sensitive: false)
matches.each do |m|
puts "Found '#{m.text}' at x=#{m.x} y=#{m.y}"
endScreenshots
screenshots = parser.screenshot("document.pdf", page_numbers: [1, 3])
screenshots.each do |s|
File.binwrite("page_#{s.page_num}.png", s.image_bytes)
endInspecting config
cfg = parser.config
puts cfg.ocr_enabled # true/false
puts cfg.output_format # "json", "text", "markdown"
puts cfg.dpi # 150.0Types
| Ruby Class | Description |
|---|---|
LiteParse::LiteParse |
Main parser |
LiteParse::ParseResult |
Parsed document with pages/text/images |
LiteParse::ParsedPage |
Single page with text items |
LiteParse::TextItem |
Word/phrase with bounding box |
LiteParse::ExtractedImage |
Embedded raster image (in embed mode) |
LiteParse::ScreenshotResult |
Page screenshot PNG bytes |
LiteParse::Config |
Resolved configuration |
LiteParse::ParseError |
Raised on parse failures |
Development
Prerequisites: Rust toolchain, Ruby 3.3+, cmake.
bundle install
rake compile
ruby -I lib -e "require 'liteparse'; puts LiteParse::VERSION"Release
# Bump version in lib/liteparse/version.rb and crates/*/Cargo.toml
# Add entry to CHANGELOG.md
bundle install # syncs Gemfile.lock with new version
git add -A && git commit -m "Release v0.1.0"
git tag v0.1.0
git push && git push --tagsTag push triggers CI then publishes prebuilt platform gems to rubygems.org.
License
Apache-2.0 — same as LiteParse upstream.