Native Ruby bindings for the parser-core Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX, PPTX), images (with OCR), and more. Part of the ruby-nlp ecosystem.
Features
- ๐ Document Parsing: Extract text from PDFs, Office documents (DOCX, XLSX, PPTX)
- ๐ผ๏ธ OCR Support: Extract text from images using Tesseract OCR
- ๐ High Performance: Native Rust performance with Ruby convenience
- ๐ง Unified API: Single interface for multiple document formats
- ๐ฆ Cross-Platform: Works on Linux, macOS, and Windows
- ๐งช Well Tested: Comprehensive test suite with RSpec
Installation
Add this line to your application's Gemfile:
gem 'parsekit'
And then execute:
$ bundle install
Or install it yourself as:
gem install parsekit
Requirements
- Ruby >= 3.0.0
- Rust toolchain (stable)
- C compiler (for linking)
- System libraries for document parsing:
-
macOS:
brew install leptonica tesseract poppler
-
Ubuntu/Debian:
sudo apt-get install libleptonica-dev libtesseract-dev libpoppler-cpp-dev
-
Fedora/RHEL:
sudo dnf install leptonica-devel tesseract-devel poppler-cpp-devel
- Windows: See DEPENDENCIES.md for MSYS2 instructions
-
macOS:
For detailed installation instructions and troubleshooting, see DEPENDENCIES.md.
Usage
Basic Usage
require 'parsekit'
# Parse a PDF file
text = ParseKit.parse_file("document.pdf")
puts text # Extracted text from the PDF
# Parse an Office document
text = ParseKit.parse_file("presentation.pptx")
puts text # Extracted text from all slides
# Parse an Excel file
text = ParseKit.parse_file("spreadsheet.xlsx")
puts text # Extracted text from all sheets
# Parse binary data directly
file_data = File.binread("document.pdf")
text = ParseKit.parse_bytes(file_data)
puts text
# Parse with a Parser instance
parser = ParseKit::Parser.new
text = parser.parse_file("report.docx")
puts text
Module-Level Convenience Methods
# Parse files directly
content = ParseKit.parse_file('document.pdf')
# Parse bytes
data = File.read('document.pdf', mode: 'rb')
content = ParseKit.parse_bytes(data.bytes)
# Check supported formats
formats = ParseKit.supported_formats
# => ["txt", "json", "xml", "html", "docx", "xlsx", "xls", "csv", "pdf", "png", "jpg", "jpeg", "tiff", "bmp"]
# Check if a file is supported
ParseKit.supports_file?('document.pdf') # => true
Configuration Options
# Create parser with options
parser = ParseKit::Parser.new(
strict_mode: true,
max_size: 50 * 1024 * 1024, # 50MB limit
encoding: 'UTF-8'
)
# Or use the strict convenience method
parser = ParseKit::Parser.strict
Format-Specific Parsing
parser = ParseKit::Parser.new
# Direct access to format-specific parsers
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)
image_data = File.read('image.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)
excel_data = File.read('data.xlsx', mode: 'rb').bytes
excel_text = parser.parse_xlsx(excel_data)
Supported Formats
Format | Extensions | Method | Notes |
---|---|---|---|
parse_pdf |
Text extraction via MuPDF | ||
Word | .docx | parse_docx |
Office Open XML format |
Excel | .xlsx, .xls | parse_xlsx |
Both modern and legacy formats |
Images | .png, .jpg, .jpeg, .tiff, .bmp | ocr_image |
OCR via embedded Tesseract |
JSON | .json | parse_json |
Pretty-printed output |
XML/HTML | .xml, .html | parse_xml |
Extracts text content |
Text | .txt, .csv, .md | parse_text |
With encoding detection |
Performance
ParseKit is built with performance in mind:
- Native Rust implementation for speed
- Statically linked C libraries (MuPDF, Tesseract) compiled with optimizations
- Efficient memory usage with streaming where possible
- Configurable size limits to prevent memory issues
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake spec
to run the tests.
To compile the Rust extension:
rake compile
To run tests with coverage:
rake dev:coverage
Architecture
ParseKit uses a hybrid Ruby/Rust architecture:
- Ruby Layer: Provides convenient API and format detection
-
Rust Layer: Implements high-performance parsing using:
- MuPDF for PDF text extraction (statically linked)
- rusty-tesseract for OCR (with embedded Tesseract)
- Pure Rust libraries for DOCX/XLSX parsing
- Magnus for Ruby-Rust FFI bindings
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/cpetersen/parsekit.
License
The gem is available as open source under the terms of the MIT License.
Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.