Project

parsekit

0.0
The project is in a healthy, maintained state
Native Ruby gem for parsing documents (PDF, DOCX, XLSX, images with OCR) with zero runtime dependencies. Statically links MuPDF for PDF extraction and Tesseract for OCR.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

~> 13.0
~> 3.0
~> 0.22

Runtime

~> 0.9
 Project Readme

parsekit

Gem Version License: MIT

Native Ruby bindings for the parser-core Rust crate, providing high-performance document parsing and text extraction capabilities through Magnus. This gem wraps parser-core to extract text from PDFs, Office documents (DOCX, XLSX, PPTX), images (with OCR), and more. Part of the ruby-nlp ecosystem.

Features

  • ๐Ÿ“„ Document Parsing: Extract text from PDFs, Office documents (DOCX, XLSX, PPTX)
  • ๐Ÿ–ผ๏ธ OCR Support: Extract text from images using Tesseract OCR
  • ๐Ÿš€ High Performance: Native Rust performance with Ruby convenience
  • ๐Ÿ”ง Unified API: Single interface for multiple document formats
  • ๐Ÿ“ฆ Cross-Platform: Works on Linux, macOS, and Windows
  • ๐Ÿงช Well Tested: Comprehensive test suite with RSpec

Installation

Add this line to your application's Gemfile:

gem 'parsekit'

And then execute:

$ bundle install

Or install it yourself as:

gem install parsekit

Requirements

  • Ruby >= 3.0.0
  • Rust toolchain (stable)
  • C compiler (for linking)
  • System libraries for document parsing:
    • macOS: brew install leptonica tesseract poppler
    • Ubuntu/Debian: sudo apt-get install libleptonica-dev libtesseract-dev libpoppler-cpp-dev
    • Fedora/RHEL: sudo dnf install leptonica-devel tesseract-devel poppler-cpp-devel
    • Windows: See DEPENDENCIES.md for MSYS2 instructions

For detailed installation instructions and troubleshooting, see DEPENDENCIES.md.

Usage

Basic Usage

require 'parsekit'

# Parse a PDF file
text = ParseKit.parse_file("document.pdf")
puts text  # Extracted text from the PDF

# Parse an Office document
text = ParseKit.parse_file("presentation.pptx")
puts text  # Extracted text from all slides

# Parse an Excel file
text = ParseKit.parse_file("spreadsheet.xlsx")
puts text  # Extracted text from all sheets

# Parse binary data directly
file_data = File.binread("document.pdf")
text = ParseKit.parse_bytes(file_data)
puts text

# Parse with a Parser instance
parser = ParseKit::Parser.new
text = parser.parse_file("report.docx")
puts text

Module-Level Convenience Methods

# Parse files directly
content = ParseKit.parse_file('document.pdf')

# Parse bytes
data = File.read('document.pdf', mode: 'rb')
content = ParseKit.parse_bytes(data.bytes)

# Check supported formats
formats = ParseKit.supported_formats
# => ["txt", "json", "xml", "html", "docx", "xlsx", "xls", "csv", "pdf", "png", "jpg", "jpeg", "tiff", "bmp"]

# Check if a file is supported
ParseKit.supports_file?('document.pdf')  # => true

Configuration Options

# Create parser with options
parser = ParseKit::Parser.new(
  strict_mode: true,
  max_size: 50 * 1024 * 1024,  # 50MB limit
  encoding: 'UTF-8'
)

# Or use the strict convenience method
parser = ParseKit::Parser.strict

Format-Specific Parsing

parser = ParseKit::Parser.new

# Direct access to format-specific parsers
pdf_data = File.read('document.pdf', mode: 'rb').bytes
pdf_text = parser.parse_pdf(pdf_data)

image_data = File.read('image.png', mode: 'rb').bytes
ocr_text = parser.ocr_image(image_data)

excel_data = File.read('data.xlsx', mode: 'rb').bytes
excel_text = parser.parse_xlsx(excel_data)

Supported Formats

Format Extensions Method Notes
PDF .pdf parse_pdf Text extraction via MuPDF
Word .docx parse_docx Office Open XML format
Excel .xlsx, .xls parse_xlsx Both modern and legacy formats
Images .png, .jpg, .jpeg, .tiff, .bmp ocr_image OCR via embedded Tesseract
JSON .json parse_json Pretty-printed output
XML/HTML .xml, .html parse_xml Extracts text content
Text .txt, .csv, .md parse_text With encoding detection

Performance

ParseKit is built with performance in mind:

  • Native Rust implementation for speed
  • Statically linked C libraries (MuPDF, Tesseract) compiled with optimizations
  • Efficient memory usage with streaming where possible
  • Configurable size limits to prevent memory issues

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests.

To compile the Rust extension:

rake compile

To run tests with coverage:

rake dev:coverage

Architecture

ParseKit uses a hybrid Ruby/Rust architecture:

  • Ruby Layer: Provides convenient API and format detection
  • Rust Layer: Implements high-performance parsing using:
    • MuPDF for PDF text extraction (statically linked)
    • rusty-tesseract for OCR (with embedded Tesseract)
    • Pure Rust libraries for DOCX/XLSX parsing
    • Magnus for Ruby-Rust FFI bindings

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/cpetersen/parsekit.

License

The gem is available as open source under the terms of the MIT License.

Note: This gem includes statically linked versions of MuPDF (AGPL/Commercial) and Tesseract (Apache 2.0). Please review their respective licenses for compliance with your use case.