Project

pdfsink

0.0
The project is in a healthy, maintained state
A Ruby gem that wraps the pdfsink-rs CLI, a fast pure-Rust PDF extraction tool, providing text, word, object, table, link, and regex-search extraction from PDFs for use in Rails applications.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies
 Project Readme

Pdfsink

Ruby wrapper for pdfsink-rs, a fast pure-Rust PDF extraction tool. Extract text, words, page objects, tables, hyperlinks, and regex matches from PDFs — without a Python runtime.

The gem shells out to the bundled pdfsink-rs binary and parses its JSON output, so there is nothing to load into your Ruby process and no FFI.

Requirements

  • Ruby >= 3.2
  • For source installs only: a Rust toolchain (cargo) to compile the binary. Precompiled platform gems bundle the binary and need no toolchain.

Installation

# Gemfile
gem "pdfsink"
bundle install
# or
gem install pdfsink

On supported platforms (x86_64-linux, aarch64-linux, x86_64-darwin, arm64-darwin) a precompiled gem ships the binary. Elsewhere the source gem compiles it on install via cargo install pdfsink-rs.

Building from source

git clone https://github.com/AccountAim/pdfsink-ruby
cd pdfsink-ruby
bundle install
bundle exec rake cargo:build   # compiles the binary into lib/pdfsink/
bundle exec rspec

To point the gem at a binary you built elsewhere, set PDFSINK_BIN:

export PDFSINK_BIN=/path/to/pdfsink-rs

Usage

Open a document

doc = Pdfsink.open("report.pdf")
doc.page_count        # => 12
doc.pages             # => [#<Pdfsink::Page number=1 612.0x792.0>, ...]
doc.each_page { |page| puts page.extract_text }

Extract text

page = doc.page(1)
page.extract_text     # => "Quarterly Report\n..."

# One-shot, without holding a Document:
Pdfsink.extract_text("report.pdf", page: 1)

Words with positions

page.extract_words
# => [{"text" => "Quarterly", "x0" => 72.0, "top" => 90.1, ...}, ...]

Page objects

page.object_counts    # => {"chars" => 812, "lines" => 14, "rects" => 3, ...}
page.objects          # => {"chars" => [...], "lines" => [...], ...}

Tables

page.tables                      # default strategy ("lines")
page.tables(strategy: :text)     # infer from text alignment
# => [["Name", "Age", "City"], ["Alice", "31", "Oakland"], ...]

Strategies: :lines, :lines_strict, :text, :explicit (see Pdfsink::TableStrategy).

Hyperlinks

page.links
# => [{"uri" => "https://example.com", "x0" => 72.0, ...}]

Search

page.search(/total:\s*\$\d+/i)
# => [{"text" => "Total: $420", "x0" => ..., ...}]

Binary version

Pdfsink.version   # => "0.2.8"

Rails Integration

# config/application.rb (or an initializer)
config.pdfsink.default_table_strategy = :text
config.pdfsink.binary_path = Rails.root.join("bin/pdfsink-rs").to_s

Configuration

Pdfsink.configure do |config|
  config.default_table_strategy = :text   # used when Page#tables gets no strategy
end

The binary is located in this order:

  1. PDFSINK_BIN environment variable
  2. lib/pdfsink/pdfsink-rs inside the gem (bundled / built)
  3. ext/pdfsink/bin/pdfsink-rs (dev location)
  4. pdfsink-rs on PATH

Development

bundle install
bundle exec rake cargo:build
bundle exec rspec

License

MIT — see LICENSE.