PDF OCR
A lightweight Ruby gem for extracting text from PDFs, including scanned PDFs using OCR.
This gem supports:
- PDFs with readable text
- Scanned PDFs using Tesseract OCR
- File objects, file paths, StringIO, and Rails/ActiveStorage uploads
- Fully Rails-independent
๐ Features
- Detect if PDF is scanned or text-based
- Extract text from normal PDFs using
PDF::Reader - Extract text from scanned PDFs using
RTesseractandMiniMagick - Automatic cleanup of temporary images
๐ป Installation
Add this line to your application's Gemfile:
gem 'pdf_ocr'Or install directly:
gem install pdf_ocrDependencies
-
PDF::Reader
-
RTesseract
-
MiniMagick
-
Tesseract OCR (system-level executable)
-
pdftoppm from Poppler utils (for converting PDF pages to images)
โ๏ธ Usage
require 'pdf_ocr'
require 'stringio'
# From a File object
file = File.open("path/to/document.pdf")
result = Ocr::DataExtractor.new(file).call
puts result["raw_text"] if result["success"]
# From a file path string
result = Ocr::DataExtractor.new("path/to/document.pdf").call
# From a StringIO object (in-memory PDF)
pdf_data = StringIO.new(File.read("path/to/document.pdf"))
result = Ocr::DataExtractor.new(pdf_data).callExample Result
{
"success" => true,
"raw_text" => "Extracted text content from PDF ..."
}- If OCR fails for a scanned PDF:
{
"success" => false,
"message" => "Unable to extract text using OCR"
}๐ง Notes
- Ensure Tesseract OCR is installed on your system:
# Ubuntu/Debian
sudo apt install tesseract-ocr
# MacOS (with Homebrew)
brew install tesseract
- Ensure pdftoppm is installed (for PDF-to-image conversion):
# Ubuntu/Debian
sudo apt install poppler-utils
# MacOS (with Homebrew)
brew install poppler
- This gem does not require Rails, but it will work with Rails ActiveStorage objects that respond to .open.
๐งช Running Tests
bundle install
bundle exec rspec
-
PDFs with selectable text
-
Scanned PDFs
-
Malformed PDFs (fallback to OCR)
๐ Contributing
-
Fork the repository
-
Create your feature branch (git checkout -b your-feature)
-
Commit your changes (git commit -am 'Add new feature')
-
Push to the branch (git push origin your-feature)
-
Open a Pull Request
๐งโ๐ผ Author
Ravi Shankar Singhal
Senior Backend Developer โ Ruby on Rails
๐ง ravi.singhal2308@gmail.com
๐ https://github.com/RaviShankarSinghal
๐ License
MIT License ยฉ RaviShankarSinghal
This version includes:
- Version and build badges (replace with your repo info)
- Clear installation instructions
- Usage examples for File, path, and StringIO
- System dependencies
- Test instructions
- Contributing guidelines
- The gem is available as open source under the terms of the MIT License.