Project

pdf_ocr

0.0
The project is in a healthy, maintained state
OCR is a Ruby gem that allows you to easily extract text from image files (JPG, PNG, PDF) using Tesseract OCR engine. It provides a simple, intuitive interface for integrating OCR capabilities into your Ruby or Rails applications.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

>= 0
>= 0

Runtime

 Project Readme

PDF OCR

A lightweight Ruby gem for extracting text from PDFs, including scanned PDFs using OCR.

This gem supports:

  • PDFs with readable text
  • Scanned PDFs using Tesseract OCR
  • File objects, file paths, StringIO, and Rails/ActiveStorage uploads
  • Fully Rails-independent

๐Ÿš€ Features

  • Detect if PDF is scanned or text-based
  • Extract text from normal PDFs using PDF::Reader
  • Extract text from scanned PDFs using RTesseract and MiniMagick
  • Automatic cleanup of temporary images

๐Ÿ’ป Installation

Add this line to your application's Gemfile:

gem 'pdf_ocr'

Or install directly:

gem install pdf_ocr

Dependencies

  • PDF::Reader

  • RTesseract

  • MiniMagick

  • Tesseract OCR (system-level executable)

  • pdftoppm from Poppler utils (for converting PDF pages to images)

โš™๏ธ Usage

require 'pdf_ocr'
require 'stringio'

# From a File object
file = File.open("path/to/document.pdf")
result = Ocr::DataExtractor.new(file).call
puts result["raw_text"] if result["success"]

# From a file path string
result = Ocr::DataExtractor.new("path/to/document.pdf").call

# From a StringIO object (in-memory PDF)
pdf_data = StringIO.new(File.read("path/to/document.pdf"))
result = Ocr::DataExtractor.new(pdf_data).call

Example Result

{
  "success" => true,
  "raw_text" => "Extracted text content from PDF ..."
}
  • If OCR fails for a scanned PDF:
{
  "success" => false,
  "message" => "Unable to extract text using OCR"
}

๐Ÿ”ง Notes

  1. Ensure Tesseract OCR is installed on your system:
# Ubuntu/Debian
sudo apt install tesseract-ocr

# MacOS (with Homebrew)
brew install tesseract
  1. Ensure pdftoppm is installed (for PDF-to-image conversion):
# Ubuntu/Debian
sudo apt install poppler-utils

# MacOS (with Homebrew)
brew install poppler
  1. This gem does not require Rails, but it will work with Rails ActiveStorage objects that respond to .open.

๐Ÿงช Running Tests

bundle install
bundle exec rspec
  • PDFs with selectable text

  • Scanned PDFs

  • Malformed PDFs (fallback to OCR)

๐Ÿ“ Contributing

  • Fork the repository

  • Create your feature branch (git checkout -b your-feature)

  • Commit your changes (git commit -am 'Add new feature')

  • Push to the branch (git push origin your-feature)

  • Open a Pull Request

๐Ÿง‘โ€๐Ÿ’ผ Author

Ravi Shankar Singhal
Senior Backend Developer โ€” Ruby on Rails
๐Ÿ“ง ravi.singhal2308@gmail.com

๐ŸŒ https://github.com/RaviShankarSinghal

๐Ÿ“ License

MIT License ยฉ RaviShankarSinghal


This version includes:

  • Version and build badges (replace with your repo info)
  • Clear installation instructions
  • Usage examples for File, path, and StringIO
  • System dependencies
  • Test instructions
  • Contributing guidelines
  • The gem is available as open source under the terms of the MIT License.