A long-lived project that still receives updates
Library (Docsplit wrapper) for text extraction from pdf, doc/x, txt files with OpenOffice
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

~> 2.2, >= 2.2.33
~> 0
~> 12, >= 12.3.3
~> 0

Runtime

 Project Readme

codecov

act_as_page_extractor

A library that extracts plain text from documents for subsequent processing, such as indexing and search.

Installation

Install all dependencies before use:

sh Aptfile.sh

Add this to your Gemfile:

gem 'act_as_page_extractor'

Usage

Generate a migration, for example for a Document model:

rails g act_as_page_extractor:migration Document category_id user_id

This will generate two migration files:

class AddPageExtractorFields < ActiveRecord::Migration
  def change
    add_column :documents, :page_extraction_state, :string, default: ''
    add_column :documents, :page_extraction_pages, :integer, default: 0
    add_column :documents, :page_extraction_doctype, :string, default: ''
    add_column :documents, :page_extraction_filesize, :string, default: ''
    add_column :documents, :pages_extraction_errors, :string, default: ''
  end
end

class CreateExtractedPages < ActiveRecord::Migration
  def change
    create_table :extracted_pages do |t|
      t.text :page
      t.integer :document_id
      t.integer :category_id
      t.integer :user_id
      t.integer :page_number

      t.timestamps null: false
    end

    add_index :extracted_pages, :document_id
    add_index :extracted_pages, :category_id
    add_index :extracted_pages, [:document_id, :category_id]
    add_index :extracted_pages, [:document_id, :page_number]
  end
end

Model Document must have field which contains path to file(supports different archive types that contains txt, pdf, doc/x, txt, html, rtf, ...)

Add to model next parameters for initializing:

  class Document < ActiveRecord::Base
    include ActAsPageExtractor

    act_as_page_extractor options: {
      document_class:    'Document',
      save_as_pdf:       true, # store converted document as PDF
      filename:          :filename,
      document_id:       :document_id,
      additional_fields: [:category_id, :user_id], # copy values of these fields from document to extracted_page
      root_folder:       Rails.root.to_s, # or "/full/path/to/project", it needs to share folder between deployments
      # file_storage:      "/full/path/to/project/public/uploads/documents/storage" # optional
      # pdf_storage:       "/full/path/to/project/public/uploads/extracted/pdf/storage" # optional
    }

    has_many :extracted_pages, dependent: :destroy
end

The instance now provides several new methods:

document = Document.first
document.page_extract!
document.extracted_pages
document.pdf_path # if option 'save_as_pdf' is 'true'

# Access to pages
ExtractedPage.count

# Importing whole directory of documents
ActAsPageExtractor.import_files('/path/to/foler/with/documents')

# We can use cron for run the processing of all the new documents
ActAsPageExtractor.start_extraction

# Getting statistics information of all documents
ActAsPageExtractor.statistics

Parameters of initializing act_as_page_extractor:

  • document_class — The name of the model (e.g., Document).
  • save_as_pdf — Boolean (true/false). Indicates whether to save a temporary PDF.
  • filename — The field containing access to the file. This should be an object with a url method that returns the file path (e.g., a CarrierWave object with filename.url).
  • document_id — The field name for storing the document ID.
  • additional_fields — Extra fields to be added to the extracted page (useful for indexing, etc.).
  • root_folder — The root folder to be shared across deployments (e.g., Rails.root.to_s).
  • file_storage — Path for saving temporary files (default: "public").
  • pdf_storage — Path for saving PDFs (default: "public/uploads/extracted/pdf").

Run tests

bundle
rspec

Contacts

https://github.com/phlowerteam / phlowerteam[A]gmail.com

License

MIT License © 2025 PhlowerTeam