PDFBox text extraction
This gem lets you extract plain text from PDF documents. It is a Jruby wrapper for the Apache PDFBox library.
Installation
Add this line to your application's Gemfile:
gem 'pdfbox_text_extraction'
And then execute:
$ bundle
Or install it yourself as:
$ gem install pdfbox_text_extraction
Usage
To extract all text on every page:
extracted_text = PdfboxTextExtraction.run(path_to_pdf)
To extract text inside a crop area:
extracted_text = PdfboxTextExtraction.run(
  path_to_pdf,
  {
    crop_x: 0, # crop area top left corner x-coordinate
    crop_y: 1.0, # crop area top left corner y-coordinate
    crop_width: 8.5, # crop area width
    crop_height: 9.4, # crop area height
  }
)
Contributing
- Fork it ( https://github.com/jhund/pdfbox_text_extraction/fork )
- Create your feature branch (git checkout -b my-new-feature)
- Commit your changes (git commit -am 'Add some feature')
- Push to the branch (git push origin my-new-feature)
- Create a new Pull Request
Resources
License
Copyright
Copyright (c) 2016 Jo Hund. See (MIT) LICENSE for details.