No commit activity in last 3 years
No release in over 3 years
This gem lets you extract plain text from PDF documents. It is a Jruby wrapper for the Apache PDFBox library.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Development

~> 1.6
>= 0
 Project Readme

PDFBox text extraction

This gem lets you extract plain text from PDF documents. It is a Jruby wrapper for the Apache PDFBox library.

Installation

Add this line to your application's Gemfile:

gem 'pdfbox_text_extraction'

And then execute:

$ bundle

Or install it yourself as:

$ gem install pdfbox_text_extraction

Usage

To extract all text on every page:

extracted_text = PdfboxTextExtraction.run(path_to_pdf)

To extract text inside a crop area:

extracted_text = PdfboxTextExtraction.run(
  path_to_pdf,
  {
    crop_x: 0, # crop area top left corner x-coordinate
    crop_y: 1.0, # crop area top left corner y-coordinate
    crop_width: 8.5, # crop area width
    crop_height: 9.4, # crop area height
  }
)

Contributing

  1. Fork it ( https://github.com/jhund/pdfbox_text_extraction/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Resources

License

MIT licensed.

Copyright

Copyright (c) 2016 Jo Hund. See (MIT) LICENSE for details.