No commit activity in last 3 years
No release in over 3 years
This gem lets you extract plain text from PDF documents. It is a Jruby wrapper for the Apache PDFBox library.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 1.6
>= 0
 Project Readme

PDFBox text extraction

This gem lets you extract plain text from PDF documents. It is a Jruby wrapper for the Apache PDFBox library.

Installation

Add this line to your application's Gemfile:

gem 'pdfbox_text_extraction'

And then execute:

$ bundle

Or install it yourself as:

$ gem install pdfbox_text_extraction

Usage

To extract all text on every page:

extracted_text = PdfboxTextExtraction.run(path_to_pdf)

To extract text inside a crop area:

extracted_text = PdfboxTextExtraction.run(
  path_to_pdf,
  {
    crop_x: 0, # crop area top left corner x-coordinate
    crop_y: 1.0, # crop area top left corner y-coordinate
    crop_width: 8.5, # crop area width
    crop_height: 9.4, # crop area height
  }
)

Contributing

  1. Fork it ( https://github.com/jhund/pdfbox_text_extraction/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Resources

License

MIT licensed.

Copyright

Copyright (c) 2016 Jo Hund. See (MIT) LICENSE for details.