Project

doc_ripper

0.05
No commit activity in last 3 years
No release in over 3 years
Scrape text from common file formats (.pdf,.doc,.docx, .sketch, .txt) with a single convenient command.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 1.6
>= 0
~> 10.0
>= 0

Runtime

 Project Readme

DocRipper

Gem Version

Grab the text from common document formats with 1 command. DocRipper is an extremely lightweight Ruby wrapper that can be used to parse text contents from common file formats (currently .doc, .docx and .pdf, .sketch) without the need for a large number of dependencies like an OCR library or OpenOffice/LibreOffice.

For simple parsing, you'll likely see a large performance improvement with DocRipper over solutions that rely on OpenOffice/LibreOffice for .doc/.docx conversion.

Need OCR support or in-image text parsing? Take a look at Docsplit.

Supported File Formats

.doc
.docx
.pdf
.txt
.sketch
File format Supported? Dependencies
.doc x Antiword
.docx x
.pdf x Poppler-utils
.txt x
.sketch x Sqlite3

Quickstart

  gem install doc_ripper

Specify a file path of a file

  require 'doc_ripper'

  DocRipper::rip('/path/to/file')

If the file cannot be read, nil will be returned.

  DocRipper::rip('/path/to/missing/file')
  => nil

Want to raise an exception? Use #rip!

#rip! will raise an exception if rip returns nil or the file type isn't supported

  # invalid file type
  DocRipper::rip!('/path/to/invalide/file.type')
  => DocRipper::UnsupportedFileType

  # missing file
  DocRipper::rip!('/path/to/missing/file.doc')
  => DocRipper::FileNotFound

Dependencies