0.0
No commit activity in last 3 years
No release in over 3 years
A JRuby wrapper for Apache Tika to extract text and metadata from various file formats, slightly modified.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

= 10.3.1
= 2.14.1
 Project Readme

Rika

A JRuby wrapper for Apache Tika to extract text and metadata from various file formats.

More information about Apache Tika can be found here: http://tika.apache.org/

Code Climate Build Status

Jeremy's modifications

basically, just using my own version of Tika with special email parsing fixes, adds X-Attachments metadata key (listing attachment filenames from emails) and removes bouncycastle from Tika-parsers's requirements because everything is awful.

for instance, Tika by itself detects an .eml if the file has "Received: " as the first string of bytes. I've made it so it'll detect an email with that in the first 300 bytes, to cope with leaked emails that have non-standard headers first, then the Received: line.

Installation

Add this line to your application's Gemfile:

gem 'rika'

Remember that this gem only works on JRuby.

And then execute:

$ bundle

Or install it yourself as:

$ gem install rika

Usage

For a quick start with the simplest use cases, the following functions are provided to get what you need in a single function call, for your convenience:

require 'rika'

content           = Rika.parse_content('document.pdf')    # string containing all content text
metadata          = Rika.parse_metadata('document.pdf')   # hash containing the document metadata
content, metadata = Rika.parse_content_and_metadata('document.pdf')   # both of the above

For other use cases and finer control, you can work directly with the Rika::Parser object:

require 'rika'

parser = Rika::Parser.new('document.pdf')

# Return the content of the document:
parser.content 

# Return the media type for the document:
parser.media_type 
=> "application/pdf"

# Return the metadata field title if it exists:
parser.metadata["title"] if parser.metadata_exists?("title") 

# Return all the available metadata keys that can be read from the document
parser.available_metadata

# Return only the first 10000 chars of the content:
parser = Rika::Parser.new('document.pdf', 10000)
parser.content # 10000 first chars returned

# Return content from URL
parser = Rika::Parser.new('http://riakhandbook.com/sample.pdf', 200)
parser.content

# Return the language for the content
parser = parser = Rika::Parser.new('german document.pdf')
parser.language
=> "de"

# Check whether the langugage identification is certain enough to be trusted
parser.language_is_reasonably_certain?
	

Credits

The following people have contributed ideas, documentation, or code to Rika:

  • Keith Bennett
  • Richard Nyström

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request