Project

open-nlp

0.04
No commit activity in last 3 years
No release in over 3 years
Ruby bindings to the OpenNLP tools, a Java machine learning toolkit for natural language processing (NLP).
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 0

Runtime

~> 0.2.5
 Project Readme

Build Status

###About

This library provides high-level Ruby bindings to the Open NLP package, a Java machine learning toolkit for natural language processing (NLP). This gem is compatible with Ruby 1.9.2 and 1.9.3 as well as JRuby 1.7.1. It is tested on both Java 6 and Java 7.

###Installing

First, install the gem: gem install open-nlp. Then, download the JARs and English language models in one package (80 MB).

Place the contents of the extracted archive inside the /bin/ folder of the open-nlp gem (e.g. [...]/gems/open-nlp-0.x.x/bin/).

Alternatively, from a terminal window, cd to the gem's folder and run:

wget http://www.louismullie.com/treat/open-nlp-english.zip
unzip -o open-nlp-english.zip -d bin/

Afterwards, you may individually download the appropriate models for other languages from the open-nlp website.

###Configuring

After installing and requiring the gem (require 'open-nlp'), you may want to set some of the following configuration options.

# Set an alternative path to look for the JAR files.
# Default is gem's bin folder.
OpenNLP.jar_path = '/path_to_jars/'

# Set an alternative path to look for the model files.
# Default is gem's bin folder.
OpenNLP.model_path = '/path_to_models/'

# Pass some alternative arguments to the Java VM.
# Default is ['-Xms512M', '-Xmx1024M'].
OpenNLP.jvm_args = ['-option1', '-option2']

# Redirect VM output to log.txt
OpenNLP.log_file = 'log.txt'

# Set default models for a language.
OpenNLP.use :language

###Examples

Simple tokenizer

OpenNLP.load

sent = "The death of the poet was kept from his poems."
tokenizer = OpenNLP::SimpleTokenizer.new

tokens = tokenizer.tokenize(sent).to_a
# => %w[The death of the poet was kept from his poems .]

Maximum entropy tokenizer, chunker and POS tagger

OpenNLP.load

chunker   = OpenNLP::ChunkerME.new
tokenizer = OpenNLP::TokenizerME.new
tagger    = OpenNLP::POSTaggerME.new

sent   = "The death of the poet was kept from his poems."

tokens = tokenizer.tokenize(sent).to_a
# => %w[The death of the poet was kept from his poems .]

tags   = tagger.tag(tokens).to_a
# => %w[DT NN IN DT NN VBD VBN IN PRP$ NNS .]

chunks = chunker.chunk(tokens, tags).to_a
# => %w[B-NP I-NP B-PP B-NP I-NP B-VP I-VP B-PP B-NP I-NP O]

Abstract Bottom-Up Parser

OpenNLP.load

sent      = "The death of the poet was kept from his poems."
parser = OpenNLP::Parser.new
parse = parser.parse(sent)

parse.get_text.should eql sent

parse.get_span.get_start.should eql 0
parse.get_span.get_end.should eql 46
parse.get_child_count.should eql 1

child = parse.get_children[0]

child.text # => "The death of the poet was kept from his poems."
child.get_child_count # => 3
child.get_head_index #=> 5
child.get_type # => "S"

Maximum Entropy Name Finder*

OpenNLP.load

text = File.read('./spec/sample.txt').gsub!("\n", "")

tokenizer   = OpenNLP::TokenizerME.new
segmenter   = OpenNLP::SentenceDetectorME.new
ner_models  = ['person', 'time', 'money']

ner_finders = ner_models.map do |model|
  OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
end

sentences = segmenter.sent_detect(text)
named_entities = []

sentences.each do |sentence|

  tokens = tokenizer.tokenize(sentence)
  
  ner_models.each_with_index do |model,i|
    finder = ner_finders[i]
    name_spans = finder.find(tokens)
    name_probs = finder.probs()
    name_spans.each_with_index do |name_span,j|
      start = name_span.get_start
      stop  = name_span.get_end-1
      slice = tokens[start..stop].to_a
      prob  = name_probs[j]
      named_entities << [slice, model, prob]
    end
  end

end

Loading specific models

Just pass the name of the model file to the constructor. The gem will search for the file in the OpenNLP.model_path folder.

OpenNLP.load

tokenizer = OpenNLP::TokenizerME.new('en-token.bin')
tagger = OpenNLP::POSTaggerME.new('en-pos-perceptron.bin')
name_finder = OpenNLP::NameFinderME.new('en-ner-person.bin')
# etc.

Loading specific classes

You may want to load specific classes from the OpenNLP library that are not loaded by default. The gem provides an API to do this:

# Default base class is opennlp.tools.
OpenNLP.load_class('SomeClassName')  
# => OpenNLP::SomeClassName

# Here, we specify another base class.
OpenNLP.load_class('SomeOtherClass', 'opennlp.tools.namefind')
# => OpenNLP::SomeOtherClass

Contributing

Fork the project and send me a pull request! Config updates for other languages are welcome.