uea-stemmer
Ruby implementation of the UEA-Lite stemmer for conservative stemming in search and indexing workloads.
UEA-Lite uses a rule set to normalize suffixes while avoiding aggressive stemming.
Behavior Notes
The stemmer operates on a single token at a time and returns a stemmed token.
Notable behavior of this implementation:
-
possessive apostrophes are removed
-
contractions are expanded by default (for example,
don'tbecomesdo not) -
tokens beginning with uppercase letters are preserved, and pluralized acronyms ending in a lowercase
sare singularized -
pure numbers, and tokens containing hyphens/underscores, are passed through unchanged
This is a port to Ruby from the Java port of the original Perl script by Marie-Claire Jenkins and Dr. Dan J. Smith at the University of East Anglia.
Installation
Install the gem:
gem install uea-stemmer
Install from source:
git clone https://github.com/ealdent/uea-stemmer.git cd uea-stemmer bundle install bundle exec rake test bundle exec rake install
Example Usage
Basic usage:
require "uea-stemmer" stemmer = UEAStemmer.new stemmer.stem("helpers") # => "helper" stemmer.stem("dying") # => "die" stemmer.stem("scarred") # => "scar"
You can extract the matching rule with stem_with_rule:
result = stemmer.stem_with_rule("invited") result.word # => "invite" result.rule_num # => 22.3 result.rule # => #<UEAStemmer::Rule ...>
Disable contraction expansion:
UEAStemmer.new(nil, nil, skip_contractions: true).stem("don't") # => "don't"
Use the singleton instance:
DefaultUEAStemmer.instance.stem("running") # => "run"
Contributing
-
Fork the project.
-
Make your feature addition or bug fix.
-
Add or update tests.
-
Run +bundle exec rake test+.
-
Send me a pull request. Bonus points for topic branches.
Relevant Web Pages
Copyright
Copyright © 2005 by the University of East Anglia and authored by Marie-Claire Jenkins and Dr. Dan J Smith. This port to Ruby was done by Jason Adams using the port to Java by Richard Churchill.
This project is distributed under the Apache 2.0 License. See LICENSE for details.