0.0
No release in over 3 years
Low commit activity in last 3 years
There's a lot of open issues
A maintained fork of the sastrawi gem. Stems words in Bahasa Indonesia using the Nazief & Adriani algorithm with Enhanced Confix Stripping. Based on the original work by Andrias Meisyal (sastrawi gem) and the PHP Sastrawi project (github.com/sastrawi/sastrawi).
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Development

>= 2.0
~> 13.0
~> 3.10
 Project Readme

sastrawi-ruby

Indonesian language stemmer for Ruby. Stems words in Bahasa Indonesia using the Nazief & Adriani algorithm with Enhanced Confix Stripping (ECS).

This is an actively maintained fork of meisyal/sastrawi-ruby.

What's New in v0.2.0

  • Bug fixes: Fixed 3 stemming bugs (menerangi, berimanlah, kesepersepuluhnya)
  • Dictionary: Added missing words (sepuluh)
  • Modernized: Ruby 3.0+ required, updated dependencies, GitHub Actions CI
  • Fixed regex warning in disambiguator prefix rule 16

Installation

# Gemfile
gem "sastrawi"
gem install sastrawi

Requires Ruby 3.0+.

Usage

Stemming

require "sastrawi"

factory = Sastrawi::Stemmer::StemmerFactory.new
stemmer = factory.create_stemmer

stemmer.stem("Perekonomian Indonesia sedang dalam pertumbuhan yang membanggakan")
# => "ekonomi indonesia sedang dalam tumbuh yang bangga"

stemmer.stem("membangunkan")  # => "bangun"
stemmer.stem("bersembunyi")   # => "sembunyi"
stemmer.stem("menerangi")     # => "terang"
stemmer.stem("kesepersepuluhnya") # => "sepuluh"

Stop Word Removal

require "sastrawi"

factory = Sastrawi::StopWordRemover::StopWordRemoverFactory.new
stop_words = factory.get_stop_word
# => ["a", "ada", "adalah", "agar", "akan", ...]

Custom Dictionary

require "sastrawi"

factory = Sastrawi::Stemmer::StemmerFactory.new
dictionary = factory.create_default_dictionary

# Add words from file
dictionary.add_words_from_text_file("my-dictionary.txt")

# Add/remove individual words
dictionary.add("internet")
dictionary.remove("desa")

stemmer = Sastrawi::Stemmer::Stemmer.new(dictionary)
stemmer.stem("internetan")  # => "internet"

How It Works

Indonesian stemming removes affixes (prefixes, suffixes, infixes) to find base words:

Affix Type Examples Algorithm Step
Inflectional Particle -lah, -kah, -pun Removed first
Possessive Pronoun -ku, -mu, -nya Removed second
Derivational Suffix -i, -kan, -an Removed third
Derivational Prefix me-, ber-, ter-, pe-, di-, ke-, se- Removed last (up to 3 layers)

The algorithm uses Confix Stripping (CS) and Enhanced Confix Stripping (ECS) for handling complex prefix-suffix combinations, plus a dictionary lookup at each step to validate results.

Known Limitations

  • memuaskan stems to muas instead of puas — both are valid dictionary words and the algorithm picks the first match (Rule13a). This is an inherent ambiguity in the Nazief-Adriani algorithm.

Contributing

Bug reports and pull requests are welcome on GitHub.

License

MIT License. Contains base words from Kateglo licensed under CC BY-NC-SA 3.0.

Credits