No commit activity in last 3 years
No release in over 3 years
An efficient Wagner Fischer levenshtein edit distance calculation in C Ruby
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies
 Project Readme

StringDotLevenshtein

Preview

Levenstein algorithm helps calculate the distance between two words. You can make a spell checker using this algorithm.

This gem attempts to solve this as fast as possible but at the expense of disallowing big string and non ASCII characters. Any string more than 4096 characters will raise RuntimeError.

You can also use the pure Ruby method as a workaround, which doesn't have any limitation, but it's way slower.

Read the usage for more info.

Dependencies (Build)

You need to have the C compiler installed to be able to compile the C extension. On Arch Linux:

# pacman -S gcc make

On Debian based systems:

# apt install gcc build-essential

You might also require ruby-dev in Debian based systems which provides support for ruby.h header file:

# apt install ruby-dev

Once you are done, and you can compile the C file, you can follow the installation!

Installation

Add this line to your application's Gemfile:

gem 'string_dot_levenshtein'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install string_dot_levenshtein

Usage

String#levenshtein(str)

To get word rank, just use String#levenshtein(str). For example:

irb(main):001:0> require 'string_dot_levenshtein'
=> true
irb(main):002:0> "sunday".levenshtein("saturday")
=> 3

Cons:

  • The length of the string is limited to 4096 characters.
  • Any non-ASCII character can return an incorrect result.

Pros:

  • It's the fastest.

String#levenshtein_rb(str)

To overcome the cons led by using String#levenshtein(str), use String#levenshtein_rb(str):

irb(main):001:0> require 'string_dot_levenshtein'
=> true
irb(main):002:0> "sunday".levenshtein_rb("saturday")
=> 3

# This is the C Ruby method
irb(main):003:0> "a".levenshtein("🐻")
=> 4

# This is the pure Ruby method
irb(main):004:0> "a".levenshtein_rb("🐻")
=> 1

# This is the C Ruby method
irb(main):005:0> "cómo".levenshtein("como")
=> 2

# This is the pure Ruby method
irb(main):006:0> "cómo".levenshtein_rb("como")
=> 1

Cons:

  • It's 9 - 10x slower than the String#levenstein(arg) as it's written in pure Ruby

Pros:

  • It can work with arbitrary-length string
  • It can work with emojis
  • It can work with various other languages

  • Note that words are case-sensitive, unstripped. 3 and 4-byte emojis are counted as 3 and 4-byte string.

  • If you are building an application where the user is guaranteed to use only 1 byte ASCII characters, use the String#levenshtein(str) method, it's fast!

  • For supporting various languages and emoji (although they don't calculate the distance), use String#levenshtein_rb(str)


If you want to search a dictionary against matches, you can get a dictionary.com verified (scraped) wordlist here: https://github.com/Souravgoswami/wordlists/blob/master/dictionary.com_verified_words

Development

After checking out the repo, run bin/setup to install dependencies. To install this gem onto your local machine, run rake install.

Contributing

Our main aim is to make the calculation as fast as possible without losing any accuracy. This is a small gem, we don't care a bit about code-readability and maintainability. So any suggestion or contribution that makes the gem faster are greatly welcomed.

Bug reports and pull requests are welcome on GitHub at https://github.com/Souravgoswami/string_dot_levenshtein.

License

The gem is available as open-source under the terms of the MIT License.