The project is in a healthy, maintained state
Match strings approximately using multiple algorithms: Levenshtein edit distance, Damerau-Levenshtein with transpositions, Jaro-Winkler similarity, Dice coefficient, and Longest Common Subsequence. Includes Soundex and Metaphone phonetic matching, ranked search, and deduplication.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies
 Project Readme

philiprehberger-fuzzy_match

Tests Gem Version Last updated

Fuzzy string matching with Levenshtein, Damerau-Levenshtein, Jaro-Winkler, Hamming, LCS, token-based, and phonetic algorithms

Requirements

  • Ruby >= 3.1

Installation

Add to your Gemfile:

gem "philiprehberger-fuzzy_match"

Or install directly:

gem install philiprehberger-fuzzy_match

Usage

require "philiprehberger/fuzzy_match"

# Individual algorithms
Philiprehberger::FuzzyMatch.levenshtein('kitten', 'sitting')   # => 3
Philiprehberger::FuzzyMatch.jaro_winkler('martha', 'marhta')   # => ~0.96
Philiprehberger::FuzzyMatch.dice_coefficient('night', 'nacht') # => 0.25

# Normalized ratio (0.0 to 1.0)
Philiprehberger::FuzzyMatch.ratio('kitten', 'sitting')  # => ~0.57

Damerau-Levenshtein (Transposition-Aware)

# Counts adjacent transpositions as 1 edit (Levenshtein counts them as 2)
Philiprehberger::FuzzyMatch.damerau_levenshtein('teh', 'the')   # => 1
Philiprehberger::FuzzyMatch.damerau_ratio('teh', 'the')         # => ~0.667

Longest Common Subsequence

Philiprehberger::FuzzyMatch.lcs('kitten', 'sitting')       # => 4
Philiprehberger::FuzzyMatch.lcs_ratio('kitten', 'sitting')  # => ~0.615

Best Match

candidates = %w[Ruby Python Rust JavaScript]
result = Philiprehberger::FuzzyMatch.best('rubyy', candidates)
result[:match]  # => "Ruby"
result[:score]  # => 0.8

Ranked Search

candidates = %w[commit comment command compare]
results = Philiprehberger::FuzzyMatch.search('comit', candidates, threshold: 0.5)
# => [{ match: "commit", score: 0.8333 }, { match: "comment", score: 0.7143 }, ...]

Did-You-Mean Suggestions

Philiprehberger::FuzzyMatch.suggest('comit', %w[commit comment zebra], threshold: 0.6, max: 3)
# => ["commit", "comment"]

Phonetic Matching

Philiprehberger::FuzzyMatch.soundex('Robert')    # => "R163"
Philiprehberger::FuzzyMatch.metaphone('Smith')    # => "SM0"
Philiprehberger::FuzzyMatch.phonetic_match?('Robert', 'Rupert')  # => true

Deduplication

Philiprehberger::FuzzyMatch.deduplicate(%w[hello helo world wrld], threshold: 0.8)
# => ["hello", "world"]

Hamming Distance

Philiprehberger::FuzzyMatch.hamming('karolin', 'kathrin')  # => 3
Philiprehberger::FuzzyMatch.hamming('abc', 'abc')          # => 0
# Raises Error for different-length strings

Token-Based Matching

# Token sort: reorder tokens alphabetically before comparing
Philiprehberger::FuzzyMatch.token_sort_ratio('john smith jr', 'jr john smith')  # => 1.0

# Token set: compare based on token set intersection/union
Philiprehberger::FuzzyMatch.token_set_ratio('new york mets', 'new york mets vs atlanta braves')
# => high score (shared tokens boost similarity)

Weighted Scoring

Philiprehberger::FuzzyMatch.weighted_score('kitten', 'sitting',
  weights: { jaro_winkler: 0.5, dice: 0.3, levenshtein_ratio: 0.2 })
# => weighted combination of algorithm scores
# Supported keys: :jaro_winkler, :dice, :levenshtein_ratio, :lcs_ratio, :damerau_ratio
# Weights must sum to 1.0

API

Philiprehberger::FuzzyMatch

Method Description
.levenshtein(a, b) Levenshtein edit distance (integer)
.jaro_winkler(a, b) Jaro-Winkler similarity (0.0 to 1.0)
.dice_coefficient(a, b) Dice coefficient from bigram overlap (0.0 to 1.0)
.damerau_levenshtein(a, b) Damerau-Levenshtein distance with transpositions (integer)
.damerau_ratio(a, b) Normalized Damerau-Levenshtein similarity (0.0 to 1.0)
.lcs(a, b) Longest common subsequence length (integer)
.lcs_ratio(a, b) Normalized LCS similarity (0.0 to 1.0)
.ratio(a, b) Normalized Levenshtein ratio (0.0 to 1.0)
.best(query, candidates, threshold: 0.0) Best match as { match:, score: }
.search(query, candidates, threshold: 0.3) Ranked array of { match:, score: }
.suggest(query, candidates, threshold: 0.6, max: 5) Array of match strings
.soundex(string) Generate 4-character Soundex code
.metaphone(string) Generate Metaphone phonetic code
.phonetic_match?(a, b) Check if two strings match phonetically
.hamming(a, b) Hamming distance for equal-length strings (integer)
.token_sort_ratio(a, b) Token-sorted Jaro-Winkler similarity (0.0 to 1.0)
.token_set_ratio(a, b) Token-set-based similarity (0.0 to 1.0)
.weighted_score(a, b, weights:) Weighted multi-algorithm score (0.0 to 1.0)
.deduplicate(array, threshold:, algorithm:) Group and deduplicate similar strings

All methods are case-insensitive by default.

Development

bundle install
bundle exec rspec
bundle exec rubocop

Support

If you find this project useful:

Star the repo

🐛 Report issues

💡 Suggest features

❤️ Sponsor development

🌐 All Open Source Projects

💻 GitHub Profile

🔗 LinkedIn Profile

License

MIT