Low commit activity in last 3 years
Match strings approximately using multiple algorithms: Levenshtein edit distance, Damerau-Levenshtein with transpositions, Jaro-Winkler similarity, Dice coefficient, Hamming distance, and Longest Common Subsequence. Includes token-based matching, weighted scoring, Soundex and Metaphone phonetic matching, ranked search, and deduplication.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies
 Project Readme

philiprehberger-fuzzy_match

Tests Gem Version Last updated

Fuzzy string matching with Levenshtein, Damerau-Levenshtein, Jaro-Winkler, Hamming, LCS, token-based, and phonetic algorithms

Requirements

  • Ruby >= 3.1

Installation

Add to your Gemfile:

gem "philiprehberger-fuzzy_match"

Or install directly:

gem install philiprehberger-fuzzy_match

Usage

require "philiprehberger/fuzzy_match"

# Individual algorithms
Philiprehberger::FuzzyMatch.levenshtein('kitten', 'sitting')   # => 3
Philiprehberger::FuzzyMatch.jaro_winkler('martha', 'marhta')   # => ~0.96
Philiprehberger::FuzzyMatch.dice_coefficient('night', 'nacht') # => 0.25

# Normalized ratio (0.0 to 1.0)
Philiprehberger::FuzzyMatch.ratio('kitten', 'sitting')  # => ~0.57

Damerau-Levenshtein (Transposition-Aware)

# Counts adjacent transpositions as 1 edit (Levenshtein counts them as 2)
Philiprehberger::FuzzyMatch.damerau_levenshtein('teh', 'the')   # => 1
Philiprehberger::FuzzyMatch.damerau_ratio('teh', 'the')         # => ~0.667

Longest Common Subsequence

Philiprehberger::FuzzyMatch.lcs('kitten', 'sitting')       # => 4
Philiprehberger::FuzzyMatch.lcs_ratio('kitten', 'sitting')  # => ~0.615

Best Match

candidates = %w[Ruby Python Rust JavaScript]
result = Philiprehberger::FuzzyMatch.best('rubyy', candidates)
result[:match]  # => "Ruby"
result[:score]  # => 0.8

Ranked Search

candidates = %w[commit comment command compare]
results = Philiprehberger::FuzzyMatch.search('comit', candidates, threshold: 0.5)
# => [{ match: "commit", score: 0.8333 }, { match: "comment", score: 0.7143 }, ...]

Rank candidates

candidates = %w[commit comment command compare]
Philiprehberger::FuzzyMatch.rank('comit', candidates)
# => [{ value: "commit", score: ... }, { value: "comment", score: ... }, ...]

# Choose algorithm (:jaro_winkler default, :dice, or :levenshtein)
Philiprehberger::FuzzyMatch.rank('comit', candidates, algorithm: :levenshtein)

Top N Matches

candidates = %w[commit comment command compare zebra]
results = Philiprehberger::FuzzyMatch.closest_n('comit', candidates, n: 3)
# => [{ match: "commit", score: ... }, { match: "comment", score: ... }, { match: "command", score: ... }]

# Choose algorithm (:jaro_winkler default, :dice, or :levenshtein)
Philiprehberger::FuzzyMatch.closest_n('comit', candidates, n: 2, algorithm: :levenshtein)

Top-N Matches

candidates = %w[commit comment command compare zebra]
results = Philiprehberger::FuzzyMatch.top_n('comit', candidates, n: 3)
# => [{ value: "commit", similarity: ... }, { value: "comment", similarity: ... }, { value: "command", similarity: ... }]

# Filter out low-similarity entries with min_similarity (default 0.0)
Philiprehberger::FuzzyMatch.top_n('comit', candidates, n: 5, min_similarity: 0.7)
# => only entries whose similarity >= 0.7

# Choose algorithm (:jaro_winkler default, :dice, or :levenshtein)
Philiprehberger::FuzzyMatch.top_n('comit', candidates, n: 2, algorithm: :levenshtein)

Did-You-Mean Suggestions

Philiprehberger::FuzzyMatch.suggest('comit', %w[commit comment zebra], threshold: 0.6, max: 3)
# => ["commit", "comment"]

Phonetic Matching

Philiprehberger::FuzzyMatch.soundex('Robert')    # => "R163"
Philiprehberger::FuzzyMatch.metaphone('Smith')    # => "SM0"
Philiprehberger::FuzzyMatch.phonetic_match?('Robert', 'Rupert')  # => true

Similarity Matrix

strings = %w[hello helo world]
matrix = Philiprehberger::FuzzyMatch.similarity_matrix(strings)
# => { "hello" => { "hello" => 1.0, "helo" => 0.9333, "world" => 0.4667 }, ... }

# Filter to only high-similarity pairs
matrix = Philiprehberger::FuzzyMatch.similarity_matrix(strings, threshold: 0.8)
# => { "hello" => { "hello" => 1.0, "helo" => 0.9333 }, ... }

# Choose algorithm (:jaro_winkler default, :dice, or :levenshtein)
Philiprehberger::FuzzyMatch.similarity_matrix(strings, algorithm: :dice)

Deduplication

Philiprehberger::FuzzyMatch.deduplicate(%w[hello helo world wrld], threshold: 0.8)
# => ["hello", "world"]

Hamming Distance

Philiprehberger::FuzzyMatch.hamming('karolin', 'kathrin')  # => 3
Philiprehberger::FuzzyMatch.hamming('abc', 'abc')          # => 0
# Raises Error for different-length strings

Token-Based Matching

# Token sort: reorder tokens alphabetically before comparing
Philiprehberger::FuzzyMatch.token_sort_ratio('john smith jr', 'jr john smith')  # => 1.0

# Token set: compare based on token set intersection/union
Philiprehberger::FuzzyMatch.token_set_ratio('new york mets', 'new york mets vs atlanta braves')
# => high score (shared tokens boost similarity)

Partial (Substring) Ratio

# "Does string A appear approximately inside string B?"
Philiprehberger::FuzzyMatch.partial_ratio('the cat', 'the cat sat on the mat')
# => 1.0

Philiprehberger::FuzzyMatch.partial_ratio('cat', 'a black cat sat on a mat')
# => 1.0

Philiprehberger::FuzzyMatch.partial_ratio('helo', 'why hello there')
# => ~0.75

Slides the shorter string across every same-length window of the longer one and returns the maximum Levenshtein-based ratio. FuzzyWuzzy parity.

Weighted Scoring

Philiprehberger::FuzzyMatch.weighted_score('kitten', 'sitting',
  weights: { jaro_winkler: 0.5, dice: 0.3, levenshtein_ratio: 0.2 })
# => weighted combination of algorithm scores
# Supported keys: :jaro_winkler, :dice, :levenshtein_ratio, :lcs_ratio, :damerau_ratio
# Weights must sum to 1.0

API

Philiprehberger::FuzzyMatch

Method Description
.levenshtein(a, b) Levenshtein edit distance (integer)
.jaro_winkler(a, b) Jaro-Winkler similarity (0.0 to 1.0)
.dice_coefficient(a, b) Dice coefficient from bigram overlap (0.0 to 1.0)
.damerau_levenshtein(a, b) Damerau-Levenshtein distance with transpositions (integer)
.damerau_ratio(a, b) Normalized Damerau-Levenshtein similarity (0.0 to 1.0)
.lcs(a, b) Longest common subsequence length (integer)
.lcs_ratio(a, b) Normalized LCS similarity (0.0 to 1.0)
.ratio(a, b) Normalized Levenshtein ratio (0.0 to 1.0)
.partial_ratio(a, b) Substring-style similarity: max Levenshtein ratio over windows of the longer string (0.0 to 1.0)
.best(query, candidates, threshold: 0.0) Best match as { match:, score: }
.search(query, candidates, threshold: 0.3) Ranked array of { match:, score: }
.suggest(query, candidates, threshold: 0.6, max: 5) Array of match strings
.rank(query, candidates, algorithm: :jaro_winkler) All candidates sorted desc as { value:, score: } (stable)
.closest_n(query, candidates, n:, algorithm: :jaro_winkler) Top N matches as { match:, score: } sorted by score descending
.top_n(query, candidates, n:, algorithm: :jaro_winkler, min_similarity: 0.0) Top-N matches as { value:, similarity: } sorted by similarity descending, with optional similarity floor
.soundex(string) Generate 4-character Soundex code
.metaphone(string) Generate Metaphone phonetic code
.phonetic_match?(a, b) Check if two strings match phonetically
.hamming(a, b) Hamming distance for equal-length strings (integer)
.token_sort_ratio(a, b) Token-sorted Jaro-Winkler similarity (0.0 to 1.0)
.token_set_ratio(a, b) Token-set-based similarity (0.0 to 1.0)
.weighted_score(a, b, weights:) Weighted multi-algorithm score (0.0 to 1.0)
.similarity_matrix(strings, algorithm:, threshold:) Pairwise similarity hash-of-hashes for batch comparison
.deduplicate(array, threshold:, algorithm:) Group and deduplicate similar strings

All methods are case-insensitive by default.

Development

bundle install
bundle exec rspec
bundle exec rubocop

Support

If you find this project useful:

Star the repo

🐛 Report issues

💡 Suggest features

❤️ Sponsor development

🌐 All Open Source Projects

💻 GitHub Profile

🔗 LinkedIn Profile

License

MIT