philiprehberger-fuzzy_match

Fuzzy string matching with Levenshtein, Damerau-Levenshtein, Jaro-Winkler, Hamming, LCS, token-based, and phonetic algorithms

Requirements

Ruby >= 3.1

Installation

Add to your Gemfile:

gem "philiprehberger-fuzzy_match"

Or install directly:

gem install philiprehberger-fuzzy_match

Usage

require "philiprehberger/fuzzy_match"

# Individual algorithms
Philiprehberger::FuzzyMatch.levenshtein('kitten', 'sitting')   # => 3
Philiprehberger::FuzzyMatch.jaro_winkler('martha', 'marhta')   # => ~0.96
Philiprehberger::FuzzyMatch.dice_coefficient('night', 'nacht') # => 0.25

# Normalized ratio (0.0 to 1.0)
Philiprehberger::FuzzyMatch.ratio('kitten', 'sitting')  # => ~0.57

Damerau-Levenshtein (Transposition-Aware)

# Counts adjacent transpositions as 1 edit (Levenshtein counts them as 2)
Philiprehberger::FuzzyMatch.damerau_levenshtein('teh', 'the')   # => 1
Philiprehberger::FuzzyMatch.damerau_ratio('teh', 'the')         # => ~0.667

Longest Common Subsequence

Philiprehberger::FuzzyMatch.lcs('kitten', 'sitting')       # => 4
Philiprehberger::FuzzyMatch.lcs_ratio('kitten', 'sitting')  # => ~0.615

Best Match

candidates = %w[Ruby Python Rust JavaScript]
result = Philiprehberger::FuzzyMatch.best('rubyy', candidates)
result[:match]  # => "Ruby"
result[:score]  # => 0.8

Ranked Search

candidates = %w[commit comment command compare]
results = Philiprehberger::FuzzyMatch.search('comit', candidates, threshold: 0.5)
# => [{ match: "commit", score: 0.8333 }, { match: "comment", score: 0.7143 }, ...]

Rank candidates

candidates = %w[commit comment command compare]
Philiprehberger::FuzzyMatch.rank('comit', candidates)
# => [{ value: "commit", score: ... }, { value: "comment", score: ... }, ...]

# Choose algorithm (:jaro_winkler default, :dice, or :levenshtein)
Philiprehberger::FuzzyMatch.rank('comit', candidates, algorithm: :levenshtein)

Top N Matches

candidates = %w[commit comment command compare zebra]
results = Philiprehberger::FuzzyMatch.closest_n('comit', candidates, n: 3)
# => [{ match: "commit", score: ... }, { match: "comment", score: ... }, { match: "command", score: ... }]

# Choose algorithm (:jaro_winkler default, :dice, or :levenshtein)
Philiprehberger::FuzzyMatch.closest_n('comit', candidates, n: 2, algorithm: :levenshtein)

Top-N Matches

candidates = %w[commit comment command compare zebra]
results = Philiprehberger::FuzzyMatch.top_n('comit', candidates, n: 3)
# => [{ value: "commit", similarity: ... }, { value: "comment", similarity: ... }, { value: "command", similarity: ... }]

# Filter out low-similarity entries with min_similarity (default 0.0)
Philiprehberger::FuzzyMatch.top_n('comit', candidates, n: 5, min_similarity: 0.7)
# => only entries whose similarity >= 0.7

# Choose algorithm (:jaro_winkler default, :dice, or :levenshtein)
Philiprehberger::FuzzyMatch.top_n('comit', candidates, n: 2, algorithm: :levenshtein)

Did-You-Mean Suggestions

Philiprehberger::FuzzyMatch.suggest('comit', %w[commit comment zebra], threshold: 0.6, max: 3)
# => ["commit", "comment"]

Phonetic Matching

Philiprehberger::FuzzyMatch.soundex('Robert')    # => "R163"
Philiprehberger::FuzzyMatch.metaphone('Smith')    # => "SM0"
Philiprehberger::FuzzyMatch.phonetic_match?('Robert', 'Rupert')  # => true

Similarity Matrix

strings = %w[hello helo world]
matrix = Philiprehberger::FuzzyMatch.similarity_matrix(strings)
# => { "hello" => { "hello" => 1.0, "helo" => 0.9333, "world" => 0.4667 }, ... }

# Filter to only high-similarity pairs
matrix = Philiprehberger::FuzzyMatch.similarity_matrix(strings, threshold: 0.8)
# => { "hello" => { "hello" => 1.0, "helo" => 0.9333 }, ... }

# Choose algorithm (:jaro_winkler default, :dice, or :levenshtein)
Philiprehberger::FuzzyMatch.similarity_matrix(strings, algorithm: :dice)

Deduplication

Philiprehberger::FuzzyMatch.deduplicate(%w[hello helo world wrld], threshold: 0.8)
# => ["hello", "world"]

Hamming Distance

Philiprehberger::FuzzyMatch.hamming('karolin', 'kathrin')  # => 3
Philiprehberger::FuzzyMatch.hamming('abc', 'abc')          # => 0
# Raises Error for different-length strings

Token-Based Matching

# Token sort: reorder tokens alphabetically before comparing
Philiprehberger::FuzzyMatch.token_sort_ratio('john smith jr', 'jr john smith')  # => 1.0

# Token set: compare based on token set intersection/union
Philiprehberger::FuzzyMatch.token_set_ratio('new york mets', 'new york mets vs atlanta braves')
# => high score (shared tokens boost similarity)

Partial (Substring) Ratio

# "Does string A appear approximately inside string B?"
Philiprehberger::FuzzyMatch.partial_ratio('the cat', 'the cat sat on the mat')
# => 1.0

Philiprehberger::FuzzyMatch.partial_ratio('cat', 'a black cat sat on a mat')
# => 1.0

Philiprehberger::FuzzyMatch.partial_ratio('helo', 'why hello there')
# => ~0.75

Slides the shorter string across every same-length window of the longer one and returns the maximum Levenshtein-based ratio. FuzzyWuzzy parity.

Weighted Scoring

Philiprehberger::FuzzyMatch.weighted_score('kitten', 'sitting',
  weights: { jaro_winkler: 0.5, dice: 0.3, levenshtein_ratio: 0.2 })
# => weighted combination of algorithm scores
# Supported keys: :jaro_winkler, :dice, :levenshtein_ratio, :lcs_ratio, :damerau_ratio
# Weights must sum to 1.0

API

`Philiprehberger::FuzzyMatch`

Method	Description
`.levenshtein(a, b)`	Levenshtein edit distance (integer)
`.jaro_winkler(a, b)`	Jaro-Winkler similarity (0.0 to 1.0)
`.dice_coefficient(a, b)`	Dice coefficient from bigram overlap (0.0 to 1.0)
`.damerau_levenshtein(a, b)`	Damerau-Levenshtein distance with transpositions (integer)
`.damerau_ratio(a, b)`	Normalized Damerau-Levenshtein similarity (0.0 to 1.0)
`.lcs(a, b)`	Longest common subsequence length (integer)
`.lcs_ratio(a, b)`	Normalized LCS similarity (0.0 to 1.0)
`.ratio(a, b)`	Normalized Levenshtein ratio (0.0 to 1.0)
`.partial_ratio(a, b)`	Substring-style similarity: max Levenshtein ratio over windows of the longer string (0.0 to 1.0)
`.best(query, candidates, threshold: 0.0)`	Best match as `{ match:, score: }`
`.search(query, candidates, threshold: 0.3)`	Ranked array of `{ match:, score: }`
`.suggest(query, candidates, threshold: 0.6, max: 5)`	Array of match strings
`.rank(query, candidates, algorithm: :jaro_winkler)`	All candidates sorted desc as `{ value:, score: }` (stable)
`.closest_n(query, candidates, n:, algorithm: :jaro_winkler)`	Top N matches as `{ match:, score: }` sorted by score descending
`.top_n(query, candidates, n:, algorithm: :jaro_winkler, min_similarity: 0.0)`	Top-N matches as `{ value:, similarity: }` sorted by similarity descending, with optional similarity floor
`.soundex(string)`	Generate 4-character Soundex code
`.metaphone(string)`	Generate Metaphone phonetic code
`.phonetic_match?(a, b)`	Check if two strings match phonetically
`.hamming(a, b)`	Hamming distance for equal-length strings (integer)
`.token_sort_ratio(a, b)`	Token-sorted Jaro-Winkler similarity (0.0 to 1.0)
`.token_set_ratio(a, b)`	Token-set-based similarity (0.0 to 1.0)
`.weighted_score(a, b, weights:)`	Weighted multi-algorithm score (0.0 to 1.0)
`.similarity_matrix(strings, algorithm:, threshold:)`	Pairwise similarity hash-of-hashes for batch comparison
`.deduplicate(array, threshold:, algorithm:)`	Group and deduplicate similar strings

All methods are case-insensitive by default.

Development

bundle install
bundle exec rspec
bundle exec rubocop

Support

If you find this project useful:

⭐ Star the repo

🐛 Report issues

💡 Suggest features

❤️ Sponsor development

🌐 All Open Source Projects

💻 GitHub Profile

🔗 LinkedIn Profile

License

MIT