No commit activity in last 3 years
No release in over 3 years
Create dictionaries that link rows between two tables (left and right) using loose matching (string similarity) by default and tight matching (regexp) by request.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 0.2.19

Runtime

>= 0.2.5
>= 1.3.1
>= 1.5.3
 Project Readme

loose_tight_dictionary¶ ↑

DEPRECATED: use fuzzy_match instead. All further development will happen there.

FuzzyMatch 1.0.5 is identical to LooseTightDictionary 1.0.5 (except for the name).

Find a needle in a haystack based on string similarity (using the Pair Distance algorithm and Levenshtein distance) and regular expressions.

Quickstart¶ ↑

>> require 'loose_tight_dictionary'
=> true 
>> LooseTightDictionary.new(%w{seamus andy ben}).find('Shamus')
=> "seamus"

String similarity matching¶ ↑

Uses Dice’s Coefficient algorithm (aka Pair Distance).

If that judges two strings to be be equally similar to a third string, then Levenshtein distance is used. For example, pair distance considers “RATZ” and “CATZ” to be equally similar to “RITZ” so we invoke Levenshtein.

>> require 'amatch'
=> true 
>> 'RITZ'.pair_distance_similar 'RATZ'
=> 0.3333333333333333 
>> 'RITZ'.pair_distance_similar 'CATZ'  # <-- pair distance can't tell the difference, so we fall back to levenshtein...
=> 0.3333333333333333 
>> 'RITZ'.levenshtein_similar 'RATZ'
=> 0.75 
>> 'RITZ'.levenshtein_similar 'CATZ'    # <-- which properly shows that RATZ should win
=> 0.5

Production use¶ ↑

Over 2 years in Brighter Planet’s environmental impact API and reference data service.

Haystacks and how to read them¶ ↑

The (admittedly imperfect) metaphor is “look for a needle in a haystack”

  • needle - the search term

  • haystack - the records you are searching (your result will be an object from here)

So, what if your needle is a string like youruguay and your haystack is full of Country objects like <Country name:"Uruguay">?

>> LooseTightDictionary.new(countries, :read => :name).find('youruguay')
=> <Country name:"Uruguay">

Regular expressions¶ ↑

You can improve the default matchings with regular expressions.

  • Emphasize important words using blockings and tighteners

  • Filter out stop words with tighteners

  • Prevent impossible matches with blockings and identities

  • Ignore words with stop words

Blockings¶ ↑

Setting a blocking of /Airbus/ ensures that strings containing “Airbus” will only be scored against to other strings containing “Airbus”. A better blocking in this case would probably be /airbus/i.

Tighteners¶ ↑

Adding a tightener like /(boeing).*(7\d\d)/i will cause “BOEING COMPANY 747” and “boeing747” to be scored as if they were “BOEING 747” and “boeing 747”, respectively. See also “Case sensitivity” below.

Identities¶ ↑

Adding an identity like /(F)\-?(\d50)/ ensures that “Ford F-150” and “Ford F-250” never match.

Stop words¶ ↑

Adding a stop word like THE ensures that it is not taken into account when comparing “THE CAT”, “THE DAT”, and “THE CATT”

Case sensitivity¶ ↑

Scoring is case-insensitive. Everything is downcased before scoring. This is a change from previous versions. Your regexps may still be case-sensitive, though.

Examples¶ ↑

Check out the tests.

Speed (and who to thank for the algorithms)¶ ↑

If you add the amatch gem to your Gemfile, it will use that, which is much faster (but segfaults have been seen in the wild). Thanks Flori!

Otherwise, pure ruby versions of the string similarity algorithms derived from the answer to a StackOverflow question and the text gem are used. Thanks marzagao and threedaymonk!

Authors¶ ↑

  • Seamus Abshere <seamus@abshere.net>

  • Ian Hough <ijhough@gmail.com>

  • Andy Rossmeissl <andy@rossmeissl.net>

Copyright 2011 Brighter Planet, Inc.