Twins
Twins sorts through the small differences between multiple objects and smartly consolidate all of them together.
Usage
Let's say you have a collection of objects representing the same book but from different sources, which brings the possibility for each object to be slightly different from one another.
books = [{
title: "Shantaram: A Novel",
author: "Gregory David Roberts",
published: 2012,
details: {
paperback: true
}
},
{
title: "Shantaram",
author: "Gregory David Roberts & Alejandro Palomas",
published: 2012,
details: {
paperback: false
}
},
{
title: "Shantaram",
author: "Gregory David Roberts",
published: 2012,
details: {
paperback: true
}
},
{
title: "Shantaram",
author: "Gregory D. Roberts",
published: 2005,
details: {
paperback: true
}
}]Consolidate
Assembles a new Hash based on every elements in the collection. By default Twins#consolidate will determine the candidate values based on the most frequent value present for a given key, also known as the mode.
Twins.consolidate(books)
{
title: "Shantaram",
author: "Gregory David Roberts",
published: 2012,
details: {
paperback: true
}
}You may also provide Twins#consolidate with priorities for String and Numeric attributes, which will precede on the mode while determining the canditate value.
options = {
priority: {
title: "Novel"
}
}
Twins.consolidate(books, options)
{
title: "Shantaram: A Novel",
author: "Gregory David Roberts",
published: 2012,
details: {
paperback: true
}
}Pick
Selects the collection's most representative element. By default Twins.pick will determine the candidate element based on the highest count of modes present for a given element.
Twins.pick(books)
{
title: "Shantaram",
author: "Gregory David Roberts",
published: 2012,
details: {
paperback: true
}
}You may also provide Twins#pick with priorities for String and Numeric attributes, which will be used to compute each element's overall distance while determining the canditate element.
options = {
priority: {
title: "Novel"
}
}
Twins.pick(books, options)
{
title: "Shantaram: A Novel",
author: "Gregory David Roberts",
published: 2012,
details: {
paperback: true
}
}Internals
Distance
String distances are calculated using a longest subsequence algorithm and Numeric distances are calculated with their difference.
Contributing
- Fork it
- Create a topic branch
- Add specs for your unimplemented modifications
- Run
bundle exec rspec. If specs pass, return to step 3. - Implement your modifications
- Run
bundle exec rspec. If specs fail, return to step 5. - Commit your changes and push
- Submit a pull request
- Thank you!
TODO
- Think about using jaccard to weight items
Author
License
See LICENSE