No commit activity in last 3 years
No release in over 3 years
Correct strings based on remote errata files
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies
 Project Readme

errata

Define an errata in table format (CSV) and then apply it to an arbitrary source. Inspired by RFC Errata, lets you keep your own errata in a transparent way.

Tested in MRI 1.8.7+, MRI 1.9.2+, and JRuby 1.6.7+. Thread safe.

Inspiration

There's a process for reporting errata on RFC:

screenshot of the RFC Editor

Example

Every errata has a table structure based on the IETF RFC Editor's "How to Report Errata".

date name email type section action x y condition notes
2011-03-22 Ian Hough ian@brighterplanet.com meta Intended use http://example.com/original-data-with-errors.xls A hypothetical document that uses non-ISO country names
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /ANTIGUA & BARBUDA/ ANTIGUA AND BARBUDA
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /BOLIVIA/ BOLIVIA, PLURINATIONAL STATE OF
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /BOSNIA & HERZEGOVINA/ BOSNIA AND HERZEGOVINA
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /BRITISH VIRGIN ISLANDS/ VIRGIN ISLANDS, BRITISH
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /COTE D'IVOIRE/ CÔTE D'IVOIRE
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /DEM\. PEOPLE'S REP\. OF KOREA/ KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /DEM\. REP\. OF THE CONGO/ CONGO, THE DEMOCRATIC REPUBLIC OF THE
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /HONG KONG SAR/ HONG KONG
2011-03-22 Ian Hough ian@brighterplanet.com technical Country Name replace /IRAN \(ISLAMIC REPUBLIC OF\)/ IRAN, ISLAMIC REPUBLIC OF

Which would be saved as a CSV:

date,name,email,type,section,action,x,y,condition,notes
2011-03-22,Ian Hough,ian@brighterplanet.com,meta,Intended use,,http://example.com/original-data-with-errors.xls,,A hypothetical document that uses non-ISO country names
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/ANTIGUA & BARBUDA/,ANTIGUA AND BARBUDA,,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/BOLIVIA/,"BOLIVIA, PLURINATIONAL STATE OF",,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/BOSNIA & HERZEGOVINA/,BOSNIA AND HERZEGOVINA,,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/BRITISH VIRGIN ISLANDS/,"VIRGIN ISLANDS, BRITISH",,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/COTE D'IVOIRE/,CÔTE D'IVOIRE,,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/DEM\.  PEOPLE'S REP\. OF KOREA/,"KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF",,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/DEM\. REP\. OF THE CONGO/,"CONGO, THE DEMOCRATIC REPUBLIC OF THE",,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/HONG KONG SAR/,HONG KONG,,
2011-03-22,Ian Hough,ian@brighterplanet.com,technical,Country Name,replace,/IRAN \(ISLAMIC REPUBLIC OF\)/,"IRAN, ISLAMIC REPUBLIC OF",,

And then used

errata = Errata.new(:url => 'http://example.com/errata.csv')
original = RemoteTable.new(:url => 'http://example.com/original-data-with-errors.xls')
original.each do |row|
  errata.correct! row # destructively correct each row
end

UTF-8

Assumes all input strings are UTF-8. Otherwise there can be problems with Ruby 1.9 and Regexp::FIXEDENCODING. Specifically, ASCII-8BIT regexps might be applied to UTF-8 strings (or vice-versa), resulting in Encoding::CompatibilityError.

More advanced usage

The earth library has dozens of real-life examples showing errata in action:

Model Reference Errata file
Country data_miner.rb wri_errata.csv
Aircraft data_miner.rb faa_errata.csv
Airports data_miner.rb openflights_errata.csv
Automobile model variants data_miner.rb feg_errata.csv

Real-world usage

Brighter Planet logo

We use errata for data science at Brighter Planet and in production at

The killer combination:

  1. active_record_inline_schema - define table structure
  2. remote_table - download data and parse it
  3. errata (this library!) - apply corrections in a transparent way
  4. data_miner - import data idempotently

Authors

Copyright

Copyright (c) 2012 Brighter Planet. See LICENSE for details.