0.0
No release in over a year
CSV Indexer makes it simple the indexation and searching in lasge CSV files. It is not as robust as Lucence, but it is simple and cost-effective. May index files with millions of rows and find specific rows in matter of seconds. Find documentation here: https://github.com/leandrosardi/csv-indexer.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Runtime

~> 3.2.2, >= 3.2.2
~> 1.2.2, >= 1.2.2
 Project Readme

GitHub issues GitHub GitHub tag (latest by date) GitHub last commit

CSV-Indexer

CSV-Indexer makes it simple the indexation and searching in large CSV files.

CSV-Indexer is not as robust as Lucene, but it is simple and cost-effective. May index files with millions of rows and find specific rows in matter of seconds.

1. Installation

gem install csv-indexer

2. Quick Start

Step 1. Download a sample CSV file in the same directory where you are running your Ruby script:

wget https://raw.githubusercontent.com/leandrosardi/csv-indexer/main/examples/example.csv

Step 2. In your Ruby script, require the csv-indexer gem.

require 'csv-indexer'

Step 3. Setup the index for that CSV file.

# define the indexation of for `example.csv`
source = BlackStack::CSVIndexer.add_indexation({
    # Assign a unique name for this indexation.
    #
    # This parameter is mandatory.
    #
    # Each `.csv` file indexed will be stored in a file with the same name but replaciing `.csv` with the name of this index.
    # For example, if you use `:name => 'my_index'` and you index a file called `my_file.csv`, the index will be stored in a file called `my_file.my_index`.
    #
    # This name must have filename safe characters only. No spaces, no special characters.
    # 
    :name => 'ix_example01',
    # Write a brief description of what you are indexing and why.
    # This parameter is optional.
    # Default: nil.
    :description => 'Find the email address and other insights of any LinkedIn user from his/her LinkedIn URL.',
    # The path to the `.csv` file(s) to be indexed.
    # This parameter is optional.
    # Default: './*.csv'
    :input => './example.csv',
    # The path to the directory where the index will be stored.
    # This parameter is optional.
    # Default: './'
    :output => './',
    # The path to the directory where the log files will be stored.
    # This parameter is optional.
    # Default: './'
    :log => './',
    # The mapping of the columns in the `.csv` file to be index.
    # This parameter is mandatory.
    :mapping => {
        :first_name => 0,
        :last_name => 1,
        :linkedin_url => 2,
        :email => 5,
    },
    # List column mapped to the index who are used to build the key of the index.
    # This parameter is mandatory.
    :keys => [:linkedin_url],
})

Step 4. Run the indexation

Add this line to build the index.

BlackStack::CSVIndexer.index('ix_example01')
# => 2022-11-09 15:37:46: Indexing example.csv... done

Note:

For better performance, the index method loads the whole file to memory. So, if you have csv files higher than 500MB, it is advisable you split then in chunks using the split command.

E.g.:

split -C 500m --numeric-suffixes input_filename

Step 5. Searching for a specific LinkedIn URL in your index.

ret = BlackStack::CSVIndexer.find('ix_example01', 'linkedin.com/in/almu-dan-9808753a')
puts "#{ret[:matches].size.to_s} results found."
puts "Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}"
# => 1 results found.
# => Enlapsed seconds: 0.001595287

3. Indexing Many Files

You can define the indexation of many files.

E.g.: Replacing './example.csv' by './*.csv'.

source = BlackStack::CSVIndexer.add_indexation({
    :name => 'ix_example01',
    :input => './*.csv',
    :mapping => {
        :first_name => 0,
        :last_name => 1,
        :linkedin_url => 2,
        :email => 5,
    },
    :keys => [:linkedin_url],
})

4. Indexing by Many Columns

You can index by many columns.

E.g.: Replacing [:linkedin_url] by [:first_name, :last_name].

source = BlackStack::CSVIndexer.add_indexation({
    :name => 'ix_example02',
    :description => 'Find the email address and other insights of any LinkedIn user from his/her name.',
    :input => './example.csv',
    :output => './',
    :log => './',
    :mapping => {
        :first_name => 0,
        :last_name => 1,
        :linkedin_url => 2,
        :email => 5,
    },
    :keys => [:first_name, :last_name],
})

BlackStack::CSVIndexer.index('ix_example02')
# => 2022-11-09 16:43:52: Indexing example.csv... done

5. Searching by Many Columns

If you indexed by more than one column, you can choose one or more of those columns for search.

E.g.: Replacing 'linkedin.com/in/almu-dan-9808753a' by ['alan', 'armstrong'].

ret = BlackStack::CSVIndexer.find('ix_example02', ['alan', 'armstrong'])
puts "#{ret[:matches].size.to_s} results found."
if ret[:matches].size > 0
    puts "First Name: #{ret[:matches].first[2]}" 
    puts "Last Name: #{ret[:matches].first[3]}" 
    puts "Email: #{ret[:matches].first[5]}" 
end
puts "Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}"
# => 1 results found.
# => First Name: alan
# => Last Name: armstrong
# => Email: razorback1@plansandmorellp.com
# => Enlapsed seconds: 0.001613454

6. Key Must Be Unique

At this moment, CSV-Indexer returns no more than 1 result.

If there are two or more rows in your index who match with the criteria, CSV-Indexer will return the first that it founds.

E.g.: If you remove the 'armstrong', you get another Alan.

ret = BlackStack::CSVIndexer.find('ix_example02', ['alan'])
puts "#{ret[:matches].size.to_s} results found."
if ret[:matches].size > 0
    puts "First Name: #{ret[:matches].first[2]}" 
    puts "Last Name: #{ret[:matches].first[3]}" 
    puts "Email: #{ret[:matches].first[5]}" 
end
puts "Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}"
# => 1 results found.
# => First Name: alan
# => Last Name: kane
# => Email: akane@myalexandertoyota.com
# => Enlapsed seconds: 0.001480246

7. Case Insensitive

CSV-Indexer is case-insensitive.

E.g.: ['alan', 'armstrong'] is the same than ['Alan', 'Armstrong'].

ret = BlackStack::CSVIndexer.find('ix_example02', ['Alan', 'Armstrong'])
puts "#{ret[:matches].size.to_s} results found."
if ret[:matches].size > 0
    puts "First Name: #{ret[:matches].first[2]}" 
    puts "Last Name: #{ret[:matches].first[3]}" 
    puts "Email: #{ret[:matches].first[5]}" 
end
puts "Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}"
# => 1 results found.
# => First Name: alan
# => Last Name: armstrong
# => Email: razorback1@plansandmorellp.com
# => Enlapsed seconds: 0.001613454

8. Matching Criteria

You can find values who match partially with the key.

E.g: ['Ala', 'Armstrong'] works the same than ['Alan', 'Armstrong'] if you add a thirth parameter exact_match=false

ret = BlackStack::CSVIndexer.find('ix_example02', ['Ala', 'Armstrong'], exact_match=false)
puts "#{ret[:matches].size.to_s} results found."
if ret[:matches].size > 0
    puts "First Name: #{ret[:matches].first[2]}" 
    puts "Last Name: #{ret[:matches].first[3]}" 
    puts "Email: #{ret[:matches].first[5]}" 
end
puts "Enlapsed seconds: #{ret[:enlapsed_seconds].to_s}"
# => 1 results found.
# => First Name: alan
# => Last Name: armstrong
# => Email: razorback1@plansandmorellp.com
# => Enlapsed seconds: 0.001595377