Project

mini_etl

0.0
No commit activity in last 3 years
No release in over 3 years
Extract/Transform/Load wrapper
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 1.16
~> 1.9
~> 0.12.2
~> 10.0
~> 3.0
~> 0.71.0
~> 0.16.1
 Project Readme

Mini ETL

Basic toolkit for Extract/Transform/Load operations. Abstracts the details of performing sourcing, intermediate structure generation and data persistance.

Usage

Sourcing

A MiniEtl Process is kicked off by configuring a process. For a basic CSV file deserialize and bulk load:

process = MiniEtl.create_process do |process|
  process.source.type = :csv
  process.source.location = 'samples/small.csv'
end

process.bootstrap

TODO: Write a strategy for HTTP, use JSON server

process = MiniEtl.create_process do |process|
  process.source.type = :http
  process.source.location = 'localhost:8080/sample'
end

process.bootstrap

Strategies are available for CSV and JSON. If you need something else entirely, a manual source can be used instead:

process = MiniEtl.create_process do |process|
  process.source.type = :manual
  process.source.method = Proc.new do
    ...
  end
end

Structure generation

Once data sourcing is complete, data can be fetched in-place.

process = MiniEtl.create_process do |process|
  process.source.type = :csv
  process.source.location = 'samples/small.csv'
end

process.bootstrap
process.generate

process.generator.structures # intermediate structure for bulk import

If the data source is too large to process in memory, an iterator can be given instead:

process = MiniEtl.create_process do |process|
  process.source.type = :csv
  process.source.location = 'samples/large.csv'
  process.generator.lazy = true
end

process.bootstrap
process.generator.start do |structures|
  ...
end

Data persistance

Finally, once data is shaped the way you need it to, data can be persisted in any kind of way you need it to. The receiver class is expected to respond to .create(args)

process = MiniEtl.create_process do |process|
  process.source.type = :csv
  process.source.location = 'samples/large.csv'
  process.store.type = Person # An active record model
end

process.bootstrap
process.generate
process.persist

In this way, any arbitrary store can be created,

class Payroll
  Struct.new(:target, :name, :last_name, ...)
  @@data = []

  def create(params = {})
    @@data << Struct::Target.new(name: params[:name], last_name: params[:last_name], ...)
  end
end

process = MiniEtl.create_process do |process|
  process.source.type = :csv
  process.source.location = 'samples/small.csv'
  process.store.type = Payroll
end

process.bootstrap
process.generate
process.persist

Development

TODO: Test stuff $ rake

Runs rspec, rubocop, generates coverage report

TODO: Explain how to generate csv files and the rest of the samples NOTE: This will take ~5.5 mins, super slow, would need a parallel version

$ rake sample:csv:all

TODO: Explain how to use JSON Server to provide a fake API

$ npm install -g json-server
$ rake sample:json:small
$ json-server --watch samples/small.json --port 3001

API is now available at localhost:3001/payroll

...

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/[USERNAME]/etl.