Note
This gem is being renamed to Safra so it can be published in rubygems.
Name change
- Replace
gem "extractor", github: "felipedmesquita/extractor"
withgem "safra"
- Done
Extractor
A Ruby Gem to extract data from apis with mininal configuration.
This is currently in a working, but messy state. Watch the Adding Tests to a Messy Ruby Library episode of the Code with Jason Meetup for a quick introduction, demo, and Jason graciously pairing with me to figure out where we could start adding tests.
If you somehow found this gem and want to try it out, I can pair with you to walk you through setting everything up and runing your first tap. Just send me an email to the address in the gemspec and we'll schedule it.
Installation
Add this to your Gemfile:
gem "extractor", github: "felipedmesquita/extractor"
Create the requests model:
rails generate extractor:install
rails db:migrate
To get newer changes from the main branch run bundle update extractor
Usage
Create a tap:
rails generate extractor:tap example
# => create app/extractors/example_tap.rb
# => create app/sql/example.sql
Taps inherit the initilizer and perform methods from Extractor::Tap. To run our example tap, we can simply call:
ExampleTap.new.perform
This will download all posts from jsonplaceholder as requests. To clean up and analize the response contents, check out dbt
How it works
The perform method takes no arguments and just runs until reached_end? returns true for any response received. It also handles retries when responses don't pass validation. Here's a flowchart with the specifcs. Please open a pr if you can untangle this mermaid chart.
flowchart TD
A["ExampleTap.new.perform"] --> B["@current_value = first_value || 1"]
B --> C["request_for(@current_value)"]
C --> D{"validate(response)"}
D --true--> E(save request)
E --> L{"reached_end?(res)"}
L --false--> J["@current_value = next_value"]
L --true--> M[END]
J --> C
D --false or error--> H{"reached_end?(res)"}
H --false--> F{retries count > MAX_RETRIES}
F --true--> G[raise 'Maximum number of retries reached']
H --true----> I(END)
F --false--> C
Working with an array parameter
When the value needed to build requests comes from a list (like ids or product codes), you can pass an array to the initializer as the first argument as in ProductsTap.new(skus).perform
. This changes the behavior of the perform function so that instead of calling resquest_for
with values counting up from 1, it passes each item of the array one by one. This also makes defining a reached_end?
method unnecessary, perform
ends when it has successfully saved a request for each array item.
Handling multiple array items per request
Define a method like request_for_20_values
to better utilize APIs that suport retrieving multiple items per request, then build the request using the array received:
def request_for_20_values values
Typhoeus.get "imaginaryapi.com/products?ids={values.join ','}"
end
Working with a custom parameter
The first argument of the tap's initializer is saved to @parameter and can be any type.
Authentication
The named parameter auth:
is saved to instance variable @auth:
ExampleTap.new(auth: {seller_id: "23899283", access_token: "jf943u0923nd"}).perform
def request_for value
Typhoeus.get "imaginaryapi.com/sellers/#{@auth['seller_id']}/products?access_token=#{@auth['access_token']}"
end
Options
Define these constants to configure retry behavior.
Defaults
MAX_RETRIES = 4
ON_MAX_RETRIES = :fail
ON_MAX_RETRIES
accepts three possible options:
-
:fail
Default. Raises 'Maximum number of retries reached' -
:save_to_errors
Saves the last invalid request to the requests table with the extractor_class column set to ExampleTap_errors -
:skip_silently
Don't use this one.
Instance variables
-
@parameter
The postional first argument to.new
. InExampleTap.new(Date.yesterday)
@parameter would beDate.yesterday
. This is not used by the Extractor::Tap class. -
@auth
The named argumentauth:
to.new
. InExampleTap.new(auth: {api_key: '328490284209'})
@auth would be'328490284209'
. This is not used by the Extractor::Tap class. -
@last_response
Is set automaticaly to the last valid response received fromrequest_for
.