Extraloop Redis Storage¶ ↑
Description¶ ↑
Persistence layer for the ExtraLoop data extraction toolkit. This module is implemented as a wrapper around Ohm, an object-hash mapping library which makes easy storing structured data into Redis. Includes a convinent command line tool that allows to list, filter, and delete harvested datasets, as well as exporting them on local files or remote data stores (i.e Google Fusion tables).
Installation ¶ ↑
gem install extraloop-redis-storage
Usage¶ ↑
Extraloop’s Redis storage module decorates ExtraLoop::ScraperBase and ExtraLoop::IterativeScraper instances with the set_storage method: a helper method that allows to specify how the scraped data should be stored.
require "extraloop/redis-storage"
class AmazonReview < ExtraLoop::Storage::Record
attribute :title
attribute :rank
attribute :date
def validate
assert (0..5).include?(rank.to_i), "Rank not in range"
end
end
scraper = AmazonReviewScraper.new("0262560992").
.set_storage(AmazonReview, "Amazon reviews of 'The Little Schemer'")
.run()
At each scraper run, the ExtraLoop storage module internally instantiates a session (see ExtraLoop::Storage::ScrapingSession) and associates the extracted records to it. The ‘AmazonReview` records just created, can now be accessed by calling the `#records` metod on scraper session object.
reviews = scraper.session.records
#set_storage ¶ ↑
The set_storage method accepts the following arguments:
-
model A Ruby constant or a symbol specifying the model to be used for storing the extracted data. If a symbol is passed, it is assumed that a model does not exist and the storage module dynamically generates one by subclassing
ExtraLoop::Storage::Record. -
session_title A human readable title for the extracted dataset (optional).
Command line interface ¶ ↑
Once installed, the gem will also add to your system path the extraloop executable: a command line interface to the datasets harvested through ExtraLoop. A list of datasets can be obtained by running:
extraloop datastore list
This will generate a table like the following one:
id | title | model | records -------------------------------------------------------------------- 48 | 1330106699 GoogleNewsStory Dataset | GoogleNewsStory | 110 49 | 1330106948 AmazonReview Dataset | AmazonReview | 0 51 | 1330107087 GoogleNewsStory Dataset | GoogleNewsStory | 110 52 | 1330111630 AmazonReview Dataset | AmazonReview | 10
Datasets can be removed using the delete subcommand:
extraloop datastore delete [id]
Where id is either a single scraping session id, or a session id range (e.g. 48..52).
From the Redis datastore, ExtraLoop datasets can be exported to disk as CSV, JSON, or YAML documents:
extraloop datastore export 51..52 -f csv
Similarly, stored datasets can be uploaded to a remote datastore:
extraloop datastore push 51..48 fusion_tables -c google_username:password
While Google’s Fusion Tables is currently the only one implemented, support for pushing dataset to other remote datastores (e.g. couchDB, cartoDB, and CKAN Webstore) will be added soon.