No commit activity in last 3 years
No release in over 3 years
Redis-based Persistence layer for the ExtraLoop data extraction toolkit. Includes a convinent command line tool allowing to list, filter, delete, and export harvested datasets


~> 1.1.1
~> 0.7.0
>= 0
~> 1.0.4
~> 2.7.0


~> 0.0.3
~> 0.1.3
~> 0.1.2
= 0.14.6
Extraloop Redis Storage¶ ↑

Description¶ ↑

Persistence layer for the ExtraLoop data extraction toolkit. This module is implemented as a wrapper around Ohm, an object-hash mapping library which makes easy storing structured data into Redis. Includes a convinent command line tool that allows to list, filter, and delete harvested datasets, as well as exporting them on local files or remote data stores (i.e Google Fusion tables).

Installation ¶ ↑

gem install extraloop-redis-storage

Usage¶ ↑

Extraloop's Redis storage module decorates ExtraLoop::ScraperBase and ExtraLoop::IterativeScraper instances with the set_storage method: a helper method that allows to specify how the scraped data should be stored.

require "extraloop/redis-storage"

class AmazonReview < ExtraLoop::Storage::Record
  attribute :title
  attribute :rank
  attribute :date

  def validate
    assert (0..5).include?(rank.to_i), "Rank not in range"

scraper ="0262560992").
  .set_storage(AmazonReview, "Amazon reviews of 'The Little Schemer'")

At each scraper run, the ExtraLoop storage module internally instantiates a session (see ExtraLoop::Storage::ScrapingSession) and associates the extracted records to it. The `AmazonReview` records just created, can now be accessed by calling the `#records` metod on scraper session object.

reviews = scraper.session.records

#set_storage ¶ ↑

The set_storage method accepts the following arguments:

  • model A Ruby constant or a symbol specifying the model to be used for storing the extracted data. If a symbol is passed, it is assumed that a model does not exist and the storage module dynamically generates one by subclassing ExtraLoop::Storage::Record.

  • session_title A human readable title for the extracted dataset (optional).

Command line interface ¶ ↑

Once installed, the gem will also add to your system path the extraloop executable: a command line interface to the datasets harvested through ExtraLoop. A list of datasets can be obtained by running:

extraloop datastore list

This will generate a table like the following one:

 id | title                              | model           | records
 48 | 1330106699 GoogleNewsStory Dataset | GoogleNewsStory | 110    
 49 | 1330106948 AmazonReview Dataset    | AmazonReview    | 0      
 51 | 1330107087 GoogleNewsStory Dataset | GoogleNewsStory | 110    
 52 | 1330111630 AmazonReview Dataset    | AmazonReview    | 10

Datasets can be removed using the delete subcommand:

extraloop datastore delete [id]

Where id is either a single scraping session id, or a session id range (e.g. 48..52).

From the Redis datastore, ExtraLoop datasets can be exported to disk as CSV, JSON, or YAML documents:

extraloop datastore export 51..52 -f csv

Similarly, stored datasets can be uploaded to a remote datastore:

extraloop datastore push 51..48 fusion_tables -c google_username:password

While Google's Fusion Tables is currently the only one implemented, support for pushing dataset to other remote datastores (e.g. couchDB, cartoDB, and CKAN Webstore) will be added soon.