No commit activity in last 3 years
No release in over 3 years
Redis-based Persistence layer for the ExtraLoop data extraction toolkit. Includes a convinent command line tool allowing to list, filter, delete, and export harvested datasets
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 1.1.1
~> 0.7.0
~> 0.9.7.4
>= 0
~> 1.0.4
~> 2.7.0

Runtime

~> 0.0.3
~> 0.1.3
~> 0.1.2
= 0.14.6
 Project Readme

Extraloop Redis Storage¶ ↑

Description¶ ↑

Persistence layer for the ExtraLoop data extraction toolkit. This module is implemented as a wrapper around Ohm, an object-hash mapping library which makes easy storing structured data into Redis. Includes a convinent command line tool that allows to list, filter, and delete harvested datasets, as well as exporting them on local files or remote data stores (i.e Google Fusion tables).

Installation ¶ ↑

gem install extraloop-redis-storage

Usage¶ ↑

Extraloop’s Redis storage module decorates ExtraLoop::ScraperBase and ExtraLoop::IterativeScraper instances with the set_storage method: a helper method that allows to specify how the scraped data should be stored.

require "extraloop/redis-storage"

class AmazonReview < ExtraLoop::Storage::Record
  attribute :title
  attribute :rank
  attribute :date

  def validate
    assert (0..5).include?(rank.to_i), "Rank not in range"
  end
end

scraper = AmazonReviewScraper.new("0262560992").
  .set_storage(AmazonReview, "Amazon reviews of 'The Little Schemer'")
  .run()

At each scraper run, the ExtraLoop storage module internally instantiates a session (see ExtraLoop::Storage::ScrapingSession) and associates the extracted records to it. The ‘AmazonReview` records just created, can now be accessed by calling the `#records` metod on scraper session object.

reviews = scraper.session.records

#set_storage ¶ ↑

The set_storage method accepts the following arguments:

  • model A Ruby constant or a symbol specifying the model to be used for storing the extracted data. If a symbol is passed, it is assumed that a model does not exist and the storage module dynamically generates one by subclassing ExtraLoop::Storage::Record.

  • session_title A human readable title for the extracted dataset (optional).

Command line interface ¶ ↑

Once installed, the gem will also add to your system path the extraloop executable: a command line interface to the datasets harvested through ExtraLoop. A list of datasets can be obtained by running:

extraloop datastore list

This will generate a table like the following one:

 id | title                              | model           | records
--------------------------------------------------------------------
 48 | 1330106699 GoogleNewsStory Dataset | GoogleNewsStory | 110    
 49 | 1330106948 AmazonReview Dataset    | AmazonReview    | 0      
 51 | 1330107087 GoogleNewsStory Dataset | GoogleNewsStory | 110    
 52 | 1330111630 AmazonReview Dataset    | AmazonReview    | 10

Datasets can be removed using the delete subcommand:

extraloop datastore delete [id]

Where id is either a single scraping session id, or a session id range (e.g. 48..52).

From the Redis datastore, ExtraLoop datasets can be exported to disk as CSV, JSON, or YAML documents:

extraloop datastore export 51..52 -f csv

Similarly, stored datasets can be uploaded to a remote datastore:

extraloop datastore push 51..48 fusion_tables -c google_username:password

While Google’s Fusion Tables is currently the only one implemented, support for pushing dataset to other remote datastores (e.g. couchDB, cartoDB, and CKAN Webstore) will be added soon.