Project: reid - The Ruby Toolbox

About

Reid and its sister class requester are simple classes to help structure web scraping tasks.

Typical usage for a single-page scrape involves defining an array of xpath or css Nokogiri selectors and one or more first-class functions (in the form of lambda(s) and/or proc(s)). Requester then iterates the array, selecting the elements from the Nokogiri document and passing them to your function. Your function should take two arguments. The first is the element selected by the Nokogiri slecector, the second is a hash where you can save whatever data you're scraping from that particular selection (this record is returned by the method).

Multipage scrapes include require you to build a page iterator and allow you to pass a method to persist your record (see specifications and example below.)

Reid uses Requester which can be used to throttle requests, backoff and log request errors. See Requester documentaiton for details

Example usage

require 'reid'

requester_options = {
	:min_request_interval => 1,
	:max_backoff_time => 60
	#...
}

reid = Reid.new(requester_options)

##############################
# 	Single page scrape
##############################

operations = [
	[
		Proc.new{|element, record| record[:title] = element.xpath('//title').text},
		'//head',
		:xpath
	],
	[
		Proc.new{|element, record| record[:paragraph] = element.css('p').text},
		'body',
		:css
	]
]

 record = reid.scrape_page('http://example.iana.org/', operations)

 p record[:title] #=> "Example Domain"


##############################
# 	Using crawl method
##############################

class Url_iter
	def initialize
		@urls = ['http://example.iana.org/',
				 'http://www.iana.org/domains/special']
		@current = 0
	end
	def next(doc)
		#This method should return urls until all urls are
		#processed, then it should return nil.
		#
		#if you are iterating though multiple pages, etc.
		#you can check the Nokogiri document from your previous
		#request to determine if you've reached the last
		#page or whatever.
		#nil is passed on the first call
		if @current == 2
			@current = 0
			return nil
		else
			@current += 1
			return @urls[@current - 1]
		end
	end
end

persist_method = lambda do |record|
	#this is where you handle checking and persisting your record
	p 'I should be storing ' + record[:title]
end

reid.crawl(Url_iter.new, operations, persist_method)

#=> "I should be storing Example Domain"
#=> "I should be storing Iana..."

Installation

gem install reid

Intialization

Reid takes an options hash for initializing a Requester object. See Requester doucmentation for details. Requester handles backing off if there are request errors. Requester has default options so you aren't requried to specify anthing.

Method reference

`scrape_page(url, operations)`

Takes the url you want to scrape and a 2D array.

Returns a hash.

Each array within the 2D array should have three items (the first dimension of the 'x' specifies the element of operations. You can execute multiple operations on the same page):

operations[x][0] is an xpath or css selector.
operations[x][1] is a proc or lambda. operations[x][1] is passed the element(s) returned when operations[x][0] is applied to the Nokogiri document for the passed url. The proc or lambda passed as the second element (operations[x][1]) should accept two arguments. The first argument represents the element(s) that will be returned when operations[x][0] is applied to the Nokogiri document. The second argument is a hash. This is the hash that is ultimately returned by the scrape_page method and should be used to store any elements from the selection that you want returned from the method.
operations[x][2] is a symbol flag which can be either :css or :xpath depending on whether operations[x][0] is a css or xpath selector.

If there are multiple arrays contained in the 2D array, they will all be evaluated in order.

The second argument passed to the the proc or lambda is the same hash so everything added to this hash by all proc/lambdas will be returned by the scrape_page method.

`scrape_doc(doc, operations)`

Same as scrape_page except it takes a Nokogiri doc instead of an url (in case you want to handle the page request outside of Reid).

`crawl(url_crawler, operations, store_function)`

Takes an object, 2D array, and a proc/lambda

url_crawler must be an object that has a next method that accepts one argument. .next should return the next url to crawl or Nil once the crawl is complete. .next will be passed the Nokogiri document from the previous request or nil if it is the first request. This allows you to check the Nokogiri document from your previous request in case it is relevant to determining the next url which will be returned or to determine if it is the end of the crawl.

operations is a 2D array following the specification defined in the scrape_page method documentation. These methods are applied to each page.

store_function is a proc or lambda which receiveds the hash generated by operations. This is typically used for checking and storing scraped data.

Requester documentation

################################################################################

						ABOUT

################################################################################

Requester is a a small class to use in conjunction with Nokogiri to perform request dampering and exponential backoffs when Nokogiri runs into an error when making a request.

################################################################################

						Method/usage

################################################################################

request(url) Takes a string url, returns a Nokogiri document or raises error if max_backoff_time is set during initializaing and is excieded.

Usage: require 'open-uri'

options = {
	:min_request_interval => 5.0,
	:max_backoff_time => 60
	#...
}

r = Requester.new(options)
doc = r.request('http://www.example.com')
puts doc.css('title').text

################################################################################

						Initialize / options hash

################################################################################

Requester takes an optional options hash. Below are the options and their defaults.

:error_log
	Set to either a MongoDB collection or false. Default is false.
	If set to a collection, the time and error message for the Nokogiri
		request error will be saved to this collection. (You may want to
		use a capped collection.)

:intitial_delay
	Set to number.
	Default is 1.0
	After the first delay, Requester will wait this amount of seconds
		before making the next request.

:max_backoff_time
	Set to either number of seconds or false. Default is false.
	If false, there is no maximum backoff time. Requester will continue
		at an exponentially diminishing rate.
	If set to a number, Requester will raise an error once it reaches
		that number of seconds. Note, this is the number since the last
		request, not the cumulative seconds. So, for instance, if you
		set this to 600 (10 minutes) and started with 2 seconds between
		requests, Requester would not stop making requests until there
		was 600 seconds between two consecutive requests.

:min_request_interval
	Set to either a number of seconds or false. Default is false.
	If false, no dampering of requests will occur.
	If set to a number, each request will wait min_request_interval
		until making another request

:multiplicand
	Set to number.
	Default is 1.3
	When there is an error with the request, Requester will wait the 
		previous back off time, times this amount.

################################################################################

reid

Runtime

About

Example usage

Installation

Intialization

Method reference

scrape_page(url, operations)

scrape_doc(doc, operations)

crawl(url_crawler, operations, store_function)

Requester documentation

`scrape_page(url, operations)`

`scrape_doc(doc, operations)`

`crawl(url_crawler, operations, store_function)`