Scraptacular

Organized web-scraping.

Installation

Add this line to your application's Gemfile:

gem 'scraptacular'

And then execute:

$ bundle

Or install it yourself as:

$ gem install scraptacular

Usage

Defining Scrapers

The scraper describes what content should be plucked from the page and returned in the result.

Example 1 : Basic Usage

 scraper :yahoo_front_page do
   result do
     highest_trending_url { page.search("ol.trending_now_trend_list li a").first.attributes["href"].value }
     anything { "My returned value" }
   end
 end

Example 2 : Multiple Level Scraping

 scraper :event_index_page do
   # Find URLs, scrape the contents of those pages using the :event_detail_page scraper
   scrape_links("a.css_selector_for_links", with: :event_detail_page).each do |link|
     result do

       # Provide partial result from the index page
       event_title { page.search("h4").first.text }

       # Merge results from the detail page
       merge(link)
     end
   end
 end

 scraper :event_detail_page do
   result do
      date { ... }
      price { ... }
   end
 end

Scraping a page returns a Scraptacular::Result object :

result = results.first  # See section below on running a scraping session

result.class  # Scraptacular::Result
result.to_h   # {:highest_trending_url => "http://www.harlemshakevideos.com", :anything => "My returned value" }

Setting Up Scraping Sessions

Scraping sessions are divided into groups and suites. The group is a logical separation by content topic. The suite generally refers to a set of urls which should be scraped using the same scraper

scrape_group "Ruby Sites" do
  suite "Google", with: :google_result_index do

    # The url will be scraped using the :google_result_index scraper
    url "https://www.google.com/search?q=Ruby"

    # Tell Scraptacular to use a different scraper for an individual URL
    url "https://www.google.com/search?q=Ruby+On+Rails", with: :google_alternate_index
  end
end

Running From The Command Line

Scraptacular comes with its own command line utility. Currently the only supported output format is JSON: See scraptacular --help for more info.

$ scraptacular -d /path/to/scraper_definitions.rb -s /path/to/sessions.rb -o /path/to/outout.json

Inside Your Project

require 'scraptacular'

# Set up the definitions and sessions
scraper :my_scraper do
end

scrape_group "My Group" do
  suite "My Suite", with: :my_scraper do
    url "http..."
    url "http..."
  end
end

# Run all groups
results = Scraptacular.run

# Run a single group
results = Scraptacular.run({group: "My Group"})

Contributing

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

scraptacular

Development

Runtime