Project

grubby

0.0
No release in over 3 years
Low commit activity in last 3 years
Fail-fast web scraping
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Runtime

 Project Readme

grubby

Fail-fast web scraping. grubby adds a layer of utility and error-checking atop the marvelous Mechanize gem. See API listing below, or browse the full documentation.

Examples

The following code scrapes stories from the Hacker News front page:

require "grubby"

class HackerNews < Grubby::PageScraper
  scrapes(:items) do
    page.search!(".athing").map{|element| Item.new(element) }
  end

  class Item < Grubby::Scraper
    scrapes(:story_link){ source.at!("a.storylink") }

    scrapes(:story_url){ expand_url(story_link["href"]) }

    scrapes(:title){ story_link.text }

    scrapes(:comments_link, optional: true) do
      source.next_sibling.search!(".subtext a").find do |link|
        link.text.match?(/comment|discuss/)
      end
    end

    scrapes(:comments_url, if: :comments_link) do
      expand_url(comments_link["href"])
    end

    scrapes(:comment_count, if: :comments_link) do
      comments_link.text.to_i
    end

    def expand_url(url)
      url.include?("://") ? url : source.document.uri.merge(url).to_s
    end
  end
end

# The following line will raise an exception if anything goes wrong
# during the scraping process.  For example, if the structure of the
# HTML does not match expectations due to a site change, the script will
# terminate immediately with a helpful error message.  This prevents bad
# data from propagating and causing hard-to-trace errors.
hn = HackerNews.scrape("https://news.ycombinator.com/news")

# Your processing logic goes here:
hn.items.take(10).each do |item|
  puts "* #{item.title}"
  puts "  #{item.story_url}"
  puts "  #{item.comment_count} comments: #{item.comments_url}" if item.comments_url
  puts
end

Hacker News also offers a JSON API, which may be more robust for scraping purposes. grubby can scrape JSON just as well:

require "grubby"

class HackerNews < Grubby::JsonScraper
  scrapes(:items) do
    # API returns array of top 500 item IDs, so limit as necessary
    json.take(10).map do |item_id|
      Item.scrape("https://hacker-news.firebaseio.com/v0/item/#{item_id}.json")
    end
  end

  class Item < Grubby::JsonScraper
    scrapes(:story_url){ json["url"] || hn_url }

    scrapes(:title){ json["title"] }

    scrapes(:comments_url, optional: true) do
      hn_url if json["descendants"]
    end

    scrapes(:comment_count, optional: true) do
      json["descendants"]&.to_i
    end

    def hn_url
      "https://news.ycombinator.com/item?id=#{json["id"]}"
    end
  end
end

hn = HackerNews.scrape("https://hacker-news.firebaseio.com/v0/topstories.json")

# Your processing logic goes here:
hn.items.each do |item|
  puts "* #{item.title}"
  puts "  #{item.story_url}"
  puts "  #{item.comment_count} comments: #{item.comments_url}" if item.comments_url
  puts
end

Core API

Auxiliary API

grubby loads several gems that extend Ruby objects with utility methods. Some of those methods are listed below. See each gem's documentation for a complete API listing.

Installation

Install the grubby gem.

Contributing

Run rake test to run the tests.

License

MIT License