0.0
Repository is archived
No commit activity in last 3 years
No release in over 3 years
Simple library to scrap web pages.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

Runtime

~> 2.0
 Project Readme

Sagrone scraper

Gem Version Build Status

Simple library to scrap web pages. Bellow you will find information on how to use it.

Table of Contents

  • Installation
  • Basic Usage
  • Modules
    • SagroneScraper::Agent
    • SagroneScraper::Base
      • Create a scraper class
      • Instantiate the scraper
      • Scrape the page
      • Extract the data
    • SagroneScraper::Collection

Installation

Add this line to your application's Gemfile:

$ gem 'sagrone_scraper'

And then execute:

$ bundle

Or install it yourself as:

$ gem install sagrone_scraper

Basic Usage

In order to scrape a web page you will need to:

  1. create a new scraper class by inheriting from SagroneScraper::Base, and
  2. instantiate it with a url or page
  3. then you can use the scraper instance to scrape the page and extract structured data

More informations at SagroneScraper::Base module.

Modules

SagroneScraper::Agent

The agent is responsible for obtaining a page, Mechanize::Page, from a URL. Here is how you can create an agent:

require 'sagrone_scraper'

agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
agent.page
# => Mechanize::Page

agent.page.at('.ProfileHeaderCard-bio').text
# => "Javascript User Group Milano #milanojs"

SagroneScraper::Base

Here we define a TwitterScraper, by inheriting from SagroneScraper::Base class.

The scraper is responsible for extracting structured data from a page or a url. The page can be obtained by the agent.

Public instance methods will be used to extract data, whereas private instance methods will be ignored (seen as helper methods). Most importantly self.can_scrape?(url) class method ensures that only a known subset of pages can be scraped for data.

Create a scraper class

require 'sagrone_scraper'

class TwitterScraper < SagroneScraper::Base
  TWITTER_PROFILE_URL = /^https?:\/\/twitter.com\/(\w)+\/?$/i

  def self.can_scrape?(url)
    url.match(TWITTER_PROFILE_URL) ? true : false
  end

  # Public instance methods are used for data extraction.

  def bio
    text_at('.ProfileHeaderCard-bio')
  end

  def location
    text_at('.ProfileHeaderCard-locationText')
  end

  private

  # Private instance methods are not used for data extraction.

  def text_at(selector)
    page.at(selector).text if page.at(selector)
  end
end

Instantiate the scraper

# Instantiate the scraper with a "url".
scraper = TwitterScraper.new(url: 'https://twitter.com/Milano_JS')

# Instantiate the scraper with a "page" (Mechanize::Page).
agent = SagroneScraper::Agent.new(url: 'https://twitter.com/Milano_JS')
scraper = TwitterScraper.new(page: agent.page)

Scrape the page

scraper.scrape_page!

Extract the data

scraper.attributes
# => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}

SagroneScraper::Collection

This is the simplest way to scrape a web page:

require 'sagrone_scraper'

# 1) Define a scraper. For example, the TwitterScraper above.

# 2) New created scrapers will be registered.
SagroneScraper.Collection::registered_scrapers
# => ['TwitterScraper']

# 3) Here we use the collection to scrape data at a URL.
SagroneScraper::Collection.scrape(url: 'https://twitter.com/Milano_JS')
# => {bio: "Javascript User Group Milano #milanojs", location: "Milan, Italy"}

Contributing

  1. Fork it ( https://github.com/[my-github-username]/sagrone_scraper/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request