Welcome to Sinew
Sinew is a Ruby library for collecting data from web sites (scraping). Though small, this project is the culmination of years of effort based on crawling systems built at several different companies. Sinew has been used to crawl millions of websites.
Key Features
- Robust crawling with the Faraday HTTP client
- Aggressive caching with httpdisk
- Easy parsing with HTML cleanup, Nokogiri, JSON, etc.
- CSV generation for crawled data
Installation
# install gem
$ gem install sinew
# or add to your Gemfile:
gem 'sinew'Table of Contents
- Sinew 4
- Quick Example
- How it Works
- Reference
- Hints
- Limitations
- Changelog
- License
Sinew 4 (June 2021)
Breaking change
We are pleased to announce the release of Sinew 4. The Sinew DSL exposes a single sinew method in lieu of the many methods exposed in Sinew 3. Because of this single entry point, Sinew is now much easier to embed in other applications. Also, each Sinew 4 request returns a full Response object to faciliate parallelism.
Sinew uses the Faraday HTTP client with the httpdisk middleware for aggressive caching of responses.
Quick Example
Here's an example for collecting the links from httpbingo.org. Paste this into a file called sample.sinew and run sinew sample.sinew. It will create a sample.csv file containing the href and text for each link:
# get the url
response = sinew.get "https://httpbingo.org"
# use nokogiri to collect links
response.noko.css("ul li a").each do |a|
  row = { }
  row[:url] = a[:href]
  row[:title] = a.text
  # append a row to the csv
  sinew.csv_emit(row)
endHow it Works
There are three main features provided by Sinew.
Recipes
Sinew uses recipe files to crawl web sites. Recipes have the .sinew extension, but they are plain old Ruby. Here's a trivial example that calls get to make an HTTP GET request:
response = sinew.get "https://www.google.com/search?q=darwin"
response = sinew.get "https://www.google.com/search", q: "charles darwin"Once you've done a get, you can access the document in a few different formats. In general, it's easiest to use noko to automatically parse and interact with HTML results. If Nokogiri isn't appropriate, fall back to regular expressions run against body or html. Use json if you are expecting a JSON response.
response = sinew.get "https://www.google.com/search?q=darwin"
# pull out the links with nokogiri
links = response.noko.css("a").map { _1[:href] }
puts links.inspect
# or, use a regex
links = response.html[/<a[^>]+href="([^"]+)/, 1]
puts links.inspectCSV Output
Recipes output CSV files. To continue the example above:
response = sinew.get "https://www.google.com/search?q=darwin"
response.noko.css("a").each do |i|
  row = { }
  row[:href] = i[:href]
  row[:text] = i.text
  sinew.csv_emit row
endSinew creates a CSV file with the same name as the recipe, and csv_emit(hash) appends a row. The values of your hash are cleaned up and converted to strings:
- Nokogiri nodes are converted to text
- Arrays are joined with "|", so you can separate them later
- HTML tags, entities and non-ascii chars are removed
- Whitespace is squished
Caching
Sinew uses httpdisk to aggressively cache all HTTP responses to disk in ~/.sinew. Error responses are cached as well. Each URL will be hit exactly once, and requests are rate limited to one per second. Sinew tries to be polite.
Sinew never deletes files from the cache - that's up to you! Sinew has various command line options to refresh the cache. See --expires, --force and --force-errors.
Because all requests are cached, you can run Sinew repeatedly with confidence. Run it over and over again while you work on your recipe.
Running Sinew
The sinew command line has many useful options. You will be using this command many times as you iterate on your recipe:
$ bin/sinew --help
Usage: sinew [options] [recipe]
    -l, --limit     quit after emitting this many rows
    --proxy         use host[:port] as HTTP proxy
    --timeout       maximum time allowed for the transfer
    -s, --silent    suppress some output
    -v, --verbose   dump emitted rows while running
From httpdisk:
    --dir           set custom cache directory
    --expires       when to expire cached requests (ex: 1h, 2d, 3w)
    --force         don't read anything from cache (but still write)
    --force-errors  don't read errors from cache (but still write)Sinew also has many runtime options that can be set by in your recipe. For example:
sinew.options[:headers] = { 'User-Agent' => 'xyz' }
...Here is the list of available options for Sinew:
- headers - default HTTP headers to use on every request
- ignore_params - ignore these query params when generating httpdisk cache keys
- insecure - ignore SSL errors
- params - default query parameters to use on every request
- rate_limit - minimum time between network requests
- retries - number of times to retry each failed request
- url_prefix - deafult URL base to use on every request
Reference
Making HTTP requests
- 
sinew.get(url, params = nil, headers = nil)- fetch a url with GET
- 
sinew.post(url, body = nil, headers = nil)- fetch a url with POST, usingformas the URL encoded POST body.
- 
sinew.post_json(url, body = nil, headers = nil)- fetch a url with POST, usingjsonas the POST body.
Parsing the response
Each request method returns a Sinew::Response. The response has several helpers to make parsing easier:
- 
body- the raw body
- 
html- likebody, but with a handful of HTML-specific whitespace cleanups
- 
noko- parse as HTML and return a Nokogiri document
- 
xml- parse as XML and return a Nokogiri document
- 
json- parse as JSON, with symbolized keys
- 
mash- parse as JSON and return a Hashie::Mash
- 
url- the url of the request. If the request goes through a redirect,urlwill reflect the final url.
Writing CSV
- 
sinew.csv_header(columns)- specify the columns for CSV output. If you don't call this, Sinew will use the keys from the first call tosinew.csv_emit.
- 
sinew.csv_emit(hash)- append a row to the CSV file
Advanced: Cache
Sinew has some advanced helpers for checking the httpdisk cache. For the following methods, body hashes default to form body type.
- 
sinew.cached?(method, url, params = nil, body = nil)- check if request is cached
- 
sinew.uncache(method, url, params = nil, body = nil)- remove cache file, if any
- 
sinew.status(method, url, params = nil, body = nil)- get httpdisk status
Plus some caching helpers in Sinew::Response:
- 
diskpath- the location on disk for the cached httpdisk response
- 
uncache- remove cache file for this response
Hints
Writing Sinew recipes is fun and easy. The builtin caching means you can iterate quickly, since you won't have to re-fetch the data. Here are some hints for writing idiomatic recipes:
- Sinew doesn't (yet) check robots.txt - please check it manually.
- Prefer Nokogiri over regular expressions wherever possible. Learn CSS selectors.
- In Chrome, $in the console is your friend.
- Fallback to regular expressions if you're desperate. Depending on the site, use either bodyorhtml.htmlis probably your best bet.bodyis good for crawling Javascript, but it's fragile if the site changes.
- Learn to love String#[regexp], which is an obscure operator but incredibly handy for Sinew.
- Laziness is useful. Keep your CSS selectors and regular expressions simple, so maybe they'll work again the next time you need to crawl a site.
- Don't be afraid to mix CSS selectors, regular expressions, and Ruby:
noko.css("table")[4].css("td").select do
  _1[:width].to_i > 80
end.map(&:text)- Debug your recipes using plain old puts, or better yet useapfrom amazing_print.
- Run sinew -vto get a report on everycsv_emit. Very handy.
- Add the CSV files to your git repo. That way you can version them and get diffs!
Limitations
- Caching is based on URL, so use caution with cookies and other forms of authentication
- Almost no support for international (non-english) characters
Changelog
4.0.1 (Aug 2023)
- Updated dependencies, added justfile
4.0.0 (Jul 2021)
- Rewritten to use simpler DSL
- Upgraded to httpdisk 0.5 to take advantage of the new encoding support
3.0.0 (May 2021)
- Major rewrite of network and caching layer. See above.
- Use Faraday HTTP client with sinew middleware for caching.
- Supports multiple proxies (--proxy host1,host2,...)
2.0.4 (May 2018)
- Handle and cache more errors (too many redirects, connection failures, etc.)
- Support for adding uri.scheme in generate_cache_key
- Added status code, a peer touri,raw, etc.
2.0.3 (May 2018)
- & now normalizes to & (not and)
2.0.2 (May 2018)
- Support for --limit,--proxyand thexmlvariable
- Dedup - warn and ignore if row[:url] has already been emitted
- Auto gunzip if contents are compressed
2.0.1 (May 2018)
- Support for legacy cached headfiles from Sinew 1
2.0.0 (May 2018)
- Complete rewrite. See above.
1.0.3 (June 2012)
...
License
This extension is licensed under the MIT License.