Project

sq

0.0
No commit activity in last 3 years
No release in over 3 years
Download all PDFs linked in a Web page
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 0.7
~> 1.3
~> 10.1
~> 0.8
~> 2.5

Runtime

 Project Readme

sq

Build Status Gem Version Coverage Status Inline docs

sq is a web scrapping tool for PDFs. Give it an URL and an optional regex, and it’ll download all PDFs linked on it.

Install

gem install sq

Usage

From the command-line:

$ sq [-o <directory>] [-F <format>] <url> [<regex>]

Available options:

  • -F: output format (see below), default is %s.pdf
  • -o: choose the output directory
  • -V: be more verbose
  • --formats: list available formats

The regex is case-sensitive and is matched against the whole URL.

Examples

# Get all PDFs from a Web page
sq http://liafa.fr/~yunes/cours/interfaces/

# Use a regexp to get only those you want
sq http://liafa.fr/~yunes/cours/interfaces/ 'fiches/\d+'

# Be more verbose
sq -V http://liafa.fr/~yunes/cours/interfaces/ 'fiches/\d+'

# Add a filename format
sq -V http://liafa.fr/~yunes/cours/interfaces/ 'fiches/\d+' -F 'class-%Z.pdf'

Formats

The output format is used for each PDF filename. It’s a string with zero or more special strings that will be replaced by a special value.

%n - PDF number, starting at 0
%N - PDF number, starting at 1
%z - same as %n, but zero-padded
%Z - same as %N, but zero-padded
%c - total number of PDFs
%s - name of the PDF, extracted from its URI, without `.pdf`
%S - name of the PDF, extracted from the link text
%_ - same as %S, but spaces are replaced with underscores
%- - same as %S, but spaces are replaced with hyphens
%% - litteral %

API

In a Ruby file:

require 'sq'

urls = SQ.query('http://example.com', /important/i)

Tests

$ git clone https://github.com/bfontaine/sq.git
$ cd sq
$ bundle install
$ rake test

It’ll generate a coverage/index.html, which you can open in a Web browser.