Project

spidr_cli

0.0
No release in over 3 years
Low commit activity in last 3 years
Command Line Interface (CLI) for the spidr gem.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 1.16
~> 9.0
~> 10.0
~> 3.0

Runtime

~> 0.6
 Project Readme

SpidrCLI Build Status

Command Line Interface (CLI) for the excellent spidr gem.

Installation

Install with

$ gem install spidr_cli

Usage

Print all found pages on site

$ spidr https://jacoburenstam.com/

Print all HTML/JS/CSS pages

$ spidr --content-types=html,javascript,css https://jacoburenstam.com/

Max 10 pages

$ spidr --limit=10 https://jacoburenstam.com/

Spidr host

$ spidr host jacoburenstam.com

Spidr a single site (this is the default)

$ spidr site https://jacoburenstam.com

Start spidr from URL

$ spidr start_at https://jacoburenstam.com

Any method that Spidr::Page responds to you can output, you can also choose to include the header in the output (which is valid CSV)

$ spidr --columns=code,content_type,url \
        --header                        \
        https://jacoburenstam.com/

Full usage instructions

Usage: spidr [<method>] [options] <url>
        --columns=[val1,val2]        Columns in output
        --content-types=[val1,val2]  Formats to output (html, javascript, css, json, ..)
        --[no-]header                Include the header
        --[no-]strip-fragments       Specifies whether the Agent will strip URI fragments (default: true)
        --[no-]strip-query           Specifies whether the Agent will strip URI query (default: false)
        --schemes=[http,https]       Only spider links with certain scheme
        --host=[example]             Only spider links on certain host
        --hosts=[example.com]        Only spider links on certain hosts (ignored unless method is "start_at" or "site")
        --ignore-hosts=[www.example.com]
                                     Do not spider links on certain hosts (ignored unless method is "start_at" or "site")
        --ports=[80, 443]            Only spider links on certain ports
        --ignore-ports=[8000, 8080, 3000]
                                     Do not spider links on certain ports
        --links=[/blog/]             Only spider links on certain link patterns
        --ignore-links=[/blog/]      Do not spider links on certain link patterns
        --urls=[/blog/]              Only spider links on certain urls
        --ignore-urls=[/blog/]       Do not spider links on certain urls
        --exts=[htm]                 Only spider links on certain extensions
        --ignore-exts=[cfm]          Do not spider links on certain extensions
        --open-timeout=val           Open timeout
        --read-timeout=val           Read timeout
        --ssl-timeout=val            SSL timeout
        --continue-timeout=val       Continue timeout
        --keep-alive-timeout=val     Keep alive timeout
        --proxy-host=val             The host the proxy is running on
        --proxy-port=val             The port the proxy is running on
        --proxy-user=val             The user to authenticate with the proxy
        --proxy-password=val         The password to authenticate with the proxy
        --default-headers=[key1=val1,key2=val2]
                                     Default headers to set for every request
        --host-header=val            The HTTP Host header to use with each request
        --host-headers=[key1=val1,key2=val2]
                                     The HTTP Host headers to use for specific hosts
        --user-agent=val             The User-Agent string to send with each requests
        --referer=val                The Referer URL to send with each request
        --delay=val                  The number of seconds to pause between each request
        --queue=[val1,val2]          The initial queue of URLs to visit
        --history=[val1,val2]        The initial list of visited URLs
        --limit=val                  The maximum number of pages to visit
        --max-depth=val              The maximum link depth to follow
        --[no-]robots                Respect Robots.txt
    -h, --help                       How to use
        --version                    Show version

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/buren/spidr_cli.

License

The gem is available as open source under the terms of the MIT License.

Thanks

Huge thanks to @postmodern for creating spidr