No release in over a year
ronin-web-spider is a collection of common web spidering routines using the spidr gem.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 2.0

Runtime

~> 0.7
 Project Readme

ronin-web-spider

CI Code Climate Gem Version

Description

ronin-web-spider is a collection of common web spidering routines using the spidr gem.

Features

  • Built on top of the battle tested and versatile spidr gem.
  • Provides additional callback methods:
  • Supports archiving spidered pages to a directory or git repository.
  • Has 97% documentation coverage.
  • Has 94% test coverage.

Examples

Spider a host:

require 'ronin/web/spider'

Ronin::Web::Spider.start_at('http://tenderlovemaking.com/') do |agent|
  # ...
end

Spider a host:

Ronin::Web::Spider.host('solnic.eu') do |agent|
  # ...
end

Spider a domain (and any sub-domains):

Ronin::Web::Spider.domain('ruby-lang.org') do |agent|
  # ...
end

Spider a site:

Ronin::Web::Spider.site('http://www.rubyflow.com/') do |agent|
  # ...
end

Spider multiple hosts:

Ronin::Web::Spider.start_at('http://company.com/', hosts: ['company.com', /host[\d]+\.company\.com/]) do |agent|
  # ...
end

Do not spider certain links:

Ronin::Web::Spider.site('http://company.com/', ignore_links: [%{^/blog/}]) do |agent|
  # ...
end

Do not spider links on certain ports:

Ronin::Web::Spider.site('http://company.com/', ignore_ports: [8000, 8010, 8080]) do |agent|
  # ...
end

Do not spider links blacklisted in robots.txt:

Ronin::Web::Spider.site('http://company.com/', robots: true) do |agent|
  # ...
end

Print out visited URLs:

Ronin::Web::Spider.site('http://www.rubyinside.com/') do |spider|
  spider.every_url { |url| puts url }
end

Build a URL map of a site:

url_map = Hash.new { |hash,key| hash[key] = [] }

Ronin::Web::Spider.site('http://intranet.com/') do |spider|
  spider.every_link do |origin,dest|
    url_map[dest] << origin
  end
end

Print out the URLs that could not be requested:

Ronin::Web::Spider.site('http://company.com/') do |spider|
  spider.every_failed_url { |url| puts url }
end

Finds all pages which have broken links:

url_map = Hash.new { |hash,key| hash[key] = [] }

spider = Ronin::Web::Spider.site('http://intranet.com/') do |spider|
  spider.every_link do |origin,dest|
    url_map[dest] << origin
  end
end

spider.failures.each do |url|
  puts "Broken link #{url} found in:"

  url_map[url].each { |page| puts "  #{page}" }
end

Search HTML and XML pages:

Ronin::Web::Spider.site('http://company.com/') do |spider|
  spider.every_page do |page|
    puts ">>> #{page.url}"

    page.search('//meta').each do |meta|
      name = (meta.attributes['name'] || meta.attributes['http-equiv'])
      value = meta.attributes['content']

      puts "  #{name} = #{value}"
    end
  end
end

Print out the titles from every page:

Ronin::Web::Spider.site('https://www.ruby-lang.org/') do |spider|
  spider.every_html_page do |page|
    puts page.title
  end
end

Print out every HTTP redirect:

Ronin::Web::Spider.host('company.com') do |spider|
  spider.every_redirect_page do |page|
    puts "#{page.url} -> #{page.headers['Location']}"
  end
end

Find what kinds of web servers a host is using, by accessing the headers:

servers = Set[]

Ronin::Web::Spider.host('company.com') do |spider|
  spider.all_headers do |headers|
    servers << headers['server']
  end
end

Pause the spider on a forbidden page:

Ronin::Web::Spider.host('company.com') do |spider|
  spider.every_forbidden_page do |page|
    spider.pause!
  end
end

Skip the processing of a page:

Ronin::Web::Spider.host('company.com') do |spider|
  spider.every_missing_page do |page|
    spider.skip_page!
  end
end

Skip the processing of links:

Ronin::Web::Spider.host('company.com') do |spider|
  spider.every_url do |url|
    if url.path.split('/').find { |dir| dir.to_i > 1000 }
      spider.skip_link!
    end
  end
end

Detect when a new host name is spidered:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_host do |host|
    puts "Spidering #{host} ..."
  end
end

Detect when a new SSL/TLS certificate is encountered:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_cert do |cert|
    puts "Discovered new cert for #{cert.subject.command_name}, #{cert.subject_alt_name}"
  end
end

Print the MD5 checksum of every favicon.ico file:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_favicon do |page|
    puts "#{page.url}: #{page.body.md5}"
  end
end

Print every HTML comment:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_html_comment do |comment|
    puts comment
  end
end

Print all JavaScript source code:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_javascript do |js|
    puts js
  end
end

Print every JavaScript string literal:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_javascript_string do |str|
    puts str
  end
end

Print every JavaScript comment:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_javascript_comment do |comment|
    puts comment
  end
end

Print every HTML and JavaScript comment:

Ronin::Web::Spider.domain('example.com') do |spider|
  spider.every_comment do |comment|
    puts comment
  end
end

Spider a host and archive every web page:

require 'ronin/web/spider'
require 'ronin/web/spider/archive'

Ronin::Web::Spider::Archive.open('path/to/root') do |archive|
  Ronin::Web::Spider.every_page(host: 'example.com') do |page|
    archive.write(page.url,page.body)
  end
end

Spider a host and archive every web page to a Git repository:

require 'ronin/web/spider/git_archive'
require 'ronin/web/spider'
require 'date'

Ronin::Web::Spider::GitArchive.open('path/to/root') do |archive|
  archive.commit("Updated #{Date.today}") do
    Ronin::Web::Spider.every_page(host: 'example.com') do |page|
      archive.write(page.url,page.body)
    end
  end
end

Requirements

Install

$ gem install ronin-web-spider

Gemfile

gem 'ronin-web-spider', '~> 0.1'

gemspec

gem.add_dependency 'ronin-web-spider', '~> 0.1'

Development

  1. Fork It!
  2. Clone It!
  3. cd ronin-web-spider/
  4. bundle install
  5. git checkout -b my_feature
  6. Code It!
  7. bundle exec rake spec
  8. git push origin my_feature

License

Copyright (c) 2006-2023 Hal Brodigan (postmodern.mod3 at gmail.com)

ronin-web-spider is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

ronin-web-spider is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with ronin-web-spider. If not, see https://www.gnu.org/licenses/.