Project

kudzu

0.0
Low commit activity in last 3 years
A long-lived project that still receives updates
A simple web crawler for ruby
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies
 Project Readme

Kudzu

A simple web crawler for ruby.

Features

  • Run single-thread or multi-thread.
  • Pool HTTP connection.
  • Restrict links by url-based patterns.
  • Respect robots.txt.
  • Store page contents via adapter.

Dependencies

  • ruby 2.5+
  • libicu

Installation

Add to your application's Gemfile:

gem 'kudzu'

Then run:

$ bundle install

Usage

Crawl html files in example.com:

crawler = Kudzu::Crawler.new do
  user_agent 'YOUR_AWESOME_APP'
  add_filter do
    focus_host true
    allow_mime_type %w(text/html)
  end
end
crawler.run('http://example.com/') do
  on_success do |page, link|
    puts page.url
  end
end

Adapters

This gem supports only in-memory crawling by default. Use following adapter to save page contents persistently:

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/kanety/kudzu. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the MIT License.