No commit activity in last 3 years
No release in over 3 years
Anemone web-spider framework
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

>= 1.3.1
>= 1.3.0
>= 1.3.1
>= 0.9.2
>= 3.12
>= 2.2.0
>= 2.8.0
>= 1.3.4

Runtime

>= 1.3.0
>= 1.0.0
 Project Readme

Anemone¶ ↑

Anemone is a web spider framework that can spider a domain and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized spider tasks quickly and easily.

See anemone.rubyforge.org for more information.

Features¶ ↑

  • Multi-threaded design for high performance

  • Tracks 301 HTTP redirects

  • Built-in BFS algorithm for determining page depth

  • Allows exclusion of URLs based on regular expressions

  • Choose the links to follow on each page with focus_crawl()

  • HTTPS support

  • Records response time for each page

  • CLI program can list all pages in a domain, calculate page depths, and more

  • Obey robots.txt

  • In-memory or persistent storage of pages during crawl, using TokyoCabinet, SQLite3, MongoDB, or Redis

Examples¶ ↑

See the scripts under the lib/anemone/cli directory for examples of several useful Anemone tasks.

Requirements¶ ↑

  • nokogiri

  • robots

Development¶ ↑

To test and develop this gem, additional requirements are:

  • rspec

  • fakeweb

  • tokyocabinet

  • kyotocabinet-ruby

  • mongo

  • redis

  • sqlite3

You will need to have KyotoCabinet, Tokyo Cabinet, MongoDB, and Redis installed on your system and running.