0.0
No commit activity in last 3 years
No release in over 3 years
web crawler that generates a sitemap to a neo4j database. It will also store broken_links and total number of pages on site
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Runtime

>= 2.3.6, ~> 2.3
~> 1.5
>= 1.4.1, ~> 1.4
>= 1.6.1, ~> 1.6
~> 0.8
~> 2.14
~> 2.0
>= 0.1.1, ~> 0.1
 Project Readme

creepy-crawler

Ruby web crawler that takes a url as input and produces a sitemap using a neo4j graph database - Nothing creepy about it.

Build Status

##Installation ####Clone git clone https://github.com/udryan10/creepy-crawler.git && cd creepy-crawler ####Install Required Gems bundle install ####Install graph database rake neo4j:install ####Start graph database rake neo4j:start

####Requirements

  1. Gems listed in Gemfile
  2. Ruby 1.9+
  3. neo4j
  4. Oracle jdk7 (for neo4j graphing database)
  5. lsof (for neo4j graphing database)

##Usage ###Code ####Require require './creepy-crawler' ####Start a crawl Creepycrawler.crawl("http://example.com") ####Limit number of pages to crawl Creepycrawler.crawl("http://example.com", :max_page_crawl => 500) ####Extract some (potentially) useful statistics crawler = Creepycrawler.crawl("http://example.com", :max_page_crawl => 500) # list of broken links puts crawler.broken_links # list of sites that were visited puts crawler.visited_queue # count of crawled pages puts crawler.page_crawl_count

####Options DEFAULT_OPTIONS = { # whether to print crawling information :verbose => true, # whether to obey robots.txt :obey_robots => true, # maximum number of pages to crawl, value of nil will attempt to crawl all pages :max_page_crawl => nil, # should pages be written to the database. Likely only used for testing, but may be used if you only wanted to get at the broken_links data :graph_to_neo4j => true }

####Example examples located in examples/ directory

###Command line # Crawl site ruby creepy-crawler.rb --site "http://google.com" # Get command options ruby creepy-crawler.rb --help

Note: If behind a proxy, export your proxy environment variables

export http_proxy=<proxy_host>; export https_proxy=<proxy_host>

###Docker For testing, I have included the ability to run the environment and a crawl inside of a docker container

##Output creepy-crawler uses neo4j graph database to store and display the site map.

Web interface

neo4j has a web interface for viewing and interacting with the graph data. When running on local host, visit: http://localhost:7474/webadmin/

  1. Click the Data Browser tab

  2. Enter Query to search for nodes (will search all nodes):

    START root=node(*) RETURN root
  3. Click into a node

  4. Click switch view mode at top right to view a graphical map

Note: to have the map display url names instead of node numbers, you must create a style

REST interface

neo4j also has a full REST API for programatic access to the data

###Example Output Map Output Map

##TODO

  1. convert to gem
  2. multi-threaded to increase crawl performance