Search results for 'crawler' - The Ruby Toolbox

50%

57%

2016-04-07

voight_kampff biola/voight-kampff Homepage Documentation Source Code Bug Tracker Wiki

voight_kampff

0.26

Low commit activity in last 3 years

No release in over a year

Voight-Kampff detects bots, spiders, crawlers and replicants

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

6,494,491

178

Releases

2.0.0

2011-05-11

2023-03-12

Activity

94%

58%

2018-09-03

polipus taganaka/polipus Homepage Documentation Source Code Bug Tracker Wiki

polipus

0.07

No commit activity in last 3 years

No release in over 3 years

There's a lot of open issues

An easy to use distributed web-crawler framework based on Redis

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

51,543

Releases

0.5.1

2014-01-05

2015-07-17

Activity

65%

84%

2015-03-02

rubyretriever joenorton/rubyretriever Homepage Documentation Source Code Bug Tracker Wiki

rubyretriever

0.08

No release in over 3 years

Low commit activity in last 3 years

There's a lot of open issues

Asynchronous web crawler, scraper and file harvester

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

67,263

141

Releases

1.4.6

2014-05-25

2016-04-11

Activity

69%

76%

2015-02-18

instagram-crawler mgleon08/instagram-crawler Homepage Documentation Source Code Bug Tracker Wiki

instagram-crawler

0.08

No release in over 3 years

Low commit activity in last 3 years

There's a lot of open issues

Crawl instagram photos, posts and videos for download.

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

7,362

197

Releases

0.3.0

2018-11-23

2019-04-14

Activity

16%

50%

2018-12-12

arachnid dchuk/arachnid Homepage Documentation Source Code Bug Tracker Wiki

arachnid

0.03

No commit activity in last 3 years

No release in over 3 years

Arachnid is a web crawler that relies on Bloom Filters to efficiently store visited urls and Typhoeus to avoid the overhead of Mechanize when crawling every page on a domain.

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

48,791

Releases

0.4.1

2011-11-11

2014-01-16

Activity

66%

2012-04-10

crawler_detect loadkpi/crawler_detect Homepage Documentation Source Code Bug Tracker

crawler_detect

User Agent Detection

0.07

User Agent Detection

Low commit activity in last 3 years

A long-lived project that still receives updates

CrawlerDetect is a library to detect bots/crawlers via the user agent

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

1,028,425

111

Releases

1.2.4

2018-08-05

2024-03-20

Activity

87%

73%

2021-01-07

google_ajax_crawler benkitzelman/google-ajax-crawler Homepage Documentation Source Code Bug Tracker Wiki

google_ajax_crawler

0.03

No commit activity in last 3 years

No release in over 3 years

Rack Middleware adhering to the Google Ajax Crawling Scheme, using a headless browser to render JS heavy pages and serve a dom snapshot of the rendered state to a requesting search engine.

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

15,855

Releases

0.2.0

2013-03-16

2013-07-13

Activity

50%

100%

2013-07-10

wayback_archiver buren/wayback_archiver Homepage Documentation Source Code Bug Tracker Wiki

wayback_archiver

0.03

No release in over 3 years

Low commit activity in last 3 years

Post URLs to Wayback Machine (Internet Archive), using a crawler, from Sitemap(s) or a list of URLs.

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

46,170

Releases

1.4.0

2014-07-17

2021-04-23

Activity

77%

68%

2019-08-15

validate-website spk/validate-website Homepage Documentation Source Code Bug Tracker Wiki

validate-website

0.03

Low commit activity in last 3 years

No release in over a year

validate-website is a web crawler for checking the markup validity with XML Schema / DTD and not found urls.

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

125,592

Releases

1.12.0

2009-10-24

2022-11-15

Activity

100%

83%

2021-01-02

grell mdsol/grell Homepage Documentation Source Code Bug Tracker Wiki

grell

0.02

No commit activity in last 3 years

No release in over 3 years

Ruby web crawler using PhantomJS

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

86,603

481

Releases

2.1.2

2015-05-07

2021-02-17

Activity

87%

2017-01-26

daimon_skycrawlers bm-sms/daimon_skycrawlers Homepage Documentation Source Code Bug Tracker

daimon_skycrawlers

0.01

Repository is archived

No commit activity in last 3 years

No release in over 3 years

This is a crawler framework.

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

41,121

Releases

1.0.0

2016-01-27

2017-02-15

Activity

100%

91%

2017-02-12

render_static herval/render_static Homepage Documentation Source Code Bug Tracker Wiki

render_static

0.01

No commit activity in last 3 years

No release in over 3 years

render_static allows you to make your single-page apps (Backbone, Angular, etc) built on Rails SEO-friendly. It works by injecting a small rack middleware that will render pages as plain html, when the requester is one of the most common crawlers/bots out there (Google, Yahoo Baidu and Bing)

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

4,145

Releases

0.0.0

2013-05-08

2013-05-08

Activity

2013-05-07

rdig jkraemer/rdig Homepage Documentation Source Code Bug Tracker Wiki

rdig

0.01

No commit activity in last 3 years

No release in over 3 years

Website crawler and fulltext indexer.

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

48,189

Releases

0.3.12

2006-03-25

2009-04-25

Activity

2008-07-29

is_crawler ccashwell/is_crawler Homepage Documentation Source Code Bug Tracker Wiki

is_crawler

0.02

No commit activity in last 3 years

No release in over 3 years

is_crawler does exactly what you might think it does: determine if the supplied string matches a known crawler or bot.

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

160,932

Releases

0.1.5

2013-02-27

2013-05-23

Activity

60%

2013-12-05

driller shashikant86/driller Homepage Documentation Source Code Bug Tracker Wiki

driller

0.01

No commit activity in last 3 years

No release in over 3 years

Driller is a command line Ruby based web crawler based on Anemone. Driller can crawl website and reports error pages and slow pages and generates HTML reports.

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

33,787

Releases

0.1.4

2015-05-10

2015-05-18

Activity

2015-05-14

marmiton_crawler madeindjs/marmiton_crawler Homepage Documentation Source Code Bug Tracker Wiki

marmiton_crawler

0.01

Repository is archived

No commit activity in last 3 years

No release in over 3 years

A web scrawler to get a Marmiton's recipe

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

4,669

Releases

1.0.3

2016-10-09

2016-11-28

Activity

75%

100%

2017-09-23

medusa-crawler brutuscat/medusa-crawler Homepage Documentation Source Code Bug Tracker Wiki

medusa-crawler

0.01

No commit activity in last 3 years

No release in over 3 years

== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push Medusa is a framework for the ruby language to crawl and collect useful information about the pages it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily. === Features * Choose the links to follow on each page with +focus_crawl+ * Multi-threaded design for high performance * Tracks +301+ HTTP redirects * Allows exclusion of URLs based on regular expressions * Records response time for each page * Obey _robots.txt_ directives (optional, but recommended) * In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta] * Inherits OpenURI behavior (redirects, automatic charset and encoding detection, proxy configuration options). <b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b> === Examples Medusa is versatile and to be used programatically, you can start with one or multiple URIs: require 'medusa' Medusa.crawl('https://www.example.com', depth_limit: 2) Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus: require 'medusa' Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler| crawler.discard_page_bodies = some_flag # Persist all the pages state across crawl-runs. crawler.clear_on_startup = false crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0') crawler.skip_links_like(/private/) crawler.on_pages_like(/public/) do |page| logger.debug "[public page] #{page.url} took #{page.response_time} found #{page.links.count}" end # Use an arbitrary logic, page by page, to continue customize the crawling. crawler.focus_crawl(/public/) do |page| page.links.first end end

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Popularity

4,216

Releases

1.0.0

2020-08-06

2020-08-17

Activity

80%

2020-05-23