Project

legitbot

0.02
A long-lived project that still receives updates
Does Web request come from a real search engine or from an impersonating agent?
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Runtime

~> 0.2, >= 0.2.1
~> 0.2, >= 0.2.2
 Project Readme

Legitbot codecov

Ruby gem to make sure that an IP really belongs to a bot, typically a search engine.

Usage

Suppose you have a Web request and you would like to check it is not diguised:

bot = Legitbot.bot(userAgent, ip)

bot will be nil if no bot signature was found in the User-Agent. Otherwise, it will be an object with methods

bot.detected_as # => :google
bot.valid? # => true
bot.fake? # => false

Sometimes you already know which search engine to expect. For example, you might be using rack-attack:

Rack::Attack.blocklist("fake Googlebot") do |req|
  req.user_agent =~ %r(Googlebot) && Legitbot::Google.fake?(req.ip)
end

Or if you do not like all those ghoulish crawlers stealing your content, evaluating it and getting ready to invade your site with spammers, then block them all:

Rack::Attack.blocklist 'fake search engines' do |request|
  Legitbot.bot(request.user_agent, request.ip)&.fake?
end

Versioning

Semantic versioning with the following clarifications:

  • MINOR version is incremented when support for new bots is added.
  • PATCH version is incremented when validation logic for a bot changes (IP list updated, for example).

Supported

License

Apache 2.0

Other projects

  • Play Framework variant in Scala: play-legitbot
  • Article When (Fake) Googlebots Attack Your Rails App
  • Voight-Kampff is a Ruby gem that detects bots by User-Agent
  • crawler_detect is a Ruby gem and Rack middleware to detect crawlers by few different request headers, including User-Agent
  • Project Honeypot's http:BL can not only classify IP as a search engine, but also label them as suspicious and reports the number of days since the last activity. My implementation of the protocol in Scala is here.
  • CIDRAM is a PHP routing manager with built-in support to validate bots.