Project

robotx

0.0
No commit activity in last 3 years
No release in over 3 years
A simple to use parser for the robots.txt file.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Development

~> 1.6
>= 0
 Project Readme

Robotx

Robotx (pronounced "robotex") is a simple but powerful parser for robots.txt files. It offers a bunch of features which allows you to check whether an URL is allowed/disallowed to be visited by a crawler.

Features

  • Maintains lists for allowed/disallowed URLs
  • Simple method to check whether an URL or just a path is allowed to be visited
  • Show all user agents covered by the robots.txt
  • Get the 'Crawl-Delay' for a website
  • Support for sitemap(s)

Installation

With Bundler

Just add to your Gemfile

gem 'robotx'

Without Bundler

If you're not using Bundler just execute on your commandline

$ gem install robotx

Usage

Support for different user agents

Robotx can be initialized with a special user agent. The default user agent is *. Please note: All method results depends on the user agent Robotx was initialized with.

require 'robotx'

# Initialize with the default user agent '*'
robots_txt = Robotx.new('https://github.com')
robots_txt.allowed  # => ["/humans.txt"]

# Initialize with 'googlebot' as user agent
robots_txt = Robotx.new('https://github.com', 'googlebot')
robots_txt.allowed # => ["/*/*/tree/master", "/*/*/blob/master"]

Check whether an URL is allowed to be indexed

require 'robotx'

robots_txt = Robotx.new('https://github.com')
robots_txt.allowed?('/humans.txt')  # => true
robots_txt.allowed?('/')  # => false

# The allowed? method can also handle arrays or URIs/paths
robots_txt.allowed?(['/', '/humans.txt'])  # => {"/"=>false, "/humans.txt"=>true}

Get all allowed/disallowed URLs

require 'robotx'

robots_txt = Robotx.new('https://github.com')
robots_txt.allowed  # => ["/humans.txt"]
robots_txt.disallowed  # => ["/"]

Get additional information

require 'robotx'

robots_txt = Robotx.new('https://github.com')
robots_txt.sitemap  # => []
robots_txt.crawl_delay  # => 0
robots_txt.user_agents  # => ["googlebot", "baiduspider", ...]

Todo

  • Add tests