0.0
The project is in a healthy, maintained state
Enables fast, standards-compliant robots.txt parsing and URL access checking directly from Ruby.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

~> 3.12
~> 1.81

Runtime

~> 1.16
 Project Readme

robotstxt-rb

Gem Version License build-natives-and-publish-gem Ruby Version

A Ruby gem providing native bindings to Google's official C++ robots.txt parser and matcher. Enables fast, standards-compliant robots.txt parsing and URL access checking directly from Ruby.

Features

  • Fast Performance: Native C++ implementation via FFI bindings
  • Standards Compliant: Wraps Google's official robots.txt C++ parser
  • Cross-Platform: Supports macOS and Linux (ARM64 and x86_64)
  • Simple API: Easy-to-use Ruby interface
  • RFC 9309 Compliant: Follows the latest robots.txt specification

Installation

From RubyGems (Recommended)

gem install robotstxt-rb

From GitHub

Add this line to your application's Gemfile:

gem 'robotstxt-rb', git: 'https://github.com/jacksontrieu/robotstxt-rb.git'

And then execute:

bundle install

Quick Start

require 'robotstxt-rb'

# Check if a URL is allowed for a specific user agent
robots_txt = <<~ROBOTS
  User-agent: *
  Disallow: /admin
  Allow: /public
ROBOTS

# Check if a URL is allowed
RobotstxtRb.allowed?(
  robots_txt: robots_txt,
  user_agent: "MyBot",
  url: "https://example.com/public"
)
# => true

RobotstxtRb.allowed?(
  robots_txt: robots_txt,
  user_agent: "MyBot",
  url: "https://example.com/admin"
)
# => false

# Validate robots.txt content
RobotstxtRb.valid?(robots_txt: robots_txt)
# => true

API Documentation

RobotstxtRb.allowed?(robots_txt:, user_agent:, url:)

Checks if a specific URL is allowed to be crawled by a given user agent according to the robots.txt rules.

Parameters:

  • robots_txt (String): The robots.txt content to parse
  • user_agent (String): The user agent string to check
  • url (String): The URL to check (can be full URL or path)

Returns: Boolean - true if the URL is allowed, false if disallowed

Raises: ArgumentError if any required parameter is nil

RobotstxtRb.valid?(robots_txt:)

Validates whether the given robots.txt content is well-formed.

Parameters:

  • robots_txt (String): The robots.txt content to validate

Returns: Boolean - true if valid, false if invalid or nil

Supported Platforms

  • macOS: ARM64 (Apple Silicon) and x86_64 (Intel)
  • Linux: ARM64 and x86_64

Usage Examples

Basic URL Checking

require 'robotstxt-rb'

# Simple robots.txt
robots_txt = "User-agent: *\nDisallow: /private"

# Check various URLs
puts RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Bot", url: "/public")     # true
puts RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Bot", url: "/private")    # false
puts RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Bot", url: "/private/")   # false

User Agent Specific Rules

robots_txt = <<~ROBOTS
  User-agent: Googlebot
  Disallow: /search
  Allow: /

  User-agent: *
  Disallow: /
ROBOTS

# Googlebot can access most URLs
RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Googlebot", url: "/")           # true
RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Googlebot", url: "/search")     # false

# Other bots are blocked
RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "OtherBot", url: "/")            # false

Wildcard Patterns

robots_txt = <<~ROBOTS
  User-agent: *
  Disallow: /*.pdf$
  Disallow: /temp*
  Allow: /temp/public
ROBOTS

# PDF files are blocked
RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Bot", url: "/document.pdf")     # false

# Temp files are blocked
RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Bot", url: "/temp/file")       # false

# Exception to temp rule
RobotstxtRb.allowed?(robots_txt: robots_txt, user_agent: "Bot", url: "/temp/public")     # true

Validation

# Valid robots.txt
RobotstxtRb.valid?(robots_txt: "User-agent: *\nDisallow: /admin")  # true

# Invalid robots.txt
RobotstxtRb.valid?(robots_txt: "Invalid-directive: value")         # false

# Empty robots.txt is valid
RobotstxtRb.valid?(robots_txt: "")                                 # true

Development

Setup

  1. Clone the repository:
git clone https://github.com/jacksontrieu/robotstxt-rb.git
cd robotstxt-rb
  1. Install dependencies:
bundle install

Running Tests

# Run all tests
bundle exec rspec

# Run with coverage
bundle exec rspec --format documentation

Code Style

This project uses RuboCop for code style enforcement. To check and fix style issues:

# Check for style violations
bundle exec rubocop

# Auto-fix violations where possible
bundle exec rubocop --auto-correct

# Auto-fix all violations (including unsafe ones)
bundle exec rubocop --auto-correct-all

Building the Gem

gem build robotstxt-rb.gemspec

Contributing

We welcome contributions! Please see our Contributing Guide for details on how to get started.

Code of Conduct

This project adheres to a Code of Conduct. By participating, you are expected to uphold this code.

Security

Please see our Security Policy for information on reporting security vulnerabilities.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Related Resources

Changelog

See CHANGELOG.md for a list of changes and version history.