Project

rubycrawl

0.0
The project is in a healthy, maintained state
rubycrawl uses Ferrum (Chrome DevTools Protocol) for JS rendering.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Runtime

 Project Readme

RubyCrawl 🎭

Gem Version CI License: MIT Ruby

Production-ready web crawler for Ruby powered by Ferrum — Full JavaScript rendering via Chrome DevTools Protocol, with first-class Rails support and no Node.js dependency.

RubyCrawl provides accurate, JavaScript-enabled web scraping using a pure Ruby browser automation stack. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.

Why RubyCrawl?

  • Real browser — Handles JavaScript, AJAX, and SPAs correctly
  • Pure Ruby — No Node.js, no npm, no external processes to manage
  • Zero config — Works out of the box, no Ferrum knowledge needed
  • Production-ready — Auto-retry, error handling, resource optimization
  • Multi-page crawling — BFS algorithm with smart URL deduplication
  • Rails-friendly — Generators, initializers, and ActiveJob integration
  • Readability-powered — Mozilla Readability.js for article-quality extraction, heuristic fallback for all other pages
# One line to crawl any JavaScript-heavy site
result = RubyCrawl.crawl("https://docs.example.com")

result.html           # Full HTML with JS rendered
result.clean_text     # Noise-stripped plain text (no nav/footer/ads)
result.clean_markdown # Markdown ready for RAG pipelines
result.links          # All links with url, text, title, rel
result.metadata       # Title, description, OG tags, etc.

Features

  • Pure Ruby: Ferrum drives Chromium directly via CDP — no Node.js or npm required
  • Production-ready: Designed for Rails apps with auto-retry and exponential backoff
  • Simple API: Clean Ruby interface — zero Ferrum or CDP knowledge required
  • Resource optimization: Built-in resource blocking for 2-3x faster crawls
  • Auto-managed browsers: Lazy Chrome singleton, isolated page per crawl
  • Content extraction: Mozilla Readability.js (primary) + link-density heuristic (fallback) — article-quality clean_html, clean_text, clean_markdown, links, metadata
  • Multi-page crawling: BFS crawler with configurable depth limits and URL deduplication
  • Smart URL handling: Automatic normalization, tracking parameter removal, same-host filtering
  • Rails integration: First-class Rails support with generators and initializers

Table of Contents

  • Installation
  • Quick Start
  • Use Cases
  • Usage
    • Basic Crawling
    • Multi-Page Crawling
    • Configuration
    • Result Object
    • Error Handling
  • Rails Integration
  • Production Deployment
  • Architecture
  • Performance
  • Development
  • Contributing
  • License

Installation

Requirements

  • Ruby >= 3.0
  • Chrome or Chromium — managed automatically by Ferrum (downloaded on first use)

Add to Gemfile

gem "rubycrawl"

Then install:

bundle install

Install Chrome

Ferrum manages Chrome automatically. Run the install task to verify Chrome is available and generate a Rails initializer:

bundle exec rake rubycrawl:install

This command:

  • ✅ Checks for Chrome/Chromium in your PATH
  • ✅ Creates a Rails initializer (if using Rails)

Note: If Chrome is not in your PATH, install it via your system package manager or download from google.com/chrome.

Quick Start

require "rubycrawl"

# Simple crawl
result = RubyCrawl.crawl("https://example.com")

# Access extracted content
result.final_url                   # Final URL after redirects
result.clean_text                  # Noise-stripped plain text (no nav/footer/ads)
result.clean_html                  # Noise-stripped HTML (same noise removed as clean_text)
result.raw_text                    # Full body.innerText (unfiltered)
result.html                        # Full raw HTML content
result.links                       # Extracted links with url, text, title, rel
result.metadata                    # Title, description, OG tags, etc.
result.metadata['extractor']       # "readability" or "heuristic" — which extractor ran
result.clean_markdown              # Markdown converted from clean_html (lazy — first access only)

Use Cases

RubyCrawl is perfect for:

  • RAG applications: Build knowledge bases for LLM/AI applications by crawling documentation sites
  • Data aggregation: Crawl product catalogs, job listings, or news articles
  • SEO analysis: Extract metadata, links, and content structure
  • Content migration: Convert existing sites to Markdown for static site generators
  • Documentation scraping: Create local copies of documentation with preserved links

Usage

Basic Crawling

result = RubyCrawl.crawl("https://example.com")

result.html           # => "<html>...</html>"
result.clean_text     # => "Example Domain\n\nThis domain is..." (no nav/ads)
result.raw_text       # => "Example Domain\nThis domain is..." (full body text)
result.metadata       # => { "final_url" => "https://example.com", "title" => "..." }

Multi-Page Crawling

Crawl an entire site following links with BFS (breadth-first search):

# Crawl up to 100 pages, max 3 links deep
RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |page|
  # Each page is yielded as it's crawled (streaming)
  puts "Crawled: #{page.url} (depth: #{page.depth})"

  # Save to database
  Page.create!(
    url:      page.url,
    html:     page.html,
    markdown: page.clean_markdown,
    depth:    page.depth
  )
end

Real-world example: Building a RAG knowledge base

require "rubycrawl"

RubyCrawl.configure(
  wait_until: "networkidle",  # Ensure JS content loads
  block_resources: true       # Skip images/fonts for speed
)

pages_crawled = RubyCrawl.crawl_site(
  "https://docs.example.com",
  max_pages: 500,
  max_depth: 5,
  same_host_only: true
) do |page|
  VectorDB.upsert(
    id:       Digest::SHA256.hexdigest(page.url),
    content:  page.clean_markdown,
    metadata: {
      url:   page.url,
      title: page.metadata["title"],
      depth: page.depth
    }
  )
end

puts "Indexed #{pages_crawled} pages"

Multi-Page Options

Option Default Description
max_pages 50 Maximum number of pages to crawl
max_depth 3 Maximum link depth from start URL
same_host_only true Only follow links on the same domain
wait_until inherited Page load strategy
block_resources inherited Block images/fonts/CSS
respect_robots_txt false Honour robots.txt rules and auto-sleep Crawl-delay

robots.txt Support

When respect_robots_txt: true, RubyCrawl fetches robots.txt once at the start of the crawl and:

  • Skips any URL disallowed for User-agent: *
  • Automatically sleeps the Crawl-delay specified in robots.txt between pages
RubyCrawl.crawl_site("https://example.com",
  respect_robots_txt: true,
  max_pages: 100
) do |page|
  puts page.url
end

Or enable globally:

RubyCrawl.configure(respect_robots_txt: true)

If robots.txt is unreachable or missing, crawling proceeds normally (fail open).

Page Result Object

The block receives a PageResult with:

page.url            # String: Final URL after redirects
page.html           # String: Full raw HTML content
page.clean_html     # String: Noise-stripped HTML (no nav/header/footer/ads)
page.clean_text     # String: Noise-stripped plain text (derived from clean_html)
page.raw_text       # String: Full body.innerText (unfiltered)
page.clean_markdown # String: Lazy-converted Markdown from clean_html
page.links          # Array: URLs extracted from page
page.metadata       # Hash: final_url, title, OG tags, etc.
page.depth          # Integer: Link depth from start URL

Configuration

Global Configuration

RubyCrawl.configure(
  wait_until:      "networkidle",
  block_resources: true,
  timeout:         60,
  headless:        true
)

# All subsequent crawls use these defaults
result = RubyCrawl.crawl("https://example.com")

Per-Request Options

# Use global defaults
result = RubyCrawl.crawl("https://example.com")

# Override for this request only
result = RubyCrawl.crawl(
  "https://example.com",
  wait_until:      "domcontentloaded",
  block_resources: false
)

Configuration Options

Option Values Default Description
wait_until "load", "domcontentloaded", "networkidle", "commit" nil When to consider page loaded (nil = Ferrum default)
block_resources true, false nil Block images, fonts, CSS, media for faster crawls
max_attempts Integer 3 Total number of attempts (including the first)
timeout Integer (seconds) 30 Browser navigation timeout
headless true, false true Run Chrome headlessly
respect_robots_txt true, false false Honour robots.txt rules and auto-sleep Crawl-delay

Wait strategies explained:

  • load — Wait for the load event (good for static sites)
  • domcontentloaded — Wait for DOM ready (faster)
  • networkidle — Wait until no network requests for 500ms (best for SPAs)
  • commit — Wait until the first response bytes are received (fastest)

Result Object

result = RubyCrawl.crawl("https://example.com")

result.html           # String: Full raw HTML
result.clean_html     # String: Noise-stripped HTML (nav/header/footer/ads removed)
result.clean_text     # String: Plain text derived from clean_html — ideal for RAG
result.raw_text       # String: Full body.innerText (unfiltered)
result.clean_markdown # String: Markdown from clean_html (lazy — computed on first access)
result.links          # Array: Extracted links with url/text/title/rel
result.metadata       # Hash: See below
result.final_url      # String: Shortcut for metadata['final_url']

Links Format

result.links
# => [
#   { "url" => "https://example.com/about", "text" => "About", "title" => nil, "rel" => nil },
#   { "url" => "https://example.com/contact", "text" => "Contact", "title" => nil, "rel" => "nofollow" },
# ]

URLs are automatically resolved to absolute form by the browser.

Markdown Conversion

Markdown is lazy — conversion only happens on first access of .clean_markdown:

result.clean_html     # ✅ Already available, no overhead
result.clean_markdown # Converts clean_html → Markdown here (first call only)
result.clean_markdown # ✅ Cached, instant on subsequent calls

Uses reverse_markdown with GitHub-flavored output.

Metadata Fields

result.metadata
# => {
#   "final_url"           => "https://example.com",
#   "title"               => "Page Title",
#   "description"         => "...",
#   "keywords"            => "ruby, web",
#   "author"              => "Author Name",
#   "og_title"            => "...",
#   "og_description"      => "...",
#   "og_image"            => "https://...",
#   "og_url"              => "https://...",
#   "og_type"             => "website",
#   "twitter_card"        => "summary",
#   "twitter_title"       => "...",
#   "twitter_description" => "...",
#   "twitter_image"       => "https://...",
#   "canonical"           => "https://...",
#   "lang"                => "en",
#   "charset"             => "UTF-8",
#   "extractor"           => "readability"  # or "heuristic"
# }

Error Handling

begin
  result = RubyCrawl.crawl(url)
rescue RubyCrawl::ConfigurationError => e
  # Invalid URL or option value
rescue RubyCrawl::TimeoutError => e
  # Page load timed out
rescue RubyCrawl::NavigationError => e
  # Navigation failed (404, DNS error, SSL error)
rescue RubyCrawl::ServiceError => e
  # Browser failed to start or crashed
rescue RubyCrawl::Error => e
  # Catch-all for any RubyCrawl error
end

Exception Hierarchy:

RubyCrawl::Error
  ├── ConfigurationError  — invalid URL or option value
  ├── TimeoutError        — page load timed out
  ├── NavigationError     — navigation failed (HTTP error, DNS, SSL)
  └── ServiceError        — browser failed to start or crashed

Automatic Retry: ServiceError and TimeoutError are retried with exponential backoff. NavigationError and ConfigurationError are not retried (they won't succeed on retry).

RubyCrawl.configure(max_attempts: 5)     # 5 total attempts
RubyCrawl.crawl(url, max_attempts: 1)    # Disable retries

Rails Integration

Installation

bundle exec rake rubycrawl:install

This creates config/initializers/rubycrawl.rb:

RubyCrawl.configure(
  wait_until:      "load",
  block_resources: true
)

Usage in Rails

Background Jobs with ActiveJob

class CrawlPageJob < ApplicationJob
  queue_as :crawlers

  retry_on RubyCrawl::ServiceError, wait: :exponentially_longer, attempts: 5
  retry_on RubyCrawl::TimeoutError, wait: :exponentially_longer, attempts: 3
  discard_on RubyCrawl::ConfigurationError

  def perform(url)
    result = RubyCrawl.crawl(url)

    Page.create!(
      url:        result.final_url,
      title:      result.metadata['title'],
      content:    result.clean_text,
      markdown:   result.clean_markdown,
      crawled_at: Time.current
    )
  end
end

Multi-page RAG knowledge base:

class BuildKnowledgeBaseJob < ApplicationJob
  queue_as :crawlers

  def perform(documentation_url)
    RubyCrawl.crawl_site(documentation_url, max_pages: 500, max_depth: 5) do |page|
      embedding = OpenAI.embed(page.clean_markdown)

      Document.create!(
        url:       page.url,
        title:     page.metadata['title'],
        content:   page.clean_markdown,
        embedding: embedding,
        depth:     page.depth
      )
    end
  end
end

Best Practices

  1. Use background jobs to avoid blocking web requests
  2. Configure retry logic based on error type
  3. Store clean_markdown for RAG applications (preserves heading structure for chunking)
  4. Rate limit external crawling to be respectful

Production Deployment

Pre-deployment Checklist

  1. Ensure Chrome is installed on your production servers
  2. Run installer during deployment:
    bundle exec rake rubycrawl:install

Docker Example

FROM ruby:3.2

# Install Chrome
RUN apt-get update && apt-get install -y \
    chromium \
    --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY Gemfile* ./
RUN bundle install

COPY . .
CMD ["rails", "server"]

Ferrum will detect chromium automatically. To specify a custom path:

RubyCrawl.configure(
  browser_options: { "browser-path": "/usr/bin/chromium" }
)

Architecture

RubyCrawl uses a single-process architecture:

RubyCrawl (public API)
  ↓
Browser (lib/rubycrawl/browser.rb)       ← Ferrum wrapper
  ↓
Ferrum::Browser                          ← Chrome DevTools Protocol (pure Ruby)
  ↓
Chromium                                 ← headless browser
  ↓
Readability.js → heuristic fallback      ← content extraction (inside browser)
  • Chrome launches once lazily and is reused across all crawls
  • Each crawl gets an isolated page context (own cookies/storage)
  • Content extraction runs inside the browser via page.evaluate():
    • Primary: Mozilla Readability.js — article-quality extraction for blogs, docs, news
    • Fallback: link-density heuristic — covers marketing pages, homepages, SPAs
  • result.metadata['extractor'] tells you which path was used ("readability" or "heuristic")
  • No separate processes, no HTTP boundary, no Node.js

Performance

  • Resource blocking: Keep block_resources: true (default: nil) to skip images/fonts/CSS for 2-3x faster crawls
  • Wait strategy: Use wait_until: "load" for static sites, "networkidle" for SPAs
  • Browser reuse: The first crawl is slower (~2s) due to Chrome launch; subsequent crawls are much faster (~200-500ms)

Parallelism

RubyCrawl does not support parallel page loading within a single process — Ferrum uses one Chrome instance and concurrent access is not thread-safe.

The recommended pattern is job-level parallelism: each background job gets its own RubyCrawl instance and Chrome process, with natural rate limiting via your job queue's concurrency setting:

# Enqueue independent crawls — each job runs its own Chrome
urls.each { |url| CrawlJob.perform_later(url) }

# Control concurrency via your queue worker config (Sidekiq, GoodJob, etc.)
# e.g. Sidekiq concurrency: 3 → 3 Chrome processes crawling in parallel

This also works naturally with respect_robots_txt: true — each job respects Crawl-delay independently.

Development

git clone git@github.com:craft-wise/rubycrawl.git
cd rubycrawl
bin/setup

# Run all tests (Chrome required — installed as a gem dependency)
bundle exec rspec

# Manual testing
bin/console
> RubyCrawl.crawl("https://example.com")
> RubyCrawl.crawl("https://example.com").clean_text
> RubyCrawl.crawl("https://example.com").clean_markdown

Contributing

Contributions are welcome! Please read our contribution guidelines first.

  • Simplicity over cleverness: Prefer clear, explicit code
  • Stability over speed: Correctness first, optimization second
  • Hide complexity: Users should never need to know Ferrum exists

License

The gem is available as open source under the terms of the MIT License.

Credits

Built with Ferrum — pure Ruby Chrome DevTools Protocol client.

Content extraction powered by Mozilla Readability.js — the algorithm behind Firefox Reader View.

Markdown conversion powered by reverse_markdown for GitHub-flavored output.

Support