RubyCrawl 🎭
Production-ready web crawler for Ruby powered by Ferrum — Full JavaScript rendering via Chrome DevTools Protocol, with first-class Rails support and no Node.js dependency.
RubyCrawl provides accurate, JavaScript-enabled web scraping using a pure Ruby browser automation stack. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
Why RubyCrawl?
- ✅ Real browser — Handles JavaScript, AJAX, and SPAs correctly
- ✅ Pure Ruby — No Node.js, no npm, no external processes to manage
- ✅ Zero config — Works out of the box, no Ferrum knowledge needed
- ✅ Production-ready — Auto-retry, error handling, resource optimization
- ✅ Multi-page crawling — BFS algorithm with smart URL deduplication
- ✅ Rails-friendly — Generators, initializers, and ActiveJob integration
- ✅ Readability-powered — Mozilla Readability.js for article-quality extraction, heuristic fallback for all other pages
# One line to crawl any JavaScript-heavy site
result = RubyCrawl.crawl("https://docs.example.com")
result.html # Full HTML with JS rendered
result.clean_text # Noise-stripped plain text (no nav/footer/ads)
result.clean_markdown # Markdown ready for RAG pipelines
result.links # All links with url, text, title, rel
result.metadata # Title, description, OG tags, etc.Features
- Pure Ruby: Ferrum drives Chromium directly via CDP — no Node.js or npm required
- Production-ready: Designed for Rails apps with auto-retry and exponential backoff
- Simple API: Clean Ruby interface — zero Ferrum or CDP knowledge required
- Resource optimization: Built-in resource blocking for 2-3x faster crawls
- Auto-managed browsers: Lazy Chrome singleton, isolated page per crawl
-
Content extraction: Mozilla Readability.js (primary) + link-density heuristic (fallback) — article-quality
clean_html,clean_text,clean_markdown, links, metadata - Multi-page crawling: BFS crawler with configurable depth limits and URL deduplication
- Smart URL handling: Automatic normalization, tracking parameter removal, same-host filtering
- Rails integration: First-class Rails support with generators and initializers
Table of Contents
- Installation
- Quick Start
- Use Cases
- Usage
- Basic Crawling
- Multi-Page Crawling
- Configuration
- Result Object
- Error Handling
- Rails Integration
- Production Deployment
- Architecture
- Performance
- Development
- Contributing
- License
Installation
Requirements
- Ruby >= 3.0
- Chrome or Chromium — managed automatically by Ferrum (downloaded on first use)
Add to Gemfile
gem "rubycrawl"Then install:
bundle installInstall Chrome
Ferrum manages Chrome automatically. Run the install task to verify Chrome is available and generate a Rails initializer:
bundle exec rake rubycrawl:installThis command:
- ✅ Checks for Chrome/Chromium in your PATH
- ✅ Creates a Rails initializer (if using Rails)
Note: If Chrome is not in your PATH, install it via your system package manager or download from google.com/chrome.
Quick Start
require "rubycrawl"
# Simple crawl
result = RubyCrawl.crawl("https://example.com")
# Access extracted content
result.final_url # Final URL after redirects
result.clean_text # Noise-stripped plain text (no nav/footer/ads)
result.clean_html # Noise-stripped HTML (same noise removed as clean_text)
result.raw_text # Full body.innerText (unfiltered)
result.html # Full raw HTML content
result.links # Extracted links with url, text, title, rel
result.metadata # Title, description, OG tags, etc.
result.metadata['extractor'] # "readability" or "heuristic" — which extractor ran
result.clean_markdown # Markdown converted from clean_html (lazy — first access only)Use Cases
RubyCrawl is perfect for:
- RAG applications: Build knowledge bases for LLM/AI applications by crawling documentation sites
- Data aggregation: Crawl product catalogs, job listings, or news articles
- SEO analysis: Extract metadata, links, and content structure
- Content migration: Convert existing sites to Markdown for static site generators
- Documentation scraping: Create local copies of documentation with preserved links
Usage
Basic Crawling
result = RubyCrawl.crawl("https://example.com")
result.html # => "<html>...</html>"
result.clean_text # => "Example Domain\n\nThis domain is..." (no nav/ads)
result.raw_text # => "Example Domain\nThis domain is..." (full body text)
result.metadata # => { "final_url" => "https://example.com", "title" => "..." }Multi-Page Crawling
Crawl an entire site following links with BFS (breadth-first search):
# Crawl up to 100 pages, max 3 links deep
RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |page|
# Each page is yielded as it's crawled (streaming)
puts "Crawled: #{page.url} (depth: #{page.depth})"
# Save to database
Page.create!(
url: page.url,
html: page.html,
markdown: page.clean_markdown,
depth: page.depth
)
endReal-world example: Building a RAG knowledge base
require "rubycrawl"
RubyCrawl.configure(
wait_until: "networkidle", # Ensure JS content loads
block_resources: true # Skip images/fonts for speed
)
pages_crawled = RubyCrawl.crawl_site(
"https://docs.example.com",
max_pages: 500,
max_depth: 5,
same_host_only: true
) do |page|
VectorDB.upsert(
id: Digest::SHA256.hexdigest(page.url),
content: page.clean_markdown,
metadata: {
url: page.url,
title: page.metadata["title"],
depth: page.depth
}
)
end
puts "Indexed #{pages_crawled} pages"Multi-Page Options
| Option | Default | Description |
|---|---|---|
max_pages |
50 | Maximum number of pages to crawl |
max_depth |
3 | Maximum link depth from start URL |
same_host_only |
true | Only follow links on the same domain |
wait_until |
inherited | Page load strategy |
block_resources |
inherited | Block images/fonts/CSS |
respect_robots_txt |
false | Honour robots.txt rules and auto-sleep Crawl-delay
|
robots.txt Support
When respect_robots_txt: true, RubyCrawl fetches robots.txt once at the start of the crawl and:
- Skips any URL disallowed for
User-agent: * - Automatically sleeps the
Crawl-delayspecified in robots.txt between pages
RubyCrawl.crawl_site("https://example.com",
respect_robots_txt: true,
max_pages: 100
) do |page|
puts page.url
endOr enable globally:
RubyCrawl.configure(respect_robots_txt: true)If robots.txt is unreachable or missing, crawling proceeds normally (fail open).
Page Result Object
The block receives a PageResult with:
page.url # String: Final URL after redirects
page.html # String: Full raw HTML content
page.clean_html # String: Noise-stripped HTML (no nav/header/footer/ads)
page.clean_text # String: Noise-stripped plain text (derived from clean_html)
page.raw_text # String: Full body.innerText (unfiltered)
page.clean_markdown # String: Lazy-converted Markdown from clean_html
page.links # Array: URLs extracted from page
page.metadata # Hash: final_url, title, OG tags, etc.
page.depth # Integer: Link depth from start URLConfiguration
Global Configuration
RubyCrawl.configure(
wait_until: "networkidle",
block_resources: true,
timeout: 60,
headless: true
)
# All subsequent crawls use these defaults
result = RubyCrawl.crawl("https://example.com")Per-Request Options
# Use global defaults
result = RubyCrawl.crawl("https://example.com")
# Override for this request only
result = RubyCrawl.crawl(
"https://example.com",
wait_until: "domcontentloaded",
block_resources: false
)Configuration Options
| Option | Values | Default | Description |
|---|---|---|---|
wait_until |
"load", "domcontentloaded", "networkidle", "commit"
|
nil |
When to consider page loaded (nil = Ferrum default) |
block_resources |
true, false
|
nil |
Block images, fonts, CSS, media for faster crawls |
max_attempts |
Integer | 3 |
Total number of attempts (including the first) |
timeout |
Integer (seconds) | 30 |
Browser navigation timeout |
headless |
true, false
|
true |
Run Chrome headlessly |
respect_robots_txt |
true, false
|
false |
Honour robots.txt rules and auto-sleep Crawl-delay |
Wait strategies explained:
-
load— Wait for the load event (good for static sites) -
domcontentloaded— Wait for DOM ready (faster) -
networkidle— Wait until no network requests for 500ms (best for SPAs) -
commit— Wait until the first response bytes are received (fastest)
Result Object
result = RubyCrawl.crawl("https://example.com")
result.html # String: Full raw HTML
result.clean_html # String: Noise-stripped HTML (nav/header/footer/ads removed)
result.clean_text # String: Plain text derived from clean_html — ideal for RAG
result.raw_text # String: Full body.innerText (unfiltered)
result.clean_markdown # String: Markdown from clean_html (lazy — computed on first access)
result.links # Array: Extracted links with url/text/title/rel
result.metadata # Hash: See below
result.final_url # String: Shortcut for metadata['final_url']Links Format
result.links
# => [
# { "url" => "https://example.com/about", "text" => "About", "title" => nil, "rel" => nil },
# { "url" => "https://example.com/contact", "text" => "Contact", "title" => nil, "rel" => "nofollow" },
# ]URLs are automatically resolved to absolute form by the browser.
Markdown Conversion
Markdown is lazy — conversion only happens on first access of .clean_markdown:
result.clean_html # ✅ Already available, no overhead
result.clean_markdown # Converts clean_html → Markdown here (first call only)
result.clean_markdown # ✅ Cached, instant on subsequent callsUses reverse_markdown with GitHub-flavored output.
Metadata Fields
result.metadata
# => {
# "final_url" => "https://example.com",
# "title" => "Page Title",
# "description" => "...",
# "keywords" => "ruby, web",
# "author" => "Author Name",
# "og_title" => "...",
# "og_description" => "...",
# "og_image" => "https://...",
# "og_url" => "https://...",
# "og_type" => "website",
# "twitter_card" => "summary",
# "twitter_title" => "...",
# "twitter_description" => "...",
# "twitter_image" => "https://...",
# "canonical" => "https://...",
# "lang" => "en",
# "charset" => "UTF-8",
# "extractor" => "readability" # or "heuristic"
# }Error Handling
begin
result = RubyCrawl.crawl(url)
rescue RubyCrawl::ConfigurationError => e
# Invalid URL or option value
rescue RubyCrawl::TimeoutError => e
# Page load timed out
rescue RubyCrawl::NavigationError => e
# Navigation failed (404, DNS error, SSL error)
rescue RubyCrawl::ServiceError => e
# Browser failed to start or crashed
rescue RubyCrawl::Error => e
# Catch-all for any RubyCrawl error
endException Hierarchy:
RubyCrawl::Error
├── ConfigurationError — invalid URL or option value
├── TimeoutError — page load timed out
├── NavigationError — navigation failed (HTTP error, DNS, SSL)
└── ServiceError — browser failed to start or crashed
Automatic Retry: ServiceError and TimeoutError are retried with exponential backoff. NavigationError and ConfigurationError are not retried (they won't succeed on retry).
RubyCrawl.configure(max_attempts: 5) # 5 total attempts
RubyCrawl.crawl(url, max_attempts: 1) # Disable retriesRails Integration
Installation
bundle exec rake rubycrawl:installThis creates config/initializers/rubycrawl.rb:
RubyCrawl.configure(
wait_until: "load",
block_resources: true
)Usage in Rails
Background Jobs with ActiveJob
class CrawlPageJob < ApplicationJob
queue_as :crawlers
retry_on RubyCrawl::ServiceError, wait: :exponentially_longer, attempts: 5
retry_on RubyCrawl::TimeoutError, wait: :exponentially_longer, attempts: 3
discard_on RubyCrawl::ConfigurationError
def perform(url)
result = RubyCrawl.crawl(url)
Page.create!(
url: result.final_url,
title: result.metadata['title'],
content: result.clean_text,
markdown: result.clean_markdown,
crawled_at: Time.current
)
end
endMulti-page RAG knowledge base:
class BuildKnowledgeBaseJob < ApplicationJob
queue_as :crawlers
def perform(documentation_url)
RubyCrawl.crawl_site(documentation_url, max_pages: 500, max_depth: 5) do |page|
embedding = OpenAI.embed(page.clean_markdown)
Document.create!(
url: page.url,
title: page.metadata['title'],
content: page.clean_markdown,
embedding: embedding,
depth: page.depth
)
end
end
endBest Practices
- Use background jobs to avoid blocking web requests
- Configure retry logic based on error type
-
Store
clean_markdownfor RAG applications (preserves heading structure for chunking) - Rate limit external crawling to be respectful
Production Deployment
Pre-deployment Checklist
- Ensure Chrome is installed on your production servers
-
Run installer during deployment:
bundle exec rake rubycrawl:install
Docker Example
FROM ruby:3.2
# Install Chrome
RUN apt-get update && apt-get install -y \
chromium \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY Gemfile* ./
RUN bundle install
COPY . .
CMD ["rails", "server"]Ferrum will detect chromium automatically. To specify a custom path:
RubyCrawl.configure(
browser_options: { "browser-path": "/usr/bin/chromium" }
)Architecture
RubyCrawl uses a single-process architecture:
RubyCrawl (public API)
↓
Browser (lib/rubycrawl/browser.rb) ← Ferrum wrapper
↓
Ferrum::Browser ← Chrome DevTools Protocol (pure Ruby)
↓
Chromium ← headless browser
↓
Readability.js → heuristic fallback ← content extraction (inside browser)
- Chrome launches once lazily and is reused across all crawls
- Each crawl gets an isolated page context (own cookies/storage)
- Content extraction runs inside the browser via
page.evaluate():- Primary: Mozilla Readability.js — article-quality extraction for blogs, docs, news
- Fallback: link-density heuristic — covers marketing pages, homepages, SPAs
-
result.metadata['extractor']tells you which path was used ("readability"or"heuristic") - No separate processes, no HTTP boundary, no Node.js
Performance
-
Resource blocking: Keep
block_resources: true(default: nil) to skip images/fonts/CSS for 2-3x faster crawls -
Wait strategy: Use
wait_until: "load"for static sites,"networkidle"for SPAs - Browser reuse: The first crawl is slower (~2s) due to Chrome launch; subsequent crawls are much faster (~200-500ms)
Parallelism
RubyCrawl does not support parallel page loading within a single process — Ferrum uses one Chrome instance and concurrent access is not thread-safe.
The recommended pattern is job-level parallelism: each background job gets its own RubyCrawl instance and Chrome process, with natural rate limiting via your job queue's concurrency setting:
# Enqueue independent crawls — each job runs its own Chrome
urls.each { |url| CrawlJob.perform_later(url) }
# Control concurrency via your queue worker config (Sidekiq, GoodJob, etc.)
# e.g. Sidekiq concurrency: 3 → 3 Chrome processes crawling in parallelThis also works naturally with respect_robots_txt: true — each job respects Crawl-delay independently.
Development
git clone git@github.com:craft-wise/rubycrawl.git
cd rubycrawl
bin/setup
# Run all tests (Chrome required — installed as a gem dependency)
bundle exec rspec
# Manual testing
bin/console
> RubyCrawl.crawl("https://example.com")
> RubyCrawl.crawl("https://example.com").clean_text
> RubyCrawl.crawl("https://example.com").clean_markdownContributing
Contributions are welcome! Please read our contribution guidelines first.
- Simplicity over cleverness: Prefer clear, explicit code
- Stability over speed: Correctness first, optimization second
- Hide complexity: Users should never need to know Ferrum exists
License
The gem is available as open source under the terms of the MIT License.
Credits
Built with Ferrum — pure Ruby Chrome DevTools Protocol client.
Content extraction powered by Mozilla Readability.js — the algorithm behind Firefox Reader View.
Markdown conversion powered by reverse_markdown for GitHub-flavored output.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: ganesh.navale@zohomail.in