RubyCrawl ๐ญ
Production-ready web crawler for Ruby powered by Playwright โ Bringing the power of modern browser automation to the Ruby ecosystem with first-class Rails support.
RubyCrawl provides accurate, JavaScript-enabled web scraping using Playwright's battle-tested browser automation, wrapped in a clean Ruby API. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
Why RubyCrawl?
- โ Real browser โ Handles JavaScript, AJAX, and SPAs correctly
- โ Zero config โ Works out of the box, no Playwright knowledge needed
- โ Production-ready โ Auto-retry, error handling, resource optimization
- โ Multi-page crawling โ BFS algorithm with smart URL deduplication
- โ Rails-friendly โ Generators, initializers, and ActiveJob integration
- โ Modular architecture โ Clean, testable, maintainable codebase
Features
- ๐ญ Playwright-powered: Real browser automation for JavaScript-heavy sites and SPAs
- ๐ Production-ready: Designed for Rails apps and production environments with auto-retry and error handling
- ๐ฏ Simple API: Clean, minimal Ruby interface โ zero Playwright or Node.js knowledge required
- โก Resource optimization: Built-in resource blocking for 2-3x faster crawls
- ๐ Auto-managed browsers: Browser process reuse and automatic lifecycle management
- ๐ Content extraction: HTML, links (with metadata), and lazy-loaded Markdown conversion
- ๐ Multi-page crawling: BFS (breadth-first search) crawler with configurable depth limits and URL deduplication
- ๐ก๏ธ Smart URL handling: Automatic normalization, tracking parameter removal, and same-host filtering
- ๐ง Rails integration: First-class Rails support with generators and initializers
- ๐ Modular design: Clean separation of concerns with focused, testable modules
Table of Contents
- Features
- Installation
- Quick Start
- Use Cases
- Usage
- Basic Crawling
- Multi-Page Crawling
- Configuration
- Result Object
- Error Handling
- Rails Integration
- Production Deployment
- Architecture
- Performance
- Development
- Project Structure
- Roadmap
- Contributing
- Why Choose RubyCrawl?
- License
- Support
Installation
Requirements
- Ruby >= 3.0
- Node.js LTS (v18+ recommended) โ required for the bundled Playwright service
Add to Gemfile
gem "rubycrawl"Then install:
bundle installInstall Playwright browsers
After bundling, install the Playwright browsers:
bundle exec rake rubycrawl:installThis command:
- โ
Installs Node.js dependencies in the bundled
node/directory - โ Downloads Playwright browsers (Chromium, Firefox, WebKit) โ ~300MB download
- โ Creates a Rails initializer (if using Rails)
Note: You only need to run this once. The installation task is idempotent and safe to run multiple times.
Troubleshooting installation:
# If installation fails, check Node.js version
node --version # Should be v18+ LTS
# Enable verbose logging
RUBYCRAWL_NODE_LOG=/tmp/rubycrawl.log bundle exec rake rubycrawl:install
# Check installation status
cd node && npm listQuick Start
require "rubycrawl"
# Simple crawl
result = RubyCrawl.crawl("https://example.com")
# Access extracted content
puts result.html # Raw HTML content
puts result.markdown # Converted to Markdown
puts result.links # Extracted links from the page
puts result.metadata # Status code, final URL, etc.Use Cases
RubyCrawl is perfect for:
- ๐ Data aggregation: Crawl product catalogs, job listings, or news articles
- ๐ค RAG applications: Build knowledge bases for LLM/AI applications by crawling documentation sites
- ๐ SEO analysis: Extract metadata, links, and content structure
- ๐ฑ Content migration: Convert existing sites to Markdown for static site generators
- ๐งช Testing: Verify deployed site structure and content
- ๐ Documentation scraping: Create local copies of documentation with preserved links
Usage
Basic Crawling
The simplest way to crawl a URL:
result = RubyCrawl.crawl("https://example.com")
# Access the results
result.html # => "<html>...</html>"
result.markdown # => "# Example Domain\n\nThis domain is..." (lazy-loaded)
result.links # => [{ "url" => "https://...", "text" => "More info" }, ...]
result.metadata # => { "status" => 200, "final_url" => "https://example.com" }
result.text # => "" (coming soon)Multi-Page Crawling
Crawl an entire site following links with BFS (breadth-first search):
# Crawl up to 100 pages, max 3 links deep
RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |page|
# Each page is yielded as it's crawled (streaming)
puts "Crawled: #{page.url} (depth: #{page.depth})"
# Save to database
Page.create!(
url: page.url,
html: page.html,
markdown: page.markdown,
depth: page.depth
)
endReal-world example: Building a RAG knowledge base
# Crawl documentation site for AI/RAG application
require "rubycrawl"
RubyCrawl.configure(
wait_until: "networkidle", # Ensure JS content loads
block_resources: true # Skip images/fonts for speed
)
pages_crawled = RubyCrawl.crawl_site(
"https://docs.example.com",
max_pages: 500,
max_depth: 5,
same_host_only: true
) do |page|
# Store in vector database for RAG
VectorDB.upsert(
id: Digest::SHA256.hexdigest(page.url),
content: page.markdown, # Clean markdown for better embeddings
metadata: {
url: page.url,
title: page.metadata["title"],
depth: page.depth
}
)
puts "โ Indexed: #{page.metadata['title']} (#{page.depth} levels deep)"
end
puts "Crawled #{pages_crawled} pages into knowledge base"Multi-Page Options
| Option | Default | Description |
|---|---|---|
max_pages |
50 | Maximum number of pages to crawl |
max_depth |
3 | Maximum link depth from start URL |
same_host_only |
true | Only follow links on the same domain |
wait_until |
inherited | Page load strategy |
block_resources |
inherited | Block images/fonts/CSS |
Page Result Object
The block receives a PageResult with:
page.url # String: Final URL after redirects
page.html # String: Full HTML content
page.markdown # String: Lazy-converted Markdown
page.links # Array: URLs extracted from page
page.metadata # Hash: HTTP status, final URL, etc.
page.depth # Integer: Link depth from start URLConfiguration
Global Configuration
Set default options that apply to all crawls:
RubyCrawl.configure(
wait_until: "networkidle", # Wait until network is idle
block_resources: true # Block images, fonts, CSS for speed
)
# All subsequent crawls use these defaults
result = RubyCrawl.crawl("https://example.com")Per-Request Options
Override defaults for specific requests:
# Use global defaults
result = RubyCrawl.crawl("https://example.com")
# Override for this request only
result = RubyCrawl.crawl(
"https://example.com",
wait_until: "domcontentloaded",
block_resources: false
)Configuration Options
| Option | Values | Default | Description |
|---|---|---|---|
wait_until |
"load", "domcontentloaded", "networkidle"
|
"load" |
When to consider page loaded |
block_resources |
true, false
|
true |
Block images, fonts, CSS, media for faster crawls |
Wait strategies explained:
-
loadโ Wait for the load event (fastest, good for static sites) -
domcontentloadedโ Wait for DOM ready (medium speed) -
networkidleโ Wait until no network requests for 500ms (slowest, best for SPAs)
Advanced Usage
Session-Based Crawling
Sessions allow reusing browser contexts for better performance when crawling multiple pages. They're automatically used by crawl_site, but you can manage them manually for advanced use cases:
# Create a session (reusable browser context)
session_id = RubyCrawl.create_session
begin
# All crawls with this session_id share the same browser context
result1 = RubyCrawl.crawl("https://example.com/page1", session_id: session_id)
result2 = RubyCrawl.crawl("https://example.com/page2", session_id: session_id)
# Browser state (cookies, localStorage) persists between crawls
ensure
# Always destroy session when done
RubyCrawl.destroy_session(session_id)
endWhen to use sessions:
- Multiple sequential crawls to the same domain (better performance)
- Preserving cookies/state set by the site between page visits
- Avoiding browser context creation overhead
Important: Sessions are for performance optimization only. RubyCrawl is designed for crawling public websites. It does not provide authentication or login functionality for protected content.
Note: crawl_site automatically creates and manages a session internally, so you don't need manual session management for multi-page crawling.
Session lifecycle:
- Sessions automatically expire after 30 minutes of inactivity
- Sessions are cleaned up every 5 minutes
- Always call
destroy_sessionwhen done to free resources immediately
Result Object
The crawl result is a RubyCrawl::Result object with these attributes:
result = RubyCrawl.crawl("https://example.com")
result.html # String: Raw HTML content from page
result.markdown # String: Markdown conversion (lazy-loaded on first access)
result.links # Array: Extracted links with url and text
result.text # String: Plain text (coming soon)
result.metadata # Hash: Comprehensive metadata (see below)Links Format
Links are extracted with full metadata:
result.links
# => [
# {
# "url" => "https://example.com/about",
# "text" => "About Us",
# "title" => "Learn more about us", # <a title="...">
# "rel" => nil # <a rel="nofollow">
# },
# {
# "url" => "https://example.com/contact",
# "text" => "Contact",
# "title" => null,
# "rel" => "nofollow"
# },
# ...
# ]Note: URLs are automatically converted to absolute URLs by the browser, so relative links like /about become https://example.com/about.
Markdown Conversion
Markdown is lazy-loaded โ conversion only happens when you access .markdown:
result = RubyCrawl.crawl(url)
result.html # โ
No overhead
result.markdown # โฌ
๏ธ Conversion happens here (first call only)
result.markdown # โ
Cached, instantUses reverse_markdown with GitHub-flavored output.
Metadata Fields
The metadata hash includes HTTP and HTML metadata:
result.metadata
# => {
# "status" => 200, # HTTP status code
# "final_url" => "https://...", # Final URL after redirects
# "title" => "Page Title", # <title> tag
# "description" => "...", # Meta description
# "keywords" => "ruby, web", # Meta keywords
# "author" => "Author Name", # Meta author
# "og_title" => "...", # Open Graph title
# "og_description" => "...", # Open Graph description
# "og_image" => "https://...", # Open Graph image
# "og_url" => "https://...", # Open Graph URL
# "og_type" => "website", # Open Graph type
# "twitter_card" => "summary", # Twitter card type
# "twitter_title" => "...", # Twitter title
# "twitter_description" => "...", # Twitter description
# "twitter_image" => "https://...",# Twitter image
# "canonical" => "https://...", # Canonical URL
# "lang" => "en", # Page language
# "charset" => "UTF-8" # Character encoding
# }Note: All HTML metadata fields may be null if not present on the page.
Error Handling
RubyCrawl provides specific exception classes for different error scenarios:
begin
result = RubyCrawl.crawl(url)
rescue RubyCrawl::ConfigurationError => e
# Invalid URL or configuration
puts "Configuration error: #{e.message}"
rescue RubyCrawl::TimeoutError => e
# Page load timeout or network timeout
puts "Timeout: #{e.message}"
rescue RubyCrawl::NavigationError => e
# Page navigation failed (404, DNS error, SSL error, etc.)
puts "Navigation failed: #{e.message}"
rescue RubyCrawl::ServiceError => e
# Node service unavailable or crashed
puts "Service error: #{e.message}"
rescue RubyCrawl::Error => e
# Catch-all for any RubyCrawl error
puts "Crawl error: #{e.message}"
endException Hierarchy:
-
RubyCrawl::Error(base class)-
RubyCrawl::ConfigurationError- Invalid URL or configuration -
RubyCrawl::TimeoutError- Timeout during crawl -
RubyCrawl::NavigationError- Page navigation failed -
RubyCrawl::ServiceError- Node service issues
-
Automatic Retry: RubyCrawl automatically retries transient failures (service errors, timeouts) up to 3 times with exponential backoff (2s, 4s, 8s). Configure with:
RubyCrawl.configure(max_retries: 5)
# or per-request
RubyCrawl.crawl(url, retries: 1) # Disable retryRails Integration
Installation
Run the installer in your Rails app:
bundle exec rake rubycrawl:installThis creates config/initializers/rubycrawl.rb:
# frozen_string_literal: true
# rubycrawl default configuration
RubyCrawl.configure(
wait_until: "load",
block_resources: true
)Usage in Rails
# In a controller, service, or background job
class ContentScraperJob < ApplicationJob
def perform(url)
result = RubyCrawl.crawl(url)
# Save to database
ScrapedContent.create!(
url: url,
html: result.html,
status: result.metadata[:status]
)
end
endProduction Deployment
Pre-deployment Checklist
- Install Node.js on your production servers (LTS version recommended)
-
Run installer during deployment:
bundle exec rake rubycrawl:install -
Set environment variables (optional):
export RUBYCRAWL_NODE_BIN=/usr/bin/node # Custom Node.js path export RUBYCRAWL_NODE_LOG=/var/log/rubycrawl.log # Service logs
Docker Example
FROM ruby:3.2
# Install Node.js LTS
RUN curl -fsSL https://deb.nodesource.com/setup_lts.x | bash - \
&& apt-get install -y nodejs
# Install system dependencies for Playwright
RUN npx playwright install-deps
WORKDIR /app
COPY Gemfile* ./
RUN bundle install
# Install Playwright browsers
RUN bundle exec rake rubycrawl:install
COPY . .
CMD ["rails", "server"]Heroku Deployment
Add the Node.js buildpack:
heroku buildpacks:add heroku/nodejs
heroku buildpacks:add heroku/rubyAdd to package.json in your Rails root:
{
"engines": {
"node": "18.x"
}
}How It Works
RubyCrawl uses a simple architecture:
- Ruby Gem provides the public API and handles orchestration
- Node.js Service (bundled, auto-started) manages Playwright browsers
- Communication via HTTP/JSON on localhost
This design keeps things stable and easy to debug. The browser runs in a separate process, so crashes won't affect your Ruby application.
Performance Tips
-
Resource blocking: Keep
block_resources: true(default) for 2-3x faster crawls when you don't need images/CSS -
Wait strategy: Use
wait_until: "load"for static sites,"networkidle"for SPAs - Concurrency: Use background jobs (Sidekiq, etc.) for parallel crawling
- Browser reuse: The first crawl is slower (~2s) due to browser launch; subsequent crawls are much faster (~500ms)
Development
Want to contribute? Check out the contributor guidelines.
# Setup
git clone git@github.com:craft-wise/rubycrawl.git
cd rubycrawl
bin/setup
# Run tests
bundle exec rspec
# Manual testing
bin/console
> RubyCrawl.crawl("https://example.com")๐ Long-term (v1.0.0)
Maturity Goals:
- Production battle-tested (1000+ stars, real-world usage)
- Full documentation with video tutorials
- Performance benchmarks vs. alternatives
- Migration guides from Nokogiri, Mechanize, etc.
Contributing
Contributions are welcome! Please read our contribution guidelines first.
Development Philosophy
- Simplicity over cleverness: Prefer clear, explicit code
- Stability over speed: Correctness first, optimization second
- Ruby-first: Hide Node.js/Playwright complexity from users
- No vendor lock-in: Pure open source, no SaaS dependencies
Why Choose RubyCrawl?
RubyCrawl stands out in the Ruby ecosystem with its unique combination of features:
๐ฏ Built for Ruby Developers
- Idiomatic Ruby API โ Feels natural to Rubyists, no need to learn Playwright
- Rails-first design โ Generators, initializers, and ActiveJob integration out of the box
- Modular architecture โ Clean, testable code following Ruby best practices
๐ Production-Grade Reliability
- Automatic retry with exponential backoff for transient failures
- Smart error handling with custom exception hierarchy
- Process isolation โ Browser crashes don't affect your Ruby application
- Battle-tested โ Built on Playwright's proven browser automation
๐ Developer Experience
- Zero configuration โ Works immediately after installation
- Lazy loading โ Markdown conversion only when you need it
- Smart URL handling โ Automatic normalization and deduplication
- Comprehensive docs โ Clear examples for common use cases
๐ Rich Feature Set
- โ JavaScript-enabled crawling (SPAs, AJAX, dynamic content)
- โ Multi-page crawling with BFS algorithm
- โ Link extraction with metadata (url, text, title, rel)
- โ Markdown conversion (GitHub-flavored)
- โ Metadata extraction (OG tags, Twitter cards, etc.)
- โ Resource blocking for 2-3x performance boost
๐ Perfect for Modern Use Cases
- RAG applications โ Build AI knowledge bases from documentation
- Data aggregation โ Extract structured data from multiple pages
- Content migration โ Convert sites to Markdown for static generators
- SEO analysis โ Extract metadata and link structures
- Testing โ Verify deployed site content and structure
License
The gem is available as open source under the terms of the MIT License.
Credits
Built with Playwright by Microsoft โ the industry-standard browser automation framework.
Powered by reverse_markdown for GitHub-flavored Markdown conversion.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: ganesh.navale@zohomail.in
Acknowledgments
Special thanks to:
- Microsoft Playwright team for the robust, production-grade browser automation framework
- The Ruby community for building an ecosystem that values developer happiness and code clarity
- The Node.js community for excellent tooling and libraries that make cross-language integration seamless
- Open source contributors worldwide who make projects like this possible