crawlr 🕷️

A powerful, async Ruby web scraping framework designed for respectful and efficient data extraction. Built with modern Ruby practices, crawlr provides a clean API for scraping websites while respecting robots.txt, managing cookies, rotating proxies, and handling complex scraping scenarios.

✨ Features

🚀 Async HTTP requests with configurable concurrency
🤖 Robots.txt compliance with automatic parsing and rule enforcement
🍪 Cookie management with automatic persistence across requests
🔄 Proxy rotation with round-robin and random strategies
🎯 Flexible selectors supporting both CSS and XPath
🔧 Extensible hooks for request/response lifecycle events
📊 Built-in statistics and monitoring capabilities
🛡️ Respectful crawling with delays, depth limits, and visit tracking
🧵 Thread-safe operations for parallel scraping
📄 Comprehensive logging with configurable levels

📦 Installation

Add this line to your application's Gemfile:

gem 'crawlr'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install crawlr

🚀 Quick Start

require 'crawlr'

# Create a collector with configuration
collector = Crawlr::Collector.new(
  max_depth: 3,
  max_parallelism: 5,
  random_delay: 1.0,
  timeout: 15
)

# Register callbacks for data extraction
collector.on_html(:css, '.article-title') do |node, context|
  puts "Found title: #{node.text.strip}"
end

collector.on_html(:css, 'a[href]') do |link, context|
  href = link['href']
  puts "Found link: #{href}" if href.start_with?('http')
end

# Start scraping
collector.visit('https://example.com')

📚 Usage Examples

Basic Web Scraping

collector = Crawlr::Collector.new
products = []
# Extract product information
collector.visit('https://shop.example.com/products') do |c|
  c.on_html(:css, '.product') do |product, ctx|
    data = {
      name: product.css('.product-name').text.strip,
      price: product.css('.price').text.strip,
      image: product.css('img')&.first&.[]('src')
    }
  
    products << data
  end
end
# do something with data

API Scraping with Pagination

collector = Crawlr::Collector.new(
  max_parallelism: 10,
  timeout: 30
)
mu = Mutex.new
items = Array.new

collector.on_xml(:css, 'item') do |item, _ctx|
  data =  {
    id: item.css('id').text,
    title: item.css('title').text,
    published: item.css('published').text
  }
  mu.synchronize { items << data }
end

# Automatically handles pagination with ?page=1, ?page=2, etc.
collector.paginated_visit(
  'https://api.example.com/feed',
  batch_size: 5,
  start_page: 1
)

Advanced Configuration

collector = Crawlr::Collector.new(
  # Network settings
  timeout: 20,
  max_parallelism: 8,
  random_delay: 2.0,

  # Crawling behavior
  max_depth: 5,
  allow_url_revisit: false,
  max_visited: 50_000,

  # Proxy rotation
  proxies: ['proxy1.com:8080', 'proxy2.com:8080'],
  proxy_strategy: :round_robin,

  # Respectful crawling
  ignore_robots_txt: false,
  allow_cookies: true,

  # Error handling
  max_retries: 3,
  retry_delay: 1.0,
  retry_backoff: 2.0
)

Domain Filtering

# Allow specific domains
collector = Crawlr::Collector.new(
  allowed_domains: ['example.com', 'api.example.com']
)

# Or use glob patterns
collector = Crawlr::Collector.new(
  domain_glob: ['*.example.com', '*.trusted-site.*']
)

Hooks for Custom Behavior

# Add custom headers before each request
collector.hook(:before_visit) do |url, headers|
  headers['Authorization'] = "Bearer #{get_auth_token()}"
  headers['X-Custom-Header'] = 'MyBot/1.0'
  puts "Visiting: #{url}"
end

# Process responses after each request
collector.hook(:after_visit) do |url, response|
  puts "Got #{response.status} from #{url}"
end

# Handle errors gracefully
collector.hook(:on_error) do |url, error|
  puts "Failed to scrape #{url}: #{error.message}"
end

XPath Selectors

collector.on_html(:xpath, '//div[@class="content"]//p[position() <= 3]') do |paragraph, ctx|
  # Do stuff
end

collector.on_xml(:xpath, '//item[price > 100]/title') do |title, ctx|
  # Do stuff
end

Session Management with Cookies

collector = Crawlr::Collector.new(allow_cookies: true)

# First visit will set cookies tor following requests
collector.visit('https://site.com/login')
collector.visit('https://site.com/protected-content') # Uses login cookies

Stats

collector = Crawlr::Collector.new

# Get comprehensive statistics
stats = collector.stats
puts "Visited #{stats[:total_visits]} pages"
puts "Active callbacks: #{stats[:callbacks_count]}"
puts "Memory usage: #{stats[:visited_count]}/#{stats[:max_visited]} URLs tracked"

# Clone collectors for different tasks while sharing HTTP connections
product_scraper = collector.clone
product_scraper.on_html(:css, '.product') { |node, ctx| extract_product(node, ctx) }

review_scraper = collector.clone
review_scraper.on_html(:css, '.review') { |node, ctx| extract_review(node, ctx) }

🏗️ Architecture

crawlr is built with a modular architecture:

Collector: Main orchestrator managing the scraping workflow
HTTPInterface: Async HTTP client with proxy and cookie support
Parser: Document parsing engine using Nokogiri
Callbacks: Flexible callback system for data extraction
Hooks: Event system for request/response lifecycle customization
Config: Centralized configuration management
Visits: Thread-safe URL deduplication and visit tracking
Domains: Domain filtering and allowlist management
Robots: Robots.txt parsing and compliance checking

🤝 Respectful Scraping

crawlr is designed to be a responsible scraping framework:

Robots.txt compliance: Automatically fetches and respects robots.txt rules
Rate limiting: Built-in delays and concurrency controls
User-Agent identification: Clear identification in requests
Error handling: Graceful handling of failures without overwhelming servers
Memory management: Automatic cleanup to prevent resource exhaustion

🔧 Configuration Options

Option	Default	Description
`timeout`	10	HTTP request timeout in seconds
`max_parallelism`	1	Maximum concurrent requests
`max_depth`	0	Maximum crawling depth (0 = unlimited)
`random_delay`	0	Maximum random delay between requests
`allow_url_revisit`	false	Allow revisiting previously scraped URLs
`max_visited`	10,000	Maximum URLs to track before cache reset
`allow_cookies`	false	Enable cookie jar management
`ignore_robots_txt`	false	Skip robots.txt checking
`max_retries`	nil	Maximum retry attempts (nil = disabled)
`retry_delay`	1.0	Base delay between retries
`retry_backoff`	2.0	Exponential backoff multiplier

🧪 Testing

Run the test suite:

bundle exec rspec

Run with coverage:

COVERAGE=true bundle exec rspec

📖 Documentation

Generate API documentation:

yard doc

View documentation:

yard server

🤝 Contributing

Fork it (https://github.com/aristorap/crawlr/fork)
Create your feature branch (git checkout -b feature/amazing-feature)
Make your changes with tests
Ensure all tests pass (bundle exec rspec)
Commit your changes (git commit -am 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Create a new Pull Request

📝 License

This gem is available as open source under the terms of the MIT License.

🙏 Acknowledgments

Built with Nokogiri for HTML/XML parsing
Uses Async for high-performance concurrency
Inspired by Golang's Colly framework and modern Ruby practices

📞 Support

📖 Documentation TBD
🐛 Issue Tracker

Happy Scraping! 🕷️✨

crawlr

Development

Runtime