Kreuzcrawl
High-performance Rust web crawling engine for structured data extraction. Scrape, crawl, and map websites with native bindings for 14 languages — same engine, identical results across every runtime.
Key Features
- Structured extraction — Text, metadata, links, images, assets, JSON-LD, Open Graph, hreflang, favicons, headings, and response headers
- Markdown conversion — Clean Markdown output with citations, document structure, and fit-content mode
- Concurrent crawling — Depth-first, breadth-first, or best-first traversal with configurable depth, page limits, and concurrency
- 14 language bindings — Rust, Python, Node.js, TypeScript, Ruby, Go, Java, Kotlin (Android), C#, PHP, Elixir, Dart, Swift, Zig, and WebAssembly
- Smart filtering — BM25 relevance scoring, URL include/exclude patterns, robots.txt compliance, and sitemap discovery
- Browser rendering — Optional headless browser for JavaScript-heavy SPAs with WAF detection and bypass
- Batch operations — Scrape or crawl hundreds of URLs concurrently with partial failure handling
- Streaming — Real-time crawl events via async streams for progress tracking
- Authentication — HTTP Basic, Bearer token, and custom header auth with persistent cookie jars
- Rate limiting — Per-domain request throttling with configurable delays
-
SSRF-safe by default — Refuses requests to loopback, private, link-local, and cloud-metadata addresses; opt out via env var or
CrawlConfig - Asset download — Download, deduplicate, and filter images, documents, and other linked assets
- MCP server — Model Context Protocol integration for AI agents
- REST API — HTTP server with OpenAPI spec
Substrate vs operational
kreuzcrawl ships the substrate: everything a developer needs to crawl a site end-to-end with no external service. Productization concerns (managed proxy pools, tuned WAF fingerprints, authenticated-session injection, scheduling/retry orchestration, billing) live in kreuzberg-cloud, the reference operational implementation.
In the substrate
HTML→Markdown engine, headless-Chrome fallback, kreuzcrawl::robots and kreuzcrawl::sitemap parsers (public — usable standalone), per-domain throttling, SSRF policy, baseline WAF classifier (TomlClassifier), crawl event stream, URL canonicalization, citations.
Extension points (inject your own impls)
| Trait | Baseline (kreuzcrawl) | Reference cloud impl |
|---|---|---|
traits::Frontier |
defaults::InMemoryFrontier |
NATS-backed frontier in kreuzberg-cloud/crates/crawl-traits/src/frontier.rs
|
traits::RateLimiter |
defaults::PerDomainThrottle |
AdaptiveRateLimiter (cloud) |
traits::CrawlStore |
defaults::NoopStore |
CloudCrawlStore (cloud) |
traits::EventEmitter |
defaults::NoopEmitter |
NatsEventEmitter (cloud) |
traits::ContentFilter |
defaults::NoopFilter |
LlmContentFilter (cloud) |
traits::CrawlCache |
defaults::NoopCache |
CloudCrawlCache (cloud) |
WafClassifier |
TomlClassifier (loads rules from disk) |
Tuned-fingerprint database (cloud) |
BypassProvider |
Built-in baseline | Managed bypass cluster (cloud) |
AntibotStrategy |
DefaultAntibotStrategy |
Custom strategies (cloud) |
Inject any impl via CrawlEngineBuilder::with_<trait>(Arc::new(MyImpl)).
Out of the box (cloud-only)
Managed paid IP rotation pool, premium per-site scraper presets (LinkedIn, Amazon, eBay, GitHub, …), authenticated-crawl session injection, visual-diff change detection, scheduled crawls + per-customer budgets, cost analytics.
Installation
| Language | Package | Install |
|---|---|---|
| Python | kreuzcrawl | pip install kreuzcrawl |
| Node.js | @kreuzberg/kreuzcrawl | npm install @kreuzberg/kreuzcrawl |
| Rust | kreuzcrawl | cargo add kreuzcrawl |
| Go | pkg.go.dev | go get github.com/kreuzberg-dev/kreuzcrawl/packages/go |
| Java | Maven Central | See README |
| C# | NuGet | dotnet add package Kreuzcrawl |
| Ruby | kreuzcrawl | gem install kreuzcrawl |
| PHP | kreuzberg-dev/kreuzcrawl | composer require kreuzberg-dev/kreuzcrawl |
| Elixir | kreuzcrawl | {:kreuzcrawl, "~> 0.3"} |
| WASM | @kreuzberg/kreuzcrawl-wasm | npm install @kreuzberg/kreuzcrawl-wasm |
| C FFI | GitHub Releases | C header + shared library |
| CLI | crates.io | cargo install kreuzcrawl-cli |
| CLI (Homebrew) | kreuzberg-dev/tap | brew install kreuzberg-dev/tap/kreuzcrawl |
Homebrew 6.0+ requires explicit trust for third-party taps. Run brew trust kreuzberg-dev/tap once before
installing from kreuzberg-dev/tap.
Quick Start
from kreuzcrawl import create_engine, scrape
engine = create_engine()
result = scrape(engine, "https://example.com")
print(result.metadata.title)
print(result.markdown.content)
print(len(result.links))import { createEngine, scrape } from "@kreuzberg/kreuzcrawl";
const engine = createEngine();
const result = await scrape(engine, "https://example.com");
console.log(result.metadata.title);
console.log(result.markdown.content);
console.log(result.links.length);let engine = kreuzcrawl::create_engine(None)?;
let result = kreuzcrawl::scrape(&engine, "https://example.com").await?;
println!("{}", result.metadata.title);
println!("{}", result.markdown.content);
println!("{}", result.links.len());engine, _ := kcrawl.CreateEngine()
result, _ := kcrawl.Scrape(engine, "https://example.com")
fmt.Println(result.Metadata.Title)
fmt.Println(result.Markdown.Content)
fmt.Println(len(result.Links))var engine = Kreuzcrawl.createEngine(null);
var result = Kreuzcrawl.scrape(engine, "https://example.com");
System.out.println(result.metadata().title());
System.out.println(result.markdown().content());
System.out.println(result.links().size());var engine = KreuzcrawlLib.CreateEngine(null);
var result = await KreuzcrawlLib.Scrape(engine, "https://example.com");
Console.WriteLine(result.Metadata.Title);
Console.WriteLine(result.Markdown.Content);
Console.WriteLine(result.Links.Count);engine = Kreuzcrawl.create_engine(nil)
result = Kreuzcrawl.scrape(engine, "https://example.com")
puts result.metadata.title
puts result.markdown.content
puts result.links.length$engine = Kreuzcrawl::createEngine(null);
$result = Kreuzcrawl::scrape($engine, "https://example.com");
echo $result->metadata->title;
echo $result->markdown->content;
echo count($result->links);{:ok, engine} = Kreuzcrawl.create_engine(nil)
{:ok, result} = Kreuzcrawl.scrape(engine, "https://example.com")
IO.puts(result.metadata.title)
IO.puts(result.markdown.content)
IO.puts(length(result.links))Platform Support
| Language | Linux x86_64 | Linux aarch64 | macOS ARM64 | Windows x64 |
|---|---|---|---|---|
| Python | ✅ | ✅ | ✅ | ✅ |
| Node.js | ✅ | ✅ | ✅ | ✅ |
| WASM | ✅ | ✅ | ✅ | ✅ |
| Ruby | ✅ | ✅ | ✅ | — |
| Elixir | ✅ | ✅ | ✅ | ✅ |
| Go | ✅ | ✅ | ✅ | ✅ |
| Java | ✅ | ✅ | ✅ | ✅ |
| C# | ✅ | ✅ | ✅ | ✅ |
| PHP | ✅ | ✅ | ✅ | ✅ |
| Rust | ✅ | ✅ | ✅ | ✅ |
| C (FFI) | ✅ | ✅ | ✅ | ✅ |
| CLI | ✅ | ✅ | ✅ | ✅ |
Guides
- Observability — Integrate with Jaeger, Prometheus, or your observability backend. Emit and analyze crawl spans, metrics, and traces with W3C TraceContext propagation.
- Antibot Strategy & Stealth — Detect and bypass WAF systems. Customize antibot behavior per vendor, enable stealth surfaces, and integrate third-party bypass providers.
Architecture
Your Application (Python, Node.js, Ruby, Java, Go, C#, PHP, Elixir, ...)
│
Language Bindings (PyO3, NAPI-RS, Magnus, ext-php-rs, Rustler, cgo, Panama, P/Invoke)
│
Rust Core Engine (async, concurrent, SIMD-optimized)
│
├── HTTP Client (reqwest + tower middleware stack)
├── HTML Parser (html5ever + lol_html)
├── Markdown Converter (html-to-markdown-rs)
├── Content Extraction (metadata, JSON-LD, Open Graph, readability)
├── Link Discovery (robots.txt, sitemaps, anchor analysis)
└── Browser Rendering (optional headless Chrome/Firefox)
Contributing
Contributions are welcome! See our Contributing Guide.
Part of Kreuzberg.dev
- Kreuzberg — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
- Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
- kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- html-to-markdown — fast, lossless HTML→Markdown engine.
- liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
- tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
- alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.
- Discord — community, roadmap, announcements.