Kreuzcrawl

High-performance Rust web crawling engine for structured data extraction. Scrape, crawl, and map websites with native bindings for 14 languages — same engine, identical results across every runtime.

Key Features

Structured extraction — Text, metadata, links, images, assets, JSON-LD, Open Graph, hreflang, favicons, headings, and response headers
Markdown conversion — Clean Markdown output with citations, document structure, and fit-content mode
Concurrent crawling — Depth-first, breadth-first, or best-first traversal with configurable depth, page limits, and concurrency
14 language bindings — Rust, Python, Node.js, TypeScript, Ruby, Go, Java, Kotlin (Android), C#, PHP, Elixir, Dart, Swift, Zig, and WebAssembly
Smart filtering — BM25 relevance scoring, URL include/exclude patterns, robots.txt compliance, and sitemap discovery
Browser rendering — Optional headless browser for JavaScript-heavy SPAs with WAF detection and bypass
Batch operations — Scrape or crawl hundreds of URLs concurrently with partial failure handling
Streaming — Real-time crawl events via async streams for progress tracking
Authentication — HTTP Basic, Bearer token, and custom header auth with persistent cookie jars
Rate limiting — Per-domain request throttling with configurable delays
SSRF-safe by default — Refuses requests to loopback, private, link-local, and cloud-metadata addresses; opt out via env var or CrawlConfig
Asset download — Download, deduplicate, and filter images, documents, and other linked assets
MCP server — Model Context Protocol integration for AI agents
REST API — HTTP server with OpenAPI spec

Documentation | API Reference

Substrate vs operational

kreuzcrawl ships the substrate: everything a developer needs to crawl a site end-to-end with no external service. Productization concerns (managed proxy pools, tuned WAF fingerprints, authenticated-session injection, scheduling/retry orchestration, billing) live in kreuzberg-cloud, the reference operational implementation.

In the substrate

HTML→Markdown engine, headless-Chrome fallback, kreuzcrawl::robots and kreuzcrawl::sitemap parsers (public — usable standalone), per-domain throttling, SSRF policy, baseline WAF classifier (TomlClassifier), crawl event stream, URL canonicalization, citations.

Extension points (inject your own impls)

Trait	Baseline (kreuzcrawl)	Reference cloud impl
`traits::Frontier`	`defaults::InMemoryFrontier`	NATS-backed frontier in `kreuzberg-cloud/crates/crawl-traits/src/frontier.rs`
`traits::RateLimiter`	`defaults::PerDomainThrottle`	`AdaptiveRateLimiter` (cloud)
`traits::CrawlStore`	`defaults::NoopStore`	`CloudCrawlStore` (cloud)
`traits::EventEmitter`	`defaults::NoopEmitter`	`NatsEventEmitter` (cloud)
`traits::ContentFilter`	`defaults::NoopFilter`	`LlmContentFilter` (cloud)
`traits::CrawlCache`	`defaults::NoopCache`	`CloudCrawlCache` (cloud)
`WafClassifier`	`TomlClassifier` (loads rules from disk)	Tuned-fingerprint database (cloud)
`BypassProvider`	Built-in baseline	Managed bypass cluster (cloud)
`AntibotStrategy`	`DefaultAntibotStrategy`	Custom strategies (cloud)

Inject any impl via CrawlEngineBuilder::with_<trait>(Arc::new(MyImpl)).

Out of the box (cloud-only)

Managed paid IP rotation pool, premium per-site scraper presets (LinkedIn, Amazon, eBay, GitHub, …), authenticated-crawl session injection, visual-diff change detection, scheduled crawls + per-customer budgets, cost analytics.

Installation

Language	Package	Install
Python	kreuzcrawl	`pip install kreuzcrawl`
Node.js	@kreuzberg/kreuzcrawl	`npm install @kreuzberg/kreuzcrawl`
Rust	kreuzcrawl	`cargo add kreuzcrawl`
Go	pkg.go.dev	`go get github.com/kreuzberg-dev/kreuzcrawl/packages/go`
Java	Maven Central	See README
C#	NuGet	`dotnet add package Kreuzcrawl`
Ruby	kreuzcrawl	`gem install kreuzcrawl`
PHP	kreuzberg-dev/kreuzcrawl	`composer require kreuzberg-dev/kreuzcrawl`
Elixir	kreuzcrawl	`{:kreuzcrawl, "~> 0.3"}`
WASM	@kreuzberg/kreuzcrawl-wasm	`npm install @kreuzberg/kreuzcrawl-wasm`
C FFI	GitHub Releases	C header + shared library
CLI	crates.io	`cargo install kreuzcrawl-cli`
CLI (Homebrew)	kreuzberg-dev/tap	`brew install kreuzberg-dev/tap/kreuzcrawl`

Homebrew 6.0+ requires explicit trust for third-party taps. Run brew trust kreuzberg-dev/tap once before installing from kreuzberg-dev/tap.

Quick Start

Python — Full docs

from kreuzcrawl import create_engine, scrape

engine = create_engine()
result = scrape(engine, "https://example.com")

print(result.metadata.title)
print(result.markdown.content)
print(len(result.links))

Node.js / TypeScript — Full docs

import { createEngine, scrape } from "@kreuzberg/kreuzcrawl";

const engine = createEngine();
const result = await scrape(engine, "https://example.com");

console.log(result.metadata.title);
console.log(result.markdown.content);
console.log(result.links.length);

Rust — Full docs

let engine = kreuzcrawl::create_engine(None)?;
let result = kreuzcrawl::scrape(&engine, "https://example.com").await?;

println!("{}", result.metadata.title);
println!("{}", result.markdown.content);
println!("{}", result.links.len());

Go — Full docs

engine, _ := kcrawl.CreateEngine()
result, _ := kcrawl.Scrape(engine, "https://example.com")

fmt.Println(result.Metadata.Title)
fmt.Println(result.Markdown.Content)
fmt.Println(len(result.Links))

Java — Full docs

var engine = Kreuzcrawl.createEngine(null);
var result = Kreuzcrawl.scrape(engine, "https://example.com");

System.out.println(result.metadata().title());
System.out.println(result.markdown().content());
System.out.println(result.links().size());

C# — Full docs

var engine = KreuzcrawlLib.CreateEngine(null);
var result = await KreuzcrawlLib.Scrape(engine, "https://example.com");

Console.WriteLine(result.Metadata.Title);
Console.WriteLine(result.Markdown.Content);
Console.WriteLine(result.Links.Count);

Ruby — Full docs

engine = Kreuzcrawl.create_engine(nil)
result = Kreuzcrawl.scrape(engine, "https://example.com")

puts result.metadata.title
puts result.markdown.content
puts result.links.length

PHP — Full docs

$engine = Kreuzcrawl::createEngine(null);
$result = Kreuzcrawl::scrape($engine, "https://example.com");

echo $result->metadata->title;
echo $result->markdown->content;
echo count($result->links);

Elixir — Full docs

{:ok, engine} = Kreuzcrawl.create_engine(nil)
{:ok, result} = Kreuzcrawl.scrape(engine, "https://example.com")

IO.puts(result.metadata.title)
IO.puts(result.markdown.content)
IO.puts(length(result.links))

Platform Support

Language	Linux x86_64	Linux aarch64	macOS ARM64	Windows x64
Python	✅	✅	✅	✅
Node.js	✅	✅	✅	✅
WASM	✅	✅	✅	✅
Ruby	✅	✅	✅	—
Elixir	✅	✅	✅	✅
Go	✅	✅	✅	✅
Java	✅	✅	✅	✅
C#	✅	✅	✅	✅
PHP	✅	✅	✅	✅
Rust	✅	✅	✅	✅
C (FFI)	✅	✅	✅	✅
CLI	✅	✅	✅	✅

Guides

Observability — Integrate with Jaeger, Prometheus, or your observability backend. Emit and analyze crawl spans, metrics, and traces with W3C TraceContext propagation.
Antibot Strategy & Stealth — Detect and bypass WAF systems. Customize antibot behavior per vendor, enable stealth surfaces, and integrate third-party bypass providers.

Architecture

Your Application (Python, Node.js, Ruby, Java, Go, C#, PHP, Elixir, ...)
    │
Language Bindings (PyO3, NAPI-RS, Magnus, ext-php-rs, Rustler, cgo, Panama, P/Invoke)
    │
Rust Core Engine (async, concurrent, SIMD-optimized)
    │
    ├── HTTP Client (reqwest + tower middleware stack)
    ├── HTML Parser (html5ever + lol_html)
    ├── Markdown Converter (html-to-markdown-rs)
    ├── Content Extraction (metadata, JSON-LD, Open Graph, readability)
    ├── Link Discovery (robots.txt, sitemaps, anchor analysis)
    └── Browser Rendering (optional headless Chrome/Firefox)

Contributing

Contributions are welcome! See our Contributing Guide.

Part of Kreuzberg.dev

Kreuzberg — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
Kreuzberg Cloud — managed extraction API with SDKs, dashboards, and observability.
kreuzcrawl — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
html-to-markdown — fast, lossless HTML→Markdown engine.
liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.
Discord — community, roadmap, announcements.

License

Elastic License 2.0

kreuzcrawl

Runtime