Kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Ruby, Go, and TypeScript/Node.jsβor use via CLI, REST API, or MCP server.
π Version 4.0.0 Release Candidate This is a pre-release version. We invite you to test the library and report any issues you encounter. Help us make the stable release better!
Why Kreuzberg
- Rust-powered core β High-performance native code for text extraction
- Truly polyglot β Native bindings for Rust, Python, Ruby, and TypeScript/Node.js
- Production-ready β Battle-tested with comprehensive error handling and validation
- 50+ file format families β PDF, Office documents, images, HTML, XML, emails, archives, and more
- OCR built-in β Multiple backends (Tesseract, EasyOCR, PaddleOCR) with table extraction support
- Flexible deployment β Use as library, CLI tool, REST API server, or MCP server
- Memory efficient β Streaming parsers handle multi-GB files with constant memory usage
π Complete Documentation β’ π Installation Guides
Installation
Python
pip install kreuzbergRuby
gem install kreuzbergTypeScript/Node.js
npm install @goldziher/kreuzbergTypeScript/Node.js Documentation β
Go
go get github.com/Goldziher/kreuzberg/packages/go/kreuzberg@latestBuild the FFI crate (cargo build -p kreuzberg-ffi --release) and set LD_LIBRARY_PATH/DYLD_FALLBACK_LIBRARY_PATH to target/release so cgo can locate libkreuzberg_ffi.
Rust
[dependencies]
# Use git dependency for full feature support (including embeddings)
kreuzberg = { git = "https://github.com/Goldziher/kreuzberg", tag = "v4.0.0" }
# Or use a specific branch
# kreuzberg = { git = "https://github.com/Goldziher/kreuzberg", branch = "main" }Note: Kreuzberg is not currently published to crates.io due to git dependencies (fastembed-rs, ort). Use the git dependency above for full functionality.
CLI
brew install goldziher/tap/kreuzbergcargo install kreuzberg-cliQuick Start
Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:
- Python Quick Start β β Installation, basic usage, async/sync APIs
- Ruby Quick Start β β Installation, basic usage, configuration
- TypeScript/Node.js Quick Start β β Installation, types, promises
- Go Quick Start β β Installation, native library setup, sync/async extraction + batch APIs
- Rust Quick Start β β Crate usage, features, async/sync APIs
- CLI Quick Start β β Command-line usage, batch processing, options
Supported Formats
Documents & Productivity
| Format | Extensions | Metadata | Tables | Images |
|---|---|---|---|---|
.pdf |
β | β | β | |
| Word |
.docx, .doc
|
β | β | β |
| Excel |
.xlsx, .xls, .ods
|
β | β | β |
| PowerPoint |
.pptx, .ppt
|
β | β | β |
| Rich Text | .rtf |
β | β | β |
| EPUB | .epub |
β | β | β |
Images
All image formats support OCR: .jpg, .jpeg, .png, .tiff, .tif, .bmp, .gif, .webp, .jp2
Web & Structured Data
| Format | Extensions | Features |
|---|---|---|
| HTML |
.html, .htm
|
Metadata extraction, link preservation |
| XML | .xml |
Streaming parser for multi-GB files |
| JSON | .json |
Intelligent field detection |
| YAML | .yaml |
Structure preservation |
| TOML | .toml |
Configuration parsing |
Email & Archives
| Format | Extensions | Features |
|---|---|---|
.eml, .msg
|
Full metadata, attachment extraction | |
| Archives |
.zip, .tar, .gz, .7z
|
File listing, metadata |
Academic & Technical
LaTeX (.tex), BibTeX (.bib), Jupyter (.ipynb), reStructuredText (.rst), Org Mode (.org), Markdown (.md)
Key Features
OCR with Table Extraction
Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.
Batch Processing
Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.
Password-Protected PDFs
Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.
Language Detection
Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.
Metadata Extraction
Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.
Deployment Options
REST API Server
Production-ready API server with OpenAPI documentation, health checks, and telemetry support. Deploy standalone or in containers with automatic format detection and streaming support.
MCP Server (AI Integration)
Model Context Protocol server for Claude and other AI assistants. Enables AI agents to extract and process documents directly with full configuration support.
Docker
Official Docker images available in multiple variants:
- Core (~1.0-1.3GB): Tesseract OCR, Pandoc, modern Office formats
- Full (~1.5-2.1GB): Adds LibreOffice for legacy Office formats (.doc, .ppt)
All images support API server, CLI, and MCP server modes with automatic platform detection for linux/amd64 and linux/arm64.
Architecture
Kreuzberg is built with a Rust core for efficient document extraction and processing.
Design Principles
- Rust core β Native code for text extraction and processing
- Async throughout β Asynchronous processing with Tokio runtime
- Memory efficient β Streaming parsers for large files
- Parallel batch processing β Configurable concurrency for multiple documents
- Zero-copy operations β Efficient data handling where possible
Documentation
- Installation Guide β Setup and dependencies
- User Guide β Comprehensive usage guide
- API Reference β Complete API documentation
- Format Support β Supported file formats
- OCR Backends β OCR engine setup
- CLI Guide β Command-line usage
- Migration Guide β Upgrading from v3
Contributing
Contributions are welcome! See CONTRIBUTING.md for guidelines.
License
MIT License - see LICENSE for details.