Kreuzberg
Extract text and metadata from a wide range of file formats (56+), generate embeddings and post-process at native speeds without needing a GPU.
Key Features
- Extensible architecture – Plugin system for custom OCR backends, validators, post-processors, and document extractors
- Polyglot – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, and Elixir
- 56 file formats – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
- OCR support – Tesseract (all languages via native binding), EasyOCR/PaddleOCR (Python), Guten (Node.js), extensible via plugin API
- High performance – Rust core with native PDFium, SIMD optimizations and full parallelism
- Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
- Memory efficient – Streaming parsers for multi-GB files
Complete Documentation | Installation Guides
Installation
Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:
Scripting Languages:
- Python – PyPI package, async/sync APIs, OCR backends (Tesseract, EasyOCR, PaddleOCR)
- Ruby – RubyGems package, idiomatic Ruby API, native bindings
- PHP – Composer package, modern PHP 8.2+ support, type-safe API
- Elixir – Hex package, OTP integration, concurrent processing
JavaScript/TypeScript:
- @kreuzberg/node – Native NAPI-RS bindings for Node.js/Bun, fastest performance
- @kreuzberg/wasm – WebAssembly for browsers/Deno/Cloudflare Workers
Compiled Languages:
- Go – Go module with FFI bindings, context-aware async
- Java – Maven Central, Foreign Function & Memory API
- C# – NuGet package, .NET 6.0+, full async/await support
Native:
- Rust – Core library, flexible feature flags, zero-copy APIs
Containers:
- Docker – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.5-2.1GB with LibreOffice)
Command-Line:
- CLI – Cross-platform binary, batch processing, MCP server mode
All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.
Docker:
-
Docker - Two images
coreandfull. Available for both x86 and ARM
Platform Support
Complete architecture coverage across all language bindings:
| Language | Linux x86_64 | Linux aarch64 | macOS ARM64 | Windows x64 |
|---|---|---|---|---|
| Python | ✅ | ✅ | ✅ | ✅ |
| Node.js | ✅ | ✅ | ✅ | ✅ |
| Ruby | ✅ | ✅ | ✅ | - |
| Elixir | ✅ | ✅ | ✅ | ✅ |
| Go | ✅ | ✅ | ✅ | ✅ |
| Java | ✅ | ✅ | ✅ | ✅ |
| C# | ✅ | ✅ | ✅ | ✅ |
| PHP | ✅ | ✅ | ✅ | ✅ |
| Rust | ✅ | ✅ | ✅ | ✅ |
| CLI | ✅ | ✅ | ✅ | ✅ |
| Docker | ✅ | ✅ | ✅ | - |
Note: ✅ = Precompiled binaries available with instant installation. All platforms are tested in CI. macOS support is Apple Silicon only.
Embeddings Support (Optional)
To use embeddings functionality:
-
Install ONNX Runtime 1.22.x:
- Linux: Download from ONNX Runtime releases (Debian packages may have older versions)
- macOS:
brew install onnxruntime - Windows: Download from ONNX Runtime releases
-
Use embeddings in your code - see Embeddings Guide
Note: Kreuzberg requires ONNX Runtime version 1.22.x for embeddings. All other Kreuzberg features work without ONNX Runtime.
Supported Formats
56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
Office Documents
| Category | Formats | Capabilities |
|---|---|---|
| Word Processing |
.docx, .odt
|
Full text, tables, images, metadata, styles |
| Spreadsheets |
.xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods
|
Sheet data, formulas, cell metadata, charts |
| Presentations |
.pptx, .ppt, .ppsx
|
Slides, speaker notes, images, metadata |
.pdf |
Text, tables, images, metadata, OCR support | |
| eBooks |
.epub, .fb2
|
Chapters, metadata, embedded resources |
Images (OCR-Enabled)
| Category | Formats | Features |
|---|---|---|
| Raster |
.png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif
|
OCR, table detection, EXIF metadata, dimensions, color space |
| Advanced |
.jp2, .jpx, .jpm, .mj2, .pnm, .pbm, .pgm, .ppm
|
OCR, table detection, format-specific metadata |
| Vector | .svg |
DOM parsing, embedded text, graphics metadata |
Web & Data
| Category | Formats | Features |
|---|---|---|
| Markup |
.html, .htm, .xhtml, .xml, .svg
|
DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| Structured Data |
.json, .yaml, .yml, .toml, .csv, .tsv
|
Schema detection, nested structures, validation |
| Text & Markdown |
.txt, .md, .markdown, .rst, .org, .rtf
|
CommonMark, GFM, reStructuredText, Org Mode |
Email & Archives
| Category | Formats | Features |
|---|---|---|
.eml, .msg
|
Headers, body (HTML/plain), attachments, threading | |
| Archives |
.zip, .tar, .tgz, .gz, .7z
|
File listing, nested archives, metadata |
Academic & Scientific
| Category | Formats | Features |
|---|---|---|
| Citations |
.bib, .biblatex, .ris, .enw, .csl
|
Bibliography parsing, citation extraction |
| Scientific |
.tex, .latex, .typst, .jats, .ipynb, .docbook
|
LaTeX, Jupyter notebooks, PubMed JATS |
| Documentation |
.opml, .pod, .mdoc, .troff
|
Technical documentation formats |
Key Features
Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.
Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.
Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.
Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.
Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.
Documentation
- Installation Guide – Setup and dependencies
- User Guide – Comprehensive usage guide
- API Reference – Complete API documentation
- Format Support – Supported file formats
- OCR Backends – OCR engine setup
- CLI Guide – Command-line usage
- Migration Guide – Upgrading from v3
Contributing
Contributions are welcome! See CONTRIBUTING.md for guidelines.
License
MIT License - see LICENSE for details. You can use Kreuzberg freely in both commercial and closed-source products with no obligations, no viral effects, and no licensing restrictions.