Project

kreuzberg

0.98
The project is in a healthy, maintained state
Kreuzberg is a multi-language document intelligence framework with a high-performance Rust core. Supports extraction, OCR, chunking, and language detection for 30+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Development

~> 4.0
~> 13.0
= 0.9.119
~> 3.12
~> 0.9
~> 3.0
~> 1.66
~> 1.8
 Project Readme

Kreuzberg

Linkedin- Banner

Extract text and metadata from a wide range of file formats (56+), generate embeddings and post-process at native speeds without needing a GPU.

Key Features

  • Extensible architecture – Plugin system for custom OCR backends, validators, post-processors, and document extractors
  • Polyglot – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, and Elixir
  • 56 file formats – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
  • OCR support – Tesseract (all languages via native binding), EasyOCR/PaddleOCR (Python), Guten (Node.js), extensible via plugin API
  • High performance – Rust core with native PDFium, SIMD optimizations and full parallelism
  • Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
  • Memory efficient – Streaming parsers for multi-GB files

Complete Documentation | Installation Guides

Installation

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:

Scripting Languages:

  • Python – PyPI package, async/sync APIs, OCR backends (Tesseract, EasyOCR, PaddleOCR)
  • Ruby – RubyGems package, idiomatic Ruby API, native bindings
  • PHP – Composer package, modern PHP 8.2+ support, type-safe API
  • Elixir – Hex package, OTP integration, concurrent processing

JavaScript/TypeScript:

  • @kreuzberg/node – Native NAPI-RS bindings for Node.js/Bun, fastest performance
  • @kreuzberg/wasm – WebAssembly for browsers/Deno/Cloudflare Workers

Compiled Languages:

  • Go – Go module with FFI bindings, context-aware async
  • Java – Maven Central, Foreign Function & Memory API
  • C# – NuGet package, .NET 6.0+, full async/await support

Native:

  • Rust – Core library, flexible feature flags, zero-copy APIs

Containers:

  • Docker – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.5-2.1GB with LibreOffice)

Command-Line:

  • CLI – Cross-platform binary, batch processing, MCP server mode

All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.

Docker:

  • Docker - Two images core and full. Available for both x86 and ARM

Platform Support

Complete architecture coverage across all language bindings:

Language Linux x86_64 Linux aarch64 macOS ARM64 Windows x64
Python
Node.js
Ruby -
Elixir
Go
Java
C#
PHP
Rust
CLI
Docker -

Note: ✅ = Precompiled binaries available with instant installation. All platforms are tested in CI. macOS support is Apple Silicon only.

Embeddings Support (Optional)

To use embeddings functionality:

  1. Install ONNX Runtime 1.22.x:

  2. Use embeddings in your code - see Embeddings Guide

Note: Kreuzberg requires ONNX Runtime version 1.22.x for embeddings. All other Kreuzberg features work without ONNX Runtime.

Supported Formats

56 file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

Category Formats Capabilities
Word Processing .docx, .odt Full text, tables, images, metadata, styles
Spreadsheets .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .ods Sheet data, formulas, cell metadata, charts
Presentations .pptx, .ppt, .ppsx Slides, speaker notes, images, metadata
PDF .pdf Text, tables, images, metadata, OCR support
eBooks .epub, .fb2 Chapters, metadata, embedded resources

Images (OCR-Enabled)

Category Formats Features
Raster .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif OCR, table detection, EXIF metadata, dimensions, color space
Advanced .jp2, .jpx, .jpm, .mj2, .pnm, .pbm, .pgm, .ppm OCR, table detection, format-specific metadata
Vector .svg DOM parsing, embedded text, graphics metadata

Web & Data

Category Formats Features
Markup .html, .htm, .xhtml, .xml, .svg DOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data .json, .yaml, .yml, .toml, .csv, .tsv Schema detection, nested structures, validation
Text & Markdown .txt, .md, .markdown, .rst, .org, .rtf CommonMark, GFM, reStructuredText, Org Mode

Email & Archives

Category Formats Features
Email .eml, .msg Headers, body (HTML/plain), attachments, threading
Archives .zip, .tar, .tgz, .gz, .7z File listing, nested archives, metadata

Academic & Scientific

Category Formats Features
Citations .bib, .biblatex, .ris, .enw, .csl Bibliography parsing, citation extraction
Scientific .tex, .latex, .typst, .jats, .ipynb, .docbook LaTeX, Jupyter notebooks, PubMed JATS
Documentation .opml, .pod, .mdoc, .troff Technical documentation formats

Complete Format Reference →

Key Features

OCR with Table Extraction

Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.

OCR Backend Documentation →

Batch Processing

Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.

Batch Processing Guide →

Password-Protected PDFs

Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.

PDF Configuration →

Language Detection

Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.

Language Detection Guide →

Metadata Extraction

Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.

Metadata Guide →

Documentation

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details. You can use Kreuzberg freely in both commercial and closed-source products with no obligations, no viral effects, and no licensing restrictions.