Project

kreuzberg

0.63
The project is in a healthy, maintained state
Kreuzberg is a multi-language document intelligence framework with a high-performance Rust core. Supports extraction, OCR, chunking, and language detection for 30+ file formats including PDF, DOCX, PPTX, XLSX, images, and more.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Development

~> 2.0
~> 13.0
~> 3.0
~> 0.9
~> 3.12
~> 1.66
~> 1.8
~> 0.9
 Project Readme

Kreuzberg

Discord PyPI npm RubyGems Go Reference Documentation License: MIT

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Ruby, Go, and TypeScript/Node.jsβ€”or use via CLI, REST API, or MCP server.

πŸš€ Version 4.0.0 Release Candidate This is a pre-release version. We invite you to test the library and report any issues you encounter. Help us make the stable release better!

Why Kreuzberg

  • Rust-powered core – High-performance native code for text extraction
  • Truly polyglot – Native bindings for Rust, Python, Ruby, and TypeScript/Node.js
  • Production-ready – Battle-tested with comprehensive error handling and validation
  • 50+ file format families – PDF, Office documents, images, HTML, XML, emails, archives, and more
  • OCR built-in – Multiple backends (Tesseract, EasyOCR, PaddleOCR) with table extraction support
  • Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
  • Memory efficient – Streaming parsers handle multi-GB files with constant memory usage

πŸ“– Complete Documentation β€’ πŸš€ Installation Guides

Installation

Python

pip install kreuzberg

Python Documentation β†’

Ruby

gem install kreuzberg

Ruby Documentation β†’

TypeScript/Node.js

npm install @goldziher/kreuzberg

TypeScript/Node.js Documentation β†’

Go

go get github.com/Goldziher/kreuzberg/packages/go/kreuzberg@latest

Build the FFI crate (cargo build -p kreuzberg-ffi --release) and set LD_LIBRARY_PATH/DYLD_FALLBACK_LIBRARY_PATH to target/release so cgo can locate libkreuzberg_ffi.

Go Documentation β†’

Rust

[dependencies]
# Use git dependency for full feature support (including embeddings)
kreuzberg = { git = "https://github.com/Goldziher/kreuzberg", tag = "v4.0.0" }

# Or use a specific branch
# kreuzberg = { git = "https://github.com/Goldziher/kreuzberg", branch = "main" }

Note: Kreuzberg is not currently published to crates.io due to git dependencies (fastembed-rs, ort). Use the git dependency above for full functionality.

Rust Documentation β†’

CLI

brew install goldziher/tap/kreuzberg
cargo install kreuzberg-cli

CLI Documentation β†’

Quick Start

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:

Supported Formats

Documents & Productivity

Format Extensions Metadata Tables Images
PDF .pdf βœ… βœ… βœ…
Word .docx, .doc βœ… βœ… βœ…
Excel .xlsx, .xls, .ods βœ… βœ… ❌
PowerPoint .pptx, .ppt βœ… βœ… βœ…
Rich Text .rtf βœ… ❌ ❌
EPUB .epub βœ… ❌ ❌

Images

All image formats support OCR: .jpg, .jpeg, .png, .tiff, .tif, .bmp, .gif, .webp, .jp2

Web & Structured Data

Format Extensions Features
HTML .html, .htm Metadata extraction, link preservation
XML .xml Streaming parser for multi-GB files
JSON .json Intelligent field detection
YAML .yaml Structure preservation
TOML .toml Configuration parsing

Email & Archives

Format Extensions Features
Email .eml, .msg Full metadata, attachment extraction
Archives .zip, .tar, .gz, .7z File listing, metadata

Academic & Technical

LaTeX (.tex), BibTeX (.bib), Jupyter (.ipynb), reStructuredText (.rst), Org Mode (.org), Markdown (.md)

Complete Format Documentation

Key Features

OCR with Table Extraction

Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.

OCR Backend Documentation β†’

Batch Processing

Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.

Batch Processing Guide β†’

Password-Protected PDFs

Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.

PDF Configuration β†’

Language Detection

Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.

Language Detection Guide β†’

Metadata Extraction

Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.

Metadata Guide β†’

Deployment Options

REST API Server

Production-ready API server with OpenAPI documentation, health checks, and telemetry support. Deploy standalone or in containers with automatic format detection and streaming support.

API Server Documentation β†’

MCP Server (AI Integration)

Model Context Protocol server for Claude and other AI assistants. Enables AI agents to extract and process documents directly with full configuration support.

MCP Server Documentation β†’

Docker

Official Docker images available in multiple variants:

  • Core (~1.0-1.3GB): Tesseract OCR, Pandoc, modern Office formats
  • Full (~1.5-2.1GB): Adds LibreOffice for legacy Office formats (.doc, .ppt)

All images support API server, CLI, and MCP server modes with automatic platform detection for linux/amd64 and linux/arm64.

Docker Deployment Guide β†’

Architecture

Kreuzberg is built with a Rust core for efficient document extraction and processing.

Design Principles

  • Rust core – Native code for text extraction and processing
  • Async throughout – Asynchronous processing with Tokio runtime
  • Memory efficient – Streaming parsers for large files
  • Parallel batch processing – Configurable concurrency for multiple documents
  • Zero-copy operations – Efficient data handling where possible

Documentation

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.