html-to-markdown
Fast, robust HTML → Markdown for 16 languages. A tiered converter that picks the safest, fastest path per input without losing content.
What and Why?
html-to-markdown converts real-world HTML — unclosed tags, CDATA, custom elements, malformed entities, nested tables, mixed encodings — into clean CommonMark (or Djot) without losing content, from one Rust core with native bindings for 16 languages.
It routes each input through three tiers: a single-pass byte scanner for clean HTML, a tolerant DOM walker for complex inputs, and an html5ever repair pass for malformed HTML — with byte-identical output across tiers, enforced by a 116-snapshot oracle and per-group performance gates in CI. The dispatcher is invisible: the same convert() call works regardless of which tier runs.
Features
| Feature | Description |
|---|---|
| 16 languages, one Rust core | Rust, Python, Node.js, WASM, Java, Go, C#, PHP, Ruby, Elixir, R, Dart, Kotlin (Android), Swift, Zig, and a C ABI |
| Tiered dispatch | Byte scanner → DOM walker → html5ever repair, with byte-equal output across tiers |
| Real-HTML robust | Unclosed tags, CDATA, custom elements, malformed entities, nested tables, mixed encodings — handled without losing content |
| GFM tables | Padded cells, alignment, and pipe escaping |
| Djot output | Set output_format = "djot" to emit Djot instead of Markdown |
| Metadata extraction | Parse <head> into structured metadata (Open Graph, Twitter, JSON-LD, microdata, RDFa, header hierarchy) |
| Inline images | Opt-in mirroring of data URIs and remote image references |
| Visitor API | Feature-gated traversal to transform the converted Markdown AST |
| Configurable preprocessing | Standard, strict, and lenient presets — or build your own |
| Fast | 19–116 MB/s on the Wikipedia/mdream corpus; per-group regression thresholds enforced on every PR |
⭐ Star this repo to show your support — it helps others discover html-to-markdown.
Quick Start
convert() is the single entry point — it returns a structured result with content, warnings, and optional metadata.
Language Packages
cargo add html-to-markdown-rsSee Rust README for full documentation.
pip install html-to-markdownSee Python README for full documentation.
npm install @kreuzberg/html-to-markdownSee Node.js README for full documentation.
go get github.com/xberg-io/html-to-markdown/packages/go/v3See Go README for full documentation.
Available on Maven Central as dev.kreuzberg:html-to-markdown. See Java README for the dependency snippet and current version.
dotnet add package KreuzbergDev.HtmlToMarkdownSee C# README for full documentation.
gem install html-to-markdownSee Ruby README for full documentation.
This is a native PHP extension (Rust ext-php-rs), so install it with PIE — not composer require:
pie install xberg-io/html-to-markdownSee PHP README for full documentation.
Add {:html_to_markdown, "~> 3.6"} to your mix.exs dependencies. See Elixir README for full documentation.
install.packages("htmltomarkdown", repos = "https://xberg-io.r-universe.dev")See R README for full documentation.
dart pub add h2mSee Dart README for full documentation.
Available on Maven Central as dev.kreuzberg:html-to-markdown-android. See Kotlin README for the dependency snippet and current version.
Add via Swift Package Manager. See Swift README for full documentation.
See Zig README for installation and usage.
npm install @kreuzberg/html-to-markdown-wasmSee WebAssembly README for full documentation.
Pre-built .so / .dll / .dylib from GitHub Releases. See FFI crate for full documentation.
cargo install html-to-markdown-clibrew install xberg-io/tap/html-to-markdownSee CLI usage for full documentation.
AI Coding Assistants
Install the html-to-markdown plugin from the xberg-io/plugins marketplace. It ships the html-to-markdown agent skills and works with every major coding agent — expand your harness below.
/plugin marketplace add xberg-io/plugins
/plugin install html-to-markdown@kreuzberg
/plugins add https://github.com/xberg-io/plugins
Then search for html-to-markdown and select Install Plugin.
Settings → Plugins → Add from URL → https://github.com/xberg-io/plugins, then select html-to-markdown.
gemini extensions install https://github.com/xberg-io/plugins
droid plugin marketplace add https://github.com/xberg-io/plugins
droid plugin install html-to-markdown@kreuzberg
copilot plugin marketplace add https://github.com/xberg-io/plugins
copilot plugin install html-to-markdown@kreuzberg
Add the package to opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"plugin": ["@kreuzberg/opencode-html-to-markdown"]
}Documentation
Full guides, the convert() API for every binding, tier architecture, the metadata and visitor APIs, and performance benchmarks live at docs.html-to-markdown.xberg.io.
Part of Kreuzberg.dev
- Kreuzberg — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
- Xberg Enterprise — managed extraction API with SDKs, dashboards, and observability.
- crawlberg — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
- html-to-markdown — fast, lossless HTML→Markdown engine.
- liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
- tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
- alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.
Contributing
Contributions welcome! See CONTRIBUTING.md for setup instructions and guidelines.
License
MIT License — see LICENSE for details.