A long-lived project that still receives updates
# html-to-markdown-rb Blazing-fast HTML → Markdown conversion for Ruby, powered by the same Rust engine used by our Python, Node.js, and WebAssembly packages. Ship identical Markdown across every runtime while enjoying native extension performance. [![Crates.io](https://img.shields.io/crates/v/html-to-markdown-rs.svg)](https://crates.io/crates/html-to-markdown-rs) [![npm version](https://badge.fury.io/js/html-to-markdown-node.svg)](https://www.npmjs.com/package/html-to-markdown-node) [![PyPI version](https://badge.fury.io/py/html-to-markdown.svg)](https://pypi.org/project/html-to-markdown/) [![Gem Version](https://badge.fury.io/rb/html-to-markdown.svg)](https://rubygems.org/gems/html-to-markdown) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Goldziher/html-to-markdown/blob/main/LICENSE) ## Features - ⚡ **Rust-fast**: Ruby bindings around a highly optimised Rust core (60‑80× faster than BeautifulSoup-based converters). - 🔁 **Identical output**: Shares logic with the Python wheels, npm bindings, WASM package, and CLI — consistent Markdown everywhere. - ⚙️ **Rich configuration**: Control heading styles, list indentation, whitespace handling, HTML preprocessing, and more. - 🖼️ **Inline image extraction**: Pull out embedded images (PNG/JPEG/SVG/data URIs) alongside Markdown. - 🧰 **Bundled CLI proxy**: Call the Rust CLI straight from Ruby or shell scripts. - 🛠️ **First-class Rails support**: Works with `Gem.win_platform?` builds, supports Trusted Publishing, and compiles on install if no native gem matches. ## Documentation & Support - [GitHub repository](https://github.com/Goldziher/html-to-markdown) - [Issue tracker](https://github.com/Goldziher/html-to-markdown/issues) - [Changelog](https://github.com/Goldziher/html-to-markdown/blob/main/CHANGELOG.md) - [Live demo (WASM)](https://goldziher.github.io/html-to-markdown/) ## Installation ```bash bundle add html-to-markdown # or gem install html-to-markdown ``` Add the gem to your project and Bundler will compile the native Rust extension on first install. ### Requirements - Ruby **3.2+** (Magnus relies on the fiber scheduler APIs added in 3.2) - Rust toolchain **1.85+** with Cargo available on your `$PATH` - Ruby development headers (`ruby-dev`, `ruby-devel`, or the platform equivalent) **Windows**: install [RubyInstaller with MSYS2](https://rubyinstaller.org/) (UCRT64). Run once: ```powershell ridk exec pacman -S --needed --noconfirm base-devel mingw-w64-ucrt-x86_64-toolchain ``` This provides the standard headers (including `strings.h`) required for the bindgen step. ## Performance Snapshot Apple M4 • Real Wikipedia documents • `HtmlToMarkdown.convert` (Ruby) | Document | Size | Latency | Throughput | Docs/sec | | ------------------- | ----- | ------- | ---------- | -------- | | Lists (Timeline) | 129KB | 0.69ms | 187 MB/s | 1,450 | | Tables (Countries) | 360KB | 2.19ms | 164 MB/s | 456 | | Mixed (Python wiki) | 656KB | 4.88ms | 134 MB/s | 205 | > Same core, same benchmarks: the Ruby extension stays within single-digit % of the Rust CLI and mirrors the Python/Node numbers. ## Quick Start ```ruby require 'html_to_markdown' html = <<~HTML <h1>Welcome</h1> <p>This is <strong>Rust-fast</strong> conversion!</p> <ul> <li>Native extension</li> <li>Identical output across languages</li> </ul> HTML markdown = HtmlToMarkdown.convert(html) puts markdown # # Welcome # # This is **Rust-fast** conversion! # # - Native extension # - Identical output across languages ``` ## API ### Conversion Options Pass a Ruby hash (string or symbol keys) to tweak rendering. Every option maps one-for-one with the Rust/Python/Node APIs. ```ruby require 'html_to_markdown' markdown = HtmlToMarkdown.convert( '<pre><code class="language-ruby">puts "hi"</code></pre>', heading_style: :atx, code_block_style: :fenced, bullets: '*+-', list_indent_type: :spaces, list_indent_width: 2, whitespace_mode: :normalized, highlight_style: :double_equal ) puts markdown ``` ### HTML Preprocessing Clean up scraped HTML (navigation, forms, malformed markup) before conversion: ```ruby require 'html_to_markdown' markdown = HtmlToMarkdown.convert( html, preprocessing: { enabled: true, preset: :aggressive, # :minimal, :standard, :aggressive remove_navigation: true, remove_forms: true } ) ``` ### Inline Images Extract inline binary data (data URIs, SVG) together with the converted Markdown. ```ruby require 'html_to_markdown' result = HtmlToMarkdown.convert_with_inline_images( '<img src="..." alt="Pixel">', image_config: { max_decoded_size_bytes: 1 * 1024 * 1024, infer_dimensions: true, filename_prefix: 'img_', capture_svg: true } ) puts result.markdown result.inline_images.each do |img| puts "#{img.filename} -> #{img.format} (#{img.data.bytesize} bytes)" end ``` ## CLI The gem bundles a small proxy for the Rust CLI binary. Use it when you need parity with the standalone `html-to-markdown` executable. ```ruby require 'html_to_markdown/cli' HtmlToMarkdown::CLI.run(%w[--heading-style atx input.html], stdout: $stdout) # => writes converted Markdown to STDOUT ``` You can also call the CLI binary directly for scripting: ```ruby HtmlToMarkdown::CLIProxy.call(['--version']) # => "html-to-markdown 2.5.6" ``` Rebuild the CLI locally if you see `CLI binary not built` during tests: ```bash bundle exec rake compile # builds the extension bundle exec ruby scripts/prepare_ruby_gem.rb # copies the CLI into lib/bin/ ``` ## Error Handling Conversion errors raise `HtmlToMarkdown::Error` (wrapping the Rust error context). CLI invocations use specialised subclasses: - `HtmlToMarkdown::CLIProxy::MissingBinaryError` - `HtmlToMarkdown::CLIProxy::CLIExecutionError` Rescue them to provide clearer feedback in your application. ## Consistent Across Languages The Ruby gem shares the exact Rust core with: - [Python wheels](https://pypi.org/project/html-to-markdown/) - [Node.js / Bun bindings](https://www.npmjs.com/package/html-to-markdown-node) - [WebAssembly package](https://www.npmjs.com/package/html-to-markdown-wasm) - The Rust crate and CLI Use whichever runtime fits your stack while keeping formatting behaviour identical. ## Development ```bash bundle exec rake compile # build the native extension bundle exec rspec # run test suite ``` The extension uses [Magnus](https://github.com/matsadler/magnus) plus `rb-sys` for bindgen. When editing the Rust code under `src/`, rerun `rake compile`. ## License MIT © Na'aman Hirschfeld
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies

Runtime

>= 0.9, < 1.0
 Project Readme

html-to-markdown

High-performance HTML → Markdown conversion powered by Rust. Shipping as a Rust crate, Python package, Ruby gem, Node.js bindings, WebAssembly, and standalone CLI with identical rendering behaviour.

Crates.io npm version PyPI version Gem Version Python Versions License: MIT Discord


Experience WebAssembly-powered HTML to Markdown conversion instantly in your browser. No installation needed!


Why html-to-markdown?

  • Blazing Fast: Rust-powered core delivers 10-80× faster conversion than pure Python alternatives
  • Universal: Works everywhere - Node.js, Bun, Deno, browsers, Python, Rust, and standalone CLI
  • Smart Conversion: Handles complex documents including nested tables, code blocks, task lists, and hOCR OCR output
  • Highly Configurable: Control heading styles, code block fences, list formatting, whitespace handling, and HTML sanitization
  • Tag Preservation: Keep specific HTML tags unconverted when markdown isn't expressive enough
  • Secure by Default: Built-in HTML sanitization prevents malicious content
  • Consistent Output: Identical markdown rendering across all language bindings

Documentation

Installation

Target Command
Node.js/Bun (native) npm install html-to-markdown-node
WebAssembly (universal) npm install html-to-markdown-wasm
Deno import { convert } from "npm:html-to-markdown-wasm"
Python (bindings + CLI) pip install html-to-markdown
Ruby gem bundle add html-to-markdown or gem install html-to-markdown
Rust crate cargo add html-to-markdown-rs
Rust CLI cargo install html-to-markdown-cli
Homebrew CLI brew tap goldziher/tap
brew install html-to-markdown
Releases GitHub Releases

Quick Start

JavaScript/TypeScript

Node.js / Bun (Native - Fastest):

import { convert } from 'html-to-markdown-node';

const html = '<h1>Hello</h1><p>Rust ❤️ Markdown</p>';
const markdown = convert(html, {
  headingStyle: 'Atx',
  codeBlockStyle: 'Backticks',
  wrap: true,
  preserveTags: ['table'], // NEW in v2.5: Keep complex HTML as-is
});

Deno / Browsers / Edge (Universal):

import { convert } from "npm:html-to-markdown-wasm"; // Deno
// or: import { convert } from 'html-to-markdown-wasm'; // Bundlers

const markdown = convert(html, {
  headingStyle: 'atx',
  listIndentWidth: 2,
});

Performance: Native bindings average ~19k ops/sec, WASM averages ~16k ops/sec (benchmarked on complex real-world documents).

See the JavaScript guides for full API documentation:

CLI

# Convert a file
html-to-markdown input.html > output.md

# Stream from stdin
curl https://example.com | html-to-markdown > output.md

# Apply options
html-to-markdown --heading-style atx --list-indent-width 2 input.html

Python (v2 API)

from html_to_markdown import convert, convert_with_inline_images, InlineImageConfig

html = "<h1>Hello</h1><p>Rust ❤️ Markdown</p>"
markdown = convert(html)

markdown, inline_images, warnings = convert_with_inline_images(
    '<img src="data:image/png;base64,...==" alt="Pixel">',
    image_config=InlineImageConfig(max_decoded_size_bytes=1024, infer_dimensions=True),
)

Rust

use html_to_markdown_rs::{convert, ConversionOptions, HeadingStyle};

let html = "<h1>Welcome</h1><p>Fast conversion</p>";
let markdown = convert(html, None)?;

let options = ConversionOptions {
    heading_style: HeadingStyle::Atx,
    ..Default::default()
};
let markdown = convert(html, Some(options))?;

See the language-specific READMEs for complete configuration, hOCR workflows, and inline image extraction.

Performance

Benchmarked on Apple M4 with complex real-world documents (Wikipedia articles, tables, lists):

Operations per Second (higher is better)

Document Type Node.js (NAPI) WASM Python (PyO3) Speedup (Node vs Python)
Small (5 paragraphs) 86,233 70,300 8,443 10.2×
Medium (25 paragraphs) 18,979 15,282 1,846 10.3×
Large (100 paragraphs) 4,907 3,836 438 11.2×
Tables (complex) 5,003 3,748 4,829 1.0×
Lists (nested) 1,819 1,391 1,165 1.6×
Wikipedia (129KB) 1,125 1,022 - -
Wikipedia (653KB) 156 147 - -

Average Performance Summary

Implementation Avg ops/sec vs WASM vs Python Best For
Node.js (NAPI-RS) 18,162 1.17× faster 7.4× faster Maximum throughput in Node.js/Bun
WebAssembly 15,536 baseline 6.3× faster Universal (Deno, browsers, edge)
Python (PyO3) 2,465 6.3× slower baseline Python ecosystem integration
Rust CLI/Binary 150-210 MB/s - - Standalone processing

Key Insights

  • JavaScript bindings are fastest: Native Node.js bindings achieve ~18k ops/sec average, with WASM close behind at ~16k ops/sec
  • Python is 6-10× slower: Despite using the same Rust core, PyO3 FFI overhead significantly impacts Python performance
  • Small documents: Both JS implementations reach 70-90k ops/sec on simple HTML
  • Large documents: Performance gap widens with complexity

Note on Python performance: The current Python bindings have optimization opportunities. The v2 API with direct convert() calls performs best; avoid the v1 compatibility layer for performance-critical applications.

Compatibility (v1 → v2)

  • V2’s Rust core sustains 150–210 MB/s throughput; V1 averaged ≈ 2.5 MB/s in its Python/BeautifulSoup implementation (60–80× faster).
  • The Python package offers a compatibility shim in html_to_markdown.v1_compat (convert_to_markdown, convert_to_markdown_stream, markdownify). Details and keyword mappings live in Python README.
  • CLI flag changes, option renames, and other breaking updates are summarised in CHANGELOG.

Community

Ruby

require 'html_to_markdown'

html = '<h1>Hello</h1><p>Rust ❤️ Markdown</p>'
markdown = HtmlToMarkdown.convert(html, heading_style: :atx, wrap: true)

puts markdown
# # Hello
#
# Rust ❤️ Markdown

See the language-specific READMEs for complete configuration, hOCR workflows, and inline image extraction.