trafilatura

Extract readable content, comments, and metadata from web pages.

A Rust port of go-trafilatura, which itself ports the Python trafilatura library by Adrien Barbaresi.

Usage

Add to your Cargo.toml:

[dependencies]
trafilatura = "0.3"

Library

use trafilatura::{extract, Options};

let html = r#"<html><body>
  <nav>Menu items</nav>
  <article><p>This is the main article content.</p></article>
  <footer>Copyright 2024</footer>
</body></html>"#;

let result = extract(html, &Options::default()).unwrap();
println!("{}", result.content_text);   // "This is the main article content."
println!("{}", result.metadata.title); // extracted <title> or og:title

With options

use trafilatura::{extract, Options, ExtractionFocus};

let opts = Options::default()
    .with_fallback(true)              // use readability fallback
    .with_links(true)                 // preserve <a> tags in HTML output
    .with_focus(ExtractionFocus::FavorRecall); // extract more content

let result = extract(html, &opts).unwrap();

Markdown output

Enable the markdown feature to convert extracted content to Markdown:

[dependencies]
trafilatura = { version = "0.3", features = ["markdown"] }

use trafilatura::{extract, create_markdown_document, Options};

let result = extract(html, &Options::default()).unwrap();

// Just the content as markdown:
let md = result.content_markdown();

// Full document with YAML front matter + content + comments:
let doc = create_markdown_document(&result);

CLI

# Extract from a URL
trafilatura https://example.com/article

# Extract as markdown (with front matter)
trafilatura --format md --links https://example.com/article

# Extract from a file
trafilatura path/to/page.html

# Include links in output
trafilatura --links https://example.com/article

What it extracts

Content — main article body as plain text, cleaned HTML, or Markdown
Comments — user comments, separately from article content
Metadata — title, author, date, description, site name, categories, tags, license, language, and image URL (from meta tags, OpenGraph, JSON-LD)

How it works

Parse HTML and extract metadata from <meta>, OpenGraph, and JSON-LD
Clean the DOM (remove scripts, styles, hidden elements, boilerplate)
Score and select content using CSS selector rules and paragraph heuristics
If primary extraction yields too little, fall back to readability-based extraction or baseline (last-resort) extraction
Filter duplicates and check language constraints

Language bindings

Native bindings are available via UniFFI for Swift, Kotlin, Ruby, Dart, C#, and JavaScript/TypeScript.

make test-bindings   # run all binding test suites
make test-swift      # Swift only (XCTest)
make test-kotlin     # Kotlin only (JUnit 5)
make test-ruby       # Ruby only (Minitest)
make test-dart       # Dart only
make test-cs         # C# only (xUnit)
make test-js         # JS/TS only (Vitest, WASM)

Swift (full API docs)

// Package.swift
.package(url: "https://github.com/nchapman/trafilatura-swift", from: "0.3.5"),

import Trafilatura

let result = try extractSimple(html: html)
print(result.contentText, result.metadata.title)

Kotlin / Android (full API docs)

// build.gradle.kts
implementation("io.github.nchapman:trafilatura:0.3.5")

import trafilatura.*

val result = extractSimple(html)
println("${result.contentText} ${result.metadata.title}")

Ruby (full API docs)

gem install trafilatura

require "trafilatura"

result = Trafilatura.extract_simple(html)
puts result.content_text, result.metadata.title

Dart

import 'package:trafilatura/trafilatura.dart';

final result = extractSimple(html: html);
print('${result.contentText} ${result.metadata.title}');

C# (.NET) (full API docs)

dotnet add package Trafilatura

using Trafilatura;

var result = Extractor.ExtractSimple(html);
Console.WriteLine($"{result.contentText} {result.metadata.title}");

JavaScript / TypeScript (WASM)

import { Trafilatura } from "./trafilatura.js";

const { extractSimple } = Trafilatura;
const result = extractSimple(html);
console.log(result.contentText, result.metadata.title);

Benchmarks

Speed

Extraction time per document, Rust vs Go (go-trafilatura) vs Python (trafilatura):

Document	Rust	Go	Python
small (6 KB)	793 µs	1.19 ms	1.1 ms
medium (85 KB)	5.7 ms	5.6 ms	6.2 ms
large (382 KB)	3.6 ms	4.9 ms	4.7 ms
xlarge (906 KB)	10.4 ms	13.9 ms	13.9 ms

Extraction quality

Evaluated on a 960-entry dataset (strings expected to be present/absent in extracted text):

Implementation	Precision	Recall	Accuracy	F-score
Rust (balanced + fallback)	0.908	0.919	0.913	0.913
Python trafilatura	0.920	0.909	0.915	0.914
Go go-trafilatura	0.909	0.921	0.914	0.915

All three implementations produce near-identical quality scores. Minor differences stem from HTML parser handling and Unicode normalization.

Measured on Apple M4 Max, Rust 1.93, macOS 15.7.

Reproduce:

cargo bench                                            # speed benchmarks
cargo test --test comparison_test -- --nocapture       # Rust quality scores
python3 scripts/compare_python.py > /dev/null          # Python quality (stderr)

License

Apache-2.0

trafilatura

Runtime

trafilatura

Usage

Library

With options

Markdown output

CLI

What it extracts

How it works

Language bindings

Swift (full API docs)

Kotlin / Android (full API docs)

Ruby (full API docs)

Dart

C# (.NET) (full API docs)

JavaScript / TypeScript (WASM)

Benchmarks

Speed

Extraction quality

License