Readability

Ruby port of Mozilla Readability.js -- extract readable article content from HTML pages, like Firefox Reader View.

Passes all 130 Mozilla test fixtures.

Installation

Add this line to your application's Gemfile:

gem "readability-rb"

Quick Start

result = Readability.parse(html, url: "https://example.com/article")

result.title        # article title
result.byline       # author name
result.content      # cleaned HTML content
result.text_content # plain text content
result.excerpt      # short summary
result.length       # text content length
result.site_name    # site name
result.published_time # publication date
result.dir          # text direction
result.lang         # language

Usage

Parse an article

html = Net::HTTP.get(URI("https://example.com/article"))
result = Readability.parse(html, url: "https://example.com/article")

puts result.title
puts result.content

Returns a Readability::Result or nil if parsing fails.

Check if a page is readable

if Readability.readerable?(html)
  result = Readability.parse(html)
end

Accepts min_score and min_content_length options.

Readability.readerable?(html, min_score: 30, min_content_length: 200)

Use the lower-level API

Pass a Nokogiri document directly.

doc = Nokogiri::HTML5(html)
result = Readability::Document.new(doc, url: "https://example.com").parse

Custom serializer

Replace the default HTML serializer.

result = Readability.parse(html, serializer: ->(el) { el.to_html })

Options

Option	Description	Default
`url`	Base URL for resolving relative links	`nil`
`max_elems_to_parse`	Max elements before aborting (0 = no limit)	`0`
`nb_top_candidates`	Number of top candidates to consider	`5`
`char_threshold`	Min characters for a successful parse	`500`
`classes_to_preserve`	CSS classes to keep on elements	`[]`
`keep_classes`	Preserve all CSS classes	`false`
`disable_json_ld`	Skip JSON-LD metadata extraction	`false`
`allowed_video_regex`	Regex for allowed video embed URLs	built-in
`link_density_modifier`	Adjust link density calculation	`0`
`serializer`	Lambda to serialize the content element	`inner_html`
`max_attributes`	Max attributes per element at parse time	`1000`
`max_tree_depth`	Max document tree depth at parse time (`-1` to disable)	`1000`

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

Report bugs
Fix bugs and submit pull requests
Write, clarify, or fix documentation
Suggest or add new features

License

Apache 2.0

readability-rb

Runtime