Nous
Crawl websites and extract readable Markdown, optimized for LLM consumption. Inspired by sitefetch.
Nous fetches same-host pages starting from a seed URL, extracts readable content, and outputs clean Markdown as XML-tagged text or JSON. It supports concurrent crawling, glob-based URL filtering, and two extraction backends: a local parser (ruby-readability) and the Jina Reader API for JS-rendered sites.
Installation
Add to your Gemfile:
gem "nous"Or install directly:
gem install nousCLI Usage
# Crawl a site and print extracted content to stdout
nous https://example.com
# Output as JSON
nous https://example.com -f json
# Write to a file
nous https://example.com -o site.md
# Limit pages and increase concurrency
nous https://example.com -l 20 -c 5
# Only crawl pages matching a glob pattern
nous https://example.com -m "/blog/*"
# Scope extraction to a CSS selector
nous https://example.com -s "article.post"
# Use Jina Reader API for JS-rendered sites (Next.js, SPAs)
nous https://example.com --jina
# Debug logging
nous https://example.com -dOptions
| Flag | Description | Default |
|---|---|---|
-o, --output PATH
|
Write output to file | stdout |
-f, --format FORMAT
|
Output format: text or json
|
text |
-c, --concurrency N
|
Concurrent requests | 3 |
-m, --match PATTERN
|
Glob filter for URLs (repeatable) | none |
-s, --selector SELECTOR
|
CSS selector to scope extraction | none |
-l, --limit N
|
Maximum pages to fetch | 100 |
--timeout N |
Per-request timeout in seconds | 15 |
--jina |
Use Jina Reader API for extraction | off |
-v, --version
|
Print version and exit | off |
-h, --help
|
Print usage and exit | off |
-d, --debug
|
Debug logging to stderr | off |
Ruby API
Basic Usage
require "nous"
# Fetch pages with the default extractor
pages = Nous.fetch("https://example.com", limit: 10, concurrency: 3)
# Each page is a Nous::Page with title, url, pathname, content, metadata
pages.each do |page|
puts "#{page.title} (#{page.url})"
puts page.content
end
# Serialize to XML-tagged text
text = Nous.serialize(pages, format: :text)
# Serialize to JSON
json = Nous.serialize(pages, format: :json)
# Use the Jina extractor for JS-heavy sites
pages = Nous.fetch("https://spa-site.com",
extractor: Nous::Extractor::Jina.new,
limit: 5
)Detailed Results
Use the details: true option to receive full fetch results including failures:
result = Nous.fetch("https://example.com", details: true)
result.pages # Array<Nous::Page> - successfully extracted pages
result.failures # Array<{requested_url:, error:}> - failed fetches
result.total_requested # Integer - total URLs attempted
result.all_succeeded? # Boolean - true if no failures
result.any_succeeded? # Boolean - true if at least one page extractedThis is useful when you need to handle failures explicitly:
result = Nous.fetch("https://example.com/api-docs", details: true)
if result.failures.any?
puts "Failed to fetch:"
result.failures.each do |failure|
puts " #{failure[:requested_url]}: #{failure[:error]}"
end
end
result.pages.each do |page|
puts "Successfully extracted: #{page.title}"
endPage Structure
Each extracted page contains:
| Field | Type | Description |
|---|---|---|
title |
String | Page title (fallback chain: readability → <title> tag → <h1>) |
url |
String | Final URL after redirects |
pathname |
String | URL path component |
content |
String | Extracted content as Markdown |
metadata |
Hash | Provenance information (see below) |
Page Metadata
page.metadata # => {
# extractor: "Nous::Extractor::Default", # Which extractor was used
# requested_url: "https://example.com/blog", # Original URL before redirects
# content_type: "text/html; charset=utf-8", # HTTP Content-Type header
# redirected: true # Whether redirects occurred
# }Extraction Backends
Default (ruby-readability)
Parses static HTML using ruby-readability, strips noisy elements (script, style, nav, footer), and converts to Markdown via reverse_markdown. Fast and requires no external services, but cannot extract content from JS-rendered pages.
Title extraction uses a fallback chain:
- Readability's extracted title
- Original
<title>tag from HTML - First
<h1>from extracted content
Jina Reader API
Uses the Jina Reader API which renders pages with headless Chrome. Handles Next.js App Router, React Server Components, SPAs, and other JS-heavy sites. Free tier allows 20 requests/minute without a key, or 500 RPM with a JINA_API_KEY environment variable.
Output Formats
Text (default)
XML-tagged output designed for LLM context windows:
<page>
<title>Page Title</title>
<url>https://example.com/page</url>
<pathname>/page</pathname>
<extractor>Nous::Extractor::Default</extractor>
<content>
# Heading
Extracted markdown content...
</content>
</page>JSON
[
{
"title": "Page Title",
"url": "https://example.com/page",
"pathname": "/page",
"content": "# Heading\n\nExtracted markdown content...",
"metadata": {
"extractor": "Nous::Extractor::Default",
"requested_url": "https://example.com/page",
"content_type": "text/html; charset=utf-8",
"redirected": false
}
}
]Development
bin/setup # Install dependencies
bundle exec rspec # Run tests
bundle exec standardrb # Lint
bundle exec exe/nous # Run the command line in-developmentLicense
MIT License. See LICENSE.txt.