Project

nous

0.0
A long-lived project that still receives updates
Nous crawls same-host web pages, extracts readable content, and serializes clean Markdown as text or JSON.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Development

~> 13.0
~> 3.13
~> 1.42
~> 3.25

Runtime

 Project Readme

Nous

Crawl websites and extract readable Markdown, optimized for LLM consumption. Inspired by sitefetch.

Nous fetches same-host pages starting from a seed URL, extracts readable content, and outputs clean Markdown as XML-tagged text or JSON. It supports concurrent crawling, glob-based URL filtering, and two extraction backends: a local parser (ruby-readability) and the Jina Reader API for JS-rendered sites.

Installation

Add to your Gemfile:

gem "nous"

Or install directly:

gem install nous

CLI Usage

# Crawl a site and print extracted content to stdout
nous https://example.com

# Output as JSON
nous https://example.com -f json

# Write to a file
nous https://example.com -o site.md

# Limit pages and increase concurrency
nous https://example.com -l 20 -c 5

# Only crawl pages matching a glob pattern
nous https://example.com -m "/blog/*"

# Scope extraction to a CSS selector
nous https://example.com -s "article.post"

# Use Jina Reader API for JS-rendered sites (Next.js, SPAs)
nous https://example.com --jina

# Debug logging
nous https://example.com -d

Options

Flag Description Default
-o, --output PATH Write output to file stdout
-f, --format FORMAT Output format: text or json text
-c, --concurrency N Concurrent requests 3
-m, --match PATTERN Glob filter for URLs (repeatable) none
-s, --selector SELECTOR CSS selector to scope extraction none
-l, --limit N Maximum pages to fetch 100
--timeout N Per-request timeout in seconds 15
--jina Use Jina Reader API for extraction off
-v, --version Print version and exit off
-h, --help Print usage and exit off
-d, --debug Debug logging to stderr off

Ruby API

Basic Usage

require "nous"

# Fetch pages with the default extractor
pages = Nous.fetch("https://example.com", limit: 10, concurrency: 3)

# Each page is a Nous::Page with title, url, pathname, content, metadata
pages.each do |page|
  puts "#{page.title} (#{page.url})"
  puts page.content
end

# Serialize to XML-tagged text
text = Nous.serialize(pages, format: :text)

# Serialize to JSON
json = Nous.serialize(pages, format: :json)

# Use the Jina extractor for JS-heavy sites
pages = Nous.fetch("https://spa-site.com",
  extractor: Nous::Extractor::Jina.new,
  limit: 5
)

Detailed Results

Use the details: true option to receive full fetch results including failures:

result = Nous.fetch("https://example.com", details: true)

result.pages       # Array<Nous::Page> - successfully extracted pages
result.failures    # Array<{requested_url:, error:}> - failed fetches
result.total_requested  # Integer - total URLs attempted
result.all_succeeded?   # Boolean - true if no failures
result.any_succeeded?   # Boolean - true if at least one page extracted

This is useful when you need to handle failures explicitly:

result = Nous.fetch("https://example.com/api-docs", details: true)

if result.failures.any?
  puts "Failed to fetch:"
  result.failures.each do |failure|
    puts "  #{failure[:requested_url]}: #{failure[:error]}"
  end
end

result.pages.each do |page|
  puts "Successfully extracted: #{page.title}"
end

Page Structure

Each extracted page contains:

Field Type Description
title String Page title (fallback chain: readability → <title> tag → <h1>)
url String Final URL after redirects
pathname String URL path component
content String Extracted content as Markdown
metadata Hash Provenance information (see below)

Page Metadata

page.metadata  # => {
  #   extractor: "Nous::Extractor::Default",  # Which extractor was used
  #   requested_url: "https://example.com/blog", # Original URL before redirects
  #   content_type: "text/html; charset=utf-8",  # HTTP Content-Type header
  #   redirected: true                           # Whether redirects occurred
  # }

Extraction Backends

Default (ruby-readability)

Parses static HTML using ruby-readability, strips noisy elements (script, style, nav, footer), and converts to Markdown via reverse_markdown. Fast and requires no external services, but cannot extract content from JS-rendered pages.

Title extraction uses a fallback chain:

  1. Readability's extracted title
  2. Original <title> tag from HTML
  3. First <h1> from extracted content

Jina Reader API

Uses the Jina Reader API which renders pages with headless Chrome. Handles Next.js App Router, React Server Components, SPAs, and other JS-heavy sites. Free tier allows 20 requests/minute without a key, or 500 RPM with a JINA_API_KEY environment variable.

Output Formats

Text (default)

XML-tagged output designed for LLM context windows:

<page>
  <title>Page Title</title>
  <url>https://example.com/page</url>
  <pathname>/page</pathname>
  <extractor>Nous::Extractor::Default</extractor>
  <content>
# Heading

Extracted markdown content...
  </content>
</page>

JSON

[
  {
    "title": "Page Title",
    "url": "https://example.com/page",
    "pathname": "/page",
    "content": "# Heading\n\nExtracted markdown content...",
    "metadata": {
      "extractor": "Nous::Extractor::Default",
      "requested_url": "https://example.com/page",
      "content_type": "text/html; charset=utf-8",
      "redirected": false
    }
  }
]

Development

bin/setup               # Install dependencies
bundle exec rspec       # Run tests
bundle exec standardrb  # Lint
bundle exec exe/nous    # Run the command line in-development

License

MIT License. See LICENSE.txt.