0.0
No release in over 3 years
There's a lot of open issues
Document extraction for RAG pipelines. Loads PDF, DOCX, CSV, HTML, and web pages into a normalized Document format for chunking and embedding.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
 Dependencies

Development

~> 5.0
~> 13.0
~> 3.0

Runtime

>= 0
 Project Readme

loader-ruby

Document loader library for Ruby RAG pipelines. Load text from PDF, HTML, CSV, DOCX, and web URLs.

Installation

gem "loader-ruby"

Usage

require "loader_ruby"

# Auto-detect format from file extension
doc = LoaderRuby.load("report.pdf")
doc = LoaderRuby.load("data.csv")
doc = LoaderRuby.load("page.html")

# Web loader with redirect handling
doc = LoaderRuby.load("https://example.com/article")

# PDF with password
loader = LoaderRuby::Loaders::Pdf.new("encrypted.pdf", password: "secret")
doc = loader.load

# Access content
doc.content   # => extracted text
doc.metadata  # => { source: "report.pdf", ... }

Features

  • PDF, HTML, CSV, DOCX, and plain text loaders
  • Web loader with configurable max redirects (default: 5)
  • Encoding auto-detection (BOM, Content-Type charset)
  • Graceful transcoding to UTF-8
  • Shared HTML extraction module
  • Error hierarchy (FileNotFoundError, TooManyRedirectsError, etc.)
  • Input validation for paths and URLs

License

MIT