The project is in a healthy, maintained state
A robust, featureful and efficient streaming sitemap parser.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
 Project Readme

Sitemap

A robust, featureful and efficient streaming sitemap parser.

Gem name streaming-sitemap-parser.

Note: sitemap gem name is already taken so I named this gem streaming-sitemap-parser. Gem's top-level namespace is Sitemap module, so you require "sitemap" to load it.

Features

  • Parses sitemaps and sitemap index files.
  • Input can be a string or a file/IO object that responds to #read.
  • Low parsing memory use: if passed a file/IO object that responds to #read it uses a streaming parser that doesn't read all the content in memory. Parsing memory use is kept at 16Kb even for huge inputs.
  • Low output memory use.
  • Lax parsing by default: it gets you as much data as possible from the input. This way you can decide what is used and what is ignored.
  • Per-field validations help you easily filter data. Example: Sitemap.entries(input).select(&:valid_loc?).
  • URL length limiting, default 2048.
  • Input length limiting (file, IO or string), default 50 Mb.
  • URL number limit (sitemap or sitemap index entries), default 50,000.
  • Validate sitemap or sitemap index URLs by checking if they belong to a valid site / directory via scope parameter.
  • Full support for video sitemap extension.
  • Proper XML namespace support.
  • Alternate pages (xhtml:link).

Usage

With a File or IO input (recommended to keep memory usage low)

require "sitemap"

file = File.open("sitemap.xml")
Sitemap.entries(file).each do |entry|
  puts entry.loc
end

With string input

require "sitemap"

sitemap_string = fetch_sitemap_xml
Sitemap.entries(sitemap_string).each do |entry|
  puts entry.loc if entry.valid?
end

Regular sitemap vs sitemap index input

When you're fetching a sitemap from the internet you may get a a regular sitemap or a sitemap index file. This gem handles both.

Sitemap.entries(input).each do |entry|
  case entry
  when Sitemap::Entry::URL # regular sitemap entry
    url_work(entry)
  when Sitemap::Entry::Sitemap # sitemap index entry
    sitemap_work(entry)
  end
end

Parse regular sitemap only

Sitemap::Document::Sitemap.entries(input).each

Parse sitemap index only

Sitemap::Document::SitemapIndex.entries(input).each

Keeping memory use low

By following below simple guidelines you can keep memory use very low.

  1. Use a #readable IO object as an input. Sitemaps can get quite big, the spec limit is 50Mb. By reading the whole file into memory you use, well 50Mb.

  2. Don't keep result entries in memory. Do this:

Sitemap.entries(input).each do |entry|
  do_work(entry) # Don't reference 'entry' outside the block scope.
end

Don't do this:

# All results are kept in an array (in memory). Sitemap files contain many
entries so memory use can be high.
Sitemap.entries(input).to_a

also, don't do this:

# All valid entries (probably a lot of them) are buffered in memory.
Sitemap.entries(input).select(&:valid?)

License

MIT