Sitemap
A robust, featureful and efficient streaming sitemap parser.
Gem name streaming-sitemap-parser.
Note: sitemap
gem name is already taken so I named this gem
streaming-sitemap-parser
. Gem's top-level namespace is Sitemap
module, so
you require "sitemap"
to load it.
Features
- Parses sitemaps and sitemap index files.
- Input can be a string or a file/IO object that responds to
#read
. - Low parsing memory use: if passed a file/IO object that responds to
#read
it uses a streaming parser that doesn't read all the content in memory. Parsing memory use is kept at 16Kb even for huge inputs. - Low output memory use.
- Lax parsing by default: it gets you as much data as possible from the input. This way you can decide what is used and what is ignored.
- Per-field validations help you easily filter data.
Example:
Sitemap.entries(input).select(&:valid_loc?)
. - URL length limiting, default 2048.
- Input length limiting (file, IO or string), default 50 Mb.
- URL number limit (sitemap or sitemap index entries), default 50,000.
- Validate sitemap or sitemap index URLs by checking if they belong to a valid
site / directory via
scope
parameter. - Full support for
video
sitemap extension. - Proper XML namespace support.
- Alternate pages (
xhtml:link
).
Usage
With a File or IO input (recommended to keep memory usage low)
require "sitemap"
file = File.open("sitemap.xml")
Sitemap.entries(file).each do |entry|
puts entry.loc
end
With string input
require "sitemap"
sitemap_string = fetch_sitemap_xml
Sitemap.entries(sitemap_string).each do |entry|
puts entry.loc if entry.valid?
end
Regular sitemap vs sitemap index input
When you're fetching a sitemap from the internet you may get a a regular sitemap or a sitemap index file. This gem handles both.
Sitemap.entries(input).each do |entry|
case entry
when Sitemap::Entry::URL # regular sitemap entry
url_work(entry)
when Sitemap::Entry::Sitemap # sitemap index entry
sitemap_work(entry)
end
end
Parse regular sitemap only
Sitemap::Document::Sitemap.entries(input).each
Parse sitemap index only
Sitemap::Document::SitemapIndex.entries(input).each
Keeping memory use low
By following below simple guidelines you can keep memory use very low.
-
Use a
#read
able IO object as an input. Sitemaps can get quite big, the spec limit is 50Mb. By reading the whole file into memory you use, well 50Mb. -
Don't keep result entries in memory. Do this:
Sitemap.entries(input).each do |entry|
do_work(entry) # Don't reference 'entry' outside the block scope.
end
Don't do this:
# All results are kept in an array (in memory). Sitemap files contain many
entries so memory use can be high.
Sitemap.entries(input).to_a
also, don't do this:
# All valid entries (probably a lot of them) are buffered in memory.
Sitemap.entries(input).select(&:valid?)