XML Node Stream¶ ↑

This gem provides a very easy to use XML parser the provides the benefits of both stream parsing (i.e. SAX) and document parsing (i.e. DOM). In addition, it provides a unified parsing language for each of the major Ruby XML parsers (REXML, Nokogiri, and LibXML) so that your code doesn’t have to be bound to a particular XML library.

Stream Parsing¶ ↑

The primary purpose of this gem is to facilitate parsing large XML files (i.e. several megabytes in size). Often, reading these files into a document structure is not feasible because the whole document must be read into memory. Stream/SAX parsing solves this issue by reading in the file incrementally and providing callbacks for various events. This method can be quite painful to deal with for any sort of complex document structure.

This gem attempts to solve both of these issues by combining the best features of both. Parsing is performed by a stream parser which construct document style nodes and calls back to the application code with these nodes. When your application is done with a node, it can release it to free up memory and keep your heap from bloating.

In order to keep the interface simple and universal, only XML elements and text nodes are supported. XML processing instructions and comments will be ignored.

Examples¶ ↑

Suppose we have file with every book in the world in it:

<books>
  <book isbn="123456">
    <title>Moby Dick</title>
    <author>Herman Melville</author>
    <categories>
      <category>Fiction</category>
      <category>Adventure</category>
    </categories>
  </book>
  <book isbn="98765643">
    <title>The Decline and Fall of the Roman Empire</title>
    <author>Edward Gibbon</author>
    <category>
      <category>History</category>
      <category>Ancient</category>
    </categories>
  </book>
  ...
</books>

And we want to get them into our Books data model:

XmlNodeStream.parse('/tmp/books.xml') do |node|
  if node.path == '/books/book'
    book = Book.new
    book.isbn = node['isbn']
    book.title = node.find('title').value
    book.author = node.find('author/text()')
    book.categories = node.select('categories/category/text()')
    book.save
    node.release!
  end
end

Releasing Nodes¶ ↑

In the above example, what prevents memory bloat when parsing a large document is the call to node.release!. This call will remove the node from the node tree. The general practice is to look for the higher level nodes you are interested in and then release them immediately. If there are nodes you don’t care about at all, those can be released immediately as well.

A sample 77Mb XML document parsed into Nokogiri consumes over 800Mb of memory. Parsing the same document with XmlNodeStream and releasing top level nodes as they’re processed uses less than 1Mb.

XPath¶ ↑

You can use a subset of the XPath language to navigate nodes. The only parts of XPath implemented are the paths themselves and the text() function. The text() function is useful for getting the value of node directly from the find or select methods without having to do a nil check on the nodes. For instance, in the above example we can get the name of an author with node.find(‘author/text()’) instead of node.find(‘author’).value if node.find(‘author’).

The rest of the XPath language is not implemented since it is a programming language and there is really no need for it since we already have Ruby at our disposal which is far more powerful than XPath. See the Selector class for details.

xml_node_stream

Development

XML Node Stream¶ ↑

Stream Parsing¶ ↑

Examples¶ ↑

Releasing Nodes¶ ↑

XPath¶ ↑