RXerces

A Ruby XML library with a Nokogiri-compatible API, powered by Apache Xerces-C instead of libxml2.

Overview

RXerces provides a familiar Nokogiri-like interface for XML parsing and manipulation, but uses the robust Apache Xerces-C XML parser under the hood. This allows Ruby developers to leverage Xerces-C's performance and standards compliance while maintaining compatibility with existing Nokogiri-based code.

Features

✅ Nokogiri-compatible API
✅ Powered by Apache Xerces-C
✅ Parse XML documents
✅ Navigate and manipulate DOM trees
✅ Read and write node attributes
✅ Query nodes with XPath (basic support)
✅ Serialize documents back to XML strings

Installation

Prerequisites

You need to have Xerces-C installed on your system:

macOS (Homebrew):

brew install xerces-c

Ubuntu/Debian:

sudo apt-get install libxerces-c-dev

Fedora/RHEL:

sudo yum install xerces-c-devel

Xalan

For XPath 1.0 compliance, you will need to install the Xalan library. Note that this is optional, and rxerces will default to using the Xpath support from Xerces, which is more limited.

Ubuntu/Debian:

sudo apt-get install libxalan-c-dev

Fedora/RHEL:

sudo yum install xalan-c-devel

Note that MacOS, contrary to what the documentation currently says, does not have a brew package for Xalan. You will either need to use Mac ports or clone and build the code manually. I found that it required some tweaking to work:

apache/xalan-c#44

Install the Gem

Add this line to your application's Gemfile:

gem 'rxerces'

And then execute:

bundle install

Or install it yourself as:

gem install rxerces

Usage

Basic Parsing

require 'rxerces'

# Parse XML string
xml = '<root><person name="Alice">Hello</person></root>'
doc = RXerces.XML(xml)

# Access root element
root = doc.root
puts root.name  # => "root"

Nokogiri Compatibility

RXerces provides optional Nokogiri compatibility. Require rxerces/nokogiri to enable drop-in replacement:

require 'rxerces/nokogiri'

# Parse XML with Nokogiri syntax
doc = Nokogiri.XML('<root><child>text</child></root>')
puts doc.root.name  # => "root"

# Parse HTML with Nokogiri syntax
html_doc = Nokogiri.HTML('<html><body><h1>Hello</h1></body></html>')
puts html_doc.root.name  # => "html"

# Alternative syntax
xml_doc = Nokogiri::XML.parse('<root>text</root>')
html_doc = Nokogiri::HTML.parse('<html>...</html>')

# Classes are aliased for both XML and HTML
Nokogiri::XML::Document == RXerces::XML::Document   # => true
Nokogiri::HTML::Document == RXerces::XML::Document  # => true

Note: If you don't need Nokogiri compatibility, just require 'rxerces' and use the RXerces module directly.

HTML Parsing Note: Since RXerces uses Xerces-C (an XML parser), Nokogiri::HTML parses HTML as XML. This means it won't perform HTML-specific error correction or tag fixing like Nokogiri does with libxml2's HTML parser. For well-formed HTML/XHTML documents, this works fine.

Working with Nodes

# Parse XML
xml = <<-XML
  <library>
    <book id="1" title="1984">
      <author>George Orwell</author>
      <year>1949</year>
    </book>
    <book id="2" title="Brave New World">
      <author>Aldous Huxley</author>
      <year>1932</year>
    </book>
  </library>
XML

doc = RXerces.XML(xml)
root = doc.root

# Get attributes
book = root.children.find { |n| n.is_a?(RXerces::XML::Element) }
puts book['id']     # => "1"
puts book['title']  # => "1984"

# Set attributes
book['isbn'] = '978-0451524935'
puts book['isbn']   # => "978-0451524935"

# Get text content
author = book.children.find { |n| n.name == 'author' }
puts author.text    # => "George Orwell"

# Set text content
author.text = "Eric Arthur Blair"
puts author.text    # => "Eric Arthur Blair"

Navigating the DOM

# Get all children
root.children.each do |child|
  puts "#{child.name}: #{child.class}"
end

# Find specific elements
books = root.children.select { |n| n.is_a?(RXerces::XML::Element) && n.name == 'book' }
books.each do |book|
  puts "Book ID: #{book['id']}"
end

Serialization

# Convert document back to XML string
xml_string = doc.to_xml
puts xml_string

# or use to_s
puts doc.to_s

XPath Queries

RXerces supports XPath queries using Xerces-C's XPath implementation by default:

xml = <<-XML
  <library>
    <book>
      <title>1984</title>
      <author>George Orwell</author>
    </book>
    <book>
      <title>Brave New World</title>
      <author>Aldous Huxley</author>
    </book>
  </library>
XML

doc = RXerces.XML(xml)

# Find all book elements
books = doc.xpath('//book')
puts books.length  # => 2

# Find all titles
titles = doc.xpath('//title')
titles.each do |title|
  puts title.text.strip
end

# Use path expressions
authors = doc.xpath('/library/book/author')
puts authors.length  # => 2

# Query from a specific node
first_book = books[0]
title = first_book.xpath('.//title').first
puts title.text  # => "1984"

Note on XPath Support: Xerces-C implements the XML Schema XPath subset, not full XPath 1.0. Supported features include:

Basic path expressions (/, //, ., ..)
Element selection by name
Descendant and child axes

Not supported:

Attribute predicates ([@attribute="value"])
XPath functions (last(), position(), text())
Comparison operators in predicates

For more complex queries, you can combine basic XPath with Ruby's select and find methods.

For full XPath 1.0 support, install the Xalan library.

API Reference

RXerces Module

RXerces.XML(string) - Parse XML string and return Document
RXerces.parse(string) - Alias for XML
RXerces.xalan_enabled? - Check if Xalan XPath 1.0 support is available

XPath Validation Cache Configuration

RXerces validates XPath expressions for security (preventing injection attacks). For high-volume applications, validated expressions are cached to avoid redundant validation overhead.

# Check if caching is enabled (default: true)
RXerces.cache_xpath_validation?  # => true

# Disable caching (re-validates every query)
RXerces.cache_xpath_validation = false

# Re-enable caching
RXerces.cache_xpath_validation = true

# Get current cache size
RXerces.xpath_validation_cache_size  # => 42

# Get/set maximum cache size (default: 10,000)
RXerces.xpath_validation_cache_max_size       # => 10000
RXerces.xpath_validation_cache_max_size = 5000

# Clear the cache
RXerces.clear_xpath_validation_cache

Performance note: Caching provides ~7-9% speedup for repeated XPath queries by avoiding redundant validation. The cache is thread-safe.

RXerces::XML::Document

.parse(string) - Parse XML string (class method)
#root - Get root element
#to_s / #to_xml - Serialize to XML string
#xpath(path) - Query with XPath (returns NodeSet)

RXerces::XML::Node

#name - Get node name
#text / #content - Get text content
#text= / #content= - Set text content
#[attribute] - Get attribute value
#[attribute]= - Set attribute value
#children - Get array of child nodes
#xpath(path) - Query descendants with XPath

RXerces::XML::Element

Inherits all methods from Node. Represents element nodes.

RXerces::XML::Text

Inherits all methods from Node. Represents text nodes.

RXerces::XML::NodeSet

#length / #size - Get number of nodes
#[] - Access node by index
#each - Iterate over nodes (Enumerable)
#to_a - Convert to array

Development

Building the Extension

bundle install
bundle exec rake compile

Running Tests

bundle exec rspec

Running Tests with Compilation

bundle exec rake

Implementation Notes

Uses Apache Xerces-C 3.x for XML parsing
C++ extension compiled with Ruby's native extension API
XPath support is basic by default (full XPath requires Xalan)
Memory management handled by Ruby's GC and Xerces-C's DOM

Differences from Nokogiri

While RXerces aims for API compatibility with Nokogiri, there are some differences:

Parser Backend: Uses Xerces-C instead of libxml2
XPath: Basic XPath support (returns empty NodeSet currently)
Features: Subset of Nokogiri's full feature set
Performance: Different performance characteristics due to Xerces-C

Contributing

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

License

MIT License - see LICENSE file for details

Credits

Built with Apache Xerces-C
API inspired by Nokogiri

Misc

This library was almost entirely written using AI (Claude Sonnet 4.5). It was mainly a reaction to the lack of maintainers for libxml2, and the generally sorry state of that library in general. Since nokogiri uses it under the hood, I thought it best to create an alternative.

Copyright

Author

Daniel J. Berger