CompareXML
CompareXML is a fast, lightweight and feature-rich tool that will solve your XML/HTML comparison or diffing needs. its purpose is to compare two instances of Nokogiri::XML::Node or Nokogiri::XML::NodeSet for equality or equivalency.
Features
- Fast, light-weight and highly customizable
- Compares XML/HTML documents and document fragments
- Can produce both detailed diffing discrepancies or execute silently
- Has the ability to exclude specific nodes or attributes from all comparisons
Installation
Add this line to your application's Gemfile:
gem 'compare-xml'
And then execute:
bundle
Or install it yourself as:
gem install compare-xml
Usage
Using CompareXML is as simple as
CompareXML.equivalent?(doc1, doc2)where doc1 and doc2 are instances of Nokogiri::XML::Node or Nokogiri::XML::NodeSet.
Example
Suppose you have two files 1.html and 2.html that you would like to compare. You could do it as follows:
doc1 = Nokogiri::HTML(open('1.html'))
doc2 = Nokogiri::HTML(open('2.html'))
puts CompareXML.equivalent?(doc1, doc2)The above code will print true or false depending on the result of the comparison.
If you are using CompareXML in a script, then you need to require it manually with:
require 'compare-xml'Options at a Glance
CompareXML has a variety of options that can be invoked as an optional argument, e.g.:
CompareXML.equivalent?(doc1, doc2, {collapse_whitespace: false, verbose: true, ...})-
collapse_whitespace: {true|false}default:trueshow examples ⇨- when
true, trims and collapses whitespace
- when
-
ignore_attr_order: {true|false}default:trueshow examples ⇨- when
true, ignores attribute order within tags
- when
-
ignore_attr_content: [string1, string2, ...]default:[]show examples ⇨- when provided, ignores all attributes that contain substrings
string,string2, etc.
- when provided, ignores all attributes that contain substrings
-
ignore_attrs: [css_selector1, css_selector1, ...]default:[]show examples ⇨- when provided, ignores specific attributes using CSS selectors
-
ignore_attrs_by_name: [string1, string2, ...]default:[]show examples ⇨- when provided, ignores specific attributes using [String]
-
ignore_comments: {true|false}default:trueshow examples ⇨- when
true, ignores comments, such as<!-- comment -->
- when
-
ignore_nodes: [css_selector1, css_selector1, ...]default:[]show examples ⇨- when provided, ignores specific nodes using CSS selectors
-
ignore_text_nodes: {true|false}default:falseshow examples ⇨- when
true, ignores all text content within a document
- when
-
verbose: {true|false}default:falseshow examples ⇨- when
true, instead of a boolean,CompareXML.equivalent?returns an array of discrepancies.
- when
-
ignore_children {true|false}defaultfalseshow examples ⇨- when
true, the subnodes of a node in the xml are ignored
- when
-
force_children {true|false}defaultfalseshow examples ⇨- when
true, the subnodes of a node are checked independently of the status of the parent node
- when
Options in Depth
-
collapse_whitespace: {true|false}default:trueWhen
true, all text content within the document is trimmed (i.e. space removed from left and right) and whitespace is collapsed (i.e. tabs, new lines, multiple whitespace characters are replaced by a single whitespace).Usage Example:
CompareXML.equivalent?(doc1, doc2, {collapse_whitespace: true})Example: When
truethe following HTML strings are considered equal:<a href="/admin"> SOME TEXT CONTENT </a> <a href="/index"> SOME TEXT CONTENT </a>Example: When
truethe following HTML strings are considered equal:<html> <title> This is my title </title> </html> <html><title>This is my title</title></html>
-
ignore_attr_order: {true|false}default:trueWhen
true, all attributes are sorted before comparison and only attributes of the same type are compared.Usage Example:
CompareXML.equivalent?(doc1, doc2, {ignore_attr_order: true})Example: When
truethe following HTML strings are considered equal:<a href="/admin" class="button" target="_blank">Link</a> <a class="button" target="_blank" href="/admin">Link</a>Example: When
falsethe above HTML strings are compared as follows:href="admin" != class="buttonThe comparison of the
<a>element will stop at this point, since a discrepancy is found.Example: When
truethe following HTML strings are compared as follows:<a href="/admin" class="button" target="_blank">Link</a> <a class="button" target="_blank" href="/admin" rel="nofollow">Link</a> class="button" == class="button" href="/admin" == href="/admin" =! rel="nofollow" target="_blank" == target="_blank"
-
ignore_attr_content: [string1, string2, ...]default:[]When provided, ignores all attributes that contain any of the given substrings. Note: types of attributes still have to match (i.e.
<p>=<p>,<div>=<div>, etc).Usage Example:
CompareXML.equivalent?(doc1, doc2, {ignore_attr_content: ['button']})Example: With
ignore_attr_content: ['button']the following HTML strings are considered equal:<a href="/admin" id="button_1" class="blue button">Link</a> <a href="/admin" id="button_2" class="info button">Link</a>Example: With
ignore_attr_content: ['menu']the following HTML strings are considered equal:<a class="menu left" data-scope="abrth$menu" role="side-menu">Link</a> <a class="main menu" data-scope="ergeh$menu" role="main-menu">Link</a>
-
ignore_attrs: [css_selector1, css_selector1, ...]default:[]When provided, ignores all attributes that satisfy a particular rule using CSS selectors.
Usage Example:
CompareXML.equivalent?(doc1, doc2, {ignore_attrs: ['a[rel="nofollow"]', 'input[type="hidden"']})Example: With
ignore_attrs: ['a[rel="nofollow"]', 'a[target]']the following HTML strings are considered equal:<a href="/admin" class="button" target="_blank">Link</a> <a href="/admin" class="button" target="_self" rel="nofollow">Link</a>Example: With
ignore_attrs: ['a[href^="http"]', 'a[class*="button"]']the following HTML strings are considered equal:<a href="http://google.ca" class="primary button">Link</a> <a href="https://google.com" class="primary button rounded">Link</a>
-
ignore_attrs_by_name: [string1, string2, ...]default:falseWhen provided, ignores all attributes which name is specified in the string array.
Usage Example:
CompareXML.equivalent?(doc1, doc2, {ignore_attrs_by_name: ['target'])Example: With
ignore_attrs_by_name: ['target', 'rel']the following HTML strings are considered equal:<a href="/admin" class="button" target="_blank">Link</a> <a href="/admin" class="button" target="_self" rel="nofollow">Link</a>
-
ignore_comments: {true|false}default:trueWhen
true, ignores comments, such as<!-- This is a comment -->.Usage Example:
CompareXML.equivalent?(doc1, doc2, {ignore_comments: true})Example: When
truethe following HTML strings are considered equal:<!-- This is a comment --> <!-- This is another comment -->Example: When
truethe following HTML strings are considered equal:<a href="/admin"><!-- This is a comment -->Link</a> <a href="/admin">Link</a>
-
ignore_nodes: [css_selector1, css_selector1, ...]default:[]When provided, ignores all nodes that satisfy a particular rule using CSS selectors.
Usage Example:
CompareXML.equivalent?(doc1, doc2, {ignore_nodes: ['script', 'object']})Example: With
ignore_nodes: ['a[rel="nofollow"]', 'a[target]']the following HTML strings are considered equal:<a href="/admin" class="icon" target="_blank">Link 1</a> <a href="/index" class="button" target="_self" rel="nofollow">Link 2</a>Example: With
ignore_nodes: ['b', 'i']the following HTML strings are considered equal:<a href="/admin"><i class"icon bulb"></i><b>Warning:</b> Link</a> <a href="/admin"><i class"icon info"></i><b>Message:</b> Link</a>
-
ignore_text_nodes: {true|false}default:falseWhen
true, ignores all text content. Text content is anything that is included between an opening and a closing tag, e.g.<tag>THIS IS TEXT CONTENT</tag>.Usage Example:
CompareXML.equivalent?(doc1, doc2, {ignore_text_nodes: true})Example: When
truethe following HTML strings are considered equal:<a href="/admin">SOME TEXT CONTENT</a> <a href="/admin">DIFFERENT TEXT CONTENT</a>Example: When
truethe following HTML strings are considered equal:<i class="icon></i> <b>Warning:</b> <i class="icon> </i> <b>Message:</b>
-
verbose: {true|false}default:falseWhen
true, instead of returning a boolean valueCompareXML.equivalent?returns an array of all errors encountered when performing a comparison.Warning: When
true, the comparison takes longer! Not only because more processing is required to produce meaningful differences, but also because in this mode, comparison does NOT stop when a first difference is encountered, because the goal is to capture as many differences as possible.Usage Example:
CompareXML.equivalent?(doc1, doc2, {verbose: true})Example: When
truegiven the following HTML strings:CompareXML.equivalent?(doc1, doc2, {verbose: true})will produce an array shown below.[ { node1: '<title>TITLE</title>', node2: '<title>ANOTHER TITLE</title>', diff1: 'TITLE', diff2: 'ANOTHER TITLE', }, { node1: '<h1>SOME HEADING</h1>', node2: '<h1 id="main">SOME HEADING</h1>', diff1: nil, diff2: 'id="main"', }, { node1: '<a href="/admin" rel="icon">Link</a>', node2: '<a rel="button" href="/admin">Link</a>', diff1: '"rel="icon"', diff2: '"rel="button"', }, { node1: '<cite>Author Name</cite>', node2: nil, diff1: '<cite>Author Name</cite>', diff2: nil, }, { node1: '<p class="footer">FOOTER</p>', node2: '<div class="footer">FOOTER</div>', diff1: 'p', diff2: 'div', } ]
The structure of each hash inside the array is:
node1: [Nokogiri::XML::Node] left node that contains the difference node2: [Nokogiri::XML::Node] right node that contains the difference diff1: [Nokogiri::XML::Node|String] left difference diff2: [Nokogiri::XML::Node|String] right difference
-
ignore_children: {true|false}default:falseWhen provided, ignores all subnodes of any node.
Usage Example:
CompareXML.equivalent?(doc1, doc2, {ignore_children: true})Example: With
ignore_children: truethe following HTML strings are considered equal:<body><a href="/admin" class="icon" target="_blank">Link 1</a></body> <body><a href="/index" class="button" target="_self" rel="nofollow">Link 2</a></body>
-
force_children: {true|false}default:falseWhen provided, compares all subnodes of any node.
Usage Example:
CompareXML.equivalent?(doc1, doc2, {force_children: true})
Contributing
- Fork it
- Create your feature branch (
git checkout -b my-new-feature) - Commit your changes (
git commit -am 'Add some feature') - Push to the branch (
git push origin my-new-feature) - Create new Pull Request
Credits
This gem was inspired by Michael B. Klein's gem equivalent-xml - another excellent tool for XML comparison.
License
The gem is available as open source under the terms of the MIT License.
