Canon: Semantic comparison for serialization formats
- Purpose
- Installation
- Quick start
- Format documents
- Compare documents
- Use in tests
- Command-line interface
- Documentation
- Using Canon
- Understanding Canon
- Features
- Advanced topics
- Features
- Canonicalization
- Semantic comparison
- Algorithm choice
- Size limits for large files
- Smart diff output
- Enhanced diff features
- Input validation
- Examples
- Ruby API example
- CLI example
- RSpec example
- Architecture
- CompareProfile architecture
- CompareProfile architecture
- Development
- Contributing
- Copyright and license
Purpose
Canon provides canonicalization, pretty-printing, and semantic comparison for serialization formats (XML, HTML, JSON, YAML). It produces standardized forms suitable for comparison, testing, digital signatures, and human-readable output.
Key features:
-
Format support: XML, HTML, JSON, YAML
-
Canonicalization: W3C XML C14N 1.1, sorted JSON/YAML keys
-
Semantic comparison: Compare meaning, not formatting
-
Multiple interfaces: Ruby API, CLI, RSpec matchers
-
Smart diff output: By-line or by-object modes with syntax highlighting
Installation
Add to your application’s Gemfile:
gem 'canon'Then execute:
$ bundle installOr install directly:
$ gem install canonQuick start
Format documents
require 'canon'
# Canonical form (compact)
Canon.format('<root><b>2</b><a>1</a></root>', :xml)
# => Pretty-printed XML (default behavior)
# Compact canonical form
require 'canon/xml/c14n'
Canon::Xml::C14n.canonicalize('<root><b>2</b><a>1</a></root>', with_comments: false)
# => "<root><b>2</b><a>1</a></root>"
# Pretty-print (human-readable with custom indent)
require 'canon/pretty_printer/xml'
xml_input = '<root><b>2</b><a>1</a></root>'
Canon::PrettyPrinter::Xml.new(indent: 2).format(xml_input)Compare documents
require 'canon/comparison'
xml1 = '<root><a>1</a><b>2</b></root>'
xml2 = '<root> <b>2</b> <a>1</a> </root>'
Canon::Comparison.equivalent?(xml1, xml2)
# => true (semantically equivalent despite formatting differences)
# Use semantic tree diff for operation-level analysis
result = Canon::Comparison.equivalent?(xml1, xml2,
verbose: true,
diff_algorithm: :semantic
)
result.operations # => [INSERT, DELETE, UPDATE, MOVE operations]Use in tests
require 'canon/rspec_matchers'
RSpec.describe 'XML generation' do
it 'generates correct XML' do
expect(actual_xml).to be_xml_equivalent_to(expected_xml)
end
endCommand-line interface
# Format a file
$ canon format input.xml --mode pretty
# Compare files
$ canon diff file1.xml file2.xml --verbose
# Get help
$ canon helpDocumentation
Using Canon
-
Ruby API - Using Canon from Ruby code
-
Command-line interface - CLI commands and options
-
RSpec matchers - Testing with Canon
Understanding Canon
-
Match architecture - How comparison works
-
Format support - XML, HTML, JSON, YAML details
-
Diff modes - By-line vs by-object comparison
Features
-
Preprocessing - Document normalization options
-
Match options - Match dimensions and profiles
-
Semantic tree diff - Operation-level tree comparison
-
Semantic tree diff algorithm - Comprehensive guide to semantic diff
-
Environment configuration - Configure via ENV variables including size limits
-
Diff formatting - Customizing diff output
-
Character visualization - Whitespace and special characters
-
Input validation - Error handling
Advanced topics
-
Verbose mode - Two-tier diff architecture
-
Semantic diff report - Diff report format
-
Normative vs informative diffs - Diff classification
-
Diff architecture - Technical pipeline details
-
CompareProfile architecture - Format-specific policies
Features
Canonicalization
XML: W3C Canonical XML Version 1.1 specification with namespace declaration ordering, attribute ordering, character encoding normalization, and proper handling of xml:base, xml:lang, xml:space, and xml:id attributes.
HTML: Consistent formatting for HTML 4/5 and XHTML with automatic detection and appropriate formatting rules.
JSON/YAML: Alphabetically sorted keys at all levels with consistent formatting.
Semantic comparison
Compare documents based on meaning, not formatting:
-
Whitespace normalization options
-
Attribute/key order handling
-
Comment handling with display control
-
Multiple match dimensions with behaviors
-
Predefined match profiles (strict, rendered, spec_friendly, content_only)
See Match options for details.
Comment display control
Control which differences are displayed in diff output:
# Show all differences (default)
result = Canon::Comparison.equivalent?(xml1, xml2,
verbose: true,
match: { comments: :ignore },
show_diffs: :all
)
# Show only normative differences (affect equivalence)
result = Canon::Comparison.equivalent?(xml1, xml2,
verbose: true,
match: { comments: :ignore },
show_diffs: :normative
)
# Show only informative differences
result = Canon::Comparison.equivalent?(xml1, xml2,
verbose: true,
match: { comments: :ignore },
show_diffs: :informative
)CLI usage:
# Show all differences
$ canon diff file1.xml file2.xml --show-diffs all
# Show only normative differences
$ canon diff file1.xml file2.xml --show-diffs normative
# Show only informative differences
$ canon diff file1.xml file2.xml --show-diffs informativeRSpec usage:
expect(actual).to be_xml_equivalent_to(expected)
.show_diffs(:normative)Algorithm choice
Canon provides two diff algorithms:
-
DOM diff (default): Stable, position-based comparison for traditional line-by-line output
-
Semantic tree diff (experimental): Advanced operation detection (INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT, UPGRADE, DOWNGRADE)
# Use DOM diff (default, stable)
result = Canon::Comparison.equivalent?(doc1, doc2,
verbose: true,
diff_algorithm: :dom
)
# Use semantic tree diff (experimental, more intelligent)
result = Canon::Comparison.equivalent?(doc1, doc2,
verbose: true,
diff_algorithm: :semantic
)When to use semantic tree diff:
-
Need to detect high-level operations (moves, merges, splits)
-
Documents have significant rearrangement
-
Want statistical analysis of changes
-
Need operation-level transformation analysis
When to use DOM diff:
-
Need stable, well-tested comparison
-
Want traditional line-by-line output
-
Documents are similar in structure
-
Maximum performance for large files
See Semantic tree diff algorithm for comprehensive guide.
Size limits for large files
Canon provides configurable size limits to prevent hangs on pathologically large files:
-
File size limit: Default 5MB (configurable)
-
Node count limit: Default 10,000 nodes (configurable)
-
Diff output limit: Default 10,000 lines (configurable)
# Configure via environment variables
export CANON_MAX_FILE_SIZE=10485760 # 10MB
export CANON_MAX_NODE_COUNT=50000 # 50,000 nodes
export CANON_MAX_DIFF_LINES=20000 # 20,000 lines
bundle exec rspec# Or programmatically
Canon::Config.instance.xml.diff.max_file_size = 10_485_760
Canon::Config.instance.xml.diff.max_node_count = 50_000
Canon::Config.instance.xml.diff.max_diff_lines = 20_000See ENV_CONFIG for details on size limit configuration.
Smart diff output
By-line mode: Traditional line-by-line diff with:
-
DOM-guided semantic matching for XML
-
Syntax-aware token highlighting
-
Context lines around changes
-
Whitespace visualization
By-object mode: Tree-based semantic diff with:
-
Visual tree structure using box-drawing characters
-
Shows only what changed (additions, removals, modifications)
-
Color-coded output
See Diff modes for details.
Enhanced diff features
-
Three-tier diff classification: Formatting-only (
[dark gray/]light gray), informative (<blue/>cyan), and normative (-red/+green) differences with directional colors -
Directional color coding: Removals and additions use different colors within each tier (red/green for normative, blue/cyan for informative, dark gray/light gray for formatting)
-
Namespace declaration tracking: Separate dimension for tracking
xmlnsandxmlns:*attribute changes, reported independently from regular data attributes -
Namespace rendering: Explicit namespace display in XML diffs using
ns:[uri]orns:[]format -
Informative diff visualization: Visually distinct blue/cyan markers for differences that don’t affect equivalence
-
Formatting diff detection: Automatically detects and highlights purely cosmetic whitespace/line break differences
-
Whitespace visualization: Make invisible characters visible with CJK-safe Unicode symbols
-
Non-ASCII detection: Warnings for unexpected Unicode characters
-
Customizable: Character maps, context lines, grouping options
See Diff formatting and Character visualization for details.
Input validation
Comprehensive validation with clear error messages showing exact line and column numbers for syntax errors in XML, HTML, JSON, and YAML.
See Input validation for details.
Examples
Ruby API example
require 'canon/comparison'
# Compare with custom options
Canon::Comparison.equivalent?(doc1, doc2,
match: {
text_content: :normalize,
structural_whitespace: :ignore,
comments: :ignore
},
verbose: true
)CLI example
# Compare with semantic diff
$ canon diff file1.xml file2.xml \
--verbose \
--text-content normalize \
--structural-whitespace ignoreSee CLI documentation.
RSpec example
# Configure globally
Canon::Config.configure do |config|
config.xml.match.profile = :spec_friendly
config.xml.diff.use_color = true
end
# Use in tests
RSpec.describe 'XML generation' do
it 'generates correct structure' do
expect(actual_xml).to be_xml_equivalent_to(expected_xml)
end
endSee RSpec documentation.
Architecture
Canon follows an orchestrator pattern with MECE (Mutually Exclusive, Collectively Exhaustive) principles:
Comparison module (Canon::Comparison): Format detection, validation, and
delegation to format-specific comparators (XML, HTML, JSON, YAML).
DiffFormatter module (Canon::DiffFormatter): Diff mode detection and
delegation to mode-specific formatters (by-line, by-object).
Three-phase comparison:
-
Preprocessing: Optional document normalization (c14n, normalize, format)
-
Semantic matching: Configurable match dimensions with behaviors
-
Diff rendering: Formatted output with visualization
See Match architecture for details.
CompareProfile architecture
Canon uses the CompareProfile class to encapsulate policy decisions about how differences in various dimensions should be handled during comparison. This provides clean separation of concerns between policy decisions, comparison logic, and difference classification.
Separation of concerns
The comparison system is divided into four distinct components:
- CompareProfile
-
Policy decisions (what to track, what affects equivalence)
- XmlComparator/HtmlComparator
-
Comparison logic (detect differences)
- DiffNode
-
Data representation (represents a difference)
- DiffClassifier
-
Classification logic (normative vs informative vs formatting)
Each component has ONE responsibility with no overlapping concerns:
-
CompareProfile does NOT classify differences
-
XmlComparator does NOT make policy decisions
-
DiffClassifier does NOT compare documents
Policy methods
CompareProfile provides four key policy methods:
track_dimension?(dimension)-
Should DiffNodes be created for this dimension? Returns
truein verbose mode to track all differences for reporting. affects_equivalence?(dimension)-
Should differences affect equivalence? Determines the return value of the comparison. Returns
falsefor dimensions with:ignorebehavior. normative_dimension?(dimension)-
Is this dimension normative (affects equivalence) or informative (display only)? Used by DiffClassifier to set the normative flag on DiffNodes.
supports_formatting_detection?(dimension)-
Can FormattingDetector apply to this dimension? Returns
trueonly for text/content dimensions (:text_content,:structural_whitespace,:comments).
CompareProfile architecture
Canon uses a CompareProfile system to define format-specific comparison policies.
This allows different formats (HTML, XML, JSON, YAML) to have their own default
behaviors while maintaining a consistent architecture.
How CompareProfile works
The CompareProfile class provides the foundation for policy-based comparison:
Normative policy: Determines what differences matter for equivalence. Each
dimension (:text_content, :structural_whitespace, :comments, etc.) has a
behavior (:strict, :normalize, :ignore) that determines whether differences
in that dimension affect equivalence.
Dimension-based classification: Each difference has a dimension and the profile determines if that dimension is:
-
Normative: Affects equivalence (documents not equivalent if different)
-
Informative: Tracked but doesn’t affect equivalence
-
Formatting-only: Pure whitespace differences when normalized content matches
Classification hierarchy:
-
Normative (highest priority): Differences that make documents non-equivalent
-
Informative (medium priority): Differences that are tracked but don’t affect equivalence
-
Formatting-only (lowest priority): Pure whitespace/formatting differences
Dimension behaviors
Each dimension can have one of three behaviors:
-
:strict: Differences in this dimension are normative (affect equivalence) -
:normalize: Differences are normalized; only semantic changes are normative -
:ignore: Differences are informative only (don’t affect equivalence)
# Default (strict mode): whitespace differences are normative
xml1 = '<root><p>Hello world</p></root>'
xml2 = '<root><p>Hello\nworld</p></root>'
Canon::Comparison.equivalent?(xml1, xml2) # => false
# Normalize mode: whitespace-only differences are formatting-only
Canon::Comparison.equivalent?(xml1, xml2,
match: { text_content: :normalize, structural_whitespace: :normalize }
) # => true
In normalize mode, the line break is detected as formatting-only because the normalized content ("Hello world") is the same.
Format-specific profiles
Different formats can extend CompareProfile with format-specific policies:
-
XML (base): Strict policies for all dimensions
-
HTML (HtmlCompareProfile): Comments ignored by default, whitespace preserved in certain elements
-
JSON/YAML (future): Key order policies, type handling
See lib/canon/comparison/compare_profile.rb for the base implementation and
lib/canon/comparison/html_compare_profile.rb for HTML-specific policies.
Format-specific policies for HTML
Canon provides a format-specific CompareProfile implementation called HtmlCompareProfile that encapsulates policies specific to HTML comparison. This profile is automatically used by HtmlComparator based on detected HTML version.
Comments: Default behavior is :ignore (presentational content in HTML),
unless explicitly set to :strict. When comments are set to :strict,
they will affect equivalence.
Whitespace preservation: HtmlCompareProfile automatically preserves
whitespace in elements where it’s semantically significant (e.g., <pre>,
<code>, <textarea>, <script>, <style>). In other elements, whitespace
is normalized.
Case sensitivity: HTML5 is case-sensitive for element names, while HTML4 is case-insensitive. HtmlCompareProfile uses HTML5 case-sensitivity by default.
Usage example
When using match: { comments: :ignore }:
-
track_dimension?(:comments)returnstrue(track in verbose mode) -
affects_equivalence?(:comments)returnsfalse(doesn’t affect equivalence) -
normative_dimension?(:comments)returnsfalse(informative only)
This ensures that comment differences are tracked and displayed in verbose mode but don’t make documents non-equivalent.
xml1 = '<root><!-- comment 1 --><data>value</data></root>'
xml2 = '<root><!-- comment 2 --><data>value</data></root>'
result = Canon::Comparison.equivalent?(xml1, xml2,
verbose: true,
match: { comments: :ignore }
)
result.differences # => [#<DiffNode @dimension=:comments>]
result.differences[0].normative? # => false (informative)
result.equivalent? # => true (doesn't affect equivalence)The comment difference is tracked and displayed, but the documents are still
considered equivalent because comments are set to :ignore.
html1 = '<div><!-- comment --><p>Text</p></div>'
html2 = '<div><p>Text</p></div>'
# HTML defaults: comments are ignored (presentational)
result = Canon::Comparison.equivalent?(html1, html2)
# => true (comments don't affect HTML equivalence by default)
# Explicit strict matching
result = Canon::Comparison.equivalent?(html1, html2,
match: { comments: :strict }
)
# => false (comments now affect equivalence)Comments in HTML are considered presentational content (like CSS styles) and
don’t affect the semantic meaning unless explicitly configured to :strict.
html1 = '<pre>Line 1\n Line 2</pre>'
html2 = '<pre>Line 1\nLine 2</pre>'
# Whitespace is preserved in <pre> elements
result = Canon::Comparison.equivalent?(html1, html2)
# => false (whitespace differs in pre element)
# But normalized in other elements
html3 = '<div>Text with spaces</div>'
html4 = '<div>Text with spaces</div>'
result = Canon::Comparison.equivalent?(html3, html4)
# => true (whitespace normalized in regular elements)HtmlCompareProfile automatically preserves whitespace in elements where it’s
semantically significant (<pre>, <code>, <textarea>, <script>,
<style>), while normalizing it in other elements.
Future format profiles: The architecture supports additional format-specific profiles for JSON, YAML, and other formats as needed.
Development
After checking out the repo, run bin/setup to install dependencies. Then run
rake spec to run the tests. You can also run bin/console for an interactive
prompt.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/lutaml/canon.
Copyright and license
Copyright Ribose. BSD-2-Clause License.