Canon: Semantic comparison for serialization formats
- Purpose
- Installation
- Quick start
- Format documents
- Compare documents
- Use in tests
- Command-line interface
- Documentation
- Using Canon
- Understanding Canon
- Features
- Advanced topics
- Features
- Canonicalization
- Semantic comparison
- Smart diff output
- Enhanced diff features
- Input validation
- Examples
- Ruby API example
- CLI example
- RSpec example
- Architecture
- Development
- Contributing
- Copyright and license
Purpose
Canon provides canonicalization, pretty-printing, and semantic comparison for serialization formats (XML, HTML, JSON, YAML). It produces standardized forms suitable for comparison, testing, digital signatures, and human-readable output.
Key features:
-
Format support: XML, HTML, JSON, YAML
-
Canonicalization: W3C XML C14N 1.1, sorted JSON/YAML keys
-
Semantic comparison: Compare meaning, not formatting
-
Multiple interfaces: Ruby API, CLI, RSpec matchers
-
Smart diff output: By-line or by-object modes with syntax highlighting
Installation
Add to your application’s Gemfile:
gem 'canon'Then execute:
$ bundle installOr install directly:
$ gem install canonQuick start
Format documents
require 'canon'
# Canonical form (compact)
Canon.format('<root><b>2</b><a>1</a></root>', :xml)
# => "<root><a>1</a><b>2</b></root>"
# Pretty-print (human-readable)
require 'canon/pretty_printer/xml'
Canon::Xml::PrettyPrinter.new(indent: 2).format(xml_input)Compare documents
require 'canon/comparison'
xml1 = '<root><a>1</a><b>2</b></root>'
xml2 = '<root> <b>2</b> <a>1</a> </root>'
Canon::Comparison.equivalent?(xml1, xml2)
# => true (semantically equivalent despite formatting differences)Use in tests
require 'canon/rspec_matchers'
RSpec.describe 'XML generation' do
it 'generates correct XML' do
expect(actual_xml).to be_xml_equivalent_to(expected_xml)
end
endCommand-line interface
# Format a file
$ canon format input.xml --mode pretty
# Compare files
$ canon diff file1.xml file2.xml --verbose
# Get help
$ canon helpDocumentation
Using Canon
-
Ruby API - Using Canon from Ruby code
-
Command-line interface - CLI commands and options
-
RSpec matchers - Testing with Canon
Understanding Canon
-
Match architecture - How comparison works
-
Format support - XML, HTML, JSON, YAML details
-
Diff modes - By-line vs by-object comparison
Features
-
Preprocessing - Document normalization options
-
Match options - Match dimensions and profiles
-
Diff formatting - Customizing diff output
-
Character visualization - Whitespace and special characters
-
Input validation - Error handling
Advanced topics
-
Verbose mode - Two-tier diff architecture
-
Semantic diff report - Diff report format
-
Normative vs informative diffs - Diff classification
-
Diff architecture - Technical pipeline details
Features
Canonicalization
XML: W3C Canonical XML Version 1.1 specification with namespace declaration ordering, attribute ordering, character encoding normalization, and proper handling of xml:base, xml:lang, xml:space, and xml:id attributes.
HTML: Consistent formatting for HTML 4/5 and XHTML with automatic detection and appropriate formatting rules.
JSON/YAML: Alphabetically sorted keys at all levels with consistent formatting.
Semantic comparison
Compare documents based on meaning, not formatting:
-
Whitespace normalization options
-
Attribute/key order handling
-
Comment handling
-
Multiple match dimensions with behaviors
-
Predefined match profiles (strict, rendered, spec_friendly, content_only)
See Match options for details.
Smart diff output
By-line mode: Traditional line-by-line diff with:
-
DOM-guided semantic matching for XML
-
Syntax-aware token highlighting
-
Context lines around changes
-
Whitespace visualization
By-object mode: Tree-based semantic diff with:
-
Visual tree structure using box-drawing characters
-
Shows only what changed (additions, removals, modifications)
-
Color-coded output
See Diff modes for details.
Enhanced diff features
-
Color-coded output: Red (normative deletions), green (normative additions), yellow (normative structure), cyan (informative diffs)
-
Whitespace visualization: Make invisible characters visible with CJK-safe Unicode symbols
-
Non-ASCII detection: Warnings for unexpected Unicode characters
-
Customizable: Character maps, context lines, grouping options
See Diff formatting and Character visualization for details.
Input validation
Comprehensive validation with clear error messages showing exact line and column numbers for syntax errors in XML, HTML, JSON, and YAML.
See Input validation for details.
Examples
Ruby API example
require 'canon/comparison'
# Compare with custom options
Canon::Comparison.equivalent?(doc1, doc2,
match: {
text_content: :normalize,
structural_whitespace: :ignore,
comments: :ignore
},
verbose: true
)CLI example
# Compare with semantic diff
$ canon diff file1.xml file2.xml \
--verbose \
--text-content normalize \
--structural-whitespace ignoreSee CLI documentation.
RSpec example
# Configure globally
Canon::RSpecMatchers.configure do |config|
config.xml.match.profile = :spec_friendly
config.xml.diff.use_color = true
end
# Use in tests
RSpec.describe 'XML generation' do
it 'generates correct structure' do
expect(actual_xml).to be_xml_equivalent_to(expected_xml)
end
endSee RSpec documentation.
Architecture
Canon follows an orchestrator pattern with MECE (Mutually Exclusive, Collectively Exhaustive) principles:
Comparison module (Canon::Comparison): Format detection, validation, and
delegation to format-specific comparators (XML, HTML, JSON, YAML).
DiffFormatter module (Canon::DiffFormatter): Diff mode detection and
delegation to mode-specific formatters (by-line, by-object).
Three-phase comparison:
-
Preprocessing: Optional document normalization (c14n, normalize, format)
-
Semantic matching: Configurable match dimensions with behaviors
-
Diff rendering: Formatted output with visualization
See Match architecture for details.
Development
After checking out the repo, run bin/setup to install dependencies. Then run
rake spec to run the tests. You can also run bin/console for an interactive
prompt.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/lutaml/canon.
Copyright and license
Copyright Ribose. BSD-2-Clause License.