NITFr
A Ruby gem for parsing NITF (News Industry Text Format) XML files.
NITF is a standard XML format developed by the IPTC (International Press Telecommunications Council) for marking up news articles. NITFr makes it easy for Ruby applications to parse and extract content from NITF documents.
Requirements
- Ruby 3.0 or higher
- No native extensions or external dependencies (pure Ruby using REXML)
Security
NITFr is designed with security in mind:
- XXE Protection: REXML does not expand external entities by default, protecting against XML External Entity (XXE) attacks
- Entity Expansion Limits: Configured to prevent "Billion Laughs" and similar entity expansion attacks
- No Code Execution: The parser never evaluates or executes content from XML documents
Installation
Add this line to your application's Gemfile:
gem 'nitfr'And then execute:
bundle installOr install it yourself:
gem install nitfrUsage
Basic Parsing
require 'nitfr'
# Parse from a string
xml = File.read('article.xml')
doc = NITFr.parse(xml)
# Or parse directly from a file
doc = NITFr.parse_file('article.xml')
# With explicit encoding
doc = NITFr.parse_file('article.xml', encoding: 'ISO-8859-1')Accessing Content
# Get the headline
doc.headline # => "Revolutionary Technology Changes Industry"
doc.headlines.primary # => "Revolutionary Technology Changes Industry"
doc.headlines.secondary # => "Experts predict widespread adoption"
# Get byline information
doc.byline.text # => "By Jane Smith, Senior Reporter"
doc.byline.person # => "Jane Smith"
doc.byline.title # => "Senior Reporter"
# Get the article text
doc.paragraphs.each do |para|
puts para.text
end
# Or get all text at once
puts doc.textWorking with Metadata
# Document metadata
doc.title # => "Sample News Article Title"
doc.doc_id # => "article-2024-001"
doc.issue_date # => #<Date: 2024-12-15>
# Copyright info
doc.docdata.copyright_holder # => "Example News Corp"
doc.docdata.copyright_year # => "2024"
# Urgency (1-8, 1 being most urgent)
doc.docdata.urgency # => 4
# Identified content
doc.docdata.subjects # => ["Technology", "Business"]
doc.docdata.organizations # => ["TechCorp Inc"]
doc.docdata.people # => ["John Doe"]
doc.docdata.locations # => ["San Francisco"]Working with Body Content
# Access the body section
body = doc.body
# Dateline and abstract
body.dateline # => "SAN FRANCISCO, Dec 15"
body.abstract # => "A new technology platform..."
# Block quotes
body.block_quotes # => ["Innovation distinguishes..."]
# Tagline from body.end
body.tagline # => "Contact: press@example.com"Working with Paragraphs
doc.paragraphs.each do |para|
# Check if it's the lead paragraph
puts "LEAD: " if para.lead?
# Get plain text
puts para.text
# Get entities mentioned in this paragraph
puts "People: #{para.people.join(', ')}"
puts "Organizations: #{para.organizations.join(', ')}"
puts "Locations: #{para.locations.join(', ')}"
# Get emphasized text
puts "Emphasized: #{para.emphasis.join(', ')}"
# Get links
para.links.each do |link|
puts "Link: #{link[:text]} -> #{link[:href]}"
end
# Word count
puts "Words: #{para.word_count}"
endWorking with Media
doc.media.each do |media|
puts "Caption: #{media.caption}"
puts "Credit: #{media.credit}"
puts "MIME type: #{media.mime_type}"
if media.image?
puts "Image: #{media.source}"
puts "Size: #{media.width}x#{media.height}"
puts "Alt text: #{media.alt_text}"
elsif media.video?
puts "Video: #{media.source}"
elsif media.audio?
puts "Audio: #{media.source}"
end
# Access all references (different sizes/formats)
media.references.each do |ref|
puts " #{ref[:source]} (#{ref[:mime_type]})"
end
endError Handling
begin
doc = NITFr.parse(xml)
rescue NITFr::ParseError => e
puts "Invalid XML: #{e.message}"
rescue NITFr::InvalidDocumentError => e
puts "Not a valid NITF document: #{e.message}"
endDocument Attributes
# NITF version and change information
doc.version # => "-//IPTC//DTD NITF 3.5//EN"
doc.change_date # => "October 18, 2007"
doc.change_time # => "19:30"
# Check validity
doc.valid? # => true
# Get raw XML
doc.to_xml # => "<?xml version..."Advanced Usage
Head Section Details
head = doc.head
# Meta tags as a hash
head.meta # => {"keywords" => "tech, news", "author" => "Jane"}
head.keywords # => ["tech, news"]
# Publication data
head.pubdata[:type] # => "print"
head.pubdata[:name] # => "Example Times"
head.pubdata[:edition] # => "Morning"
head.pubdata[:volume] # => "42"
# Revision history
head.revision_history.each do |rev|
puts "#{rev[:name]} (#{rev[:function]}): #{rev[:comment]}"
endExtended Docdata
docdata = doc.docdata
# Additional dates
docdata.release_date # => #<Date: 2024-12-15>
docdata.expire_date # => #<Date: 2024-12-31>
# Document scope and fixture
docdata.doc_scope # => "national"
docdata.fixture # => "fixture-123"
# Series information
docdata.series[:name] # => "Investigation"
docdata.series[:part] # => 2
docdata.series[:total] # => 5
# Editorial status
docdata.management_status[:info] # => "Approved"
docdata.management_status[:message_type] # => "advisory"Body Section Extras
body = doc.body
# Distributor and series
body.distributor # => "Wire Service"
body.series[:name] # => "Special Report"
body.series[:part] # => "1"
body.series[:totalpart] # => "3"
# Lists in the content
body.lists.each do |list|
puts "#{list[:type]}: #{list[:items].join(', ')}"
end
# Tables (returns raw REXML elements)
body.tables.each do |table|
# Process table XML as needed
end
# Notes from body.end
body.notes # => ["Editor's note: ...", "Correction: ..."]
# Bibliography
body.body_end_content[:bibliography] # => ["Source 1", "Source 2"]NITF Structure
A typical NITF document has this structure:
<nitf>
<head>
<title>...</title>
<docdata>
<doc-id id-string="..."/>
<date.issue norm="YYYYMMDD"/>
...
</docdata>
</head>
<body>
<body.head>
<headline>
<hl1>Primary Headline</hl1>
<hl2>Secondary Headline</hl2>
</headline>
<byline>By Author Name</byline>
<dateline>CITY, Date</dateline>
</body.head>
<body.content>
<p>Paragraph content...</p>
<media media-type="image">...</media>
</body.content>
<body.end>
<tagline>...</tagline>
</body.end>
</body>
</nitf>Development
After checking out the repo, install dependencies and run the tests:
bundle install
bundle exec rake testContributing
Bug reports and pull requests are welcome on GitHub at https://github.com/amerine/nitfr.
License
The gem is available as open source under the terms of the MIT License.