Project

xml-simple

3.1
No commit activity in last 3 years
No release in over 3 years
A simple API for XML processing.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Runtime

>= 0
 Project Readme

XmlSimple - XML made easy

Introduction

The XmlSimple class offers an easy API to read and write XML. It is a Ruby translation of Grant McLean’s Perl module XML::Simple. Please note, that this tutorial was originally written by Grant McLean. I have only converted it to Asciidoc and adjusted it for the Ruby version.

Installation

XmlSimple is available as a Gem, so you can install it as follows:

  gem install xml-simple

Quick Start

Say you have a script called foo and a file of configuration options called foo.xml containing this:

  <config logdir="/var/log/foo/" debugfile="/tmp/foo.debug">
    <server name="sahara" osname="solaris" osversion="2.6">
      <address>10.0.0.101</address>
      <address>10.0.1.101</address>
    </server>
    <server name="gobi" osname="irix" osversion="6.5">
      <address>10.0.0.102</address>
    </server>
    <server name="kalahari" osname="linux" osversion="2.0.34">
      <address>10.0.0.103</address>
      <address>10.0.1.103</address>
    </server>
  </config>

The following lines of code in foo:

  require 'xmlsimple'
  config = XmlSimple.xml_in('foo.xml', { 'KeyAttr' => 'name' })

will 'slurp' the configuration options into the Hash config (if no arguments are passed to xml_in the name and location of the XML file will be inferred from name and location of the script). You can dump out the contents of the Hash using p config, which will produce something like this (formatting has been adjusted for brevity):

{
  'logdir'        => '/var/log/foo/',
  'debugfile'     => '/tmp/foo.debug',
  'server'        => {
    'sahara'        => {
      'osversion'     => '2.6',
      'osname'        => 'solaris',
      'address'       => [ '10.0.0.101', '10.0.1.101' ]
    },
    'gobi'          => {
      'osversion'     => '6.5',
      'osname'        => 'irix',
      'address'       => [ '10.0.0.102' ]
    },
    'kalahari'      => {
      'osversion'     => '2.0.34',
      'osname'        => 'linux',
      'address'       => [ '10.0.0.103', '10.0.1.103' ]
    }
  }
}

Your script could then access the name of the log directory like this:

  puts config['logdir']

Similarly, the second address on the server 'kalahari' could be referenced as:

  puts config['server']['kalahari']['address'][1]

What could be simpler? (Rhetorical).

For simple requirements, that’s really all there is to it. If you want to store your XML in a different directory or file, or pass it in as a string, you’ll need to check out the section on options. If you want to turn off or tweak the array folding feature (that neat little transformation that produced config['server']) you’ll find options for that as well.

If you want to generate XML (for example to write a modified version of config back out as XML), check out xml_out.

Description

The XmlSimple class provides a simple API layer on top of the REXML parser. Additionally, two functions are exported: xml_in and xml_out.

The simplest approach is to call these two functions directly, but an optional object-oriented interface (see the section on Optional OO Interface below) allows them to be called as methods of an XmlSimple object.

xml_in

Parses XML formatted data and returns a reference to a data structure which contains the same information in a more readily accessible form. (Skip down to the section on examples below, for more sample code).

xml_in accepts an optional XML specifier followed by a Hash containing 'name ⇒ value' option pairs. The XML specifier can be a filename, nil, or an IO object.

A Filename

If the filename contains no directory components xml_in will look for the file in each directory in the searchpath (see the section on options below). For example:

    ref = XmlSimple.xml_in('/etc/params.xml')

nil

If there is no XML specifier, xml_in will check the script directory and each of the searchpath directories for a file with the same name as the script but with the extension '.xml'. Note: if you wish to specify options, you must specify the value nil:

    ref = XmlSimple.xml_in(nil, { 'ForceArray' => false })

A String of XML

A string containing XML (recognised by the presence of '<' and '>' characters) will be parsed directly. For example:

    ref = XmlSimple.xml_in('<opt username="bob" password="flurp" />')

An IO object

An IO object will be read to EOF and its contents parsed. For example:

    file = File.open('/etc/params.xml')
    ref  = XmlSimple.xml_in(file)

xml_out

Takes a data structure (generally a Hash) and returns an XML encoding of that structure. If the resulting XML is parsed using xml_in, it will return a data structure equivalent to the original.

When translating hashes to XML, hash keys which have a leading '-' will be silently skipped. This is the approved method for marking elements of a data structure which should be ignored by xml_out. (Note: If these items were not skipped the key names would be emitted as element or attribute names with a leading '-' which would not be valid XML).

Caveats

Some care is required in creating data structures which will be passed to xml_out. Hash keys from the data structure will be encoded as either XML element names or attribute names. Therefore, you should use hash key names which conform to the relatively strict XML naming rules:

Names in XML must begin with a letter. The remaining characters may be letters, digits, hyphens (-), underscores () or full stops (.). It is also allowable to include one colon (:) in an element name but this should only be used when working with namespaces - a facility well beyond the scope of _XmlSimple.

You can use other punctuation characters in hash values (just not in hash keys) however XmlSimple does not support dumping binary data.

If you break these rules, the current implementation of xml_out will simply emit non-compliant XML which will be rejected if you try to read it back in. (A later version of XmlSimple might take a more proactive approach).

Note also that although you can nest hashes and arrays to arbitrary levels, recursive data structures are not supported and will cause xml_out to raise an exception.

Options

IMPORTANT NOTE FOR USERS OF THE PERL VERSION! The default values of some options have changed, some options are not supported and I have added new options, too:

  • 'ForceArray' is true by default.

  • 'KeyAttr' defaults to [] and not to ['name', 'key', 'id'].

  • The SAX parts of XML::Simple are currently not supported.

  • Namespaces are currently not supported.

  • Currently, there is no 'strict mode'.

  • 'AnonymousTag' is not available in the current Perl version.

  • 'Indent' is not available in the current Perl version.

  • The Perl version does not support so called blessed references and raises an exception ("can’t encode value of type"), if one is used. The Ruby version supports all object types, because every object in Ruby has a to_s method.

XmlSimple supports a number of options. If you find yourself repeatedly having to specify the same options, you might like to investigate the section on "Optional OO Interface" below.

Because there are so many options, it’s hard for new users to know which ones are important, so here are the two you really need to know about:

  • Check out 'ForceArray' because you’ll almost certainly want to leave it on.

  • Make sure you know what the 'KeyAttr' option does and what its default value is because it may surprise you otherwise.

Both xml_in and xml_out expect a single argument followed by a Hash containing options. So, an option takes the form of a 'name ⇒ value' pair. The options listed below are marked with 'in' if they are recognised by xml_in and 'out' if they are recognised by xml_out.

Each option is also flagged to indicate whether it is:

  • 'important' - don’t use the module until you understand this

  • 'handy' - you can skip this on the first time through

  • 'advanced' - you can skip this on the second time through

  • 'seldom used' - you’ll probably never use this unless you were the person that requested the feature

The options are listed alphabetically.

Note: Option names are not case-sensitive, so you can use the mixed case versions shown here. Additionally, you can put underscores between the words (for example 'key_attr').

AnonymousTag ⇒ 'tag name' (in + out) (seldom used)

By default, the tag to declare an anonymous value is 'anon'. Using option 'AnonymousTag' you can set it to an arbitrary string (that must obey to the XML naming rules, of course).

Cache ⇒ [ cache scheme(s) ] (in) (advanced)

Because loading the REXML parser module and parsing an XML file can consume a significant number of CPU cycles, it is often desirable to cache the output of xml_in for later reuse.

When parsing from a named file, XmlSimple supports a number of caching schemes. The 'Cache' option may be used to specify one or more schemes (using an anonymous array). Each scheme will be tried in turn in the hope of finding a cached pre-parsed representation of the XML file. If no cached copy is found, the file will be parsed and the first cache scheme in the list will be used to save a copy of the results. The following cache schemes have been implemented:

storable

Utilises Marshal to read/write a cache file with the same name as the XML file but with the extension .stor.

mem_share

When a file is first parsed, a copy of the resulting data structure is retained in memory in XmlSimple's namespace. Subsequent calls to parse the same file will return a reference to this structure. This cached version will persist only for the life of the Ruby interpreter (which in the case of mod_ruby for example, may be some significant time).

Because each caller receives a reference to the same data structure, a change made by one caller will be visible to all. For this reason, the reference returned should be treated as read-only.

mem_copy

This scheme works identically to 'mem_share' (above) except that each caller receives a reference to a new data structure which is a copy of the cached version. Copying the data structure will add a little processing overhead, therefore this scheme should only be used where the caller intends to modify the data structure (or wishes to protect itself from others who might). This scheme uses the Marshal module to perform the copy.

Warning! The memory-based caching schemes compare the timestamp on the file to the time when it was last parsed. If the file is stored on an NFS filesystem (or other network share) and the clock on the file server is not exactly synchronised with the clock where your script is run, updates to the source XML file may appear to be ignored.

ContentKey ⇒ 'keyname' (in + out) (seldom used)

When text content is parsed to a hash value, this option let’s you specify a name for the hash key to override the default 'content'. So for example:

    XmlSimple.xml_in('<opt one="1">Text</opt>', { 'ContentKey' => 'text' })

will parse to:

    { 'one' => '1', 'text' => 'Text' }

instead of:

    { 'one' => '1', 'content' => 'Text' }

xml_out will also honour the value of this option when converting a hash to XML.

You can also prefix your selected key name with a '-' character to have xml_in try a little harder to eliminate unnecessary 'content' keys after array folding. For example:

  XmlSimple.xml_in(%q(
    <opt>
      <item name="one">First<item>
      <item name="two">Second<item>
    <opt>), {
  'KeyAttr'    => { 'item' => 'name' },
  'ForceArray' => [ 'item' ],
  'ContentKey' => '-content'
})

will parse to:

{
  'item' => {
    'one'  =>  'First',
    'two'  =>  'Second'
  }
}

rather than this (without the '-'):

{
  'item' => {
    'one'  => { 'content' => 'First' },
    'two'  => { 'content' => 'Second' }
  }
}

ForceArray ⇒ true | false (in) (IMPORTANT!)

This option should be set to true to force nested elements to be represented as arrays even when there is only one. For example, with 'ForceArray' enabled, this XML:

<opt>
  <name>value</name>
</opt>

would parse to this:

{
  'name' => [ 'value' ]
}

instead of this (the default):

{
  'name' => 'value'
}

This option is especially useful if the data structure is likely to be written back out as XML and the default behaviour of rolling single nested elements up into attributes is not desirable.

If you are using the array folding feature, you should almost certainly enable this option. If you do not, single nested elements will not be parsed to arrays and therefore will not be candidates for folding to a hash.

The option is true by default.

ForceArray ⇒ [ name(s) ] (in) (IMPORTANT!)

This alternative form of the 'ForceArray' option allows you to specify a list of element names which should always be forced into an array representation, rather than the 'all or nothing' approach above.

It is also possible to include compiled regular expressions in the list - any element names which match the pattern will be forced to arrays. If the list contains only a single regex, then it is not necessary to enclose it in an Array. For example,

'ForceArray' => %r(_list$)

ForceContent (in) (seldom used)

When xml_in parses elements which have text content as well as attributes, the text content must be represented as a hash value rather than a simple scalar. This option allows you to force text content to always parse to a hash value even when there are no attributes. So, for example:

  xml =%q(
    <opt>
      <x>text1</x>
      <y a="2">text2</y>
    </opt>)
  XmlSimple.xml_in(xml, { 'ForceContent' => true })

will parse to:

    {
      'x' => {             'content' => 'text1' },
      'y' => { 'a' => '2', 'content' => 'text2' }
    }

instead of:

    {
      'x' => 'text1',
      'y' => { 'a' => '2', 'content' => 'text2' }
    }

GroupTags ⇒ { grouping tag ⇒ grouped tag } (in + out) (handy)

You can use this option to eliminate extra levels of indirection in your Ruby data structure. For example this XML:

  xml = %q(
  <opt>
    <searchpath>
      <dir>usr/bin<dir>
      <dir>usr/local/bin<dir>
      <dir>usr/X11/bin<dir>
    <searchpath>
  <opt>)

Would normally be read into a structure like this:

 {
   'searchpath' => {
     'dir' => [ '/usr/bin', '/usr/local/bin', '/usr/X11/bin' ]
   }
 }

But when read in with the appropriate value for 'GroupTags':

    opt = XmlSimple.xml_in(xml, { 'GroupTags' => { 'searchpath' => 'dir' })

It will return this simpler structure:

    {
      'searchpath' => [ '/usr/bin', '/usr/local/bin', '/usr/X11/bin' ]
    }

You can specify multiple 'grouping element' to 'grouped element' mappings in the same Hash. If this option is combined with 'KeyAttr', the array folding will occur first and then the grouped element names will be eliminated.

xml_out will also use the grouptag mappings to re-introduce the tags around the grouped elements. Beware though that this will occur in all places that the 'grouping tag' name occurs - you probably don’t want to use the same name for elements as well as attributes.

Indent ⇒ 'string' (out) (seldom used)

By default, xml_out's pretty printing mode indents the output document using two blanks. 'Indent' allows you to use an arbitrary string for indentation.

If the 'NoIndent' option is set, 'Indent' will be ignored.

KeepRoot ⇒ true | false (in + out) (handy)

In its attempt to return a data structure free of superfluous detail and unnecessary levels of indirection, xml_in normally discards the root element name. Setting the 'KeepRoot' option to true will cause the root element name to be retained. So after executing this code:

    config = XmlSimple.xml_in('<config tempdir="/tmp" />', { 'KeepRoot' => true })

you’ll be able to reference the tempdir as config['config']['tempdir'] instead of the default config['tempdir'].

Similarly, setting the 'KeepRoot' option to true will tell xml_out that the data structure already contains a root element name and it is not necessary to add another.

KeyAttr ⇒ [ list ] (in + out) (IMPORTANT!)

This option controls the 'array folding' feature which translates nested elements from an array to a hash. For example, this XML:

        <opt>
          <user login="grep" fullname="Gary R Epstein" />
          <user login="stty" fullname="Simon T Tyson" />
        </opt>

would, by default, parse to this:

{
  'user' => [
     {
       'login'    => 'grep',
       'fullname' => 'Gary R Epstein'
     },
     {
       'login'    => 'stty',
       'fullname' => 'Simon T Tyson'
     }
   ]
}

If the option 'KeyAttr ⇒ "login"' were used to specify that the 'login' attribute is a key, the same XML would parse to:

{
  'user' => {
    'stty' => {
      'fullname' => 'Simon T Tyson'
    },
    'grep' => {
      'fullname' => 'Gary R Epstein'
    }
  }
}

The key attribute names should be supplied in an array if there is more than one. xml_in will attempt to match attribute names in the order supplied. xml_out will use the first attribute name supplied when 'unfolding' a hash into an array.

Note: the 'KeyAttr' option controls the folding of arrays. By default, a single nested element will be rolled up into a scalar rather than an array and therefore will not be folded. Use the 'ForceArray' option to force nested elements to be parsed into arrays and therefore candidates for folding into hashes.

The default value for 'KeyAttr' is [], that is, the array folding feature is disabled.

KeyAttr ⇒ { list } (in + out) (IMPORTANT!)

This alternative method of specifying the key attributes allows more fine grained control over which elements are folded and on which attributes. For example the option 'KeyAttr ⇒ { 'package' ⇒ 'id' } will cause any package elements to be folded on the 'id' attribute. No other elements which have an 'id' attribute will be folded at all.

Note: xml_in will generate a warning if this syntax is used and an element which does not have the specified key attribute is encountered (for example: a 'package' element without an 'id' attribute, to use the example above).

Two further variations are made possible by prefixing a '+' or a '-' character to the attribute name:

The option

  'KeyAttr' => { 'user' => "+login" }'

will cause this XML:

<opt>
  <user login="grep" fullname="Gary R Epstein" />
  <user login="stty" fullname="Simon T Tyson" />
</opt>

to parse to this data structure:

{
  'user' => {
    'stty' => {
      'fullname' => 'Simon T Tyson',
      'login'    => 'stty'
    },
    'grep' => {
      'fullname' => 'Gary R Epstein',
      'login'    => 'grep'
    }
  }
}

The '+' indicates that the value of the key attribute should be copied rather than moved to the folded hash key.

A '-' prefix would produce this result:

{
  'user' => {
    'stty' => {
      'fullname' => 'Simon T Tyson',
      '-login'   => 'stty'
    },
    'grep' => {
      'fullname' => 'Gary R Epstein',
      '-login'   => 'grep'
    }
  }
}

As described earlier, xml_out will ignore hash keys starting with a '-'.

AttrPrefix ⇒ true | false (in + out) (handy)

XmlSimple treats attributes and elements equally and there is no way to determine, if a certain hash key has been derived from an element name or from an attribute name. Sometimes you need this information and that’s when you use the AttrPrefix option:

xml_str = <<XML_STR
<Customer id="12253">
  <first_name>Joe</first_name>
  <last_name>Joe</last_name>
  <Address type="home">
    <line1>211 Over There</line1>
    <city>Jacksonville</city>
    <state>FL</state>
    <zip_code>11234</zip_code>
  </Address>
  <Address type="postal">
    <line1>3535 Head Office</line1>
    <city>Jacksonville</city>
    <state>FL</state>
    <zip_code>11234</zip_code>
  </Address>
</Customer>
XML_STR

result = XmlSimple.xml_in xml_str, { 'ForceArray' => false, 'AttrPrefix' => true }
p result

produces:

{
  "@id" => "12253",
  "first_name" => "Joe",
  "Address" => [
    {
      "city" => "Jacksonville",
      "line1" => "211 Over There",
      "zip_code" => "11234",
      "@type" => "home",
      "state" => "FL"
    },
    {
      "city" => "Jacksonville",
      "line1" => "3535 Head Office",
      "zip_code" => "11234",
      "@type" => "postal",
      "state" => "FL"
    }
  ],
  "last_name" => "Joe"
}

As you can see all hash keys that have been derived from attributes are prefixed by an @ character, so now you know if they have been elements or attributes before. Of course, xml_out knows how to correctly transform hash keys prefixed by an @ character, too:

    doc = REXML::Document.new XmlSimple.xml_out(result, 'AttrPrefix' => true)
    d = ''
    doc.write(d)
    puts d

produces:

<opt id="12253">
  <first_name>Joe</first_name>
  <last_name>Joe</last_name>
  <Address type="home">
    <line1>211 Over There</line1>
    <city>Jacksonville</city>
    <state>FL</state>
    <zip_code>11234</zip_code>
  </Address>
  <Address type="postal">
    <line1>3535 Head Office</line1>
    <city>Jacksonville</city>
    <state>FL</state>
    <zip_code>11234</zip_code>
  </Address>
</opt>

NoAttr ⇒ true | false (in + out) (handy)

When used with xml_out, the generated XML will contain no attributes. All hash key/values will be represented as nested elements instead.

When used with xml_in, any attributes in the XML will be ignored.

NormaliseSpace ⇒ 0 | 1 | 2 (in) (handy)

This option controls how whitespace in text content is handled. Recognised values for the option are:

  • 0 - The default behaviour is for whitespace to be passed through unaltered (except of course for the normalisation of whitespace in attribute values which is mandated by the XML recommendation).

  • 1 - Whitespace is normalised in any value used as a hash key (normalising means removing leading and trailing whitespace and collapsing sequences of whitespace characters to a single space).

  • 2 - Whitespace is normalised in all text content.

Note: you can spell this option with a 'z' if that is more natural for you.

NoEscape ⇒ true | false (out) (seldom used)

By default, xml_out will translate the characters <, >, &, ', and " to '<', '>', '&amp', '&apos', and '&quot' respectively. Use this option to suppress escaping (presumably because you’ve already escaped the data in some more sophisticated manner).

NoIndent ⇒ true | false (out) (seldom used)

Set this option to true to disable xml_out's default 'pretty printing' mode. With this option enabled, the XML output will all be on one line (unless there are newlines in the data) - this may be easier for downstream processing.

KeyToSymbol ⇒ true | false (in) (handy)

If set to true (default is false) all keys are turned into symbols, that is, the following snippet

  doc = <<-DOC
  <atts>
    <x>Hello</x>
    <y>world</y>
    <z>
      <inner>Inner</inner>
    </z>
  </atts>
  DOC
  p XmlSimple.xml_in(doc, 'KeyToSymbol' => true)

produces:

  {
    :x => ["Hello"],
    :y => ["World"],
    :z => [ { :inner => ["Inner"] } ]
  }

AttrToSymbol ⇒ true | false (in) (handy)

If set to true (default is false) all keys are turned into symbols, that is, the following snippet

  doc = <<-DOC
  <atts>
    <msg text="Hello, world!" />
  </atts>
  DOC
  p XmlSimple.xml_in(doc, 'AttrToSymbol' => true)

produces:

  {
    "msg" => [ { :text => "Hello, world!" } ]
  }

OutputFile ⇒ <file specifier> (out) (handy)

The default behavior of xml_out is to return the XML as a string. If you wish to write the XML to a file, simply supply the filename using the 'OutputFile' option. Alternatively, you can supply an object derived from IO instead of a filename.

RootName ⇒ 'string' (out) (handy)

By default, when xml_out generates XML, the root element will be named 'opt'. This option allows you to specify an alternative name.

Specifying either nil or the empty string for the 'RootName' option will produce XML with no root elements. In most cases the resulting XML fragment will not be 'well formed' and therefore could not be read back in by xml_in. Nevertheless, the option has been found to be useful in certain circumstances.

SearchPath ⇒ [ list ] (in) (handy)

Where the XML is being read from a file, and no path to the file is specified, this attribute allows you to specify which directories should be searched.

If the first parameter to xml_in is undefined, the default searchpath will contain only the directory in which the script itself is located. Otherwise the default searchpath will be empty.

Note: the current directory ('.') is not searched unless it is the directory containing the script.

SelfClose ⇒ true | false (out)

If set, xml_out will use self-closing tags for empty elements. For example:

<element />

instead of

<element></element>

SuppressEmpty ⇒ true | '' | nil (in + out) (handy)

This option controls what xml_in should do with empty elements (no attributes and no content). The default behaviour is to represent them as empty hashes. Setting this option to true will cause empty elements to be skipped altogether. Setting the option to nil or the empty string will cause empty elements to be represented as nil or the empty string respectively. The latter two alternatives are a little easier to test for in your code than a hash with no keys.

Variables ⇒ { name ⇒ value } (in) (handy)

This option allows variables in the XML to be expanded when the file is read. (there is no facility for putting the variable names back if you regenerate XML using xml_out).

A 'variable' is any text of the form "${name}" which occurs in an attribute value or in the text content of an element. If 'name' matches a key in the supplied Hash, "${name}" will be replaced with the corresponding value from the Hash. If no matching key is found, the variable will not be replaced.

VarAttr ⇒ 'attr_name' (in) (handy)

In addition to the variables defined using 'Variables', this option allows variables to be defined in the XML. A variable definition consists of an element with an attribute called 'attr_name' (the value of the 'VarAttr' option). The value of the attribute will be used as the variable name and the text content of the element will be used as the value. A variable defined in this way will override a variable defined using the 'Variables' option. For example:

    XmlSimple.xml_in(%q(<opt>
        <dir name="prefix">usr/local/apache</dir>
        <dir name="exec_prefix">${prefix}</dir>
        <dir name="bindir">${exec_prefix}/bin</dir>
        </opt>), {
     'VarAttr' => 'name', 'ContentKey' => '-content'
     })

produces the following data structure:

{
  'dir' => {
           'prefix'      => '/usr/local/apache',
           'exec_prefix' => '/usr/local/apache',
           'bindir'      => '/usr/local/apache/bin',
      }
}

XmlDeclaration ⇒ true | 'string' (out) (handy)

If you want the output from xml_out to start with the optional XML declaration, simply set the option to true. The default XML declaration is:

    <?xml version='1.0' standalone='yes'?>

If you want some other string (for example to declare an encoding value), set the value of this option to the complete string you require.

conversions ⇒ { regex ⇒ lambda } (in) (handy)

When importing XML documents it’s often necessary to filter or transform certain elements or attributes. The conversions option helps you to do this. It expects a Hash object where the keys are regular expressions identifying element or attribute names. The values are lambda functions that will be applied to the matching elements.

Let’s say we have a file named status.xml containing the following document:

<result>
  <status>OK</status>
  <total>10</total>
  <failed>2</failed>
</result>

The following statement

  conversions = {
    /^total|failed$/ => lambda { |v| v.to_i },
    /^status$/       => lambda { |v| v.downcase }
  }

  p XmlSimple.xml_in(
    'status.xml',
    :conversions => conversions,
    :forcearray  => false
  )

produces the following output:

{
  'status' => 'ok',
  'total'  => 10,
  'failed' => 2
}

Optional OO Interface

The procedural interface is both simple and convenient, but if you have to define a set of default values which should be used on all subsequent calls to xml_in or xml_out, you might prefer to use the object-oriented (OO) interface.

The default values for the options described above are unlikely to suit everyone. The OO interface allows you to effectively override XmlSimple's defaults with your preferred values. It works like this:

First create an XmlSimple parser object with your preferred defaults:

    xs = XmlSimple.new({ 'ForceArray' => false, 'KeepRoot' => true)

then call xml_in or xml_out as a method of that object:

    ref = xs.xml_in(xml)
    xml = xs.xml_out(ref)

You can also specify options when you make the method calls and these values will be merged with the values specified when the object was created. Values specified in a method call take precedence.

Error Handling

The XML standard is very clear on the issue of non-compliant documents. An error in parsing any single element (for example a missing end tag) must cause the whole document to be rejected. XmlSimple will raise an appropriate exception if it encounters a parsing error.

Examples

When xml_in reads the following very simple piece of XML:

    <opt username="testuser" password="frodo"></opt>

it returns the following data structure:

    {
      'username' => 'testuser',
      'password' => 'frodo'
    }

The identical result could have been produced with this alternative XML:

    <opt username="testuser" password="frodo" />

Or this (although see 'ForceArray' option for variations):

    <opt>
      <username>testuser</username>
      <password>frodo</password>
    </opt>

Repeated nested elements are represented as anonymous arrays:

    <opt>
      <person firstname="Joe" lastname="Smith">
        <email>joe@smith.com</email>
        <email>jsmith@yahoo.com</email>
      </person>
      <person firstname="Bob" lastname="Smith">
        <email>bob@smith.com</email>
      </person>
    </opt>

    {
      'person' => [
        {
          'email' => [
            'joe@smith.com',
            'jsmith@yahoo.com'
          ],
          'firstname' => 'Joe',
          'lastname' => 'Smith'
        },
        {
          'email' => ['bob@smith.com'],
          'firstname' => 'Bob',
          'lastname' => 'Smith'
        }
      ]
    }

Nested elements with a recognised key attribute are transformed (folded) from an array into a hash keyed on the value of that attribute, that is, calling xml_in with the 'KeyAttr' set to [key] will transform

    <opt>
      <person key="jsmith" firstname="Joe" lastname="Smith" />
      <person key="tsmith" firstname="Tom" lastname="Smith" />
      <person key="jbloggs" firstname="Joe" lastname="Bloggs" />
    </opt>

into

    {
      'person' => {
        'jbloggs' => {
          'firstname' => 'Joe',
          'lastname' => 'Bloggs'
        },
        'tsmith' => {
          'firstname' => 'Tom',
          'lastname' => 'Smith'
        },
        'jsmith' => {
          'firstname' => 'Joe',
          'lastname' => 'Smith'
        }
      }
    }

The <anon> tag can be used to form anonymous arrays:

    <opt>
      <head><anon>Col 1</anon><anon>Col 2</anon><anon>Col 3</anon></head>
      <data><anon>R1C1</anon><anon>R1C2</anon><anon>R1C3</anon></data>
      <data><anon>R2C1</anon><anon>R2C2</anon><anon>R2C3</anon></data>
      <data><anon>R3C1</anon><anon>R3C2</anon><anon>R3C3</anon></data>
    </opt>

    {
      'head' => [
        [ 'Col 1', 'Col 2', 'Col 3' ]
      ],
      'data' => [
        [ 'R1C1', 'R1C2', 'R1C3' ],
        [ 'R2C1', 'R2C2', 'R2C3' ],
        [ 'R3C1', 'R3C2', 'R3C3' ]
      ]
    }

Anonymous arrays can be nested to arbitrary levels and as a special case, if the surrounding tags for an XML document contain only an anonymous array the array will be returned directly rather than the usual hash:

    <opt>
      <anon><anon>Col 1</anon><anon>Col 2</anon></anon>
      <anon><anon>R1C1</anon><anon>R1C2</anon></anon>
      <anon><anon>R2C1</anon><anon>R2C2</anon></anon>
    </opt>

    [
      [ 'Col 1', 'Col 2' ],
      [ 'R1C1', 'R1C2' ],
      [ 'R2C1', 'R2C2' ]
    ]

Elements which only contain text content will simply be represented as a scalar. Where an element has both attributes and text content, the element will be represented as a hash with the text content in the 'content' key:

    <opt>
      <one>first</one>
      <two attr="value">second</two>
    </opt>

    {
      'one' => 'first',
      'two' => { 'attr' => 'value', 'content' => 'second' }
    }

Mixed content (elements which contain both text content and nested elements) will be not be represented in a useful way - element order and significant whitespace will be lost. If you need to work with mixed content, then XmlSimple is not the right tool for your job - check out the next section.

Where to from here?

XmlSimple is by nature very simple.

  • The parsing process liberally disposes of 'surplus' whitespace - some applications will be sensitive to this.

  • Slurping data into a hash will implicitly discard information about attribute order. Normally this would not be a problem because any items for which order is important would typically be encoded as elements rather than attributes. However, XmlSimple's aggressive slurping and folding algorithms can defeat even these techniques.

  • The API offers little control over the output of xml_out. In particular, it is not especially likely that feeding the output from xml_in into xml_out will reproduce the original XML (although passing the output from xml_out into xml_in should reproduce the original data structure).

  • xml_out cannot produce well-formed HTML unless you feed it with care - hash keys must conform to XML element naming rules and nil values should be avoided.

  • xml_out does not currently support encodings (although it shouldn’t stand in your way if you feed it encoded data).

  • If you’re attempting to get the output from xml_out to conform to a specific DTD, you’re almost certainly using the wrong tool for the job.

If any of these points are a problem for you, then XmlSimple is probably not the right class for your application.

FAQ

Question: if I include XmlSimple in a rails app and run for example 'rake' in the root of the app, I always get the following warnings:

  /usr/local/lib/ruby/gems/1.8/gems/xml-simple-1.0.10/lib/xmlsimple.rb:275:
  warning: already initialized constant KNOWN_OPTIONS

  /usr/local/lib/ruby/gems/1.8/gems/xml-simple-1.0.10/lib/xmlsimple.rb:280:
  warning: already initialized constant DEF_KEY_ATTRIBUTES

  /usr/local/lib/ruby/gems/1.8/gems/xml-simple-1.0.10/lib/xmlsimple.rb:281:
  warning: already initialized constant DEF_ROOT_NAME

  /usr/local/lib/ruby/gems/1.8/gems/xml-simple-1.0.10/lib/xmlsimple.rb:282:
  warning: already initialized constant DEF_CONTENT_KEY

  /usr/local/lib/ruby/gems/1.8/gems/xml-simple-1.0.10/lib/xmlsimple.rb:283:
  warning: already initialized constant DEF_XML_DECLARATION

  /usr/local/lib/ruby/gems/1.8/gems/xml-simple-1.0.10/lib/xmlsimple.rb:284:
  warning: already initialized constant DEF_ANONYMOUS_TAG

  /usr/local/lib/ruby/gems/1.8/gems/xml-simple-1.0.10/lib/xmlsimple.rb:285:
  warning: already initialized constant DEF_FORCE_ARRAY

  /usr/local/lib/ruby/gems/1.8/gems/xml-simple-1.0.10/lib/xmlsimple.rb:286:
  warning: already initialized constant DEF_INDENTATION

  /usr/local/lib/ruby/gems/1.8/gems/xml-simple-1.0.10/lib/xmlsimple.rb:287:
  warning: already initialized constant DEF_KEY_TO_SYMBOL

Answer: The reason for this is, that you’re using XmlSimple explicitly in a rails app. XmlSimple is part of rails (you can find it in ./actionpack-1.12.5/lib/action_controller/vendor/xml_simple.rb). Unfortunately, the library is named "xml_simple.rb" and not "xmlsimple.rb". Ruby’s "require" prevents you from loading a library two times and it does so by checking if a file name occurs more than once. In your case somewhere in the rails framework "require 'xml_simple'" is performed and you run "require 'xmlsimple'" afterwards. Hence, the library is loaded twice and all constants are redefined.

A solution is to only require xml-simple unless XmlSimple has not been defined already.

Acknowledgements

A big "Thank you!" goes to

  • Grant McLean for Perl’s XML::Simple

  • Yukihiro Matsumoto for Ruby.

  • Sean Russell for REXML.

  • Dave Thomas for Rdoc.

  • Nathaniel Talbott for Test::Unit.

  • Minero Aoki for his setup package.

Contact

If you have any suggestions or want to report bugs, please contact me.