Products
Technologies

Developer resources

Accessing XML attributes and nodes with different namespaces.

Avg. Rating 4.5

Problem

I need to access data in the attributes and elements in a very large and complex XML file. E4X works well for more basic XML structures, but the namespaces make it much more challenging to get at access I need without parsing the text of the file. There has to be an easier way.

Solution

Not as easy as working with simpler XML, but E4X makes it straight-forward by making use of the top-level QName class to handle the elements and attributes based on their qualified names, taking namespaces into account rather than removing them.

Detailed explanation

The Background.

I've been tasked with creating an AIR application allowing users to browse an ontology (basically, a thesaurus on steroids). I need to extract, translate, and load the data into a SQLite database to be deployed with the application. Trouble is, the source files are pretty complicated, and the data I need is tucked away in any number of the zillion elements and attributes each contains.

The Particulars.

The source content is written in a particular (and peculiar) language known as OWL (W3C's Web Ontology Language), and is commonly stored using RDF/XML syntax. Often, with as many as a dozen namespaces sprinkled throughout. Attributes often have different namespaces than the elements that contain them. The basic E4X techniques don't seem to work anymore, and most online discussions still tend to be too basic, or deal with smallish chunks of data coming over the network.

The XML Content.

Here is some (dramatically) pared-down XML content in the header of an OWL file, in this case, drawn from the OpenCyc Ontology

<mx:XML id="rawContent" xmlns="">
    <rdf:RDF xml:base="http://sw.opencyc.org/concept/"
        xmlns="http://sw.opencyc.org/concept/"
        xmlns:owl="http://www.w3.org/2002/07/owl#"
        xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
        <owl:Ontology rdf:about="Opencyc">
            <owl:versionInfo>Version
2.0.0</owl:versionInfo>
            <rdfs:comment xml:lang="en">OpenCyc Knowledge
Base...</rdfs:comment>
            <memo>Opencyc as OWL</memo>
        </owl:Ontology>
    </rdf:RDF>
</mx:XML>

The objective is to populate a simple class with data from this raw content.

public class OntologyData {
    public var about:String;
    public var version:String;
    public var comment:String;
    public var commentLanguage:String;
    public var memo:String;
}

The Namespaces.

Success lies in dealing with each element and attribute based on its qualified name (both tag and namespace)-and the namespaces can actually help get at the desired parts of each node. But first we have to define them. There is more than one way to do this, but the one that looks the most familiar (in an ActionScript 3 context) is to declare instances of the top-level Namespace class.

public const owl:Namespace = new Namespace("http://www.w3.org/2002/07/owl#");
public const rdf:Namespace = new Namespace("http://www.w3.org/1999/02/22-rdf-syntax-ns#");
public const rdfs:Namespace = new Namespace("http://www.w3.org/2000/01/rdf-schema#");

We copy these directly from the XML we're going to access. It's also possible to grab them dynamically as we begin to process the file, but that is beyond the scope of this recipe.

The Elements.

The first one is a little tricky. If it weren't for the owl namespace, we'd simply write  rawContent.Ontology[0]. The zero accessor, [0], is necessary because rawContent.Ontology returns an XMLList (with only one element in this case) of all rawContent child nodes named Ontology

In ActionScript 3 code, we use the period to separate namespaces. E4X, however, used double-colons (like many other languages) because periods are already used to indicate hierarchical levels within the XML. And just as we would say org.slocity.owl.OwlParser to indicate a fully-qualified name of a class called OwlParser within the namespace, or source code hierarchy, we say owl::Ontology to refer to the Ontology node within the owl namespace.

var baseNode:XML = rawContent.owl::Ontology[0];

Things get easier when we get to the children of baseNode-we grab all of them.

var nodes:XMLList =  baseNode.*;

We can then loop through them and do interesting things, or in this case, report the type, name, namespace, and value of each one.

for each (var node:XML in nodes)
    reportAndStore(node);

The Attributes.

Grabbing the attributes isn't difficult, in fact it's the same thing we'd do if there weren't any namespace issues-we use the attributes() method of the top-level XML class.

var baseAttributes:XMLList = baseNode.attributes();
var attributes:XMLList = node.attributes();

And we take advantage of a  for each loop to access each one.

for each (var baseAttribute:XML in attributes)
    reportAndStore(baseAttribute);

for each (var attribute:XML in baseAttributes) reportAndStore(attribute);

If we wanted to grab a particular attribute rather than all of them, we would use the same approach we used to select the nodes. The namespace designator goes after the @ accessor because the namespace describes the attribute, not the means to access it.

var about:String = baseNode..@rdf::about;

The inline E4X accessor syntax is tricky-at some point the documentation will include examples of basic E4X tasks that take into account the namespaces involved (in a way other than removing them). 

The QName Goodness.

Just as the qualified name of a class includes the namespace as described in its package designation (i.e.  org.slocity.owl.OWLParser), the qualified name of an XML tag includes its namespace as well (i.e. owl::Ontology or rdf::about).

The ActionScript 3 parser won't distinguish whether the namespace designator is intended to describe a variable (and flagged as a probable error) or an element within an XML variable, so we'd have to use the full E4X expression each time we wish to access it (e.g.  rawContent.owl::Ontology[0]) unless we assigned it to an XML variable and accessed its inherent attributes.

The first step is to determine the namespace and the name of the element, or in HTML parlance, its tag. If a typical XML element has the structure

<namespace::tag attribute="value">text</namespace::tag>

we can capture it by explicitly stating the namespace in an E4X expression

var baseNode:XML = rawContent.owl::Ontology[0];

but we still need some way to separate and identify the namespace and the name, or tag, and we shouldn't have to jump through too many hoops to do it. This is where the QName class comes in. 

var qname:QName = baseNode.name() as QName;

Our qname variable provides easy access both to the tag and the part of the namespace we'll use to identify whether we want to access it or not, the URI.

var tag:String = qname.localName;
var uri:String = qname.uri;

By the way,  localName is also the name of a method available directly from an XML variable, and delivers an identical result.

baseNode.localName() == qname.localName

Divide and Conquer.

Now we can test both to determine whether or not it is one of the values we want to access. We'll branch off the namespace first. We can test against a String, or we can make our code less prone to error by using the uri portion of each Namespace constant we defined at the start.

switch (uri) {
    case "http://www.w3.org/2002/07/owl#":
        // branch by owl tag
        break;
    
    case rdf.uri:
        // branch by rdf tag
        break;

    case rdfs.uri:
        // branch by rdfs tag
        break;

    default:
        // branch by default tag
        break;
}

For each namespace, we determine whether the tag is one that we wish to collect data from.

// branch by owl tag
switch (tag) {
    case "about":
        // keep
        break;

    case "intersectionOf":
        // ignore
        break;

    default:
        // ignore
}

Then collect it and store it somewhere, such as the simple data structure mentioned above. We simply assign the value of an attribute or the text of a node to the corresponding member variable.

var oData:OntologyData = new OntologyData();
oData.about = attribute;
oData.version = node.text();

Conclusion.

Dealing with namespaces in XML can be challenging, but that is precisely why they exist-the alternative is usually a chaos far, far worse. The code and ideas presented here are verbose and neither optimized for efficiency nor elegance, but for clarity. Hopefully, some much-more-informed folks will chime-in with some whiz-bang tricksy approach that does it all automatically (a la the XML decoding and XML proxy classes out there). In the meantime, premature optimization is the root of all evil.


+
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License. Permissions beyond the scope of this license, pertaining to the examples of code included within this work are available at Adobe.

Report abuse

Related recipes