I need to access data in the attributes and elements in a very large and complex XML file. E4X works well for more basic XML structures, but the namespaces make it much more challenging to get at access I need without parsing the text of the file. There has to be an easier way.
Not as easy as working with simpler XML, but E4X makes it straight-forward by making use of the top-level QName class to handle the elements and attributes based on their qualified names, taking namespaces into account rather than removing them.
I've been tasked with creating an AIR application allowing users to browse an ontology (basically, a thesaurus on steroids). I need to extract, translate, and load the data into a SQLite database to be deployed with the application. Trouble is, the source files are pretty complicated, and the data I need is tucked away in any number of the zillion elements and attributes each contains.
The source content is written in a particular (and peculiar) language known as OWL (W3C's Web Ontology Language), and is commonly stored using RDF/XML syntax. Often, with as many as a dozen namespaces sprinkled throughout. Attributes often have different namespaces than the elements that contain them. The basic E4X techniques don't seem to work anymore, and most online discussions still tend to be too basic, or deal with smallish chunks of data coming over the network.
Here is some (dramatically) pared-down XML content in the header of an OWL file, in this case, drawn from the OpenCyc Ontology.
<mx:XML id="rawContent" xmlns="">
<rdf:RDF xml:base="http://sw.opencyc.org/concept/"
xmlns="http://sw.opencyc.org/concept/"
xmlns:owl="http://www.w3.org/2002/07/owl#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">
<owl:Ontology rdf:about="Opencyc">
<owl:versionInfo>Version
2.0.0</owl:versionInfo>
<rdfs:comment xml:lang="en">OpenCyc Knowledge
Base...</rdfs:comment>
<memo>Opencyc as OWL</memo>
</owl:Ontology>
</rdf:RDF>
</mx:XML>
The objective is to populate a simple class with data from this raw content.
public class OntologyData {
public var about:String;
public var version:String;
public var comment:String;
public var commentLanguage:String;
public var memo:String;
}
Success lies in dealing with each element and attribute based on
its qualified name (both tag and namespace)-and the namespaces can
actually help get at the desired parts of each node. But first we
have to define them. There is more than one way to do this, but the
one that looks the most familiar (in an ActionScript 3 context) is
to declare instances of the top-level
Namespace class.
public const owl:Namespace = new
Namespace("http://www.w3.org/2002/07/owl#");
public const rdf:Namespace = new
Namespace("http://www.w3.org/1999/02/22-rdf-syntax-ns#");
public const rdfs:Namespace = new
Namespace("http://www.w3.org/2000/01/rdf-schema#");
We copy these directly from the XML we're going to access. It's also possible to grab them dynamically as we begin to process the file, but that is beyond the scope of this recipe.
The first one is a little tricky. If it weren't for the owl
namespace, we'd simply write
rawContent.Ontology[0]. The zero accessor,
[0], is necessary because
rawContent.Ontology returns an
XMLList (with only one element in this case) of all
rawContent child nodes named
Ontology.
In ActionScript 3 code, we use the period to separate
namespaces. E4X, however, used double-colons (like many other
languages) because periods are already used to indicate
hierarchical levels within the XML. And just as we would say
org.slocity.owl.OwlParser to indicate a
fully-qualified name of a class called
OwlParser within the namespace, or source code
hierarchy, we say
owl::Ontology to refer to the
Ontology node within the
owl namespace.
var baseNode:XML = rawContent.owl::Ontology[0];
Things get easier when we get to the children of baseNode-we grab all of them.
var nodes:XMLList = baseNode.*;
We can then loop through them and do interesting things, or in this case, report the type, name, namespace, and value of each one.
for each (var node:XML in nodes)
reportAndStore(node);
Grabbing the attributes isn't difficult, in fact it's the same
thing we'd do if there weren't any namespace issues-we use the
attributes() method of the top-level
XML class.
var baseAttributes:XMLList = baseNode.attributes(); var attributes:XMLList = node.attributes();
And we take advantage of a
for each loop to access each one.
for each (var baseAttribute:XML in attributes)
reportAndStore(baseAttribute);
for each (var attribute:XML in baseAttributes)
reportAndStore(attribute);
If we wanted to grab a particular attribute rather than all of them, we would use the same approach we used to select the nodes. The namespace designator goes after the @ accessor because the namespace describes the attribute, not the means to access it.
var about:String = baseNode..@rdf::about;
The inline E4X accessor syntax is tricky-at some point the documentation will include examples of basic E4X tasks that take into account the namespaces involved (in a way other than removing them).
Just as the qualified name of a class includes the namespace as
described in its package designation (i.e.
org.slocity.owl.OWLParser), the qualified name of an
XML tag includes its namespace as well (i.e.
owl::Ontology or
rdf::about).
The ActionScript 3 parser won't distinguish whether the
namespace designator is intended to describe a variable (and
flagged as a probable error) or an element within an
XML variable, so we'd have to use the full E4X
expression each time we wish to access it (e.g.
rawContent.owl::Ontology[0]) unless we assigned
it to an
XML variable and accessed its inherent attributes.
The first step is to determine the namespace and the name of the element, or in HTML parlance, its tag. If a typical XML element has the structure
<namespace::tag attribute="value">text</namespace::tag>
we can capture it by explicitly stating the namespace in an E4X expression
var baseNode:XML = rawContent.owl::Ontology[0];
but we still need some way to separate and identify the
namespace and the name, or tag, and we shouldn't have to jump
through too many hoops to do it. This is where the
QName class comes in.
var qname:QName = baseNode.name() as QName;
Our
qname variable provides easy access both to the tag
and the part of the namespace we'll use to identify whether we want
to access it or not, the URI.
var tag:String = qname.localName; var uri:String = qname.uri;
By the way,
localName is also the name of a method available
directly from an
XML variable, and delivers an identical result.
baseNode.localName() == qname.localName
Now we can test both to determine whether or not it is one of
the values we want to access. We'll branch off the namespace first.
We can test against a
String, or we can make our code less prone to error by
using the
uri portion of each
Namespace constant we defined at the start.
switch (uri) {
case "http://www.w3.org/2002/07/owl#":
// branch by owl tag
break;
case rdf.uri:
// branch by rdf tag
break;
case rdfs.uri:
// branch by rdfs tag
break;
default:
// branch by default tag
break;
}
For each namespace, we determine whether the tag is one that we wish to collect data from.
// branch by owl tag
switch (tag) {
case "about":
// keep
break;
case "intersectionOf":
// ignore
break;
default:
// ignore
}
Then collect it and store it somewhere, such as the simple data structure mentioned above. We simply assign the value of an attribute or the text of a node to the corresponding member variable.
var oData:OntologyData = new OntologyData(); oData.about = attribute; oData.version = node.text();
Dealing with namespaces in XML can be challenging, but that is precisely why they exist-the alternative is usually a chaos far, far worse. The code and ideas presented here are verbose and neither optimized for efficiency nor elegance, but for clarity. Hopefully, some much-more-informed folks will chime-in with some whiz-bang tricksy approach that does it all automatically (a la the XML decoding and XML proxy classes out there). In the meantime, premature optimization is the root of all evil.
+