A Brief Introduction to XML |
Like HTML, XML is a plain text data format with structural tagging. Unlike HTML, which has a predefined set of tags that represent the structuring and rendering facilities of modern web browsers, XML has no pre-defined document tags. Instead, XML allows applications to define their own sets of tags for use according to the XML syntactical rules. Using XML can be very simple or very complex, depending on how many of the fancy features you try to use. Ganymede uses fairly simple XML, and this page is intended to present the most basic facts about XML that you will need to know to write syntactically valid XML files that Ganymede can handle.
Facts about XML:
Tags are case-sensitive. <tag>, <Tag> and <TAG> are all separate tags, and will not be treated as equivalent by an XML parser.
All tags must come in pairs. HTML browsers can tolerate tags that stand alone, like <br> and <p>, but XML is more strict, and all elements must have an open tag (<element>) and a close tag (</element>).
The structure of XML files must be strictly tree-like. That is, XML structural elements can contain other XML structural elements, but one XML structural element may not be partially contained by another.
In other words,
<element1> <element2> </element1> </element2>
is an invalid XML sequence, as the element2 and element1 structural elements are intermingled.
In contrast,
<element1> <element2></element2> </element1>
is perfectly valid, because the entirety of element2 is contained within element1.
In cases like the above, where the element2 element's open and close tag are immediately adjacent, XML supports a special syntax for an empty element. The same legal XML structure shown above could be written as:
<element1> <element2/> </element1>
where the trailing slash in the element2 element indicates that there will be no matching close tag to come along. At any other time an element open tag (<element>) is seen, a compliant XML parser will expect and demand to see the matching close tag (</element>) for that element before it sees the close tag for any elements higher up in the document structure.
In XML, all XML documents contain a single element (the Document Element, in XML lingo) which contains in turn any other elements and character data. The start tag for the document element will generally be the first thing in an XML file's content proper, and the matching close tag for the document element will be the last thing in the XML file. For this reason, the following XML fragment could not be an entire XML file:
<element1> <element2/> </element1> <element3> <element4> <element5/> </element4> </element3>
Because element1 does not contain elements 3, 4, and 5.
In XML, all attribute data must be quoted, as in
<object label="labeldata">
The following is illegal:
<object label=labeldata>
All possible characters are legal within the double quotes surrounding an attribute's value except the '&' character and the double quotation mark itself. If you need to include a double quotation mark in an attribute's data field, you have to use "
For example, if you wanted the string 'He said "hi" & I waved back.' in an XML tag attribute, you would have to do it this way:
<sentence text="He said "hi" & I waved back."/>
Newlines and other whitespace are explicitly acceptable within the quoted value of an attribute.
Likewise, if you want to include the '<', '>', or '&' characters anywhere in the body of an XML document, for anything other than tag or special character definitions, you need to use '<', '>', and '&' instead, just as with HTML.
XML files use Unicode, with the UTF-8 encoding, typically. American 7-bit ASCII is a proper subset of Unicode and require no special handling in UTF-8 encoding. International characters may be used in XML files, but you must do so in a manner compliant with UTF-8. Ganymede will always emit XML files using the standard UNIX end-of-line character, but as with any XML parser, Ganymede can also handle DOS/Windows style line termination when reading XML files.
Ganymede allows any string that is valid for XML 1.1 / XML 1.0 (version 5) element names for object type and field names, but spaces in object type / field names are represented in XML as underscore ('_') characters. Because the Ganymede XML layer uses underscore as a stand-in for space characters in Ganymede object type and field names, you are not allowed to use underscores in Ganymede object type and field names.
This means that a field named 'Home Directory' will be represented in XML as an element named <Home_Directory>.
Unlike in HTML, the XML standard does not specify that whitespace can be ignored or contracted. For Ganymede's purposes, however, whitespace between tags is generally ignored, and newlines and indentation are nice for human readability but not necessary for Ganymede's input parsing. More on this as we discuss what Ganymede does with XML, below.
That's about all you should need to know about XML, at least as far as we'll need to talk about in discussing Ganymede. If you want to read about things like external reference entities, Document Type Definitions/DTD's, or the precise Backus-Naur style specification for what characters are allowed to go where when, you can probably find your way to the original XML standards documents over at www.XML.com.