XML

XML (eXtensible Markup Language) is a markup language that early 1998 became  a "Recommendation" of the World Wide Web Consortium (W3C). It was developed by a working group consisting of members from the computer industry, W3C, and  academia. In general, a document written in a markup language contains human-readable text in which important terms are marked by special tags. XML, a simplified subset of the older Standard Generalized Markup Language (SGML), follows the SGML convention of marking noteworthy terms by angular-bracketed tags: &lt;tag_name> &sdot;&sdot;&sdot; &lt;/tag_name>. For instance, an XML document about the science and culture of the Interbellum (period between WWI and WWII) could contain the marked-up sentence: This example contains two elements, which, by definition,  consist of a start and end tag plus content in between. The name of the element is the name between &lt; and > in the start tag, which must be exactly the same as the name between &lt;/ and > in the end tag. The tagged pieces of text can be extracted by a computer program (often referred to as "user agent"), which for example can prepare a typeset version of the document. The user agent might set the tagged contents in a special font and  remove the tags.

An important application of XML is in creating  legible databases. A public library, for instance, could build a computer-readable catalog as follows (note that in contrast to HTML, element names in XML are case-sensitive, is not the same as ):    Adams, Douglas The Hitchhiker's Guide to the Galaxy Crown  1400052920    ...   ... </Book> </Fiction> <Non-Fiction> <Languages>  ... </Book> </Languages> <Science> ...     </Science> </Non-Fiction> </Library> A user agent, knowing the tags appearing in this XML document, can parse it, extract the book titles, and, after some formatting, publish them on the internet, so that the titles become easily accessible to members of the library. At the same time the program (i.e., the user agent) could store the information about the books into a proprietary database.

XML is a meta-language which means that it prescribes the syntax (grammar) of elements. However, XML does not fix actual names or semantics (meanings) of elements. One could say that XML is infinitely large, practically any element name can be chosen by the author of an XML document. Only relatively few characters are forbidden in element names and many different scripts besides Latin are allowed. A markup language with an infinite number of different element names is unworkable and a finite&mdash;well-defined&mdash;number of permitted names and their content type must be chosen. The chosen elements can be recorded in an XML schema or in a Document Type Definition (DTD), either of which defines in fact a valid XML-conform markup language. For example, a markup language describing the cultural history of the Interbellum could have an associated DTD in which &lt;physicist>, &lt;cinematographer>, &lt;mathematician>, &lt;sculptor>, etc., are recorded as valid elements.

When a document is to be published (on screen, in print, in braille, or as audio file), the appearance of the elements is dictated by a stylesheet. For instance, the stylesheet may prescribe that the contents of &lt;author> &sdot;&sdot;&sdot; &lt;/author> are printed in bold. XML itself does not do anything, a computer program (a user agent) written in a language like C++ or Java is needed to perform the layout and/or storing or transferring of XML documents.

The best-known XML-compliant language is XHTML, a language for web pages that differs only in minor details from the older HTML 4.01. The definitions of XHTML are contained in a DTD file accessible  to the world as web document. The stylesheets most commonly used with XHTML are written according to the CCS 2.1 standard (Cascading Style Sheet version 2.1). CSS 2.1 also applies to XML documents. User agents as the web browsers Internet Explorer, Firefox, or Chrome, perform the parsing and further processing of XHTML documents. XML has found many applications besides XHTML, see Ref. for an extensive list of XML-based languages.

The XML document
An XML document may be text-based, or contain data to be transferred from one electronic application to another. A document might also contain abstract structures, such as graphic shapes described by vectors, as in SVG (scalable vector graphics) documents. A single XML document can be spanned over multiple computer files.

As was discussed in the introduction, content in an XML document is marked with two surrounding tags. The markup is opened with an element name  between angular brackets (the start tag) and closed by the same name between &lt;/ and > (the end tag). Also empty elements (without content) are allowed; they are denoted by a single tag closed by />, Often empty elements are used to convey information through their attributes, like the src attribute in the img element of XHTML:

Most of the letters in the UTF-8 character set are allowed in names of elements, meaning that letters in many different scripts may be employed. However, names cannot begin with a digit ([0-9]), a dot (.), a middle dot (&#xb7;) or a hyphen(-). The letters commonly used in names are: [A-Z], [a-z], [0-9], - (hyphen),  _ (underscore), and. (dot). Colons are used in flagging namespaces (see below) and hence are better avoided. Names containing one or more slashes or backslashes (as used in file paths and URLs) are forbidden, as are names containing space(s). Any of the white space characters: tab, line-feed, carriage return, and space terminates a name. As stated earlier, in contrast to HTML, element names are case sensitive, is not the same as <HEAD>, although both are allowed.

There are five predefined character references in XML:

The names given in the first column are part of XML and do not have to be defined by the user. The two characters &amp; and &lt; have special meaning for XML parsers and should be referred to by &amp;amp; and &amp;lt;, respectively. The other three characters in the table may be invoked  by  either value, whatever is more convenient. When in a piece of text the characters &amp; and &lt; are needed several times, it is often more convenient to enclose the text between &lt;![CDATA[ and ]]>. The text thus enclosed is not parsed and may contain any characters (except the string consisting of the three closing characters). Example:

A web browser will show the text (including &amp; and &lt;), but not the bracketing strings.

The elements in an XML document form a tree. The tree starts at the root (in the example below the element <Library> is the root of the document), and branches to the lower levels of the tree. The presence of a root element is mandatory in an XML document (in contrast to HTML 4). All XML elements may have children. <Fiction> is a child of <Library> and the siblings  are children of <Fiction> and descendants of the common ancestor <Library>: <Library> <Fiction>  ... </Book>  ... </Book> </Fiction> </Library> Thus, the terms parent, child,  sibling, ancestor, and descendant are used to describe the relationships between elements. Children of the same parent are called siblings, which is a generic term for brothers and sisters. A document tree is only properly defined when its elements are properly nested, that is, a construct as <tt> ...  </tt> is forbidden. All elements must be closed by an end tag (this is optional in HTML 4, but obligatory in XHTML 1.0, because the latter is XML-conform).

Additional information about elements may be provided by attributes, which have the general syntax An element may have any number of attributes with attribute names that obey the same rules as element names [almost any UTF-8 letter is allowed, but names cannot start with digit, hyphen or (middle) dot, etc.]. The value of an attribute must be surrounded by a pair of single or double quotes. Spaces around the assignment symbol (=) are optional. Examples of syntactically well-formed elements: Just as in HTML 4 documents, an XML document can contain comments, opened by &lt;&thinsp;!-&thinsp;- and closed by -&thinsp;->.

The two-character markup string "&lt;?" has special meaning. It may mark some processing instructions as in

or an XML declaration, as in

The latter line may optionally precede an XML document (will come before ). This line is advisable if the encoding of the document is not the default UTF-8.

The DTD
An XML Schema Definition (XSD) is a set of definitions of XML elements and their attributes. XSDs provide a means for defining the structure, content and semantics (meaning) of XML documents. They are updated versions of the older DTDs that go back to SGML. XSDs are more powerful&mdash;but more complex&mdash;than DTDs. See Ref. for more details about XSDs. Here we will consider only some rudimentary notions of DTDs,  which to date (2012) are still part of both XHTML 1 and HTML 4. A DTD puts on record elements and their attributes, but it may also define  entity references, indicated in a document by a beginning &amp; and a closing semicolon. For instance, &amp;pi; refers to the entity (Greek letter) &pi;.

A document is valid (in the terminology of XML) when all elements and attributes are defined in an accompanying DTD and, obviously, if all elements and attributes satisfy their definitions. A valid document must be well-formed (syntactically correct), but the converse is not necessarily true (in fact, a well-formed document may not even have a DTD). Most XHTML browsers check for well-formedness (and are obliged to stop execution as soon as they detect an error). This is not true for invalidity. Usually browsers try to fix the invalid expressions that they encounter. Special validators exist for XHTML and XML.

To give a taste of a DTD declaration, the following simple example of an XML document is presented that has an internal DTD (that is, the declarations are included in the XML document)

<!DOCTYPE Courtman_family [ <!ELEMENT Courtman_family  (#PCDATA|Mother|Father|Son|Daughter)*> <!ELEMENT Mother           (First_names+, Last_name)> <!ELEMENT Father           (First_names*, Son*, Last_name+)> <!ELEMENT First_names      (#PCDATA)> <!ELEMENT Last_name        (#PCDATA)> <!ELEMENT Son              (#PCDATA)> <!ELEMENT Daughter         (#PCDATA)> <!ATTLIST Father            born  CDATA #IMPLIED profession CDATA #IMPLIED > <!ENTITY  Address           "Hillsdale Blvd"> ]> <?xml-stylesheet href="" ?> <Courtman_family> Ms. <Mother><First_names>Corina S. </First_names><Last_name>Robinson</Last_name></Mother> has two children from her previous marriage: son <Son>Theo</Son> and daughter <Daughter>Lizz</Daughter>. She has a son: <Father><Son>Mike</Son> <Last_name>Courtman</Last_name> <Father> with her present husband, mr. <Father born="March 6, 1964" profession="Carpenter"><Last_name>Courtman</Last_name></Father>. The family lives on &Address; </Courtman_family> A browser outputs this document simply as:


 * Ms. Corina S. Robinson has two children from her previous marriage: son Theo and  daughter Lizz. She has a son: Mike Courtman with her present husband, mr. Courtman.The family lives on Hillsdale Blvd

Explanation


 * The first line (commented out) in the XML document gives the DTD declaration when the DTD is external (in the separate file  ).
 * The second line starts the DTD declaration (anything between the matching square brackets). The element  is the root element of the document. This line associates the DTD with the  document. Recall that the name of the root element  is unique within the document.
 * The third line (the first actual DTD declaration) gives the syntax of the root element . Any element declaration starts with   and is followed by the name of the element. Then the element syntax is described by a string between parentheses.
 * The string defining the syntax of   contains the primitive (atomic) value   (Parsed Character DATA), which is a sequence of UTF-8 characters that does not have any  children, and hence is not allowed to contain &lt; or &amp; (an entity reference is also seen as a child).
 * The syntax of  allows also for the non-primitive children  . They  may be defined in the DTD (which they are further down in the example). Any element that appears in the document as child of root must be defined here. Lower descendants (grandchildren, etc.) must be defined separately. If a child of root is not used, it may be left here without further definition.
 * Note that the element  appears as child of root and as child of    (i.e., grandchild of root). This is legitimate, but   must be defined in both relationships.
 * The line  specifies a null stylesheet and triggers a browser to apply its default style rules. If this line is omitted, a browser outputs an XML document as a tree. If the line refers to a separate CSS file, this dictates the style.
 * The symbol * appearing in the first line is borrowed from the Unix editor Ed as part of a regular expression. It says that the elements between the parentheses may occur zero or more times, that is,  any of the children:    of the root  may  appear an indeterminate number of times. The vertical bar (|) indicates a logical "or", i.e., a choice between the elements separated by |.
 * The fourth line defines  as a sequence, recognized by members between commas. The first child must be   (and cannot be text). The + sign originates from regular expressions: the child   must appear at least once and possibly more than once.  Because the comma defines a sequence, all appearances of   must come before   . This last child of   must appear always and just once.
 * has three children. The order of their appearance in the document is fixed. The first two are optional, while the last child must appear at least once.
 * The following four lines define elements as childless pieces of text.
 * The one but last line defines the attributes of . They are "IMPLIED", meaning that they are optional, and consist of CDATA (simple character data not containing &amp; or &lt;).
 * The last line in the DTD defines an entity, referred to as &amp;Address;.

For use internal to the DTD, parameter entities may be defined, as in:

Note the spaces around %, they must be there. A parameter entity may be used everywhere in the DTD, for instance in the attribute  of the empty element <!ELEMENT base EMPTY> <!ATTLIST base href %URI; #REQUIRED>

In the XML document this would be used as, e.g.,  and inside the DTD the string   gets the value , consisting of simple character data.

The syntax of XML and DTD is formally (and precisely) declared in Extended Backus-Naur Form (EBNF), see Ref. To give the flavor of it, an incomplete definition of element is given and explained:











First white space S is defined as: space, tab, carriage return, or line feed, with the respective code points in hexadecimal(decimal) x20(32), x9(9), xD(13), and xA(10). Note that S contains at least one of these characters, and possibly more; this is indicated by the terminating regular expression symbol +. The start tag (Stag) is opened with the literal &lt;, then follows&mdash;with no white space in between&mdash;the name of the element, followed by zero or more attributes. The Etag (end tag) is defined beginning with the literal &lt;/, no space and then the name of the element. Zero or more white space characters precede the literal >. The question mark is borrowed from regular expressions and indicates an option (zero or one). A Name starts with a Letter, underscore, or colon. The allowed entities Letter cover almost all of Unicode, but from the ASCII set (first 128 code points) only [A-Z] and [a-z] can be used, i.e., a name cannot start with a digit or punctuation mark (other than the underscore and colon). The exact definition of Letter and NameChar can be found in Ref. Finally, an element consists of an Stag, content, and an Etag (the definition of empty elements is omitted here). The definition of content is fairly involved, because recursive nesting occurs, but basically content consists of one or more sequences (strings separated by commas) and/or choices (strings separated by vertical bars).

The style sheet
A markup language proper only defines the syntax and semantics of terms. A markup language is not concerned with appearance (indentations, vertical spacings, choice of fonts, etc.). For many applications, such as database applications, appearance is irrelevant. For humans, however, the presentation (printed or on screen) of a document improves readability and hence is important. The appearance of certain tagged terms may be defined separately in a style sheet. A style sheet may have different sources that together form a cascade of sources. Priority in a cascade is well-defined: if two sources of a style sheet give contradictory definitions, then the definition lowest in the cascade takes priority. In this manner Cascading Style Sheets (CSS) are defined, which can be used by XML documents to make up their appearance on screen, in print, braille, or as audio file. The latest standard of style definitions is CSS2.1. As of this writing (spring 2012) the definition of CSS3 is nearing completion.

The definition of CSS 2.1 is such that it is fully interoperable with XML, HTML 4, and XHTML 1.0, see Ref. for a quick tutorial of CSS with XML. Because CSS 2.1 is described elsewhere, no more details are given here.

The XML parser
The software module that reads and checks the information in XML documents is known as an XML parser or processor. It serves usually as the front end of a user agent. That is, parsers are part of XML-compliant applications (such as web browsers or database servers). One of the the tasks of a parser is to ascertain that the XML document follows all the rules of the XML markup syntax. If that is the case the documents is said to be well-formed. Some parsers go a step further and check the documents against the DTD or XML schema, these are validating parsers. If a document passes both checks, it is said to be valid. Once it has been established that a document is well-formed, the parsed document is passed on to the engine (core part) of the user agent. In the case of web browsers, for example, the engine takes care of the layout of the page on the user screen.

XML namespaces
Every XML application has its own markup vocabulary consisting of element names and attributes that the application understands. A single XML document may be input to multiple XML applications with different vocabularies. For instance, an XHTML 1.0 document may contain SVG diagrams that are illustrations to the XHTML text. The XML-conform application SVG is a vector-based drawing program. In this case a parser must distinguish SVG from XHTML elements as it must decide to which application the element must be sent. A problem here is that vocabularies may have overlap, two element names may "collide" meaning that they are the same. For example, the parser needs to differentiate between two meanings of  : a tooltip of a drawing (in the vocabulary of SVG) and a title of a document (in the vocabulary of XHTML).

To resolve this possible conflict the W3C created the namespace convention. A declaration provides a long and unique namespace name for a particular XML vocabulary that is conveniently represented by a shorthand consisting of a unique (within the document) prefix. Prefixes attached to local names of elements distinguish the names originating from different vocabularies. It is important to emphasize that namespaces are independent of DTDs, a DTD does not know  whether a document contains elements with names from a namespace. A document is valid only if all element names (prefixed or not) of the document are properly recorded in its DTD. For a DTD a prefixed name is a name as any. Conversely, a document with different namespaces does not need to have an associated DTD (and will be non-valid in that case, but can still be well-formed).

As said, a namespace obviously must have a unique name. It is common to give a name that is an URL. This choice kills two birds with one stone: it gives information about the organization that maintains the application with its vocabulary and  an URL is unique.

Namespaces are declared in an XML document by the  attribute. One can establish a namespace for an element and all its descendants. All elements being descendants of root, declaration as attribute of root applies to the whole document. A strict XHTML 1.0 document (that must have  as root) is required to have its (default)  namespace declared. This is done by:

This declaration associates the namespace with an empty prefix, all unprefixed names are associated with this (default) namespace.

If one plans to invoke a vocabulary repeatedly inside an XHTML document (for instance SVG),  the    attributes may be added to the root element, as in:

This declaration associates elements, starting with  , with the namespace    containing. If a namespace declaration does not specify a prefix, it acts as default namespace and its elements are referred to without prefix. The name of a prefix is free, but for obvious reasons it cannot contain a colon.

An ellipse within a blue rectangle can now be drawn as follows (all elements, except  and  ), are from the SVG namespace):   &lt;html xmlns="&#x68;ttp://www.w3.org/1999/xhtml"  xmlns:s="&#x68;ttp://www.w3.org/2000/svg">      &lt;p> Here comes some SVG       <s:svg width="15cm" height="15cm">         <s:rect x="1cm" y="1cm" width="12cm" height="12cm"                 fill="none" stroke="blue" stroke-width="2" />         <s:g transform="translate(250 275)">            <s:ellipse rx="125" ry="90"                fill="cyan"  />         </s:g>      </s:svg>		   &lt;/html>

If one applies a namespace to part of the document only, one can do it by defining a prefix on a parent element.

In general attributes are not prefixed and keep the meaning defined by the element to which they belong, as is shown by the example above. In exceptional cases one can associate an attribute with a namespace that differs from its element and then a prefix attached to the attribute is required.