XML Tutorial The Document Type Definition – aka DTD

Author: Jaidev

Why we need a DTD

XML is a language specification. Based on this specification, individuals and organizations develop their own markup languages which they then use to communicate information with. When this information is transferred from source to destination, the destination:

* Needs to know how the document is structured and

* Needs to check if the content is indeed compliant with the structure

The Document Type Definition also known as DTD holds information about the structure of an XML document. In this chapter we will understand the important aspects of DTDs. The concept of a DTD is not new. It actually finds its origins with SGML (remember the good old SGML?!) and has of course evolved since.

Any human or computer reader can “read” the DTD and understand how the document content will be made available. As a corollary, if the document content is not present in the way that the DTD has specified, the human or computer reader can reasonably assume that the content is not properly structured and throw an error or request a resend.

Placing the DTD in an XML document

 Before we understand how to write a Document Type Definition, let us see where it’s content is placed in order to provide the structure of the document.

This complete DTD content can have two placements:

 * Internal – Refers to the placement of the DTD content directly within the XML document itself.

Every internal DTD is enclosed within the following two statements to differentiate it from the rest of the XML document:

“<!DOCTYPE root_element [“ and “]>”

where root_element is the name of the root element of the XML document (remember every XML document must have a root element since XML is a tree structure).

For instance, consider the following XML file with an internal DTD:

Note that the Document Type Definition (shown in dark blue and boldface above) is placed within the XML document itself.

Also, the DTD specification is always placed after the first line which is always <?xml version …….?> and before the actual document content starts. This holds good even for External References, as we shall see in a moment.

 * External Reference – Refers to saving the DTD as a file with extension .dtd and then referencing the DTD file within the XML document. For instance consider the two files shown below – the standalone DTD file articles.dtd and the actual XML file articles.xml based on this DTD.

Notice the similarity in positioning. The only difference from Internal DTD is that, in this case the DTD file is external and is only referenced in the XML document -articles.xml (as shown in dark blue and boldface).

     

An External DTD can feature as a file on the local file system or as a URL reference and each case is specified slightly differently.

For instance, in the above example, the DTD file was on my file system as “D:\articles.dtd”. Since this was the case, the keyword SYSTEM was used to indicate that it is a personal external reference.

<!DOCTYPE ARTICLES SYSTEM "D:\articles.dtd">

Had the same DTD been available off my website (it is not, so don’t bother looking for it!!), it might have been specified with the keyword PUBLIC as:

<!DOCTYPE ARTICLES PUBLIC "-//JAIDEV//DTD ARTICLES XML V1.0//EN" “http://www.mydomain.com/dtd/articles.dtd">

Here the formal name of the DTD is specified followed by the URL of the DTD location.



Writing a DTD

By now we know the syntax of an XML document (from the previous chapter). To recall, every XML document markup is made up of

* Elements and

* Attribute-Values

The structure of these elements and attributes is exactly what a DTD seeks to formally provide. So, let us take a look at each.

Elements

An element in a DTD is defined as:

 <!ELEMENT element_name (child_content)>

 Here element_name is the name of the element and the child_content is the content of the child (or children) of this element.

A child_element can be any of the following:

* A single element name: Implies only one child

* More than one element names separated by commas: Implies a sequence of children

* The keyword EMPTY: Implies no children

* The keyword ANY: Implies any combination within

* The keyword (#PCDATA) including the round brackets: Implies any text that will be parsed. Since this data will be parsed be careful not to use any markup text and special symbols.

* Choices: Implies that the children can be selected from among choices separated by the “|” symbol.

* Instance Quantities: Implies that the element can appear as many times as specified. Can be one of three types:

    o “*” implying “any number of times”,

    o “+” implying “at least once” and

    o “?” implying “at most once”.

Let us consider examples for each of these cases:

Attributes

Attributes as we have seen in the last chapter are additional pieces of information about an element. There has been a long standing debate about when to use attributes and when to breakdown an element into child elements. I shall not attempt to provide my own theory or rule-of-thumb. Please see the following site (one of many many) to guide you if you are lucky or spark your own imagination if you are not:

http://www-106.ibm.com/developerworks/xml/library/x-eleatt.html

Let us instead proceed to see how attributes are specified in a DTD. The following syntax holds:

<!ATTLIST element attribute_name attribute_type additional_characteristic>

 where:

    o element is the name of the tag for which this attribute is being specified

    o attribute_name is the name of the attribute

    o attribute_type can be either of

        * “CDATA” for character data but no markup

        * “ID” implying that the value cannot be repeated anywhere in the document i.e. it is unique

        * “NMTOKEN” implying that the value must conform to XML identifier name specifications.

        * choice list as (choice1 | choice2 |……| choiceN) o additional_characteristic can be one of    

        * “default_value” where default_value is the default value of the attribute

        * #FIXED “default_value” where default_value is the default value of the attribute and this is the only value you can specify

        * #REQUIRED implying that a value is mandatory

        * #IMPLIED implying that the value (or default) is optional

Once again let us see some examples:

A Frequently Asked Question on Binary Data

 Before we close this chapter on DTDs, let us relook at a FAQ (that we did look at in the last chapter).

Q. Can XML files hold binary data as part of an element?

A. No. XML files can only carry text data

Q. Is that not restrictive? What do I do for binary data then? All my image files, sound files etc?

A. Well, yes it is restrictive but that is what gives XML its power. XML being plain text can be transported and understood so easily. Instead of transporting binary data as part of an XML document one can simply reference the binary file as an attribute of an element or another element itself. For instance, consider the following element and attribute definition in the DTD: <!ELEMENT MYPHOTO EMPTY> <!ATTLIST MYPHOTO FILENAME ENTITY #REQUIRED> The corresponding XML element would be: <MYPHOTO FILENAME=”jaidev.jpg”/> Now, any program using this XML data could easily acquire the data separately using native mechanisms. For instance a browser encountering an image reference tag could easily make a request over HTTP, obtain the binary information and display the JPEG file.

DTDs rock but….

… they do have some drawbacks. Perhaps the two most significant ones are: * There is no formal datatyping scheme * All definitions have a global scope To solve some of these drawbacks, the concept of XML Schemas was introduced. In the next chapter we will consider the basics of Schemas and we will also discuss a tad bit more on validation of XML documents that are based on DTDs or Schemas. I do hope you try out some of the examples given above using an XML editor. You can also practice writing your own DTD for any common type of document that you deal with – perhaps for personal data, or account information or maybe ticketing data. Take your pick but do it. Practice makes perfect, at least from this chapter onwards!

Copyright© 2012 C# Computing, LLC