|
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Chapter 3: Validating XML with the Document Type Definition (DTD)In This Chapter
XML is a meta-markup language that is fully extensible. As long as it is well formed, XML authors can create any XML structure they desire in order to describe their data. However, an XML author cannot be sure that the structure he poured so much time and effort into creating won't be changed by another XML author or for that matter an application. There needs to be a way to ensure that the XML structure cannot be changed at random. This type of assurance for XML document structure is vital for e-commerce applications and business-to-business processing, among other things. This is where the Document Type Definition (DTD) steps in. A DTD provides a roadmap for describing and documenting the structure that makes up an XML document. A DTD can be used to determine the validity of an XML document. In this chapter we will start with several examples and a brief overview of the DTD and what it does. Then we will break down the different items that make up the structure of the DTD. The coverage of the DTD structure will begin with a discussion of the Document Type Declaration. Then we will move on to the functional items that make up the DTD. The DTD includes element definitions, entity definitions, and parameters. Finally, before closing the chapter, we will explore some of the drawbacks of DTDS and emerging alternatives for validation. Now, let's start by defining the Document Type Definition. Document Type DefinitionsDTD stands for Document Type Definition. A Document Type Definition allows the XML author to define a set of rules for an XML document to make it valid. An XML document is considered "well formed" if that document is syntactically correct according to the syntax rules of XML 1.0. However, that does not mean the document is necessarily valid. In order to be considered valid, an XML document must be validated, or verified, against a DTD. The DTD will define the elements required by an XML document, the elements that are optional, the number of times an element should (could) occur, and the order in which elements should be nested. DTD markup also defines the type of data that will occur in an XML element and the attributes that may be associated with those elements. A document, even if well formed, is not considered valid if it does not follow the rules defined in the DTD.
When an XML document is validated against a DTD by a validating XML parser, the XML document will be checked to ensure that all required elements are present and that no undeclared elements have been added. The hierarchical structure of elements defined in the DTD must be maintained. The values of all attributes will be checked to ensure that they fall within defined guidelines. No undeclared attributes will be allowed and no required attributes may be omitted. In short, every last detail of the XML document from top to bottom will be defined and validated by the DTD. Although validation is optional, if an XML author is publishing an XML document for which maintaining the structure is vital, the author can reference a DTD from the XML document and use a validating XML parser during processing. Requiring that an XML document be validated against a DTD ensures the integrity of the data structure. XML documents may be parsed and validated before they are ever loaded by an application. That way, XML data that is not valid can be flagged as "invalid" before it ever gets processed by the application (thus saving a lot of the headaches that corrupt or incomplete data can cause). Imagine a scenario where data is being exchanged in an XML format between multiple organizations. The integrity of business-to-business data is vital for the smooth functioning of commerce. There needs to be a way to ensure that the structure of the XML data does not change from organization to organization (thus rendering the data corrupt and useless). A DTD can ensure this. An extra advantage of using DTDs in this situation is that a single DTD could be referenced by all the organization's applications. The defined structure of the data would be in a centralized resource, which means that any changes to the data structure definition would only need to be implemented in one place. All the applications that referenced the DTD would automatically use the new, updated structure. A DTD can be internal, residing within the body of a single XML document. It can also be external, referenced by the XML document. A single XML document could even have both a portion (or subset) of its DTD that is internal and a portion that is external. As mentioned in the previous paragraph, a single external DTD can be referenced by many XML documents. Because an external DTD may be referenced by many documents, it is a good repository for global types of definitions (definitions that apply to all documents). An internal DTD is good to use for rules that only apply to that specific document. If a document has both internal and external DTD subsets, the internal rules override the external rules in cases where the same item is defined in both subsets. Given this brief overview, you can quickly see why a DTD would be important to applications that exchange data in an XML format. Before diving into the actual coverage of the structure of DTDs, take a look at a couple of quick examples. This will give you a better impression of what we are talking about as we go forward. Some Simple DTD ExamplesLet's take a quick look at two DTDsone internal and one external. Listing 3.1 shows an internal DTD. Listing 3.1 An Internal DTD<?xml version="1.0"?> <!DOCTYPE message [ <!ELEMENT message (#PCDATA)> ]> <message> Let the good times roll! </message> In Listing 3.1, the internal DTD is contained within the Document Type Declaration, which begins with <!DOCTYPE and ends with ]>. The Document Type Declaration will appear between the XML declaration and the start of the document itself (the document or root element) and identify that section of the XML document as containing a Document Type Definition. Following the Document Type Declaration (DOCTYPE), the root element of the XML document is defined (in this case, message). The DTD tells us that this document will have a single element, message, that will contain parsed character data (#PCDATA).
Now, let's take a look at Listing 3.2 and see how this same DTD and XML document would be joined if the DTD were external. Listing 3.2 An External DTD<?xml version="1.0"?> <!DOCTYPE message SYSTEM "message.dtd"> <message> Let the good times roll! </message> In Listing 3.2 the DTD is contained in a separate file, message.dtd. The contents of message.dtd are assumed to be the same as the contents of the DTD in Listing 3.1. The keyword SYSTEM in the Document Type Declaration lets us know that the DTD is going to be found in a separate file. A URL could have been used to define the location of the DTD. For example, rather than message.dtd, the Document Type Declaration could have specified something like ../DTD/message.dtd.
Both of these examples show us a well-formed XML document. Additionally, because both XML documents contain a single element, message, which contains only parsed character data, both adhere to the DTD. Therefore, they are both also valid XML documents. A document that looks like what's shown in Listing 3.3 would not be valid according to the DTD in these examples. Listing 3.3 Document Not Valid According to Defined DTD<?xml version="1.0"?> <!DOCTYPE message SYSTEM "message.dtd"> <message> <text> Let the good times roll! </text> </message> Even though this is a well-formed XML document, it is not valid. When this document is validated against message.dtd, a flag will be raised because message.dtd does not define an element named text. Don't worry if you do not completely understand what is going on at this point. As long as you get the gist, everything will become very clear in the sections that follow. Structure of a Document Type DefinitionThe structure of a DTD consists of a Document Type Declaration, elements, attributes, entities, and several other minor keywords. We will take a look at each of these topics, in that order. As we progress from topic to topic, we will follow a mini case study about the use of XML to store employee records by the Human Resources department of a fictitious company. Our coverage of the DTD structure shall begin with the Document Type Declaration. The Document Type DeclarationIn order to reference a DTD from an XML document, a Document Type Declaration must be included in the XML document. Listings 3.1, 3.2, and 3.3 gave some examples and brief explanations of using a Document Type Declaration to reference a DTD. There may be one Document Type Declaration per XML document. The syntax is as follows: <!DOCTYPE rootelement SYSTEM | PUBLIC DTDlocation [ internalDTDelements ] >
It is possible for a Document Type Declaration to contain both an external DTD subset and an internal DTD subset. In this situation, the internal declarations take precedence over the external ones. In other words, if both the external and internal DTDs define a rule for the same element, the rule of the internal element will be the one used. Consider the Document Type Declaration fragment shown in Listing 3.4. Listing 3.4 Internal and External DTDs<!DOCTYPE rootelement SYSTEM "http://www.myserver.com/mydtd.dtd" [ <!ELEMENT element1 (element2,element3)> <!ELEMENT element2 (#PCDATA)> <!ELEMENT element3 (#PCDATA)> ]> Here in Listing 3.4, we see that the Document Type Declaration references an external DTD. There is also an internal subset of the DTD contained in the Document Type Declaration. Any rules in the external DTD that apply to elements defined in the internal DTD will be overridden by the rules of the internal DTD.
Now that you have seen how to reference a DTD from an XML document, we will begin our coverage of the items that make up the declarations in DTDs. DTD ElementsAll elements in a valid XML document are defined with an element declaration in the DTD. An element declaration defines the name and all allowed contents of an element. Element names must start with a letter or an underscore and may contain any combination of letters, numbers, underscores, dashes, and periods. Element names must never start with the string "xml". Colons should not be used in element names because they are normally used to reference namespaces. Each element in the DTD should be defined with the following syntax:
In a DTD, the elements are processed from the top down. A validating XML parser will expect the order of the appearance of elements in the XML document to match the order of elements defined in the DTD. Therefore, elements in a DTD should appear in the order you want them to appear in an XML document. If the elements in an XML document do not match the order of the DTD, the XML document will not be considered valid by a validating parser. Listing 3.5 demonstrates a DTD, contactlist.dtd, that defines the ordering of elements for referencing XML documents. Listing 3.5 contactlist.dtd<!ELEMENT contactlist (fullname, address, phone, email) > <!ELEMENT fullname (#PCDATA)> <!ELEMENT address (addressline1, addressline2)> <!ELEMENT addressline1 (#PCDATA)> <!ELEMENT addressline2 (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT email (#PCDATA)> The first element in the DTD, contactlist, is the document element. The rule for this element is that it contains (is the parent element of) the fullname, address, phone, and email elements. The rule for the fullname element, the phone element, and the email element is that each contains parsed character data (#PCDATA). This means that the elements will contain marked-up character data that the XML parser will interpret. The address element has two child elements: addressline1 and addressline2. These two children elements contain #PCDATA. This DTD defines an XML structure that is nested two levels deep. The root element, contactlist, has four child elements. The address element is, in turn, parent to two more elements. In order for an XML document that references this DTD to be valid, it must be laid out in the same order, and it must have the same depth of nesting. The XML document in Listing 3.6 is a valid document because it follows the rules laid out in Listing 3.5 for contactlist.dtd. Listing 3.6 contactlist.xml<?xml version="1.0"?> <!DOCTYPE contactlist SYSTEM "contactlist.dtd"> <contactlist> <fullname>Bobby Soninlaw</fullname> <address> <addressline1>101 South Street</addressline1> <addressline2>Apartment #2</addressline2> </address> <phone>(405) 555-1234</phone> <email>bs@mail.com</email> </contactlist> The second line of this XML document is the Document Type Declaration that references contactlist.dtd. This is a valid XML document because it is well formed and complies with the structural definition laid out in the DTD.
The element rules govern the types of data that may appear in an element. DTD Element RulesAll data contained in an element must follow a set rule. As stated previously, the rule is the definition to which the element's data content must conform. There are two basic types of rules that elements must fall into. The first type of rule deals with content. The second type of rule deals with structure. First, we will look at element rules that deal with content. Content RulesThe content rules for .elements deal with the actual data that defined elements may contain. These rules include the ANY rule, the EMPTY rule, and the #PCDATA rule. The ANY RuleAn element may be defined. using the ANY rule. This rule is just what it sounds like: The element may contain other elements and/or normal character data (just about anything as long as it is well formed). An element using the ANY rule would appear as follows: <!ELEMENT elementname ANY> The drawback to this rule is that it is so wide open that it defeats the purpose of validation. A DTD that defines all its elements using the ANY rule will always be valid as long as the XML is well formed. This really precludes any effective validation. The XML fragments as shown in Listing 3.7 are all valid given the definition of elementname. Listing 3.7 XML Fragments Using the ANY Rule<elementname> This is valid content </elementname> <elementname> <anotherelement> This is more valid content </anotherelement> This is still valid content </elementname> <elementname> <emptyelement /> <yetanotherelement> This is still valid content! </yetanotherelement> Here is more valid content </elementname> You should see from this listing why it is not always a great idea to use the ANY rule. All three fragments containing the element elementname are valid. There is, in effect, no validation for this element. Use of the ANY rule should probably be limited to instances where the XML data will be freeform text or other types of data that will be highly variable and have difficulty conforming to a set structure. The EMPTY RuleThis rule is the exact opposite of the ANY rule. An element that is defined with this rule will contain no data. However, an element with the EMPTY rule could still contain attributes (more on attributes in a bit). The following element is an example of the EMPTY rule: <!ELEMENT elementname EMPTY> This concept is seen a lot in HTML. There are many tags such as the break tag (<br />) and the paragraph tag (<p />) that follow this rule. Neither one of these tags contains any data, but both are very important in HTML documents. The best example of an empty tag used in HTML is the image tag (<img>). Even though the image tag does not contain any data, it does have attributes that describe the location and display of an image for a Web browser. In XML, the EMPTY rule might be used to define empty elements that contain diagnostic information for the processing of data. Empty elements could also be created to hold metadata describing the contents of the XML document for indexing purposes. Empty elements could even be used to provide clues for applications that will render the data for viewing (such as an empty "gender" tag, which designates an XML record as "male" or "female"male records could be rendered in blue, and female records could be rendered in pink) . The #PCDATA RuleThe #PCDATA rule indicates that parsed character data will be contained in the element. Parsed character data is data that may contain normal markup and will be interpreted and parsed by any XML parser accessing the document. The following element demonstrates the #PCDATA rule: <!ELEMENT elementname (#PCDATA)> An element in an XML document that adheres to the #PCDATA rule might appear as follows: <data> This is some parsed character data </data> It is possible in an element using the #PCDATA rule to use the CDATA keyword to prevent the character data from being parsed. You can see an example of this in Listing 3.8. Listing 3.8 CDATA<sample> <data> <![CDATA[<tag>This will not be parsed</tag>]]> </data> </sample> All the data between <![CDATA[ and ]]> will be ignored by the parser and treated as normal characters (markup ignored). Structure RulesWhereas the content rules. deal with the actual content of the data contained in defined elements, structure rules deal with how that data may be organized. There are two types of structure rules we will look at here. The first is the "element only" rule. The second rule is the "mixed" rule. The "Element Only" RuleThe "element only" rule .specifies that only elements may appear as children of the current element. The child element sequences should be separated by commas and listed in the order they should appear. If there are to be options for which elements will appear, the listed elements should be separated by the pipe symbol (|). The following element definition demonstrates the "element only" rule: <!ELEMENT elementname (element1, element2, element3)> You can see here that a list of elements are expected to appear as child elements of elementname when the referencing XML document is parsed. All these child elements must be present and in the specified order. Here is how an element that is listing a series of options will appear: <!ELEMENT elementname (element1 | element2)> The element defined here will have a single child element: either element1 or element2. The "Mixed" RuleThe "mixed" rule is used to help define elements that may have both character data (#PCDATA) and child elements in the data they contain. A list of options or a sequential list will be enclosed by parentheses. Options will be separated by the pipe symbol (|), whereas sequential lists will be separated by commas. The following element is an example of the "mixed" rule: <!ELEMENT elementname (#PCDATA | childelement1 | childelement2)*> In this example, the element may contain a mixture of character data and child elements. The pipe symbol is used here to indicate that there is a choice between #PCDATA and each of the child elements. However, the asterisk symbol (*) is added here to indicate that each of the items within the parentheses may appear zero or more times (we will cover the use of element symbols in the next section). This can be useful for describing data sets that have optional values. Consider the following element definition:
<!ELEMENT Son (#PCDATA | Name | Age)*> This definition defines an element, Son, for which there may be character data, elements, or both. A man might have a son, but he might not. If there is no son, then normal character data (such as "N/A") could be used to describe this condition. Alternatively, the man might have an adopted son and would like to indicate this. Consider the XML fragments shown in Listing 3.9 in relation to the definition for the element Son. Listing 3.9 The "Mixed" Rule<Son> N/A </Son> <Son> Adopted Son <Name>Bobby</Name> <Age>12</Age> </Son> The first fragment contains only character data. The second fragment contains a mixture of character data and the two defined child elements. Both fragments conform to the definition and are valid. Element SymbolsIn addition to the normal rules that apply to element definitions, element symbols can be used to control the occurrence of data. Table 3.1 shows the symbols that are available for use in DTDs. Table 3.1 Element Symbols
Element symbols can be added to element definitions for another level of control over the XML documents that are being validated against it. Consider the DTD in Listing 3.10, which makes very limited use of XML symbols. Listing 3.10 Limited Use of Symbols<!ELEMENT contactlist (contact) > <!ELEMENT contact (name, age, sex, address, city, state, zip, children) > <!ELEMENT name (#PCDATA) > <!ELEMENT age (#PCDATA) > <!ELEMENT sex (#PCDATA) > <!ELEMENT address (#PCDATA) > <!ELEMENT city (#PCDATA) > <!ELEMENT state (#PCDATA) > <!ELEMENT zip (#PCDATA) > <!ELEMENT children (child) > <!ELEMENT child (childname, childage, childsex) > <!ELEMENT childname (#PCDATA) > <!ELEMENT childage (#PCDATA) > <!ELEMENT childsex (#PCDATA) > You can see in Listing 3.10 that a contact record for a contactlist file is being laid out. It is very straight forward and includes the basic address information you would expect to see in this type of file. Information on the contact's children is also included. This looks like a well-laid-out, easy-to-use file format. However, there are several problems. What if you are not sure about a contact's address? What if the contact does not have children? What if the user is a lady and you are afraid to ask her age? The way that this DTD is laid out, it will be very difficult for a referencing XML document to be deemed valid if any of this information is unknown. Using element symbols, you can create a more flexible DTD that will take into account the possibility that you might not always know all of a contact's personal information. Take a look at a similar DTD laid out in Listing 3.11. Listing 3.11 Broader Use of Symbols<!ELEMENT contactlist (contact+) > <!ELEMENT contact (name, age?, sex, address?, city?, state?, zip?, children?) > <!ELEMENT name (#PCDATA) > <!ELEMENT age (#PCDATA) > <!ELEMENT sex (#PCDATA) > <!ELEMENT address (#PCDATA) > <!ELEMENT city (#PCDATA) > <!ELEMENT state (#PCDATA) > <!ELEMENT zip (#PCDATA) > <!ELEMENT children (child*) > <!ELEMENT child (childname, childage?, childsex) > <!ELEMENT childname (#PCDATA) > <!ELEMENT childage (#PCDATA) > <!ELEMENT childsex (#PCDATA) > Listing 3.11 is much more flexible than Listing 3.10. There is still a single root element, contactlist, which will contain one or more instances (+) of the element contact. Under each contact element is a list of child elements that make up the description of the contact record. It is assumed here that the name and sex of the contact will be known. However, the definition indicates that there will be zero or one occurrence (?) of the age, address, city, state, zip, and children elements. These elements are set for zero or one occurrence because the definition is taking into account that this information might not be known. Looking further down the listing, you see that the children element is marked to have zero or more instances (*) of the child element. This is because a person might have no children or many children (or we might not know how many children the person has). Under the child element, it is assumed that childname and childsex information will be known (if there is at least one child element). However, the childage element is marked as zero or one (?), just in case it is unknown how old the child is. You can easily see how Listing 3.11 is more flexible than Listing 3.10. Listing 3.11 takes into account that much of the contact data could be missing or unknown. An XML document being validated against the DTD in Listing 3.10 could still be validated and accepted by a validating parser even though it might not have all the contact's personal data. However, an XML document being validated against the DTD in Listing 3.10 would be rejected as invalid if it did not include the children element. Now that you have seen how DTDs define element declarations, let's take a look at how attributes are used in a mini case study.
DTD AttributesSo far you have seen that it is possible to use intricate combinations of elements and symbols to create complex element definitions. Now let's take a look at how XML attribute definitions can be added into this mix. XML attributes are name/value pairs that are used as metadata to describe XML elements. XML attributes are very similar to HTML attributes. In HTML, src is an attribute of the img tag, as shown in the following example: <img src="images/imagename.gif" width="10" height="20"> In this example, width and height are also attributes of the img tag. This is very similar to the markup in Listing 3.12, which demonstrates how an image element might be structured in XML. Listing 3.12 Attribute Use in XML<image src="images/" width="10" height="20"> imagename.gif </image> In Listing 3.12, src, width, and height are presented as attributes of the XML element image. This is very similar to the way that these attributes are used in HTML. The only difference is that the src attribute merely contains the relative path of the image's directory and not the actual name of the image file. In Listing 3.12, the attributes width, height, and src are used as metadata to describe certain aspects of the content of the image element. This is consistent with the normal use of attributes. Attributes can also be used to provide additional information to further identify or index an element or even give formatting information. Attributes are also defined in DTDs. Attribute definitions are declared using the ATTLIST declaration. An ATTLIST declaration will define one or more attributes for the element that it is referencing.
Attribute list declarations in a DTD will have the following syntax: <!ATTLIST elementname attributename type defaultbehavior defaultvalue>
Take a look at Listing 3.13 for an example of how this declaration may be used. Listing 3.13 ATTLIST Declaration<!ATTLIST name sex CDATA #REQUIRED age CDATA #IMPLIED race CDATA #IMPLIED > In Listing 3.13, an attribute list is declared. The name element is being referenced by the declaration. Three attributes are defined; sex, age, and race. The three attributes are character data (CDATA). Only one of the attributes, sex, is required (#REQUIRED). The other two attributes, age and race, are optional (#IMPLIED). An XML element using the attribute list declared here would appear as follows: <name sex="male" age="30" race="Caucasian">Michael Qualls</name> The name element contains the value "Michael Qualls". It also has three attributes of Michael Qualls: sex, age, and race. The attributes in Listing 3.13 are all character data (CDATA). However, attributes actually have 10 possible data types. Attribute TypesBefore going over a more detailed example of using attributes in your DTDs, let's first review Table 3.2, which presents the 10 valid types of attributes that may be used in a DTD. Then we will look at Table 3.3, which shows the default values for attributes. Table 3.2 Attribute Types
You saw during the coverage of the 10 valid attribute types that we used two preset default behavior settings: #REQUIRED and #IMPLIED. There are four different default types that may be used in an attribute definition, as detailed in Table 3.3. Table 3.3 Default Value Types
So far you have element (ELEMENT) declarations and attribute (ATTLIST) declarations under your belt. You have seen that you can create some very complex hierarchical structures using elements and attributes. Next, we will take a look at a way to save some time when building DTDs. DTD entities offer a way to store repetitive or large chunks of data for quick reference. First, however, we are going to revisit our mini case study.
DTD EntitiesEntities in DTDs are storage units. They can also be considered placeholders. Entities are special markups that contain content for insertion into the XML document. Usually this will be some type of information that is bulky or repetitive. Entities make this type of information more easily handled because the DTD author can use them to indicate where the information should be inserted in the XML document. This is much better than having to retype the same information over and over. An entity's content could be well-formed XML, normal text, binary data, a database record, and so on. The main purpose of an entity is to hold content, and there is virtually no limit on the type of content an entity can hold. The general syntax of an entity is as follows: <!ENTITY entityname [SYSTEM | PUBLIC] entitycontent>
Entities may either point to internal data or external data. Internal entities represent data that is contained completely within the DTD. External entities point to content in another location via a URL. External data could be anything from normal parsed text in another file, to a graphics or audio file, to an Excel spreadsheet. The type of data to which an external entity can refer is virtually unlimited. An entity is referenced in an XML document by inserting the name of the entity prefixed by & and suffixed by ;. When referenced in this manner, the content of the entity will be placed into the XML document when the document is parsed and validated. Let's take a look at an example of how this works (see Listing 3.14). Listing 3.14 Using Internal Entities<?xml version="1.0"?> <!DOCTYPE library [ <!ENTITY cpy "Copyright 2000"> <!ELEMENT library (book+)> <!ELEMENT book (title,author,copyright)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT copyright (#PCDATA)> ]> <library> <book> <title>How to Win Friends</title> <author>Joe Charisma</author> <copyright>&cpy;</copyright> </book> <book> <title>Make Money Fast</title> <author>Jimmy QuickBuck</author> <copyright>&cpy;</copyright> </book> </library> Listing 3.14 uses an internal DTD. In the DTD, an entity called cpy is declared that contains the content "Copyright 2000". In the copyright element of the XML document, this entity is referenced by using &cpy;. When this document is parsed, &cpy; will be replaced with "Copyright 2000" in each instance in which it is used. Using the entity &cpy; saves the XML document author from having to type in "Copyright 2000" over and over. This is a fairly simple example, but imagine if the entity contained a string of data that was several hundred characters long. It is much more convenient (and easier on the fingers) to be able to reference a three- or four-character entity in an XML document than to type in all that content. Predefined EntitiesThere are five predefined entities, as shown in Table 3.4. These entities do not have to be declared in the DTD. When an XML parser encounters these entities (unless they are contained in a CDATA section), they will automatically be replaced with the content they represent. Table 3.4 Predefined Entities
The XML fragment in Listing 3.15 demonstrates the use of a predefined entity. Listing 3.15 Using Predefined Entities<icecream> <flavor>Cherry Garcia</flavor> <vendor>Ben & Jerry's</vendor> </icecream> In this listing, the ampersand in "Ben & Jerry's" is replaced with the predefined entity for an ampersand (&) . External EntitiesExternal entities are used to reference external content. As stated previously, external entities get their content by referencing it via a URL placed in the entitycontent portion of the entity declaration. Either the SYSTEM keyword or the PUBLIC keyword is used here to let the XML parser know that the content is external. XML is incredibly flexible. External entities can contain references to almost any type of dataeven other XML documents. One well-formed XML document can contain another well-formed XML document through the use of an external entity reference. Taking this a step further, it can be easily extrapolated that a single XML document can be made up of references to many small XML documents. When the document is parsed, the XML parser will gather all the small XML documents, merging them into a whole. The end-user application will only see one document and never know the difference. One useful way to apply the principle of combining XML documents through the use of external entities would be in an employee-tracking application, like the one shown in Listing 3.16. Listing 3.16 Using External Entities<?xml version="1.0"?> <!DOCTYPE employees [ <!ENTITY bob SYSTEM "http://srvr/emps/bob.xml"> <!ENTITY nancy SYSTEM "http://srvr/emps/nancy.xml"> <!ELEMENT employees (clerk)> <!ELEMENT clerk (#PCDATA)> ]> <employees> <clerk>&bob;</clerk> <clerk>&nancy;</clerk> </employees> In this listing, two external entity references are used to refer to XML documents outside the current document that contain the employee data on "bob" (bob.xml) and "nancy" (nancy.xml). The SYSTEM keyword is used here to let the XML parser know that this is external content. In order to insert the external content into the XML document, the entities &bob; and &nancy; are used. It is useful to be able to contain the employee information in a separate file and "import" it using an entity reference. This is because this same information could be easily referenced by other XML documents, such as an employee directory and a payroll application. Defining logical units of data and separating them into multiple documents, as in this example, makes the data more extensible and reduces the need to reproduce redundant data from document to document.
Non-Text External Entities and NotationsSome external entities will contain non-text data, such as an image file. We do not want the XML parser to attempt to parse these types of files. In order to stop the XML parser, we use the NDATA keyword. Take a look at the following declaration: <!ENTITY myimage SYSTEM "myimage.gif" NDATA gif> The NDATA keyword is used to alert the parser that the entity content should be sent unparsed to the output document. The final part of the declaration, gif, is a reference to a notation. A notation is a special declaration that identifies the format of non-text external data so that the XML application will know how handle the data. Any time an external reference to non-text data is used, a notation identifying the data must be included and referenced. Notations are declared in the body of the DTD and have the following syntax: <!NOTATION notationname [SYSTEM | PUBLIC ] dataformat>
Listing 3.17 is an example of using notation declarations for non-text external entities. Listing 3.17 Using External Non-Text Entities<!NOTATION gif SYSTEM "image/gif" > <!ENTITY employeephoto SYSTEM "images/employees/MichaelQ.gif" NDATA gif > <!ELEMENT employee (name, sex, title, years) > <!ATTLIST employee pic ENTITY #IMPLIED > ... <employee pic="employeephoto"> ... </employee> In this example, an ENTITY type of attribute, pic, is defined for the element employee. In the XML document, the pic attribute is given the value employeephoto, which is an external entity that serves as a placeholder for the GIF file MichaelQ.gif. In order to aid the application process and display the GIF file, the external entity (using the NDATA keyword) references the notation gif, which points to the MIME type for GIF files. Parameter EntitiesThe final type of entity we will look at is the parameter entity, which is very similar to the internal entity. The main difference between an internal entity and a parameter entity is that a parameter entity may only be referenced inside the DTD. Parameter entities are in effect entities specifically for DTDs. Parameter entities can be useful when you have to use a lot of repetitive or lengthy text in a DTD. Use the following syntax for parameter entities: <!ENTITY % entityname entitycontent> The syntax for a parameter entity is almost identical to the syntax for a normal, internal entity. However, notice that in the syntax, after the declaration, there is a space, a percent sign, and another space before entityname. This alerts the XML parser that this is a parameter entity and will be used only in the DTD. These types of entities, when referenced, should begin with % and end with ;. Listing 3.18 shows an example of this. Listing 3.18 Using Parameter Entities<!ENTITY % pc "(#PCDATA)"> <!ELEMENT name %pc;> <!ELEMENT age %pc;> <!ELEMENT weight %pc;> In this listing, pc is used as a parameter entity to reference (#PCDATA). All entities in the DTD that hold parsed character data use the entity reference %pc;. This saves the DTD author from having to type #PCDATA over and over. This particular example is somewhat trivial, but you can see where this can be extrapolated out to a situation where you have a long character string that you do not want to have to retype. We are almost finished. Having covered the use of element, attribute, and entity declarations in DTDs, we have just a few more loose ends to tie up. In the next section, we will look at the use of the IGNORE and INCLUDE directives. Then we will discuss the use of comments in DTDs. In the final part of the chapter, we will look at the future of DTDs, some possible shortcomings of DTDs, and a possible alternative for DTD validation. Before moving on though, let's pay one more quick visit to the Zippy Human Resources department in our mini case study.
More DTD DirectivesJust a few more DTD keywords are left to cover. These are keywords that do not neatly fit into any particular topic, so they're lumped together here. These keywords are INCLUDE and IGNORE, and they do just what their names suggestthey indicate pieces of markup that should either be included in the validation process or ignored. The IGNORE KeywordWhen developing or updating a DTD, you may need to comment out parts of the DTD that are not yet reflected in the XML documents that use the DTD. You could use a normal comment directive (which will be covered in the next section), or you can use an IGNORE directive. The syntax for IGNORE is shown in Listing 3.19. Listing 3.19 Using IGNORE Directives<![ IGNORE This is the part of the DTD ignored ]]> You can choose to ignore elements, entities, or attributes. However, you must ignore entire declarations. You may not attempt to ignore a part of a declaration. For example, the following would be invalid: <!ELEMENT Employee <![ IGNORE (#PCDATA) ]]> (Name, Address, Phone) > In this example, the DTD author has attempted to ignore the rule #PCDATA in the middle of an element declaration. This is invalid and would trigger an error. The INCLUDE KeywordThe INCLUDE directive marks declarations to be included in the document. It might seem interesting that this keyword exists at all because not using an INCLUDE directive is the same as using it! In the absence of the INCLUDE directive, all declarations (unless they are commented out or enclosed in an IGNORE directive) will be included anyway. The syntax for INCLUDE, as shown in Listing 3.20, is very similar to the syntax for the IGNORE directive. Listing 3.20 Using INCLUDE Directives<![ INCLUDE This is the part of the DTD included ]]> The INCLUDE directive follows the same basic rules as the IGNORE directive. It may enclose entire declarations but not pieces of declarations. The INCLUDE directive can be useful when you're in the process of developing a new DTD or adding to an existing DTD. Sections of the DTD can be toggled between the INCLUDE directive and the IGNORE directive in order to make it clear which sections are currently being used and which are not. This can make the process of developing a new DTD easier, because you are able to quickly "turn on" or "turn off" different sections of the DTD.
Comments Within a DTDComments can also be added to DTDs. Comments within a DTD are just like comments in HTML and take the following syntax: <!-- Everything between the opening tag and closing tag is a comment --> As in HTML, comments in a DTD may not be nested. Comments may, however, span multiple lines. Generally comments in a DTD are used to demarcate different sections of the DTD or to help human readers understand different abbreviations used in the declarations. Comments will be ignored by the XML parser during processing. Listing 3.21 shows how to insert comments into a DTD. Listing 3.21 Using Comments<!-- This is a comment --> <!ELEMENT rootelement (element1, element2)> <!ELEMENT element1 (#PCDATA)> <!-- This is another comment --> <!ELEMENT element2 (#PCDATA)> <!-- This is a comment that spans multiple lines --> Comments provide a useful way to explain the meaning of different elements, attribute lists, and entities within the DTD. They can also be used to demarcate the beginning and end of different sections in the DTD. The DTD is a powerful tool for defining rules for XML documents to follow. DTDs have had and will continue to have an important place in the XML world for some time to come. However, DTDs are not perfect. As XML has expanded beyond a simple document markup language, these limitations have become more apparent. XML is quickly becoming the language of choice for describing more abstract types of data. DTDs are hard-pressed to keep up. We will now take a look at some of the drawbacks to DTDs and what future alternatives will be available. DTD Drawbacks and AlternativesThroughout this book, we will continue to document new growths, changes, and permutations to XML as a technology to enhance data exchange, data structuring, e-commerce, the Internet, and so on. As newer uses for XML come into being, the needs for validation expand. XML is being used to describe the data structure of video files, audio files, and Braille devices, among other thingsnot to mention the ever-growing plethora of alternative data devices such as cellular phones, handheld computers, televisions, and even appliances. There are several drawbacks that limit the ability of DTDs to meet these growing and changing validation needs. First and foremost, DTDs are composed of non-XML syntax. Given that one of the central tenets of XML is that it be totally extensible, it may not seem to make a lot of sense that this is the case for DTDs. However, you must consider that XML is a child of SGML, and in SGML, DTDs are the method used to validate documents. Therefore, XML inherited DTDs from its parent. Although DTDs are effective at defining the structure for document markup, as XML evolves, the fact that DTDS are not formed of XML syntax and are nonextensible becomes constraining. Additionally, there can only be a single DTD per document. It is true that there can be internal and external subsets of DTDs, but there can only be a single DTD referenced. In the modern programming world, we are used to being able to draw the programming constructs we use from different modules or classes. If we applied this idea to DTDs, we might expect to be able to use a DTD for customers, a separate DTD for inventory, and a separate DTD for orders. However, this is not the case. All aspects of an XML document must be within a single DTD. This limitation is similar to what programmers faced back in the days of monolithic applications before object-oriented programming became a normal standard for application development. This leads into the next limitation. DTDS are not object oriented. There is no inheritance in DTDs. As programmers, we have gotten used to describing new objects based on the characteristics of existing objects. One classic example is having Porsche, Ford, and Chevrolet classes that inherent some characteristics from a base car class. DTDs have no capability to do this. DTDs do not support namespaces very well. For a namespace to be used, the entire namespace must be defined within the DTD. If there are more than one namespace, each of them must be defined within the DTD. This totally defeats the purpose of namespacesbeing able to define multiple namespaces from many different external sources. Additionally, DTDs have weak data typing and no support for the XML DOM. DTDs basically have one data type: the text string. There are a few restraints, such as the element rules and attribute types covered in this chapter, but these are pretty weak considering the types of applications for which XML is now being used (especially in e-commerce). The Document Object Model has become a powerful tool to manipulate XML data; however, the DTD is totally cut off from the reach of the DOM. Finally, and possibly most important from a security standpoint, is the ability of the internal DTD subset to override the external DTD subset. An company could spend a great deal of time and effort crafting a DTD to validate the XML data in its e-commerce transactions only to have the settings in the DTD overridden by the internally defined elements of a DTD. The implications on this from a transaction security standpoint should be fairly clear. So, what is to be done about the DTD? The DTD is still an effective mechanism for validating XML documents and will be so for a long time to come. It just does not "scale" to meet the needs of the expanding XML world. At the time of this writing, the W3C organization has just recently finished the final touches on the recommendation for the XML Schema, which is a new validation mechanism for XML that corrects all the shortcomings of DTDs. XML Schema is a powerful and important technology for the future of XML. The next chapter of this book will be devoted to covering the XML Schema.
SummaryIn this chapter, we have covered the Document Type Definition (DTD) and how it is used to validate XML documents. Well-formed XML documents are documents that are syntactically correct according to the syntax rules of XML. However, in order to be a valid XML document, it must be validated against a DTD using a validating XML parser. A DTD serves as a roadmap for defining what structure a valid XML document should have. We covered the following items in relation to using DTDs:
Throughout the chapter, we followed a mini case study in which the Human Resources department for Zippy Delivery Service used XML to store employee records. The Human Resources department required a DTD to ensure that all XML records were of a uniform structure. To start, they built a simple DTD that was functional and worked. However, they were able to expand upon and improve their DTD to coincide with the introduction of new DTD topics in this chapter. Ultimately, they produced a DTD that effectively defined all the needs of the Human Resources department and enabled them to build a good roadmap for a valid XML document containing employee records. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Copyright © 2000-2003 ASPAlliance.com Page Rendered at
11/8/2009 1:40:02 AM |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||