Introduction: XML and Data
XML stands for eXtensible Markup Language. XML
is a meta-markup language developed by the World Wide
Web Consortium(W3C) to deal with a number of the shortcomings of
HTML. As more and more functionality was
added to HTML to account for the diverse needs of users of the Web, the language
began to grow increasingly complex and unwieldy. The need for a way to create
domain-specific markup languages that did not contain all the cruft of HTML became
increasingly necessary and XML was born.
The main difference between HTML and XML is that whereas in HTML the semantics and
syntax of tags is fixed, in XML the author of the document is free to create tags
whose syntax and semantics are specific to the target application. Also the
semantics of a tag is not tied down but is instead dependent on the context of the
application that processes the document. The other significant differences between
HTML and XML is that the an XML document must be
Although the original purpose of XML was as a way to mark up content, it became
clear that XML also provided a way to describe structured data thus making it important
as a data storage and interchange format. XML provides many advantages as a data format
over others, including:
Structuring XML: DTDs and XML Schemas
- Built in support for internationalization due to the fact that it utilizes unicode.
- Platform independence (for instance, no need to worry about endianess).
- Human readable format makes it easier for developers to locate and fix errors
than with previous data storage formats.
- Extensibility in a manner that allows developers to add extra information to
a format without breaking applications that where based on older versions of
- Large number of off-the-shelf tools for processing XML documents already
Since XML is a way to describe structured data there should be a means to specify the
structure of an XML document. Document Type Definitions (DTDs) and XML Schemas are
different mechanisms that are used to specify valid elements that can occur in a
document, the order in which they can occur and constrain certain aspects of these
elements. An XML document that conforms to a DTD or schema is considered to be
valid. Below is listing of the different means of constraining the contents of
an XML document.
SAMPLE XML FRAGMENT
- Document Type Definitions (DTD): DTDs were the original means of
specifying the structure of an XML document and a holdover from XML's roots as a
subset of the Standardized and General
Markup Language(SGML). DTDs have a different syntax from XML and are used to
specify the order and occurence of elements in an XML document. Below is a DTD
for the above XML fragment.
DTD FOR SAMPLE XML FRAGMENT
- XML Schema Reduced (XDR): DTDs proved to be inadequate for the
needs of users of XML due to to a number of reasons. The main reasons behind the
criticisms of DTDs were the fact that they used a different syntax than XML and their
non-existent support for datatypes.
XDR, a recommendation
for XML schemas, was submitted to the W3C by the Microsoft Corporation as a
potential XML schema standard which but was eventually rejected. XDR tackled some of
the problems of DTDs being XML based as well as supporting a number of datatypes
analogous to those used in relational database management systems and popular
programming languages. Below is an XML schema, using XDR, for the above XML
XDR FOR SAMPLE XML FRAGMENT
- XML Schema Definitions (XSD) : The W3C
recommendation provides a
sophisticated means of describing the structure and constraints on the content model
of XML documents. W3C XML schema support more datatypes than XDR, allow for the
creation of custom data types, and support object oriented programming concepts like
inheritance and polymorphism. Currently XDR is used more widely than than W3C XML
schema but this is primarily because the XML Schema recommendation is fairly new and
will thus take time to become accepted by the software industry.
XSD FOR SAMPLE XML FRAGMENT
The linked examples show that DTDs give the least control over how one can constrain and
structure data within an XML document while W3C XML schemas give the most.
XML Querying: XPath and XQuery
It is sometimes necessary to extract subsets of the data stored within an XML
document. A number of languages have been created for querying XML documents
YaTL. Since XPath is already
a W3C recommendation while XQuery is on its way to becoming one, the focus of this
section will be on both these languages. Both languages can be used to retrieve and
manipulate data from an XML document.
XML and Databases
- XML Path Language (XPath): XPath is a language for addressing parts of an
XML document that utilizes a syntax that resembles hierarchical paths used to
address parts of a filesystem or URL. XPath also supports the use of functions for
interacting with the selected data from the document. It provides functions for the
accessing information about document nodes as well as for the manipulation of
strings, numbers and booleans. XPath is extensible with regards to functions which
allows developers to add functions that manipulate
the data retrieved by an XPath query to the library of functions available by
default. XPath uses a compact, non-XML syntax in order to facilitate the use of
XPath within URIs and XML attribute values (this is important for other W3C
recommendations like XML schema and XSLT that use XPath within attributes).
XPath operates on the
abstract, logical structure of an XML document, rather than its surface syntax. XPath
is designed to operate on a single XML document which it views as a tree of nodes
and the values returned by an XPath query are considered conceptually to be nodes.
The types of nodes that exist in the XPath data model of a document are text nodes,
element nodes, attribute nodes, root nodes, namespace nodes, processing instruction
nodes, and comment nodes.
Sample XPath Queries Against Sample XML Fragment
- XML Query Language (XQuery): XQuery is an attempt to provide a query
language that provides the same breadth of functionality and underlying formalism as
SQL does for relational databases. XQuery is a
language where each query is an expression. XQuery expressions fall into seven
broad types; path expressions, element constructors, FLWR expressions, expressions
involving operators and functions, conditional expressions, quantified expressions
or expressions that test or modify datatypes. The syntax and semantics of the
different kinds of XQuery expressions vary significantly which is a testament to the
numerous influences in the design of XQuery.
XQuery has a sophisticated type system based on
XML schema datatypes and supports
the manipulation of the document nodes unlike XPath. Also the data model of XQuery
is not only designed to operate on a single XML document but also a well-formed
fragment of a document, a sequence of documents, or a sequence of document fragments.
W3C is also wqorking towards creating an alternate version of XQuery that has the
same semantics but uses XML based syntax instead called
Sample XQuery Queries and Expressions Taken From W3C Working Draft
As was mentioned in the introduction, there is a dichotomy in how XML is used in industry. On one
hand there is the document-centric model of XML where XML is typically used as a means to creating
semi-structured documents with irregular content that are meant for human consumption. An example
of document-centric usage of XML is XHTML which is the XML based successor to HTML.
SAMPLE XHTML DOCUMENT
The other primary usage of XML is in a data-centric model. In a data-centric model, XML is used as
a storage or interchange format for data that is structured, appears in a regular order and is most
likely to be machine processed instead of read by a human. In a data-centric model, the fact that
the data is stored or transferred as XML is typically incidental since it could be stored or
transferred in a number of other formats which may or may not be better suited for the task
depending on the data and how it is used. An example of a data-centric usage of XML is SOAP.
SOAP is an XML based protocol used for exchanging information in a decentralized, distributed
environment. A SOAP message consists of three parts: an envelope that defines a framework for
describing what is in a message and how to process it, a set of encoding rules for expressing
instances of application-defined datatypes, and a convention for representing remote procedure
calls and responses.
SAMPLE SOAP MESSAGE TAKEN FROM W3C SOAP RECOMMENDATION
In both models where XML is used, it is sometimes necessary to store the XML in some sort of
repository or database that allows for more sophisticated storage and retrieval of the data
especially if the XML is to be accessed by multiple users. Below is a description of storage
options based on what model of XML usage is required.
- Data-centric model: In a data-centric model where data is stored in a relational
database or similar repository; one may want to extract data from a
database as XML, store XML into a database or both. For situations where one only needs to
extract XML from the database one may use a middleware application or component that retrieves
data from the database and returns it as XML. Middleware components that transform relational
data to XML and back vary widely in the functionality they provide and how they provide it.
For instance, Microsoft's
XML integration to such a degree that results from queries on XML documents or SQL databases
can be accessed identically via the same API. Some like Merant's
require the user to specify how the results of a SQL query should be converted to XML via a
custom query while others like IBM's
Database DOM require the user
to create a template file that contains the SQL to XML mappings for the query to be performed.
Another approach is the one taken by
DB2XML where a
default mapping of SQL results to XML data exists that cannot be altered by the user. Middleware
components also vary in how the sophistication of their user interface which may vary from
practically non-existent (interaction done via programmatically using APIs) to interaction
being via a sophisticated graphical user interfaces.
The alternative to using middleware components to retrieve or store XML in a database is to
use an XML-enabled database that understands how to convert relational data to XML and back.
Currently, the Big 3 relational database products all support retrieving and storing XML in one
form or another. IBM's DB2 uses the
DB2 XML Extender. The DB2 extender gives one the option to store an entire XML document and
its DTD as a user-defined column [of type XMLCLOB,XMLVARCHAR or XMLFile] or to shred the document
into multiple tables and columns. XML documents can then be queried with syntax that is compliant
with W3C XPath recommendation. Updating of XML data is also possible using stored procedures.
SAMPLE DB2 XML EXTENDER TABLE AND QUERY
Oracle has completely integrated XML into
it's Oracle 9i database as well as the rest of its family of products. XML documents can
be stored as whole documents in user-defined columns [of type XMLType or CLOB/BLOB] where they can be
extracted using XMLType functions such as Extract() or they can be stored as decomposed XML
documents that are stored in object relational form which can be recontituted using the XML SQL
Utility (XSU) or SQL functions and packages. For searching XML, Oracle provides Oracle Text which can be used to index and search XML
stored in VARCHAR2 or BLOB variables within a table via the CONTAINS and WITHIN operators used in
collusion with SQL SELECT queries. XMLType columns can be queried by selecting them through a
programming interface (e.g. SQL, PL/SQL, C, or Java), by querying them directly and using
extract() and/or existsNode() or by using Oracle Text operators to query the XML content. The
extract() and existsNode() functions uses XPath expressions for querying XML data.
Oracle 9i also allows one to create relational views on XML documents stored in XMLType columns
which can then be queried using SQL. The columns in the table are mapped to XPath expressions
that query the document in the XMLType column.
SAMPLE ORACLE 9i TABLE AND QUERY
Microsoft's SQL Server 2000 also supports XML operations being performed on relational data
. XML data can be retrieved from relational tables using the FOR XML clause.
The FOR XML clause has three modes: RAW, AUTO and EXPLICIT. RAW mode sends each row of data in
the resultset back as a XML element named "row" and with each column being an attribute of
the "row" element. AUTO mode returns query results in a nested XML tree where each element
returned is named after the table it was extracted from and each column is an attribute of the
returned elements. The hierarchy is determined based on the order of the tables identified by
the columns of the SELECT statement. With EXPLICIT mode the hierarchy of the XML returned is
completely controlled by the query which can be rather complex. SQL Server also provides the
OPENXML clause which to provide a relational view on XML data. OPENXML allows XML documents
placed in memory to be used as parameters to SQL statements or stored procedures. Thus
OPENXML is used to query data from XML, join XML data with existing relational tables, and
insert XML data into the database by "shredding" it into tables. Also W3C XML
schema to can be used to provide mappings between XML and relational structures. These
mappings are called XML views and allow relational data in tables to be viewed as XML which
can be queried using XPath.
As can be seen from the above descriptions, there is currently no standard way to access XML
from relational databases. This may change with the development of the
SQL/XML standard currently being developed by the SQLX
- Document-centric model: Content management systems are typically the tool of
choice when considering storing, updating and retrieving various XML documents in a shared
repository. A content management system typically consists of a repository that stores a variety
of XML documents, an editor and an engine that provides one or more of the following features:
- version, revison and access control
- ability to reuse documents in different formats
- web publishing facilities
- support for a variety of text editors (e.g. Microsoft Word, Adobe Framemaker, etc)
- indexing and search capabilities
Content management systems have been primarily of benefit for workflow management in corporate
environments where information sharing is vital and as a way to manage the creation of web
content in a modular fashion allowing web developers and content creators to perform their tasks
with less interdependence than exists in a traditional web authoring environment. Examples of XML
based content management systems are
- Hybrid model: In situations where both documentric-centric and data-centric
models of XML usage will occur, the best data storage choice is usually a native XML
database. What actually constitutes a native XML database has been a topic of some debate in
various fora which has been compounded by the blurred lines that many see between XML-enabled
databases, XML query engines, XML servers and native XML databases. The most coherrent definition
so far is one that was reached by consensus amongst members of the
XML:DB mailing list which
defines a native XML database as a database that has an XML document as its fundamental unit of
(logical) storage and defines a (logical) model for an XML document, as opposed to the data in
that document, and stores and retrieves documents according to that model. At a minimum, the
model must include elements, attributes, PCDATA, and document order. Described below are two
examples of native XML databases with the intent of showing the breadth of functionality and
variety that can be expected in the native XML database arena.
Tamino is a native XML database management
system developed by Software AG. Tamino is a relatively mature application, currently at version
2.3.1, that provides the means to store & retrieve XML documents, store & retrieve
relational data, as well as interface with external applications and data sources. Tamino has
a web based administration interface similar to that used by the major relational database
management systems and includes GUI tools for interacting with the database and editting
Schemas in Tamino are DTD-based and are used primarily as a way to describe how the XML
data should be indexed. When storing XML documents in Tamino; one can specify
a pre-existing DTD which is then converted to a Tamino schema, store a well-formed XML document
without a schema which means that default indexing ensues or a schema can be created from scratch
for the XML document being stored. A secondary usage of schemas is for specifying the datatypes
in XML documents. The main advantage of using datatypes in Tamino is to enable type based
operations within queries (e.g. numeric comparisons). The query language used by Tamino is based
on XPath and is called X-Query (not to be confused with the W3C XQuery).
Tamino also ships with a relational database management system which is called the SQL Engine.
Schemas can be used to creating mappings from SQL to XML which then allow for the storage or
retrieval of XML data from relational database sources either internal (i.e. the SQL Engine) or
external. Schemas can also be used to represent joins across different document types. Joins allow
for queries to be performed on XML documents with differing schemas. Future versions of Tamino
are supposed to eliminate the need to specify joins up front in a schema and instead should allow
for such joins to be done dynamically from a query.
Tamino provides APIs for accessing for accessing the XML store in both Java and Microsoft's
JScript. C programmers can interact with the SQL engine using the SQL precompiler that ships
with Tamino. Interfaces that allow ODBC, OLE DB and JDBC clients to communicate with the
Tamino SQL Engine are also available. Finally, Tamino ships with the X-Tensions framework which
allows developers to extend the functionality of Tamino by using C++ COM objects or Java objects.
Tamino operations have ACID properties
(Atomicity, Consistency, Isolation and Durability) via the support of transactions in its
dbXML is an Open Source native XML database management system
which is sponsored by the dbXML Group. dbXML is designed for managing collections of XML
documents which are arranged in hierarchically within the system in a manner similar to that of
a file system. Querying the XML documents within the system is done using XPath and the documents
can be indexed to improve query performance.
dbXML is written in Java but supports access from other languages by exposing a CORBA API
thus allowing interaction with any language that supports a CORBA binding. It also ships with a
Java implementation of the XML:DB XML Database API which
is designed to be a vendor neutral API for XML databases. A number of command line tools for
managing documents and collections are also provided.
dbXML is mostly still in development (version at time of writing was 1.0 beta 2) and does not
currently support transactions or the use of schemas but these features are currently being
developed for future versions.
- Chamberlin, Don et al. XQuery 1.0: An XML Query Language (Working Draft). 7 June 2001.
- Clark, James and Steve DeRose. XPath: XML Path Language (Version 1.0). 16 November 1999.
- Bourett, Ronald. XML and Databases. June 2001.
- Bourett, Ronald. XML Database Products. 22 October 2001.
- Turau, Volker. Making Legacy Data Accessible For XML Applications. 1999.
- Cheng, Josephine and Xu, Jane. IBM DB2 Extender. From ICDE "00 <!-- " --> Conference, San Diego. February 2000.
The following people helped in reviewing and proofreading this paper: Dr. Sham Navathe, Kimbro Staken,
Dmitri Alperovitch, Sam Collins, and Dennis Lu.