Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
What do you know about XML and Databases?

By Carnage4Life in Technology
Mon Oct 29, 2001 at 12:15:58 AM EST
Tags: Software (all tags)
Software

The world of traditional data storage and XML have never been closer together. To better understand how data storage and retrievel works in an XML world, this article will first discuss the past, present, and future of structuring XML documents. Then it will delve into the languages that add the ability to query an XML document similar to a traditional data store. This will be followed by an exploration of how the most popular RDBMSs have recognized the importance of this new data storage format and have integrated XML into their latest releases. Finally the rise of new data storage and retrieval systems specifically designed for handling XML will be shown.


Introduction: XML and Data

XML stands for eXtensible Markup Language. XML is a meta-markup language developed by the World Wide Web Consortium(W3C) to deal with a number of the shortcomings of HTML. As more and more functionality was added to HTML to account for the diverse needs of users of the Web, the language began to grow increasingly complex and unwieldy. The need for a way to create domain-specific markup languages that did not contain all the cruft of HTML became increasingly necessary and XML was born.

The main difference between HTML and XML is that whereas in HTML the semantics and syntax of tags is fixed, in XML the author of the document is free to create tags whose syntax and semantics are specific to the target application. Also the semantics of a tag is not tied down but is instead dependent on the context of the application that processes the document. The other significant differences between HTML and XML is that the an XML document must be well-formed.

Although the original purpose of XML was as a way to mark up content, it became clear that XML also provided a way to describe structured data thus making it important as a data storage and interchange format. XML provides many advantages as a data format over others, including:

  1. Built in support for internationalization due to the fact that it utilizes unicode.
  2. Platform independence (for instance, no need to worry about endianess).
  3. Human readable format makes it easier for developers to locate and fix errors than with previous data storage formats.
  4. Extensibility in a manner that allows developers to add extra information to a format without breaking applications that where based on older versions of the format.
  5. Large number of off-the-shelf tools for processing XML documents already exist.

Structuring XML: DTDs and XML Schemas

Since XML is a way to describe structured data there should be a means to specify the structure of an XML document. Document Type Definitions (DTDs) and XML Schemas are different mechanisms that are used to specify valid elements that can occur in a document, the order in which they can occur and constrain certain aspects of these elements. An XML document that conforms to a DTD or schema is considered to be valid. Below is listing of the different means of constraining the contents of an XML document.

SAMPLE XML FRAGMENT
  1. Document Type Definitions (DTD): DTDs were the original means of specifying the structure of an XML document and a holdover from XML's roots as a subset of the Standardized and General Markup Language(SGML). DTDs have a different syntax from XML and are used to specify the order and occurence of elements in an XML document. Below is a DTD for the above XML fragment.

    DTD FOR SAMPLE XML FRAGMENT

  2. XML Schema Reduced (XDR): DTDs proved to be inadequate for the needs of users of XML due to to a number of reasons. The main reasons behind the criticisms of DTDs were the fact that they used a different syntax than XML and their non-existent support for datatypes. XDR, a recommendation for XML schemas, was submitted to the W3C by the Microsoft Corporation as a potential XML schema standard which but was eventually rejected. XDR tackled some of the problems of DTDs being XML based as well as supporting a number of datatypes analogous to those used in relational database management systems and popular programming languages. Below is an XML schema, using XDR, for the above XML fragment.

    XDR FOR SAMPLE XML FRAGMENT

  3. XML Schema Definitions (XSD) : The W3C XML schema recommendation provides a sophisticated means of describing the structure and constraints on the content model of XML documents. W3C XML schema support more datatypes than XDR, allow for the creation of custom data types, and support object oriented programming concepts like inheritance and polymorphism. Currently XDR is used more widely than than W3C XML schema but this is primarily because the XML Schema recommendation is fairly new and will thus take time to become accepted by the software industry.

    XSD FOR SAMPLE XML FRAGMENT

The linked examples show that DTDs give the least control over how one can constrain and structure data within an XML document while W3C XML schemas give the most.

XML Querying: XPath and XQuery

It is sometimes necessary to extract subsets of the data stored within an XML document. A number of languages have been created for querying XML documents including Lorel, Quilt, UnQL, XDuce, XML-QL, XPath, XQL, XQuery and YaTL. Since XPath is already a W3C recommendation while XQuery is on its way to becoming one, the focus of this section will be on both these languages. Both languages can be used to retrieve and manipulate data from an XML document.

  1. XML Path Language (XPath): XPath is a language for addressing parts of an XML document that utilizes a syntax that resembles hierarchical paths used to address parts of a filesystem or URL. XPath also supports the use of functions for interacting with the selected data from the document. It provides functions for the accessing information about document nodes as well as for the manipulation of strings, numbers and booleans. XPath is extensible with regards to functions which allows developers to add functions that manipulate the data retrieved by an XPath query to the library of functions available by default. XPath uses a compact, non-XML syntax in order to facilitate the use of XPath within URIs and XML attribute values (this is important for other W3C recommendations like XML schema and XSLT that use XPath within attributes).

    XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. XPath is designed to operate on a single XML document which it views as a tree of nodes and the values returned by an XPath query are considered conceptually to be nodes. The types of nodes that exist in the XPath data model of a document are text nodes, element nodes, attribute nodes, root nodes, namespace nodes, processing instruction nodes, and comment nodes.

    Sample XPath Queries Against Sample XML Fragment

  2. XML Query Language (XQuery): XQuery is an attempt to provide a query language that provides the same breadth of functionality and underlying formalism as SQL does for relational databases. XQuery is a functional language where each query is an expression. XQuery expressions fall into seven broad types; path expressions, element constructors, FLWR expressions, expressions involving operators and functions, conditional expressions, quantified expressions or expressions that test or modify datatypes. The syntax and semantics of the different kinds of XQuery expressions vary significantly which is a testament to the numerous influences in the design of XQuery.

    XQuery has a sophisticated type system based on XML schema datatypes and supports the manipulation of the document nodes unlike XPath. Also the data model of XQuery is not only designed to operate on a single XML document but also a well-formed fragment of a document, a sequence of documents, or a sequence of document fragments.

    W3C is also wqorking towards creating an alternate version of XQuery that has the same semantics but uses XML based syntax instead called XQueryX.

    Sample XQuery Queries and Expressions Taken From W3C Working Draft

XML and Databases

As was mentioned in the introduction, there is a dichotomy in how XML is used in industry. On one hand there is the document-centric model of XML where XML is typically used as a means to creating semi-structured documents with irregular content that are meant for human consumption. An example of document-centric usage of XML is XHTML which is the XML based successor to HTML.

SAMPLE XHTML DOCUMENT

The other primary usage of XML is in a data-centric model. In a data-centric model, XML is used as a storage or interchange format for data that is structured, appears in a regular order and is most likely to be machine processed instead of read by a human. In a data-centric model, the fact that the data is stored or transferred as XML is typically incidental since it could be stored or transferred in a number of other formats which may or may not be better suited for the task depending on the data and how it is used. An example of a data-centric usage of XML is SOAP. SOAP is an XML based protocol used for exchanging information in a decentralized, distributed environment. A SOAP message consists of three parts: an envelope that defines a framework for describing what is in a message and how to process it, a set of encoding rules for expressing instances of application-defined datatypes, and a convention for representing remote procedure calls and responses.

SAMPLE SOAP MESSAGE TAKEN FROM W3C SOAP RECOMMENDATION

In both models where XML is used, it is sometimes necessary to store the XML in some sort of repository or database that allows for more sophisticated storage and retrieval of the data especially if the XML is to be accessed by multiple users. Below is a description of storage options based on what model of XML usage is required.

  1. Data-centric model: In a data-centric model where data is stored in a relational database or similar repository; one may want to extract data from a database as XML, store XML into a database or both. For situations where one only needs to extract XML from the database one may use a middleware application or component that retrieves data from the database and returns it as XML. Middleware components that transform relational data to XML and back vary widely in the functionality they provide and how they provide it. For instance, Microsoft's ADO.NET provides XML integration to such a degree that results from queries on XML documents or SQL databases can be accessed identically via the same API. Some like Merant's jxTransformer require the user to specify how the results of a SQL query should be converted to XML via a custom query while others like IBM's Database DOM require the user to create a template file that contains the SQL to XML mappings for the query to be performed. Another approach is the one taken by DB2XML where a default mapping of SQL results to XML data exists that cannot be altered by the user. Middleware components also vary in how the sophistication of their user interface which may vary from practically non-existent (interaction done via programmatically using APIs) to interaction being via a sophisticated graphical user interfaces.

    The alternative to using middleware components to retrieve or store XML in a database is to use an XML-enabled database that understands how to convert relational data to XML and back. Currently, the Big 3 relational database products all support retrieving and storing XML in one form or another. IBM's DB2 uses the DB2 XML Extender. The DB2 extender gives one the option to store an entire XML document and its DTD as a user-defined column [of type XMLCLOB,XMLVARCHAR or XMLFile] or to shred the document into multiple tables and columns. XML documents can then be queried with syntax that is compliant with W3C XPath recommendation. Updating of XML data is also possible using stored procedures.

    SAMPLE DB2 XML EXTENDER TABLE AND QUERY

    Oracle has completely integrated XML into it's Oracle 9i database as well as the rest of its family of products. XML documents can be stored as whole documents in user-defined columns [of type XMLType or CLOB/BLOB] where they can be extracted using XMLType functions such as Extract() or they can be stored as decomposed XML documents that are stored in object relational form which can be recontituted using the XML SQL Utility (XSU) or SQL functions and packages. For searching XML, Oracle provides Oracle Text which can be used to index and search XML stored in VARCHAR2 or BLOB variables within a table via the CONTAINS and WITHIN operators used in collusion with SQL SELECT queries. XMLType columns can be queried by selecting them through a programming interface (e.g. SQL, PL/SQL, C, or Java), by querying them directly and using extract() and/or existsNode() or by using Oracle Text operators to query the XML content. The extract() and existsNode() functions uses XPath expressions for querying XML data. Oracle 9i also allows one to create relational views on XML documents stored in XMLType columns which can then be queried using SQL. The columns in the table are mapped to XPath expressions that query the document in the XMLType column.

    SAMPLE ORACLE 9i TABLE AND QUERY

    Microsoft's SQL Server 2000 also supports XML operations being performed on relational data . XML data can be retrieved from relational tables using the FOR XML clause. The FOR XML clause has three modes: RAW, AUTO and EXPLICIT. RAW mode sends each row of data in the resultset back as a XML element named "row" and with each column being an attribute of the "row" element. AUTO mode returns query results in a nested XML tree where each element returned is named after the table it was extracted from and each column is an attribute of the returned elements. The hierarchy is determined based on the order of the tables identified by the columns of the SELECT statement. With EXPLICIT mode the hierarchy of the XML returned is completely controlled by the query which can be rather complex. SQL Server also provides the OPENXML clause which to provide a relational view on XML data. OPENXML allows XML documents placed in memory to be used as parameters to SQL statements or stored procedures. Thus OPENXML is used to query data from XML, join XML data with existing relational tables, and insert XML data into the database by "shredding" it into tables. Also W3C XML schema to can be used to provide mappings between XML and relational structures. These mappings are called XML views and allow relational data in tables to be viewed as XML which can be queried using XPath.

    As can be seen from the above descriptions, there is currently no standard way to access XML from relational databases. This may change with the development of the SQL/XML standard currently being developed by the SQLX group.



  2. Document-centric model: Content management systems are typically the tool of choice when considering storing, updating and retrieving various XML documents in a shared repository. A content management system typically consists of a repository that stores a variety of XML documents, an editor and an engine that provides one or more of the following features:

    • version, revison and access control
    • ability to reuse documents in different formats
    • collaboration
    • web publishing facilities
    • support for a variety of text editors (e.g. Microsoft Word, Adobe Framemaker, etc)
    • indexing and search capabilities

    Content management systems have been primarily of benefit for workflow management in corporate environments where information sharing is vital and as a way to manage the creation of web content in a modular fashion allowing web developers and content creators to perform their tasks with less interdependence than exists in a traditional web authoring environment. Examples of XML based content management systems are SyCOMAX, Content@, Frontier, Entrepid, XDisect, and SiberSafe.

  3. Hybrid model: In situations where both documentric-centric and data-centric models of XML usage will occur, the best data storage choice is usually a native XML database. What actually constitutes a native XML database has been a topic of some debate in various fora which has been compounded by the blurred lines that many see between XML-enabled databases, XML query engines, XML servers and native XML databases. The most coherrent definition so far is one that was reached by consensus amongst members of the XML:DB mailing list which defines a native XML database as a database that has an XML document as its fundamental unit of (logical) storage and defines a (logical) model for an XML document, as opposed to the data in that document, and stores and retrieves documents according to that model. At a minimum, the model must include elements, attributes, PCDATA, and document order. Described below are two examples of native XML databases with the intent of showing the breadth of functionality and variety that can be expected in the native XML database arena.

    Tamino is a native XML database management system developed by Software AG. Tamino is a relatively mature application, currently at version 2.3.1, that provides the means to store & retrieve XML documents, store & retrieve relational data, as well as interface with external applications and data sources. Tamino has a web based administration interface similar to that used by the major relational database management systems and includes GUI tools for interacting with the database and editting schemas.

    Schemas in Tamino are DTD-based and are used primarily as a way to describe how the XML data should be indexed. When storing XML documents in Tamino; one can specify a pre-existing DTD which is then converted to a Tamino schema, store a well-formed XML document without a schema which means that default indexing ensues or a schema can be created from scratch for the XML document being stored. A secondary usage of schemas is for specifying the datatypes in XML documents. The main advantage of using datatypes in Tamino is to enable type based operations within queries (e.g. numeric comparisons). The query language used by Tamino is based on XPath and is called X-Query (not to be confused with the W3C XQuery).

    Tamino also ships with a relational database management system which is called the SQL Engine. Schemas can be used to creating mappings from SQL to XML which then allow for the storage or retrieval of XML data from relational database sources either internal (i.e. the SQL Engine) or external. Schemas can also be used to represent joins across different document types. Joins allow for queries to be performed on XML documents with differing schemas. Future versions of Tamino are supposed to eliminate the need to specify joins up front in a schema and instead should allow for such joins to be done dynamically from a query.

    Tamino provides APIs for accessing for accessing the XML store in both Java and Microsoft's JScript. C programmers can interact with the SQL engine using the SQL precompiler that ships with Tamino. Interfaces that allow ODBC, OLE DB and JDBC clients to communicate with the Tamino SQL Engine are also available. Finally, Tamino ships with the X-Tensions framework which allows developers to extend the functionality of Tamino by using C++ COM objects or Java objects. Tamino operations have ACID properties (Atomicity, Consistency, Isolation and Durability) via the support of transactions in its programming interfaces.

    dbXML is an Open Source native XML database management system which is sponsored by the dbXML Group. dbXML is designed for managing collections of XML documents which are arranged in hierarchically within the system in a manner similar to that of a file system. Querying the XML documents within the system is done using XPath and the documents can be indexed to improve query performance.

    dbXML is written in Java but supports access from other languages by exposing a CORBA API thus allowing interaction with any language that supports a CORBA binding. It also ships with a Java implementation of the XML:DB XML Database API which is designed to be a vendor neutral API for XML databases. A number of command line tools for managing documents and collections are also provided.

    dbXML is mostly still in development (version at time of writing was 1.0 beta 2) and does not currently support transactions or the use of schemas but these features are currently being developed for future versions.


Bibliography

  1. Chamberlin, Don et al. XQuery 1.0: An XML Query Language (Working Draft). 7 June 2001. <http://www.w3.org/TR/2001/WD-xquery-20010607 >

  2. Clark, James and Steve DeRose. XPath: XML Path Language (Version 1.0). 16 November 1999. <http://www.w3.org/TR/1999/REC-xpath-19991116 >

  3. Bourett, Ronald. XML and Databases. June 2001. <http://www.rpbourret.com/xml/XMLAndDatabases.htm>

  4. Bourett, Ronald. XML Database Products. 22 October 2001. <http://www.rpbourret.com/xml/XMLAndDatabases.htm>

  5. Turau, Volker. Making Legacy Data Accessible For XML Applications. 1999. <http://www.informatik.fh-wiesbaden.de/~turau/ps/legacy.pdf>

  6. Cheng, Josephine and Xu, Jane. IBM DB2 Extender. From ICDE "00 <!-- " --> Conference, San Diego. February 2000. <http://www-4.ibm.com/software/data/db2/extenders/xmlext/xmlextbroch.pdf>

Acknowledgements

The following people helped in reviewing and proofreading this paper: Dr. Sham Navathe, Kimbro Staken, Dmitri Alperovitch, Sam Collins, and Dennis Lu.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Poll
What is your prefered way of storing XML data?
o Flat files 29%
o An XML-enabled database 0%
o A content management system 1%
o A relational database 11%
o An object oriented database 5%
o Other 3%
o I don't use XML 48%

Votes: 54
Results | Other Polls

Related Links
o XML
o World Wide Web Consortium
o HTML
o well-forme d
o SAMPLE XML FRAGMENT
o Standardiz ed and General Markup Language
o DTD FOR SAMPLE XML FRAGMENT
o XDR
o XDR FOR SAMPLE XML FRAGMENT
o XML schema
o recommenda tion
o XSD FOR SAMPLE XML FRAGMENT
o Lorel
o Quilt
o UnQL
o XDuce
o XML-QL
o XPath
o XQL
o XQuery
o YaTL
o Sample XPath Queries Against Sample XML Fragment
o functional language
o XML schema datatypes
o XQueryX
o Sample XQuery Queries and Expressions Taken From W3C Working Draft
o SAMPLE XHTML DOCUMENT
o SAMPLE SOAP MESSAGE TAKEN FROM W3C SOAP RECOMMENDATION
o ADO.NET
o jxTransfor mer
o Database DOM
o DB2XML
o DB2 XML Extender
o SAMPLE DB2 XML EXTENDER TABLE AND QUERY
o integrated XML into it's Oracle 9i database
o Oracle Text
o SAMPLE ORACLE 9i TABLE AND QUERY
o SQL Server 2000 also supports XML operations being performed on relational data
o SQL/XML standard
o SQLX group
o SyCOMAX
o Content@
o Frontier
o Entrepid
o XDisect
o SiberSafe
o XML:DB mailing list
o Tamino
o ACID properties
o dbXML
o XML:DB XML Database API
o http://www .w3.org/TR/2001/WD-xquery-20010607
o http://www .w3.org/TR/1999/REC-xpath-19991116
o http://www .rpbourret.com/xml/XMLAndDatabases.htm
o http://www .rpbourret.com/xml/XMLAndDatabases.htm [2]
o http://www .informatik.fh-wiesbaden.de/~turau/ps/legacy.pdf
o http://www -4.ibm.com/software/data/db2/extenders/xmlext/xmlextbroch.pdf
o Also by Carnage4Life


Display: Sort:
What do you know about XML and Databases? | 34 comments (26 topical, 8 editorial, 0 hidden)
We use Oracle & C++ (3.50 / 2) (#3)
by wiredog on Sun Oct 28, 2001 at 08:54:44 PM EST

Where I work, we use Oracle & C++. The data is stored as regular Oracle data. We use XML Dom to parse the document, with the data stored in the program in vectors, maps, and other C++ structures. Once the document is parsed, we use OCI to load it into the Oracle DB. This allows us to do searches, joins, etc, on the various documents we've stored.

If there's a choice between performance and ease of use, Linux will go for performance every time. -- Jerry Pournelle
what about ldap? (2.50 / 2) (#9)
by alejo on Mon Oct 29, 2001 at 01:15:10 AM EST

Aren't ldap services more similar to what xml is about?

Does anybody here used those?
I'd really like some feedback on that.


Huh? (4.50 / 2) (#10)
by vr on Mon Oct 29, 2001 at 08:42:46 AM EST

I'm using both LDAP and XML.

Why do you think they have anything in common? XML is a format for encoding data, and LDAP is a protocol for accessing a directory.



[ Parent ]
blobs suck (none / 0) (#33)
by alejo on Wed Oct 31, 2001 at 06:45:34 PM EST

most sqls do xml storage in blobs. and custom indexing in other tables.

don't know a lot about ldap, but the way it is explained the storage is more similar to the way of SQLs (static tables mainly).

maybe it is a stupid assumption, but that is what I wanted to ask, instead of wasting a month and code reviewing to get it.

thanks (in advance) for your meaningful responses.

[ Parent ]
Triple stores (4.50 / 4) (#11)
by macpeep on Mon Oct 29, 2001 at 08:59:48 AM EST

At my previous job, I implemented an experimental app that was inspired by RDF (Resource Description Framework) and triple stores.

In a triple store, you have objects that are defined by a set of properties. The word "triple" comes from the fact that you have triples of objects, properties and property values. For example, you could have a person; John Q, who has an age 37, a phone number 1234 and an employer Foo Ltd. Foo Ltd. in turn has a phone number 5678 and any number of other properties. This forms the following tripples: John Q --age--> 37, John Q --phone number--> 1234. John Q --employer--> Foo Ltd. Foo Ltd --phone number--> 5678.

When you look at these, you can see that Foo Ltd. is both the employer of John Q (a property value) but also an object in itself that is described by a set of properties. In RDF, the tripples form a graph that describes your data. The graph is typically serialized as XML.

At first, it would seem that this lends itself very well for relational databases. A row in a table would be the object to be described and columns are the properties. The intersection is the value. However, the problem - and strength of RDF - is that you can have any number of properties for an object. Basically, you could have any number of columns and sometimes, the property value is not just a value - it can be a database row in itself or even a set of rows.. or a set of values.

The app I wrote mapped arbitrary RDF files to relational databases and back as well as provided an API to perform queries on the data. The result of the queries were RDF graphs in themselves.

While this was quite cool, it turned out to be quite difficult to turn the query result graphs into meaningful stuff in a user interface. Also, queries on the RDF graphs could turn out to be extremely complex SQL queries... Most of these problems were eventually solved but the code wasn't used directly for any real world app, except heavily modified as a metadata database for a web publishing system.

Pet theory (4.00 / 1) (#25)
by Herring on Tue Oct 30, 2001 at 05:39:28 AM EST

The nature of XML is "navigational". - ie you navigate from parent to child and back. Sometimes you need to find stuff, but not that often. RDBMSs are superb for finding stuff (IANADBA), but for navigation ... well, doing shedloads of individual queries to go from one thing to another... it's gotta be crap.

Point 2 is polymorphism. There are some solutions to this for RDBMSs, but, deep down, it all comes down to having a "type" field, then either storing the type specific data in a BLOB and decoding it in the code (shit but fast), or having specific tables for different types and going and getting the data from that with another query (less shit, but slow).

My preferred solution - ODBMS. OK, so the querying tends to be rubbish on ODBMSs and you can't run a nice report off using Crystal (or whatever) but the navigation is damn fast and the polymorphism is no problem. Oh, and another great thing about using an ODBMs is that a "power user" (aka "meddling twat") can't come along and screw it up with a bit of VB.

Of course, XML on it's own wouldn't be using nearly all the power of a decent ODBMS (proper stuff like Versant handles all the endian problems etc.) but it should work pretty well.


Say lol what again motherfucker, say lol what again, I dare you, no I double dare you
[ Parent ]
XML and LISP (4.66 / 3) (#12)
by DGolden on Mon Oct 29, 2001 at 09:32:06 AM EST

One thing annoys me about XML and all the hype surrounding it:

They've just reinvented S-Expressions, but with really annoying syntax. People moan about parens everywhere in Lisp, but XML tags are far worse, IMO!

The best way to process XML data that I've seen is to convert it into Scheme, and then use all the usual LISPy tricks to munge it anyway you want.

See ssax.sourceforge.net for the coolest way to work with XML.
Don't eat yellow snow

That's right! (5.00 / 1) (#17)
by epepke on Mon Oct 29, 2001 at 02:57:04 PM EST

You're bloody well right! You've got a bloody right to say!

About two years ago, when I was writing an XML parser in C, the first thing I did was to write cons(), car(), cdr(), and gc(). It was easy and made sense.

I have to laugh at this. IBM develops this thing called SGML. Then, somebody uses the ideas to make HTML, which is more specific and has looser syntax. Then, the idea gets around that there should be something less specific with tighter syntax, and wham, it's a hot new marketing tool

I've had people tell me for the past two years how everything is going to be built in XML. I've always said, "It's a tree. It's just a tree. It's fine that there's an agreed-upon syntax, but it's still just a tree. Everything depends on how you use it, and we've known for a long time what trees are good for and what they're not good for. It's not a hot new business data concept. It's a tree."

DTD's are pretty OK, but still, it's just a type of limited generative grammar.


The truth may be out there, but lies are inside your head.--Terry Pratchett


[ Parent ]
That's right! (4.50 / 2) (#18)
by epepke on Mon Oct 29, 2001 at 02:57:39 PM EST

You're bloody well right! You've got a bloody right to say!

About two years ago, when I was writing an XML parser in C, the first thing I did was to write cons(), car(), cdr(), and gc(). It was easy and made sense.

I have to laugh at this. IBM develops this thing called SGML. Then, somebody uses the ideas to make HTML, which is more specific and has looser syntax. Then, the idea gets around that there should be something less specific with tighter syntax, and wham, it's a hot new marketing tool

I've had people tell me for the past two years how everything is going to be built in XML. I've always said, "It's a tree. It's just a tree. It's fine that there's an agreed-upon syntax, but it's still just a tree. Everything depends on how you use it, and we've known for a long time what trees are good for and what they're not good for. It's not a hot new business data concept. It's a tree."

DTD's are pretty OK, but still, it's just a type of limited generative grammar.


The truth may be out there, but lies are inside your head.--Terry Pratchett


[ Parent ]
Clarification: HTML *is* SGML (3.00 / 1) (#19)
by tmoertel on Mon Oct 29, 2001 at 03:28:47 PM EST

I have to laugh at this. IBM develops this thing called SGML. Then, somebody uses the ideas to make HTML, which is more specific and has looser syntax...
Actually, HTML is an SGML application and is in every way legitimate SGML. From the HTML 4 recommendation:
HTML 4 is an SGML application conforming to International Standard ISO 8879 -- Standard Generalized Markup Language [ISO8879].
In other words, HTML is not a distorted version of SGML ideas but the real thing. SGML, like XML, lets you define syntax via DTDs and then associate semantics with them to create "applications." HTML is merely one such application. For more information, see the W3C's HTML recommendation, which defines the application in detail.

--
My blog | LectroTest

[ Disagree? Reply. ]


[ Parent ]
No, sorry (4.00 / 1) (#30)
by epepke on Tue Oct 30, 2001 at 01:20:50 PM EST

I'm talking primarily about the days before W3C, when HTML was a de facto entity, in effect only defined by conventions and implementation (much as the syntax of FORTRAN 66, to this day, is defined by a program, not a document).

This caused problems, which is why W3C got together and put together the standard. The standard is fine and dandy, and I conform to it when I can (although it still makes more sense to use some of the deprecated tags than to use CSS style sheets), but it isn't HTML in the real world, and it certainly doesn't retroactively change what the original designers were actually doing. Of course, we have always been at war with Eastasia.

Even now, it is only the XHTML standard which effectively brings HTML into total compliance with SGML.

SGML has always been defined by a document and never had that Wild West stage. I can remember 20 years ago when Gooch and Newcome were trying to encode all music written before 1900 into SGML. One still had to deal with thick books of standards in those days. I can remember thinking that SGML was one of those great things that IBM does every once in a while but totally fails to push in the real world.

There were discussions back then about how the sloppiness of HTML may have contributed to the explosion of the web, back when amateurs had to write sloppy code by hand and didn't have Front Page to write sloppy code for them.


The truth may be out there, but lies are inside your head.--Terry Pratchett


[ Parent ]
Reinventing the wheel (3.00 / 1) (#32)
by jolly st nick on Wed Oct 31, 2001 at 09:43:50 AM EST

I think by in large XML is a good thing. However, I often wonder about people on the XML bandwagon reinventing the wheel. After all we are talking languages and parsing here -- techniques exist that are over forty years old to handle these tasks. With lexx and yacc, I could neatly handle many tasks people are doing by struggling with the XML alphabet soup.

XML is a great thing for creating marked up documents that have to be processed in a wide variety of ways by many different parties. As a side effect, it produces somewhat human readable files. But it is not a very good general purpose tool for producing human readable languages.



[ Parent ]

XML and IMS (none / 0) (#34)
by KWillets on Fri Nov 02, 2001 at 04:03:02 PM EST

IMS is "Information Management System", a hierarchical database created by IBM a long, long time ago. When relational databases came out (in 1983 or earlier), there was a debate over whether relational would win over "network" databases, that stored things in hierarchies and links between objects, like C++ gone bad (or, er, normally). I have no idea how IMS syntax works, but apparently querying it was an exercise in confusion, eg trying to remember "was that employees/california/operations/sandiego, or /california/sandiego/employees/operations, or (22 more variants)?".

Now it looks like XML has blessed us with half a dozen new query languages. I'll raise a malt liquor to that.





[ Parent ]
Good article (2.00 / 2) (#13)
by puelly on Mon Oct 29, 2001 at 01:38:41 PM EST

Thanks for a great article. I have been looking for one that adeqautely explains XML. Kudos.

How does this hype differ from all other hype? (1.66 / 3) (#14)
by jmeltzer on Mon Oct 29, 2001 at 01:42:00 PM EST

Seriously. What's so great about XML? Why should anyone take this as anything more than yet another e-business, e-commerce, distributed computing, hype scam? Does XML actually help a company to make money?

Make money? (5.00 / 3) (#16)
by Merc on Mon Oct 29, 2001 at 02:34:40 PM EST

Does HTML help a company make money? Does "comma delimited text"?

XML is just a way of representing data. It's nothing magic, nothing too special, and it's very similar to things that have been done before.

The only thing that makes XML special is that there are very few truly standard data formats, and they are mostly limited to a very specialized applications. On the other hand, a lot of different groups from a lot of different areas have decided XML is a good way of representing data. XML also strikes a critical balance between human readability, strict data type integrity, and computer parseability. That's what makes it special.

A simple XML document is easily readable by a person and is easy to understand and edit. A tab-delimited file, on the other hand, is only readable as long as the columns align, after that it's just a big mess.

XML is also easy for a program to both write and parse. It can hold any type of data, and by using a specific schema, the data contained can be validated.

Combine that with XSLT and you have a means of transforming data from one domain into data from another domain. This means you could do something like automatically transform data from a database into a spreadsheet or even into an-XML based SVG graphic.



[ Parent ]
Awesome article!!! (1.00 / 3) (#15)
by j0nkatz on Mon Oct 29, 2001 at 02:16:26 PM EST

EXACTLY like the one at slashdot!!!


w()()p
Yeah! (5.00 / 1) (#20)
by ucblockhead on Mon Oct 29, 2001 at 07:57:10 PM EST

I bet Dare Obasanjo is gonna sue the crap out of this "Carnage4Life" guy.

I mean, you'd think he could write an original article, but no, he's got to plagiarize slashdot.

:-)


-----------------------
This is k5. We're all tools - duxup
[ Parent ]

XML in the industry (3.00 / 1) (#21)
by ucblockhead on Mon Oct 29, 2001 at 11:28:12 PM EST

It is my experience that a lot of the XML that is in the industry is not true XML, but stuff using an XML-like syntax but perhaps not conforming completely to the various standards. It is hard to say whether or not this is a bad thing. It is a nice syntax for a configuration file format, but being rigid there is perhaps overkill.
-----------------------
This is k5. We're all tools - duxup
What do you mean by true XML? (4.00 / 1) (#22)
by Carnage4Life on Mon Oct 29, 2001 at 11:45:17 PM EST

It is my experience that a lot of the XML that is in the industry is not true XML, but stuff using an XML-like syntax but perhaps not conforming completely to the various standards.

I'm confused as to what exactly you are stating. An XML document can be any text that is well formed and begins with the XML version declaration. Are you claiming there are a lot of people in industry using XML that isn't well formed in some way or that a lot of people aren't using the big name XML-based standards (SOAP, ebXML, XHTML, XSLT, etc)?

[ Parent ]
Bad wording (3.00 / 1) (#27)
by ucblockhead on Tue Oct 30, 2001 at 11:53:27 AM EST

Perhaps "doesn't meet the standard" is bad wording. I mean that a lot of people create systems that use a basic XML syntax (tags, attributes) without messing about with DTDs or anything similar.

(And by "in industry", I'm not really talking about the big players, but various little projects I've been on or am aware of.)
-----------------------
This is k5. We're all tools - duxup
[ Parent ]

That's how I used to use it. (3.00 / 1) (#28)
by Carnage4Life on Tue Oct 30, 2001 at 12:26:08 PM EST

Perhaps "doesn't meet the standard" is bad wording. I mean that a lot of people create systems that use a basic XML syntax (tags, attributes) without messing about with DTDs or anything similar.

Same here. Before this summer I'd used XML in three different projects (consulting gig, internship and school project) but never needed anything more than DOM or a SAX parser.

Of course, all this changed when I joined the .NET XML team and ended up having to learn XPath, schemas, DTDs, XDR, and XSLT.

[ Parent ]
nice article... (3.00 / 1) (#23)
by pb on Tue Oct 30, 2001 at 12:47:14 AM EST

but, of course, I've got a few questions that peripherally relate to the article...

1) Has anyone settled on a binary format for XML? I'm sure people are working on it. Heck, technically you could just use a filesystem for it, (or for that matter, tar+gzip, or a compressed directory listing...) or parse it and stuff it in your favorite tree-like structure...

Yeah, yeah, I suppose you'd just stuff it into a database, or write a Java object to parse and unparse it, and then serialize *that*... and maybe stuff *THAT* into a database... But I'm not that sick. I'd gzip it and uuencode it instead. :)

2) Will we eventually see the horrendous goop that is the Windows Registry replaced by XML? Or would that be too much to hope for?? One would think that if Microsoft took an interest in something like XML, they'd eventually standardize around it for text databases, exporting stuff, etc., etc. But that would be so useful that it wouldn't be like them to do it...

Binary format? (4.00 / 1) (#24)
by Carnage4Life on Tue Oct 30, 2001 at 02:55:04 AM EST

) Has anyone settled on a binary format for XML? I'm sure people are working on it. Heck, technically you could just use a filesystem for it, (or for that matter, tar+gzip, or a compressed directory listing...) or parse it and stuff it in your favorite tree-like structure...

I'm unclear as to what you mean by a binary format. If you mean a way to encode XML that reduces the redundant nature of the tags then simply gzipping is good for a general solution (especially if you are sending it to a browser since HTTP has gzip support). On the other hand if you're looking for a more application specific binary encoding scheme, I saw some post on Slashdot where someone suggested replacing each tag with a byte value.Doing this can significantly shrink the size of an XML document which is especially useful if it is being stored in a DB or sent across the network. I wouldn't be surprised if some of the native XML or XML-enabled databases do something like this.

Of course, since your suggestions seem nothing like mine we may be talking about completely different things.

Will we eventually see the horrendous goop that is the Windows Registry replaced by XML? Or would that be too much to hope for??

I doubt it, unless they can come up with a way to do it that's backwards compatible with all the current API. Since I don't do Windows programming I have no idea how easy or difficult such a task would be.

[ Parent ]
ASN.1 (4.00 / 1) (#26)
by thecabinet on Tue Oct 30, 2001 at 10:57:05 AM EST

... replacing each tag with a byte value.

That already exists and is called ASN.1, for Abstract Syntax Notation One. It's an incredibly powerful system allowing arbitrary length data encapsulation. And I mean seriously arbitrary here. Nothing in ASN.1 is fixed. Even the fields specifying the length of the following data are variable length, through their usage of byte-stuffing (sort of).

Unfortunately, it seems so terrible when you're first introduced to it that you won't be able to sleep for a week.

Check out this page for some ASN.1 whitepapers.

[ Parent ]

Registry (3.00 / 1) (#31)
by zephiros on Tue Oct 30, 2001 at 09:00:00 PM EST

Will we eventually see the horrendous goop that is the Windows Registry replaced by XML?

To some degree, this is showing up in .Net. There's a push to put application settings in a programName.config file, which is XML-based. App options can then be fished out at runtime via the System.ConfigurationSettings.AppSettings collection. Depending on how you squint, the value of this varies. I come from a *nix background, so I'm generally of the opinion that standardized text-based configuration files are a winning choice. That said, the flat structure of the *.config files means they're essentially *.ini files with racing stripes.
 
Kuro5hin is full of mostly freaks and hostile lunatics - KTB
[ Parent ]

What do you know about XML and Databases? | 34 comments (26 topical, 8 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!