Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
XML and verbosity

By marx in Technology
Mon Dec 02, 2002 at 09:07:07 PM EST
Tags: Technology (all tags)
Technology

One of the most common criticisms of XML is that it's too verbose. Implicit in this is that using it will waste storage and bandwidth, add bloat and inefficiency. We will show with a few simple arguments and examples that this is not true in practice.


Introduction

Say that you are working on a software system and you need to store your custom data on disk. To do this you need some kind of file format. What file format should you use? Almost every programmer encounters this question at some point.

Some people start designing their own file format. They come up with a structure, data encoding, parser, writer, etc. They discover that they need to make sure it interoperates correctly with different platforms, that it needs to be easily extensible, that it needs to be rigorously specified for other systems to be able to process it. Soon the file format has become one of the largest tasks in the development process.

XML is a standard for data representation and structuring that, in this context, already has solved most of the above problems. XML specifies how to encode and structure data. It guarantees that the resulting file will be extensible and processable on other platforms and systems. Perhaps most importantly, there exists many both free and non-free libraries and tools which parse and write XML. The task has been reduced to what it should have been from the beginning: to decide which data should be stored, and what the meaning of this data should be.

This sounds great, but before becoming dependent on standards or tools designed by someone else (especially committees), people need to be assured that it will work essentially just as well as if they would have done all the work themselves.

One of the most common concerns is that XML is too verbose. It's all ASCII (plain text), while binary would seem to be more efficient, and that it typically ends up with more "tags", or structure notation, than actual data.

These are valid concerns which need to be resolved. Fortunately, it's not very complex to do so.

ASCII vs. binary

The question of the difference in storage space between ASCII and binary representations is not tied to XML only, it's equally valid for all file formats.

This issue can be split up in two parts: ASCII representation of data and ASCII representation of structure notation. We will look at structure notation in the next section, and thus only focus on data representation here.

A piece of data in an XML document (or any type of document) is typically a string, an integer or a real. A string in ASCII or some binary format typically have the exact same, or very similar representations. We thus only have numbers left.

Integers

An integer in ASCII representation is a string of character digits, like so: "12345". An integer in binary representation is simply the binary number. Binary integers typically have a limitation of 8, 16, 32 or 64 binary digits. We can assume 32-bit (4 byte) unsigned integers for simplicity.

The problem with ASCII is that we are using a byte (one character) for every digit. So as soon as we have a very large number, such as "123456789012345", we are using many bytes; 15 in this case. However, this is flawed reasoning. The highest representable integer in 32 bits is 2^32 = 4294967296, which only has 10 decimal digits. So only when we have numbers between 10000 and 4294967296 are we actually wasting any space. The maximally wasteful ASCII integers only take up 2.5 times as much space as their binary counterparts, while the minimally wasteful take up 0.25 times as much space, i.e. they save space.

A similar argument can be applied for other binary integer sizes.

So on average, the ASCII representation is not more wasteful than the binary representation. In extreme cases it is about twice as wasteful.

Reals

A real number in ASCII representation is simply the digits again, but with extra punctuation, like so: "123.45". A real number in binary representation is almost always represented in "floating point", specifically the IEEE standard (the name comes from using a scientific notation, where the decimal point "floats" as the powers change). The size of a binary floating point number is either 4 or 8 bytes ("float" or "double" respectively), depending on required precision. We will assume floats for simplicity.

We have the same problem here, only magnified. A large real in ASCII, with many decimals, such as "12345.6789012345" uses many bytes; 16 in this case. Again, this is flawed reasoning. There is virtually no limit to how large floats can be, but what is limiting is the precision. The above number has 15 decimal digits of precision, while a float can only represent up to about 7 digits[1]. So the ASCII number "12345.6789012345" (16 bytes) is actually equivalent to "12345.678" (8 bytes) when represented as a float. The maximally wasteful ASCII reals thusly take up 2 times as much space as floats, while the minimally wasteful take up 0.25 times as much space.

(We have a potentially more wasteful situation when we have trailing zeros, such as "12345000000", where the precision is low but the number of bytes is high. This is solved by using or requiring scientific notation however, i.e. "1.2345e10".)

A similar argument can be applied for doubles.

So again, on average, the ASCII representation is not more wasteful than the binary representation. In extreme cases it is about twice as wasteful. Perhaps the extreme case is more common for reals than for integers.

Structure notation

We might think all is done when our data representation is reasonably space conservative, but that's not the case. A typical XML data element looks like this: "<element>12.23</element>". The data part of the element takes up 4 bytes and the structure notation part takes up 19 bytes. In a binary format, the structure notation per element would likely be 0 bytes, at most 1 or 2.

The answer lies in observing that while there exist many structure elements in an XML documents, there exist very few different structure elements. In other words, there is a very high redundancy in the structure notation. Whenever we have high redundancy, compression, or some other appropriate encoding, becomes highly efficient.

This is where we must stray a bit from the ASCII nature of XML. If we allow or enforce compression (say by using gzip), then the result is no longer an XML document, specifically, it's not even ASCII anymore.

However, this is not as drastic as it sounds. We can put the (de)compression completely outside the editing or processing stage. Instead of letting the XML parser or editor directly access the XML document from a file, it accesses it through a (de)compressor. This can be implemented with streams, so that the uncompressed XML document never needs to be stored anywhere.

Similarly, when we want to send XML as part of a network protocol, we can employ the compression of some layer of the protocol.

We can think of the (de)compression stage as an analogy with compression in modem communication, virtually transparent.

To verify the efficiency of compression of the structure notation in XML, we can make a simple test. We create a typical XML document, as well as a similar document with only the structure notation, and a document with only the data. We then compress all three separately and compare file sizes. The compressed full XML document should be roughly the same size as the compressed data-only document. The compressed notation-only document should be very small.

We can do this test simply in a bash shell (gzip is necessary also). To qualify as a full document we should have a root element, we skip it for simplicity:

for a in $( seq 1 100000 ) ; do echo "<element>${a}</element>" >> full.txt ; done
for a in $( seq 1 100000 ) ; do echo "${a}" >> data-only.txt ; done
for a in $( seq 1 100000 ) ; do echo "<element></element>" >> notation-only.txt ; done

File sizes:

2.4M full.txt
580K data-only.txt
2.0M notation-only.txt

Compress (lowest/fastest compression):

gzip -1 full.txt
gzip -1 notation-only.txt
gzip -1 data-only.txt

File sizes:

268K full.txt.gz
212K data-only.txt.gz
16K notation-only.txt.gz

The structure notation takes up very little space when compressed. The compressed full document is 1.25 larger than the compressed pure data. The uncompressed full document is 9 times larger than the compressed full document.

Parsing speed and memory usage is out of the scope of this article, but we can make a simple experiment to see that decompression incurs a very small overhead.

For this test you need to have root access on a Linux machine:

Compression from memory:

time dd if=/dev/mem bs=1024 count=10000 |gzip -1 -c > /dev/null

Time:

3.285s

Compression + decompression from memory:

time dd if=/dev/mem bs=1024 count=10000 |gzip -1 -c |gunzip -c > /dev/null

Time:

3.603s

Time for decompression:

3.603s - 3.285s = 0.318s

This means that on a 450 MHz PII processor, it's possible to decompress over 30MB/s, probably faster than any file can be read from disk. A modern processor should be able to decompress over 100MB/s.

Conclusion

We have examined the verbosity aspect of XML. We have shown that the verbosity arising from representing data as ASCII is marginal. We have also shown that the verbosity arising from the structure notation in XML is very high, but can be reduced to virtually nothing by using a compression layer.

If we can guarantee low to no waste in storage space, and little waste in processing speed (which we have not covered) when using XML, then we can have a file format which is just as efficient as one which is hand-crafted for every application, but which is general for every application. We also get the benefit that many more can construct and review file formats, and write processors for them.

[1] - ε for floats is 2^-24 = 10^-7. This means that changes in the 8th or higher digit make no difference. (See also "Scientific Computing", Heath).

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Poll
XML?
o Yes 75%
o No 25%

Votes: 120
Results | Other Polls

Related Links
o Also by marx


Display: Sort:
XML and verbosity | 137 comments (132 topical, 5 editorial, 0 hidden)
Question? (3.00 / 3) (#1)
by omegadan on Mon Dec 02, 2002 at 12:48:27 PM EST

I know the idea of XML is to make grammars standard -- but we have had a standard grammar parser for 20 years (bison/yacc+flex/lex). So why do we need XML? (obviously someone thinks we do because it's freakin everywhere!) Can XML parse grammars like computer languages? Or just languages in its (horrid) HTML-like format ?

Religion is a gateway psychosis. - Dave Foley

we still need a standard metagrammar (4.75 / 4) (#76)
by The Shrubber on Tue Dec 03, 2002 at 08:32:22 AM EST

XML doesn't do anything; it's just a meta-format for structured data.  In other words, it's a format for writing formats, and it's quite adequate if your data can be tree-structured.  

For example, HTML is an example of a format which is created from a meta-format (SGML).  HTML is a kind of SGML. Likewise, MathML, SVG, OpenOffice, etc are kinds of XML.

What's special about XML is just that it is standard.  Standardising makes a bunch of people's lives easier.  For example, i could write an XML parsing library with lex/yacc, but then it'd be done, and i could just give it to the community to be reused over and over again.  Making people's lives easier on the stupid stuff opens up avenues for progress on the smart stuff like scientific research.  

Instead of being bogged down in designing your data format, you stop making unnecessary decisions, and just use XML.

The only problem is that you have to the format itself.  Decide what tags to use, what the tags means, etc, which means you have to get agreement, which means you have to either 1) do it yourself and hope it fits 2) form a commitee.  But such is the way of standardisation.

We don't need XML because it's special.  We need XML because it's standard, because it's a decision we don't need to keep making over and over again.

[ Parent ]

XML Yumminess (3.00 / 1) (#2)
by rdskutter on Mon Dec 02, 2002 at 12:53:27 PM EST

I've been using XML as a programming language of sorts (not for data storage) for about a year now.

Its extremely nice to use. I should mention that the language is by its very nature declarative and not imperative, and XML is very well suited to this.

You can't write hacky looking code in XML. It's always clear what is going on.

Verbosity is not a problem. Readable, understandable code is better than terse hacks.


Yanks are like ICBMs: Good to have on your side, but dangerous to have nearby. - OzJuggler
History will be kind to me for I intend to write it.

Explain please? (4.66 / 3) (#9)
by czth on Mon Dec 02, 2002 at 01:48:55 PM EST

I've been using XML as a programming language of sorts (not for data storage) for about a year now.

I'm curious. How do you use XML as a programming language, even "of sorts"?

You can't write hacky looking code in XML. It's always clear what is going on.

"Real programmers can write assembly code in any language" - Larry Wall.

Speaking of Larry Wall, if I had to use XML as a programming language I'd probably factor it down to the minimum and then write a perl script to convert my minimal version to XML when needed.

czth

[ Parent ]

Perhaps (4.00 / 2) (#25)
by Holloway on Mon Dec 02, 2002 at 03:06:36 PM EST

I'm curious. How do you use XML as a programming language, even "of sorts"?
Maybe they mean XSLT.


== Human's wear pants, if they don't wear pants they stand out in a crowd. But if a monkey didn't wear pants it would be anonymous

[ Parent ]
Examples (none / 0) (#65)
by rdskutter on Tue Dec 03, 2002 at 02:23:06 AM EST

Well you can have a look at the example code on arcos.kusala.com to see exactly what I'm talking about.

Its basically an abstraction layer. The re-usable components are written in PHP and the XML language is used to include the components and pass parameters to them.


Yanks are like ICBMs: Good to have on your side, but dangerous to have nearby. - OzJuggler
History will be kind to me for I intend to write it.[ Parent ]

Eek (none / 0) (#81)
by czth on Tue Dec 03, 2002 at 10:22:48 AM EST

The demo on that site scares the poo out of me. Real programmers, avert your eyes, and pray that ye never tread here, for here be dragonnes invincible.

PHP is scary too. C-like syntax but denying the power thereof. Hm. Maybe I need to write an article on why PHP is crap (in much the same way VB is crap).

czth

[ Parent ]

Look at the sample code (none / 0) (#83)
by rdskutter on Tue Dec 03, 2002 at 11:20:31 AM EST

The demo you looked at was for the high level user interface and has nothing to do with programming, or XML.

If you want to see how we use XML then look here instead. I recomend that you look at the website in an hour and the blog first. The layout demo shows much more than XML and is intended to show how we integrate graphics into web-sites.


Yanks are like ICBMs: Good to have on your side, but dangerous to have nearby. - OzJuggler
History will be kind to me for I intend to write it.[ Parent ]

Please do (5.00 / 2) (#86)
by Josh A on Tue Dec 03, 2002 at 01:51:09 PM EST

PHP is scary too. C-like syntax but denying the power thereof. Hm. Maybe I need to write an article on why PHP is crap (in much the same way VB is crap).

I would appreciate it, since it seems to me that PHP and C serve completely different purposes. Do tell me what you suggest we use for server-side web scripting instead.

---
Thank God for Canada, if only because they annoy the Republicans so much. – Blarney


[ Parent ]
Perl or Ruby -nt (none / 0) (#125)
by czth on Mon Dec 09, 2002 at 02:29:20 PM EST



[ Parent ]
Time (3.80 / 5) (#3)
by Bad Harmony on Mon Dec 02, 2002 at 01:21:52 PM EST

I'm sure that XML has its advantages, but any kind of text based file format is going to incur a major speed penalty over a binary format with fixed-length records. This is important when you have to process large volumes of data.

5440' or Fight!

Sure (none / 0) (#75)
by rdskutter on Tue Dec 03, 2002 at 08:29:41 AM EST

But XML beats the crap out of CSV for transfering and viewing data.


Yanks are like ICBMs: Good to have on your side, but dangerous to have nearby. - OzJuggler
History will be kind to me for I intend to write it.[ Parent ]

speed is of no importance (none / 0) (#120)
by nex on Sun Dec 08, 2002 at 11:00:34 AM EST

> This is important when you have
> to process large volumes of data.

You don't store large volumes of data in XML. You don't need fixed-length records for data that's stored in XML. If you have to store large volumes of data and performance is an issue, of course you use a data base. But what if you want to map the tables in that data base to, say, C++ objects and you need to specify which table column corresponds to which member of which object? You need a configuration file there, which stores the configuration data in an XML structure or just in any way. The advantage of using XML would be that you don't have to write a parser. But speed isn't of concern when you have to parse a few kilobytes at program startup.

And when you have to parse hundreds of megabytes in fractions of a second, using XML is silly anyway.

[ Parent ]

Verbosity is not the problem (3.45 / 11) (#4)
by localroger on Mon Dec 02, 2002 at 01:25:40 PM EST

Verbosity is not the problem with XML, as the author points out. The problem with XML is that it must be parsed. This makes it completely useless for real-world databases of any significant size.

Fixed-width records are even more wasteful than the verbosity of ASCII XML, but all real databases use them as a starting point because you can go directly to record 34,892 by multiplying 34,892 by the width of the record and positioning the disk pointer right there. With XML you must scan through the file every time you do a query, or build an index. It's not noticeable on the programming exercise level but it's massively wasteful when one begins dealing with the kind of real data businesses need -- say, 3,000 customers and 40,000 part numbers and 100,000 transactions a year. Those figures are typical for the small business I work for, and XML is utterly inadequate to deal with them.

I can haz blog!

What's the problem? (5.00 / 4) (#5)
by i on Mon Dec 02, 2002 at 01:38:24 PM EST

If you need a database, go buy a DBMS. XML doesn't replace databases. XML replaces comma-separated flat files. You don't use a comma-separated flat file as a database, do you?

and we have a contradicton according to our assumptions and the factor theorem

[ Parent ]
what's wrong with comma separated files? (3.00 / 2) (#7)
by speek on Mon Dec 02, 2002 at 01:40:32 PM EST

They're easier to deal with, faster, and can get the job done much of the time. Plus, non-programming people can manipulate them easily.

--
al queda is kicking themsleves for not knowing about the levees
[ Parent ]

Nothing wrong with them. (5.00 / 3) (#10)
by i on Mon Dec 02, 2002 at 01:50:35 PM EST

Except they are flat, and don't contain any metadata, and there's no standard way to put a comma inside a field.

and we have a contradicton according to our assumptions and the factor theorem

[ Parent ]
You mean besides... (none / 0) (#74)
by hummassa on Tue Dec 03, 2002 at 07:44:34 AM EST

","

and

,

??

[ Parent ]

You mean that (5.00 / 1) (#78)
by i on Tue Dec 03, 2002 at 08:43:30 AM EST

a standard way to put a quote inside a field is \"? Or is it ""? Or maybe %"? Or  \042? &quot; maybe? =22? %22 anyone?

Oh, and EBCDIC has no backslash character.

and we have a contradicton according to our assumptions and the factor theorem

[ Parent ]

CSV Files (none / 0) (#88)
by Argel on Tue Dec 03, 2002 at 02:31:48 PM EST

I beleive the correct way to indicate double quotes is to use two double quotes and if the data contains a comma then double quote it. E.g.:

""dblquotesindata"","datawitha,",nothingspecial

This is why e.g. Access exports to CSV with quotes around everything. You can learn more by reading up on some of the CSV Perl modules: CSV_XS.

[ Parent ]

What abount hierarchy? (5.00 / 1) (#90)
by ttfkam on Tue Dec 03, 2002 at 06:58:08 PM EST

For example, in a list of part numbers by client:

<sales xmlns="http://mycompany.com/accounts/1.0">
  <customer id="567">
    <name>Acme Inc.</name>
    <address type="shipping">
      <street>642 1st Ave.</street>
      <city>New York</city>
      <state>NY</state>
      <zipcode>10101</zipcode>
    </address>
    <address type="billing">
      <street>644 1st Ave.</street>
      <city>New York</city>
      <state>NY</state>
      <zipcode>10101</zipcode>
    </address>
    <part id="6" quantity="5"/>
    <part id="76" quantity="1"/>
    <part id="86" quantity="7"/>
    <part id="43" quantity="4"/>
    <part id="22" quantity="1"/>
  </customer>
  <customer id="8724">
    <name>Widgets Intl.</name>
    <address type="shipping billing">
      <street>543 Mowtika Blvd.</street>
      <city>Minneapolis</city>
      <state>MN</state>
      <zipcode>74653</zipcode>
    </address>
    <part id="5" quantity="500"/>
    <part id="99" quantity="12"/>
  </customer>
</sales>

What happens when we want to add a contact name?  What about files that contain both Russian and Chinese characters?  What about error-handling?  Do all of your CSV parsers know what your data is supposed to look like...for all versions?  When someone is looking at the data file, how does the reader know which field means what when many are integers (ids)?

How would you handle this dataset -- a perfectly reasonable item for B2B operations and for the production of invoices -- quickly in both developer time and processing time without using something like XML?

Where's your comma-separated value file now?

If I'm made in God's image then God needs to lay off the corn chips and onion dip. Get some exercise, God! - Tatarigami
[ Parent ]

yeah yeah yeah (none / 0) (#93)
by ttfkam on Tue Dec 03, 2002 at 07:54:19 PM EST

misspelled "about"...

sue me, it's an old laptop keyboard.

If I'm made in God's image then God needs to lay off the corn chips and onion dip. Get some exercise, God! - Tatarigami
[ Parent ]

XML doesn't need to be parsed (3.28 / 7) (#12)
by Holloway on Mon Dec 02, 2002 at 02:07:58 PM EST

Fixed-width records are even more wasteful than the verbosity of ASCII XML, but all real databases use them as a starting point because you can go directly to record 34,892 by multiplying 34,892 by the width of the record and positioning the disk pointer right there. With XML you must scan through the file every time you do a query, or build an index. It's not noticeable on the programming exercise level but it's massively wasteful when one begins dealing with the kind of real data businesses need -- say, 3,000 customers and 40,000 part numbers and 100,000 transactions a year. Those figures are typical for the small business I work for, and XML is utterly inadequate to deal with them.
Oh my god that's funny. Look, you've just proven that you don't actually pay any attention to XML. That's fine, that's OK, but please don't waste people's time posturing ideas that you don't know (like, say, an XML database Xindice, or Tamino, or Excelon).

I'm not trying to be rude, but what you wrote was really very funny :)


== Human's wear pants, if they don't wear pants they stand out in a crowd. But if a monkey didn't wear pants it would be anonymous

[ Parent ]

XML does need to be parsed (3.50 / 2) (#46)
by czth on Mon Dec 02, 2002 at 08:19:41 PM EST

Oh my god that's funny. Look, you've just proven that you don't actually pay any attention to XML.

...

I'm not trying to be rude, but what you wrote was really very funny :)

  1. Why was it so funny?
  2. Why are people voting him up if he doesn't give a reason for his fit of hilarity?
If you're using XML as an internal database format you're an idiot, for precisely the reason the top-level parent gave: you have to parse it to extract data.

So, you start keeping an index of pointers to the start of each XML "record." But then you still have to delve into each record to do the equivalent of SELECT * FROM table WHERE x > 15 ORDER BY y, which involves parsing. Alright, so we'll store indexes to each "column" in each record too.

Duh, you just reinvented the database. Badly.

XML is an interface language, not a storage format. If we went looking for a guest using an "XML database" instead of in a highly-tuned (by expert DBAs) and indexed (hint: not MySQL) RDBMS, it would literally take weeks to find someone (3,000 customers? try 15,000,000, and to some I'm sure that's peanuts). And nobody wants to be on hold or wait for a web page that long.

czth

[ Parent ]

Titter! (4.25 / 4) (#48)
by Holloway on Mon Dec 02, 2002 at 09:37:42 PM EST

Why was it so funny?
Because it was every XML cliche in one brilliant bite size chunk. Beautiful, elegant, two thumbs up!
Why are people voting him up if he doesn't give a reason for his fit of hilarity?
What?
If you're using XML as an internal database format you're an idiot, for precisely the reason the top-level parent gave: you have to parse it to extract data.
czth, this is simply not true, and this is what's funny to me. You wouldn't use CSV and expect stunning performance, nor should you use XML text files and parse the whole thing trying to find a node. So yes - you start keeping an index of pointers, but if you wouldn't write your own RDBMS then you probably shouldn't write your own XML database. They have software for that, so you just need to ask for what you want (XQuery, Xpath), and it will get the appropriate segment.

People don't regard database SELECT statements as something that involves "parsing", or that what goes on inside a RDBMS in finding an entry as "parsing". As XML databases involve pointers and other internal mechanisms there's no good reason for an XML database to involve "parsing" (unless you have a rather leniant definition of parsing that would also encompass what a RDBMS does).

Duh, you just reinvented the database. Badly.
Nice. You just sorta went on for a while about pointers and indexes... and how XML databases use them too. You do realise there wasn't actually any conclusion saying anything bad or good about XML databases in that, right?
XML is an interface language, not a storage format. If we went looking for a guest using an "XML database" instead of in a highly-tuned (by expert DBAs) and indexed (hint: not MySQL) RDBMS, it would literally take weeks to find someone (3,000 customers? try 15,000,000, and to some I'm sure that's peanuts). And nobody wants to be on hold or wait for a web page that long.
Well that's your whole argument, isn't it. That XML databases are inefficient. That they'll "literally take weeks" to find an result.

My experiences have shown otherwise. XML databases can be much faster, or much slower, depending on the type of data. Complex table joins to achieve hierarchy in a RDBMS aren't cheap, and if it's weighing down a server then consider a different technique. If your data/content is many short entries, that's highly structured, then a RDBMS would suit.

Mostly though, I find it funny that people don't have the knowledge to understand when to choose either one. That they comicly FUD their way around the topic by saying that XML databases are slow, that they require parsing the whole file, and that these problems weren't solved years ago when CSV databases were popular.

These people obviously want a rule that says XML databases don't work, when it's more complex than that. Watching people FUD themselves is funny stuff!


== Human's wear pants, if they don't wear pants they stand out in a crowd. But if a monkey didn't wear pants it would be anonymous

[ Parent ]

*Markup*, not DB! (5.00 / 3) (#17)
by David McCabe on Mon Dec 02, 2002 at 02:19:48 PM EST

Databases is not what XML is for. XML is for markup. Take XHTML and DocBook for example.

[ Parent ]
I think that's a bit blunt (4.00 / 1) (#38)
by Holloway on Mon Dec 02, 2002 at 05:06:07 PM EST

Sort of, XML:'the markup language' isn't for databases but XML:'the related software' can be, and whether it's suitable depends on the type of XML you have.

XML Databases are getting quite efficient now. I do data records that are hierarchial and since moving to an xml database it's been much faster. There are no costly table joins, you just pull out a node. But it really depends on the type of queries you're doing, whether there'll be any benefit.


== Human's wear pants, if they don't wear pants they stand out in a crowd. But if a monkey didn't wear pants it would be anonymous

[ Parent ]

Volatile Data Structures (4.00 / 1) (#51)
by cam on Mon Dec 02, 2002 at 11:06:57 PM EST

The problem with XML is that it must be parsed. This makes it completely useless for real-world databases of any significant size.

We had a problem in one of our systems with volatile data structures rather than volatile data. The data once in tended to stay there without modification but the data structures were slightly different for each client. The system also collects alot of reports which can change weekly.We tried it as part of a database schema but the schema got in the way of supporting the business process changes.

We solved it by collapsing the common fields and indexed fields to the database schema and then put the volatile data structures into XML and stored them in XML repositories. It is all on the filesystem so it adds processing overhead but the data is accessed infrequently enough that it isnt an issue. All of the common searches are done through the database schema and the more static XML data added to the database data to be displayed to screen. It works well.

cam
Freedom, Liberty, Equity and an Australian Republic
[ Parent ]

you tackled the least of XML's problems (4.40 / 10) (#6)
by speek on Mon Dec 02, 2002 at 01:38:43 PM EST

The problem isn't size of the file created. the problem is time spent diddling with ascii. And now you want to compress it too? Too slow for many problems.

And, it's simply not true that using XML makes you interoperable. So you have some help with parsing it - you still have to write your own code to read and understand it. It's marginally easier than any reasonably documented, non-obtuse file format that joe developer invents on the spot, but that invented-on-the-spot file format is likely faster, simpler, smaller, and better adapted to the problem it's solving.

Not to mention that to do your simple XML parsing requires that you include some mega-bloat library that can tinker with XML They Way God Intended, but I don't need any of it, really.

Standardization is over-hyped. There's a big difference between designing a file format to be proprietary, secret, and hard to reverse-engineer, and designing one to be binary, simple, and useful. There's no need to wedge everything into the XML mindset.

That said, I do think that when a certain set of data becomes universal enough, XML is a decent solution, and that marginally-easier-to-deal-with aspect does start to reap rewards. System config, address books, etc come to mind.

--
al queda is kicking themsleves for not knowing about the levees

A question or two. (4.14 / 7) (#11)
by i on Mon Dec 02, 2002 at 02:07:40 PM EST

Joe R. Coder invents a binary format on the spot.

What are chances that Joe R. Coder will design upward compatibility in right from the start? What are chances he will do it right? And how the hell he will do it?

Okay, that's three. Does anybody know how to answer that?

and we have a contradicton according to our assumptions and the factor theorem

[ Parent ]

who cares? (3.60 / 5) (#33)
by speek on Mon Dec 02, 2002 at 04:05:56 PM EST

What is this "upward compatibility" of which you speak? What are the chances E. Leet Coder, who spends time and energy and creates beautiful abstractions to design "upward compatibility", predicts the future correctly? What are the chances he creates a big fragile structure that gets little use?

Simplicity is a goal, not a failing.

--
al queda is kicking themsleves for not knowing about the levees
[ Parent ]

Mibs (4.00 / 1) (#52)
by cam on Mon Dec 02, 2002 at 11:22:13 PM EST

Simplicity is a goal, not a failing.

In an ITS project we stored the Mibs that were to be used for a protocol in an XML data structure. As we had to communicate with a remote device we emulated the remote devices Mibs in an XML structure locally for performance reasons doing queries on the device as needed. The original Mib XML was loaded as a singleton and kept in memory for the life of the communication appliction. Since the Mibs were stored in an XML structure we did XSLT for the lookups and processing of the DOM in memory.

Excellent design, we took any reference to the data structure totally out of our code. Our interface to the data we wanted was through XSLT. It gave a nice seperation.

cam
Freedom, Liberty, Equity and an Australian Republic
[ Parent ]

Upward compatibility and simplicity. (4.00 / 1) (#63)
by i on Tue Dec 03, 2002 at 01:47:22 AM EST

Joe R. Coder wants to design a phone book application.

Version 1. The app stores (name, phone-number) pairs.
Version 2. The app stores (name, phone-numbers-list) pairs.
Version 3. The app stores (name, phone-number-list, email-address-list) tuples.
Version 4. The app stores (name, phone-number-list, email-address-list, snail-mail-address-list, other-contact-list) tuples.
Version 5. There's calling tariff information associated with each phone number so the app can automatically select the cheapest possible rate to contact a given person.

Joe R. Coder's app now needs to read five different binary formats. Tomorrow it will need to read six different binary formats. In three years... you get the idea.

E. Leet Coder's app needs to read XML. Yesterday, now, ever.

Of course phone books are trivial. Software I'm working on isn't. While evolving at alarming rate, it supports files with extremely complex structure that were created twenty releases (and who knows how many internal versions) ago. It wouldn't be possible without designing upward compatibility in from the start. Incidentally, the file format is somewhat XML-like. XML wasn't around back then; if it were, the founding fathers would probably use it because it fits the bill perfectly.

and we have a contradicton according to our assumptions and the factor theorem

[ Parent ]

and (none / 0) (#77)
by speek on Tue Dec 03, 2002 at 08:37:54 AM EST

I could give an example of an application that was not right for XML, or one that isn't already understood by everyone and would thus not be clear whether XML would be ideal. My point wasn't that XML never works (I actually gave an address book as a good use for XML in my first post), my point was that it's not always necessary or best.

I have an application wherein one file format uses XML, and another uses comma separated. It needs to be that way because XML was too slow for the second file format (it used to be XML, and it was unusable).

--
al queda is kicking themsleves for not knowing about the levees
[ Parent ]

If it's too slow for an XML parser (5.00 / 1) (#79)
by i on Tue Dec 03, 2002 at 08:52:26 AM EST

use a friggin' database, dammit!

and we have a contradicton according to our assumptions and the factor theorem

[ Parent ]
Really? (4.25 / 4) (#15)
by Otto Surly on Mon Dec 02, 2002 at 02:15:06 PM EST

The problem isn't size of the file created. the problem is time spent diddling with ascii.

But you only make the translations to and from ASCII when reading or writing from disk, and it takes thousands of CPU cycles to approach the cost of reading or writing a single disk block. Are you sure this is really a meaningful problem? I would think any added CPU cycles would just be lost in the noise.



--
I can't wait to see The Two Towers. Man, that Legolas chick is hot.
[ Parent ]
ya, really (4.00 / 2) (#32)
by speek on Mon Dec 02, 2002 at 04:01:08 PM EST

A lot of time is spent doing XML, a lot of string manipulation is done, and the abstraction of many XML packages adds to the slowness. Is it really a problem? It depends on your app. For mine, yes it made a huge difference.

--
al queda is kicking themsleves for not knowing about the levees
[ Parent ]

Joe Developer... (3.66 / 3) (#31)
by CtrlBR on Mon Dec 02, 2002 at 03:39:38 PM EST

When Joe Developer invents his own binary format he usually ends up just writing his C structures to disk (the BMP format is almost like that for example).

You ends up with something that usually isn't cross-platform (because of padding and endianess problems) and has no provision at all for backward and forward compatibility. Well, they usually try to solve this one by "reserving" fields in structure for future use, that means that in memory objects takes to much place for the sake of an evolution that may never come. Disk is cheap, memory less so.

If Joe Developer wasn't so damn fucking dumb XML wouldn't be needed but since there way to much people who outlasted their "best use before date" (this is not an attack on older programmers, some people would be better of if they never entered th IT industry in the first place) XML is sorely needed.

[ Parent ]

XML won't save the idiots (4.40 / 5) (#34)
by speek on Mon Dec 02, 2002 at 04:07:42 PM EST

If your premise is that Joe Coder does it wrong, then the conclusion is surely that the result is bad. However, even XML won't help in that case.

--
al queda is kicking themsleves for not knowing about the levees
[ Parent ]

His premise should be... (5.00 / 3) (#36)
by LobsterGun on Mon Dec 02, 2002 at 04:18:40 PM EST

...that Joe Developer has better things to do than fussing with padding and endian issues.

You're certainly right that XML can not save us from poor developers. The best we can hope for is that it will allow developers to focus their time in ways that will produce better software.

[ Parent ]

Don't you think... (4.00 / 1) (#40)
by CtrlBR on Mon Dec 02, 2002 at 05:30:53 PM EST

...that nobody can be good at everything. Chances are that the guy writing that nice accounting package with a really foolproof UI isn't usually the guy that like to tinker with bit level stuff, sizeof(int) and little vs big endian.

That guy just want to write records in a file in a portable manner so that another guy doing the same job in another language can read it easily. XML is a tool to achieve that.

Using XML is not really optimal, not really fast, without compression certainly not compact but it works and does not tie you to a type of computer, language or compiler.

Reverse engineering is easier than with binary format, ad hoc parsing is often possible. It won't cure the hunger problems in the world by itself, but it'll help.

[ Parent ]

sure (none / 0) (#44)
by speek on Mon Dec 02, 2002 at 06:57:44 PM EST

Anything simple and straightforward would serve the same purpose. XML is nice for some stuff. But for most things people do, it's overkill. That's all I'm saying. I use XML. I actually like it a lot. But, once you're over the honeymoon stage, you realize, it's just another tool, and quite often, it's a bit too much for the task at hand. Things don't magically interoperate just cause you used XML. It doesn't save you that much work.

--
al queda is kicking themsleves for not knowing about the levees
[ Parent ]

mega-bloat libraries? i think not... (5.00 / 1) (#82)
by The Shrubber on Tue Dec 03, 2002 at 11:13:49 AM EST


Not to mention that to do your simple XML parsing requires that you include some mega-bloat library that can tinker with XML They Way God Intended, but I don't need any of it, really.

I don't think event-driven parsers are what i'd called mega-bloat (cf SAX, expat).  And there's no reason you couldn't write your own lean and mean library to deal with XML The Way Speek Intended.

Look at JDOM, for example, they looked at DOM, said "no way", and pared it down to something just-for-Java.

[ Parent ]

X Window System is overrated (3.50 / 2) (#91)
by ttfkam on Tue Dec 03, 2002 at 07:15:57 PM EST

Before it came along, people on UNIX would just roll their own widgets and display managers.  X just limits your choices and most people don't need that bloat anyway when all they want is a dropdown list, an entry box, a button, and a little text.  All video cards are VESA compatible right?

I personally don't see the advantage of network transparency.  I just don't need it.  But if I ever do, I can just add it to my program without all of the hack and mash that X makes me deal with.

This whole GUI standardization trend is just hype.  It's just like scripting languages -- which only serve to hide what the computer's doing from me.  There's no need to wedge all UNIX GUI apps into the X mindset.

That said, when all of the widgets and network transparency are really needed and used, X is a decent solution.  Editors, web browsers, etc. come to mind.

If I'm made in God's image then God needs to lay off the corn chips and onion dip. Get some exercise, God! - Tatarigami
[ Parent ]

XML primer (4.57 / 7) (#8)
by czth on Mon Dec 02, 2002 at 01:42:13 PM EST

Long version: go to the W3C site and read the XML spec (current version is XML 1.0 second edition).

Short version (pedants, cover your eyes): it's like HTML, but you can pick your own tags.

Sample XML:

<?xml version="1.0">
<Contacts>
  <Contact Type="Home">
    <FirstName>Fred</FirstName>
    <LastName>Smith</LastName>
    <Phone>800-555-1234</Phone>
  </Contact>
  <Contact Type="Home">
    <FirstName>Sue</FirstName>
    <LastName>Clark</LastName>
    <Phone>888-555-4321</Phone>
  </Contact>
  ...
</Contacts>

That's it. Formatting not required but extraneous whitespace is allowed. There are a lot of complicated things like schemas and DTDs. Most people don't need them. You can tell from the above that <Contacts> is a list of <Contact elements which may contain a first name, last name, and telephone number.

There are also a lot of good parser libraries for XML - libxml, expat, etc.

And of course, there are a whole set of perl XML-parsing modules in CPAN. XML::Parser is popular, and uses libexpat as a backend; it's quite flexible, allowing parsing via callbacks, or it can return a tree, for example. Or if that's too complicated for you there's XML::Simpler :).

Some servers I wrote at work receive transactions in XML format. Even though we get a pretty large volume of transactions, it probably isn't worth compressing them - adds complication, CPU usage, etc. Using XML wasn't my choice (not that I mind), the company loves it, so I just ask how high... :-).

Good article, though, it's nice to actually have some numbers regarding size of XML and binary, compression times, etc.

czth

But that's so ugly (3.00 / 3) (#13)
by Psycho Les on Mon Dec 02, 2002 at 02:08:09 PM EST

<Contacts
  <Contact
     <Type "Home">
     <FirstName "Fred">
     <LastName "Smith">
     <Phone "800-555-1234">>
  <Contact
    <Type "Home">
    <FirstName "Sue">
    <LastName "Clark">
    <Phone "888-555-4321">>>

See how much neater it could be.

[ Parent ]

Heh, looks like LISP :-) [nt] (5.00 / 2) (#14)
by David McCabe on Mon Dec 02, 2002 at 02:11:51 PM EST



[ Parent ]
A thing to note. (4.50 / 2) (#16)
by i on Mon Dec 02, 2002 at 02:18:04 PM EST

Your syntax (which is called s-expressions IIRC) doesn't make distinction between data and attributes.

Which brings us to a question: why attributes are needed in XML? How can I determine whether I should express something with an attribute and not with tags?

and we have a contradicton according to our assumptions and the factor theorem

[ Parent ]

You don't need attributes (none / 0) (#19)
by Psycho Les on Mon Dec 02, 2002 at 02:24:39 PM EST

There is nothing an attribute does that can't be expressed without then.

<foo bar="zoot">monkey<foo>

Doesn't hold any more information than

<foo><bar>zoot</bar>monkey</foo>

[ Parent ]

Nevertheless (5.00 / 1) (#21)
by i on Mon Dec 02, 2002 at 02:36:45 PM EST

attributes are in XML, and I pretty much want to find out why. If anybody wants to give a usual "XML is bloated and doesn't make sense" answer, please refrain. I've already heard that.

and we have a contradicton according to our assumptions and the factor theorem

[ Parent ]
XML is bloated and doesn't make sense. (4.00 / 1) (#23)
by Otto Surly on Mon Dec 02, 2002 at 02:50:53 PM EST

Honest XML textbooks will tell you that nobody really agrees on when to use attributes. They're for "metadata" as opposed to "data", but why is metadata intrinsically unique? And uniqueness as the only restriction is lame; you might just as well use a DTD to require uniqueness instead.

--
I can't wait to see The Two Towers. Man, that Legolas chick is hot.
[ Parent ]
Nah. (5.00 / 6) (#26)
by i on Mon Dec 02, 2002 at 03:06:50 PM EST

The original idea is that elements are "content" and attributes are "markup". Compare:

<page background="white">lame content here</page>
<page><background>white</background>lame content here<page>

Of course if you use XML to represent structured data (as opposed to marked-up text) the distinction doesn't make sense.

and we have a contradicton according to our assumptions and the factor theorem

[ Parent ]

attributes [contains scary code] (5.00 / 1) (#24)
by dr k on Mon Dec 02, 2002 at 03:05:45 PM EST

Attributes are there so that bad programmers can write code like this:

<items>
 <item id="1" children="3,4">
  ...
 <item id="2" children="4">
  ...
 <item id="3" children="none">
  ...
 <item id="4" children="none">
  ...
</items>

Note how cleverly the children attribute is "encrypted" so that the referred items cannot be accessed in one step.


Destroy all trusted users!
[ Parent ]

Markup (as opposed to data storage) (4.00 / 1) (#66)
by roiem on Tue Dec 03, 2002 at 02:47:08 AM EST

AFAIK, XML is originally meant for text markup (like HTML before it), and not for data storage. When you're marking up text, the idea is that removal of all the markup leaves you with the original text, that you can then read. For instance,

<b>Hello,</b> <font type="ugly">world<font>

would become simply "Hello, world", but

<b>Hello,</b> <font><type>ugly</type>world</font>

would become "Hello, uglyworld".

When using XML as a file-format for storing data, this distinction doesn't really exist anymore.
90% of all projects out there are basically glorified interfaces to relational databases.
[
Parent ]

Attributes are meta-data (5.00 / 1) (#87)
by istevens on Tue Dec 03, 2002 at 02:03:46 PM EST

attributes are in XML, and I pretty much want to find out why

Attributes are meta-data - they describe the data which is contained within the element it is a part of.  They are not necessarily important to the end-user, but might be important to the program parsing the elements.

Attributes cannot be displayed when viewing the XML through a stylesheet, but they might be able to influence how the data is displayed.  For instance, supposing I had a list element and that list might be used for records, CDs and tapes but I wanted each type of list to be displayed differently, but still be lists.  I could add an attribute called "medium", or possibly "class" to describe the contents of the list:


<list class="records">
     <item>Sgt. Pepper's Lonely Hearts Club Band</item>
     <item>Dark Side of the Moon</item>
</list>
<list class="cds">
     <item>The White Album</item>
     <item>A Love Supreme</item>
</list>

I could then create a stylesheet to display the list of records with a record icon next to each item, and the list of CDs with a CD icon next to each item.  If I were programming, I could configure my parser to only pick out lists of records.
--
ian
Weblog archives
[ Parent ]

A thought. (4.00 / 1) (#96)
by xriso on Wed Dec 04, 2002 at 02:21:51 AM EST

Say we have some red text "Foo", couldn't we also describe this as red in the shape of "Foo"?

In one case the content is Foo, the other case the content is red.

Markup-style: [red]Foo[/red], or [shape="Foo"]red[/shape]

Structured style: [text color=red content="Foo"]
--
*** Quits: xriso:#kuro5hin (Forever)
[ Parent ]

Design vs. implementation. (none / 0) (#126)
by istevens on Mon Dec 09, 2002 at 09:26:29 PM EST

Say we have some red text "Foo", couldn't we also describe this as red in the shape of "Foo"?

Sure you could, but that's a design issue not an XML implementation issue.

ian



--
ian
Weblog archives
[ Parent ]
Why (5.00 / 1) (#92)
by ttfkam on Tue Dec 03, 2002 at 07:36:01 PM EST

Your second example has a mixed content schema and is harder to validate and make editors for.

If an item is intended to be an identifier (usually very short and at most one instance), an attribute serves the purpose quite well where an element would be more clumsy.

For example:

<quote author="Anonymous Coward" source="Slashdot">
  God Bless America, where laws are passed to protect people from the legal system.
</quote>

In this case, metadata and content are kept separate.  The quote text is what is important in this case.  It is the content.  Everything else is metadata: information about the content.

For another example:

<author id="7">
  <firstname>Psycho</firstname>
  <surname>Les</surname>
</author>

In this case, "author" is nothing but the organization of metadata.  There is no content to "author."  I made "id" an attribute to denote metadata of this metadata (likely a reference to a database id).

There of course are no hard and fast rules, but this one works well for me.
  It's not like CSV or anything else has any hard and fast rules.  At least with XML, you can codify your decisions.

If I'm made in God's image then God needs to lay off the corn chips and onion dip. Get some exercise, God! - Tatarigami
[ Parent ]

When to use attributes or elements (4.00 / 2) (#22)
by Holloway on Mon Dec 02, 2002 at 02:41:39 PM EST

In XML (and by this I mean XML specifically, not general markup) attributes are unique in an element whereas child elements are not. Using Schemas/DTDs you can impose rules to achieve the same, but the point is that if XML is only well-formed (and not validated) attributes are guaranteed to be unique. Eg,

<foo id="zoot">monkey</foo>

and

<foo><id>zoot</id><monkey</foo>

Are the same, but

<foo id="zoot" id="frunsh">monkey</foo>

...would fail any parser, whereas

<foo><id>zoot</id><id>frunsh</id><monkey</fo o>

Is still well-formed.

Obviously elements allow you to nest other elements within them, so generally I use attributes for strings, and elements for hierarchial or where you want more than one value.


== Human's wear pants, if they don't wear pants they stand out in a crowd. But if a monkey didn't wear pants it would be anonymous

[ Parent ]

I find the difference between attributes ... (none / 0) (#132)
by paulgrant999 on Sun Dec 15, 2002 at 10:28:14 AM EST

and elements to be extremely useful;

when I want to quickly (sans schema/dtd) establish a 1:1 mapping, I use attributes.

When I might need more than one, I use elements.

The benefit is that anyone writing (say an XML editor), can take advantage of the fact that attributes are represented 1-to-1, and represent that node accordingly.  This line of reasoning also functions for any human who is reading the XML data; anyone who works extensively with XML realizes this quickly, and processes that fact quickly when browsing an XML file.

Similar reasoning explains the existence (and acceptance) of credit vs debit cards;  namely,
the two separate interfaces into one data store (in this case, consumers wallet), provide a quick way to establish a fundamental relationship;  whether or not the cash is actually in the merchants acct @ the time of sale.

Also, I'ld like to point out that the tag data itself is irrelevant as a source of bloat; in that a run through a slightly modified hoffman encoder would collapse all the tags down into the most efficient binary form, ANYWAY.  This compressed form (notation really), is no doubt, integral to the development of any efficient parser, which no doubt build a lookup table and encodes said hoffman tree in memory (on parse).

As to representing image data, why not?  Have the metadata encoded, and the image data bin64 encoded.  This would also allow u some flexibility;  as u could ADD or ANNOTATE the XML frame information with your own custom calculations e.g.

<image><meta><frame_info> ....</frame_info><data>...</data></image>

could have arbitratry (irregular) data shoved into it, by external programs...

<image><meta><my_calc hueristic_fitness_value="12><whatever />
</my_calc>
<frame_info> ....</frame_info>
<data>...</data></image>

Now you can argue that binary data can do this as well; but I would argue that in most cases, changing the binary data format by insertation or appending would probably break the custom parser some person wrote.

As to the LISP syntax, I would hate that; it makes reading large documents (many pages) a pain in the butt, trying to figure out what scope a block is in (sans a LISP mode editor).

As to the Markup vs Tree argument; it is a tree that can be easily used to generate a markup syntax.  Think of a Venn Diagram; where the application that generates the XML based on overlapping selections marks the overlap as an intersection of the marks.  Trivial to implement.

As to the gent in questions demonstration (re:article), eh.  anyone representing 32+ bit integers is going to use a library designed specifically for use of big integers; otherwise, what computation could you use them in?  and if that is the case (which I would hazard to say is likely), then the implementation is left to the implementors of the library, who no doubt addressed such an issue (and wrote their library to accept scientific notation).

Paul

[ Parent ]

Lisp conventions == pain (5.00 / 1) (#18)
by Otto Surly on Mon Dec 02, 2002 at 02:21:28 PM EST

If you want to remove or comment out one of those lines with all the kets on it, you have to grab just the right amount of kets off the end and put them somewhere else before you do so. Lisp's ";"-to-end-of-line comments make this even worse, since there's no way to comment out just part of a line.

I adore Scheme as a language for expressing algorithms succinctly, but I could do without the excruciating paren juggling./p.

--
I can't wait to see The Two Towers. Man, that Legolas chick is hot.
[ Parent ]

The solution is to (none / 0) (#20)
by Psycho Les on Mon Dec 02, 2002 at 02:35:55 PM EST

use a proper tool like Emacs that moves the parens for you automatically.

In Common Lisp, the lisp reader ignores anything between #| and |#.

[ Parent ]

XML for most things (1.42 / 7) (#28)
by dvchaos on Mon Dec 02, 2002 at 03:17:17 PM EST

is bascially overly complicated unnecesary bloat. u really *don't* need to XMLify absolutely everything. personally IMHO XML sucks (mainly the bloat reason). so I try and avoid using it as much as I can.

--
RAR.to - anonymous proxy server!
about floats (4.00 / 3) (#30)
by RelliK on Mon Dec 02, 2002 at 03:31:20 PM EST

The thing about real numbers is that they are not exact. Essentially, a real number is a scaled, infinite length integer. For instance π is 3.1416928... Of course you can't store an infinite amount of digits in a finite amount of space, so at some point the number is chopped. Just by assigning a real number to a floating point variable you introduce errors; every operation on the floating point number contributes to that error.

What happens when you convert a floating point number from its binary representation into decimal? You introduce more errors! Once you print a floating point number to an ASCII file, you no longer have the same number. When you convert the decimal representation back into binary, you compound the conversion errors.

I was bitten by this in a project I did a while ago. For any application where accuracy is important, the decimal representation described in the article is not an option. At best you can use hex representation so your floating point numbers will look like 0xDEADBEEF. Of course that doesn't save any space; it, in fact, doubles it. But then space is hardly an issue -- convenience and ease of development is far more important.
---
Under capitalism man exploits man, under communism it's just the opposite.

Floating Point Text I/O (4.66 / 3) (#37)
by Bad Harmony on Mon Dec 02, 2002 at 04:50:56 PM EST

I think this is already a solved problem.

Clinger, William D. How to read floating point numbers accurately.
In [ACM PLDI, 1990], pp. 92--101.
http://citeseer.nj.nec.com/clinger90how.html

G. L. Steele Jr. and J. L. White. How to print floating-point numbers accurately.
In Proceedings of the ACM SIGPLAN '90 Conference on Programming Language Design and Implementation, pages 112--126, White Plains, New York, June 1990.

5440' or Fight!
[ Parent ]

Protocol elements are not documents (3.66 / 3) (#39)
by aminorex on Mon Dec 02, 2002 at 05:10:51 PM EST

One fundamental flaw in this article is the failure to clearly and explicitly distinguish between the useful constraints on protocol elements and structured documents. Can the author indicate *one* asynchronous bi-directional protocol which uses an XML-conformant (rather than XML-like) document as the protocol block envelope?

Another basic flaw is the failure to address the issue of BLOBs. XML and BLOBs are mutually repugnant. BLOBs are of high and increasingly higher importance as there are no XML standards for the majority of multimedia document types, nor would they be used if there were. (Can you seriously contemplate MPEG-4 in XML?)

It is also a bit unrealistic to suggest that the bloat of XML can be resolved by compression when there is no prevailing standard for compression -- the interoperability value of XML would be eliminated by such a usage.

The complaints about the inappropriateness of XML as a general framework for protocol design are based in large part on these points, and failing to address them adequately means that the thesis is not effectively supported by its argument.

I certainly admit the value of XML as a framework for certain classes of protocol, if you admit frankly that you are not implementing XML per se, but rather a skew subset of XML, and I am very glad that the XML standard exists, to provide a framework for interoperation of protocols which rely on application-specific structured documents or data structure serialization -- it's one more tool in the box, and a very handy one at that. But the objections to the misapplication of XML as a panacea for the complexity and cost of protocol design is detrimental to the success of many projects, so that an unbalanced reply to those valid criticisms which have been propounded on the basis of real-world experience is not a very constructive contribution to the public discourse on this topic.

heh (3.00 / 2) (#41)
by Psycho Les on Mon Dec 02, 2002 at 05:40:34 PM EST

Can you seriously contemplate MPEG-4 in XML?

I bet the XML-freaks can.

[ Parent ]

Let me give a try at XML-freak-dom: (5.00 / 3) (#49)
by R343L on Mon Dec 02, 2002 at 10:10:45 PM EST

<mpeg version="4" xmlns="http://some-freakishly-long-url.that.no.one.can.remember.">
   <header>
       <!-- a bunch of crap...I don't know what is in an MPEG-4 header -->
   </header>
   <data encoding="base64">
       <!-- a bunch of letters and numbers -->
   </data>
</mpeg>

How's that?

Rachael
"Like cheese spread over too much cantelope, the people I spoke with liked their shoes." Ctrl-Alt-Del
[ Parent ]

either I stood up too fast or you wrote too fast (none / 0) (#47)
by speek on Mon Dec 02, 2002 at 09:25:05 PM EST

But the objections to the misapplication of XML as a panacea for the complexity and cost of protocol design is detrimental to the success of many projects, so that an unbalanced reply to those valid criticisms which have been propounded on the basis of real-world experience is not a very constructive contribution to the public discourse on this topic.

And now I need to go lie down...

--
al queda is kicking themsleves for not knowing about the levees
[ Parent ]

Protocols (5.00 / 2) (#54)
by marx on Mon Dec 02, 2002 at 11:46:31 PM EST

One fundamental flaw in this article is the failure to clearly and explicitly distinguish between the useful constraints on protocol elements and structured documents. Can the author indicate *one* asynchronous bi-directional protocol which uses an XML-conformant (rather than XML-like) document as the protocol block envelope?
No, I don't know any such protocol. I don't get this attitude of "XML can't build spaceships" -> "XML sucks". Why do you want to represent the envelope with XML? It seems much more logical to represent the payload with XML.
(Can you seriously contemplate MPEG-4 in XML?)
XML does not magically solve every problem. Still, I think an interesting solution would be to represent every group of frames using XML, then put a more suitable container format around it.

It's worth sacrificing a bit of performance to get the strengths that XML provides. You talk about MPEG-4, but how can you possibly excuse that the dominant format for MPEG-4 today is AVI?

The complaints about the inappropriateness of XML as a general framework for protocol design
Again, why do you insist on designing protocols with XML? This seems to be a typical strawman approach.
But the objections to the misapplication of XML as a panacea for the complexity and cost of protocol design is detrimental to the success of many projects
If you had actually read the article you would have seen that the article addressed concerns over verbosity by those who want to design their own file formats.

Here are the only two instances of the word "protocol" in the article:

Similarly, when we want to send XML as part of a network protocol, we can employ the compression of some layer of the protocol.
How you could interpret that to mean "XML as a panacea for [...] protocol design" is a bit beyond me.

Join me in the War on Torture: help eradicate torture from the world by holding torturers accountable.
[ Parent ]

Eh (none / 0) (#106)
by hstink on Thu Dec 05, 2002 at 01:08:28 AM EST

It's worth sacrificing a bit of performance to get the strengths that XML provides. You talk about MPEG-4, but how can you possibly excuse that the dominant format for MPEG-4 today is AVI?

What strengths would XML provide to MPEG-4?

All I can see are reduced bitrate and the ability to edit 600 MB files in human-readable form.  I don't find either particularly attractive.

-h

[ Parent ]

Asynchronous bi-dir protocol? Jabber... [n/t] (5.00 / 1) (#61)
by gusnz on Tue Dec 03, 2002 at 01:43:00 AM EST




[ JavaScript / DHTML menu, popup tooltip, scrollbar scripts... ]

[ Parent ]
Wake up y'all. (2.66 / 9) (#42)
by tkatchev on Mon Dec 02, 2002 at 06:06:39 PM EST

XML is for marked-up text, not a way to represent arbitrary structured data.

XML is nothing but a "text-file-with-tags". Think HTML or Latex or RTF except clean, simple, efficient and readable.

XML achieves what it was designed for perfectly. Use the right tool for the job.

   -- Signed, Lev Andropoff, cosmonaut.

Off-topic a bit (4.00 / 3) (#50)
by bugmaster on Mon Dec 02, 2002 at 10:55:14 PM EST

This is a bit off-topic, but: I am actually somewhat mystified by the latest "XML is the answer to everything" craze. As far as I understand, XML is just a common standard for representing structured text data. Another common standard like that is the Windows INI file; it's a bit easier to read but not as good. Anyway, since XML is so common, there are parsers/renderers available for it in many programming environments.

From the paragraph above, how does it follow that putting your data into XML format automatically makes it cross-plarform super-compatible ? If the program who is trying to read your data does not know what the <name> or <salary> elements mean, it won't be able to process your data at all. The only thing XML buys you in this case is some saved time for the parser/renderer implementations.

Am I missing something ? Is there really some automatic way for importing XML data from any application into any application ?
>|<*:=

You're right (5.00 / 2) (#53)
by epepke on Mon Dec 02, 2002 at 11:33:27 PM EST

XML is a Hot New Buzzword. Well, it isn't all that new, but still.

That having been said, it isn't a terribly bad thing. A lot of people from a lot of different companies have built stuff on top of XML. There's no magic bullet, but obviousness results in a lot of different XML formats for similar applications (such as e.g. sheduling programs and directories) looking almost the same and therefore requiring only minimal adaptation. It's also pretty easily Perl-mungeable by the somewhat competent in a way that highly proprietary data formats aren't.

That having been said, it's still basically just S-expressions with ugly syntax.


The truth may be out there, but lies are inside your head.--Terry Pratchett


[ Parent ]
No (4.00 / 4) (#55)
by marx on Tue Dec 03, 2002 at 12:10:57 AM EST

Semantics have not been standardized, that will probably never happen. Just because both you and I have a "dog" datatype doesn't mean that they mean the same thing.

This is what I wrote:

The task has been reduced to what it should have been from the beginning: to decide which data should be stored, and what the meaning of this data should be.
This is what XML provides. To be able to process the <salary> element of a file, all you have to know is the meaning, you don't have to worry about the representation.

You say that all this buys you is some saved time, but come on, it's a bit bigger than that. It's the difference between talking to someone in a language you can understand or not. If you speak the same language, then all that's left is to make sure that you agree on the semantics, that "dog" means the same thing to both of you. If you talk to someone in a language you don't understand, then you have to learn characters, syntax, grammar, etc. before you can even get to the point of worrying about semantics.

Also, part of the semantics of an XML file are typically specified in a DTD or Schema, a bigger part in the latter.

Join me in the War on Torture: help eradicate torture from the world by holding torturers accountable.
[ Parent ]

ontologies and metadata (5.00 / 1) (#80)
by The Shrubber on Tue Dec 03, 2002 at 09:35:38 AM EST

warning! i don't really understand this stuff

you are right, but there might two answers (or these might really be the same answer)


  1. Ontologies - formally defined relationships between terms.  I use some kind of logic, like description logic to formally express that Foo "is-a-child-of" Bar, and that Blech "makes-use-of" Foo.  If i'm talking in Foos Bars and Blechs, and you're talking in Arrgh, Pttpts, and THpts, it doesn't matter as long as the formal relations between the stuff is comparable.  Now there is even an XML format for expressing ontologies, OWL (http://www.w3.org/TR/owl-ref/)

  2. Metadata - data about data.  Learn about RDF (http://www.w3.org/RDF/) and DAML (http://www.daml.org/)

We're not there yet, but we're chipping away at the problem.  Believe it or not, standardisation people aren't all idiots, and there are people in those three-letter organisations (ISO) that know what's up.

Also, since i don't really understand this stuff, i'd appreciate it if someone could give a better explanation.


[ Parent ]

Exactly (none / 0) (#133)
by sandro on Sat Dec 28, 2002 at 01:08:07 AM EST

This is the area being tackled by the W3C's Semantic Web Activity (which pays my salary).

The grandparent to this comment seems right to me; XML helps settle some syntax wars (just as S-Expressions would), but it certainly doesn't tell you what <employer> means.  XML Namespaces go a step closer, especially if they get extended to say one should actually look at the namespace document. RDF also gets a step closer in telling you which tags name relationships between things and which tags name classes of things.  OWL (the successor to DAML) lets you know a lot more about those classes and relationships and how they relate to each other.

But yeah, it's a hard problem, and XML addresses much less of it that some people seem to think it does.  (The human-readable tags trick people a bit, I think.)

[ Parent ]

"On average" is miscalculated to be misl (4.28 / 7) (#56)
by derobert on Tue Dec 03, 2002 at 12:29:07 AM EST

So on average, the ASCII representation is not more wasteful than the binary representation. In extreme cases it is about twice as wasteful. That is quite an abuse of statistics there! Let's do the math for real.

  Number                   ASCII Sz  Binary Sz
  0          - 9           1           4
  10         - 99          2           4
  100        - 999         3           4
  1000       - 9999        4           4
  10000      - 99999       5           4
  100000     - 999999      6           4
  1000000    - 9999999     7           4
  10000000   - 99999999    8           4
  100000000  - 999999999   9           4
  1000000000 - 4294967295  10          4

Now, to actually calculate the average size, we need to add up the size of eachnumber and divide by the number of numbers. That is the definition, after all. The binary case is trivial; it is 4 bytes. Now, for the ASCII case, let's first count how many numbers are of each size. Also, let's go ahead and multiply to find the size of that group in total ("Tot. Size")

  Size       Count          Tot. Size
  1          9-0+1 = 10     10
  2          99-10+1 = 90   180
  3          900            2700
  4          9000           36000
  5          90000          450000
  6          900000         5400000
  7          9000000        63000000
  8          90000000       720000000
  9          900000000      8100000000
  10         3294967296     32949672960
  -------------------------------------
  Sum        4294967296     41838561850

Now dividing the final total size sum by the count yeilds the average size: approximately 9.74, 243% larger than binary. XML has its uses, but compact representation of numbers is not one of them.

[Assumption: Numbers are uniformly distributed. Non-uniform distributions have different average sizes in ASCII.]

small numbers are in fact more common (none / 0) (#58)
by xriso on Tue Dec 03, 2002 at 12:58:38 AM EST

Look at the source of this page, and note how many numbers are over 4 digits. Colors, and some miscellany, but most of the rest is indeed only a few digits.
--
*** Quits: xriso:#kuro5hin (Forever)
[ Parent ]
You wouldn't use 4 octects for those numbers (4.33 / 3) (#62)
by derobert on Tue Dec 03, 2002 at 01:43:00 AM EST

If you were going for compact binary representation, you wouldn't use longs for a lot of things in the page source. Widths and heights would be two octects, so ASCII starts losing out at 100. Borders, percentages, padding, etc. probably a single octet; ASCII never wins. Also, ASCII requires some sort of terminator, which I didn't count in my calculations (otherwise, you don't know when one number ends and the next begins. 112 could be 1, 1, 2; 11, 2; or 1, 12.

The only time ASCII turns out to be a more compact representation than binary is when most numbers are small, but their is an occaisional huge number (forcing you to use 4, 8, or more octets to represent it in binary).

Of course, if that were the case, a length-value binary representation would be shorter:

   -------------------------------
  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
  |---+---+---+---+---+---+---+---|
  | SIZE  | First six bits        |
   -------------------------------

This way, numbers up to 63 can be represented in one octet. Size can be 0 (no additional octets), or 1, 2, or 3 additional octets, allowing numbers up to 1,073,741,823 to be compactly represented. For really compact binary number storage, arithmetic coding can be used (though it hurts CPU-wise).

ASCII can't beat a proper binary representation of numbers for a simple reason: In ASCII, each digit, of which there are 10, eats 8 bits. 8 bits actually provides 256 possibilities. ASCII numbers waste 246 possibilities, or ~96% of them. When you're ignoring all but ~4% of your data, you can't win.

[ Parent ]

Fragility (4.25 / 4) (#64)
by marx on Tue Dec 03, 2002 at 02:11:31 AM EST

I think you're showing the negative side of binary representation here though. You impose restrictions to save a byte here and there, because it feels neat in a sense, but it makes your format fragile. Suddenly it's impossible to specify a percentage over 255%? Also, it's very hard to fix that problem once the format has started being used.

You're right in that it seems strange to just use 10 of 256 possibilities, but it's not as bad as you make it sound. To represent 256 possibilities in binary requires 8 bits, but to represent 10 possibilties does not require 10/256 = 0.04 bits, it requires 4 bits. So basically decimal digits in ASCII ignore half the byte, not 96% of it.

An ASCII representation is not that wasteful, but it's really flexible. And I think that was one of the main points of XML, to put a stop to the binary jungle which basically made it impossible to parse and write other people's formats, or to change formats once they started being used.

Join me in the War on Torture: help eradicate torture from the world by holding torturers accountable.
[ Parent ]

Error (4.00 / 1) (#57)
by marx on Tue Dec 03, 2002 at 12:29:31 AM EST

The decompression speed test result has an error. Since the data is compressed first, it doesn't actually decompress 30MB compressed data per second, but decompresses to 30MB uncompressed data per second. So that particular result should be about 10 MB/s instead.

Join me in the War on Torture: help eradicate torture from the world by holding torturers accountable.

For certain applications, good. (2.00 / 2) (#59)
by xriso on Tue Dec 03, 2002 at 01:04:00 AM EST

Surely you wouldn't suggest doing an image format in XML?
--
*** Quits: xriso:#kuro5hin (Forever)
Image format (4.20 / 5) (#60)
by marx on Tue Dec 03, 2002 at 01:34:09 AM EST

The Scalable Vector Graphics format has actually become kind of a poster child for XML. I don't see why a raster format couldn't be done similarly.

Join me in the War on Torture: help eradicate torture from the world by holding torturers accountable.
[ Parent ]

Uh-oh. (5.00 / 4) (#67)
by i on Tue Dec 03, 2002 at 03:31:03 AM EST

<pixel>
 <red>
  <value>
   127
  </value>
 </red>
 <green>
  <value>
   0
  </value>
 </green>
 <blue>
  <value>
    255
  </value>
 </blue>
</pixel>

and we have a contradicton according to our assumptions and the factor theorem

[ Parent ]
Pixels (5.00 / 2) (#69)
by marx on Tue Dec 03, 2002 at 05:19:20 AM EST

In SVG they seem to allow this kind of color: "#ABCDEF". I don't think I agree with those kinds of constructs though. If you start defining further arbitrary structure on your values, then you lose some of the power of XML with validation etc.

Maybe it's just better to say that XML is not that good for these kinds of extremely simple, regular formats.

As soon as you start adding complexity, I think XML should be used though. Maybe the raw data could simply be stored as an XML element, and the rest of the document describes all the related information.

Join me in the War on Torture: help eradicate torture from the world by holding torturers accountable.
[ Parent ]

XML all the way down (none / 0) (#99)
by zakalwe on Wed Dec 04, 2002 at 09:37:30 AM EST

If you start defining further arbitrary structure on your values, then you lose some of the power of XML with validation etc.
I don't think its a bad choice. From an ideological purity standpoint it may seem very nice to have the same rules all the way, but this is an illusion. At some point you have do define atomic elements that go no further. For example, numbers aren't represented as <number><tens>4</tens><digits>2</digits></number>, though you are representing more of their structure through XML rather than that arbitrary decimal layout we use (even here you're still left with a "digit" atomic type though.) For a graphics format, "colour" seems a reasonable choice for an an atom.

[ Parent ]
Not quite (3.00 / 1) (#89)
by jabber on Tue Dec 03, 2002 at 02:35:54 PM EST

Note, it's Scalable Vector Graphics, not Absolute Raster Graphics.
Try <circle cx="250" cy="250" r="35" style="stroke:none; fill:red">

[TINK5C] |"Is K5 my kapusta intellectual teddy bear?"| "Yes"
[ Parent ]

If not SVG, then what? (none / 0) (#122)
by pin0cchio on Sun Dec 08, 2002 at 09:46:45 PM EST

Note, it's Scalable Vector Graphics, not Absolute Raster Graphics.

So how would you do raster imaging in XML? A camera currently can't just recognize an ellipse and output its coordinates.


lj65
[ Parent ]
XML image format (4.50 / 4) (#71)
by Robosmurf on Tue Dec 03, 2002 at 06:05:00 AM EST

Actually, I think XML would be quite useful for an image format.

Certainly you wouldn't want to mark up the actual pixel data. However, an XML structure for information about the image would be very useful.

Images can have a huge amount of metadata relating to them. For instance, pixel format descriptions, gamma levels, layer definitions, thumbnails, author information etc.

This is exactly the kind of thing that XML is for.

[ Parent ]

Hmmm, it seems like your tests (2.33 / 3) (#68)
by S1ack3rThanThou on Tue Dec 03, 2002 at 03:49:10 AM EST

Are very XML favoured, as the the comment below states, it would be inappropriate for an image format.

The problem I hear with the XML zealots is that it can be used for anything. In fact it is a good standard that is appropriate for some, though not ALL, uses. As ever it is a case for "right tool for the job"...

"Remember what the dormouse said, feed your head..."

Well, actually ... (none / 0) (#95)
by jefu on Wed Dec 04, 2002 at 01:02:44 AM EST

XSLT turns out to be turing complete. And I can't find the reference right now, but I think it is even a fairly restricted subset of XSLT. You need to use recursion or repeated invocations of XSLT to make it work, but it does work. (I keep toying with the idea building an XSLT interpreter for SK combinators - I have an experiment I'd like to run - but keep putting it off.)

Therefore, XSLT is a universal programming language and since it is just another dialect of XML, so is XML.

So XML can compute (modulo Church/Turing) anything computable. (I know, finding that infinite tape is likely to take a while.)

So XML can do anything (more or less).

I'll make no claims about any associated time/space behavior.

[ Parent ]

Compute anything != use for anything (none / 0) (#98)
by zakalwe on Wed Dec 04, 2002 at 09:23:49 AM EST

Therefore, XSLT is a universal programming language and since it is just another dialect of XML, so is XML.
And C is an arrangement of ASCII characters, therefore ASCII is a programming language? This is a bit of a dubious statement, since its confusing the representation with the interpretation. XSLT defines a way to interpret XML elements, but saying "XML is turing complete" is pretty meaningless. You don't even need to have XSLT - you could just as easily represent a C syntax tree in XML, or just define <XMLCode>int main(){ printf("hello world")}</XMLCode> as some standard that's interpreted appropriately - both turing complete "dialects" of XML if interpreted in the right way.

In any case, turing completeness does not mean you can do anything, because in the real world, you can define specs in relative terms, like "efficiently represent an image" which XML, or even other turing complete languages like intercal can't do (In fact turing completeness is not neccessary, or in fact, desirable for this application).

For some tasks, turing completeness may even be a disadvantage, or even an outright bar. Requirements for things like "an untrusted user input query format" are not met by accepting arbitrary C code.

[ Parent ]

I am working on a project (5.00 / 3) (#70)
by daragh on Tue Dec 03, 2002 at 06:03:29 AM EST

That integretes binary data and XML. We have defined a schema for describing binary files using XML, to the level of byte order; data structures can be defined, arrays and primitive types, as well as meta data about variable names and meaning. The level of description allowed is quite flexible and we are just working on an access library for files described in this way. It is intended for scientific use or any situations where file sizes run into the 10s of gigs or so, and where marking the data up is definitely a no go.

One of the interesting things we hope to be able to do is transform the structure of binary files described using this technology using XSLT with (we think this is viable) Cocoon. For example, you could convert database table formats using just XSLT and no fancy parsers.

For more info:
here

No work.

Like I care (4.00 / 4) (#72)
by DGolden on Tue Dec 03, 2002 at 07:19:52 AM EST

I just use Lisp sexps.  Same expressivity as XML. Half as verbose.  Easily parsed in C, Lisp, Perl, and a slew of other languages.

See <a href="http://ssax.sourceforge.net/">SSAX</a> if you don't beleive me.

XML:  Letting the masses use Lisp without admitting to themselves.

Don't eat yellow snow

Sigh. (2.00 / 1) (#73)
by DGolden on Tue Dec 03, 2002 at 07:21:42 AM EST

Oops. Guess who forgot "HTML Formatted" Link is here
Don't eat yellow snow
ASCII Numbers / Fixed vs Variable Width Records (5.00 / 1) (#84)
by codemonkey_uk on Tue Dec 03, 2002 at 11:46:58 AM EST

The problem with using ASCII for numbers, and the reason that the author of the articles comparison is flawed is more to do with fixed width vs variable width records (an issue also touched on in this comment, which addresses the speed element of the problem).

One of the problems with variable width records is that you need some way of knowing where the end of the record is. An ASCII number is a varible width record. Typically each "number" is a sequence of digits, terminated with a non-digit charicter. So "1 2" is treated as two records, "1" and "2" and "12" is treated as one record, "12". So while "1" only occupies 1 byte, as opposed to, say 4 bytes for a 32 bit integer, to store two numbers as ASCII they must be separated, something which at best must take 1 byte/char (doubling the examples best case size), and in XML typically takes at least 7 chars to do the separation (ie <i>1</i>).
---
Thad
"The most savage controversies are those about matters as to which there is no good evidence either way." - Bertrand Russell

Excess 'em All (4.50 / 4) (#85)
by jefu on Tue Dec 03, 2002 at 12:48:42 PM EST

<rant topic="xml" type="text" language="US English" dialect="technogeek" organization="not overly thoroughness="not much" ">

First off, I should say that I like XML. Rather a lot actually.

Second, someone says "lisp sexprs" are the same, but this is not quite the case. XML elements can have attributes and constraints on the elements that they contain. If you leave out attributes, DTDs (to express such constraints) and some of the other fancy stuff, you can end up with expressions that are only "<foo> ... </foo>" bracketed and this is just about as easy to parse as sexprs. And a few days back I wrote and posted a small XSLT program to slashdot that would do a big chunk of conversion of XML to lisp - it took only a few minutes to write - though was far from complete.

Though the parallel with lisp is a very good one, and indeed XSL can be considered as both a transformation language and as a quasi-macro language for XML - like lisp XSL and XML share syntax.

Third, this article rambles a bit as I'd like to address several points in the comments and they ramble a bit.

XML in its full glory though demands a complicated parser as the parser must be able to cope, not just with the XML data, but also with DTD information, including element definitions, entity definitions and attributes (including defaults).

Its not as bad as it could be though, a full SGML parser is really tough (see James Clark's SP) as it not only needs to cope with all of XML, but also needs to cope with fun stull like missing element close indicators (ie </foo>).

Certainly XML is not perfect for everything. I'd not use it to replace an RDBMS - RDMBS's have very nice clean semantics and a query language that is based on a more or less simple model. OODBMS's, XML, XML query and the like are nowhere near that.

In many circumstances an RDBMS is by far the preferred solution : large data sets with fairly regular data.

But RDBMS's have problems too. One interesting one is transferring data from a database with one data model to another - even slight differences in the SQL declarations can lead to quite wonderful problems. (And I struggled once with someone who had written everything into an RDBMS with a single row definition consisting of an ID and a string. Each such string had to be parsed separately and might contain more IDs. He wanted to represent objects (of course). That was a spectacular mess.)

What XML does provide is a rich, relatively simple framework for marking up data. And, personally, I don't care much if the parser is big as long as it will deliver a DOM tree, or run XSLT or whatever - if it is a shared library (and it will be eventually on most platforms), the overhead should not be incurred by a single program.

XML does have some serious warts (whitespace handling is an interesting one) - but I've followed XML development enough to understand that there are excellent reasons for most of these and that many of them are easily enough avoided.

One of the common statements is "I can store it better in binary." Probably so. So can I - I can (have, probably will again) build ad hoc binary representations of data in a program.

I've also had to cope with binary files that have been corrupted. This usually means either building a one off repair program or tossing everything. As one annoying example, if the windows registry goes bad you may have no recourse but to reformat your drive and reinstall windows. This has happened to me several times - I now keep several backups as well as a textual backup.

This isn't just windows, of course, but most unix systems have text versions of binary files around somewhere (sigh, not always) - for example sendmail used to digest its configuration file from text to binary when things changed - so the human user could always use the text version.

While unix configuration files are text - making them available even when fancy GUI tools are not available, or may not work, they are very variable in format. Consider the following, each of which has a different format (though sometimes only slightly different), each requiring a different parser : /etc/passwd, /etc/termcap, /etc/hosts, apache, sendmail (though sendmail.cf is a beast entirely unto itself), wine, /etc/fstab, rpm --- the list goes ever on and on ....

Don't forget mail (messages as in RFC 822, the protocol as in RFC 821), HTTP (not HTML, the protocol), FTP....

All of these could be conveniently and easily written in XML.

I recently reverse engineered (DMCA notwithstanding) a set of binary files from a vendor. I needed to use their hardware for something, the hardware read these binary files, and the only thing that wrote the files was their program which ran only under DOS and required tedious data entry by hand. I had the data (thousands of records) in machine readable format (in an RDBMS) and wanted to write it in their format - but they wouldn't say how. So I broke th e thing and wrote a program to write the data into their format and thence to their hardware. If it had been XML formatted there would have been no problem at all.

A while back, (in part to test an RDBMS in fact, but mostly because I wanted to see the output) I wrote a program to decode DEM (Digital Elevation Model) data from the USGS. The data was written using fortran formatting codes, many with fixed field lengths - so I needed to build a one off parser to read this data - and many languages do not provide easily for this kind of parsing. If the data had been marked up with XML and if I'd had an XML parser, it would have been trivial.

Its also worth noting that there have been several data description languages defined that provide for specific ways to represent data in program/hardware independant(??) forms - HDF is one example, I've run across probably a dozen more. Each of these requires Yet Another Parser and Yet Another Support Library.

Sure, I know YACC and LEX and the like. But I find writing parsers BORING and would rather be doing something else.

And I've not even touched on why XML is good for marking up text.

</rant>

Why I don't like XML (5.00 / 3) (#97)
by DGolden on Wed Dec 04, 2002 at 08:07:49 AM EST

I would point out that the "SSAX" Scheme<->XML parser I linked to is a complete expression of the XML Infoset.

That is to say the sexps (called "SXML") it produces ARE in fact exactly equivalent to XML, including attributes, namespaces and entities. The toolkit also includes Scheme constructs analogous to XSLT and XPATH (Quite complete, called SXPATH)

Admittedly, it doesn't validate against a schema, but the core framework for constraint-checking is there.

So nyah. :-)

Also, I would contend that XML SUCKS for marking up text. I reproduce my stock rant below:

XML Sucks. Its Not A Markup Language
by DGolden

No really, it's a tree language. If it was MARKUP - i.e. layers of "virtual highlighter pen", it would allow overlapping tags, and wouldn't shoehorn weakly structured data into rigid trees*. As it is, XML corresponds closely to Lisp [lisp.org] sexps, but reimplemented badly with shitty redundant syntax.

XHTML is a particularly bad application of XML, because HTML text is intended to be authored by humans, not autogenerated by and for some bloated SAX parser/DOM tacked onto a bloated Java/CLR VM.

People liked HTML before XHTML because it was forgiving. One could forget a few close tags, one could <b>overlap <i>tag</b> runs </i> and the browser would muddle through.

There's no particularly good reason to burden people with maintaining rigid tree structure if it doesn't make sense. One of the major problems I have with people usng XML is is the weeks/months they spend agonising over their Schemas, on the correct way to shorehorn their transient data into pretty trees - for god's sake, people! If you're using tools so inflexible that you can't just change your mind halfway through, maybe it's time you stopped using the buzzword-laden marketware of XML/Java/C++/C# and moved to a more flexible platform, like Perl, let alone Lisp! 90% of the time, the stuff I see could just be a ASCII CSV dump of an array, or just a stream of bytes! At least Lisp sexps don't force you to bother with close tags that redundantly echo the open tags - and they have identical expressivity, since XML is a tree, not a markup, language.

Bring back real Markup languages! The XMLers have lost their way. They're busily reinventing lisp, badly (yet again) - they've just come from the other side (data-side) to all those scripting languages (code-side) that are slowly mutating into Lisp, where data is code and code is data.

* (And yes, I know that you can eventually make most everything look like a very broad tree-structure by placing a virtual root before an arbitrary collection - witness the UNIX filesystem! - but I hope the reader can see that that's not really my point)
Don't eat yellow snow
[ Parent ]

He's baa-aa-aa-aack! (5.00 / 1) (#100)
by jefu on Wed Dec 04, 2002 at 11:26:25 AM EST

Rather than rambling incoherently about two topics in one post - something I'm far too liable to do, I'm pulling this, differently topicced note into another response. (I know, this way I can Ramble Incoherently about a single topic.)

You mention the defintion of schemas (and by extension DTD's and the like) and speak of inflexibility. And do so rather dismissively, as though these problems are trivial and can be fixed with the addition of an ad hoc tag here and an ad hoc tag there.

I'd like to analogize with a programming model (a bad analogy - but sometimes the bad analogies are the most illuminating).

Suppose there's this programming project. Say 50 or more coders, designers, hangers on (managers, camp followers) and so on. They're given a spec, high level design and build a lower level design with nice interface specifications (probably UML diagrams these days).

A programmer on it sees what he thinks is a hole in the specifications and, on his own, finds a fix. (Perhaps in Perl - though then it might be a Phix.) He tells a few people, but it is not reviewed or discussed at any other level.

Small scale tests work fine. The thing is released for beta testing and suddenly it starts - not crashing - but doing the wrong thing. Why? Because that hole wasn't a hole, it was a deliberate decision on the designers part and his fix just changed the workings of the whole program.

So, easy enough, lets back off on that change - but it permeates his code and affects the code of the people he talked to, and perhaps even has influenced unit testing of parts of the code written by other people who've never heard of his bright idea.

Oops.

Similarly with data definition.

Anyone who has ever designed a database of any size knows that it pays Big Time to get the data organization right before it goes into production use. Any change to a running database is usually done carefully and with a bit of trepidation.

I have been looking at defining XML for what I thought would be a relatively simple application. After a preliminary pass through things I ended up with almost 20 pages of data items (in a rough tree structure). I could do without all of this detail, but ideally I'd like to get other potential users of the same kind of data together, agree on a schema and adopt it as a standard. Then the same tools will run on everything, the data could be imported to an RDBMS - regardless of the source (well, as long as we agree on what it all means) and so on.

I also recently spent some time looking at some XML marked up data. It was clear that the dtd had been defined without much thought. (One part was more or less like putting a persons name and address into a single string field in an RDBMS, and copying the whole thing into every record that had anything to do with that person. Have fun changing that persons address.) As that DTD and information marked up in it spreads, it will become harder and harder to change (though with tools like xslt, it may not be so hard to morph data from the original schema into data in another schema).

There are lots of places where quick, local (in time, space and organizational terms), ad hoc markup (data description) is the Right Thing To Do. Just as I write little languages for "programming" so do I write little languages for data organization. And really, thats what XML is -- a language for defining little languages. But if you want to write non-local (time, space, organization) data descriptions, it pays to do it carefully, precisely and with lots of forethought.

Ok, just a little ramble. XML is (as far as i'm concerned) a tree language. That it is called "markup" is just one of those things. Trip to the moon on gossamer wings. One of those things. XSLT is a tree rewriting language. That these can be used to do markup is a good thing.

[ Parent ]

XML Images (4.00 / 1) (#94)
by Imperfect on Wed Dec 04, 2002 at 12:50:38 AM EST

Was just thinking about how you could include images w/in XML documents and came up with an idea...  UUEncode!

I mean, Usenet's been doing that basically since it's inception.  It's not really good for huge relational databases since most UUEncode algorithms are inefficient at best, but it works for absolute raster graphics.

The funny thing is, I came upon this little problem when designing the data files for a PBEM game this past summer, and totally forgot about UUEncode for this purpose.  My solution was to have a sort of html "link" in the XML file (<image>blue42.jpg</image>), then stick the XML file in a zip with the image, then send that.

The final irony?  I went and worked out an algorithm for encoding and decoding UUEncoded files so I could send the .zip from inside the game!

One day, I wish to learn from my mistakes when it's still actually /relevant/ to me.

Not perfect, not quite.

binary data is a problem (none / 0) (#107)
by gps on Thu Dec 05, 2002 at 03:30:25 PM EST

XML does not have a nice way to encode binary data.  base64 encodings (don't use uuencode, use the base64 mime standard) are often what are used.  they are wastes of cpu time (and space if you care about that) to load the, often large, chunk of data and decode it into the binary data that it is supposed to represent.

this is one of XMLs biggest failings (if you can call it that; its not what it is intended for).  other data can pretty much be streamed inline to/from its internal data structure source.  binary data (images, large chunks of encrypted data, etc) has to be encoded and decoded (using cpu, memory and involving lots more data copies) if it is to be represented within an XML document.

that is one reason why binary data is often "encoded" as a reference to an external URI to the binary data file.


[ Parent ]

Your tests... (2.00 / 1) (#101)
by curunir on Wed Dec 04, 2002 at 02:50:49 PM EST

First off...I like XML. I think as computers get faster, things like XML will, while adding some bloat, make the lives of programmers much easier.

That said, I'm not sure your tests are particularly useful. I should preface this by saying that I don't know gzip as well as I should, but I have a decent understanding of how some compression technologies work. I believe you've come up with a particular situation for which gzip is extremely capable of reducing file size which might not translate quite as well to the real world. I believe, like many other compression technologies, gzip is capable of compressing many copies of the same string quite well. So, in your tests, the notation text compresses down much better than it would in a real-world application where the strings are at least somewhat different.

While I think it is fair to assume that the notation text would compress better than the data since it is much more likely to be repeated throughout the file, I don't think real-world testing would show as dramatic a difference in compression ratios as your tests do.

I thought that, too (none / 0) (#103)
by epepke on Wed Dec 04, 2002 at 07:19:17 PM EST

But then I used gzip to compress a real, live XML file that I produced with a moderately real, live program that I am writing. Lots of structure and not very predictable--just what you'd expect to do poorly. It went from 41571 bytes to 1735 bytes. Not bad.

Of course, there was some extraneous whitespace in the original file, so I used a Perl script to get rid of it. So, that became 30347 bytes to 1461 bytes. A bit of all right, that.


The truth may be out there, but lies are inside your head.--Terry Pratchett


[ Parent ]
Gzip is good at text... (none / 0) (#104)
by curunir on Wed Dec 04, 2002 at 07:59:15 PM EST

Gzip is really good at compressing text, so the point is still valid that much of XMLs bloat wrt to file size can be eliminated with compression.

My only point was that the test in the article created a situation where the notation was much easier to compress than the data. In a real-world test, the data and notation would compress at a rate much closer to the rate at which the data compresses.

In the above test, the full xml file compressed down to about %10 of its original size, the data compressed down to about %40 of its original size and the notation compressed down to about %1 of its original size. I would think that real-world tests would find compressed values for both data and notation somewhere between the %10 and %40 found in the above test.

[ Parent ]
A factor of 2 can be a huge problem .... (none / 0) (#102)
by alternatist on Wed Dec 04, 2002 at 05:16:31 PM EST

Try telling an organisation storing 20 TB data, that a factor of two in used storage is no problem. Or tell them that they can 'just' compress/uncompress the 2 TB of data accessed daily, since it is only a 'minor' increase in the need for CPU power ..... Not to mention the cost of configuring and maintaining all that extra hardware. Somehow i doubt your ideas would be seriously considered ..... XML is good for things where size does not matter.

CPU time is not important (none / 0) (#128)
by jonathanclark on Thu Dec 12, 2002 at 10:11:12 AM EST

> since it is only a 'minor' increase in the need for CPU power The artical fails to mention the slowest part of any computer is the hard drive - so reducing the size of your data will speed things. CPU speed needed to decompress is insignificant compared with disk speed. Data can be compressed/decompressed in blocks to prevent large reads or rewrites. http://thinstall.com is a program I made that automatically adds block-based compression into any Win32 program. It doesn't do compressed writes because it's designed as ship-time compression rather than runtime. Jonathan

[ Parent ]
The world is not INTEL 32-bit machines ...:) (none / 0) (#131)
by alternatist on Sat Dec 14, 2002 at 08:13:52 AM EST

On some architectures the CPU is balanced with the rest of the machine, and no surplus CPU cycles is available without an upgrade ... But your statement would have been correct on a PC.

[ Parent ]
Don't use XML for storage! (none / 0) (#135)
by ajm on Sat Jan 11, 2003 at 10:43:52 AM EST

No one in their right mind would use XML for storing data. XML is good for exchanging data, even if size does matter, and not for storing data in general, even if size doesn't matter.

[ Parent ]
Do some real research next time. (none / 0) (#105)
by pb on Wed Dec 04, 2002 at 08:48:59 PM EST

I did a quick test for myself in StarOffice, exporting to text, html, rtf, doc, and its native format (gee, compressed XML...)

      Compressed  Uncompressed
TXT           70*           38 
HTML         443           859
RTF          721          2323
DOC         1299          8192**
SXW         5170***      16536

*   Larger from gzip's overhead
**  I could get it down to 6656 using Word 6 format instead
*** I could get it down to 3379 using tar/gz instead of zip (native format)

Here's my test in RTF format (since it's small enough to post conveniently here...)

begin 644 example.rtf.gz
M'XL(`.ZH[CT"`]552U/;,!"^]U?HS*$C.0_"<.AT:$OI0`^%OF;V(EOK6"!+
M1E(@F8S_>U=R2>@$>FD.]!#KV]UO]:U>FS7X6`N0-FA06-<<)`U&VKG@Q>35
M&FIG8RP-`0ZU=ZVT4'?^MH"Z:J0/&#F[TBT&]AGOV9<47\,!U-)$=M4X;Y4T
MV!_WE"_^,9]2K#8[V1]/V86V5>/8N9XWD7W*Y-'3Y+=WTD9V*KU"=NIBHZO,
M'C_#]EH:]M7JRA'_XO*X[VE'*F><IRTY!H^*P]PC6@ZE62#?=8EBEITT[C(W
MS@0>V$DBQ)7!T"!&P@*J6J1-:59=@[;(HT&I!A2]U&:`K5SR'F3HI.D:22=K
MJ@8DK4[6H1A#.M5B,@$3/?E5F8/%HV"-.>PHD$[[P2WX:`3!XC(*]@YKN3!Q
MJ+(`HR?30_":V/1[$766,J!R5@P5%W0K0F0G=(O1QC#432PIBA=9[XA=T9>5
M3JURK0=0A4DJ=*^J[(-ST;J([(3NNJPB^K"1F^Y?[KU5SZD=DEH!"[-?P3,Z
M;F\QTNG;FXW6C+1&>]?ZIH..J'8TDZJVM:..X5%&W<+*%YP7T#I1@%J-H2$;
M6FV%H";D\>ZOG((XG==V((DI%XE$'$&<3.%]:DYM2Q>=74;IOWNJRY/SCNJ:
MC@7O^]3CHRQ%,>%Y3[JY"A7U,C(RY,.P""B.)H3OPY*J.^()-PE/9Z-#:*6?
MFV3-.,^&?VS$9(S'@U$^&'E>NZ2NNFDA_2OH9(>^&6;-^'Y0RPK;Z;=S;R?.
M*&`5%;VA&^LL[K7>.MKR.GT#[604"?D0*^HC"5KI0:)5RE5T2[8AN4U(T'I3
MI45Z!9V1VK*7TLUOJ]]Y0YC^+<-H"IKE8I_X_C=K6/_IHQ?(SD+/?ER<,WH%
BI0OXAK$UE.RG6_2L1!8;9-<+-<?7_;#4_A=F`GRM$PD`````
`
end

---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
The parsers don't really help. (none / 0) (#108)
by dark on Thu Dec 05, 2002 at 03:47:42 PM EST

There might be standard XML parser libraries available, but they just provide a DOM tree or a SAX stream. Then you have to write your own code to deal with that, which often ends up being just as much code as parsing a dedicated file format would have been in the first place. (And just as many people get it wrong: XML might be "extensible", but not if you have insufficient flexibility in the code that parses the DOM tree.)

You are right (none / 0) (#115)
by Timwit on Sat Dec 07, 2002 at 02:08:55 AM EST

Writing the code to process the DOM tree is a pain in the ass. A standard API to populate C structures from DOM trees would be very handy.


[ Parent ]
True, but not a problem in Java or .NET (none / 0) (#136)
by ajm on Sat Jan 11, 2003 at 10:46:57 AM EST

To be provocative, this isn't an issue for "enterprise" development languages. Java and the .NET languages all have extensive facilities for mapping between XML structures and object graphs. The problem in fact is to choose an implementation from among the many available. If you're using C with XML you may need to ask yourself why?

[ Parent ]
XML to Objects is a solved problem (none / 0) (#137)
by ajm on Sat Jan 11, 2003 at 10:52:35 AM EST

Java and the .NET languages, at least, provide extensive facilities for mapping between XML and objects. Sure, you can knock up a quick parser for a simple dedicated file format and that may work for you. However, what if you're exchanging data with someone else, or another company? Are you going to give them your code for parsing the file? What if they are using a different language from you? Are you prepared to validate their parser to make sure it works? Are they prepared to even deal with a format where they have to code there own parser? Using XML removes a whole bunch of syntatic issues from consideration (especially in combination with XML Schema). Sure you still have to deal with the semantic stuff but at least you know the syntax is right. If you control all of the usage of the file and are prepared to maintain all of the code to read and write it go ahead and define your own format, othewise XML removes a whole bunch of pretty boring code that does nothing to differentiate you app from anyone elses, unless you consider a fancy file format a selling point!

[ Parent ]
what is this supposed to mean? (4.00 / 1) (#109)
by ogre on Thu Dec 05, 2002 at 11:02:56 PM EST

As someone who has done a lot of work in taking data from one system and moving it to another, I can assure you that a compression step is a significant addition to the complexity and the maintenance overhead. Also, none of the advantages you give for XML (even if I agreed that they are true) rely on it being a text format. So the only thing you have shown (assuming that I accept your numbers) is that the enormous bloat of a text representation can be somewhat mitigated by adding a complicated extra step in your processing. This is a pretty unconvincing defense of the text representation.

Everybody relax, I'm here.

Data compression is complicated? (none / 0) (#116)
by p3d0 on Sat Dec 07, 2002 at 11:08:10 AM EST

Aren't there any number of general-purpose data compression packages that just work?
--
Patrick Doyle
My comments do not reflect the opinions of my employer.
[ Parent ]
I didn't say data compression is complicated (none / 0) (#118)
by ogre on Sun Dec 08, 2002 at 12:17:21 AM EST

It is complicated, but as you point out there are plenty of pre-canned mechanisms to do it. What I said is that adding a compression step adds to the complexity of the system. This isn't because of the inherent complexity of compression but because every processing step makes the data pipeline more complex. It is one more package to choose, acquire, and maintain, one more interface to set up and keep running, one more place for things to go wrong.

Everybody relax, I'm here.
[ Parent ]

Gotcha (none / 0) (#119)
by p3d0 on Sun Dec 08, 2002 at 10:12:10 AM EST

I was thinking of the wrong kind of complexity.
--
Patrick Doyle
My comments do not reflect the opinions of my employer.
[ Parent ]
Transparent data compression/decompression (none / 0) (#127)
by jonathanclark on Thu Dec 12, 2002 at 10:02:42 AM EST

If you are programming for Win32, then you can use this to add transparent compression to any program without any source code changes. I have some people using it with XML. It's nice because you can work with text in developement and then when you are ready to ship apply Thinstall and everything gets 10 times smaller with no work. http://thinstall.com (I'm the author)

[ Parent ]
Average case? (none / 0) (#110)
by LegionDaMany on Fri Dec 06, 2002 at 01:15:01 PM EST

Your reasoning on the "average case" of integers vs. ASCII representations is flawed. You state:

So on average, the ASCII representation is not more wasteful than the binary representation. In extreme cases it is about twice as wasteful.

You're forgetting that the amount of numbers that are wasteful in their ASCII representations are far more plentiful than the numbers that are equal or smaller in their ASCII representations. In order to see the true average case, one would need to encode one of every commonly representable number in both binary and ASCII format and compare the waste or savings thereof. If you see the table below, for any true application the savings is very quickly outweighed by the cost.

0-9 ... 10 * -3 bytes = -30
10-99 ... 90 * -2 bytes = -180
100-999 ... 900 * -1 byte = -900
1,000-9,999 ... 9,000 * 0 bytes = 0
10,000-99,999 ... 90,000 * 1 byte = 90,000
100,000-999,999 ... 900,000 * 2 bytes = 1,800,000
1,000,000-9,999,999 ... 9,000,000 * 3 bytes = 27,000,000
10,000,000-99,999,999 ... 90,000,000 * 4 bytes = 360,000,000
100,000,000-999,999,999 ... 900,000,000 * 5 bytes = 4,500,000,000
1,000,000,000-4,294,967,295 ... 3,294,967,295 * 6 bytes = 19,769,803,770

So, the binary representation of every single possible unsigned 32-bit integer would require 16.384GB of storage. The ASCII representation of the same data would require 38.965GB of storage (this does not include the necessary overhead of separating the numbers so they could be parsed correctly, which is not a problem in the binary representation). By this measure, in the average case ASCII representation is 2.378 times as large as binary representation of 32-bit unsigned numbers. In extreme cases, you would reach the limit of the ASCII representation being 2.5 times larger than the binary representation. I leave 16-bit and 64-bit numbers as an exercise for the reader.

Everything is a tradeoff in programming. Speed for size ... development speed for wages ... Pepsi for time spent to go get it. I think XML is a good standard for what it is intended for ... representation of structured data meant for portability between systems. It was not designed with speed or efficiency in mind. It has some definite advantages in portability and standardization, but those come at a cost ... and the major cost is size of the data file.



Call me Legion for I am Many ...
Already covered (none / 0) (#111)
by marx on Fri Dec 06, 2002 at 05:24:07 PM EST

It was not designed with speed or efficiency in mind. It has some definite advantages in portability and standardization, but those come at a cost ... and the major cost is size of the data file.
This was discussed in a previous thread, with a very similar argument to the one you constructed. Still, it's interesting to see different ways of reaching the 2.5 figure.

What you're not addressing is that while the speed and efficiency of XML might be less than for a binary representation, it's sufficient for the absolute majority of applications. The binary representation you're promoting is not the optimal representation, there are binary representations which store more compactly. So why are you not promoting them instead? It's because they're too complex. Today, dealing with binary at all is too complex. It's not necessary and is basically just a waste of time to satisfy some kind of neatness fetish.

There are applications which require binary representations, and for those I do not promote XML.

Join me in the War on Torture: help eradicate torture from the world by holding torturers accountable.
[ Parent ]

I apologize ... (none / 0) (#114)
by LegionDaMany on Sat Dec 07, 2002 at 12:58:24 AM EST

... for being redundant.

With that said, I did not say anything as to whether or not XML was sufficient for any particular task. I was stating that your reasoning as to XML's relative efficiency was misrepresented.


Call me Legion for I am Many ...
[ Parent ]
PNG images aren't XML for a reason (none / 0) (#121)
by pin0cchio on Sun Dec 08, 2002 at 09:29:44 PM EST

Today, dealing with binary at all is too complex. It's not necessary and is basically just a waste of time to satisfy some kind of neatness fetish.

I hope you weren't trying to imply an "always" in those sentences. There is a reason that PNG images don't look like the following (majority of data omitted):

<png>
<row><pixel>165</pixel><pixel>67</pixel> <pixel>230</pixel><pixel>229</pixel></row>
<row><pixel>15</pixel><pixel>255</pixel> <pixel>230</pixel><pixel>252</pixel></row>
</png>

It's that XML isn't really all that nice for storing two-dimensional arrays.

There are lots of other things XML isn't good for, like storage of data structures on small embedded systems and video game consoles.


lj65
[ Parent ]
umm... (none / 0) (#123)
by Rot 26 on Sun Dec 08, 2002 at 10:41:01 PM EST

notice: he said at the end of his post "There are applications which require binary representations, and for those I do not promote XML."
1: OPERATION: HAMMERTIME!
2: A website affiliate program that doesn't suck!
[ Parent ]
Actually (none / 0) (#124)
by marx on Mon Dec 09, 2002 at 03:30:24 AM EST

You could do it like this:
<png>
<pixelData>
[base64-encoded binary data goes here]
</pixelData>
</png>
In the case of PNG, the binary data would also be compressed. An alternative for other formats would be to use compressed XML instead.

The advantage of doing it this way is that all the metadata can be represented as XML. You would then have all the strengths of XML, while keeping the efficiency of binary for the raw bulk data.

If the 30% waste of base64 is too much, I'm sure an alternative encoding could be created, for example based on Unicode.

Join me in the War on Torture: help eradicate torture from the world by holding torturers accountable.
[ Parent ]

Bzzzt (none / 0) (#112)
by p3d0 on Fri Dec 06, 2002 at 09:19:20 PM EST

Your logic only applies if all numbers appear with equal frequencies in the data being stored. This never, ever happens in practice, unless you're storing random numbers like crypto keys.

The computation for "average" must account for distribution.
--
Patrick Doyle
My comments do not reflect the opinions of my employer.
[ Parent ]

What is "distribution" in the average ca (none / 0) (#113)
by LegionDaMany on Sat Dec 07, 2002 at 12:55:06 AM EST

Distribution in the average case, except in the presence of a specific set of data, which was not given, is random. For sets approaching infinite size, random distribution becomes even distribution. Thus, my point.

Or would you submit that in the domain of all possible applications (seeing as the author holds XML forth as a general solution) that some numbers occur with greater frequency? If so, please back up your claim as to what the standard distribution is.


Call me Legion for I am Many ...
[ Parent ]
Bzzt again (none / 0) (#117)
by p3d0 on Sat Dec 07, 2002 at 11:17:52 AM EST

For sets approaching infinite size, random distribution becomes even distribution.
This is as absurd as it is meaningless. If you don't know the distribution, then you don't know it. A uniform distribution is no more likely than any other to be the right one. So why would you choose a uniform distribution out of the universe of possibilities?
Or would you submit that in the domain of all possible applications (seeing as the author holds XML forth as a general solution) that some numbers occur with greater frequency? If so, please back up your claim as to what the standard distribution is.
Are you claiming that the number 2,039,472 appears just as often in normal usage as the number 5? If so, then I think it's safe to say the burden of proof lies with you.
--
Patrick Doyle
My comments do not reflect the opinions of my employer.
[ Parent ]
This assumes even distribution (none / 0) (#130)
by Alhazred on Fri Dec 13, 2002 at 08:36:46 AM EST

If you actually start examining real-world applications my bet is you discover most programmers are not really using the full range of their 32 bit numbers most of the time.

I recall this same sort of discussion WRT to FORTH, which has no float or double at all, only integer. As it turns out in practice floats and doubles are useless (mearly a convenience). Most programmers have no concept about number systems really...
That is not dead which may eternal lie And with strange aeons death itself may die.
[ Parent ]

A Real Test (none / 0) (#129)
by epepke on Fri Dec 13, 2002 at 03:17:24 AM EST

OK, this isn't much, but I now have some test data from a real application. I would have posted it earlier, but I hadn't yet written the printer for one of the cases.

The file in question is a small apartment, with four rooms and a closet, a few lights, and only a little bit of furniture (a desk and some kitchen counters). It's essentially a snapshot of an internal object-oriented format. For the purposes of comparison, I've removed leading spaces, which I had put in to make it easier to debug but in any event could be recovered with a simple pretty-printer, so they will probably not be in the final release. The content contains fairly long ASCII representations of numbers in both cases, which are mostly decimals near a multiple of one-tenth. Yeah, I'll probably eventually replace it with a better way but that's how it is now.

I'm comparing an XML representation with a representation as a LISP S-expression (Scheme plus extensions for circular references, objects, and bindings). The results are as follows:

ULR7test4.xml is 30754 bytes
ULR7test4.ss is 11045 bytes


The truth may be out there, but lies are inside your head.--Terry Pratchett


XML-ASN.1 provides good info on this subject (none / 0) (#134)
by jrst on Sun Jan 05, 2003 at 03:39:07 PM EST

Efficiency has to be considered a relative measure in this case; there are too many potential measures: encoded size?  encode/decode speed?  encoder/decoder size?  programmer effort?

But relative to what?  ASN.1 provides a good comparison, since it is the only other mechanism with the similar capabilities that is in widespread use.

Both XML and ASN.1 provide for encoding of typed/structured data.  Both provide for separation of encoding and specification.

ASN.1 has been around for many years.  Many ASN.1 proponents will tell you there's no reason for XML--ASN.1 can do everything XLM does, and do it more efficiently.

There are proof points that can be used on either side of the debate, depending on how you define "efficiency".

A good place to start is a goggle search:
    xml efficiency "asn.1"

A synopsis using a very limited set of data can be found at: http://www.dsv.su.se/jpalme/abook/comparison.pdf


XML and verbosity | 137 comments (132 topical, 5 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!