create account | help/FAQ | contact | links | search | IRC | site news
 Everything Diaries Technology Science Culture Politics Media News Internet Op-Ed Fiction Meta MLP

 XSLT, Perl, Haskell, & a word on language design By tmoertel in TechnologyTue Jan 15, 2002 at 06:05:37 AM EST Tags: Software (all tags) Here's a simple problem that I encountered a year or so ago when working on a project that required the conversion of XML documents into LaTeX. The problem is simple, but the solution is surprisingly difficult and reveals something about the tools we use. Sound interesting? Well, then, come along for the journey. I can't promise a pot of gold at the end, but you might have some fun along the way.

Here's the problem:

There are XML documents that contain information that must be published as LaTeX documents. Not only must the structure of this information be transformed into the structure of LaTeX markup, but also, since XML and LaTeX have different text encodings, a proper solution must translate between the encodings. For example, in LaTeX the ampersand & and dollar sign $characters have special meanings, and so when translated from XML these characters must be escaped as \& and \$. Similarly, the text portions of XML documents often contain text-markup idioms (like the three-letter sequence "(C)" for copyright) which should be translated into the proper LaTeX representations ("\copyright{}"). The problem is how best to perform these character and idiom substitutions.

The solution is surprisingly difficult.

One important tidbit: When I originally encountered this problem, the requirement was for a solution that was 100%-pure according to the XSLT 1.0 specification. No extensions such as EXSLT or vendor enhancements to XSLT processing engines were allowed, nor were escape-hatch calls to other languages. The solution had to be 100% XSLT 1.0, nothing more, and that constraint will apply for the remainder of the story. (This is an important thing to remember when you find yourself asking why on earth I was willing to go so far out of my way to find an XSLT-only solution.)

With these things in mind, let's turn to XSLT itself.

A brief look at XSLT

XSLT was designed as a language for transforming XML documents into other documents (perhaps also XML). In essence, XSLT "transformations" are defined as collections of templates that match portions of input documents and correspondingly emit portions of output documents. An XSLT processing engine reads in a bunch of these templates and then applies them (usually recursively) to input documents to yield output documents. Thus, XSLT seems perfect for our task (and it ought to because that's what it was designed to do). We can simply add a new "text-processing" template to our existing transformations and define it to match the text portions of our input documents, perform our substitutions, and emit the results.

But, alas, the "perform our substitutions" part is more difficult that it would appear. XSLT suffers from fundamental weaknesses that make it exceedingly difficult for many common tasks, and this task is one of them.

Like most standards, the XSLT 1.0 standard gets some things right and other things wrong. The designers of XSLT made a single decision that accounts for much of the language's strengths and weaknesses. They decided that XSLT transformations (i.e., "programs" in the XSLT language) should themselves be well-formed XML documents.

From this decision the language gains the ability to play alongside other XML documents, to be manipulated with common XML tools, and to benefit from advances in related XML standards. In short, it easily integrates into the XML-programmer's world and gains all benefits thereof.

However, this integration comes at a cost: Verbosity. Terrible verbosity. The signal-to-noise ratio of XSLT transformations is shameful, easily among the worst of all computer languages in widespread use. Non-trivial XSLT transformations almost appear obfuscated.

But verbosity isn't what makes our problem so difficult to solve, although it does get in the way.

Another decision that the XSLT designers made (and one that I support) is that XSLT transformations should be free of side effects. Like purely-functional programming languages, XSLT does not allow programmers to modify state. Expressions like $a=$a+1 (which changes the state of the variable $a when executed) are verboten. This decision means that XSLT templates behave like pure functions. They take input, and they return output, all without changing the state of the outside world. This means that templates are predictable. Given the same inputs, a template will always produce the exact same output. Every time. That's a good thing. (There are other benefits, like ease of garbage collection, parallelism, re-entrancy, etc., but I won't get into them here.) The really bad thing is that the designers of XSLT made the language free of side effects but then failed to include fundamental support for basic functional programming idioms. Without such support, many trivial tasks become hell. Our text-substitution problem is a good example. Welcome to the trivial text-substitution problem Our problem can be restated like so: Create a function (or XSLT template) that takes some text and returns its converted equivalent, in which we have substituted the appropriate replacement for each occurrence of a target phrase. Thus, given the following targets and replacements, TARGET REPLACEMENT & \&$       \$(C) \copyright{} we can imagine a function doSubstitutions that when called with the string & (C)$

it will return

\& \copyright{} \$Since the list of targets and their replacements is large and likely to change, a smart programmer is apt to factor it out of the function that implements the substitution process. This makes the list easier to manage and also makes the substitution process easier to understand and reuse. Thus, in Perl, one might create a hash of substitutions to hold the list of target-replacement pairs: my$substitutions =
{
#        TARGET-REGEX    REPLACEMENT-STRING
qw|
&               \&
\$\$
$$C$$           \copyright{}
|

# a bunch more
};

The corresponding perform-the-substitutions function would then be straightforward. An implementation could simply iterate through the substitution pairs in the hash, performing each substitution globally across the input text (changing the text in the process, but that's okay because Perl allows side effects). After the last iteration, the function could simply return the last state of the text:

sub doSubstitutions($) { my$text = $_[0]; while (my ($target, $replacement) = each %$substitutions) {
$text =~ s/$target/$replacement/g; } return$text;
}

Easy as pie.

Okay, but can you do it without side effects?

Remember when I said that XSLT doesn't allow for side effects? That constraint might seem to make our task more difficult, but really it doesn't. Consider, for example, a Haskell version of our solution.

Like XSLT, Haskell doesn't allow side effects, but (unlike XSLT) it has no problem with our task. First, let's define our list of substitutions again, Haskell style:

substitutions =

--     TARGET    , REPLACEMENT

[ (    "&"       , "\\&"           )
, (    "$" , "\\$"           )
, (    "(C)"     , "\\copyright{}" )

-- a bunch more substitution rules

]

Now, let's build a function that performs the substitutions:

doSubstitutions =
foldr1 (.) (map (uncurry substitute) substitutions)

That's it.

(Here substitute target replacement str is an external function that replaces each occurrence of target with replacement in str.)

Without going into too much detail -- and there's a lot going on behind the scenes -- the above line of Haskell code does two things. First, the (map ...) part converts each target-replacement pair in the list of substitutions into a function that performs the substitution. For instance, the pair ("&","\\&") becomes a function that replaces each occurrence of "&" in an input string with "\\&" and returns the resulting string. (Remember that in Haskell, like in other functional programming languages, functions are first-class values that can be created and manipulated by the programmer, just like data can in other languages.)

Second, the foldr1 (.) part takes the list of functions generated in the first part and glues them together to yield a single function that performs all of the substitutions at once. (The Haskell composition operator (.) behaves much like Unix shells' pipe operator (|). The foldr1 function "folds" the composition operator through the list of substitution functions to yield the all-in-one function.)

Easy as pie. Easier, even.

Okay, then, what's it look like in XSLT?

First, let's declare our substitution list in XSLT:

<xsl:variable name="substitutions">
<substitutions>
<sub-rule>
<target>&amp;</target>
<replacement>\&amp;</replacement>
</sub-rule>
<sub-rule>
<target>$</target> <replacement>\$</replacement>
</sub-rule>
<sub-rule>
<target>(C)</target>
</sub-rule>
</substitutions>
</xsl:variable>

So far, so good. (Yes, it's bloated, but this is XML. Verbosity comes with the territory. In any case, this is bloat I can live with. The truly nasty stuff is yet to come.) Now, let's build a template from the substitution list, just like we did in Haskell.

Ooops. XSLT doesn't let you create templates programmatically. Templates aren't first-class citizens.

Okay, then, let's create a single template that "walks" through the substitution list, applying each in turn, just like we did in the Perl version. After all, there's more than one way to skin a cat, right?

Maybe.

When we create a document snippet, like we did inside the definition of the substitutions variable, the snippet is presented to us as a "result tree fragment." The important thing to understand about result tree fragments is that they are atomic and opaque. Meaning, we can't iterate over their contents, peek inside, or do anything useful at all -- except emit them as output. Now, this is a whopping big limitation, and while it was intended to make the job of implementing an XSLT processing engine easier, it's just about universally despised within the XSLT programming community. (I guess that's why the XSLT 2.0 working draft does away with result tree fragments and also why EXSLT's "common" module includes the core function exslt:node-set(), which converts a result tree fragment back into a useful set of nodes.)

That's strike two. What next? Well, it's not fun, but maybe we really must build the do-the-substitutions template by hand, manually expanding the substitution list into a series of chained templates, each of which performs one of the substitutions:

<xsl:stylesheet>

<xsl:template name="doSubstitutions">
<xsl:param name="input" select="."/>
<xsl:param name="output">
<xsl:call-template name="substitute">
<xsl:with-param name="input" select="$input"/> <xsl:with-param name="target">&amp;</xsl:with-param> <xsl:with-param name="replacement">\&amp;</xsl:with-param> </xsl:call-template> </xsl:param> <xsl:call-template name="doSubstitutions2"> <xsl:with-param name="input" select="$output"/>
</xsl:call-template>
</xsl:template>

<xsl:template name="doSubstitutions2">
<xsl:param name="input" select="."/>
<xsl:param name="output">
<xsl:call-template name="substitute">
<xsl:with-param name="input" select="$input"/> <xsl:with-param name="target">$</xsl:with-param>
<xsl:with-param name="replacement">\$</xsl:with-param> </xsl:call-template> </xsl:param> <xsl:call-template name="doSubstitutions3"> <xsl:with-param name="input" select="$output"/>
</xsl:call-template>
</xsl:template>

<xsl:template name="doSubstitutions3">
<xsl:param name="input" select="."/>
<xsl:param name="output">
<xsl:call-template name="substitute">
<xsl:with-param name="input" select="$input"/> <xsl:with-param name="target">(C)</xsl:with-param> <xsl:with-param name="replacement">\copyright{}</xsl:with-param> </xsl:call-template> </xsl:param> <xsl:value-of select="$output"/>
</xsl:template>

</xsl:stylesheet>

(Again, we call an external template named "substitute" to perform the individual string-for-string substitutions.)

Let's take a closer look.

• First, the template named doSubstitutions escapes ampersands in the input text and then calls doSubstitutions2 on the result.
• Second, doSubstitutions2 escapes dollar signs in its input text and calls doSubstitutions3 on its results.
• Finally, doSubstitutions3 converts occurrences of "(C)" in its input text into "\copyright{}" and returns its results, as is, without further chaining.

It works. But, man, that code is not pretty. And it is not maintainable. Just think, the substitution list I've been using as an example is only three elements long. Real lists can easily contain a hundred elements. Imagine what the chain of code looks like for a real list! Scary.

So what's a diligent coder to do?

Well, here's what I did:

• I broke out my substitution rules into a separate XML file, in much the same format as in my result tree fragment example earlier.
• I wrote a separate XSLT transformation to "compile" the substitution rules into a chained series of XSLT templates. That is, I wrote some code that wrote the really nasty chained code for me. (Because the substitution rules were part of the compilation transformation's input document -- in fact, they were the input document -- the individual rules were input nodes, not result tree fragments, and thus available for manipulation.)
• I included the automatically generated code into my XML-to-LaTeX XSLT transformations and used them as needed.
• And I used a simple Makefile to keep everything up to date.
• Problem solved.
That's the story.

Well, almost. In a comment regarding an earlier draft of this story, jperret pointed out that XSLT's document() function can be used to load external XML documents as node sets, not as opaque result tree fragments, and thus could be used to load my substitution list.

Using this tip, another solution became possible. I could have stored the substitution list in an external XML file and then loaded it into a variable, like so:

<xsl:variable
name="substitutions"
select="document('substitutions.xml')"/>

Then, having access to the elements in the list, I could "walk" through the substitutions, like I did in the Perl example (but without using side effects), to yield in the following (untested) XSLT implementation of doSubstitutions:

<xsl:template name="doSubstitutions">

<xsl:param name="input"/>
<xsl:param name="index" select="1"/>

<xsl:choose>

<!-- see if we're done -->

<xsl:when test="$index&gt;count($substitutions/sub-rule)">
<xsl:value-of select="$input"/> </xsl:when> <!-- otherwise, perform the next substitution --> <xsl:otherwise> <!-- get next target and replacement --> <xsl:variable name="sub" select="$substitutions/sub-rule[$index]"/> <xsl:variable name="target" select="$sub/target"/>
<xsl:variable name="replacement"  select="$sub/replacement"/> <!-- recurse, passing the results of the substitution as the new input --> <xsl:call-template name="doSubstitutions"> <xsl:with-param name="input"> <xsl:call-template name="substitute"> <xsl:with-param name="input" select="$input"/>
<xsl:with-param name="target"      select="$target"/> <xsl:with-param name="replacement" select="$replacement"/>
</xsl:call-template>
</xsl:with-param>
<xsl:with-param name="index" select="$index+1"/> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:template> Well, it's different, and it saves the separate compilation step, but it's still hard to understand. Compare it to the Perl version, at seven straightforward lines. Or compare it to the Haskell version, a single line of pure, wonderful code. Either way, XSLT seems bloated. In particular, I had to roll my own support for the common recursion-with-an-accumulator idiom. Since side-effect-free programming makes frequent use of this idiom, it's a shame that XSLT doesn't support it more directly. The moral of the story If there is a lesson to be learned, it's that the design of domain-specific languages is hard. The XSLT designers created a language that, while rich in domain-specific functionality, lacked much of the basic functionality necessary to make it genuinely suited for its intended purpose. While the designers probably left "generic" functionality out of the spec on the grounds that XSLT was never intended to be a general-purpose programming language, they failed to realize that even simple document transformations often require a little nuts-and-bolts programming. Leaving out the nuts and bolts made XSLT a half-broken language. So, if you ever design a domain-specific language, don't forget the nuts and bolts.  Sponsors  Login  Poll XSLT is . . .  brain dead. 21% acceptable. 17% good. 9% great. 4% (I haven't used XSLT.) 46%  Votes: 82 Results | Other Polls  Display: Threaded Minimal Nested Flat Flat Unthreaded Sort: Unrated, then Highest Highest Rated First Lowest Rated First Ignore Ratings Newest First Oldest First  XSLT, Perl, Haskell, & a word on language design | 129 comments (116 topical, 13 editorial, 0 hidden)  Discussion of this article on XML-DEV (4.33 / 6) (#5) by Carnage4Life on Tue Jan 15, 2002 at 03:27:04 AM EST  I posted an earlier draft of this article on the XML-DEV mailing list and some discussion has ensued.  With open arms . . . (4.00 / 4) (#27) by tmoertel on Tue Jan 15, 2002 at 10:45:50 AM EST  I had orginally posted this story as a diary (which I must admit was pretty rough in places), and that's what is presently being discussed. Some of my favorite responses, so far: One comment reads: This [the diary's conclusion] is a "broken" conclusion, and I don't have time yet to read it, but I would guess therefrom that it is a "broken" article. [...] And speaking of extensibility, the well-considered features for extending XSLT are another key part of its success. The flourishing of projects such as EXSLT so soon after the advent of XSLT 1.0 is quite telling. And what are they telling you? ;-) Another post from the same author, this time in response to a post that questioned XSLTs "phenomenally well-designed"-ness: I think we'd all be better off if there were far fewer and far better programmers. Call me a snot, but IMHO A programmer who cannot understand the basic divide-and-conquer algorithmic imperatives that are the foundation of computer science, and that are properly enforced by functional programming, should be quarantined from *any* computer language. (I think he's referring to me.) Another response includes, This article is the rough equivalent of a review of giving a toaster oven a bad review because it won't wash dishes. Touche! Finally, another author's post includes the following: I personally don't care whether or not Moertl uses a standards compliant tool to perform this task. Use what works. If I were tasked with his problem, I'd use XSLT to transform to TeXML and then the IBM alphaWorks tools to transform that to TeX. Of course then I wouldn't be able to publish whining articles about how a hammer's a horrible tool based on my experiment using it to scale fish. All in all, a warm reception, wouldn't you say? --My blog | LectroTest [ Disagree? Reply. ][ Parent ]  Maybe they have a point? (4.25 / 4) (#29) by BehTong on Tue Jan 15, 2002 at 11:26:03 AM EST  OK, so people aren't being very polite and everything, but I think they do have a point. As other comments have said, XSLT isn't exactly designed to do transformations into LaTeX. If *I* were given the same task (and if I had the choice), I would not use XSLT to do this -- I'd write a SAX parser or something and output LaTeX cleanly. That way, I can use whatever language is most convenient for the problem -- probably Perl or C++, since I would have more flexibility in doing text transformations. I think people are using XSLT way beyond what it's designed for. IMHO, XSLT in its current form is suitable only for simple formatting XML transformations (aka XML-->HTML type processing). If you want to do more involved stuff, I believe that's what SAX is intended for. :-) (OTOH, I do agree with you that XSLT could be better, though... after trying to do some serious stuff with it, I find myself running into brick walls frequently, and I think I'll just resort to SAX. Same problems as you faced: verbosity, extreme verbosity, and the lack of built-in functions to do simple, fundamental text-processing stuff.) Beh Tong Kah Beh Si! [ Parent ]  Agreed, sort of . . . (4.33 / 3) (#35) by tmoertel on Tue Jan 15, 2002 at 12:29:35 PM EST  Certainly, XSLT was not intended for XML-to-LaTeX transformations. However, the problem I describe -- how to transform the text inside of the tags -- is a general problem which also occurs in XML-to-XML and XML-to-HTML transformations. (I elaborate in another comment, in which I explain that I had to do the exact same substitutions for XML-to-HTML work.) When transforming between document models, the difference between models is not limited to element-level structure but often runs into the intra-element structure, where XSLT is blind. In any case, I think it was a fun programming exercise and hope you didn't mind my re-telling of it. Cheers, Tom --My blog | LectroTest [ Disagree? Reply. ][ Parent ]  Text transformations (4.33 / 3) (#41) by BehTong on Tue Jan 15, 2002 at 01:09:11 PM EST  When transforming between document models, the difference between models is not limited to element-level structure but often runs into the intra-element structure, where XSLT is blind. Exactly. This is what I was getting at -- XSLT basically is designed only to handle those cases where the only difference between document models is the tag structure. Text manipulation? Forget it. In my book, a language is NOT qualified to be labelled "text-processing" unless it has built-in support for regular expressions. Emphatically built-in. Using external libraries only means text-processing is secondary and therefore not fundamentally supported. You can't do text-processing effectively and cleanly without this. There's just no way around it. Readability is poor excuse -- regexes don't have to use the traditional UNIX syntax -- you can easily make "readable" regexes using more verbose representations of the basic operations of concatenation, alternation, and Kleene star, and the various syntactic sugar constructions thereof. (Although the pervasiveness of the UNIX syntax surely testifies to the usefulness of having a concise, compact syntax.) Beh Tong Kah Beh Si! [ Parent ]  Built-In? (4.00 / 1) (#53) by Matrix on Tue Jan 15, 2002 at 04:21:09 PM EST  What exactly do you mean by built-in? Yes, Perl has support for them as an operator... But is there really much practical difference between that and regex functionality that's part of the core APIs? (Like Python) I've used both for text processing, and found them to be roughly equivalent. Sure, one has to use a few more statements to use the regex with Python, but that also resulted in a regex object that you could re-use in different places in your code. So, to get back on topic, what do you mean by built-in? Do Python's regex classes count? If not, would a C++ regex library that implimented Perl-style regex operators count? Or does only a language that includes them in its default operator set count, like Perl, count? (And isn't Perl really just doing the same thing as the C++ library?) Oh, and I agree that XSLT isn't really designed for generic text processing. Though for transforming between different tag-based languages (like XML and HTML), even when the structure of the tags is quite different, its great. Far faster than writing regular expressions to do the same thing, just like yacc and lex (or their derivatves) are better for the tasks they're designed for than (say) a solution using Perl's regex support. Matrix "...Pulling together is the aim of despotism and tyranny. Free men pull in all kinds of directions. It's the only way to make progress." - Lord Vetinari, pg 312 of the Truth, a Discworld novel by Terry Pratchett[ Parent ]  "Built-in"-ness (3.00 / 1) (#60) by BehTong on Tue Jan 15, 2002 at 05:18:47 PM EST  Well OK, I'm biased towards Perl. I'd say the language in question should build regexes into its core syntax. (After all, we're talking about languages meant for text-processing here.) The main reason being that having regexes as part of the core syntax makes them a lot more convenient to use -- don't have to declare regex variables and then compile/initialize them and whatnot. You want to concentrate on the text-processing algorithms, not on the mechanics of it. Just like you want to concentrate more on the algorithms by using a high-level language instead of worrying about machine instructions and register allocations in assembly language. And since Perl 5.005 (or was it 5.6?), it's not just having the syntactic sugar to support regexes -- they are (almost) treated like 1st class citizens. The qr// operator rocks. Oh yes, and the old trick of blessing a regex into an object. Finally, you can't dispute the convenience of being able to write stuff like: SWITCH: { m/[0-9]+/ && do { ... ; break; } m/[a-z_A-Z]+/ && do { ... ; break; } m/[+-*/]/ && do { ... ; break; } ... } You just gotta love the guarded-expression-like semantics of this. Just like Haskell's case blocks. (Sorry for the bad formatting, Scoop isn't cooperating here.) Now, if only Perl would let you mathematically compose regexes ... *evil grin* Beh Tong Kah Beh Si! [ Parent ]  bias obvious (none / 0) (#91) by kubalaa on Wed Jan 16, 2002 at 05:23:02 AM EST  What you're saying is that the language should support extra syntax just for handling regexps. Which doesn't make sense, since regexps aren't special; they're just a short way of expressing functions on strings, and the "expressing" part is easily written in a string. (Of course the language supports strings, right)? And since a regexp is a function it should be applied like any other function; no need for special syntax there. My theory is that if you can express something just as concisely and clearly without "builtin support," then the builtin support is just adding extra clutter. Not to mention confusing people about what actually goes on. That is, theoretically x =~ foo should be the same as foo(x), but it's not because for some mysterious reason regexps are "special" just because they're defined more compactly. Of course the first version looks confusing, which is why I don't miss it in other languages which only have one way of applying functions. [ Parent ]  Extra clutter? (none / 0) (#97) by BehTong on Wed Jan 16, 2002 at 10:53:06 AM EST  Sorry, I don't think I understand what you're trying to say here. Your example of x =~ foo being the same is foo(x) doesn't back up your point, because the whole point of having regexes in core syntax means that "foo" can be a regex literal. I.e., you can write x =~ /some-regex/. If another language allows you to write /some-regex/(x) instead, then regexes are equally a part of the core syntax as in Perl. But if not, then it's the other language that's cluttered, since you would have to explicitly define "foo" before you can apply it to the string x, e.g. regex foo = new regex("some-regex"); foo(x). My beef is not with the possibility of using regexes in a language -- heck, I can write regexes in assembly language too. With enough time and effort. But for practical purposes, if I'm supposed to be doing a lot of text-processing, then I prefer a language that lets me express regexes as directly as possible, rather than a language that requires circumlocutions to say what I want, even though both languages may be functionally equivalent in terms of regex support. And to me, Perl's syntax is closer to allowing me to express string manipulations concisely, than, say, Java. But this is just my preference; I don't claim that Perl syntax is the absolute best. (It's not -- the weird defaulting to$_ when you don't bind a regex with =~ or !~ is just awkward. But good luck convincing Larry to change that :-P) I'm just saying that languages that allows you to write regex literals with minimal overhead is more conducive to writing text-processing applications. Beh Tong Kah Beh Si! [ Parent ]
 Does that mean.... (none / 0) (#100) by joto on Wed Jan 16, 2002 at 12:10:57 PM EST

 ...that you would be happy with a C++-library that allowed you to do:  foos = RegExp("\bfo*\b").count_matches(line);  ...since you would then be allowed to create anonymous regexps? Or in Scheme:  (let ((foos (count-matches (regexp "\bfo*\b") line))) )  etc? [ Parent ]
 I suppose (none / 0) (#104) by BehTong on Wed Jan 16, 2002 at 02:13:03 PM EST

 I suppose that would be better. But that still begs the question. It's still a circumlocution. I'm not saying that's bad; you can't blame C++ for having syntax like that, because it wasn't designed to be a text-processing language. But can you imagine writing a heavily mathematical program and having to write "a+2*sqrt(b+c^2)" as "Math.add(a, Math.product(2, Math.squareRoot(Math.Add(b, Math.exponent(c,2)))))"? I'm not saying C++ (or any other language) is bad because you have to write regexes in such a verbose way -- but don't call it a text-processing language if it requires you to jump through hoops to write basic text-processing operations. Regexes ought to be a core, primitive text-processing operation. Anything that calls itself a text-processing language ought to have regexes and any other text-processing primitive as a core part of its syntax, since that's what it's supposed to do best. It's just like you can write pure OO programs in C, but people don't call C an OO language for a reason. Beh Tong Kah Beh Si! [ Parent ]
 Operators (none / 0) (#107) by Matrix on Wed Jan 16, 2002 at 03:09:34 PM EST

 One of my points, if you'll remember, was that the issue of what is "builtin" in a language with operator overloading gets a lot less clear. What if the regex library has some operator syntax to let me easily apply a regex to a string? (Say, string >> regex) Would that be enough to have turned it into a text-processing language? I think calling something a text-processing language based on its features isn't really a good way to go about things. Rather, if the language was designed primarily to process text (not tags, as XSLT was), then its a text-processing language, right? Matrix "...Pulling together is the aim of despotism and tyranny. Free men pull in all kinds of directions. It's the only way to make progress." - Lord Vetinari, pg 312 of the Truth, a Discworld novel by Terry Pratchett[ Parent ]
 circumlocution (5.00 / 2) (#110) by joto on Wed Jan 16, 2002 at 04:28:11 PM EST

 I actually had to look that word up (I am not a native english speaker) :-) Merriam Webster says the use of an unnecessarily large number of words to express an idea evasion in speech I assume you mean 1 ;-) Well, I guess you can tell I am in the "adapt your general purpose programming language to suit your needs", rather than the "build a special-purpose language with ad-hoc mechanisms insufficient for general-purpose programming"- camp, but I just thought I'd tell you anyway. On the other hand, I really enjoy the word adapt, as I'd much rather enjoy programming in languages that has flexible syntax through the use of a macro-system such as Common Lisp's or Scheme's, than to write horribly contorted code. can you imagine writing a heavily mathematical program and having to write "a+2*sqrt(b+c^2)" as "Math.add(a, Math.product(2, Math.squareRoot(Math.Add(b, Math.exponent(c,2)))))"? Well, if you wrote a new numeric class in Java, such as one for large floating point numbers, interval arithmetic or something else, you would have to write it exactly as above (with the provision of having to replace Math with BigFloat or IntervArith, or something else). I would tend to say that that would be going overboard, but you can get away with it. e.g in Scheme you would write:  (+ a (* 2 (sqrt (+ b (expt c 2)))))  which is not that different (although a lot more readable, due to the lack of package identifier, and the fact that parenthesises match a whole expression). Another option would be Forth, which uses postfix syntax:  c 2 expt b + sqrt 2 * a +  The fact is that Lisp users and Forth users actually view this generality in syntax as an advantage, and not a shortcoming. But not everyone agrees (certainly not everyone are using a lisp-dialect or forth-dialect). Another option would be to define new operators, or overload existing ones. Modern functional languages such as Haskell or ML allow you to define new operators, so you can write new number classes (or regexp classes) with convenient syntax. C++ also allows operator overloading (although it doesn't allow you to create new operators), and preprocessor macros, so if it's the particular C++ variant you were commenting on (I noticed you didn't comment on the Scheme variant), then you could easily transform it into something less readable by two simple definitions:  #define _(rx) RegExp(#rx) int operator += (RegExp r, string s) { return r.count_matches(s); }  Which would allow you to rewrite:  foos = RegExp("\bfo*\b").count_matches(line);  as:  foos = _(\bfo*\b) += line;  which strikes me as almost as unreadable as Perl, but possibly useful if one is very scared of typing. On the other hand, I would also like to see handling of regexps much more general than Perl's. How about: r1 . r2 concatenation of regexps r1 | r2 union (anything that matches r1 or r2) r1 & r2 intersection (anything matching both r1 and r2) ! r1 negation (anything that doesn't match r1) r1 \ r2 difference (anything that match r1 but not r2) r1 * kleene-star r1 ? well, you get the idea... r1 {1,4} you still know what I mean... r1 *? Non-greedy kleene-star r1 > r2 positive lookahead assertion and so on. Something like this (some renaming of operators, and making postfix operators prefix instead) could easily be added to C++, Haskell or ML (and would fit right in without any special support in Lisp or Forth). [ Parent ]
 note (none / 0) (#118) by kubalaa on Thu Jan 17, 2002 at 09:19:56 AM EST

 "almost as unreadable as Perl, but possibly useful if one is very scared of typing" <- I love this line. :)

Shouldn't difference be "a / b" ? (that is, the slash going the other way) [ Parent ]

 / vs \ (none / 0) (#127) by joto on Sat Jan 19, 2002 at 09:45:25 PM EST

 No, I was thinking of set-theory, where \ is commonly used for difference. / is usually used for division, so that would be unfortunate, but most people would probably be happier with - [ Parent ]
 More "general" regexes (none / 0) (#121) by BehTong on Thu Jan 17, 2002 at 11:34:06 AM EST

 On the other hand, I would also like to see handling of regexps much more general than Perl's. Actually, Perl's regexes are quite general already, more so than, say, egrep or flex. How about: r1 . r2 concatenation of regexps r1 | r2 union (anything that matches r1 or r2) Erm, isn't this already supported by Perl? Unless you're talking about something like: $a = qr/[0-9]+/;$b = qr/[a-zA-Z]+/; $c = qr/$a$b/;$input =~ /$c/. This is supported since Perl 5.005. (OK, I'm not sure if you can actually use a qr// variable inside another qr//; if not, that could be room for improvement.) r1 & r2 intersection (anything matching both r1 and r2) ! r1 negation (anything that doesn't match r1) r1 \ r2 difference (anything that match r1 but not r2) Yikes. Supporting these operators is desirable (hey, I've wished for them too!) but very difficult, and possibly very inefficient. The main problem is in (mathematically) negating regular expressions, which is necessary if you want to allow arbitrary combination of these operators with everything else. In general, the negation of a regex is a LOT more complex then the regex itself (ref. any text on computational theory), which in turn means that runtime speed will suffer. You're probably better off manually using !~ instead of =~. But of course, it would be very nice if Perl (or any other language) automatically does this behind the scenes. Nevertheless, if you combine a negated regex with a non-negated regex, things could get quite ugly regardless of how many optimizing tricks you use in the implementation. But, as Larry says, "easy things should be easy; hard things should be possible", so somebody could try to come up with a feasible implementation of these operators :-) r1 * kleene-star r1? well, you get the idea... r1 {1,4} you still know what I mean... r1 *? Non-greedy kleene-star r1 > r2 positive lookahead assertion Again, already in Perl, unless you're talking about the qr// literals, which I *think* can be combined like this. If not, that'll be something cool that can be added to Perl. (Someone should send this list to Larry for Perl 6 :-P) Beh Tong Kah Beh Si! [ Parent ]  does it really need to be inefficient? (none / 0) (#128) by joto on Sat Jan 19, 2002 at 11:02:17 PM EST  I wasn't aware of the qr// syntax. It's good to know Perl already has it! It uses variable interpolation instead of funky operators/functions, but that's ok, since that's how it would be implemented anyway. As for efficiency, note that Perl regexps are already far from being the DFA regular expressions you see in a automata textbook. Regexp matching in Perl is already NP-complete, so if someone wants to make them into a whole Turing-complete sub-language, I am not going to protest (as long as they still look like regexps, and keep the common case fast). I am not sure why non-matching should be that hard. Sure, if you are going to create a new regexp-string from the old one, you are going to run into trouble, but why can't you use the regexp-engine that's already there, and simply invert the outputs? (Maybe I'm saying something stupid here, let me know if I am...). Non-greedy regexps already need to look at two regexps at the same time, the currently matching one, and the next. With negative regexps, you will also need to look behind in some cases (unless the matching is anchored somewhere), but I don't think there should be any problems doing that and still keep the common case fast (hmm, it could be an advantage to be able to match backwards...). But then again, I'm certainly not an expert on perl regexps, so maybe I'm totally off base in how this is implemented. Intersection/Difference could be implemented by matching one of the regexps first, and then check whether the next one matches. Sure, it would be a little bit slower, but it should be tolerable if you really want it. (PS: I hardly think Perl 6 needs more suggestions at this time... Stuff like this can be added as libraries later without too much trouble) [ Parent ]  Real world example (none / 0) (#123) by ucblockhead on Thu Jan 17, 2002 at 03:03:54 PM EST  For what it's worth, the boost regex library for C++ allows you to do this:  if( regex_match(String, Results, regex("\\bfo*\\b") ) ) { ...stuff  ----------------------- This is k5. We're all tools - duxup[ Parent ]  cluttered (none / 0) (#117) by kubalaa on Thu Jan 17, 2002 at 09:08:18 AM EST  We disagree on the definition of cluttered. Personally I find it more readable when regexps aren't defined every place they're used. I still see your point that, because they're usually so short it's handy. I wouldn't call it necessary by any means, and if you're doing something besides regexps, the decreased use of special syntax can greatly improve readability. That was I guess my point; syntax should be kept to an absolute minimum. [ Parent ]  Absolutely minimal syntax (none / 0) (#122) by BehTong on Thu Jan 17, 2002 at 12:17:24 PM EST  Well, I don't agree that syntax should be absolutely minimal. It shouldn't be overly redundant, yes, and things such as the number of reserved keywords or the total number of syntactic constructs should be kept in check (*cough* Certain Languages *cough* freely reserve just about every word in your vocabulary... and people wonder why programmers are bad at naming variables :-P). But sometimes, some overlap is good, because some things are better expressed a different way than normal, although both mean the same thing. For example, theoretically there's no difference between an IF statement and a WHILE loop that runs exactly zero or one (and not more) times. And every FOR loop can be written as a WHILE loop. Every switch statement can be written as a series of IF statements, which in turn can be recast as WHILE loops. So the only control structure you'll ever need, theoretically, is just the WHILE loop. Except for recursion, you can throw out function calls too, since you can inline everything anyway. (You can even simulate recursion with, guess what, a WHILE loop, if you really want to.) Why have all these other redundant constructs, when a single WHILE loop construct is sufficient to express all the others anyway? Nevertheless, we still want the other, "redundant" constructs in spite of the overlap between them because the logic of our code is better expressed that way. Otherwise we might end up with a beast like ... this one. And I don't think that's a good thing :-) Beh Tong Kah Beh Si! [ Parent ]  minimal required to express intent (none / 0) (#125) by kubalaa on Fri Jan 18, 2002 at 04:43:45 AM EST  It's like Occam's razor; you want the minimal syntax required without sacrificing semantic clarity. "if", "while", "for", and "switch" are all semantically different (when used intelligently at least), and this is relevant because it reflects the kinds of potential changes that can occur. (It should be noted that I don't miss "switch" in python for example but I don't think it's bad to have.) This is contrasted with regular expressions, which are not semantically different. So in this case, the different syntax obscures semantic similarity in addition to adding syntactic complexity, whereas in the case of different control structures, the different syntax clarifies a genuine semantic difference and is therefore justifiable. [ Parent ]  Regular expressions (none / 0) (#102) by jacob on Wed Jan 16, 2002 at 01:20:03 PM EST  By that do you mean text-processing support is only good if it uses non-standard syntax and strange semantics with respect to the rest of the language? In PLT Scheme, all regular expression support is in a library. You say (regexp-match "star-squiggle-ampersand-yuck" my-string) and it produces a list of the matches. It's clear, clean, and doesn't involve you having to know any more syntax or semantics rules. I fail to see why that's a disadvantage. -- "it's not rocket science" right right insofar as rocket science is boring --Iced_Up[ Parent ]  Wow! (3.71 / 7) (#6) by Weezul on Tue Jan 15, 2002 at 03:37:24 AM EST  A tech related story that I gave +1. Impressive. Allow me to say: Haskell rocks hardcore! Personally, I feal that anyone creating a new domain specific lagnague today without first tring to build it as a Haskell extension is a fool. 1. The langague is amazingly powerful and flexible due to monads. As an example, you can easily write a Haskell library to allow you to write pharsers in Haskell. Ultimatly, Haskell is more suitible to extensions via libraries then almost any other langague. 2. The existing Haskell compilers are amazingly powerful and produce fast code. That fast code is still bottle necked to compile to C, but you could (probablby) change the compiler to make your own byte code without losing too much efficency. 3. Haskell code is amazingly quick to write once the pieces are in place, as in a domain specific setting. When (not if) the user needs to do something you did not anticipate then section 1 comes into play in full force and they can extend it themselves to do anything they need. 4. Finally, Haskell's monads are such a powerful idea that they can easily simulate an imperitive programming enviroment. A useful tool for programmers who are new to functional langagues. Haskell is really the only langauge I have ever programmed in where if fealt like every bit of code was necissary.. and nothing was just waisted garbage for the compiler. btw> Actually, the Haskell langague it's self dose seem to change a bit, but these modifications seem to be more powerful ways of defining types. btw2> Unfortuantly, profiling your code is more importent in Haskell then other langagues since the supersmart compiler might not be quite as clever as you had hoped in some situations. "Fascism should more appropriately be called Corporatism because it is a merger of state and corporate power." - Benito Mussolini  monads = useful tool (4.50 / 4) (#14) by kubalaa on Tue Jan 15, 2002 at 06:43:07 AM EST  I don't consider myself extremely stupid, and I'm not completely foreign to functional programming, but it took me a couple days of reading technical papers before I felt I understood monads really well enough to use them and know exactly what was going on. To say they're a useful tool for procedural programmers is wrong; you can't understand monads if you don't already have a good grasp of functional programming. Now, the IO monad does look and act procedural, but you shouldn't be using it if you don't understand how and why it works; it's certainly not there to help people write Haskell without knowing fp. [ Parent ]  Why I don't like pure functional programming (much (3.25 / 4) (#25) by greenrd on Tue Jan 15, 2002 at 09:48:13 AM EST  OK first off, I'll admit, I use XSLT for small tasks, and the elegance of assignment-free programming is somewhat attractive. (Although it's a bit of a stretch to call XSLT "elegant" - let alone a "functional language", since, as the article noted, it doesn't even support composition of XML-to-XML functions!) But I research object databases (when I'm not wasting time on the net, which means, almost never ;). Now just think of all the business and scientific apps that need to store significant amounts of persistent data - and need to store, crunch and search it fast. We're talking a lot of mission-critical apps here. The problem I have with assignment-free languages like Haskell is that, as I see it, you have two choices to deal with non-trivial amounts of persistent data: Create a horrible imperative kludge with monads or similar which you need a PhD to understand, and is no longer assignment-free at the application level. So that's cheating, really - you've no longer got the purported benefits of assignment-free programming at the application level. OK, so, off the top of my head - everything is transparently persistable, and use lots of virtual memory - with a hack to say that the master disk image (containing the value representing the entire database) only needs to be updated on transaction commit (which would be a side-effect). Problem 1: performance. I would be interested to hear if anyone has tried this for non-trivial apps, because I doubt it would be at all scalable. Basically, as far as I can see, assignment-free languages give you too little control for apps which need lots of persistent storage. Therefore I think most comp.sci students should be taught imperative languages as a priority, with assignment-free languages a minor part of the course, if they're mentioned at all (even if Haskell might be more intellectually challenging than Java, Java is more generally applicable, I believe). "Capitalism is the absurd belief that the worst of men, for the worst of reasons, will somehow work for the benefit of us all." -- John Maynard Keynes[ Parent ]  Don't get me started! (4.00 / 3) (#33) by jacob on Tue Jan 15, 2002 at 12:02:08 PM EST  Persistent data in functional languages is a big challenge, absolutely. But that doesn't mean that functional programming is a terrible idea, just that it doesn't work for DB stuff. Even in a program that manages lots and lots of database stuff, most of the programming logic isn't the imperative task of updating the database. It seems to me that Haskell's downfall in this area is that its paradigm is so inflexible that it won't let you drop the FP mode when you need to interact with the database: as you mention, you have to use some strange monadic beast with journal articles for a manual with no benefit except that you preserve somebody's idea of mathematical chastity. That's why I think impure functional languages are the best fit for database programs. Scheme, for instance, is as functional as you wanna be in regular code, but has the capacity to be totally imperative when that's appropriate. -- "it's not rocket science" right right insofar as rocket science is boring --Iced_Up[ Parent ]  Wolfram's "New Science" (3.50 / 2) (#36) by greenrd on Tue Jan 15, 2002 at 12:36:37 PM EST  Interestingly, Stephen Wolfram, the creator of Mathematica, claims to have created - and this is actually the title of his book - "a new kind of science" based on algorithms. He thinks that algorithmic processes are much better at modelling the real world than the equations that most scientists use today. Sounds like a false dichotomy, but still - perhaps in future we'll develop far more powerful mathematical proof techniques for imperative programming, and the idea of assignment-free functional programming will seem rather quaint. Just speculating here. "Capitalism is the absurd belief that the worst of men, for the worst of reasons, will somehow work for the benefit of us all." -- John Maynard Keynes[ Parent ]  Have you read (4.00 / 3) (#38) by wiredog on Tue Jan 15, 2002 at 12:48:07 PM EST  "From Dawn To Decadence" by Barzun? The algorithm vs equation argument sounds like a modern day Pascal vs Descartes argument. Fascinating book. Peoples Front To Reunite Gondwanaland: "Stop the Laurasian Separatist Movement!"[ Parent ]  Amen to that. (4.00 / 1) (#75) by Jacques Chester on Tue Jan 15, 2002 at 08:24:02 PM EST  Dawn to Decadence is an absolute gem. It helped me to appreciate the incredible depth of the dreadfully unfashionable "Western" culture. Hardly easy reading though. Jacques Barzun has about 20 IQ points and 60 years of reading on me :) -- In a world where an Idea can get you killed, Thinking is the most dangerous act of all.[ Parent ]  Wolfram (4.00 / 2) (#43) by Weezul on Tue Jan 15, 2002 at 01:43:44 PM EST  I don't think Wolfram knows what he is talking about. Science has to do with people thinking about things. Scientists should use whatever tools they are good with. Anyway, using a langauge like Haskell is hardly "thinking about equations" in the sence of science. Haskell is just one little tiny baby step towards scientific though. It seems to me that if Wolfram is right and scientists change how they think, programmers would change how they think too and the meeting point would be far closer to science then Haskell. Anyway, I don't think Wolfram knows what he is talking about. I consider myself pretty good at thinking about imperitive processes and I can tell you there are massive cultural (or perhaps even biological advantages) to thinking about "equations." Understand, if langagues want to reach the level of cleanness of though that equations enjoy then you will start seeing the possibility of writing a large program and compiling it the first time wtihout any bugs. I'm not going to hold my breath for this to happen in any lagnauge, Haskell, Java, etc. The bottom line is that equations allow you to know what your talking about and are fundameentally a tool for communication. Programming lagnagues allow you to carry out tasks and are fundamnetally a tool for doing. You are talking real leaps in AI or real sacrifices in scientific advancment to make these closer. "Fascism should more appropriately be called Corporatism because it is a merger of state and corporate power." - Benito Mussolini[ Parent ]  Bug-free programs (5.00 / 2) (#66) by BehTong on Tue Jan 15, 2002 at 06:19:11 PM EST  Understand, if langagues want to reach the level of cleanness of though that equations enjoy then you will start seeing the possibility of writing a large program and compiling it the first time wtihout any bugs. Uhm, I'm sorry, but unless you're talking about solving the problem of the human factor in programming, then this will never happen. Expressing all the little details required for a working software program as an equation is no mean feat, and I seriously doubt anyone can come up with the precise, correct equation without any testing/debugging, let alone getting it right on first try. Cleaner/purer/better languages puts a cap on how stupid programming mistakes can get. They do not, and cannot, increase software quality directly. The machine does what it's told, not what you mean; as long as humans make mistakes, there will be bugs in software. You can limit the amount of damage possible -- in assembly language, you can really screw up the machine bad; in C, at least you don't hit stuff like cold reboots unless you deliberately try to; in Java, you can't leak memory that easily unless you try real hard -- but nevertheless, no matter how hard you try, you can still write a program that accidentally deletes a bank customer's account when he tries to deposit something. You still get people who misunderstands the specs, or interprets it in a way you didn't expect. Or maybe it's just a plain ole slipup, where that equation should have a "+ f(x)" instead of a "- f(x)". No programming language is going to help you prevent that. It's the programmer(s) that determine(s) the software quality; the programming language merely reduces (or increases) the chance of silly (or malicious) mistakes and helps the programmer express his/her intentions. The problem of buggy programs is not the problem with the programming language (although some languages are easier to slip up in, such as assembly language). The problem is the human factor, and until you address that, bugs will still be around. Beh Tong Kah Beh Si! [ Parent ]  I agree (2.00 / 1) (#69) by Weezul on Tue Jan 15, 2002 at 07:01:51 PM EST  Programs are doing something significantly diffrent from equations. Scientific though the equations it uses are simple enough to be bug free and easy to communicate while engenering and programs must deal with all the real world complexities be they approximations, bugs, etc. The paragraph you quoted was supposed to make Wolfram's idea sound silly. :) Still, I do not consider his idea totally impossible. I can not gues at what another million years of evolution (or much less with brain implants) could do to our abilities to think clearly about complex things. "Fascism should more appropriately be called Corporatism because it is a merger of state and corporate power." - Benito Mussolini[ Parent ]  Impossible? (none / 0) (#94) by BehTong on Wed Jan 16, 2002 at 10:29:56 AM EST  Still, I do not consider his idea totally impossible. I can not gues at what another million years of evolution (or much less with brain implants) could do to our abilities to think clearly about complex things. Well, in that case, we have solved the human factor I was talking about. Personally I have doubts whether this will actually happen -- but if it does, then certainly we will not have software problems like we do today. (But will software even exist then?) Nevertheless, this still has nothing to do with the programming language; it has everything to do with how humans handle programming, which was my original point. Beh Tong Kah Beh Si! [ Parent ]  Programming in eQuations. (none / 0) (#124) by DGolden on Thu Jan 17, 2002 at 06:24:46 PM EST  Programs are doing something significantly different from equations. hmm...yes and no... Have a play with Q sometime... From the Q docs: Q stands for "equational", so Q, in a nutshell, is a programming language which allows you to "program by equations". You specify an arbitrary system of equations which the interpreter uses as rewrite rules to reduce expressions to normal form. It's quite an interesting language. Don't eat yellow snow[ Parent ]  Databases (3.66 / 3) (#42) by Weezul on Tue Jan 15, 2002 at 01:21:09 PM EST  Haskell's database access works great without waisting lots of memory or dropping the functional aspects. Database access uses a monad like anyhing else external (perhaps they keep it in the IO monad to keep thinks syncronized with user IO). Indeed, Haskell is a theoretically a far better langague for database access since you can actually implement an SQL equivelent in Haskell and have the Haskell compiler preoptimize your quieries. None of this idiotic screwing around with SQL query strings and none of the security holes it produces. Unfortunatly, the current implementations may all use strings to keep traditional programmers happy. Anyway, I find that I oftin need to simulate functional stuff in other lagnagues when writing database applications, but this may be a result of the fact that I'm never using a good datbase like Oracle so any convoluted SQL queries to do it are impossible or amazingly slow. "Fascism should more appropriately be called Corporatism because it is a merger of state and corporate power." - Benito Mussolini[ Parent ]  Monads.. (4.33 / 3) (#45) by Weezul on Tue Jan 15, 2002 at 01:45:04 PM EST  ..can give you precistant data easily.. or more interesting types of data. There are some complexities regarding composition of percestent data via monads of monands, but these are not a big deal for most programmers. You don't seem to know much about Haskell. Monads *are* asignment free and hence not a kludge. That is the whole point. There are other functional langagues with kludges, but Haskell dose it cleanly. Monads really are not that hard to understand. Your type is tracking what assignments you are capable of making. This is no philosophical sacrific if you were restricting yourself to object oritented code anyway. Shure, Monads are hard to understand if you do not like functional langague, but object oritented programming is hard to understand if your a hardcore assembly langague programmer who dose not like abstraction anyway. I know I once tried to explain C++ to an assebly only guy (a very smart guy I might add). He refused to understand it my way, so he had to compile several C++ programs to assembler and read the results. Monads only seem difficult for cultural reaasons.. and like everything in programming you can just start working with them and you will be converted. Your virtual memory objection is totally absurd. Yes, monads oftin do simulate the value of the entire world, entire database, etc, but only an idiot would assume that this has anything to do with their memory size. The monad just makes the database update, writes to the screen, or returns the key stroke. The monad type information enshures that nothing gets skipped due to laziness and keeps these accesses syncronized. You would not expect your disk access object in an object oriented langauge to copy the whole disk into memory would you? It's also foolish to attack Haskell for speed or memory and then go use a pooly optimized domain specifc langauge. I'm not saing Haskell is the right langauge for the Linux kernel. "Fascism should more appropriately be called Corporatism because it is a merger of state and corporate power." - Benito Mussolini[ Parent ]  Actually (3.50 / 2) (#49) by greenrd on Tue Jan 15, 2002 at 02:49:14 PM EST  You're right, I don't know anything about Haskell. I really don't know what I'm talking about, I must admit. Anyway, I think I have a simpler objection. Anything which explicitly writes to disk (e.g. a transaction commit) something that can be read back again by the same process is, in effect, an assignment statement. Therefore you no longer have an assignment-free language. Which is not necessarily a bad thing - but it's not clear to me how referential transparency is then maintained. The only way I could see it working (thinking in terms of ACID transactions here) is if each transaction is in effect a separate process, and you have referential transparency within a single transaction, but not between transactions. Or perhaps I've misunderstood referential transparency. "Capitalism is the absurd belief that the worst of men, for the worst of reasons, will somehow work for the benefit of us all." -- John Maynard Keynes[ Parent ]  Try this (4.00 / 1) (#56) by Weezul on Tue Jan 15, 2002 at 04:46:24 PM EST  Ok, the Haskell IO monad handles all IO activities. You can think about Haskell programming as writing a program which returns a value in the IO monad and you can think of this "value" as a procedural or machine code program which actually dose all the work. Haskell is almost a meta-langauge at this point and you can more easily see where all the power comes from. This way of thinking about Haskell code has caused me some trouble, say when the compiler did not unroll recursions that I wanted unrolled, but it's still fun to think about things this way and you can fix those kind of problems by using the profiler when you care about speed. "Fascism should more appropriately be called Corporatism because it is a merger of state and corporate power." - Benito Mussolini[ Parent ]  The frustration of the IO monad (4.00 / 3) (#62) by blamario on Tue Jan 15, 2002 at 05:36:16 PM EST  The problem with the IO monad is that the only place it can be used is on the top level of your application. A typical IO-interactive Haskell program reads some data through an IO monad, then processes this data in a purely functional way, and finally writes the results back into the environment. Loop back to the reading phase. Many applications easily fit into this framework, but not all. One example, already mentioned, is the class of database applications. Reading the needed data from your database when you need it is very often the most natural thing to do. Imagine that, at some point in your purely-functional part of your program, you need the value of (fn + db) where fn is a pure function and db is a result of a database query. Well you can't get the value of db just like that, it must be prepared in advance. But sometimes you can't tell in advance what kind of data you'll need for your computations. The most frustrating thing is that db might be a side-effect-free database query, but Haskell can't know that. What I'd like to have in Haskell is an "I monad", i.e. input-only IO monad that can return values and can be called from purely functional expressions. [ Parent ]  the problem with an I monad: (4.50 / 4) (#63) by jacob on Tue Jan 15, 2002 at 05:48:36 PM EST  When you're dealing with the outside world, just because you don't change something doesn't mean it doesn't change. In particular, it isn't necessarily true that f = f if f is a function that, for example, executes and returns the answer to a SQL query, because someone might update the database in between calls. -- "it's not rocket science" right right insofar as rocket science is boring --Iced_Up[ Parent ]  Shure (3.00 / 1) (#73) by Weezul on Tue Jan 15, 2002 at 07:35:13 PM EST  Procedural programs do not handle that kind of thing very well either. That's why you have lcoking and shit. In Haskell the monad keeps IO and locking in order. The only diffrence is that you specify the order of execution by working with the monad instead of the order of lines of code. Actually, there is some reasonable chance that Haskell could be adjusted to handle locking better, i.e. you could after the fact remind it that you wanted all the locking to be done at the right times. I donno.. "Fascism should more appropriately be called Corporatism because it is a merger of state and corporate power." - Benito Mussolini[ Parent ]  That's not what I meant (none / 0) (#78) by jacob on Tue Jan 15, 2002 at 11:46:03 PM EST  The original poster wanted an "I-only" monad that could be used within pure functions on the assumption that input-only shouldn't pollute the types. I was just demonstrating why even monads restricted to input only still introduce state and thus still should preclude a function from being pure. That's all. -- "it's not rocket science" right right insofar as rocket science is boring --Iced_Up[ Parent ]  The IO monad isn't that bad (5.00 / 3) (#65) by tmoertel on Tue Jan 15, 2002 at 06:01:49 PM EST  The problem with the IO monad is that the only place it can be used is on the top level of your application. This isn't strictly true. IO actions (and other monadic values) can be constructed anywhere. IO actions, in particular, are threaded down from the top level during execution, but I have never observed this to be a limiting factor. One example, already mentioned, is the class of database applications. Reading the needed data from your database when you need it is very often the most natural thing to do. Imagine that, at some point in your purely-functional part of your program, you need the value of (fn + db) where fn is a pure function and db is a result of a database query. Well you can't get the value of db just like that, it must be prepared in advance. But sometimes you can't tell in advance what kind of data you'll need for your computations. I don't think Haskell has any fundamental limitation in this area. People have written web servers, interactive animation systems, and robot-control systems in Haskell. If you want to read information from a database, just do it. The result will, naturally, be a value of type IO a, and so the containing function will have to return an IO action. But that's not a big deal, especially since the type system is there to keep you honest. The most frustrating thing is that db might be a side-effect-free database query, but Haskell can't know that. What I'd like to have in Haskell is an "I monad", i.e. input-only IO monad that can return values and can be called from purely functional expressions. You're missing something here. Even if your database query is truly side-effect-free, you would still want to place the query action in the IO monad in order to have guarantees upon order of execution w.r.t. all other IO actions. If your query is really free of side effects and free of order-of-execution dependencies (and you wish to take the burden of proof for these statements upon yourself), you can pierce the monad with (in Haskell) unsafePerformIO. --My blog | LectroTest [ Disagree? Reply. ][ Parent ]  An example (none / 0) (#77) by blamario on Tue Jan 15, 2002 at 10:51:35 PM EST  Thanks for telling me about unsafeIO, that's something I needed before but didn't know it exists. I don't think Haskell has any fundamental limitation in this area. People have written web servers, interactive animation systems, and robot-control systems in Haskell. If you want to read information from a database, just do it. The result will, naturally, be a value of type IO a, and so the containing function will have to return an IO action. But that's not a big deal, especially since the type system is there to keep you honest. Web server is essentially a request-response loop. It fits the model easily. Same for the other two examples you mention. Maybe my example of a more difficult problem was too abstract, so here's another one: a compiler. A compiler might seem an ideal task for a functional language: a single text file on input, a single value on output, no interactivity. In case of ISO Pascal that would be true. Howeever, any modern language allows imports or includes of other source files, and that requires some more IO. And the problem here is that you (generally speaking) can't tell which files should be imported until you parse the importing file. A natural solution could be to read and parse the first file first, and during the later tree traversal to recursively read and parse the imported modules. IO monad won't allow this. So in order to write the program the Haskell's way, after the parse you must create an IO monad that reads the imported files and return the (tree, monad) pair to the top program level to execute the monad. Then you can go back to the parse tree and the newly read imported files, parse them and so on. It's not that hard to find a proper solution, but I don't like tweaking a sensible alghoritm to make it fit the language I'm using. Writing a compiler is usually hard enough without annoyances like these. [ Parent ]  Shell (none / 0) (#83) by Weezul on Wed Jan 16, 2002 at 01:07:59 AM EST  The Haskell compiler is writen in Haskell.. file IO is just not that big a deal. Unfortunatly, it's a langauge which has evolved over time, so the compiler dose not use all the newest coolest stuff. I know the compiler uses a YACC clone for Haskell instead of a parser monad. Anyway, I can do you one better: there is a shell writen in Haskell. There are also very nice GUI toolkits, database libraries, and COM stuff (CORBA too?). Monads allow you to make fully object oriented code if you want; hence the COM stuff. The truth is that functional algorithms are just as "sensible" (or more so) as their procedural counter parts once you learn to think that way. "Fascism should more appropriately be called Corporatism because it is a merger of state and corporate power." - Benito Mussolini[ Parent ]  Yes (2.00 / 1) (#64) by greenrd on Tue Jan 15, 2002 at 05:54:53 PM EST  That's exactly how I thought IO monads worked, thanks. :-) "Capitalism is the absurd belief that the worst of men, for the worst of reasons, will somehow work for the benefit of us all." -- John Maynard Keynes[ Parent ]  As someone who's only heard of Haskell in passing: (3.66 / 6) (#10) by Inoshiro on Tue Jan 15, 2002 at 04:44:17 AM EST  And knowing that Haskell, not being included in every Linux distro and BSD in the base set, is not as wide spread or as widely known as Perl of Python, what can you tell me about it? I'd like to see some info on this language. It looks efficient :) -- [ イノシロ ]  haskell.org [nt] (2.75 / 4) (#12) by boxed on Tue Jan 15, 2002 at 06:27:05 AM EST  [ Parent ]  Haskell (4.50 / 4) (#13) by vrai on Tue Jan 15, 2002 at 06:40:15 AM EST  I had the 'joy' of being taught Haskell at university (using a cut down version called Gopher). It is a purely functional language (i.e. its based on lambda calculus) and requires a different approach to programming than with imperative languages (i.e. 'normal' ones like C/C++, Java, Python etc ...). Everything is recursive and the whole thing ends up looking like a functional specification (which is one of the language's selling points apparently). On the upside it has polymorphic typing and uses lazy evaluation. Basically if you pass the result of one operation into another second operation, the first operation will be only processed until the second one has all the data it needs. This allows you to work on 'infinite' data sets of which a subset is calculated as and when you need it. If you like functional spec'ing then you'll love Haskell. But if, like me, the thought brings on near fatal flashbacks then you'd be best off limiting your use of Haskell to the odd imported module in C. For more info see http://www.haskell.org/. [ Parent ]  OCaml (4.00 / 4) (#17) by caine on Tue Jan 15, 2002 at 06:56:37 AM EST  As vrai says above, Haskell isn't the most practical or nice language if you're used to more "normal" languages. However if you're interested in this kind of solutions, I would recommend OCaml (http://caml.inria.fr/), a functional, object-oriented and VERY fast language. Its' speeds can match those of C and C++, which really isn't bad. --[ Parent ]  In the typical programmer's phrase... (3.50 / 6) (#11) by TheophileEscargot on Tue Jan 15, 2002 at 04:45:35 AM EST  "Why do you want to do it that way?" From talking to various XSLT evangelists, I get the impression that they're getting carried away with the fact that it's a Turing-complete language. OK, but that doesn't mean it's the best way to do things. It seems to me that for most purposes it's difficult to use, a nightmare to maintain, and not particularly efficient. XSLT? Just say no, kids! ---- Support the nascent Mad Open Science movement... when we talk about "hundreds of eyeballs," we really mean it. Lagged2Death  comment from the first draft (4.40 / 5) (#15) by kubalaa on Tue Jan 15, 2002 at 06:48:14 AM EST  I said something similar in the first draft. The author countered that text processing is often necessary along with XML processing. I agree it's handy, I think the coolest thing would be a way to pipe text streams through external programs and get back XML, within an XSLT document. But that wouldn't be terribly portable. My solution (having had this problem myself) is to instead embed the XSLT processing in a more usable language like python. Python handles the text processing first and any transformations easily done in a SAX stream, then it can hand it through XSLT for more interesting node transformations, then gets the results and does whatever it wants (i.e. writes to a file). [ Parent ]  Coolness and usefulness (4.60 / 5) (#18) by TheophileEscargot on Tue Jan 15, 2002 at 07:34:28 AM EST  One of the great physicists (forget which) use to make fun of the "equations should be beautiful" school by saying in lectures: "This equation would be more elegant if this minus was a plus. However, that would be wrong." One of the criteria of coolness I think is that one little thing does a huge amount of work; but I think that can easily lead to functionality being munged together where it's not really appropriate. Better to decompose something into two easy parts than have one monster kludge. So, I think you've got the right idea with the embedding. XSLT is all very well in its place... but bloody well leave it there! ---- Support the nascent Mad Open Science movement... when we talk about "hundreds of eyeballs," we really mean it. Lagged2Death[ Parent ]  Heh (4.60 / 5) (#23) by DesiredUsername on Tue Jan 15, 2002 at 09:32:08 AM EST  Along those lines, there's an apocryphal tale about a junior engineer. He thinks a certain problem can be solved quickly and all the senior engineers tell him they tried and it can't be. He goes off and comes back with something. Junior: This solution cover most of the cases and only runs in 10 seconds. Senior: If there's no requirement that it cover all the cases, I can make it run in zero seconds. Play 囲碁[ Parent ]  "Turing complete" (4.25 / 4) (#32) by ucblockhead on Tue Jan 15, 2002 at 11:54:15 AM EST  I get the impression that they're getting carried away with the fact that it's a Turing-complete language. The proper response is to point out that Brainfuck is also a Turing complete language. ----------------------- This is k5. We're all tools - duxup[ Parent ]  And also: (4.00 / 1) (#50) by mikera7 on Tue Jan 15, 2002 at 02:55:18 PM EST  Smetana is Turing complete, although the lack of IO functions tends to make it even less practical. Interesting challenge to prove this BTW :-) You have to use the initial and final state of the program as input/output and assume that a infinitely recursive "Go to" statement counts as "halt". [ Parent ]  Turing-completeness (4.66 / 3) (#61) by jacob on Tue Jan 15, 2002 at 05:28:12 PM EST  Anyone who ever defends any language by saying it's Turing-complete should be laughed at. Ideally, pointing should also be involved. They should feel socially awkward, as though they've just inadvertently made everyone aware of what morons they are. Because they have. Remember: every language you're ever likely to use is Turing-complete, and a great many more languages you'd never want to use for anything are, too. C is Turing-complete. Assembly language is Turing-complete. Hell, Turing-machine language is Turing-complete. That doesn't mean you want to write a data transformation in it. -- "it's not rocket science" right right insofar as rocket science is boring --Iced_Up[ Parent ]  Heh (3.00 / 1) (#72) by Weezul on Tue Jan 15, 2002 at 07:25:17 PM EST  That is one of the things I love about Haskell. It's very easy to extend the langague by only adding libraries. A nice example of this was the parser monad. Your Haskell parsers looked almost exactly like a YACC parser, execpt that you add regexs or more powerful (turingish) constructs. YACC's lack of regexs in the grammer is simillar to your turring complaint. Shure, Context free grammers can simulate regexs, but it's a pain in the ass (and not all the things you want to make into regexs make sence as tokens). "Fascism should more appropriately be called Corporatism because it is a merger of state and corporate power." - Benito Mussolini[ Parent ]  CFGs (none / 0) (#81) by Robert Uhl on Wed Jan 16, 2002 at 01:01:05 AM EST  There's no inherent reason why a context-free grammar may not support regexps. Remember, the language is just shorthand--and any useful shothand is, well, useful. It'd actually be a pretty cool hack to try adding something like this to bison... I was on quite the CFG kick a few months ago when writing the file-parsing bits of travlib. My kind of CS! [ Parent ]  Turing-complete (none / 0) (#79) by wiml on Wed Jan 16, 2002 at 12:37:39 AM EST  I didn't know XSLT was Turing-complete. That suggests that it should be possible to translate other languages to XSLT. Perhaps, given the similarity noted by dgolden, implementing a Scheme interpreter in XSLT would be the way to go. Or you could take the well-traveled path of implementing a small bytecode-style machine. In any event, once that's done, you can write a program to mechanically translate from your language of choice — Perl, Haskell, whatever — into XSLT. And then you are freed forever from this bizarre management requirement to use pure-XSLT. For extra credit, write the perl↠XSLT translator in XSLT, and/or write a gcc backend that emits XSLT. (The game of Sokoban is also Turing-complete, by the way. The ASICs for my "Sokoban Machine" all-sokoban-based computer should be coming back from the fab any day now.) [ Parent ]  Haskell is mind-boggling... (3.42 / 7) (#16) by Estanislao Martínez on Tue Jan 15, 2002 at 06:50:01 AM EST  ...well, at least to the uninitiated like me. While I'm not scared at all of the semantics (it's just a plain typed lambda calculus + an operand-first evaluation regime, and some syntactic sugar for monads which, at least in one of their uses, are just functional manipulation of computation states), the practical issue of putting ideas to code in that language baffles me absolutely. I would love to have pointers, save for the fact that I have very little time to devote to following them... :-(. --em  Doh! (4.00 / 3) (#19) by greenrd on Tue Jan 15, 2002 at 08:35:54 AM EST  Heh, I read the first clause as "I would love to have C-style pointers in Haskell", and then thought you were being surreal, and then realised what you meant. "Capitalism is the absurd belief that the worst of men, for the worst of reasons, will somehow work for the benefit of us all." -- John Maynard Keynes[ Parent ]  Learning Haskell (none / 0) (#111) by mahlen on Wed Jan 16, 2002 at 04:53:13 PM EST  For me, Haskell is like a gorgeous foreign lover who speaks no English; captivating and yet baffling. Here is my Haskell Wiki link to how I went about learning Haskell: http://haskell.org/wiki/wiki?LearningHaskell If you do pursue it further, do add a note there on your own experiences. Since i wrote that page, I've not been doing any further things in Haskell (curse my short attention span!), although it still intrigues me. mahlen To doubt everything or to believe everything are two equally convenient solutions; both dispense with the necessity of reflection. --H. Poincaré [ Parent ]  Wow (3.71 / 7) (#21) by wiredog on Tue Jan 15, 2002 at 08:50:38 AM EST  I'm glad my customers don't insist on XSLT. They just insist on results, and don't give a damn how the results are achieved. The main problem we run into is that one of the results that is required is speed. We convert documents from various formats (FDP-KAT, JSIMS FDP, MS Word, etc) to XML, and then load the data from that XML document into an Oracle database. We also have to extract the data from the db into an XML document. Needless to say, but I'll say it anyway, the extracted document has to match the loaded document. Because of the speed requirement we use C++ rather than Perl or Python (although we are keeping our eyes on those) because it's compiled. Yes, I know Python is "compiled" the first time it's run, but the C++ code runs faster. Sometimes, shaving 30 seconds off of a process's run time can be important. Shaving a few minutes can be worth a couple months of optimization. For an idea of what we are doing, check out (WARNING: these are US Department of Defense websites! They are audited, monitored, etc.!) DMSO and the FDMS. Some of the information, particularly the FDMS Library, requires a logon. Peoples Front To Reunite Gondwanaland: "Stop the Laurasian Separatist Movement!"  Perhaps a misapplication of XSLT (4.33 / 9) (#24) by edAqa on Tue Jan 15, 2002 at 09:34:56 AM EST  I can't help but feel that this difficulty arose from a misapplication of XSLT. It was stated very clearly that the limitation of being pure XSLT was imposed, not chosen by this implementor -- and it appears that tmoertel does not agree with this decision. There are a couple of things that make we concerned about the conclusion and this perceived problem: 1. The XSLT Abstract states clearly "This specification defines the syntax and semantics of XSLT, which is a language for transforming XML documents into other XML documents." This doesn't mean it can't be used fo other things, but it should immediately send up warning flags should you desire to do something else with it -- that is, while it can do other things, this statement shows that the focus was for XML to XML. 2. The author wishes to perform substitutions on things like$, & and (c). The source documents are XML in nature. XML may freely use and define entities, such as © that could have been used to eliminate (or reduce) this substitution burden from the XSLT. Perhaps a modification of the input language, or an examination of its design, may have yielded an easier solution. There are several limiting factors at work which have produce the described difficulties. If all XSLT experiences were like this one, then I would support the strength of the conclusion. It seems however that the choice of Pure XSLT, and the peculiar XML input, and the choice of Latex as output, is the source of the difficulty. Perhaps we can learn from this that if the input or output is not of a clean, well-expressed, XML or XML-like (HTML) nature, then either XSLT should not be chosen, or the extension mechanism should be used. -- edA-qa
 Allow me to clarify (4.66 / 6) (#30) by tmoertel on Tue Jan 15, 2002 at 11:31:49 AM EST

 Good observations. Regarding, The XSLT Abstract states clearly "This specification defines the syntax and semantics of XSLT, which is a language for transforming XML documents into other XML documents." This doesn't mean it can't be used for other things, but it should immediately send up warning flags should you desire to do something else with it -- that is, while it can do other things, this statement shows that the focus was for XML to XML. I should make it clear that the substitution problem I discussed is not exclusive to the XML-to-LaTeX transformation process. The problem can occur in XML-to-XML transformations as well as in XML-to-HTML transformations. (In fact, the same project that required the XML-to-LaTeX transformation also required XML-to-HTML transformation, and the latter transformation also required the use of a substitution list and the same XSLT gymnastics.) The problem, stated more generally, is that XML documents comprise tags and text, while XSLT provides facilities for transforming the "tags" part (i.e., elements), it provides little support for what's inside the tags -- text. Moreover, XSLT makes it difficult to roll your own support. That's the bigger problem. The author wishes to perform substitutions on things like , & and (c). The source documents are XML in nature. XML may freely use and define entities, such as "© that could have been used to eliminate (or reduce) this substitution burden from the XSLT. Perhaps a modification of the input language, or an examination of its design, may have yielded an easier solution. Unfortunately this wasn't a realistic option. Regardless of how many times you tell people to avoid text and typewriter idioms when marking up XML -- to use "©" for "(C)", etc. -- the reality is that some people can't do it or won't take the time to learn how to do it. If you've worked on large projects where you get your content from "other people," you probably know what I mean. My options were simple: I could build a piece of software that required perfect markup to function properly. Or I could build a piece of software that accommodated the most common imperfections. I went the latter route, and I still think that's the better choice. Thanks for your comments. Cheers,Tom --My blog | LectroTest [ Disagree? Reply. ][ Parent ]  XML to Latex (3.60 / 5) (#31) by wiredog on Tue Jan 15, 2002 at 11:44:43 AM EST  What about a multistage process? XML to DocBook (which is, I think, XML) and then DocBook to latex? I think there are tools for the latter as well as for latex to html. Peoples Front To Reunite Gondwanaland: "Stop the Laurasian Separatist Movement!"[ Parent ]  The problem is that . . . (4.60 / 5) (#48) by tmoertel on Tue Jan 15, 2002 at 02:36:45 PM EST  The problem with using XML-to-DocBook-to-LaTeX is that something would be lost by shoehorning rich semantics-preserving XML into DocBook format. Now, if you're just using DocBook as a glorified device-independent page description language, then you wouldn't be concerned about that loss of semantic information. But in that case, DocBook would leave you with a rather limited degree of control over the back-end page rendering, which (at least for my projects) has always been of paramount importance. We want semantic information to flow through to the very end so that it can influence the on-page rendering. Cheers, Tom --My blog | LectroTest [ Disagree? Reply. ][ Parent ]  String transformations (4.00 / 1) (#55) by jacob on Tue Jan 15, 2002 at 04:44:49 PM EST  Unfortunately this wasn't a realistic option. Regardless of how many times you tell people to avoid text and typewriter idioms when marking up XML -- to use "©" for "(C)", etc. -- the reality is that some people can't do it or won't take the time to learn how to do it. If you've worked on large projects where you get your content from "other people," you probably know what I mean. Yep, I sure do. But at the same time, the fact that you're trying to deal with structure in your documents that isn't represented in XML seems to be the root of your problem. And it ought to be, right? XSLT is supposed to be a language for transforming input XML expressions only, so when your language isn't adhering exactly to the spirit of XML expressions, transformations that seem natural to you are going to be like pulling teeth. For that reason, I think the better solution than implementing a string-muncher in XSLT is to make a thin parser that converts your input language (conceptually a "not quite XML" language) into a pure XML representation and go from there. For example, you could descend over the XML document and replace every string with a sequence of XML elements possibly including singleton tags like and . From there, the XSLT transformation would be easier. Perhaps you couldn't do that because you were restricted to an XSLT-only solution? By the way, why was that? Ultimately, if you're using XSLT, your input domain had better be pure XML or you're going to have problems. -- "it's not rocket science" right right insofar as rocket science is boring --Iced_Up[ Parent ]  That's going a bit too far, isn't it? (5.00 / 3) (#67) by tmoertel on Tue Jan 15, 2002 at 06:22:23 PM EST  Yep, I sure do. But at the same time, the fact that you're trying to deal with structure in your documents that isn't represented in XML seems to be the root of your problem. And it ought to be, right? I don't think we agree on what structure is. What you appear to be proposing is to promote troublesome text values into elements for the sake of making them easier to process via XSLT. In addition to seeming like a lot of effort for little benefit, it fails the content-preservation rule of thumb: If you strip away a document's markup, you should still be left will all the content of your original document. (This rule of thumb is for document-centric uses of XML, not data-centric uses, where element representations are often preferred.) For example, if you strip away all of the markup in an HTML document, you still have the original text. Under your promotion scheme, the same can't be said, and this suggests that your proposal goes too far. Ultimately, if you're using XSLT, your input domain had better be pure XML or you're going to have problems. While I agree in principle, I must disagree on your definition of "pure XML." My definition of pure is such that all of the structure of a "pure" document should be properly represented as elements but none of the content itself. --My blog | LectroTest [ Disagree? Reply. ][ Parent ]  I don't think so (3.00 / 1) (#76) by jacob on Tue Jan 15, 2002 at 09:27:25 PM EST  I say that the copyright symbols and various other entities that show up inside strings are part of the structure of your document because they get handled differently from other data that can show up in that string; if you were to write out a grammar for your input language you'd have to have special production rules for them as opposed to other kinds of string. Therefore, in my book, they're part of the structure of your document. Any program you write is going to have to somehow, at some phase, notice that your data is a copyright symbol rather than anything else and take some special action accordingly; it seems to me that in that situation promotion is absolutely the right way to go. And I'm deeply suspicious of your rule of thumb; for one thing, while it's true that you can strip away all the markup from HTML and still have all the text, (a) that text isn't the only content of the page -- images spring to mind -- and (b) the text will likely lose most of its meaning. On the form I'm filling out to post this reply, for instance, you'd see "[ parent ] Post Comment Spamming is not tolerated here. Any comment ... [etc] ... Subject: Comment: Signature Behavior: Retroactive Sticky Never Apply Sig HTML Formatted Plain Text Allowed HTML: ..." and so on. Certainly suffers dramatically from the lack of structure. And besides, programs don't know the difference between "markup" and "structured data" (and neither do I, for that matter). I have an alternate rule of thumb for you: to the extent that your document has structure that you want programs to see and manipulate, you ought to mark it up with XML tags, because that's the only way any XML-related mechanisms are ever going to get at it. No more, no less. -- "it's not rocket science" right right insofar as rocket science is boring --Iced_Up[ Parent ]  Maybe I can change your mind (5.00 / 2) (#84) by tmoertel on Wed Jan 16, 2002 at 01:53:38 AM EST  I say that the copyright symbols and various other entities that show up inside strings are part of the structure of your document because they get handled differently from other data that can show up in that string; ... No way. The semantic characteristics of the information I store in my SGML/XML documents is solely determined by the information itself, not upon how it might be processed. In other words, a piece of information is what it is, not how it's processed. Under your system, you're changing the information to fit your publishing scheme. That's backwards. What if you have six publishing schemes, some of which prefer copyright as (C) and others as © and others as ? What if ten years from now you bring a new publishing scheme online? Should you re-tag all your old documents to make the new scheme more convenient, too? Represent your information once, as it truly is, and you won't need to do it again. Let the publishing schemes accommodate their own predilections. And I'm deeply suspicious of your rule of thumb ... I suspect that's because you have an deeply suspicious nature. ;-) But seriously, you know that "ML" in SGML and XML? It stands for "markup language." In the old days, that used to mean something. You see, there were all these typewritten documents that had to be stored electronically (they were filling up warehouses, it would seem, and they were hard to find in a hurry). So the text of these documents was typed into computers and then "marked up" to preserve the semantic information that is usually conveyed visually on the page. The important thing to note about this mark-up process was that it was a purely additive process -- tags were added, but the original text was never altered. Nothing removed, nothing changed. You see, the document preservation folks were keen on not losing the original text. If you marked up a document in such a way that you changed the original text, you screwed up. Hence the rule of thumb: When you strip away the tags, you should see the original text. And if you're still suspicious, take a look at the DTDs for established document-markup systems (you might want to start with DocBook and TEI) and see how much of the actual text is stored as tags instead of between the tags. I think you'll observe an almost exhaustive adherence to the rule of thumb. --My blog | LectroTest [ Disagree? Reply. ][ Parent ]  Strings are an encryption method (none / 0) (#99) by jacob on Wed Jan 16, 2002 at 12:06:12 PM EST  The semantic characteristics of the information I store in my SGML/XML documents is solely determined by the information itself, not upon how it might be processed. In other words, a piece of information is what it is, not how it's processed. I agree with you more than I think you suspect. In fact, in my view your problem stems exactly from the fact that you're not following that very advice as well as you should. The problem is that you haven't exposed enough of the way your data really is to be able to manipulate it properly. Let's take an alternate example: instead of transforming "(c)" to "copyright{}", let's say you want to transform a document into a list of all the noun phrases it contains. Whether or not a particular fragment of your string is a noun phrase is a part of the way your data really is, isn't it? It's absolutely a legitimate part of the structure, right? So this ought to be a trivial XSLT program, right? Wrong. Though English grammar is a totally legitimate part of the structure of your data, your representation doesn't expose it in XML. Why? Presumably because it would be a royal pain to have to mark up your text that way, which is perfectly reasonable. But now that you need access to those noun phrases, you need that structure that's locked away in strings. In essence, the verbosity of XML was impractical for a portion of your document, so you used a more compact non-XML representation. But now that you want to manipulate it, the fact that it isn't XML is a big problem. So what's the solution? Simple: write a program that automatically converts that non-XML stuff into XML. You don't need to make a full English-language parser, just one that's good enough to be able to distinguish between noun phrases and anything else, because that's good enough for your purposes, but you should be aware that there's more structure in the document that you're continuing to treat as opaque because you can. So, you write your noun-phrase identifier, run it on your input data fragments, and presto changeo, you can write a simple little XSLT program to produce the noun-phrase list. The day is saved. In that alternate example and in your actual situation, the root problem is the same: strings hide the structure of the data you're manipulating. The structure of the data contained in your strings is a real and legitimate part of the structure of your data, but it isn't represented in an accessible way. So the solution in both cases is the same: take the part of the data that doesn't have its relevant structure exposed, and write a program that exposes that structure when you need it. In your case, it's a whole lot easier than parsing full English; the only relevant part of the structure for your purposes is something like:  Text-fragment ::= copyright | opaque text Copyright ::= '(c)' Opaque text ::= [any other sequence of characters]  and while you're still leaving most of the string structure unexposed, you've got enough to work with. This approach also scales (with work, admittedly) to solve problems of context: '(c)' means "copyright" sometimes, "item c in a list" sometimes, "the parenthetical statement 'c'" sometimes, "the variable c grouped by itself" sometimes, and probably a half-dozen other things other times. Your approach can't be scaled up to handle that, while mine can, because mine reflects the nature of the data I'm manipulating while yours doesn't. But you don't have to be so speculative to see the advantage of the way I'm describing. Consider: if you do this my way, your program becomes much easier. The processor I'm talking about is approximately eight easy lines of Scheme code (I know, I wrote it last night in about twenty minutes) and essentially eliminates the problem you've discussed in your article. If that simplification isn't coming from the fact that I'm better exploiting the structure of the data, where is it coming from? -- "it's not rocket science" right right insofar as rocket science is boring --Iced_Up[ Parent ]  Meaning, not use, dictates representation (5.00 / 1) (#103) by tmoertel on Wed Jan 16, 2002 at 02:05:33 PM EST  I think you misunderstood the point of my previous post. I'll try to clarify in context below. Whether or not a particular fragment of your string is a noun phrase is a part of the way your data really is, isn't it? It's absolutely a legitimate part of the structure, right? If you're given some information to represent as an XML document, it's your job to study the information, understand its semantics, and create an XML representation that captures the information, true to itself. If, in the information you're given, it's important that noun phrases are distinct from other types of phrases, such as it would be if you were given an English textbook, it's your job to capture the noun-phrase structure as part of the textbook's markup. On the other hand, if you're given a sci-fi novel, and noun phrases aren't an essential part of its semantic makeup, you shouldn't include noun-phrases in its markup because you would be obfuscating the true meaning of the novel. So, in answer to your question, "It's absolutely a legitimate part of the structure, right?" -- for the English textbook, yes, and for the sci-fi novel, no. In essence, the verbosity of XML was impractical for a portion of your document, so you used a more compact non-XML representation. ... The verbosity of XML was not a factor. There was no semantic distinction to represent, and thus no markup was merited. The structure of the data contained in your strings is a real and legitimate part of the structure of your data, but it isn't represented in an accessible way. No, there is no semantic distinction between the "(C)" and the rest of the text in "(C) 2002 Joe Q. Public." They are both text. The fact that during processing the "(C)" is processed differently is an artifact of the processing scheme, nothing more. If XSLT has difficulty processing those bits of "(C)" text, you may be tempted to adulterate the text into something more palatable to XSLT, but you would be wise to resist the temptation. Otherwise, you might corrupt the original meaning of your information, and you would be heading down the slippery slope of representation based upon use rather than meaning. --My blog | LectroTest [ Disagree? Reply. ][ Parent ]  hmmm ... (none / 0) (#106) by jacob on Wed Jan 16, 2002 at 02:52:41 PM EST  I don't mean to sound adversarial -- that's truly not my intent -- but I have to disagree with you strenuously here. First of all, in absolutely any English text, sentence structure is critically important information that's inherent to the data, regardless of whether the work is sci-fi or an English textbook (try taking a complex sentence and alphabetizing all the words, and then see how much sense it makes). When you're writing a markup language, though, you might say to yourself, "Though English has an inherent structure, that's not going to be important for any of my applications and it would be a lot of work to expose all that structure, so I'll just treat it as opaque data that will get processed only by human brains." If that turns out not to be true, though, you need some way to reclaim that structure. That's what parsers do for you. But the important insight is that regardless of whether you represent it or not, English-language data is inherently highly structural. This is a common problem in representing content that's going to get delivered straight to humans. Where do you draw the line between what's structure and what's content? A lot of people draw the line between "formatting" and "text," and that's good enough for most tasks. But there's nothing fundamental about that as being the place where we stop caring about structure, and it's not good enough for every task. That's what I'd hoped my noun-phrase example would demonstrate, though apparently it didn't. And when you say "there was no semantic distinction to represent," I just have to wonder whether you're entirely sure what that means. Of course there's a semantic distinction to represent. If there were no semantic distinction between "(C)" and other text, you wouldn't be writing a program that treated it specially -- they're semantically equivalent and thus by definition mean the same thing as far as you're concerned. Your argument is, "In my data, the string '(C)' is no different from the string 'fgahsjdfh' or any other string because it's just content like them. However, in my data, the string '(C)' is different from the string 'fgahsjdfh' and every other string because it means 'a copyright symbol' whereas nothing else does." Well, which is it? I should also note that what I'm suggesting is not to store your data in this format, but to create an intermediate representation that contains all the relevant data in accessible ways. You seem strongly opposed to even putting your data in a form where you can manipulate it, which is strange to me. Of course, that's probably because I see it as clean and elegant while you see it as a hack ... But then, I'm right and you're not. =P -- "it's not rocket science" right right insofar as rocket science is boring --Iced_Up[ Parent ]  We're getting close (none / 0) (#109) by tmoertel on Wed Jan 16, 2002 at 04:15:30 PM EST  I don't mean to sound adversarial -- that's truly not my intent -- but I have to disagree with you strenuously here. I don't think that you're being adversarial, I just think that we're using slightly different vocabularies, and we use the same words but mean different things. I can only hope that some degree of explanation will align our vocabularies sufficiently for consensus to form. "Though English has an inherent structure, that's not going to be important for any of my applications and it would be a lot of work to expose all that structure, so I'll just treat it as opaque data that will get processed only by human brains." I should point out that this is a use-centric interpretation. First of all, in absolutely any English text, sentence structure is critically important information that's inherent to the data, regardless of whether the work is sci-fi or an English textbook... But the important insight is that regardless of whether you represent it or not, English-language data is inherently highly structural. No kidding. The key is recognizing whether the text itself sufficiently represents the structure to fully capture the original meaning. In the case of the English textbook, no, there is meaning beyond what's implied by the face value of the text itself, and to capture it additional markup is necessary. In the case of the sci-fi book, on the other hand, the text is sufficient by itself to convey the original meaning. And when you say "there was no semantic distinction to represent," I just have to wonder whether you're entirely sure what that means. Of course there's a semantic distinction to represent. Of course I know what it means. And, no, there is nothing to represent if the distinction wasn't important in the original. If I have a sci-fi book, and there is a "© 2002 Joe Q. Public" on the title page, is there an important distinction in this original source material between the "©" and the rest of the text? Does the "©" serve some role on the page other than what's implied by its participation in the text? If the answer is no, then no additional markup is merited around the "(C)" in the XML version. If there were no semantic distinction between "(C)" and other text, you wouldn't be writing a program that treated it specially ... So if I write a stylesheet that italicizes all of the letters "s" during a publishing pass of some books, do all the "s"s in the original printed books (from which my XML was derived) suddenly become more important? Nope. Or what if I typeset the "s"s as "\ess{}" in LaTeX? Still, nope. The significance of the "s"s is an artifact of my publishing scheme, not inherent in the original meaning of my books. Your argument is, "In my data, the string '(C)' is no different from the string 'fgahsjdfh' or any other string because it's just content like them. However, in my data, the string '(C)' is different from the string 'fgahsjdfh' and every other string because it means 'a copyright symbol' whereas nothing else does." Well, which is it? I'm arguing neither. What I am arguing is that the string "©" in the context of "© 2002 Joe Q. Public" has no meaning other than its participation in the text, regardless of whether I choose to represent the copyright symbol as "(C)", "©", "©", or "©" in markup. Do you get what I'm saying? --My blog | LectroTest [ Disagree? Reply. ][ Parent ]  I see the difference! (none / 0) (#112) by jacob on Wed Jan 16, 2002 at 06:04:40 PM EST  Based on your most recent comment, I think I see the difference between our perspectives: You favor the minimal intrusion philosophy (making the term up on the spot, by the way) -- you think that an XML markup language should explicitly denote only that information which isn't implied by the original content. If it's already implied by what's on the page, there's no reason to explicitly state it with an XML tag. As I see it, the advantages of the minimal intrusion philosophy come from the fact that it meshes easily with pre-existing ways of understanding documents. I see the advantages as: Documents are easier to produce by hand and are often more human-legible Marked-up documents vary as little as possible from the originals in document-conversion scenarios Probably others; any thoughts? I favor the minimal ambiguity philosophy -- I think that to the extent that it's feasible, all the structure of your data should be explicitly exposed using the uniform XML mechanisms. The advantages of the minimal ambiguity philosophy stem from the fact that uniform representations are easy to manipulate programmatically. They are: Standard tools (XSLT, etc) can easily transform all relevant pieces of the document No need to write ad hoc converters for different aspects of the same datum Easier to tell which aspects of a document are potentially important Probably others What's most interesting to me is that neither of these can exist as a purist strategy. Consider the following examples: Example 1:  A NEWS STORY by John Smith ATLANTA, GA -- Blah blah blah ... Blah blah ... Blah blah blah ... John Smith is a member of the Associated Press.  Example 2:  J...[etc]...  Neither is a reasonable way to mark up a news article, but both are epitomes of their respective design philosophies. So it seems to me that in any situation when you want to design a markup language to represent something, you need to decide how "deep" you want to go in exposing structure, and that definitely involves how you plan to use the data. Note, though, that I don't mean "the structure of your data depends on how you use it," which is false, but "the extent to which you represent the inherent structure of your data using uniform XML constructs depends on what you want to do with it," which is a different notion. Anyway, to get back to your problem, I'm still going to have to stick to my guns: when you get to a situation where you left implicit some bit of meaning that you need to programmatically manipulate, the easiest path is going to be to make that bit of meaning explicit in a uniform way. Doesn't mean that your original markup was wrong, just that your current task needs to see more of the structure than you anticipated. The only alternative is to make ad hoc parsing and transformation programs for your non-uniformly-represented data, and that's what XML was supposed to do away with. -- "it's not rocket science" right right insofar as rocket science is boring --Iced_Up[ Parent ]  Entities are not structure (5.00 / 2) (#88) by edAqa on Wed Jan 16, 2002 at 03:28:47 AM EST  The current standards and working groups from ISO and W3C contend symbols like copyright are pure content, 100% distinct from the structure of the document. The reason people use writings like (c) is not because they regard it as a special construct, but merely because they can't see the copyright symbol on their keyboard. Entities are a convenience to introduce the same content at various locations, or to produce those symbols which our keyboard doens't have since it doesn't have 65k (Unicode) or 4mil (ISO) keys on it. If all the replacement strings were marked up as XML entities, then the substituion would have been rather trivial. I however also agree with you that the content is not independent of the structure. The structure is a clear indication of what kind of content is present, and this markup is often essential to the nature of that content. One only needs look at the MathML to see a good example. -- edA-qa[ Parent ]  Yes, I agree with your clarification (5.00 / 1) (#87) by edAqa on Wed Jan 16, 2002 at 03:15:13 AM EST  I've done a lot of transformations of HTML, XML, CSS, and other web languages. I agree clearly that XSLT has difficulties when it comes to working with the content between the tags -- short of extension functions and heroic efforts within the language itself, XSLT simply provides no useful mechanism for changing the text content. I've found most success using XSLT for doing structural transformation only, essentially for it to work effectively the input has to be a pure and appropriate XML use. Since your input had human entered data, that likely precludes this option. I think as XSLT evolves it will better support real-world translations. In the sense of standards I think it makes sense to have released this structural only transformation language now, allow people to use it, and then ellicit ideas for textual change at a later time. -- edA-qa[ Parent ]  Docbook? (2.00 / 3) (#28) by enry on Tue Jan 15, 2002 at 11:07:23 AM EST  Docbook already can convert to *TeX (and PDF and HTML and...), but I'm not sure how it works with XSLT.  Verbosity (4.14 / 7) (#34) by ucblockhead on Tue Jan 15, 2002 at 12:09:06 PM EST  I admit I've never used XSLT, but some of the comments here confirm my biggest complaint about XML. Its greatest strength is its readability, but its writability sucks. In my mind, XML is a good thing to use for machines to talk to people, or for things that require minor modification of machine generated files. But as something for human beings to sit down and actually write, XML sucks rocks. It is far too verbose. CoBol for the new millenium. Unfortunately, I don't think that the distinction between readability and writeability is well recognized. ----------------------- This is k5. We're all tools - duxup  Well... (4.00 / 4) (#37) by wiredog on Tue Jan 15, 2002 at 12:37:57 PM EST  XML isn't really meant to be human writeable. Or readable for that matter. It's intended to allow systems to share data. The programmer implementing the humal/XML interface (me, for instance) has to be able to read a dtd or schema, and an XML file, but there's no reason that the end user would ever see the raw XML. He'd see a nicely rendered page, or a form, or something like that. Peoples Front To Reunite Gondwanaland: "Stop the Laurasian Separatist Movement!"[ Parent ]  programmer == human (4.75 / 4) (#44) by ucblockhead on Tue Jan 15, 2002 at 01:44:15 PM EST  I'm not talking about end users, but programmers. If this is a language, then writability is a huge concern. ----------------------- This is k5. We're all tools - duxup[ Parent ]  XML *is* meant to be human readable (4.50 / 2) (#47) by Carnage4Life on Tue Jan 15, 2002 at 02:34:36 PM EST  XML isn't really meant to be human writeable. Or readable for that matter. It's intended to allow systems to share data. XML is a markup language or better yet a mechanism for defining markup languages. Markup languages are traditionally meant to be human readable. Heck, my webpage is written in XML (XHTML) and I'm very sure the source is juman readable. In fact taking a look at the design goals for XML I notice 6. XML documents should be human-legible and reasonably clear. 10. Terseness in XML markup is of minimal importance. Simply because people now use XML outside its original design intentions as a replacement for HTML does not mean that it wasn't originally intended to be human readable (even by end users). [ Parent ]  Human-legible, not human-readable (none / 0) (#86) by Samrobb on Wed Jan 16, 2002 at 03:04:45 AM EST  In fact taking a look at the design goals for XML I notice 6. XML documents should be human-legible and reasonably clear. IMHO, while the term "human-legible" does imply that XML should be "human-readable", it also seems to imply that it does not neccesarily need to be easy for a human to read. "Great men are not always wise: neither do the aged understand judgment." Job 32:9[ Parent ]  SVG compression (none / 0) (#93) by jolly st nick on Wed Jan 16, 2002 at 10:09:27 AM EST  OK, this is lame, replying to my own post, but I just did a check on the SVG example. The SVG browser saves gzipped SVG files. An example file used by Adobe (theatre.svg) is 87K, and uncompresses to 184K, which is about 53% compression -- indicative of considerabl redundancy. I took a screenshot and saved the file as a comparable quality PNG and it came out at 77K, slightly smaller than the compressed vector format! I'm not saying SVG isn't useful. Howewver, I do believe that it's XML nature adds a great deal of redundancy. [ Parent ]  Only 53% ? (none / 0) (#96) by joto on Wed Jan 16, 2002 at 10:39:54 AM EST  That would certainly not be as much as I'd expected, given the verbosity of XML. I would expect XML to at least compress 1:4 with a standard algorithm, but certainly 1:10 should be possible for most data (especially machine-generated). Maybe someone should create a smarter compression algorithm geared at XML data? [ Parent ]  Missing the point of SVG (none / 0) (#105) by ptemple on Wed Jan 16, 2002 at 02:25:36 PM EST  The SVG browser saves gzipped SVG files. An example file used by Adobe (theatre.svg) is 87K, and uncompresses to 184K, which is about 53% compression -- indicative of considerabl redundancy. I took a screenshot and saved the file as a comparable quality PNG and it came out at 77K, slightly smaller than the compressed vector format! I'm not saying SVG isn't useful. Howewver, I do believe that it's XML nature adds a great deal of redundancy. The format of XML is easily tokenisable hence should lead to very good compression. The SGV browser may save gzip SVG but gzip has a number of settings (from fastest to max compression). What setting is it using? It may be 184k uncompressed but it is unlikely any application will store the raw XML, it will convert on the fly to its own internal representation, hence will not consume that in system resources. You compare the SVG to a PNG and imply the SVG has redundancy because the PNG is smaller. Untrue. By converting the application to PNG and compressing you are using lossy compression. Eg you can no longer resize or zoom in/out and have all the curves perfectly rendered and anti-aliased. If you only want a fixed sized bitmap of fixed resolution and can get a smaller image size using PNG then keep your source artwork in SVG and export to PNG to upload to your web site, much like you used to keep your source artwork in Photoshop (or whatever) format and export GIFs to your web site. One very important advantage of SVG being in human readable is that it allows web designers to easily generate images on the fly. For example I could create a button with the label "{BUTTONTEXT}" and save it in SVG format. It would then only take a couple of lines of PHP using FastTemplate to output a smart looking button with any label I like, possibly using a svg2png function to support non-svg-aware browsers. Phillip. [ Parent ]  Vector formats (none / 0) (#108) by DGolden on Wed Jan 16, 2002 at 03:43:48 PM EST  But by converting the application to PNG and compressing you are using lossy compression. Yes, but normally vector formats have a reputation for being much more space-efficient than bitmap formats. SVG somehow manages to be less efficient... Don't eat yellow snow[ Parent ]  Effect of dithering (none / 0) (#113) by ptemple on Wed Jan 16, 2002 at 10:35:58 PM EST  Yes, but normally vector formats have a reputation for being much more space-efficient than bitmap formats. SVG somehow manages to be less efficient... I wonder.With simple shapes in solid plain colour I can imagine PNG competing with SVG on size. Can someone run an identical test using a SVG with numerous gradient fills? I predict that once dithered to the screen resolution it will kill the PNG compression somewhat. Phillip. [ Parent ]  Missed point? (none / 0) (#119) by jolly st nick on Thu Jan 17, 2002 at 09:30:15 AM EST  Of course I know there's a difference between lossy raster represnetations and scalable vector/object representations. My point is that XML adds nothing as a vehicle for representing graphic objects in a human readable format, but it does inflate the file with lexical fluff. A vector file format is nothing more than a program that creates the desired graphic output. I think it would be possible to create a programming language (i.e. human readable) that describes graphic objects without the bloat of SVG. Of course, then again, maybe what I'm describing is postscript ;-) XML is not optimal for human readability. It's not that easy to parse either -- there's all kinds of gotchas in making document handlers. Lex and Yacc style parsing is both more powerful and easier to use than XML for describing programming languages. I'm not against SVG. I think it's great that we will have a non-proprietary format for representing object oriented graphics. However, XML gets people and organizations on the SVG bandwagon simply because that bandwagon is hitched to the XML juggernaut. Really, the requirements are simply that the the file format be human readable and machine processable. The universe of applications that describes is much larger than the optimal space for XML. [ Parent ]  Human readable IN A PINCH (none / 0) (#92) by jolly st nick on Wed Jan 16, 2002 at 09:42:22 AM EST  It is supposed to be human readable and produceable, but this doesn't mean it is optimal for human manipulations. So, you have a document that has to be processed in a variety of ways, some of which you haven't anticipated. I have the task of processing that document, but my program doesn't work the way I expect. I can look at the file and try to get insight into it with more convenience thant od-ing a binary file. I have a DTD or schema, so I can understand the universe of what my program might be expected to process. I can gin up test cases with an XML editor or in a pinch, a text editor. These are all good things. However, does this mean I want to program in XML? Heavens no. It's one step removed from saying that because it is useful to have a binary editor, I want to program by setting byte values in a file. The thing that struck me after digesting several books on XML is that they are written for people with no knowledge of language design or parsing. The XML phenomenon seems to exist in its own world apart from the normal world of lexing, parsing and semantic analysis. XML seems to be used for lots of things I would have developed with a BNF grammar and handled with lex and yacc. Take SVG, for example. In the end, its just a way of describing graphic objects, one that incurs a size penalty. In the end what matters is the object model and to have one or more representations you can transform it into with a simple filter (say bytecode, and pseudocode). This would mean that the end of many objects could be inferred from the grammar or marked with a single byte. Part of makes XML human readable is a relatively low entropy (high redundancy); when you send it over the Internet, your information payload is going to be bogged down with volumes of lexical fluff. Fortunately, vector graphics tend to be svelte in the first place, but if you had an elaborate file such as a highly detailed map, I would be willing to bet it would LZW compress to a fare-thee-well. As I began to have to process XML documents, it became clear to me that some people developing DTDs hadn't really studied language design, and often didn't build features into their grammar that would be useful. This is why people say things like "XML is comma-delimited files for the twenty-first century." This is not the fault of XML per se, but it shows a lot of what is driving XML is sheer momentum. I think it has a lot of potential for good, if looked at critically. The best thing about XML? Well so far its that people will open their wallets for work done in it, whether it makes sense or no. It's like when object orientation took off in a big way in the early nineties. People didn't quite know what it was, except they heard it was going to solve all their problems. Which brings me to XSL. I think it works well for transformations between XML and other XML-ish formats. However, I think the problem with XSL is that it tries to do much outside of its primary strength, which is transformation by example. The problem of enabling the transformation of any particular XML format into the universe of all other representations can be seen as isomorphic to the general problem of computation itself. We have decades of experience in developing languages for various subsets of this problem, and whole schools of thought with highly developed implementations. XSL turns its back on this experience and tries to go at it from first principles. The tranform by template kind of approach is one that clearly doesn't serve the general problem. This means that XSLT is a niche player -- like awk without the charming simplicity -- that has capabilities which allow it to be used for things it is highly inconvenient for. People have done some amazing things with awk, for example, but unless they are simple kinds of if A substitute B transformations they are more senseless acts of beauty than practical examples. The hack of course, is that XSL transforms XML and is XML. This means that you can manipulate it with XML-ish tools. You could transform an XSL file using XSL. The question is, why would you want to, if, in the first place, it isn't the best way to represent an algorithm? I think XML has potential as a container for software logic written in other languages (to mark them up with metadata). However, programming languages exist to assist programmers, not to serve some kind of movement. They are tools. When they work, they help the programmer wrap his mind around the problem. When they don't work, they cause the programmer to stumble over their structure. Thus, XML to XML using XML is natural and helpful. XML to other kinds of things using XML is highly questionable. [ Parent ]  XML Scheme (4.40 / 5) (#51) by DGolden on Tue Jan 15, 2002 at 03:53:51 PM EST  I've mentioned this on kuro5hin before, but it is pertinent: As any Lisp hacker will tell you, XML is essentially a thoroughly baroque way of expressing data that is easily expressed as Lisp s-expressions. (XML has the verbosely redundant ... syntax I find much more annoying than lots of ()s.) If you really have to sully your hands with XML, then personally, I find that Scheme is simply the nicest way to handle it - and manipulating and transforming trees composed of Lisp lists has decades of research behind it. So, if you're able to make your own decisions about what to use (and if you already know a bit of Scheme), rather than doing your transformation with XSLT, why not just use a transform written in Scheme, and the SSAX functional XML parser. SSAX converts your XML document to "SXML", which is best explained by example: XML  67 95  SXML  (WEIGHT (@ (unit "pound")) (NET (@ (certified)) 67) (GROSS 95) )  The TeX version of the SXML specification on the site is in fact generated from SXML into TeX by a short scheme "stylesheet"... Scheme implementations like guile or bigloo have comprehensive standard libraries, so you'll have all the power of a "real", general-purpose language's library support (And I suppose you could rewrite a similar parser in Common Lisp if you think scheme's still not "general purpose" enough), plus you'll be using a language that's turned out to be pretty much tailor-made for the problem space. P.S. As XSLT to XML, Scheme language to s-exprs. ;-). An interesting side-effect (groan), of course, is that you could make your SXML document itself be a scheme program that is "run", e.g. for the example above, by defining functions called "WEIGHT", "NET", etc. that do stuff. Don't eat yellow snow  *snorts* (1.00 / 1) (#54) by core10k on Tue Jan 15, 2002 at 04:33:49 PM EST  The day functional programming is taken seriously is the day it delivers the goods. If what you say is true (and enough people have claimed the same thing, so I have no reason to believe it isn't), XML is it's chance to prove it's worth. Given Lisp's pathetic history*, I'm not holding my breath. *Theoretically being 'better' in some vague, never truly defined way, doesn't count. [ Parent ]  Functional programming / Lisp (4.66 / 3) (#57) by DGolden on Tue Jan 15, 2002 at 05:08:31 PM EST  I don't think functional programming needs people to "take it seriously". I already take it seriously. Ericsson (erlang) already takes it seriously. Lots of computer scientists already take it seriously. So what if the millions of corporate C++/Java/VB drones don't understand it and thus poo-poo it? My weighting of their opinion is rather low. They don't particularly _need_ it for their class of programming problems. (Note: By day, I am just a corporate-drone programmer myself - I've first-hand experience of the sort of programming "problems" encountered by the average Java-head or VBer...) Just in case you're doing so: Don't confuse functional programming and lisp. Scheme implementations and Lisp are not purely functional languages. Yes, they've got a reputation for functional programming, but that's mainly because the languages were early on the scene and allowed one to write functional programs, not because they require one to. You can relatively easily write functional programs in Perl 5.x, too. The SSAX parser is written in a functional style, since, as it happens, it's a very clean, compact, way to write it - but there's no particular reason why you couldn't treat it pretty much as a black box, and the rest of your code couldn't be OO, say. The major Schemes tend to have CLOS-like object systems built in, and, of course, Common Lisp actually has CLOS. Don't eat yellow snow[ Parent ]  Dominant paradigm (none / 0) (#82) by Pseudonym on Wed Jan 16, 2002 at 01:06:23 AM EST  I think when trying to characterise a language, you have to look at the "dominant paradigm". C++'s features, for example, include genericity, object-oriented-ness and so on, but if you had to pick the paradigm that it owes the most to, its a procedural language. Erlang's features, include robustness, scalability, concurrency and so on, but if you had to pick the "dominant paradigm", it's functional. A warning, though: Erlang isn't purely functional. Indeed, most functional languages are not, including ML and its variants. That's why I see no reason to exclude the Lisp/Scheme family of languages from the functional language family. sub f{(f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});[ Parent ]  Re: Dominant paradigm (none / 0) (#95) by joto on Wed Jan 16, 2002 at 10:32:15 AM EST  While I would mostly agree with putting Scheme, ML and Erlang in the functional category, and C++ in the procedural (or OO (depending on the programmer, obviously)), I have to disagree with putting Lisp there. By Lisp, I assume you mean, ehh, well, Common Lisp (but Emacs Lisp would do just as well). I've done a bit of programming in Common Lisp, and usually ends up with only about 50% pure functional code (the rest being OO, somewhat unpure functional, or procedural). Putting Common Lisp in the category of functional languages is certainly no better than putting Perl or Python there (while they can be used for that purpose, they usually aren't). [ Parent ]  Functional Lisp (none / 0) (#114) by Pseudonym on Thu Jan 17, 2002 at 12:27:29 AM EST  The way I think of Lisp is a functional language which lets you drop back to imperative if you need to. I'm willing to concede that this may be a reflection of how I use it, having come from purely functional languages first. (The first functional languages that I used were, in order: Orwell, Miracula and Miranda. Does that show my age or what?) sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});[ Parent ]  Self-fulfilling prophecy (4.75 / 4) (#58) by jacob on Tue Jan 15, 2002 at 05:10:41 PM EST  I find that when most people say things like that, 'deliver the goods' ends up meaning some variation on 'has lots of large applications written in it.' Well, obviously functional programming languages don't have that -- nobody takes them seriously. On any more objective metric than a popularity contest, though, I'm convinced that functional programming languages already 'deliver the goods' in spades[1]. For example, I recently developed a component for a web CGI library that took an arbitrary XHTML form and some information about what bindings it was supposed to provide, sent the form to the user, got the results, and matched up the actual transmitted bindings with the expectation. If the bindings were correct, it bound them to variables and continued executing the program; otherwise, it automatically printed out an error message and resent the page with all the valid bindings filled in. The whole library was around five or six hundred lines of Scheme (best guess), with most of the complication being HTML's terribly non-standardized ways of handling form elements. [1] I originally typoed 'in space' here, and thought that was sufficiently funny to share. -- "it's not rocket science" right right insofar as rocket science is boring --Iced_Up[ Parent ]  Better analogy: Haskell/ML types (none / 0) (#129) by Estanislao Martínez on Sun Jan 20, 2002 at 06:08:36 PM EST  As any Lisp hacker will tell you, XML is essentially a thoroughly baroque way of expressing data that is easily expressed as Lisp s-expressions. (XML has the verbosely redundant ... syntax I find much more annoying than lots of ()s.) A better analogy is the type systems of functional languages like Haskell or ML. DTDs can (in principle) be mapped to type declarations, documents to objects of these types, and functional programs using those types can guarantee valid output if given valid input (as in "will fail to compile unless all possible outcome documents meet the DTD"). Check out this article and the references within. --em[ Parent ]  Hammers for fishing, et al. (3.71 / 7) (#52) by Treach on Tue Jan 15, 2002 at 03:55:06 PM EST  It would seem to me that the real problem here has nothing to do with the capabilites of XSLT, Perl, Haskell, or Atari BASIC, or in the poster's inability to understand (insert whatever mathematical/CS concept you think is just slightly over his head here). The problem is that he was told to use XSLT. He had no choice. So he has the right to bitch about XSLT's not working. And although he probably should be bitching to his management, that may not have been possible at the time (i.e. "Okay, of the 30 of you in this department, 28 will be laid-off and the other 2 of you will write an XSLT tool"). I didn't vote on this article but I would have said, editorially speaking, that it would be more interesting if it focused on the politics and reasons that put the square-peg-for-round-hole into his hand, rather than a long discussion of why the peg is square. He could have explained that in a paragraph and then moved on. Those of us who are programmers would understand the problem from a paragraph; those who are not probably still do not understand it - if you have never written anything to iterate through a hash, you probably don't care that it isn't possible.  One reason why I beg to differ. (4.00 / 3) (#59) by Apuleius on Tue Jan 15, 2002 at 05:14:13 PM EST  Posting details of your workplace's internal politics to K5 can be a very career limiting move. Bitching to K5 about what you have to do as a result of such politics, on the other hand, at least gives other people the chance to print out and be ready to hand over to the boss an article on the limitations of XSLT. I know I'm keeping a copy. There is a time and a place for everything, and it's called college. (The South Park chef)[ Parent ]  What shape peg? (4.80 / 5) (#68) by jmzero on Tue Jan 15, 2002 at 06:46:26 PM EST  I think one of the author's points is that this job was the job XSLT is intended to do. This is supposed to be the perfect peg for the whole, and it isn't. . "Let's not stir that bag of worms." - my lovely wife[ Parent ]  No it isn't... (none / 0) (#98) by scarhill on Wed Jan 16, 2002 at 11:31:54 AM EST  As many have commented, XSLT is designed to do XML tree transformation, not general purpose text manipulation. You can certainly argue that enhancing XSLT with better text processing functionality would be a good thing, but the fundamental problem here is the mandate that he use a tool that is not designed for the job. [ Parent ]  Sure looks like it is . . . (5.00 / 1) (#101) by tmoertel on Wed Jan 16, 2002 at 12:32:14 PM EST  XSLT is designed to do XML tree transformation, not general purpose text manipulation. The problem described in the article was tree transformation. You know that stuff in the leaf nodes of the XML tree? That's text. When transforming the tree, XSLT fell down on the leaf text. That was Problem 1. (Problem 2 was that XSLT didn't lend itself to the kinds of simple nuts-and-bolts programming that could have easily bridged the gap caused by Problem 1.) Square peg, square hole. --My blog | LectroTest [ Disagree? Reply. ][ Parent ]  XML Script (2.00 / 3) (#70) by Echo5ive on Tue Jan 15, 2002 at 07:09:07 PM EST  I never liked XSLT at all. Big an clunky and tough on the eyes. No, give me XML Script any day of the week! It's slow in development, and I haven't tried it for anything else than personal use, but I really liked it. And it's fully XML-compliant - you write the language in XML. It's much easier than XSLT, and as far as I could see, gave the same functionality. And more. --Frozen Skies: mental masturbation.  what a bunch of wimps... (2.40 / 5) (#71) by fanatic on Tue Jan 15, 2002 at 07:24:11 PM EST  Everyone knows that Intercal is the one true language. ;-P  What if repl. text contains a token? And Haskell (3.00 / 5) (#74) by phliar on Tue Jan 15, 2002 at 08:21:19 PM EST  I'm not a Perl (ugh!) adept... sub doSubstitutions($) {     my $text =$_[0];     while (my ($target,$replacement) = each %$substitutions) {$text =~ s/$target/$replacement/g;     }     return $text; } What if a replacement string contains a substring that matches a pattern? Then the results are dependent on exactly how an iterator in Perl works over the hash. The problem as you've stated it doesn't say what's to be done in that case; the most reasonable i.e. well-defined statement is that it's a one-pass substitution i.e. the replaced text is not scanned for search patterns. In the absence of side-effects this is nice and well-defined. Simple case: what do you want to do with (C) -> \copyright{} foo -> C "(foo)" If the search patterns are regexps (and not plain strings) things get even more interesting. Search-and-replace is not a trivial thing. Haskell is a very cool language. As beautiful as Perl is ugly and putrid. I want to hear more about your experience with and views on Haskell. Faster, faster, until the thrill of...  Perl, Haskell etc (4.00 / 1) (#80) by Pseudonym on Wed Jan 16, 2002 at 12:54:02 AM EST  I must be one of the few programmers in the world who likes both Haskell and Perl. Quick Haskell rant: Much as I like Haskell, I always find myself wanting more standard library than it has. It constantly bothers me that Haskell doesn't come with FiniteMap, or that it has no standard priority queue type. Yes, they're extremely easy to write, or easy to copy over into your application, but you shouldn't have to. At least Perl has hash tables as part of the language, and CPAN. This may not be good language design, but it's great in practice. sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});[ Parent ]  I like both, too (4.00 / 1) (#85) by tmoertel on Wed Jan 16, 2002 at 02:15:36 AM EST  I must be one of the few programmers in the world who likes both Haskell and Perl. I think the Haskell / Perl combo is actually quite popular. I like 'em both. And I know that the GHC build system relies upon Perl. They are both fun languages. Quick Haskell rant: Much as I like Haskell, I always find myself wanting more standard library than it has. It constantly bothers me that Haskell doesn't come with FiniteMap, or that it has no standard priority queue type. Yes, they're extremely easy to write, or easy to copy over into your application, but you shouldn't have to. Do you use GHC? If so, you have Edison, which gives you what you want. --My blog | LectroTest [ Disagree? Reply. ][ Parent ]  Hopeful (none / 0) (#90) by Pseudonym on Wed Jan 16, 2002 at 04:08:27 AM EST  I think the Haskell / Perl combo is actually quite popular. I like 'em both. I'd like to think that most professional programmers know that sometimes it's good to be quick and dirty (e.g. Perl) and sometimes it's good to be careful and robust (e.g. Haskell), but realistically, I know that most don't think that way. As for Edison, that's very interesting. The web page says that not all signatures are filled in. Is that still the case? I realise that it's important to get it right, otherwise we end up with something like the C++ STL. The STL was designed in a previous era of C++. Now the language is much more developed, but the STL is fixed, and so can't take advantage of many of the developments that have happened since (e.g. template metaprogramming). Edison's approach of providing signatures first rather than diving into implementations seems exactly the right thing to do. Regardless of completeness, now appears to be the time to upgrade my GHC. sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});[ Parent ]  Nope (none / 0) (#89) by Weezul on Wed Jan 16, 2002 at 03:59:18 AM EST  Your not the only one. I love both Haskell and Perl. Actually, people like Perl because it's so very practical. You might like Haskell for being "potentially" very practical. You can oftin write Haskell code to make many things easy to do in Haskell where you would need to change Perl it's self to make these same things easy to do in Perl. Altermnativly, Perl programmers like diffrent ways of doing things and Haskell is oftin one of the more interesting ways of doing things. One day some one will apply the Perl philosophy to a functional langauge, hopefully Haskell.. and it's gona rock. This would mean adding some clever syntatic shugar, adding more powerful type overloading, and making the type system a bit more fluid. "Fascism should more appropriately be called Corporatism because it is a merger of state and corporate power." - Benito Mussolini[ Parent ]  Bad Perl Code (none / 0) (#116) by mlvanbie on Thu Jan 17, 2002 at 02:43:00 AM EST  As mentioned above, the Perl code has problems with the order in which mappings are applied. It also has the problem that for every string of text that is converted, every pattern will be compiled as a regular expression, which is slow. I present a better way to do it:  my %subs = ( '&' => '\&', '$' => '\$', '(C)' => '\copyright()' ); my$pat = join( "|", map { s/\W/\\$&/g;$_ } (keys %subs) ); sub doSubstitutions($){$_[0] =~ s/$pat/$subs{$&}/oeg; } Note the essentially functional definition of$pat. [ Parent ]
 Is it bad, or just right? (5.00 / 1) (#120) by tmoertel on Thu Jan 17, 2002 at 11:28:03 AM EST

 As mentioned above, the Perl code has problems with the order in which mappings are applied. [emphasis mine] Now, before you go declaring things to be "problems" or "bad," you might want to know that in the actual application, substitutions are allowed to depend upon one another and are applied in topologically sorted order. Thus, the order in which mappings are applied is not a "problem" but a requirement. I didn't mention this in article because it wasn't material to the discussion, but it's apropos here if only to show that you didn't have enough information to declare what was "bad perl code" and what wasn't. Next time, instead of saying "This is bad, I've got a better way," consider saying, "Here's another way, which is often better." With the latter approach you'll look good regardless of whether you fully understand the underlying circumstances. It also has the problem that for every string of text that is converted, every pattern will be compiled as a regular expression, which is slow. I present a better way to do it. [emphasis mine] Did you ever stop to think what my goal in writing that little Perl snippet was? Did you think it's possible that, rather than exploiting every Perl idiom (which might be unfamiliar to readers) and every Perl speed optimization (which might obfuscate the code), I may have wanted to present code that was accessible to most readers while at the same time exhibiting a high degree of parallelism with the ultimate XSLT implementation? You know, for the sake of the overall discussion? Incidently, since you seem to be concerned with the speed of regex matches, you might want to know that using $& anywhere in your code imposes a performance penalty on regex matches everywhere. If you care about speed, stick to parens. Cheers, Tom --My blog | LectroTest [ Disagree? Reply. ][ Parent ]  Sample code (5.00 / 1) (#115) by coleslaw on Thu Jan 17, 2002 at 01:13:19 AM EST  This is the solution I came up with to translate XML into LaTeX. Notice how the call templates resemble a select-replace construct. The only unfortunate part is that it's very deeply nested, and takes forever to process a long document.  \ \ensuremath\backslash$                                 \$& \& % \% # \# _ \_ { \{ } \} ^ \ensuremath{\hat} ~ \ensuremath{\tilde} < \ensuremath{<} > \ensuremath{>}   What's the Point Here? (3.50 / 2) (#126) by tny on Fri Jan 18, 2002 at 05:01:46 PM EST  Similarly, the text portions of XML documents often contain text-markup idioms (like the three-letter sequence "(C)" for copyright) which should be translated into the proper LaTeX representations ("\copyright{}"). This is a problem with the data, not the structure of XML; properly speaking, one should use an entity for the copyright sign (or use the correct UTF-8 code point), NOT (C). It's not really a "text-markup idiom," it's lazy data entry. At any rate, whatever idiot wanted you to use purely XSLT for this job should have his head extracted. s/\$/\\\\$/g; s/$$C$$/\\copyright/g; s/\&/\\\&/g;, then run the rest of the XML- (you don't say what vocabulary)to-LaTeX XSLT transform, and go home happy. XSLT is a STYLESHEET LANGUAGE (that's what the SL stands for), operating on structured document data, not a character data processing language. It assumes that the character data is in a form that can be used by the target application. Making changes to the character data is not part of its purview, any more than semantic document markup is part of PERL's.
 XSLT, Perl, Haskell, & a word on language design | 129 comments (116 topical, 13 editorial, 0 hidden)
 Display: Threaded Minimal Nested Flat Flat Unthreaded Sort: Unrated, then Highest Highest Rated First Lowest Rated First Ignore Ratings Newest First Oldest First