Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
The Joy of Text

By gregholmes in Technology
Mon Oct 30, 2000 at 06:50:57 PM EST
Tags: Software (all tags)
Software

Most of us are familiar with the Unix Philosophy, but it is a great feeling and very instructive to see it in action. I recently experienced a small, but powerful example.

All (numerical) data should be stored as ASCII text.


The reasons are these:

  • ASCII text is a common interchange format -- it is not the best or most efficient, but it is the most likely to be understood on a variety of machine. Data which cannot be readily moved to another machine has limited value, and the act of converting data can be very expensive in time and money.
  • ASCII is easily read and edited -- specialized tools to manipulate the data are not required.
  • There already exists a wide array of text manipulation tools -- awk, cut, diff, grep, head, lex, more, perl, sed, sort, tail, test, wc.

OK, it is not numerical, but...

Thank goodness HTML is text.

If it weren't, my job would have been much more difficult. I needed to add spell checking to a search engine - if no results were found for a 'typo', I wanted to add a nice little table to the "no results" page, listing the original search words and the suggested alternatives. Even better, the suggested words should be links, triggering a new search.

How to do it? I have the source to htsearch, of course, but I'm more of a scripter than a programmer, and it seems like using a bomb to kill a rabbit to hard code this in.

Then light dawns: HTML is text. It didn't have to be; it could have been some proprietary (or not) binary format. But it is. My old friends ispell and sed can play. Scripting languages like python and perl are right at home!

The wrapper script was simple; nothing but search and replaces, finds, calls to ispell (filtered through sed to get only the suggestions). Just text manipulation. The tools were there; they're good, fast, and better than I could possibly code them.

So I offer my thanks to the Unix philosophy: small tools, working together, manipulating text, solved a problem easily that could have been much more difficult. It is very nice to see a philosophy succeed in the real world.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Unix Philosophy
o Also by gregholmes


Display: Sort:
The Joy of Text | 53 comments (46 topical, 7 editorial, 0 hidden)
I Suggest You Resubmit This Under Op-Ed (1.66 / 9) (#1)
by greyrat on Mon Oct 30, 2000 at 12:57:38 PM EST

Otherwise, it's a good article.
~ ~ ~
Did I actually read the article? No. No I didn't.
"Watch out for me nobbystyles, Gromit!"

Hell. I Haven't Had enough Coffee Yet Today (1.80 / 5) (#2)
by greyrat on Mon Oct 30, 2000 at 01:00:23 PM EST

I meant under 'Columns'. What bugs me more is that I inadvertently voted this article down, when in reality it fits just fine under 'Technology' too.
~ ~ ~
Did I actually read the article? No. No I didn't.
"Watch out for me nobbystyles, Gromit!"

[ Parent ]
how? (2.50 / 4) (#4)
by gregholmes on Mon Oct 30, 2000 at 01:08:16 PM EST

Hate to be ignorant, but do I just resubmit (with more appropriate selection) as if it were new, or is there a way to just change it? Afraid I'm at work and don't have the time to figure out (quick faq perusal does not explain it).

[ Parent ]
#groan# (2.00 / 4) (#6)
by greyrat on Mon Oct 30, 2000 at 01:17:43 PM EST

I believe that you just submit it all over again. Ask Rusty, I just lurk here. #:^/
~ ~ ~
Did I actually read the article? No. No I didn't.
"Watch out for me nobbystyles, Gromit!"

[ Parent ]
editing functions (2.16 / 12) (#7)
by dreamfish on Mon Oct 30, 2000 at 01:30:15 PM EST

Kuro5hin needs user features to:
  • allow a user to widthdraw a submission (but not edit it - that would get too complicated)
  • re-assign the submission to a different category (this would solve the issue of people voting down a story just because it's in the wrong place)


Unix philosophy vs. the web (3.50 / 10) (#8)
by pete on Mon Oct 30, 2000 at 01:34:33 PM EST

This reminds me of a pretty interesting speech given by Tim O'Reilly at this year's JavaOne conference. He brought up the point that in Unix, you can almost always take the output from one program and run it through one or more additional programs using a pipe. What if web content were the same way? Think about distributed web applications chained together to access data.


--pete


I can almost see it (2.75 / 4) (#24)
by titivillus on Mon Oct 30, 2000 at 03:54:03 PM EST

http://foo.com/app/?url=http://bar.com/app/?url=http://quux.com/app/?url=http://fmep.com/app/?url=http://bleemer.com/input_page <p/>

But don't you think that looks slightly silly?



[ Parent ]
? (4.00 / 2) (#25)
by pete on Mon Oct 30, 2000 at 04:35:31 PM EST

You've automatically ruled it out because you don't like the first implementation of it that popped into your head? I'm sure you can come up with more clever ways of doing it than that. (That's what I'd say to one of my developers, anyway.)


--pete


[ Parent ]
I thought I had replied to this (none / 0) (#52)
by titivillus on Sun Nov 05, 2000 at 07:29:45 AM EST

It wasn't the only implementation I thought of, it was an implementation that I thought was funny, and I thought others would be amused by it.

It gets to the Zero-One-Infinity rule from the Jargon File. First we thought that the answer was zero (you couldn't access one web page from another), then we all realized that there was such a thing as LWP and we can use it along with CGI and then the answer was one. Now we're just trying to work out the scaling to make more than one work.



[ Parent ]
Check out perl... (none / 0) (#50)
by Mr. Neutron on Fri Nov 03, 2000 at 06:42:09 AM EST

Specifically, the libwww-perl bundle (Bundle::LWP). the LWP::UserAgent and LWP::ParallelUA modules act as a "client" (or, actually, a user agent). Then (assuming you know what the html looks like) you can use HTML::TableExtract, HTML::Element, HTML::Parse, HTML::LinkExtor, HTML::TokeParser, HTML::Entities... etc. etc. More than one paradigm for HTML parsing, for sure. But, of course, you're still trying to extract data from the layout... hence, as has been much mentioned, you have XML. Or you can already play with all sorts of "web piping" or whatever if you go check out some of the things going on at apache.org, like Cocoon, Jakarta/Tomcat, and Jetspeed. And don't forget mod_perl.It really is pretty cool stuff.

[ Parent ]
Oh yeah (none / 0) (#51)
by pete on Fri Nov 03, 2000 at 07:36:55 AM EST

Played with all of it; I use LWP every day. :-) (Well, except Cocoon, but I'm getting to that.)

To me it's not really a low level technical issue -- sure it's possible, but why isn't anyone doing it? There are a bunch of issues involved:

  • Conceptually, what kinds of things do you want to do?
  • How do you present it to the user?
  • Are sites going to allow you to extract data without showing their ads?
  • I still think about this from time to time, but haven't come up with anything great so far.


    --pete


    [ Parent ]
    So what about images? You gonna store images... (2.80 / 10) (#10)
    by marlowe on Mon Oct 30, 2000 at 02:01:19 PM EST

    as text?

    Oh, I get it! ASCII art!

    Now, about those audio files... maybe we can represent the waveform with asterisks.


    -- The Americans are the Jews of the 21st century. Only we won't go as quietly to the gas chambers. --
    Well ... (3.60 / 5) (#15)
    by gregholmes on Mon Oct 30, 2000 at 02:54:43 PM EST

    How about SVG graphics?

    ;)

    But seriously, they can be analyzed, parsed, and modified.



    [ Parent ]
    Re: So what about images? You gonna store images (2.00 / 2) (#28)
    by Fyndalf on Mon Oct 30, 2000 at 06:09:31 PM EST

    "You gonna store images .. as text?"

    Yeah, why not, haven't you ever looked at an XPM?

    [ Parent ]
    uuencode? (1.00 / 2) (#43)
    by fvw on Tue Oct 31, 2000 at 09:21:22 AM EST

    What do you think uuencode is for? :-)

    [ Parent ]
    NetPBM of course (2.00 / 2) (#44)
    by interiot on Tue Oct 31, 2000 at 10:09:31 AM EST

    There's the NetPBM library (Portable Bitmap). It's a good way to convert any format to any other format in linux. The PPM format is outlined here.

    [ Parent ]
    Not All Data Should Be Stored as Text Files (3.28 / 7) (#13)
    by iCEBaLM on Mon Oct 30, 2000 at 02:17:44 PM EST

    Your example works for one very important reason, HTML is a markup language which needs to interact with any platform. However it doesn't work with other examples where the data does not need to be exchanged with the world or can be easily converted to be exchanged with the world and the benifits of the new system outweight the benifits of using ASCII text files.

    Case in point: Databases, Word processing files, Spreadsheet files.

    SQL Databases offer a wide variety of "value added" functions, relational and conditional lookup, on the fly sorting, etc, and are already available free. Why reinvent the wheel and code all of these features to work with text files which just aren't suited for it?

    Word processing and spreadsheet files need to store not only words but formatting, fonts, typefaces (bold, underline and italics) and sometimes even images. How will you effectively store these in an ASCII text file? You really can't.

    There is a reason for different file formats, if there were no reason for them they wouldn't exist.

    -- iCEBaLM

    i don't like your examples (2.25 / 4) (#27)
    by mikpos on Mon Oct 30, 2000 at 05:35:00 PM EST

    I don't have much experience with databases, so maybe you have a point there (though you might want to take that up with the person who posted about Pick). Literary documents (the kinds made by word processors) and spreadsheets are wonderful examples of things that should be text, though! Plus, they compress really well :)

    From straight ASCII to UTF-8 to TeX and PostScript through to XML (like AbiWord does), ASCII text still does a great job of dealing with literary documents. And as a bonus, it forces you to make inlined images external references (though I suppose some might consider that a bad thing). And spreadsheets! Who could forget everyone's favourite format: comma-delimited values?

    There are some things which would be silly to put into text. Bitmap graphics come to mind, though XPM I think makes a good compromise (or at least it did until true-colour bitmaps started becoming common). Other assorted PCM signals (such as .wav's) wouldn't fare too well as text either, I wouldn't think.

    [ Parent ]

    point? (1.50 / 10) (#14)
    by lucid on Mon Oct 30, 2000 at 02:23:42 PM EST

    yeah, text is neat. but so what? i'm glad that text worked out for you, but this is about as thought provoking as hearing you thank Unix Almighty that that new car you bought was an automatic instead of a stick.
    -1

    I use Unix and drive a stick. (2.75 / 4) (#21)
    by Luke Scharf on Mon Oct 30, 2000 at 03:25:59 PM EST

    i'm glad that text worked out for you, but this is about as thought provoking as hearing you thank Unix Almighty that that new car you bought was an automatic instead of a stick

    I use Linux, drive a stick, and do pointer arithmetic. Any questions?



    [ Parent ]
    So do I (2.66 / 3) (#23)
    by titivillus on Mon Oct 30, 2000 at 03:50:18 PM EST

    I use Linux, drive a stick, and do pointer arithmetic. Any questions?

    I use Linux and drive a stick, much for the same reason. Control. I don't do pointer arithmatic, however, because I tend not to use languages where that works. I use perl.



    [ Parent ]
    Re: I use Unix and drive a stick. (2.00 / 1) (#35)
    by Freshmkr on Mon Oct 30, 2000 at 09:22:56 PM EST

    I use Linux, drive a stick, and do pointer arithmetic. Any questions?

    Yes, me too. But do you use an RPN calculator? ;-)

    --Tom

    [ Parent ]

    RPN calculator (3.00 / 1) (#41)
    by dabadab on Tue Oct 31, 2000 at 05:56:00 AM EST

    I do, for some reasons unbeknownst even to me I like dc (an RPN calculator for *nix) (though I happen to have a real-world Sharp "calculator" too (32k, z80, BASIC - just like a home computer from the 80's :) ))
    Stick, Linux: also applies.

    And yeah, text IS beautiful.

    We had to make lots of (potentially auto-generatable from other documents) documents for FrameMaker (a word processor) and I am glad that FM has an xml-like text-based description language (mostly for import/export purposes), so all the stuff could be generated with a shell script.
    --
    Real life is overrated.
    [ Parent ]
    Surely that's not enough (3.54 / 11) (#17)
    by General_Corto on Mon Oct 30, 2000 at 03:08:31 PM EST

    Storing information as text is all very well and good, and something which I would advocate to some extent. However, the reason I wouldn't advocate this too much is the simple fact that 'storing as text' just isn't enough.

    We live in a world where storage space and processing power is, to all extents and purposes (on a single-user level, at any rate) ubiquitous. We don't really need to but the P4, as the P3 serves all our non-SETI needs. In fact, this has been the case for some time. I'm not going to get into a commercialism rant here, as that wouldn't achieve very much at all.

    Given the fact that we can store all we create, and can manipulate it with impunity, it makes far more sense to store data in a way which gives it inherent meaning. This is, in an acronym, XML.

    The whole point of XML is data-with-form. It has a really simple architecture. It has a well-defined format. It has a way to describe a format in order to do some checking on it. It also has a lot of effort behind it, both in academia and in the real world.

    Plain Text is dead. Long live XML.


    I'm spying on... you!
    RTFA (4.25 / 4) (#19)
    by itsbruce on Mon Oct 30, 2000 at 03:16:25 PM EST

    Plain Text is dead. Long live XML.
    Read the Fine Article, that is. If he classes HTML as text, I'm pretty sure he also includes XML.

    I eagerly await a command line toolkit that gives the same power over XML as the traditional tools offer over simple text.

    --

    It is impolite to tell a man who is carrying you on his shoulders that his head smells.
    [ Parent ]
    Inherent meaning (3.00 / 1) (#48)
    by Skeevy on Tue Oct 31, 2000 at 11:51:44 PM EST

    A common fallacy is that XML somehow magically imparts meaning to a chunk of bytes. An XML document contains no more inherent meaning than any other format, text or otherwise. You need additional information that can only be stored outside the document (and I'm not talking DTD's here).

    What is the meaning of the following XML fragment? How should it be presented to someone?

    <foo>blah blah <qux>blah blah</qux> blah blah</foo>

    XML is more convenient in some circumstances, but is no panacea. Remeber, they said the same thing about ASCII thirty years ago.

    [ Parent ]
    I do worry... (3.88 / 9) (#18)
    by itsbruce on Mon Oct 30, 2000 at 03:10:58 PM EST

    about the number of new apps written for Linux that aren't aware of standard Unix practice. This is mostly a problem with X apps - not using the standard mouse-key cut-and-paste etc. The traditional X methods (single click, sloppy focus etc) have a lot to be offered but more and more we are just offered Windows clone apps by people who just don't seem to have considered anything else.

    A good example of an app not using text when it should is Pan, a Gnome newsreader. Not only does it store all the posts in db databases but it also keeps all the user settings in them. It's always been a convention with Unix apps to keep config data in text files - apart from anything else it makes understanding/troubleshooting the app much, much easier.

    Pan is a good app in itself, the best GUI newsreader for Linux IMO but that config decision was unnecessary and just one of the reasons why it will never tempt me away from Tin.

    --

    It is impolite to tell a man who is carrying you on his shoulders that his head smells.
    if it makes you feel better... (3.57 / 7) (#20)
    by enterfornone on Mon Oct 30, 2000 at 03:23:08 PM EST

    all of gnome will move to a windows like registry as of the next version (using gconf)

    --
    efn 26/m/syd
    Will sponsor new accounts for porn.
    [ Parent ]
    Erk (3.00 / 1) (#26)
    by itsbruce on Mon Oct 30, 2000 at 04:50:06 PM EST

    all of gnome will move to a windows like registry as of the next version (using gconf)
    I hope that's a joke...

    --

    It is impolite to tell a man who is carrying you on his shoulders that his head smells.
    [ Parent ]
    no joke (4.00 / 2) (#40)
    by enterfornone on Tue Oct 31, 2000 at 04:41:57 AM EST

    from http://cvs.gnome.org/lxr/source/gconf/README

    GConf is a configuration database system, functionally similar to the Windows registry but lots better. :-) It's being written for the GNOME desktop but does not require GNOME; configure should notice if GNOME is not installed and compile the basic GConf library anyway.

    It currently uses an XML backend but I beleive Sun recently contributed a berkley DB backend. So it is currently plaintext, not like Windows.

    --
    efn 26/m/syd
    Will sponsor new accounts for porn.
    [ Parent ]

    Tell it, Brother! (2.40 / 5) (#29)
    by hardburn on Mon Oct 30, 2000 at 06:22:28 PM EST

    No kidding. I also fear lots of the non-Free apps that are comming out. Games are OK to be non-Free because games are simply diffrent, but why do we need CodeWarrier for GNU/Linux? GCC and vi (or Emacs, whatever) work just fine if you're willing to read a few HOWTOs.

    But I digress. The *nix philosophy has served us well for 20+ years, at least. It need not be destroyed by some propriety manufacturers that can't tell Window$ from X.


    ----
    while($story = K5::Story->new()) { $story->vote(-1) if($story->section() == $POLITICS); }


    [ Parent ]
    UNIX standards.. (3.00 / 2) (#32)
    by Girf on Mon Oct 30, 2000 at 07:53:31 PM EST

    the number of new apps written for Linux that aren't aware of standard Unix practice.

    It tis a problem, and I fear that I might be one of the people creating this problem as I have done the DOS->Windows->Linux route instead of the binary, 'no-HDD thingie'->UNIX->more UNIX->Linux route. Taking the Windows route I am unaware of many 'UNIX standards'.

    The short of it is: Where do people learn these things? Is there a basic UNIX tutorial out there?



    [ Parent ]

    The book: The UNIX Philosophy (3.50 / 2) (#36)
    by Laz on Mon Oct 30, 2000 at 10:00:15 PM EST

    One answer is: <a href="[http://www1.fatbrain.com/shop/quicksearch.cl?SearchFunction=key&qtext=1555581234]">The UNIX Philosophy by Mike Gancarz (ISBN 1555581234)

    I highly recommend it. It's pretty light reading and solidifies a lot of the UNIX concepts you kinda know but didn't ever see written down.

    If you're not up for buying a book (though I highly recommend buying this one), search for "Tenets of the UNIX Philosophy"... people have the tenets from this book listed all over the inet.

    .laz

    [ Parent ]

    UNIX standards.. (3.50 / 2) (#33)
    by Girf on Mon Oct 30, 2000 at 07:54:52 PM EST

    the number of new apps written for Linux that aren't aware of standard Unix practice.

    It tis a problem, and I fear that I might be one of the people creating this problem as I have done the DOS->Windows->Linux route instead of the binary, 'no-HDD thingie'->UNIX->more UNIX->Linux route. Taking the Windows route I am unaware of many 'UNIX standards'.

    The short of it is: Where do people learn these things? Is there a basic UNIX tutorial out there?



    [ Parent ]

    Yes. (4.00 / 1) (#37)
    by h2odragon on Tue Oct 31, 2000 at 02:01:38 AM EST

    The Jargon File is the most useful single source primer for unix practice and culture. It's not the only source by any means, but having read it you'll either at least begin to Get It, or have a good hard confirmation that you shouldn't be allowed near digital devices of any sort.

    ...just my opinion, of course...

    [ Parent ]

    Pick and friends (3.50 / 6) (#22)
    by djkimmel on Mon Oct 30, 2000 at 03:47:45 PM EST

    The idea of using plain old ascii text to store numeric data is not a new one. In fact, old database systems like Pick always used plain old text to store numeric values.

    Of course, since Pick is a very free-form database, this approach is pretty my required. Field typing? What's that? It makes for some great possibilities, but enforcing data integrity is not one of them.

    Anyways, that's a bit of a tangent... Maybe I'll write an article on Pick and friends later.

    -- Dave
    Isn't that the principle behind XML? (4.28 / 7) (#31)
    by Broco on Mon Oct 30, 2000 at 07:19:32 PM EST

    From what I understand, this is the attitude behind XML: storing as much data as possible as easily parsable text. XML actually takes your idea one step further, by enforcing a standard of organization on the text (using tags). I think this is closer to what you want than just "ASCII text", since technically, ASCII text is just any data with bytes between 32-127, and isn't necessarily more parsable than binary data. "DF@%#^SDG?%!^ADGdafg" is ASCII text for example :).

    Later on, simple XML tools much like the Unix text parsing tools will probably get popular, and it seems like everything is turning into XML, so problems like you described will get even easier to solve. I can't wait :).

    Still, I'd have to say that text isn't entirely a panacea. Bitmaps and sound, for example, I don't really see the advantage of being able to edit those with perl :). But indeed, it sure would be nice to be able to edit Excel documents with vi ...

    Klingon function calls do not have "parameters" - they have "arguments" - and they ALWAYS WIN THEM.

    good point. (3.50 / 2) (#39)
    by h2odragon on Tue Oct 31, 2000 at 02:10:06 AM EST

    However, your sendmail.cf snippet there doesn't look right. It should be more like:
    R$* M $* M $* C $* T $* ! $* $@ $1 M $2 M $3 C $4 T $1 ! $6

    source: Peter van Heusden's Turing Machine in sendmail

    [ Parent ]

    Yes, but... (4.00 / 2) (#45)
    by excalibor on Tue Oct 31, 2000 at 10:58:11 AM EST

    Yes, but

    <?xml version="1.0"?>
    <!DOCTYPE ghj SYSTEM "-//ghj/DTD ghj 1.9"EN">
    <ghj jhgj="ghj jhg gh gj">
    <ui>kjh hjk lkjh h hlkjhlk lkjh</ui>
    <yu yuy="rtyytr" />
    </ghj>

    isn't very understandable, either :-)

    There's always context, XML just add up structure, not semantics... That's added by context and human brain (or any other kind of brain, even Klingon ones :)

    greetings,

    [ Parent ]
    XML tools (none / 0) (#53)
    by interrupt on Tue Nov 07, 2000 at 11:55:58 AM EST

    I'm not particularly well versed in the universe of XML, but do such "XML aware" counterparts to unix text processing utilities exist? Is anyone aware of projects underway to that end? It seems to me that one of the greatest barriers to acceptance of XML for all configuration files is the fact that you would actually be making some configuration files (think of the simpler denizens of /etc) more difficult to deal with...


    "Morality is your agreement with yourself to abide by your own rules." -Jubal Harshaw in Robert Heinlein's Stranger in a Strange Land
    [ Parent ]

    Text is good but... (3.50 / 8) (#34)
    by jesterzog on Mon Oct 30, 2000 at 09:20:23 PM EST

    ...it wouldn't be anywhere near as good if it was just text.

    This is why markup languages have been so revolutionary. Text moves between platforms easily, but it's often useless (or at least still difficult to manipulate) without a context wrapper that machines can easily understand. I'm even using one as I type this so all the web browsers in the world can understand the paragraphs I'm writing properly.

    Personally I think that markup languages are at least as important as text itself, and deserve a lot of credit. Without them, text on it's own would be much less useful.


    jesterzog Fight the light


    agreed (3.50 / 2) (#42)
    by gregholmes on Tue Oct 31, 2000 at 06:22:55 AM EST

    Absolutely. What's nice is that you get all the benefits of markup languages (which are huge) and it's still text. Some tools can parse and interpret the markup, others don't give a rip, except now you have more choices (or more refined choices) for pattern recognition.

    After all, it could have been like a Word doc --shudder-- where paragraphs are "tagged" with structure and formatting information. Yeah, you can kind of parse that ... with something like HTML Transit, if you have a few extra thousand $ lying around. And it still isn't easy or works very well.



    [ Parent ]
    DOWN WITH ASCII !!!!! (1.33 / 3) (#38)
    by delmoi on Tue Oct 31, 2000 at 02:02:06 AM EST

    8bit text sucks, Unicode is the future!
    --
    "'argumentation' is not a word, idiot." -- thelizman
    NO information should be stored in ASCII text (4.00 / 3) (#46)
    by delmoi on Tue Oct 31, 2000 at 01:17:59 PM EST

    None at all. I'm not saying I have a problem with things like XML or even plaintext config files, quite the opposite. But we should really be getting out of the habit of using 8bit text.

    Text is good, 8bit ASCII text is bad.

    We should all be using Unicode, and the sooner we all do, the better. Actualy, you can't really be a fan of XML and ASCII at the same time, since the XML standard spesificaly requires Unicode

    When you say that all info should be stored as Text, I'll agree with you. When you say it should all be ASCII, I won't.
    --
    "'argumentation' is not a word, idiot." -- thelizman
    Unicode (3.00 / 1) (#49)
    by Skeevy on Tue Oct 31, 2000 at 11:56:41 PM EST

    That would be nice, if Unicode was nearly universally supported as ASCII is.

    When you have a full set of Unicode-savvy processing tools, a wide range of fonts that can render _every_ Unicode glyph, and you have simple input methods for most of those glyphs, then yes, everything should be Unicode.

    And even Unicode is not enough. The committe that decides where glyphs go conveniently left out us chemistry types, who like to have subscripts for all the roman and greek characters.

    I'll stick with ASCII, thank you.

    [ Parent ]
    Sort of (4.00 / 3) (#47)
    by Simon Kinahan on Tue Oct 31, 2000 at 04:51:49 PM EST

    The main benefit of text is that its a good lowest common denominator. If you have no tools to handle something, if its text you can get out an editor, or fire up the Unix text processing utilities, and mess with it that way. Of course, the same argument can be made for binary, but to a lesser extent, as its harder for humans to understand.

    The argument has limits though. Saying "use text" is not an adequate description of a file format. It says nothing about structure or semantics. To pick a couple of random examples, tail and more are only really useful on files that have a 'natural' sequential order for their contents (logs, for instance). Similarly you cannot use ispell on plain HTML, as it expects pure English text, which HTML is not. This is an obstacle to the "small tools" philosophy in its most naive form. Not all tasks can be seen as simple manipulations of text.

    Also, the advantages of text are outweighed by other considerations in many cases. Bitmapped images and sound, for instance, where the data being stored (samples of something, in essence) is just a bunch of numbers, and cannot sensibly be manipulated or inspected as text, and storing it as text would introduce excessive space requirements.

    This may sound a bit curmudgeonly, but I think its important to recognise these limitations. I've seen people use "a text file" as if it adequately described all I needed to know about the syntax of some (quite complex) file format. Similarly people have used "its not text" as a counterargument against quite sensible proposals about how to store what was in essence a database.




    Simon

    If you disagree, post, don't moderate
    The Joy of Text | 53 comments (46 topical, 7 editorial, 0 hidden)
    Display: Sort:

    kuro5hin.org

    [XML]
    All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
    See our legalese page for copyright policies. Please also read our Privacy Policy.
    Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
    Need some help? Email help@kuro5hin.org.
    My heart's the long stairs.

    Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!