Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

Content Creation and Text processing: Work smart

By barefootliam in Internet
Fri Dec 30, 2005 at 11:55:17 PM EST
Tags: Internet (all tags)

I've been involved in transcribing texts and in converting documents into different formats for more than twenty years. These days it's mostly conversion into HTML or XML. I've seen a lot of wasted effort, and also learnt a lot of lessons that are useful to share.

It might seem odd that I'd take the time to write about this, but as XML Activity Lead at W3C I'm concerned about what you might think of as the grass roots and origins as XML every bit as much as XML in the big companies. So what follows are some notes on converting textual documents into XML and HTML.

The Web, of course, is all about content; that is, information. OK, these days it's also about entertainment, but that comes from having entertaining content. A lot of textual Web content started out printed on paper, and has been typed in (re-keyed, as they say) from the printed page, or has been converted from some word processing or typesetting format into (X)HTML. If you are doing this, you have probably discovered that typing in page after page of a book can quickly get pretty boring. Any task that requires concentration but is boring becomes very error-prone. I've seen transcriptions of books that have whole pages missing, as well (of course) as being peppered with little mistakes. This article is about some ways to help you work in this area more effectively. The same techniques apply with content that you have written yourself, except that if you wrote them recently you probably have an electronic copy in XML or plain text or HTML that you can start with.

With a little care there are a lot of ways to avoid most of the mistakes. One way to reduce the tedium is to use optical character recognition (OCR) to read the text. I often find myself working with old books, printed before (say) 1800, that use letter forms that no OCR program I have found can handle. But for newer texts, the commercial Abbyy FineReader seems to do a good job - perhaps 100 times better than the GNU OCR program in the tests I tried, although you should do your own tests before settling on any single piece of software.

OCR software is not error-free. Common mistakes include confusing letters and digits (0/O, 1/l/!/i/I seem to be the most common), missing the first or last character off every line on a page, reading marks on the page as quotes or letters, and losing formatting that may have been of use (I shall return to formatting later in this article).

The boring way to check the output of the OCR program is first to run the built-in spell checker, and then to read word by word for typos. It's boring, and professional proofreaders know techniques for making it less error-prone. For example, have one person read out loud from the original and the other follow along in the OCR output; swap places frequently. If you do this, make sure you use a font in which the letters ell, I, digit one, and exclamation mark are all very clearly distinct, and the same for other easily-confused pairs such as S/5 and 0/O.

A more interesting approach is to write some programs or simple scripts to check for oddities. I have often found that an hour writing a script can save days of manual work.

Before embarking on writing scripts, you need one of two things: the right frame of mind to write a script, or someone else to do it for you. In either case, the frame of mind is very different from what's needed for making a product, so not all professional programmers are good at doing this until they understand the differences. Similarly, if you have never programmed before, realise that for at least the first few scripts you'll need to have someone to help you, whether locally or via electronic mail, Internet Relay Chat (IRC) or Instant Messanger (IM). IRC and IM work well for this because you can easily paste fragments of code; the telephone is just frustrating.

Some differences between one-off scripts for checking texts and code for a programming project:

  1. It's not worth taking time to find a general solution. If a script can save two days of work, it's not worth spending ten days writing it, especially since it might never be used again.
  2. Conversion scripts tend to be run hundreds of times in a single project, perhaps with slight changes, and then copied and modified for the next project. It's important to preserve the old script with the project that used it, so that you can get the same results if you run it a year from now. If you later find a bug in the script, you might or might not want to go back and retrofit a fix.
  3. Ten minutes of fiddling can sometimes produce more than you get out of a day of design.

I've known programmers who can work on contracts, on conversion scripts and on product code with seemingly equal ease, but it's rare, and in all cases the programmer needs to have a clear understanding of which is wanted. Don't assume that she knows what you want!

If you do a lot of document conversion, you might want to go to a professional data conversion house, or buy tools for the purpose. People like DCL and Exegenix are likely to have booths at XTech in Europe or XML 2006 (or whatever) in North America. I'll focus on projects that are probably below 20,000 pages of printed or text content and assume that you are going to do the work yourself, but you should also consider that you can often send work to slaves in developing countries, who will gladly (?) type in a book three times and compare the results.

OK, so you have got a megabyte (say) of text (the King James Bible is about five megabytes), perhaps from an OCR program or maybe you took it from Project Gutenberg and carefully removed the attribution to Project Gutenberg, as required by their rather odd licence.

To be even more specific, I've recently been working on The Notebooks of Leonardo da Vinci which gives me lots of good examples. I have been doing this because the Project Gutenberg edition did not have any of the diagrams or illustrations, and without them it seemed to me a lot of the text made no sense: it's full of sentences like this: "For although the lines m n and f g may be seen by the pupil they are not perfectly taken in, because they do not coincide with the line a b." which, I'm sure you will agree, makes more sense when you can see the diagram. So I have been working on an edition of this book (it has two volumes) for fromoldbooks.org to help make it more widely available in a more useful format.

The main body of the Notebooks is composed of 1,566 articles of varying length, each starting with the article number. A Perl script that recognised the numbers and looked for ones out of sequence found several errors, and helped me to identify two missing pages from the transcription!

#! /usr/bin/perl -w
use strict;

my $n = 0; # article number
while (<>) {
    # look for lines like
    # 306.
    # and make sure all the item numbers are there.
    if (m/^(\d+)\.$/) {
        if ($1 != $n + 1) {
            die "out of sequence after $n: $1";
        $n = $1;

When I ran something like this on the Project Gutenberg etext, I found several errors. Some of them, as I mentioned earlier, were because of transcription errors. I fixed most of these by re-keying the missing one and a half pages or so, and also by indenting with spaces a line that was indented in the text and flush left in the transcription. I then had a couple of false matches, and this leads to an important decision.

If I find errors in the transcription, I can fix them in my copy (and perhaps in this case send the fixed copy back to Project Gutenberg, although my texts more often come from OCR or from hand typing). But what if the problem is that my script isn't very smart? for example, consider

The boy wore black socks, as you can see in the illustration on page

Here simple line wrap has put the "31." on a line by itself and triggered a false match for an article number. The easy fix is to join the lines back together, and if this isn't poetry and we are not trying to preserve all the exact line-breaks that's probably just fine. But now you've modified the original. The other choice is to make the script smarter, whether by building in a list of errors (or line numbers) to ignore, or by making it look for a blank line followed by an article number.

If there are a lot of false errors, you should make the script smarter. If there is only one, teach it to ignore it, or fix the input.

Once you have modified a file at any stage, keep a careful note of what you did, perhaps in a README file. If you already ran several scripts whose output is the file you are editing, then when you run the scripts again you'll lose your hand editing! So you need to be able to repeat the editing.

I have mentioned one check: article numbers. Other checks for this file included matching double quotes so that I can turn them into typographic "double quotes" like these, and matching up the markup for italics that was used in the transcription (_italics_). I found quite a few errors in the transcription by doing this.

If you are working on a dictionary, you can consider a completeness test - is every word used in a definition also itself defined? Usually the answer is No, but you'll also get a list of all typos or OCR errors that happen not to be real words. Another useful approach is to look for words that only occur once. Unfortunately, as Zipf observed, word distribution in English is logarithmic, so that a very few words account for most of the text, and a very large number of words occur only once or twice, but in practice most writers have a vocabulary of under 30,000 words, so for a very large text this test can be extremely useful. Even for a smaller one, look up all words that occur only once in a dictionary or spell checker and you may well find a lot of typos. I use my text retrieval package for that, or sometimes a simple Perl script.

Section numbers can also be checked for, and can often be matched up to a table of contents. Footnote numbers should correspond to footnotes, although a single footnote can sometimes be referenced more than once on a page.

Although I gave a link above to my converted text, it's still very much an ongoing project. You may notice, for example, that sometimes a section header has fallen into the preceding item by mistake. I'll work on improving it (and on formatting the footnotes better!), and also on adding more of the images, as time permits over the next few months. Splitting it into items, instead of one huge file, already seems to me to make it easier to read online and navigate around.

This leads me to the next point I wanted to make. Generate XML as early as you can, and validate it against a DTD or Schema. This will really help to make sure that your scripts are producing what you expect. When we were designing XML itself, we were careful to keep in mind the Perl programmer who needed to make a small change to XML documents by treating them as text files, and that's often a useful strategy. But you can also use XML Query or XSLT. I used Mike Kay's Saxon 8, a Java-based implementation of the draft version 2 of the XSLT transformation language. Hmm, long words. XSLT is a standardised scripting language for changing (transforming) XML documents, whether it's into different XML, into HTML or XHTML, or into text. Saxon supports regular expression substitution (standard in XSLT 2) and also can easily generate multiple output documents, in this case one per article in the input.

Once I made all those directories, another Perl script copies the images into the appropriate directories. I could have put all the images in a single directory, but with over five hundred of them, and two or three versions of each image, that would quickly get unmanageable. The script reads the generated HTML files, finds which images are needed, makes image subdirectories and copies them in place.

The upshot of all this is that there are lots of small scripts in multiple languages (XSLT, Perl, XML Query, Unix shell, whatever is convenient), each with a single, clear purpose that is described in comments near the start of the script. But how do I know how to run them all?

One approach is to make a shell script called runme that will run everything in the right order. This is pretty simple to do and understand.

A more sophisticated approach is to use the Unix make facility. You make a file called Makefile that describes how to make each intermediate file by running commands. Make examines the Makefile and then looks to see which files need to be rebuilt at any time (don't forget to tell it that the output of each script depends on both the input to the script and the script itself, so that the script is re-run if you change it). This works much better, especially on larger projects, but make uses an arcane and difficult syntax in which leading tabs are different from leading spaces.

Whichever way you go, if you need to do hand editing at any point, you can't automate the entire process. If you decide you do need to run the whole conversion automatically, perhaps to run nightly tests, you could investigate Larry Wall's patch utility. You use the normal Unix diff program to make a file that contains the differences between the input and your hand-edited version. Then later you can use patch to recreate the hand-edited version from the original, or from a file that's pretty similar to the original. That way you can replicate the changes. I don't bother with this on small projects, but instead copy the original file and call it something like input-handedited.txt instead.

The way of looking at conversion I am trying to give you is this: always ask yourself, what can I test by program, and what can I change automatically? Don't try to do the whole conversion in one go, but instead write lots of small scripts that each do a little of the work on the whole file, and validate (or test) the file at each stage.

Each script should log errors or implausible data, such as item numbers out of sequence, very long lines, italics that cross paragraph boundaries, chapters without titles, and so forth. If you have recorded page breaks in a transcription (I usually put them in XML comments,
<!--* page 49 *-->
so that they don't affect XML processing but I can easily find them) then check for missing pages, duplicate pages, or pages in the wrong order. If most of the time it's because you made a typo in the comment, that's OK: sometimes it will be because you missed a page. So, test, test, test.

I could write more about making valid XHTML + CSS (and why), about navigation, and about how all this interacts with Web search engines, but that's getting off-topic.

I have also assumed that you are using a Unix-like operating environment. Unix was developed with text processing as a major goal early on, and it shows. I once watched a junior consultant (on approximately $600/day I think) spend a whole day editing one file. I wasn't his manager, but I saw that what he spent all day doing could have been done with a single fairly easy regular expression substitution in under a minute. If you are intimidated by regular expressions, don't be. Take the time to learn them. Regular expressions are the single most powerful text processing tool ever invented.

If you are unlucky enough to be trying to do a lot of text conversion on a non-Unix system such as Microsoft Windows (which has different strengths, but text processing with scripting isn't its strongest point) install Cygwin, or, better, a good solid Linux distribution. Again, good tools are worth learning.

This is a pretty short introduction; I'd welcome feedback on which parts to expand, or on possible future articles.


Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure


Related Links
o Abbyy FineReader
o Exegenix
o Project Gutenberg
o The Notebooks of Leonardo da Vinci
o see the diagram
o fromoldboo ks.org
o text retrieval package
o XML Query
o Saxon 8
o Cygwin
o a good solid Linux distribution
o Also by barefootliam

Display: Sort:
Content Creation and Text processing: Work smart | 73 comments (52 topical, 21 editorial, 0 hidden)
I would like to compliment you (1.11 / 9) (#6)
by AlwaysAnonyminated on Thu Dec 29, 2005 at 10:16:56 AM EST

on your beautiful troll.
Posted from my Droid 2.
Re: I would like to compliment you (3.00 / 2) (#12)
by barefootliam on Thu Dec 29, 2005 at 01:51:56 PM EST

Er, thank you, I didn't know I was trolling. Maybe the article comes across as criticising Project Gutenberg? I didn't mean for that to be the focus of it. Apart from that, I don't know why you think it a troll.

Thanks for taking the time to comment.

Liam; you might also like Words and Pictures From Old Books
[ Parent ]
If winning at buzzword bingo was winning at life (2.00 / 2) (#9)
by So Very Tired on Thu Dec 29, 2005 at 01:18:42 PM EST

Then you'd be president, son.

If winning at buzzword bingo was winning at life (none / 1) (#14)
by barefootliam on Thu Dec 29, 2005 at 02:31:15 PM EST

>Then you'd be president, son.

Thanks, grandma! :-)

I've expanded out more of the aconyms (IRC, IM) and replaced a few jargon words that slipped through. On the other hand I think that introducing some of the names of technologies like XSLT and Perl (for example) is useful for people. I wish Kuro5hin allowed the HTML abbr element, or even span with a title, so that the full expansion of the abbreviations and jargon terms would appear when you hovered a mouse pointer over them, and would also be available for people using text readers.

Thank you for the comment: it was a useful reminder that I might be using too many technical terms.


Liam; you might also like Words and Pictures From Old Books
[ Parent ]
Tool recommendation: XMLStarlet (none / 1) (#16)
by tmoertel on Thu Dec 29, 2005 at 04:57:02 PM EST

If you're doing any kind of ad-hoc work with XML documents, take a look at XMLStarlet, an MIT-licensed command-line tool that handles almost all common XML-processing tasks, including validating, transforming, querying, reformating, and escaping. It is built upon the popular libxml2 and libxslt libraries and supports XInclude, XSLT, and XML c14n canonicalization. For validation, you can use DTD-, XSD-, or RelaxNG-based content models. And it's fast.

Do check it out.

My blog | LectroTest

[ Disagree? Reply. ]

Re: Tool recommendation: XMLStarlet (none / 0) (#17)
by barefootliam on Thu Dec 29, 2005 at 05:22:10 PM EST

Yes, I've seen it, although it's not entirely clear to me what advantage it has over libxml's xsltproc and xmllint.

I didn't really think that Kuro5hin was the best place for writing about XML tools - I also have to be a bit careful about what I endorse, because people read too much significance into what I say about XML sometimes :-) which is an occupational hazard I suppose. But do you think it would be worth doing a summary/survey of open source XML tools, for example?

I do use xsltproc heavily, but the advantage of Saxon 8 is the XSLT 2 support, and in particular its grouping, output character maps and regular expression substitution. I also use saxon for XML Query, although the FromOldBooks Search Page uses Qizx/open, which for my particular application worked out faster.

Thanks for commenting!


Liam; you might also like Words and Pictures From Old Books
[ Parent ]
+1FP, long $ (none / 1) (#23)
by alevin on Fri Dec 30, 2005 at 03:05:00 AM EST

Is that all it takes these days? /nt (none / 0) (#24)
by Ignore Amos on Fri Dec 30, 2005 at 12:21:24 PM EST

And that explains why airplanes carry cargo on small boats floating in their cargo aquarium. - jmzero
[ Parent ]

So of us (3.00 / 2) (#28)
by SoupIsGoodFood on Fri Dec 30, 2005 at 04:09:47 PM EST

are easily impressed.

[ Parent ]
XML transformations to pre-include navigation? (none / 1) (#26)
by MichaelCrawford on Fri Dec 30, 2005 at 03:20:33 PM EST

I have been advised that it would be easier to use PHP includes to add site navigation to my web pages, but it seems silly to me to have to run PHP every time a page is fetched to add the navigation to it.

What I would like is to mark where the navigation is to go in my XHTML documents, then run some kind of processor on them that would transform them into the pages that I actually upload to my site.

That way, I don't have to hand-code my navigation for each page, but the processing happens only once.

I think I can use stylesheet transformations for this, but having never done them, I don't have the first clue of where to begin. Can you Refer me to The Fine Manual page that tells me what to do?


Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy

Re: XML transformations to pre-include navigation? (none / 0) (#29)
by barefootliam on Fri Dec 30, 2005 at 05:21:31 PM EST

I'm not sure this is very on-topic - maybe I should write an article about script-driven Web sites where you use scripts (e.g. XSLT) to make static XHTML Web pages. At any rate, it's not silly to run PHP each time if your computer can do it: the navigation will always be up to date! An alternative I have used is to use server-side includes (with the Apache Web server) to include HTML files that are generated when needed. I do this for the "recent images" on the From Old Books main page: the HTML fragment is generated by the same script that makes the RSS feed, when I add a new image.

If your Web pages are well-formed XHTML, you can use XSLT on them, yes. You can check them with xmllint (part of libxml) or with our W3C HTML validator. Make sure your pages are at least well-formed, and preferably valid, before trying to use XSLT on them, as otherwise you'll just get lots of cryptic error messages.

For the Notebooks of Leonardo that I mentioned in the article, all the text is in a single XML file (actually a text file that's converted to XML by a fairly simple Perl script), and I run an XSLT transformation to generate all of the Web pages, including the navigation.

Another way is to have XSLT that runs on the Web browser and generates extra links, but you have to be aware that not all browsers support XSLT (e.g. old versions of Opera) and also that you can't easily (or efficiently) check to see if a file exists on the server when you're in the client.

Which is the best approach for you? It depends on your workflow. XSLT is useful if you work with XML files a lot, and certainly worth using, but if you don't, and you can already write PHP code, I'm not going to suggest learning a new technology just for the sake of it. Of course, you might want to learn XSLT and be using this as an excuse... :-)

Hope this helps!


Liam; you might also like Words and Pictures From Old Books
[ Parent ]
XSLT > PHP (none / 0) (#30)
by MichaelCrawford on Fri Dec 30, 2005 at 10:55:37 PM EST

Thanks for your reply. Actually I was hesitant to use PHP because I've never used it, but I've done quite a bit with XML.

I'll look into using XSLT further.


Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy

[ Parent ]

Re: XSLT > PHP (none / 0) (#35)
by barefootliam on Sat Dec 31, 2005 at 12:41:05 AM EST

I use both XML Query and XSLT 2 for my own Web site, and they are both very useful.

Liam (I am Ankh on Advogato, by the way -- hi! happy Blue Ear, etc etc!)

Liam; you might also like Words and Pictures From Old Books
[ Parent ]
SSI (none / 0) (#47)
by horny smurf on Sun Jan 01, 2006 at 12:35:16 AM EST

I've used server side includes (.shtml) for navigation menus, for times when php is overkill.

Apache's SSI support includes conditional expressions:

<!--#if expr="'$DOCUMENT_NAME' = 'index.shtml'" -->
<!--#else -->
<li><a href="/">Home</a></li>
<!--#endif -->
<!--#if expr="'$DOCUMENT_NAME' = 'pictures.shtml'" -->
<!--#else -->
<li><a href="pictures.shtml">Pictures</a></li>
<!--#endif -->

That can be a lot of typing if you have a lot of articles. (you could write a script to generate the a new navbar, everytime you add an article or whatever). Personally, if I had as many articles as you, I'd put them in a database, blog, CMS, or something.

Another option would be to write them as PHP, then run commandline php to generate a static HTML page. I've done that to cache db-intensive pages.

The nice part about that is that you can set up a make file to re-generate all your pages whenever your navigation (or other dependency files) are updated.


[ Parent ]

Conditional expressions! (none / 1) (#48)
by MichaelCrawford on Sun Jan 01, 2006 at 04:48:26 PM EST

Yes, that's just what I wanted, but didn't know what to call them.

So many sites have navigation where the link to a given page is still live on that page. I've taken care to make the exception, and I think it adds a lot to the quality of a site, but it's a real pain in the neck to implement.


Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy

[ Parent ]

Re: Conditional expressions (none / 0) (#51)
by barefootliam on Mon Jan 02, 2006 at 02:11:38 AM EST

I agree with you about making the current page not be a link. I do this in the Leonardo pages, too, for example, and in my gallery script.

The gallery stuff (e.g. Gotch on the English House) is written in Perl and reads a static text file with the metadata. The Search page uses XML Query to read and query the metadata, and also a relational database with image sizes. So there are lots of ways to do it.

The down side of server side includes is that they can be fragile, especially on a large site. For example, server configuration changes can make them stop working, and you have to be careful to avoid including the same fragment from more than one place if it might have relative links in it.

Having said that, I use them too, and just test carefully and often :-)


Liam; you might also like Words and Pictures From Old Books
[ Parent ]
I think precalculating is best (none / 0) (#52)
by MichaelCrawford on Mon Jan 02, 2006 at 02:37:01 AM EST

I know it would probably be easiest to use either server-side included or PHP includes, but it just seems wasteful to me to force my server to include the navigation on the fly. At the very least, it would involve opening another file, and running some code that wouldn't have to be run if I were to pre-include the navigation.

I know it's probably only a tiny fraction of my server's total load, but it just offends my sensibilities to waste cycles when I don't have to.

What I think I'm going to do is set up a system on my home PC where the navigation is kept in a separate file, then included on the fly, but then also have a program that copied my whole site out to a mirror directory tree with the navigation preincluded.

I just added a couple new pages to the new site I'm working on, and have many more planned, so I expect I'll need to figure this out pretty soon.


Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy

[ Parent ]

Re: precalculating (none / 1) (#53)
by barefootliam on Mon Jan 02, 2006 at 03:43:37 AM EST

I share you feelings about precalculating (and it's what I almost always do), but these feelings also need to be balanced by the fact the computer is there to work for us! If it's not too loaded, do whatever is easiest to maintain and understand.

The hardest engineering problem to get right in computing is almost always the cache. Premature optimisation (whether a cache or something else), that is, optimisation not based on measurements, is also one of the most common sources of bugs and errors.

I use ssh, scp and rsync for mirroring, by the way. On Mandriva Linux there is also draksync which makes rsync a little easier to cope with. But then, I have over 40,000 HTML files and over 13,000 image files. Which is a tiny amount compared to the Web server at work (the World Wide Web Consortium) but you'd expect that :-)

You could also look at using CVS or RCS or subversion or arch or some other version control system. But that's for managing files rather than for navigation.

For the navigation, opening an extra file is really not a big deal, especially on a Unix-based Web server (including Linux, Solaris, BSD, etc.). But it's somehow more satisfying to generate a whole Web site with a single command!

Watch out for Google's relatively new policy that it down-rates Web sites who routinely update their Last Modified timestamps without changing the content. My mkgallery script, for example, only replaces files that have actually changed.


Liam; you might also like Words and Pictures From Old Books
[ Parent ]
Come on, man! (none / 1) (#71)
by mcrbids on Fri Jan 06, 2006 at 06:50:06 AM EST

How many people do you intend to serve in a 24 hour period?

Chances are, a well-written PHP script, 128 MB or RAM, and an Intel P2 server will handle that easily, with 400% excess capacity.

Seriously, PHP performance just isn't an issue in virtually all cases. The exceptions are easy to see:

  1. Heavy dependance on a database for each, individual hit,
  2. Piss-poor algorithm, like doing an individual DB query for each FIELD within a table. (yes, I've seen it!)
  3. Seriously deficient hardware,
  4. Constructing solutions that are "cool" rather than being based on realistic expectations. (See #2 above)
Meet these (simple) criterion, and you'll be fine, pretty much no matter what.
I kept looking around for somebody to solve the problem. Then I realized... I am somebody! -Anonymouse
[ Parent ]
Typographic bells and whistles (none / 1) (#31)
by caek on Fri Dec 30, 2005 at 11:07:52 PM EST

Just out of interest, how are the ellipsis, endash and emdash usually rekeyed in projects like Gutenburg?

How does one handle archaic typography such as long s (ſ)? Is is possible to note the position of ornamental (dropped) letters or unusual ligatures?

Re: Typographic bells and whistles (none / 1) (#33)
by barefootliam on Sat Dec 31, 2005 at 12:08:44 AM EST

I can't speak for Project Gutenberg -- I used th text as an example, not because I am involved with them. Having said that, I'll note that they convert long s (ſ) to short s (although texts using long s are hard or impossible for OCR software to cope with, and hence get rekeyed manually; see the Project Gutenberg Distributed Profreeders Web site for details.

ellipsis turns to ... usually; dashes end up as a hyphen or minus sign (-) or sometimes as -- or -- depending on the typist; sometimes it's different on different pages, because of the way they have assign pages randomly amongst workers.

Unusual ligatures will be dropped (e.g. the ct ligature would be keyed as ct or sometimes as a * if the typist didn't understand it). Small caps are turned to regular lower case and any semantic distinction lost. Italics are either marked with underscores at the start and end or turned to ALL CAPS, again losing any distinction from sequences of capitals.

It helps to understand that Project Gutenberg's goal is not (as I understand it) to make a digital representation of a particular edition; indeed, they often don't mark which edition was used whe they publish a text! Instead, it's to make an "electronic text" suitable for human readers. One has to set one's standards accordingly.

If you are looking for high quality faithful transcriptions, look at people using the Text Encoding Initiative Guidelines, e.g the Women Writers project at Brown University. My interest lies somewhere between the two: I'd like to make a transcription good enough that the original could be recreated, with the exception of line breaks and hyphenation, but I am not making critical editions comparing manuscripts or printed versions, for example.

I do worry that for many of these books, in (say) 50 years, the Project Gutenberg editions might be all that remain, but that's the subject for another article :-)

Thanks for your comment! Sorry I can't give a definitive answer.


Liam; you might also like Words and Pictures From Old Books
[ Parent ]
Re: Typographic bells and whistles (none / 0) (#57)
by caek on Tue Jan 03, 2006 at 10:25:02 AM EST

Instead, it's to make an "electronic text" suitable for human readers. One has to set one's standards accordingly.
Fair enough. The Gutenburg conventions you describe (long s to short s, ellipsis to three dots, etc.) are eminently reasonable with that in mind. However, if that is it the limit of its ambitions, I couldn't agree more with your concern that the Gutenburg texts may in future be all that remain, especially with things like this going on!

So, more power to your elbow!

[ Parent ]

Re: Brittle Books Project (none / 0) (#70)
by barefootliam on Thu Jan 05, 2006 at 06:08:59 PM EST

I'm all for making digital scans of fragile and rare books, so that, if the book gets lost or damaged, we still at least have something.

That does not mean that people can then go and destroy the original books, so yes, this, if true (as sounds likely) is very disturbing.

I may be slightly unfair to Project Gutenberg: I don't know if the project as a whole has ambitions beyond those of Michael Hart, who founded it. But I did exchange mail with Michael Hart around 1990 about this; his goal was very clearly to make US ASCII texts that would be readable by as many people as possible (in the USA at least). The fine print at the end of each etext mentions the total number of texts and how far ahead of schedule they are, as if it's a race.

Thanks for the link -- interesting reading.


Liam; you might also like Words and Pictures From Old Books
[ Parent ]
More on Baker's paper evangelism (none / 0) (#72)
by caek on Fri Jan 06, 2006 at 02:45:04 PM EST

The World on Sunday: Graphic Art in Joseph Pulitzer's Newspaper (1898-1911) is the first fruit of Baker's attempt to rescue various physical collections. There is a fascinating review in the new issue of the New York Review of Books.

[ Parent ]
regexps, facts, yadda .... (2.50 / 2) (#36)
by tomlord on Sat Dec 31, 2005 at 01:05:02 AM EST

Regexps are *not* the most powerful text processing tool ever invented though one "knows what you mean". Anyway, the mind boggles that el bozo gets $600 for avoiding regexps while the author of one of the very few best ever regexp engines ever (ahem) is going to die for want of money (ahem). Hey, next time you hear of such a job .... write me please. -t

Re: regexps, facts, yadda (none / 0) (#38)
by barefootliam on Sat Dec 31, 2005 at 01:12:39 AM EST

People die from want of food, or possibly from wearing shoes too much, but not from want of money :-) None the less.. yes, forgive me some hyperbole, but how else to move people past the (not uncommon) fear of regular expressions?

As for consulting jobs, I can only speak for Ontario, where there are still lots of high-tech jobs available. This particular one was in a Corporate Environment where shoes were required (although white socks were explicitly forbidden), and everyone had to Toe the Corporate Line. I couldn't stand it.

Author of the best regexp engine, that must be utzoo!henry, right? :-)



Liam; you might also like Words and Pictures From Old Books
[ Parent ]
henry (none / 0) (#42)
by tomlord on Sat Dec 31, 2005 at 01:56:22 PM EST

Good guess. No, I just stole some ideas he generously gave me. When it comes to implementing on-line fancy-fast Posix regexp matching, I think there are four or five of us in the club now.

[ Parent ]
Re: Henry (none / 0) (#43)
by barefootliam on Sat Dec 31, 2005 at 06:00:18 PM EST

I was teasing just a little :-)

Of course, the next challenge is to implement W3C XML Schema, e.g. with Brzozowski Derivatives.



Liam; you might also like Words and Pictures From Old Books
[ Parent ]
as someone who does a lot of this stuff too... (none / 1) (#37)
by urdine on Sat Dec 31, 2005 at 01:09:25 AM EST

There's some hard-earned gems here that I've learned over time as well.  Specifically, running multiple "passes" to clean and verify the data in steps, make good documentation, and separate the input/output files in different directories.  You can do some amazing work with some regex, step-by-step design, and a bunch of strategic kludges along the way.

Good article.

Re: as someone who does a lot of this stuff too... (none / 0) (#39)
by barefootliam on Sat Dec 31, 2005 at 01:37:43 AM EST



Liam; you might also like Words and Pictures From Old Books
[ Parent ]
Version control? (none / 1) (#40)
by Spider on Sat Dec 31, 2005 at 09:18:22 AM EST

Hello again, Liam : )

From a good read, I have to wonder.. How many of your scripts do you tend to put into a standard "library" of sorts, and how many of them are specific per document?

Following this up, Version control. Is it worth it for the scripts ( even if as simple as cp script script.`date +%F`; vim script ), and do you use it?

The same for the main texts, do you work with version-control here too? Tracking changes ( since you do mention diff/patch , and one of the areas where I've been using CVS et. al is to automate patches ; )

Re: Version control? (none / 0) (#45)
by barefootliam on Sat Dec 31, 2005 at 06:10:55 PM EST


Well, with Tom Love posting in the thread I suppose I should say that I am using arch, but no, I don't :-)

I generally want everything self-contained so that archiving a folder archives everything. So I use RCS for the scripts.

For the texts, I use RCS if I'm editing one, but if I took a text from elsewhere (e.g. PG) there would be no point, as I don't have editorial authority.

I tend to start by copying scripts from a previous project and editing them, but there seem to be two things in a common core:

  1. functions like XMLOpen, XMLCloseIfOpen, XMLClose, XMLCurrentElement to manage a stack in the generated output;
  2. a finite state machine to manage transitions between different parts of the document;
  3. code to handle italics, matching quotes, dashes, symbol substitution, blank lines turning to paragraphs, literal passages (no word wrap), and also warning on italics spanning paragraphs.

Similarly, the XSLT that splits the XML into multiple XHTML files looks pretty similar each time.

I've considered making the conversion scripts I use for each project available (under the Barefoot Licence of course!) for others to use, but I'm not sure if there's much point.


Liam; you might also like Words and Pictures From Old Books
[ Parent ]
Version control with XSLT and CVS (none / 1) (#63)
by cpghost on Wed Jan 04, 2006 at 06:38:32 AM EST

The interesting question is what you put in version control. In an XML/XSL to [X]HTML scenario for example, you'd normally put the XML and XSL files themselves e.g. into some CVS repository, because you can regenerate the HTML stuff from the sources anytime you need them.

But is it always that simple? Sometimes you want to regenerate an exact past snapshot of the output files. But tools change over time; and eventhough the output should in theory be identical across a wide range of standards-compliant processors, you may still not get the same results everytime.

So what do you do then? Right, you source-control the output files as well (disk space is cheap).

This is a easy enough... unless you use CVS generated timestamps in source and output files (a la "last modified: $Date$"). Now, everytime you recompile the sources, and check the output files back into the repo, both timestamps (source and output) are out-of-sync and will be regenerated. Every output file is touched, even if the only thing that changed is the CVS tag itself, which is kind of silly.

I've seen some systems that were set up this way: the repository for the output files is mostly filled with mere CVS tag changes, because the people who set this up didn't get it right. The solution, of course, is to avoid CVS tags in the output files and using XSLT to copy the CVS timestamp of the input source file directly to the outfile file, stripping the CVS tags on-the-fly.

cpghost at Cordula's Web
[ Parent ]
Now you just need (none / 0) (#41)
by shinshin on Sat Dec 31, 2005 at 10:46:07 AM EST

to tell us how to automatically convert the scanned text into Resource Description Framework (RDF).

We believe he has, in fact, reconstituted nuclear weapons --Dick Cheney, Meet the Press, March 16, 2003
Re: RDF (none / 0) (#44)
by barefootliam on Sat Dec 31, 2005 at 06:01:55 PM EST

RDF isn't really a good format for representing text... it's OK for interchanging limited sorts of metadata, and to do that, extract from the XML you generated using XSLT or XML Query.


Liam; you might also like Words and Pictures From Old Books
[ Parent ]
quilt (none / 1) (#46)
by derobert on Sat Dec 31, 2005 at 11:58:46 PM EST

You might find the "quilt" collection of programs useful for keeping track of your hand-edits. Gives you very clear documentation of each (in the form of patches), and an easy way to apply/unapply each.


Similarly, you might want to try a version control system (subversion, cvs, arch, etc.) to keep track.

Re: quilt (none / 0) (#50)
by barefootliam on Mon Jan 02, 2006 at 02:01:18 AM EST

Thanks, I've used RCS for this in the past, not least because of the way (for a one-person project or small team) everything is self-contained.

I wanted to keep the number of new tools mentioned to a minimum in this article. Maybe there is scope for a follow-up article about patches and version control, similarly about the various XML tools that are so useful.

Thanks again,


Liam; you might also like Words and Pictures From Old Books
[ Parent ]
Congrats on the promotion (none / 0) (#49)
by Chewbacca Uncircumsized on Sun Jan 01, 2006 at 04:58:37 PM EST

From Activity Lad.

OCR versus cheap labor (none / 1) (#54)
by dogeye on Mon Jan 02, 2006 at 12:30:44 PM EST

Is OCR, with all the inherent problems, really cheaper than having $1 per hour fluent English speaking Philipinos manually type the data?

Re: OCR versus cheap labor (none / 0) (#55)
by barefootliam on Mon Jan 02, 2006 at 02:58:40 PM EST

For me, yes, because I often choose books with a lot of images I want to share.

I think I mentioned the rekeying option in my article, though. It actually sometimes works best if the people are not fluent at English, by the way, because English-speakers make mistakes like repeated words, or changing the sense of a sentence (inserting not for example), whereas non-native speakers make typos but are less likely to change words. Or so it has been suggested to me. In practice the people who run rekeying operations tend to be based in the West and they find their own off-shore typing slaves. I've had texts rekeyed with good results for a commercial project, including some simple XML markup.

Another issue for me is I sometimes work with relatively rare books that I don't want to send away.

Thanks for bringing this up.


Liam; you might also like Words and Pictures From Old Books
[ Parent ]
Non-native English speakers (none / 1) (#58)
by caek on Tue Jan 03, 2006 at 10:36:36 AM EST

On the subject of non-native English speakers working with English texts, you may find this WSJ article on the "World" Scrabble Champtionships (scare quotes because this championship uses the US dictionary and, until recently, was only entered by North Americans). The relevant passages include:
We can't help but think about how words are used in life, and that sometimes can affect how we view them in the very different context of a Scrabble game.

Panupol has no such conflicts. To him, almost every word is the same: devoid of meaning. Panupol studies printed lists of words organized in the order of probability that they will appear in a game, a common study method for all players, because you want to know the words that are most likely to show up. But whereas I'll stop to look up an interesting-seeming word, or infer a part of speech, Panupol marches on undeterred, one word piling up after another.

It seems a similar thing can happen in rekeying work.

[ Parent ]
Re: OCR versus cheap labor (none / 0) (#73)
by 6mullet on Wed Jan 11, 2006 at 06:36:43 AM EST

It all depends on what precisely you are doing with the data. In some circumstances it is indeed cheaper to use OCR than to outsource the work to the Philipines.

I spent a year working on a project to digitise the local council archives' catalogues in order that they might be placed in a database. The project team consisted of three people - an archivist, an archives assistant and myself, someone with experience of proof reading and digitising documents from previous work in the press.

When I arrived at the project I discovered that this was not their first attempt. The year before a project had been attempted between multiple archives where selected catalogues were manually rekeyed in the Philipines. After seeing the results of the first few catalogues, the head of the archives had decided not to continue with this approach.

The data they had received was full of errors. This wasn't surprising and really couldn't be blamed on the workers who had rekeyed it. The original hard copy of the catalogues was mostly of a poor quality and often difficult to decipher for a native English speaker who was familiar with many of the place names contained within. Add to this the frequent changes in style and abbreviations (the catalogues dated from the 1940s through to the 1990s and were often internally inconsistent) and there was little hope of producing anything useful from the first run-through of the data.

Using local workers and OCR software produced better results but was more expensive. However, the head of the project found that factoring in the time required for a local worker to edit the output of the Philipines-produced data brought the figures much closer together. As a small team would be required either way to rework the plain-text data into modern archival catalogue standards and then into XML suitable for importing into the database he chose to go fully with the local workers approach. I suspect in many ways this was also less of an economic decision than a political or technological one - the prospect of using underpaid foreign workers didn't sit well with the liberal leanings of the head archivist and using local workers was most likely easier to sell within the council.

[ Parent ]
I wish this had been published 5 years ago (none / 1) (#59)
by wiredog on Tue Jan 03, 2006 at 12:49:55 PM EST

When I had the job of getting data to/from XML and an Oracle database. Parsing 5 MB XML files with DOM, because they were too complex for Sax. And hand hacking "scripts" that did various cleanup on the incoming data. I say "scripts" because they were written in C++. This was, remember, 5+ years ago. Speed was important.

Wilford Brimley scares my chickens.
Phil the Canuck

Re: I wish this had been published 5 years ago (none / 1) (#60)
by barefootliam on Tue Jan 03, 2006 at 02:40:19 PM EST

These days Oracle handles XML directly of course.

On the subject of speed, I hope you did timings. Often most of the time is in the XML parser or in regexp handling, and in both cases Perl does pretty well, and its bytecode optimizer means that small Perl scripts often are, in fact, faster than custom C++! If you do a lot of DOM stuff, rather than SAX or XML::Twig (say), though, C++ may still win out.

At any rate, thanks for commenting!



Liam; you might also like Words and Pictures From Old Books
[ Parent ]
That was 5 years ago... (3.00 / 2) (#65)
by wiredog on Wed Jan 04, 2006 at 09:06:38 AM EST

On a lightning fast, high-end, 1GHz P3 with 512 MB ram. A real monster workstation. ;-)

We definitely did timings. Eventually cut the running time on a test 10mb file from 1 hour down to 10 minutes through lots of hand optimizing. Part of the problem was that some "records" in the XML file contained "pointers" to other records (all eventually keys in the db) which had to be verified, and XML didn't do that. It does now.

These days I use C# and Python for my XML work.

Wilford Brimley scares my chickens.
Phil the Canuck

[ Parent ]

Re: 5 years ago (none / 1) (#66)
by barefootliam on Wed Jan 04, 2006 at 12:54:23 PM EST

Understood, I admit my reply was also mostly directed at other people who might read it, because we need it to be fast so we'll use language X is a common error made by new programmers.

One thing I do quite often is add links automatically. for example, in the 1736 Canting Slang Dictionary every word or phrase that's got its own entry becomes a link to that entry wherever it occurs. Often the links are to an irrelevent destination, but overally it seems to be useful.

This sort of thing can be done in (at least) two ways: (1) read it all into memory and, for each entry, go through the entire dictionary and add links; (2) use multiple passes: first make a list of all the entry titles, then sort them by number of words, then go through each entry and try to match the phrases against it, turning matches into links. The second approach is massively faster and uses much less memory. The fastest way to do it, if speed is needed, has the first pass write a script that, when executed, does the second pass.

I suppose what I'm trying to say is that choice of algorithm is more important than choice of language for speed in almost all cases. Although I suppose I should mention that a switch from sed to perl once got me more than a factor of 100 or so speedup, with no other change. But the algorithm was still O(nm), it was only a constant speedup.

You're well aware of all this stuff I'm sure, and I should get off the soapbox, sorry. But I always think about people who get to these articles and comments via google six months from now! :-)



Liam; you might also like Words and Pictures From Old Books
[ Parent ]
choice of algorithm is more important (none / 1) (#67)
by wiredog on Wed Jan 04, 2006 at 02:32:26 PM EST


In fact, you've (sort of) described the two ways we did it. Load it into memory resolving all the links as we go along, as that's the easiest to code, vs multiple passes to resolve the links. Harder to write (and design), but much faster. Mainly because it used less memory, therefore less swap.

Wilford Brimley scares my chickens.
Phil the Canuck

[ Parent ]

Why DOM is slow for large files (none / 1) (#61)
by MichaelCrawford on Tue Jan 03, 2006 at 04:18:27 PM EST

DOM dynamically builds a tree structure in memory that reflects the structure and content of the XML document. That means that there are about a bazillion calls to the memory allocator when you create a DOM tree, and a bazillion more when you delete it.

Some guy on the xerces-c mailing list complained that a high-end Sun server took an hour to construct a DOM tree from a twenty megabyte file. Their response was "don't use DOM for such large files".


Live your fucking life. Sue someone on the Internet. Write a fucking music player. Like the great man Michael David Crawford has shown us all: Hard work, a strong will to stalk, and a few fries short of a happy meal goes a long way. -- bride of spidy

[ Parent ]

Re: DOM speed (none / 0) (#62)
by barefootliam on Tue Jan 03, 2006 at 11:36:15 PM EST

That would be my reply too: don't use DOM if speed is the first consideration. But if you need to move around a tree, it becomes a tradeoff between the complexity of the resulting code vs. runtime speed vs. development time.

gdom2 (part of Gnome) is an interesting compromise: nodes are built on the fly as needed and kept in a cache. But frankly, for most tasks, Perl or Python produce faster code, partly because of the garbage collection and partly because the underlying code has been heavily optimised.

You have to leave code behind you, though, that others will understand and be able to adapt and fix...



Liam; you might also like Words and Pictures From Old Books
[ Parent ]
don't use DOM for such large files (3.00 / 2) (#64)
by wiredog on Wed Jan 04, 2006 at 09:02:19 AM EST

OTOH, when you have over 100 different tags, with arbitrary nesting, SAX is a nightmare. You end up with hundreds of flags to set/reset.

Sax is good for large, non-complex documents. DOM is good for small, complex documents. Large, complex documents give XML coders heartburn.

Wilford Brimley scares my chickens.
Phil the Canuck

[ Parent ]

OMG (none / 0) (#68)
by jforan on Wed Jan 04, 2006 at 03:28:30 PM EST

This is the best article I have ever seen on K5.  Never before has this been done, and never again will it be repeated.  Y Had this never Been done before Today?



I hops to be barley workin'.

For text processing (none / 0) (#69)
by fromwithin on Thu Jan 05, 2006 at 01:31:22 AM EST

I couldn't find anything that did what I want so I wrote it myself. It's called Textbath and was written for pretty much exactly the reasons in this article. It re-paragraphs, converts entities and all that. The link is the Win32 version.

MikeC - fromwithin.com
Content Creation and Text processing: Work smart | 73 comments (52 topical, 21 editorial, 0 hidden)
Display: Sort:


All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!