Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Building "Bookster"

By ubu in Technology
Sat Oct 07, 2000 at 05:11:47 AM EST
Tags: Technology (all tags)
Technology

At least one major online media outlet has raised the specter of "Bookster" in discussions relating to IP freedom. If you want to help define the technology before the media can, it might be worthwhile to begin designing Bookster, right here and right now.


Let me begin by defining some discussion parameters. We don't need to rehash the IPR legality/rationale debate here. I'm not interested in discussing the advantages of hard copy versus eBooks. I'm definitely not interested in discussing presidential candidates.

What I am interesting in discussing: document workflow, suggested limitations, feasibility, accessibility, and proposals for prototypical solutions. Especially proposals for prototypical solutions.

That said, onto the meat of this introduction. The advent of digital music has provided easy means for capture, storage, and transmission of music files. These music files share common formats, uniform presentation tools, and tremendous simplicity with regard to end-usage. I would like to contend that similar packaging might be achievable with written works, and to propose that work in this area should begin immediately.

As I see it, the workflow for "Bookster document" creation should have 4 discrete nodes: paper/legacy document, e-text, formatted e-text, and final published document. To elaborate individually:

  • Paper/Legacy: an existing (hard copy) document, such as a printed book. This node is irrelevant for original creations composed as e-texts.
  • E-text: the document, itself, in standards-based text format. Plain ASCII is not sufficient; in my opinion only XML applications are appropriate for this step (DocBook, MathML, ATA, etc).
  • Formatted E-text: translated e-text with formatting markup, suitable for automatic processing.
  • Final Publication: an individual file suitable for online reading or printing.

My understanding of the issues has led me to a number of conclusions, and ultimately to a prototype proposal. First, creating e-texts is the most expensive part of the entire flow; Project Gutenberg is currently engaged in this kind of activity, and can testify to the difficulty of the endeavor, which requires scanning, OCR, and proofreading -- all of them labor-intensive.

Second, plain e-texts should be marked-up. I am familiar with Project Gutenberg's statement on this, but I am fully convinced that using XML formats will not injure the integrity of a given e-text, and I am equally convinced that without marked-up e-texts the rest of the workflow is prohibitively expensive. Moreover, the utility of any given e-text, in bare ASCII format, is extremely low by comparison to its potential -- for searching, editing, building indices, and thousands of other purposes. For a good example of this, see how clumsily footnotes are handled in PG e-texts (Prester John, by John Buchan, for instance).

Third, given an XML e-text, the most appropriate format for the third node is XSL Formatting Objects. Besides being an XML document, in itself, it is a robust formatting specification and it has already crossed a major implementation hurdle with the development of FOP, which can create PDF and SVG files from XSL/FO documents.

In view of these conclusions, then, it would appear that the major issues are as follows:

  • Is it too expensive to convert legacy documents? Can we expect existing hard copies to be converted to e-text formats, or is the conversion so expensive as to prohibit timely delivery?
  • Is it too expensive to mark-up bare ASCII? I've written a few short scripts to help automate the process of turning PG e-texts into DocBook files, but it still requires extensive tuning and proofreading.
  • How to distribute? Transforming marked-up documents into FO files requires an XSL formatting translation that specifies the way in which a document should be presented. Should it be provided alongside the e-text? Or can a library of standardized translations be provided online to assist the user? This issue is complicated.

I have turned PG e-texts into DocBook files, written XSL translations that output FO files, and run FOP on the result to create nicely-formatted PDFs. I would love to be able to download hundreds of such e-texts and automatically view them, print them, search them, and index them to the fullest potential of XML. I would love to use a Napster-like interface to browse for such files and share them with others.

Your thoughts are eagerly anticipated.

Ubu

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Poll
How would you like to read and share online documents?
o As printouts 10%
o As plain text 19%
o As HTML documents 19%
o As PDFs 9%
o As XML with automated formatting workflow 41%

Votes: 136
Results | Other Polls

Related Links
o XSL Formatting Objects
o FOP
o Also by ubu


Display: Sort:
Building "Bookster" | 66 comments (60 topical, 6 editorial, 0 hidden)
Sorry... (3.78 / 19) (#3)
by trhurler on Fri Oct 06, 2000 at 06:02:38 PM EST

not only did I vote against this one, but I also won't help you. I'm all for electronic media, but the haste with which people are struggling to totally annihilate copyright leads me to believe that they don't understand what it is they're out to destroy. If you merely wanted to set up an XML version of Project Gutenburg, I'd understand and agree, but calling it "Bookster" suggests that this is not at all what you really want. This isn't a simple matter of agreement and disagreement; I'm not at all convinced that a US federal court wouldn't allow a prosecutor to push "conspiracy to commit" charges against me just for offering advice on how to construct "Bookster."

Now, if what you really meant was "an XML version of Project Gutenbug," without all the implications of copyright infringement that "Bookster" conjures up, then by all means do say so and I'll think more about that topic, but this almost deserves a rewrite, because your current working title is -designed- to cause commotion, and is quite likely to create more than you actually understand if you are successful.


--
'God dammit, your posts make me hard.' --LilDebbie

Re: Sorry... (3.28 / 7) (#5)
by ubu on Fri Oct 06, 2000 at 06:27:15 PM EST

I might as well have said "an XML version of PG", given the scope of what's under discussion. Your contribution could have been limited to that topic, and you might even have included a disclaimer -- if you're really peeking into shadows looking for spooks.

The "XML version of PG" was actually the genesis of my idea. I didn't think it was fair, however, to anyone in the vicinity to limit discussions of technology to the restrictive IP practices as followed in the United States. I can discuss any software system I like under the First Amendment to the Constitution; it is up to your own private discretion to act as you see fit.

The short version: I don't understand why you'd want to vote down an article because of personal reservations about the article scope.

Ubu
--
As good old software hats say - "You are in very safe hands, if you are using CVS !!!"
[ Parent ]
Re: Sorry... (3.16 / 6) (#7)
by bobsquatch on Fri Oct 06, 2000 at 07:09:33 PM EST

The short version: I don't understand why you'd want to vote down an article because of personal reservations about the article scope.

I'm not trhurler, so I can't speak for him. But I can understand why he'd want to vote down an article that he didn't want to talk about. It's his vote. People have voted against stories at k5 for far more frivolous reasons. (Hi, Karsten! :)

Even though I disagree with trhurler's reasons w.r.t. IP politics (and I often disagree with him in general), he's certainly got a right to say that he doesn't want to talk about it or see it at k5.

If you don't agree, vote otherwise...



[ Parent ]

Re: Sorry... (4.16 / 6) (#9)
by trhurler on Fri Oct 06, 2000 at 08:13:25 PM EST

Look at that kid in Norway and the DeCSS mess. He didn't even write any code; all he was guilty of was possessing a copy of it and having spoken to the authors. He was never accused of breaking any Norwegian laws. Nevertheless, he was arrested. Imagine how much easier it would be to arrest me, since I'm a US citizen. If I'm arrested, my career, the future of which probably depends on security clearances and so on from the US government, is in serious danger. Yes, I'm a bit paranoid. My government can not be trusted to act in a sane manner, so anyone who is sane cannot trust my government.

Now, as for the topic at hand, since you apparently have no blatant intention of breaking laws(btw, copyrights tend to be international in scope, so the odds are they DO impact you unless you live in some truly benighted part of the world like Libya,) I tend to vote down just about everything in sight unless I -really- like it, because I feel a need to counterbalance, even in a tiny way, the inane effects of hordes of newbies who vote up everything front page. However, I do like the idea of XML texts. The bright side is, it appears that your story will make it shortly anyway.

I would suggest that it is more important to develop the tools to manipulate the later stages than it is to worry about the first stage right now; PG does that already, so you have a ready supply of test materials, and the new and interesting part of this is a combination of the tools and a browser of some form(I'm going to suggest writing that as a library that can be made into plugins for various web browsers, because rewriting all that GUI and network code is just plain bad, and also because that's how users are going to want to interact with something like this anyway. I know this goes against everything I stand for(I'm a Unix small-efficent-correct tools kind of guy,) but I also know that you're not going to get small-efficient-correct out of any sort of networked GUI program using any existing technologies.)


--
'God dammit, your posts make me hard.' --LilDebbie

[ Parent ]
Re: Sorry... (3.83 / 6) (#13)
by ubu on Fri Oct 06, 2000 at 10:37:26 PM EST

I think that your comments about the user agent -- Web browsers and the like -- is very interesting. One of the issues that had come to mind was how books would be read. Obviously, PDFs and SVGs could be read within the Web browser using Adobe plug-ins, but there are some functional issues as well as some issues of independence. For instance, if the user browses simply for XML texts, relying upon local translations (DocBook -> XSL/FO -> SVG) according to preference, a new user agent designed for that kind of work would be best. Do you have any thoughts on this? Given an SVG rendering library (currently in development, sssshhhh), would it be reasonable to expect users to install a "reader agent" for XML texts?

Ubu
--
As good old software hats say - "You are in very safe hands, if you are using CVS !!!"
[ Parent ]
Re: Sorry... (5.00 / 1) (#44)
by trhurler on Mon Oct 09, 2000 at 01:58:24 PM EST

Sure, but my point is, why would that XML viewing program want to have its own network and GUI code? If you make it a library with a set of wrappers to do plugins for the common web browsers, it can avoid all that crap while simultaneously integrating into programs people already know how to use and which have universal distribution. Maybe I'm missing the point here, seeing as I'm not an XML guru, but it seems as though we agree without knowing it.

--
'God dammit, your posts make me hard.' --LilDebbie

[ Parent ]
"Reader agent" == Mozilla (3.00 / 1) (#47)
by roystgnr on Mon Oct 09, 2000 at 05:17:43 PM EST

I've joined a project recently that uses an XML language for internal data representation; our current way of making documents readable is to apply an XSL translation on the server, then feed the resulting XHTML along with a stylesheet to the web browser. For relatively static content, this works fine; we wouldn't even need Mozilla if there weren't MathML requirements to meet.

The whole point of XML is that it's easy to make it both computer and human-understandable; tools to do so already exist (even if they're not all mature yet) and don't need to be reinvented.

Of course, SVG support, client-side XSLT, and a crash-proof Mozilla, would all be nice things... but an XML Gutenberg project could get along fine 99% of the time without SVG or MathML.

[ Parent ]
Re: "Reader agent" == Mozilla (3.00 / 1) (#49)
by ubu on Tue Oct 10, 2000 at 12:09:10 AM EST

Client-side XSLT is the one whose lack bugs me. To me, the notion of online XML resources is much less compelling if I'm constrained by the conversions available server-side.

I'm much agreed with you, however; Mozilla would be a serviceable "universal reader" if only it had client-side conversion options and a longer MTBF.

Ubu
--
As good old software hats say - "You are in very safe hands, if you are using CVS !!!"
[ Parent ]
Re: Sorry... (2.50 / 4) (#19)
by Zanshin on Sat Oct 07, 2000 at 01:32:36 PM EST

Hmmmmm, "Let me begin by defining some discussion parameters. We don't need to rehash the IPR legality/rationale debate here."

[ Parent ]
Re: Sorry... (4.00 / 2) (#42)
by PrettyBoyTim on Mon Oct 09, 2000 at 08:29:45 AM EST

You may feel that the IPR legality/rationale does not need to be discussed, but others would disagree. Posting a story containing limits on the discussion seems odd.



[ Parent ]
Re: Sorry... (none / 0) (#58)
by fester on Wed Oct 11, 2000 at 11:02:34 AM EST

Posting a story containing limits on the discussion seems odd.

It's called "keeping a discussion on topic."

[ Parent ]
Re: Sorry... (none / 0) (#59)
by PrettyBoyTim on Wed Oct 11, 2000 at 03:59:07 PM EST

> > Posting a story containing limits on the discussion
> > seems odd.

> It's called "keeping a discussion on topic."

However, I think it's better for the readers and posters of K5 to decide what is 'on topic', rather than the person who posts the story. The poster may wish some tricky questions not to be asked, but if the readership feel that they are pertinent, then they should be asked.

[ Parent ]
Don't Panic !! (2.72 / 11) (#10)
by JB on Fri Oct 06, 2000 at 08:45:59 PM EST

Gnutella, Freenet, Blocks, Publius, etc allow for the distribution of any digital media. These exist, and are getting better. Some of these are designed to survive nuclear war and plagues of lawyers. Please remain calm. The cypherpunks have the situation under control. The Planet-Wide Hard Drive is comming soon to a terminal near you.

visit www.infoanarchy.org if you want to help

JB

Meta-data makes life easier (3.50 / 4) (#28)
by FunkyChild on Sun Oct 08, 2000 at 12:03:46 AM EST

One of the downfalls of Gnutella, Freenet etc, is that they're too general. Napster is successful because it specialises in MP3 music files, using ID3 tags etc. So you can usually tell if a song is incomplete, low quality etc by looking at the meta-data.

This is what the author is envisioning with 'Bookster' - a way to use XML or some sort of markup to give each e-text lots of meta-data, eg references, bibliographies, indexes etc. etc.

I guess the other option is to build a plugin system into Gnutella, which specialises and looks for meta-data in files. Eg. the MP3 plugin (like Napster), the MPG plugin (size, bitrate etc.) and whatever people could come up with. In the search screen you could have a drop-down menu titled 'Search for' with options like "All", "MP3 audio", "MPEG video", "e-text" etc. etc. I am not familiar with the internals of Gnutella, but that could be a nice solution.



-- Today is the tomorrow that you worried about yesterday. And now, you know why.
[ Parent ]
Re: Meta-data makes life easier (4.00 / 4) (#29)
by ubu on Sun Oct 08, 2000 at 12:39:52 AM EST

Freenet is designed to be a generic data-storage and -sharing network. The goal of the Tropus project is to create a music-specific sharing system atop Freenet. A similar system, designed for sharing XML documents, would be very desirable.

The Jabber project aspires to the creation of an XML-based messaging system. According to their project literature, the idea is to be able to send and receive any type of XML document as a standard Jabber message.

One of the innumerable uses of such a system would be the ability to send a tiny XML query document that automatically composed any amount of information culled from any number of XML data sources on a given network. Entire books (or even libraries) could be composed in this manner, customized precisely to the type and source of data desired by the sender -- yet the actual message could be a tiny scrap of text.

As a concession to the current topic, a Jabber setup could allow users to send one another complex documents -- suitable for printing, indexing, viewing online, searching, storing in databases, re-using, and editing -- as Jabber messages... in batches, as cross-references materials... the possibilities are quite literally endless.

Ubu
--
As good old software hats say - "You are in very safe hands, if you are using CVS !!!"
[ Parent ]
The CiteSeer approach (3.75 / 8) (#11)
by recursive on Fri Oct 06, 2000 at 08:47:56 PM EST

I can imagine an approach that is much closer to what Naptser/MP3 does: CiteSeer searches the web for PostScript documents, archives them, and extracts their bibliography. This is closer to Napster and MP3 because it deals with document already in a machine readable format. MP3 does not let you extract individual notes or voices from a song. However, the XML approach aims at exactly this: allowing to dissect a document which I consider too ambitious if you consider texts that contain forms, tables, and pictures.

-- My other car is a cdr.


Book digitization as a collaborative effort (4.00 / 10) (#12)
by Eloquence on Fri Oct 06, 2000 at 09:44:53 PM EST

The hard part is really digitizing the stuff. It would be no big deal if poeople could collaborate better. In a ShouldExist story (which I can't find right now since the search seems to be broken), I have suggested a "guerilla" approach to digitizing books, with hundreds of people all putting a single page out of a certain book on their server (possibly even typing it), putting the ISBN & page number in a specified format in the meta tags and registering the site with a search engine. Subsequently, it would be possible to let a spider grab all the pages and join them together. Then they could be put on one of the existing sharing systems.

This effort would not require direct cooperation (only a general consensus on what to do), and it seems a lot harder to shut down hundreds of sites that host an individual page than one that hosts hundreds of pages.

Of course, things could be much simpler were it not for copyright: You could have a central server where people can register their books and form workgroups to collaborate on them. However, provided that the XXAA wouldn't shut down the service, they would certainly require to censor the database. A distributed approach would be harder, but it could be done using the distributed database concept I have suggested here on K5.

OTOH, Freenet might already have a good solution for the problem. Freenet can handle "directories" (the storage of the data and even of the list of data is distributed) where you can submit "keys" (basically names of files on the network) which have to be approved by a "gatekeeper". This way, it should be easy to organize collaborative digitization of books. However, to make it usable not only in theory but also in practice, you would need a simple GUI implementation of Freenet specifically designed for that purpose. The Tropus project is trying to do this for music.

Let me reiterate, the key to book digitization is collaboration. Individual efforts will not be successful. But if you can find enough people who have the same version of a certain book, you can cut the amount of work down from 300 pages perhaps to a single one. But since the concept is based on mass participation, it really only works with a specialized, easy-to-use client. Possibly one thas has a simple editor built in, which automatically formats the text in the required manner.

On a different note, I have been desparately looking for information on automatically page-flipping scanners. I have heard that such beasts are in use for digitizing ancient books (allegedly working with a vacuum mechanism to pull the page), but I have never seens one in action. Anyone?

Last but not least, feel free to submit the story to infoAnarchy as well. The site has been created specifically to deal with questions like these, and if we're going to debate in more detail, we might want to move over there.
--
Copyright law is bad: infoAnarchy Pleasure is good: Origins of Violence
spread the word!

Re: Book digitization as a collaborative effort (2.75 / 4) (#30)
by ubu on Sun Oct 08, 2000 at 12:44:46 AM EST

Your "guerilla" distribution approach is an outstanding idea. The final document could be nothing more than a search specification which assembles the final XML document from any number of searchable XML sources. The individual (document fragments) could be scattered to the wind, yet re-assembled invisibly upon request.

I'm glad you shared that; if you ever find the original story, please email it to my mailbox.

Ubu
--
As good old software hats say - "You are in very safe hands, if you are using CVS !!!"
[ Parent ]
Digitizing is not the obstacle (3.00 / 3) (#33)
by kmself on Sun Oct 08, 2000 at 05:27:21 AM EST

I disagree with your premesis. Much current valuable material is already digitized, and virtually all future textual material will be -- if not at final distribution, certainly during preparation. The means for such works to "escape" is available.

Even non-digital works can be converted fairly easily. The issue is more one of incentive (who's going to do it) than means. Scanning and OCR are already cheap enough to be household technologies, speed and accuracy will increase. Even graphic scans of non-OCRd material can be distributed electronically (though the textual content will be far more portable).

The question isn't "who's going to digitize all this stuff", it's "how are the people who have to work with digitized material going to keep it from leakin out?".

--
Karsten M. Self
SCO -- backgrounder on Caldera/SCO vs IBM
Support the EFF!!
There is no K5 cabal.
[ Parent ]

Re: Digitizing is not the obstacle (4.50 / 4) (#35)
by Eloquence on Sun Oct 08, 2000 at 10:05:00 AM EST

Much current valuable material is already digitized

Most isn't. You're right, the issues are somewhat different for new material, especially when the industry starts releasing more ebooks at reasonable prices. Basically, all copy protections are worthless, you cannot stop the designated reader from copying what he reads (if necessary, he'll use screen capturing + OCR). So this is not much of an issue.

But when we're dealing with digitization of the works of the last 50 years, then digitization is the only real issue. (Even if digital versions exist for the last 10-20 years, nobody says that you can get a hold of them!)

Even non-digital works can be converted fairly easily.

Untrue. Scanning takes about two hours for an average book, proofreading takes 8 or more hours and you still don't get all the errors. Things may change as OCR software gets more proficient. Right now, digitizing books is a labor of love, very few people do it.

Scanning and OCR are already cheap enough to be household technologies

In Germany there are laws that require flat fees to the writers' organization VG Wort for all scanners sold. These have recently been increased.

Even graphic scans of non-OCRd material can be distributed electronically

Yes, this will probably become more popular in the next years as harddisk space and bandwidth grow further. But I do not see this as proper book digitization. If you scan a page at 100 dpi, most OCR software is pretty helpless (they require 200-300 dpi), so you can't easily convert later, and the unchanged pages are unsearchable, unindexable and have limited portability.

The question isn't "who's going to digitize all this stuff"

Yes it is.

it's "how are the people who have to work with digitized material going to keep it from leakin out?".

That's rather uninteresting, IMHO. You can't prevent something from "leaking out" while at the same time trying to run a business by spreading it.
--
Copyright law is bad: infoAnarchy Pleasure is good: Origins of Violence
spread the word!
[ Parent ]

Repeat: Digitizing is not the obstacle (4.00 / 4) (#38)
by kmself on Sun Oct 08, 2000 at 11:32:05 PM EST

Much current valuable material is already digitized

Most isn't.

In a market which value is ascribed to novelty, particularly in media, it's the future trend which is likely significant. The issues you raise about the difficulties of publishing old, non-digital materials apply largely equally whether the question is one of publishing them electronically or in processed tree carcass form. Typical economic life for published materials is less than five years, even if you exclude periodicals.

But when we're dealing with digitization of the works of the last 50 years, then digitization is the only real issue. (Even if digital versions exist for the last 10-20 years, nobody says that you can get a hold of them!)

I dispute this strongly for a number of reasons. The materials will be digitized by publishers, libraries, and individuals, for archival, storage, storage-reduction, and research purposes. Works which have a current value will be digitized. The act need only be done once; the virtue of electronic formats is the ease of reproduction and distribution. Granted, many of the organizations and individuals won't have an incentive to make their electronic copies available, but some will, and I suspect that future trends will be that this number will rise, not fall, over time.

Even non-digital works can be converted fairly easily.

Untrue. Scanning takes about two hours for an average book, proofreading takes 8 or more hours and you still don't get all the errors. Things may change as OCR software gets more proficient. Right now, digitizing books is a labor of love, very few people do it.

I've photocopied more than one book by hand, courtesy a former life producing college readers at a large national photocopy chain. Ten to twenty pages a minute is a reasonable rate -- that's 600 pages an hour sustained. The process today actually does do just what I mention below: most current mid to high-level photocopiers actually make a digital, not electrostatic, image of the material copied. While OCR and manual re-editing are not trivial, the tasks are sufficiently simple that a person with access to quite mundane technology in the US, EU, or Japan, could readily convert a book over a weekend or so. Where individual incentive to do so exists, it will happen.

Multiply this by several hundred million computers in the world, and the likelyhood becomes high that a given bit of desireable hardcopy will work its way to electronic format. A well-stocked megabookstore in the US might have a hundred thousand titles. This could be digitized literally overnight if 0.1% of all computer users took it into their minds to do so (assuming, of course, coordination of efforts to avoid duplication). The reality is that the practical barriers are low.

Scanning and OCR are already cheap enough to be household technologies

In Germany there are laws that require flat fees to the writers' organization VG Wort for all scanners sold. These have recently been increased.

And outside of Germany? (/me ponders turning the old US-centrist debate around and asking when Germany became the center of the Universe).

Even graphic scans of non-OCRd material can be distributed electronically

Yes, this will probably become more popular in the next years as harddisk space and bandwidth grow further. But I do not see this as proper book digitization. If you scan a page at 100 dpi, most OCR software is pretty helpless (they require 200-300 dpi), so you can't easily convert later, and the unchanged pages are unsearchable, unindexable and have limited portability.

But they are portable, and that is the sufficient precondition for the remaining processing.

The question isn't "who's going to digitize all this stuff"

Yes it is.

<match type="grudge">
No it isn't.</match>

it's "how are the people who have to work with digitized material going to keep it from leakin out?".

That's rather uninteresting, IMHO. You can't prevent something from "leaking out" while at the same time trying to run a business by spreading it.

To the contrary, this is precisely the question for the music and publishing industries. Watermarking, digital rights protections, copy protection, DeCSS, electronic paper, Napster, Gnutella, FreeNet, and whatever new buzzwords are announced tomorrow, are focused on just this question. It's what an entire industry believes its future depends on.

Mind you, I subscribe strongly to John Perry Barlow's information wants to be free (though I distinguish between "wants to be" and "must be"). But the immediate, and intersting, question, is very much how the existing media industry plans to try imposing its old successful practices on a new paradigm -- preventing something from leaking out while at the same time trying to run a business by spreading it for profit.

--
Karsten M. Self
SCO -- backgrounder on Caldera/SCO vs IBM
Support the EFF!!
There is no K5 cabal.
[ Parent ]

Re: Repeat: Digitizing is not the obstacle (4.00 / 1) (#46)
by Eloquence on Mon Oct 09, 2000 at 03:06:34 PM EST

In a market which value is ascribed to novelty

Not everyone does that, I for one, do not. It's not novelty that counts, it's quality. And I won't let the market dictate what is valuable or not. Otherwise one might argue that the new books that were released in the Dark Ages were more valuable than the old paper they were written on, often containing gems by ancient authors only rediscovered more than 1000 years later.

So I repeat: Most valuable books aren't digitized and won't be in some time to come (although of course their percentage will grow smaller as new high quality books are released digitally and the old content slowly becomes obsolete). I'm not using the term "value" in an economic sense here since this would be utterly pointless.

Typical economic life for published materials is less than five years, even if you exclude periodicals.

This has various reasons, one of them being the marketing machine that is currently used to sell books. Since we do not have proper rating mechanisms for books yet, it's no surprise that the market is dictated by the publishers' marketing. However, this doesn't say a thing about the actual books' quality. I know excellent books (fiction and nonfiction) from 1995, 1990, 1985, 1980, 1970, 1950, most of them are not digitized (to my knowledge), which is a pity.

The materials will be digitized by publishers, libraries, and individuals, for archival, storage, storage-reduction, and research purposes.

This is certainly a hope I share with you, and one should also hope that they will be accessible after digitizing. Right now, I am doubtful about it. The director of the LOC, for example, has explicitly stated that they don't want to digitize their content for copyright reasons.

Works which have a current value will be digitized.

Even in an economic sense, this is not necessarily true. I certainly hope that publishers will re-release out of print books in electronic form to make more money with them, but I wouldn't hold my breath; right now, most of them are scared shitless of ebooks and the net. And the people who scan books are usually very few dedicated individuals with their own special definition of quality, which has little to do with the current economic value of the book.

The act need only be done once;

Theoretically, yes. Pratically, I am pretty sure that more books rest electronically on some people's harddrives than are circulating on the Internet. Scanned for the purpose of searching/indexing, but not distributed due to fear of prosecution. But as you said, this may change over time, and I'm hopeful that at least the number of ebooks will soon increase dramatically.

I've photocopied more than one book by hand

So have I.

courtesy a former life producing college readers at a large national photocopy chain. Ten to twenty pages a minute is a reasonable rate -- that's 600 pages an hour sustained.

I have gotten similar rates, but scanning is much slower, about half that speed with a $1000 scanner like mine.

The process today actually does do just what I mention below: most current mid to high-level photocopiers actually make a digital, not electrostatic, image of the material copied.

Yeah, but you don't get the bytes outta there (or do you?), and you'll hardly be available to afford one for your home.

While OCR and manual re-editing are not trivial, the tasks are sufficiently simple that a person with access to quite mundane technology in the US, EU, or Japan, could readily convert a book over a weekend or so.

Make that two weekends. It's a lot of work, and most people don't do it. It's the largest hindrance, that's really not very hard to see. Unless it's done collaboratively, it's simply too much effort. Converting a CD to MP3 takes less than an hour and is completely automatic. Digitzing a book is a boring ~10 hour effort. Trust me on this, take www.pfaffenspiegel.de as an example for a book I've scanned, proofread and HTMLized. But well, my quality standards may be higher than yours.

Where individual incentive to do so exists, it will happen.

Oh, isn't that always so?

Multiply this by several hundred million computers in the world

Doesn't compute. There aren't several hundred million computers in the world with scanners. Of those who have scanners, only a small percentage has acceptable OCR software like OmniPage Pro. And the percentage of these willing to scan whole books is much lower again.

A well-stocked megabookstore in the US might have a hundred thousand titles. This could be digitized literally overnight if 0.1% of all computer users took it into their minds to do so

You will never find enough people, unless you organize the scanning & proofreading of indivudal books collaboratively. Even then, it will be hard to digitize little-known, high quality titles but at least those of high "economic value" should be digitized. As I said, Freenet might be a good solution here.

Scanning and OCR are already cheap enough to be household technologies

No. Cheap scanners have terrible speeds and the bundled OCR software is crap.

And outside of Germany? (/me ponders turning the old US-centrist debate around and asking when Germany became the center of the Universe).

Well, I wouldn't be surprised if other countries passed similar laws. But I can only talk about my own situation here.

But they are portable, and that is the sufficient precondition for the remaining processing.

No, if they're too low DPI, they can only be put to the next stage through human typing, at least with current technology (things will change, sure).

I: [That's rather uninteresting, IMHO. You can't prevent something from "leaking out" while at the same time trying to run a business by spreading it. ] To the contrary, this is precisely the question for the music and publishing industries. Watermarking, digital rights protections, copy protection, DeCSS, electronic paper,

Watermarking: wrong concept, needs software to control user, this will be easily cracked.
digital rights protections/copy protection/DeCSS: The law question is extremely important, I agree about that. Other than that, these issues are mundane, it's obvious that copy protections don't work.
Electronic paper: Yeah, but the information has to get into the "paper" somehow, so that's not an issue.

Napster, Gnutella, FreeNet

These are not important for cracking the content protection. You said "how are the people who have to work with digitized material going to keep it from leakin out?" and I think this refers to content protection mechanisms. I just think -- and you probably agree with me -- that whether or not copy-protection systems are secure or not is not a real question for anyone who knows what he's talking about, copy protections can never work. Of course it will be interesting to see how long the content industry takes to realize that, and how they'll fight the distribution mechanisms and the crackers. Of course new distribution mechanisms and the way the content industries deal with them are of high importance, but that's not what I was talking about.

Right now, there's simply not a lot of non-IT content for "Bookster", and that's the major problem.
--
Copyright law is bad: infoAnarchy Pleasure is good: Origins of Violence
spread the word!
[ Parent ]

Last word (none / 0) (#52)
by kmself on Tue Oct 10, 2000 at 02:41:47 PM EST

We're repeating ground here, I just want to address a couple of points. My last word in the thread.

Regarding market valuation of novelty -- that means just what it says: the market, as a whole, tends to value the new to the old, and the bulk of the current valuation of the media market is in new product. This isn't true for every individual publication, and it's not true for all consumers, but it's a general rule. Personal exeptions (and I suspect our tastes are similar), don't move that average by much.

Regarding scanners, digital copiers, computers, and loose numbers within an order of magnitude. Scanner speed, cost, and quality is a classic engineering problem, fairly readily solved and likely advancing on a curve close to Moore's law, and for similar reasons: image quality is directly related to electronics density. If a typical consumer-market scanner today isn't sufficient for the purpose at hand, I'm confident that it is a matter of a few short years, or a small upmarket move, to find suitable hardware.

Likewise, the repositioning of Xerox, Ricoh, and other traditional photocopier companies as "document management companies" means that, yes, you will be able to squeeze those digitized bits out of the copier. The current Xerox flagship DocuLink "publishing system" is already designed with networking and onboard storage. While a few minutes research through Xerox's webpage doesn't indicate whether or not networkd-enabled document management is already a reality, the engineering problem is trivial.

Progress with OCR and related AI technologies tends to be significantly slower than device technology. However, OCR and speech recognition are now basic consumer technology. My feeling is that progress will continue, however, "perfection" is a goal which will be asymptotically attained, and some human intervention will always be necessary.

We're also talking somewhat cross-purposes. My contentions are these.

First, as content is increasingly digitized, controls on content will become increasingly meaningless, despite heroic measures by traditional media organizations to maintain control. So long as this holds for new content, the reign of our current media giants is over. A new mode of sponsoring, publishing, and distributing creative works will emerge.

Second, the backlog of valued non-digital media can and largely will be digitized. Yes, it takes work, and the very fact that this work can likely not be remunerated through traditional sale of content leads to the paradox you are pointing to. This is a classic economic efficiency issue -- the social benefit of one person spending a day scanning in a document, and a week proofing it, is immense. The personal benefit is minimal. There will, by this argument, be an undersupply of freely scanned documents, however this is mitigated by the same dynamics that feed free software: personal incentives to digitize media vary widely, and only one individual's threshold need be exceeded for all to (potentially) share the benefits.

Which is why I argue: there are sufficient incentives for this scanning to occur for any number of people and organizations for any number of reasons. The technology is widespread enough (even at a miniscule fraction of the prevelance I postulate above) that hundreds or thousands of significant works could be digitized in short order -- certainly less than a year -- were there a desire to do so. The input issue is moot, what's required now is to work out how the materials are to be released and distributed.

Your complaint appears to boil down to "I want free access to books not currently digitized, but I don't want to digitize these books". I'm unconvinced.

--
Karsten M. Self
SCO -- backgrounder on Caldera/SCO vs IBM
Support the EFF!!
There is no K5 cabal.
[ Parent ]

Re: Last word (none / 0) (#60)
by Eloquence on Thu Oct 12, 2000 at 11:31:50 AM EST

Regarding market valuation of novelty -- that means just what it says: the market, as a whole, tends to value the new to the old

This is exactly why more collaboration from the "underground" is necessary to bring little known books or classics into cyberspace.

Scanner speed, cost, and quality is a classic engineering problem, fairly readily solved and likely advancing on a curve close to Moore's law

Not quite. The situation with scanners is more like the situation with flat panel displays or laser printers. There have been some technical improvements, especially resolution-wise, but you still pay $1000 for decent hardware. The cheap $100 scanners you find in a supermarket often take a minute to scan a page, even in binary mode. My scanner takes 7 seconds, and it's 5 years old. I've read recent comparisons: In the same price range today, scanners still have similar speeds (often even slower).

Likewise, the repositioning of Xerox, Ricoh, and other traditional photocopier companies as "document management companies" means that, yes, you will be able to squeeze those digitized bits out of the copier. The current Xerox flagship DocuLink "publishing system" isa lready designed with networking and onboard storage. While a few minutes research through Xerox's webpage doesn't indicate whether or not networkd-enabled document management is already a reality, the engineering problem is trivial.

Now this is more interesting: Go into a copyshop and get your work digitized at reasonable prizes. Perhaps even with sophisticated OCR engines already built-in. That may be a perspective.

Progress with OCR and related AI technologies tends to be significantly slower than device technology.

Hm, actually, I have witnessed the opposite. OmniPage 10 is a lot better than OmniPage 6 which was current when I bought my scanner. However, as I said, there are still no cheap high-speed scanners.

My feeling is that progress will continue, however, "perfection" is a goal which will be asymptotically attained, and some human intervention will always be necessary.

Yes.

First, as content is increasingly digitized, controls on content will become increasingly meaningless, despite heroic measures by traditional media organizations to maintain control. So long as this holds for new content, the reign of our current media giants is over. A new mode of sponsoring, publishing, and distributing creative works will emerge.

I agree.

Second, the backlog of valued non-digital media can and largely will be digitized. Yes, it takes work, and the very fact that this work can likely not be remunerated through traditional sale of content leads to the paradox you are pointing to. This is a classic economic efficiency issue -- the social benefit of one person spending a day scanning in a document, and a week proofing it, is immense. The personal benefit is minimal. There will, by this argument, be an undersupply of freely scanned documents, however this is mitigated by the same dynamics that feed free software: personal incentives to digitize media vary widely, and only one individual's threshold need be exceeded for all to (potentially) share the benefits.

If we're talking about (C) works, the personal benefit may consist of getting pulled into court for copyright infringement. The social benefit is just as limited since people have to hide their identity (even persistent pseudonyms may be dangerous). Freenet may be a solution for both problems.

The input issue is moot

I still disagree. Organizing the input is even more important and difficult than organizing the output, since it has to be an anonymous group effort to work properly.

Your complaint appears to boil down to "I want free access to books not currently digitized, but I don't want to digitize these books".

I've done my share of work. Have you?
--
Copyright law is bad: infoAnarchy Pleasure is good: Origins of Violence
spread the word!
[ Parent ]

This will never work! (2.80 / 10) (#14)
by GreenCrackBaby on Sat Oct 07, 2000 at 11:42:30 AM EST

Consider what has already happened with Napster...

You download the latest song from Metalica, listen for 3 minutes, and discover that the song in incomplete. Damn! Go back to Napster and try to find the song complete.

Now consider the same with books...

You download a copy of "Chicken Soup for the Geek Soul". You read 120 pages of this book, only to find out it is incomplete. Damn! You've just wasted 5 hours.

Or worse...some malicious user has made available a version of the book, but at page 40 he cut and paste pieces from "Chicken Soup for the Pregnant Woman". Maybe you'll be able to tell that something is wrong...maybe not.

Or even worse...some even more devious user goes and makes up the story from page 50 onward. How would you ever know?

Digital music is one thing. It's easy to tell if you've got a dud (if you downloaded me singing a Metalica song...you'd hopefully know you've got a dud right away ;). Digital books open up a whole other can of worms.

Re: This will never work! (3.33 / 3) (#15)
by Waldo on Sat Oct 07, 2000 at 12:28:45 PM EST

Oh, I don't know about that. I can just as easily tell the difference between you and Metallica as I can the difference between you and John Irving. (Unless you have some secret writing career. :) Then, some people may not know the difference.

Again, if I got Metallica's "Pregnant Woman" instead of "Geek" (yes, of course they're not real songs), I also very well may not notice the difference. If I'm a Metallica fan, the difference would be obvious.

My point is that I don't think the situation is any different between audio (MP3s) and text (books, articles, etc.)

[ Parent ]
Re: This will never work! (2.75 / 4) (#17)
by Zanshin on Sat Oct 07, 2000 at 01:18:58 PM EST

Use an MD5 checksum, or hell, just put "----- THIS IS THE END -----" at the end of everything, scroll to the end and look for it.

[ Parent ]
Re: This will never work! (2.33 / 3) (#18)
by ghoti on Sat Oct 07, 2000 at 01:22:22 PM EST

Very interesting point. And I do think that faking a piece of text (even a chapter in a book) is a lot easier than faking music. Just think: you have to emulate the voice, instruments, general sound, sound quality, etc. *Very* hard to do.
But emulating the style of a writer you know well isn't all that hard --- even I can do that for a few writers. And this would be even easier with more technical stuff, where it could do much more damage. Just think about somebody changing data in a manual or something less technical, like a book on how to cure illnesses with certain plants.
I don't know who would want to do that, but it would be possible and could do a lot of harm. So I agree that this is an issue that should be thought about before starting this kind of service.

<><
[ Parent ]
Re: This will never work! (2.50 / 2) (#37)
by cherrypi on Sun Oct 08, 2000 at 09:59:07 PM EST

you can secure documents from (easy) alteration with .pdf ... anyone can change audio files, but most don't because its a hassle, same with pdf, only it would make it harder.

[ Parent ]
PDF security (3.00 / 1) (#40)
by kmself on Mon Oct 09, 2000 at 05:22:53 AM EST

I know that at least some forms of PDF text protection are relativley easily bypassed with common Linux tools. If you can print a document, and it priduces text-based PostScript output (most graphical utilities do), you can generally strip the text out of the Postscript. I've had good luck with this using a secured PDF document and Linux command-line utilities.

--
Karsten M. Self
SCO -- backgrounder on Caldera/SCO vs IBM
Support the EFF!!
There is no K5 cabal.
[ Parent ]

Re: This will never work! (2.00 / 1) (#43)
by deaddrunk on Mon Oct 09, 2000 at 09:54:16 AM EST

See the poster above who suggests hundreds of sites hosting single pages and a web-spider applet pulls them altogether and gives you the entire book. Sorry to karma-w***e on the wrong site, but Signal 11's not the only refugee.

[ Parent ]
Text downloads on broadband != 5 hours. (none / 0) (#53)
by kmself on Tue Oct 10, 2000 at 03:06:33 PM EST

I don't know where your 5 hour download for a text document comes from. Assuming a book of 250 pages, 40 lines per page, 12 words of 6 characters per line, and a multiplier of two for markup overhead, that's about 8.4 KB for a book. Ok, that's a bit low. Practical example: the gnuplot manual runs 116 pages. Uncompressed PostScript format, that's about 750KB. Download time on a 56Kbps modem: 12 seconds. Actual time on a rather buggy connection: about two minutes. On broadband, you blink slower.

--
Karsten M. Self
SCO -- backgrounder on Caldera/SCO vs IBM
Support the EFF!!
There is no K5 cabal.
[ Parent ]

Re: Text downloads on broadband != 5 hours. (none / 0) (#61)
by g0del on Thu Oct 12, 2000 at 07:17:09 PM EST

I think he was referring to the time spent reading before you discover the book is incomplete, rather than the download time. Of course, then I wonder why reading 120 pages wastes 5 hours - 24 pages and hour seems really slow to me.

< pedant > Also, your math is wrong. Downloading a 750K doc on a 56k modem would take about 2.5 minutes on a good day, probably longer, never anywhere close to 12 seconds. I think you mixed up bytes (750K) and bits (56k).< /pedant >

G0del

[ Parent ]
Terminology and Tactics (3.16 / 6) (#16)
by JB on Sat Oct 07, 2000 at 01:11:52 PM EST

I have heard of "booby-trapped" warez; programs that contain malicious code to punish and deter those who don't pay. Is there a similar term for songs or text that have been altered or corrupted for the purpose of reducing the incentive for free downloading?

Has anyone done anything to create a web of trust containing the fingerprints (or hashes) of digital works?

JB

Watermarking, database seeding (3.66 / 3) (#32)
by kmself on Sun Oct 08, 2000 at 05:21:13 AM EST

For digital media, there are techniques called "watermarking" which aid in "digital rights management" -- essentially, you can trace copies of an originationg document.

In database and cartography, it's common practice to seed outputs with spurious records or features which can be used to determine whether or not a particular instance of a work has been copied. The artifacts are fictitious, and would exist in work based on original research.

However, none of these techniques addresses the legal issue of use and dissemination of copyrighted materials. Fair use exceptions exist for all copyrighted works, and copyright coverage for databases and collections of facts (including, say, cooking recipies) are very thin.

On Usenet, it was common practice for some time to post plausible, but nonrewarding, content, particularly to the alt.binaries.* newsgroups for erotica and software. This can be overcome, as is indicated elsewhere, by checksums (eg: MD5), or trust metrics.

--
Karsten M. Self
SCO -- backgrounder on Caldera/SCO vs IBM
Support the EFF!!
There is no K5 cabal.
[ Parent ]

OT: Cartography? (none / 0) (#62)
by Jade E. on Sat Oct 14, 2000 at 09:52:26 AM EST

In database and cartography, it's common practice to seed outputs with spurious records or features which can be used to determine whether or not a particular instance of a work has been copied. The artifacts are fictitious, and would exist in work based on original research.

First, I assume you meant "would not exist in work based on original research."

Second, while I'm familiar with the seeding of databases with various 'plants' to ensure the data isn't used improperly, how does this apply to cartography? I can see how the misuse of cartographic work could be a problem, but I just can't think of any sort of feature that could be falsely included in a map without causing serious doubts as to the accuracy of that map. The only way I can think to protect the reputation of someone who purposefully added extra features to a map would be if they published a list of which features were fake, but that would defeat the whole purpose. What kind of features would you use that would be harmless enough not to damage your accuracy, yet important enough to be included in a plagiarism?

[ Parent ]

Re: Terminology and Tactics (2.75 / 4) (#36)
by GreenCrackBaby on Sun Oct 08, 2000 at 12:35:25 PM EST

Funny you should mention it.

Bared Naked Ladies recently tried to flood (succeeded? I haven't checked.) Napster with trojan adds. They released their songs from the new CD, but after 10 seconds or so would interrupt the song with an add for their new CD.

You should be able to find the full story on CNN.com.

[ Parent ]

XML (2.66 / 6) (#20)
by evvk on Sat Oct 07, 2000 at 03:28:54 PM EST

Why is it that everything has to be XML nowadays? (La)TeX is afterall much better suited for writing texts, especially scientific ones (TeX math vs. that horrible thing called MathML). All SGML/XML -based stuff requires so much start _and_ end tags that it is inconvenient to write with a plain old editor, while writing LaTeX is very natural. And word processors are even worse.

Re: XML (3.20 / 5) (#21)
by Eloquence on Sat Oct 07, 2000 at 03:39:34 PM EST

Actually, I think we're heading towards two editing paradigms. The plain source editing approach hardly works for HTML4+JS anymore, and I guess it won't work well with XML either. Most users will be using WYSIWYG editors that produce terrible code. For power users there'll be macro-editors that hide complex code under simple symbols. (For example, you could have a HTML-macro "t{4:2}{a|b|c|d\1|2|3|4}" that produces a table with 4 columns and 2 rows.)

I use LaTeX for letters, and I try to use templates wherever I can. It does have problems with more complex formatting tasks, though, especially images. Try to do a "hip" layout in LaTeX. I would love to see an overhauled version of TeX, easy-to-use, with built-in macros for most important tasks. TeX/LaTeX is far more complex than HTML and probably XML as well, and macros and templates seem to be the best approach to using it. In its current form, it is unsuitable for mainstream use.
--
Copyright law is bad: infoAnarchy Pleasure is good: Origins of Violence
spread the word!
[ Parent ]

Re: XML (3.40 / 5) (#22)
by vsync on Sat Oct 07, 2000 at 04:55:47 PM EST

Try to do a "hip" layout in LaTeX.

Yes, but most hip layouts suck. I like the academic/professional look of most LaTeX output.



--
"The problem I had with the story, before I even finished reading, was the copious attribution of thoughts and ideas to vsync. What made it worse was the ones attributed to him were the only ones that made any sense whatsoever."
[ Parent ]

Re: XML (2.33 / 3) (#25)
by Eloquence on Sat Oct 07, 2000 at 11:31:20 PM EST

I couldn't agree more .. however, if you browse some personal homepages, you will find that many people don't feel like that ;-). And universality seems to be a criterion for mass adoption.
--
Copyright law is bad: infoAnarchy Pleasure is good: Origins of Violence
spread the word!
[ Parent ]
Re: XML (3.33 / 3) (#24)
by evvk on Sat Oct 07, 2000 at 07:57:50 PM EST

> It does have problems with more complex formatting tasks, though, especially images.

Creating .eps with an external program is difficult? And isn't LaTeX pretty much supposed to hide all the formatting? It gives the basic commands and environments to create standard-layout documents.

> TeX/LaTeX is far more complex than HTML and probably XML as well, and macros and templates seem to be the best approach to using it. In its current form, it is unsuitable for mainstream use.

But that is exactly what TeX is about, macros. LaTeX is just one collection of macros and publishers may have their own (eg. ams-TeX).
And yes, it is more complex, but normal users should only need to know the basic stuff, not the primitives upon which it is built. I agree that much could be done to simplify some aspects of LaTeX. Mainstream users don't edit HTML by hand either and there's LyX, which is a splendid step in converting from eg. Word to LaTeX :-).



[ Parent ]
Re: XML (3.00 / 3) (#26)
by Eloquence on Sat Oct 07, 2000 at 11:39:51 PM EST

Yes, I had problems implementing bitmaps, I had problems with resizing and positioning. I eventually managed it, but a) it still looks somewhat crappy, b) I couldn't get it to work with pdflatex. And it was a relatively simple task, a letterhead, with a template already there. (I had to use \usepackage{graphicx}.)

Now if I imagine the average user trying to position a photo with a caption on a page, possibly with the text flowing around .. I for one would have to browse my 800-pages reference to get started, and don't know if I would succeed. So LaTeX is clearly not for the mainstream.

I'll try LyX once I've reinstalled Linux. Still, I think a major overhaul of TeX would be nice to allow much more complex formatting tasks. Then we would have a real competitor for Acrobat/PDF.
--
Copyright law is bad: infoAnarchy Pleasure is good: Origins of Violence
spread the word!
[ Parent ]

Re: XML (4.20 / 5) (#27)
by ubu on Sun Oct 08, 2000 at 12:01:20 AM EST

Why XML instead of LaTeX? Oh, just a couple of minor differences...

Write your legal brief in LaTeX. Now build an automatic index of all case citations for cross-reference.

Create your aerospace engineering manuals in LaTeX. Now build a parts index by part number. No, do it by section. Aw, hell, give me a CALS table for each type.

Take a big database and automatically compose a LaTeX report. Now do a server-side translation to create formatted, printable output. Send your LaTeX report to someone else so they can decompose it back into *their* database automatically.

Embed one LaTeX document in another and still have a valid LaTeX document. Now take it back out. Embed another (completely different) type of document in your LaTeX document and still have a valid LaTeX document. Now take it back out.

Create a standardized API for deconstructing LaTeX documents into object trees in memory. Make it capable of converting both ways between object trees and LaTeX documents.

Create a linking specification for linking between LaTeX documents. Make sure your links can address individual characters, all the way up to entire documents. Also, make sure your links can be two-way, and even have multiple destinations. Allow your links to contain information on what kind of document they are linking.

I'm sure none of these are important for most information technology applications, of course.

Ubu
--
As good old software hats say - "You are in very safe hands, if you are using CVS !!!"
[ Parent ]
Re: XML (4.00 / 2) (#31)
by evvk on Sun Oct 08, 2000 at 04:51:40 AM EST

Now you are mostly talking about databases not texts; books, papers and such. And for huge amounts of data, XML (or any markup language) is too bloated anyway.

> Write your legal brief in LaTeX. Now build an automatic index of all case citations for cross-reference.

BibTeX?

> Create a linking specification for linking between LaTeX documents.

Not impossible at least.

Anyways, my primary point was that XML is clumsy to write while LaTeX-like syntax (especially the math mode) is very natural and it allows relatively easy creation of high print quality documents



[ Parent ]
Re: XML (4.00 / 3) (#34)
by ubu on Sun Oct 08, 2000 at 09:58:36 AM EST

Now you are mostly talking about databases not texts; books, papers and such. And for huge amounts of data, XML (or any markup language) is too bloated anyway.

No, I'm talking about information theory and documents. As for "huge amounts of data", XML database prototypes have proven to be quite capable of storing them. For medium-sized collections, SGML documents have proven quite manageable for over a decade. The aerospace industry already builds thousand-page documents with the ATA collection of DTDs.

I have personally customized workflows to deal with ATA SGML documents. One client was building parts indices by hand; the job required around a week per person per document. When we delivered our workflow automation, they were able to build the same indices in roughly 10 seconds with a 5,000 page document.

BibTeX?

BibTeX is for bibliographies. It marks up for that specific purpose. By contrast, XML can mark up for any purpose. My case citations can be marked as

<cite>Marbury vs. Madison</cite>
, which distinguishes them from
<findings>Recommendations in the case of <cite>DoJ versus Microsoft</cite></findings>.
Get the idea?

Anyways, my primary point was that XML is clumsy to write while LaTeX-like syntax (especially the math mode) is very natural and it allows relatively easy creation of high print quality documents.

And if creating pretty-print documents is all you want, keep it up. On the other hand, if you want documents that don't represent theoretical dead-ends, you might consider building them in a metadata-rich environment that lets you build the pretty-print automatically as an afterthought.

Ubu
--
As good old software hats say - "You are in very safe hands, if you are using CVS !!!"
[ Parent ]
Re: XML (3.50 / 2) (#41)
by rongen on Mon Oct 09, 2000 at 07:07:10 AM EST

I agree with you totally... I love LaTeX but it's just not designed for good information interchange, etc. You can use CVS on your latex sources to achieve versioning, etc, but a better solution for a "real" system might be to generate Latex from another document source (XML?). The Linux Documentation Project seems to create thier stuff in SGML and then convert to whatever format is appropriate for delivery to the reader (I think). There's no getting around the fact that Latex offers superior math formatting, etc, though. I am saying this having never used MathML, but having seen how you do this in Word or whatever...

The thing about XML is that it makes parsing so very straight-forward... Who wrote this document? The people in the author tags did, not the names that may appear on line seven or line 12 and could possibly be listed in the second to last section (don't laugh, I have had to parse technical documents that were stored in PDF and converted to text for processing for inclusion in a searchable index. The only way to get specific info was to pray that the person who made the document stuck to the format, which was very informal. On average it wasn't bad but from an IR point of view it was a mess).
read/write http://www.prosebush.com
[ Parent ]

Re: XML (2.50 / 2) (#39)
by slycer on Mon Oct 09, 2000 at 01:10:19 AM EST

Forget XML, forget LaTex, forget HTML or plain text. Give me PDB (Palm Database?), you know, the standard for palm pilots? Give me a book written in PDB, with appropriate indexing/bookmarks, and I am a happy happy man. I have a really hard time reading a book while sitting at my PC. However, I don't mind reading it on a Palm.. what's the difference you ask? I can take my palm on the bus/train/car/out for a cigarrete/riding an elevator etc etc.. same as a real book.

Case in point, I purchased the 6 book on CDRom set from O'Reilly, all HTML, fine and dandy when sitting at a PC, but I just couldn't browse it, couldn't read through it in my spare time.. wrote a script to convert the HTML to text (OK, very simple - uses lynx ;-) ) - found a script to convert text to pdb. Loaded it on the Palm.. wonderful.


[ Parent ]
Docster (3.33 / 3) (#45)
by kallisti on Mon Oct 09, 2000 at 02:16:40 PM EST

A similar project, Docster is being created for libraries to transfer stuff. This differs from Bookster in that issues of copyright, security and compensation are being included in the design.

Personally, I want a system where I can get out of print books. There are many books, such as Conway and Guy's Winning Ways for Your Mathematical Plays, which are considered essential, but can't be bought anywhere (at least, nowhere I've found).

textbooks (3.25 / 4) (#48)
by persimmon on Mon Oct 09, 2000 at 07:36:29 PM EST

After paying an average of $270 for textbooks this term, a few other math/CS students and I started speculating about of sharing textbooks napster-style. Although we have no plans for implementation, we concluded we could all chip in for a copy of the textbook, slice off the binding, run it through a scanner with a document feeder thing, make it a PDF or something akin, and I'd stick it on my server.

It'd be a hassle to deal with the resulting file, but probably a hassle worth not paying $120 for a 150-page analysis book*.


*We all felt a lot worse about the author of our discrete math book not getting paid than about J. Random Pop Singer, or the author of the "College Math" textbooks, for that matter, though. So we didn't try it.
--
It's funny because it's a blancmange!
why a server? (none / 0) (#63)
by anonymous cowerd on Sun Oct 15, 2000 at 09:15:42 PM EST

Just wondering, why put the scans on a server and expose yourself to easy detection and legal action for copyright infringement (yeah I understand what you want to do and why you're doing it, and were I still in college, starving-broke as I was, I'd have gladly done it myself, but the fact remains that what you propose is technically illegal) when all you have to do is print out your PDFs and hand them to your selected discreet friends? Not only do you have the cop angle to worry about, but also a paper printout is just plain more practical to use at classes than something on a server somewhere that you need a running and connected computer to access?

Yours for less unnecessary technical complexity, WDK - WKiernan@concentric.net

"This calm way of flying will suit Japan well," said Zeppelin's granddaughter, Elisabeth Veil.
[ Parent ]

Questions and Answers (3.66 / 3) (#50)
by Obiwan Kenobi on Tue Oct 10, 2000 at 02:16:43 AM EST

How to distribute: ADOBE. Absolutely no question. This company has made online documents really feasable and easy too. PDF's are widely accepted, Adobe has a reader on every platform, and it'll cook you breakfast if you call it baby.

My problem is this: is this feasable in Windows? Can you create e-texts and straight ascii files and turn them into PDFs without XML? Or could you make a (VERY) simple windows interface for Joe Sixpack? Since 95% of computer owners are Windows users, you might want to think about them being able to contribute. I mean, how did you think that Napster got so big? People seen other people making mp3's, so they made their own, and gave it to other people, either by CD-R or Napster. E-Books can go the same way. You could build a community-like interface that says they agree in exchange for as many e-books they can download to make an e-book from one in their collection that isn't available. Or some similar wording. It gets the database bigger, and gets more great books out there as well.

Too Expensive To Mark-Up Bare ASCII? In a word, yeah. The average joe might want to contribute, but doesn't have the time to look over his just-scanned Doom: Knee Deep In The Dead (The Novel...yeah...*shudder*). Here's an idea: have a communal proofreading network. Where people could submit un-proofread material for those who want to help out to do so (by checking the work) and then turn them into PDFs (hopefully by a very user-friendly 'MAKE PDF' button), ready to share them with the world.

And lastly, on a personal note, I think what you're doing is fantastic, and if there is anyway I can help I would.


-----------
Obiwan
misterorange.com - The 3 R's: Reading, Writing, and Rock & Roll...

Re: Questions and Answers (3.50 / 2) (#51)
by evvk on Tue Oct 10, 2000 at 03:29:12 AM EST

The problem with PDF is that all the viewers I have come across are crap:

Acrobat: bloated, buggy, slow, uses anti-aliased and "thick" printer fonts -> unreadable on screens normal people can afford -> could just as well use plain postscript because one has to print the documents anyway.
gv: slow, the same anti-alias problem (although it can be turned off, but then it just displays the gray pixels as black...)
xpdf: this one is faster and uses normal _screen_ fonts -> far more readable, but xpdf still far away from usable and very buggy.

Paper is still much better medium for text - one can take it anywhere, it doesn't flicker, it doesn't reflect light that bad, it has far superior resolution, etc. The only thing it is really missing is grep...




[ Parent ]
What you're talking about is illegal (3.33 / 3) (#54)
by escapist on Tue Oct 10, 2000 at 03:52:36 PM EST

Copyright laws exist to ensure that the original authors of these works get paid.

If we start scanning in these works and distributing them to everybody for free, guess what? Authors don't get paid -> authors can't survive with current career -> authors stop writing -> no more books.

I know this is a rather drastic situation I'm talking about, but think about it. The audience for some of these books, such as an analysis textbook, or a math textbook, or an upper-year computer science textbook, is relatively small. We pay a lot for the books. We expect quality and we usually get it. But, we are exactly the same audience who would adopt "Bookster" and exploit it for all it is worth. Sales drop. Small sales divided by two or three or four is very small sales indeed, driving the prices up.

So, next time you want to steal something, go steal it from the Gap, or some other evil corporation, and not an academic trying to make a living off what they do best.


Don't let the changes, get you down now.

Re: What you're talking about is illegal (3.33 / 3) (#55)
by Spinoza on Wed Oct 11, 2000 at 06:53:20 AM EST

I thought the article's author made it fairly clear that legal issues were outside the spectrum of discussion hoped for. Thank you for ignoring this sensible request. Your insightful opinion on what is and isn't legal, what copyright laws are for, and above all, what motivates authors was just the sort of intelligent wake-up call this community needs.

You must be some sort of copyright lawyer, or perhaps you hold a doctorate in some sort of ethical philosophy. Perhaps both. I can't imagine why anyone else would feel qualified to make such broad judgements of the subject.

I think, despite your apparent learned stature, you overlook a number of points in favour of free online book distribution. For instance, did you know that books before (I believe) 1925 are not copyrighted? These can be distributed in in any form by anyone. This is currently done under the auspices of Project Gutenberg, however, as has been pointed out, Project Gutenberg provides only plain text. Some people desire a more developed format for distribution, to increase the readability of the documents.

Furthermore, it may interest you to consider the injustice of the current system of non-expiring copyrights. Is it right that a descendant of a descendant of an author can continue to extract profits from work that they had no hand in? I do not consider this a reasonable application of the idea of inheritance. After a certain time, a book should be released into the public domain, to further enrich and educate, regardless of who gets paid or not.

One particularly galling example of the problem of intellectual property might be the descendants of Hemingway, who recently released his final book, many years after his death. The book was written well after his prime, at a time when he was often heavily medicated. It might be said that he would not have been pleased to see the less-than-brilliant work published. Yet his descendants were more than willing to desecrate his memory for profit. These are the people who receive additional income for the rest of his copyrighted work, as well. By what effort of their own have they come to deserve this?

Finally, I must humbly beg to differ with you on the subject of what motivates authors to write. I sincerely hope that a written work may be inspired by more than the profit motive! I suspect that many authors obtain more intangible rewards from their work. The enjoyment of writing for instance. The desire to express an idea, to offer a gift of knowledge, to be immortalised in ink, even. I dare say that many of the greatest written works would still have been penned, had the authors been paid or not. I would also like to suppose that many of the less stellar works may not have offended us with their mediocrity had their authors, seeing no profit to be made from writing, found more humble uses for their pens, typewriters and wordprocessors. I doubt this, however. Even mediocre authors are eager to see their words in print, and would probably carry on writing, even if you charged them money.

In fact, few authors are paid well-enough for their work to support themselves by writing books. Most have other jobs, particularly those in technical fields, who write textbooks and references. Does this indicate that your scheme of ensuring that authors get paid is working, or that the failure to provide authors with adequate compensation for their effort will prompt them to cease writing?

[ Parent ]

Re: What you're talking about is illegal (4.00 / 1) (#57)
by tweed on Wed Oct 11, 2000 at 08:46:22 AM EST

Whilst the tone of the original respondent was offtopic, I don't think you can ignore IPR completely at the implementation level. Should such mark up texts have standard spaces for PGP-signing by the `electronic document creator' saying that they attest that there is no copyright infringement in this document? That way even if you personally don't have a problem downloading anything I can program my browser to check this and reject anything that isn't signed or is signed by someone who I've discovered to have lied about this in the past. Should there be message digest key attatched to every document so that if someone intercepts a file of work I've put up there so that if someone takes the source, alters it by putting slanderous statements in there and then sticks it back on the net, I'm in a position to prove those alterations were done by someone else? What's the position about putting `advisory retraction' notices in the database: e.g., suppose I take a GNU copylefted manual, remove the license and change the name to say I wrote it and stick it into docster. OK, I accept it's anathema to you to remove it from the system, but if there's a database of `advisory retractions' (again cryptographically signed) from someone who you trust saying you shouldn't download the document because it isn't legally redistributable under the licence then you can decide yourself whether you feel you ought to obey this.
I entirely accept that technology makes it impossible to stop people completely disregarding laws they personally disagree with, but please don't set it up to make it needlessly difficult for people with civic consciences to act in good faith on their own.
Incidentally, I know three people who've written textbooks, all academics who don't get paid that well in their regular jobs but for whom the book paid certain expenses (university fees for children, etc) that they would have struggled to pay otherwise. Whilst they enjoyed working on the books, they'd have done original research/spent time with their families/had a life/etc if there wasn't some royalty money coming in. Just because royalty money matters to someone doesn't mean that their work is rubbish, contrary to your implication in your post.)

[ Parent ]
What about the publishing companies? (none / 0) (#56)
by The Holy Chicken on Wed Oct 11, 2000 at 08:41:37 AM EST

Last I heard Stephen King was still running his "experiment" with online book distribution with alright results. This would be the problem, people not paying for the book unless they have to as well as people sharing the books among themselves once one copy has been aquired. Although I wouldn't be completely heartbroken if a CSS-like solution were made (as long as I could view the files without too much hassle), I'd much prefer an encryption-free solution. Of course, I'm one of those people who buys the CDs from artists he gets MP3s of off napster that he likes.

Barring the above problems, why wouldn't the publishing companies want to have downloadable books from a central source just as long as they got paid? I'd have to imagine that putting a book into a nice electronic format after getting the hard-copy ready couldn't be too difficult; and I'd have to imagine that without the need to run expensive printing presses, they'd be able to reduce the price while still maintaining a nice profit margin. I'd buy the books just for the convenience of it. If I already know what book I want, I dislike going to the book store at all, let alone trying to find an out-of-copy or difficult-to-find book (I still haven't gotten Goethe's Faust yet, mainly because the only copy I could fine was in a ~$50 hardcover book with all of his works.)

ubu, can you advise me? (none / 0) (#64)
by anonymous cowerd on Sun Oct 15, 2000 at 09:23:59 PM EST

I'm working on some Project Gutenberg transcriptions, and part is making HTML pages as well as bare-ass text. I know eexactly zip about XML. What can you suggest concerning making my transcriptions more usable and accessible with XML or DocBook or whatever? You sound like you know what you're talking about; if you are willing to give me advice, please reply to my email address, wkiernan@concentric.net which isn't munged to avoid spam. (My solution, by the way, is to subscribe to two or three of busy mailing lists, so I get a hundred interesting emails a day and the spam seems as nothing.)

Yours WDK - WKiernan@concentric.net

"This calm way of flying will suit Japan well," said Zeppelin's granddaughter, Elisabeth Veil.

Bleugh! (none / 0) (#65)
by Dop on Wed Oct 18, 2000 at 08:35:08 AM EST

I think I'll stick with bound paper with a cover, thank you very much. Batteries not required, fits in pocket, can be read anywhere.
The thought of electronic books leaves me cold.

Do not burn the candle at both ends as this leads to the life of a hairdresser!
not quite the same thing but.... (none / 0) (#66)
by yugami on Wed Oct 18, 2000 at 11:14:10 AM EST

checkout a basic paper i wrote in some free time discussing a document sharing system.

ODE

Building "Bookster" | 66 comments (60 topical, 6 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!