Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
A Modest Search Algorithm Proposal

By localroger in Technology
Mon May 03, 2004 at 06:43:49 AM EST
Tags: Software (all tags)
Software

I recently expressed an opinion as to how a robust search algorithm might be built. All right, stop laughing. You too, Tex. I will now explain why it isn't nearly as dumb an idea as it sounds.


ADVERTISEMENT
Sponsor: rusty
This space intentionally left blank
...because it's waiting for your ad. So why are you still reading this? Come on, get going. Read the story, and then get an ad. Alright stop it. I'm not going to say anything else. Now you're just being silly. STOP LOOKING AT ME! I'm done!
comments (24)
active | buy ad
ADVERTISEMENT
The Problem: You have a big database of stories and comments. I'll take rmg's word that we're talking about maybe 7 gigabytes of text in 5 million comments. Your mission, should you choose to accept it, is to make a list of all the comments that contain a particular phrase.

The traditional answer to this is to use a Relational Data Base Management System or RDBMS, a generic tool that automatically indexes such a pile of data as it is created. The RDBMS takes whatever measures are necessary as the database is manipulated (largely transparent to all but the most well-informed users) so that you can later give it a query amounting to "give me a list of all the comments containing the phrase foo-bar," and out pops the list like magic.

When a tool like this works it is truly sweet. When such a tool is open-source it is even sweeter. When even profit-hungry industry behemoths are offering free as in beer solutions, it's easy to see why programmers have gone almost universally to RDBMS engines when they need to store and retrieve data, whether they actually need the advanced features of the RDBMS or not. Programming tutorials which in days of yore would have walked new users through the task of creating, writing, and re-reading files, now introduce the neophyte to linked data controls and SQL queries instead.

Like all good solutions, though, the RDBMS can't be good at everything. As the old saying goes, the jack of all trades is master of none; and the RDBMS introduces a few problems that eventually become apparent in real-world use:

  • The indexing increases storage and processing requirements.
  • When the indexing becomes corrupted, the entire database can become unreadable instead of the damage being limited to the corrupted records. Elaborate tools are necessary to recover from this.
  • Even simple, linear queries can involve traversing widely separated parts of the database in order to follow relational pointers.
  • RDBMS code is extremely complicated and when the RDBMS engine itself has a bug or fault it can be extremely difficult to find and fix.
  • Because most programmers don't know exactly what has to be done to process their queries, latencies can be innocently introduced which harm scaleability and are very difficult to eliminate.
The latency problem is at the heart of K5's nonfunctional search function, and there aren't many solutions.
  1. You can buy enough RAM to accommodate the whole DB; this will work but gets expensive.
  2. You can become an expert on your DB engine and tweak the indexing and query structure until it squeaks; this might or might not work and obviates the whole point of using a convenient generic tool to access your data.
  3. You can create your own simpler parallel data structure either within or outside of the RDBMS to do specialized indexing.
All the really workable suggestions fall in the last category. Before we get to my idea and the gales of laughter it inspired, let's look at something a bit more obviously practical:

Tex's version of the vector space search:

Then I would make another database, like a binary tree or something like that. Every word would be a separate record. So basically if a word was already in the word search database, the comment id, story id, diary id, whatever, would just be added to that word's record. And if not, then a new word record in the wordsearch database would be created.
Once this word database is created, doing the search becomes simplicity itself -- you just go to the word record and spit out its contents. And as rmg pointed out, you can even do phrase searches with a little extra work either at search time or by including word order in the word index.

This is such a good idea, you might be asking: What's my problem with it?

The answer, mainly, is that it's a pain in the ass to implement. First of all if you don't want the scheme to bog down under its own weight you need to do a custom data structure for it; it's essentially going to be a forest of pointers with a few bits of data sprinkled throughout. You will be adding dynamically to all of them through the course of a day, and their growth rates are very unpredictable. Those word indexes will range from common articles like "the" which list nearly every comment, down to individual posts which contain unusually misspelled words. You can pare the list down a bit (as Google does) by eliminating the very common words and spell-checking, but that just adds more logic.

Maintaining a database of this type and keeping it from becoming corrupted is hard. And while it will probably be smaller than the sum of all the original text, it probably won't be that much smaller; remember that the few hundred most common words are going to all have hundreds of thousands of pointers. It might still fit in RAM; really it has to because when you add a post you're tacking new pointers onto hundreds of widely separated data structures. As those pointer lists grow you will have to move them around and garbage collect the storage space in real time. If they're on disk you'll burn out the force coil trying to get to them all.

The word index helps with overhead, but it also moves a lot of heavy lifting from the search itself to realtime index maintenance. And at the risk of repeating myself, let me say that writing an engine to keep all those word pointer lists updated and handy is hard. You will certainly write it in a higher-level language if you're not a complete masochist. If you get lazy and try to use the RDBMS to do it you'll find that it still slows to a crawl, now when you leave a post instead of when you do a search. This kind of thing has to be somewhat optimized. You'll have more fun than the law allows trying to debug it (especially under real world load conditions), and Gaia help you if it gets corrupted.

My modest proposal was to simply tack every new post to the end of a big text file as they're created, and write a really efficient parser to scan it. At first this sounds like, soooo 1980. But on reflection, you'll see that it's not 1980 at all -- it wouldn't have been possible, much less practical in 1980.

In 1981 I was privileged to visit a major Air Force computer installation that featured a Cray I processor -- the famous "C" with the bench around it and its freon-cooled circuit boards. It had eight megabytes of RAM and sixteen vector CPU's working in parallel to achieve about 100 megaflops performance. Disk and tape drives filled a room the size of a small warehouse. The 80 megabyte removable-platter hard disks, with media the size and shape of a cake, had access times reckoned in tenths of a second. We were told that this machine was used, among other things, to sort Exxon's 100 million record mailing list.

You no longer need a ten million dollar supercomputer to do that. And you'd no longer bother with vector processing and the many tricks that were necessary to optimize memory and disk paging on a dataset that size. And similarly, as recently as 1990 searching a seven-gigabyte dataset on a personal computer would have been well nigh impossible without a lot of fancy indexing and careful memory and disk management.

But it's 2004, not 1990, and things have changed.

Right now it is not hard to find hard disks with over 100 megabyte per second sustained transfer rates. This means you could load and scan seven gigabytes in about a minute. While that sounds like terrible performance there are several reasons why it is actually acceptable.

  • Most searches don't scan the entire database; so if you follow Scoop's convention and keep the stories in a separate, smaller pile, only the comment search will have the performance problem.
  • Searches are already limited by default to a more modest "current" dataset.
  • Searches terminate and return results when a page of results has been acquired; on most searches this will not require scanning the whole DB.
  • The time required for a search is definite and predictable, so the user can be given appropriate status indications.
  • As blocks are loaded into RAM they can be used for all ongoing concurrent searches, so while one worst-case search might take 60 seconds, a thousand concurrent worst-case searches will also take 60 seconds.
  • The effect on the main database is effectively zero.
Now it's possible to argue that 60 seconds is entirely too long for a search to complete, and if I was doing a lot of arcane searches I'd probably agree. But this is on the order of how long archive comment searches were taking anyway before the plug was pulled; and it's on the order of how long searches take on many web BBS systems.

The advantage here is that the search routine spends nearly all of its time waiting for the hard drive to load another block, so that no matter how many searches are requested none of them will ever take more than the predictable worst-case time and the effect on K5's other functionality will be unnoticeable. And it was the effect on K5's other functionality more than anything else that made it a problem for the site.

In the original comment I suggested that the scanner should be coded in assembly language, and that's partly to make sure the search stays invisible in this way; but mostly it's because the thing is just so simple that you can. No matter how good your compiler is you tend to do things in a higher level language that look stupid when translated to machine code, like recalculating intermediate results, moving strings around instead of examining them in situ, and using memory where registers are available. If you're avoiding this kind of performance hit it's probably because you are very familiar with how your compiler writes code and keeping it in the back of your mind; and that's not high-level programming even if you are doing it in C++.

A tight loop like a search is exactly the kind of thing that would benefit from being hand tuned. The maintenance of Tex's word directory would benefit too, but that's so complicated only a pure masochist would even attempt it. (I've done projects like that, but I wouldn't expect anybody else to.) The big advantage of my scheme is that it is simple. Maintaining the search database is simple and scanning it is simple. This means you can easily and completely debug it, and you don't have to worry about it getting corrupted when the power fails unexpectedly.

All of your files, including the RDBMS and search index, live on another kind of database, the hard disk file system. Unlike the RDBMS, a file system is designed to do one thing -- store and retrieve blocks of data without worrying too much about what they are -- and if it's worth the electrons it's made of, a file system will do this very well. Any good well-maintained file system will tend to group a big file like the search index all in one place, so that linear access is blazingly fast, limited mostly by the rotation of the platters rather than repositioning of the head. The flat index leverages this to its advantage.

And if it still seems unworkable and silly, keep in mind that next year's hard drives will be even faster than today's. RAID controllers are getting cheap enough for personal systems. The technology to bring the worst-case search under 10 seconds is already there for a reasonable cost, and within a shorter time than K5 has already existed I'd guess that 1 second will be achievable.

Meanwhile, RDBMS code never gets any simpler. The opposite is true; faster machines and more memory encourage feature bloat, so that performance doesn't rise as rapidly as the low-level performance of the machine does.

I have argued in the past, loudly and at great length, that modern computing suffers because lazy programmers use powerful hardware as a crutch instead of learning how to do things right. An alert reader might accuse me of violating my own principles by leaning on the fast CPU and hard drive instead of building a proper index.

It's a valid criticism and I'd answer the same way modern programmers do; it's a tradeoff. It is lazy to depend on gigahertz processors and super-fast hard drive interfaces that didn't even exist a few years ago. But you also have to ask what you get in trade for that performance hit.

I like fast code, but even more than that I like code that works. This means I like code that it simple, lean, and written with an awareness of its hardware platform. What the flat search index buys you is extreme simplicity. Simple code is code that is easily optimized and easily debugged. It also buys you short development time and low cost, which aren't enough to justify requiring a bigger computer but aren't to be laughed at, either.

In the final analysis, if time and resources are limited there is really only one choice that makes any sense. A search function that works, however laughably implemented, is surely better than one you don't have the time to write and debug.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Poll
I keep my data in...
o Oracle 5%
o DBase 0%
o FoxBase 0%
o Access 0%
o MySQL 19%
o PostGreSQL 20%
o Fixed-width random files 4%
o Text files 18%
o XML 2%
o Post-It Notes 4%
o My head 16%
o Shredder 7%

Votes: 124
Results | Other Polls

Related Links
o Scoop
o Google
o opinion
o laughing
o You too, Tex
o rmg's word
o RDBMS
o open-sourc e
o free as in beer
o Tex's version
o pointed out
o Also by localroger


Display: Sort:
A Modest Search Algorithm Proposal | 253 comments (246 topical, 7 editorial, 1 hidden)
+1, technology (1.33 / 18) (#1)
by Hide The Hamster on Sat May 01, 2004 at 04:05:23 PM EST




Free spirits are a liability.

August 8, 2004: "it certainly is" and I had engaged in a homosexual tryst.

hahahahaahaha (2.33 / 6) (#28)
by James A C Joyce on Sat May 01, 2004 at 06:26:14 PM EST

Normally, your saying this would only raise a mere chuckle. But the deadpan delivery in the face of this obvious techish troll as if nothing out of the ordinary has been posted makes "+1, technology" hilarious.

I bought this account on eBay
[ Parent ]

And I you, sir. (1.75 / 4) (#34)
by Hide The Hamster on Sat May 01, 2004 at 07:09:18 PM EST

And I you.


Free spirits are a liability.

August 8, 2004: "it certainly is" and I had engaged in a homosexual tryst.

[ Parent ]
Yah (2.33 / 6) (#2)
by melia on Sat May 01, 2004 at 04:06:45 PM EST

What the flat search index buys you is extreme simplicity. Simple code is code that is easily optimized and easily debugged. It also buys you short development time and low cost, which aren't enough to justify requiring a bigger computer but aren't to be laughed at, either.

Yah it looks pretty similar to PHP.


Disclaimer: All of the above is probably wrong

Easy fix to K5's database woes (2.41 / 12) (#3)
by Jed Smith on Sat May 01, 2004 at 04:12:55 PM EST

http://www.slashcode.com/
_____
K5 is dead. Steve Ballmer made the most insightful comment on a story. -- jw32767
Creepy (2.33 / 6) (#4)
by melia on Sat May 01, 2004 at 04:16:36 PM EST

First line of the first post on that page: Recently I was asked by a heavily trafficed website to investigate migrating their site to Slash.
Disclaimer: All of the above is probably wrong
[ Parent ]
I'd just like the record to reflect that (2.47 / 19) (#6)
by Tex Bigballs on Sat May 01, 2004 at 04:27:14 PM EST

I was just throwing my idea out there because I had to do a similar project in college. I have no idea if it would be easy or difficult to implement. In fact, the only programming I've ever done is some C in college and some visual basic for my job.

Therefore, I would defer to someone who knows more about databases to evaluate local roger's method.

Finally, I would just like to add that local roger is an egotistical windbag and here is yet another sterling example.

Why not just make comments smaller? (nt) (2.57 / 21) (#7)
by ennui on Sat May 01, 2004 at 04:33:05 PM EST



"You can get a lot more done with a kind word and a gun, than with a kind word alone." -- Al Capone
-1, moron (2.79 / 34) (#8)
by pb on Sat May 01, 2004 at 04:41:39 PM EST

Here is why your proposal is total bullshit: kuro5hin effectively uses it already, and it doesn't work.

Since you admit your ignorance about databases, you'll just have to take my word that "SELECT * FROM comments WHERE comment LIKE '%foo%'" is roughly equivalent to searching for the string 'foo' in a flat text file of comment text. You'll also have to accept the fact that searching K5 currently works sort of like that now, and that's why it's so slow.

To search it faster, what you'd want to do would involve a slightly more complicated scheme, and it doesn't really matter if you store your data on the filesystem (as I do) or in a database (as K5 does); you just have to do it right. That is to say, your search algorithm shouldn't involve sifting through 7GB of data every time.

After all, if sifting through 7GB of data is 'fast', then sifting through 7MB would be faster! Ideally, you'd want to reduce the amount you have to search for any given query to some small fraction of the actual data, and then use that to rank and look up the relevant records. This is called an index.

Now, why do people generally use databases for this instead of writing their own system in assembler with flat text files? Well, because databases are designed for these sorts of problems. They were originally written in assembler, by experts, and these experts have refined their techniques and incorporated new algorithms over the years. Do you know what a B-tree is? Well, I guarantee you that they do.

So although it is possible that for a highly specialized task, you could 'beat them' with some hand-crafted program, it isn't likely. Chances are, if your database or search program is dog-slow, it's because your database schema is badly designed, and/or your choice of algorithms in accessing your data is poor. Other people who know what they're doing could use the same database, do the same task, and do it far faster, because they know how to do it better.

In short... leave it to the professionals.
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall

absurd. (2.15 / 19) (#10)
by rmg on Sat May 01, 2004 at 04:53:58 PM EST

the reason the search is slow is because it's written in sql, which everyone knows is dead slow.

you've entirely missed the point of rog's proposal. it's written in assembly language.

you've obviously never dealt with realtime latencies or down to the metal hardware programming. you simply do not appreciate the awesome power of assembly language.

----

dave dean
[ Parent ]

I didn't miss it at all. (2.50 / 6) (#15)
by pb on Sat May 01, 2004 at 05:04:00 PM EST

Read my post:
databases are designed for these sorts of problems. They were originally written in assembler, by experts
As for your other assertions... false as well.

Cheers!
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]

mysql is written in c. (2.50 / 10) (#17)
by rmg on Sat May 01, 2004 at 05:10:45 PM EST

c, motherfucker.

assembly language is up to three times faster than c. therefore, our new search will be at least three times faster than the old one, plus the improvements due to getting rid of that crufty old database!

assembly, bitch. assembly.

----

dave dean
[ Parent ]

actually... (2.85 / 7) (#22)
by pb on Sat May 01, 2004 at 05:27:20 PM EST

as usual, the truth is both more complex and more interesting than what you assert.

Mysql isn't just written in C. It is largely written in C, and some of it is written in C++ -- but it may interest you to learn that some of it is actually written in assembly, and this is done so for performance reasons. Specifically, check out strings/README for a discussion of this.

As for why, well, once you figure out how to use a profiler, and learn about how you might spend 90% of your time executing 10% of your code (or less!), I'll get back to you about it.
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]

YHBT. YHL. HAND. [n/t] (none / 0) (#224)
by braeburn on Tue May 04, 2004 at 04:03:38 PM EST



[ Parent ]
Assembly (none / 1) (#116)
by duffbeer703 on Mon May 03, 2004 at 12:04:16 AM EST

Assembly can be 3x faster because a skilled programmer with alot of time on his hands can optimize code more effectively than a compiler.

He can do this because he has a very specific task in mind and knows what he is doing.

Since you think that search engines are an appropriate application of assembly, you have no clue. When you finish coding in 10 years, you'll have to spend another twenty years to deal with Unicode or some other such nonsense.


[ Parent ]

Listen, we need something that *works* (2.66 / 9) (#18)
by Mutually Assured Destruction on Sat May 01, 2004 at 05:12:29 PM EST

Have you ever bothered to find out just how complex the code is for your average RDBMS software? I'll give you a hint: it's generally accepted as the most complex and difficult piece of software to write, period. That's right, operating systems, word processors and 3D game engines all take a back seat to these carefully-designed multimillion-line behemoths.

Now, using one of these is all well and good for a big corporation like Microsoft that has literally billions of dollars to drop on an R&D project, but Kuro5hin isn't Microsoft. What we need right now is something that works, and hand-coding a parser in assembly to scan a flat file of comments is about as simple and error-free as you can get. In contrast, dealing with MySQL is dealing with one of the most complex and poorly-documented systems on Earth. I think it should be quite obvious which is the better solution.

[ Parent ]

not even. (2.80 / 5) (#23)
by pb on Sat May 01, 2004 at 05:30:46 PM EST

I'd like to mention that MySQL doesn't even break one million lines of code. And although it is complex, it has excellent documentation. As for hand-coding a parser in assembly being simple and error-free... well, tell me another one. :)
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]
You wrote an excellent first comment (none / 1) (#127)
by gazbo on Mon May 03, 2004 at 07:03:51 AM EST

And as such I'll put you out of your misery.  The person you replied to, and rmg, both agree with you.  They are just choosing to express it through sarcasm.

-----
Topless, revealing, nude pics and vids of Zora Suleman! Upskirt and down blouse! Cleavage!
Hardcore ZORA SULEMAN pics!

[ Parent ]

and I with them :) (none / 1) (#140)
by pb on Mon May 03, 2004 at 09:58:04 AM EST

Note that I gave rmg a 3 somewhere in there; it's all quite entertaining.
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]
Uh... (2.44 / 9) (#9)
by The Honorable Elijah Muhammad on Sat May 01, 2004 at 04:46:17 PM EST

What happened to vector search?


___
localroger is a tool.
In memory of the You Sad Bastard thread. A part of our heritage.
good question. (1.83 / 6) (#12)
by pb on Sat May 01, 2004 at 05:01:45 PM EST

maybe someone will get around to it, someday.
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]
So... (1.80 / 5) (#14)
by The Honorable Elijah Muhammad on Sat May 01, 2004 at 05:03:54 PM EST

do I get to blame rusty or hulver?


___
localroger is a tool.
In memory of the You Sad Bastard thread. A part of our heritage.
[ Parent ]
blame rusty. (2.50 / 10) (#16)
by pb on Sat May 01, 2004 at 05:05:56 PM EST

AFAIK, hulver hasn't promised anything yet.
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]
yeah (none / 3) (#53)
by regeya on Sat May 01, 2004 at 10:43:19 PM EST

and if Rusty doesn't like your implementation, he or someone else could always adapt, say, the simple Search-VectorSpace engine on CPAN. *shrug*

[ yokelpunk | kuro5hin diary ]
[ Parent ]

um. God no. (none / 0) (#141)
by pb on Mon May 03, 2004 at 10:01:33 AM EST

That's why I wrote mine in the first place; that implementation (the one in the article, that is) doesn't scale at all. It doesn't do sparse arrays. It is a toy, at best. (and really, it's supposed to be an example)

Naturally that wouldn't stop anyone from trying it, and actually that's what rusty and I started from in the first place. But as it is, no one should use it for searching anything less than a very small set of documents.
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall
[ Parent ]

OK, cool (none / 0) (#164)
by regeya on Mon May 03, 2004 at 05:18:18 PM EST

Thanks for the quick education. :-D

[ yokelpunk | kuro5hin diary ]
[ Parent ]

Nobody wants to wait 60 seconds on search results (2.76 / 17) (#13)
by bc on Sat May 01, 2004 at 05:01:57 PM EST

The solution is to use a system with proper caching, db pooling etc. Many such solutions exist. Some are complicated, some are absurdly easy. All this work's been done already, better, by people smarter than you and me.

You might also want to think about using a proper database, and not a dreadful piece of shit.

Nonetheless, blaming a perverse, dreadful implementation of search and DB and design and so on (scoop) on the whole world of RDBMS, and advocating something that will "only take 60 seconds" (and may indeed be better and faster than perl + mysql, lets face it, these two things should be banned) is to miss the point that we can take a step forwards, into a new, sunlit upland of lightning performance and piss-easy development, rather than a step backwards into nasm, flat files, and general agony.

♥, bc.

A site for the people (2.50 / 16) (#24)
by rmg on Sat May 01, 2004 at 05:49:13 PM EST

Must not use the tools of the workers' oppressors.

Kuro5hin's infrastructure, like its content, must be the free product of the people's labor. You offer us bourgeois solutions rooted deeply in tyrrany and the corporate interests who would silence us with a monopolized media and an eight hour workday.

If this means we cannot have a search function, it will only remind us of the sacrifice of those who came before us, whose memory we must never let go.

----

dave dean
[ Parent ]

I never had you down for a lunix user (2.36 / 11) (#26)
by bc on Sat May 01, 2004 at 06:01:50 PM EST

I think kuro5hin's infrastructure should be the product of commercial enterprise - the honest exchange of goods and services.

What has Open Source given us? Nothing more than economic recession. The OSS crowd have written themselves out of jobs. Why should a company, whether IBM or a local insurance outfit, pay the wages of programmers, when they can find fools to do it for free?

Every line of open source contributes to a nice girl's daddy being fired, and a life of penury, and - ultimately - prostitution.

Every line written in scoop and slash contributes to the demise of professional solutions and the destitution of families.

You, sir, are little better than a hip, middle class Marxist youth, happily advocating that free trade between nations be abolished because its "exploitative", then seizing on the resulting mass poverty and humanitarian crises in the poor parts of the world caused by this to further his ideals more, among his chattering class friends. He doesn't do anybody any favours by whining aobut the problem, He is the problem.

You are the problem. You stand in the way of economic upturn. You stand in the way of employment for programmers. You stand for tyranny, poverty, and outright stalinmanism.

Its no fucking use at all.

I mean seriously. Support commerce! Stand against software communism!

♥, bc.
[ Parent ]

The invisible hand has failed (2.36 / 11) (#33)
by rmg on Sat May 01, 2004 at 07:07:19 PM EST

To provide the freedom and security Smith promised. Instead, it has slapped down the worker and proved the ruin of the American working class.

The snake oil salesman of the late 1900s have failed the workers through embezzlement and artificial inflation of profits, i.e. "cooking the books." In so doing, they have caused chaos amongst the investing class when their misdeeds were revealed. The result was defaulted loans and frozen stores -- what Marx called "crisis."

Production fell to a crawl and workers were the first to feel the effects. But man's essence is in his labor -- he is defined through action and plying his skills. He must continue his work and share with the community, for it is through expression and solidarity that man perfects his spirit.

Open source is just one expression of man's need to create and to control his labor product. Through labor and community, he realizes his potential and lives in earnest, as does the kibbutznik and sportsman.

"The economic downturn," as you call it, is merely a harbinger of the impending social transformation from corporate serfdom to a network of closely knit workers living on the bounty of the land and their own imaginations. Open source provides us the tools to face the challenges we must face as we engineer a new industrial future in which the workers themselves build and own the means of production, including the software mills.

Your foolish naysaying will not shake our resolve. We know that we will have to face the shills for the old order, but the cheers of the triumphant masses will drown out their screeching and they will fade into the obscurity. You will be nothing more than a relic of a forgotten history.

----

dave dean
[ Parent ]

You said it comrade (none / 2) (#76)
by bugmaster on Sun May 02, 2004 at 02:53:18 PM EST

Users of kuro5hin, unite ! You have nothing to lose but your indexes.
>|<*:=
[ Parent ]
ob duh (none / 1) (#133)
by dke on Mon May 03, 2004 at 08:50:24 AM EST

Yes, and all OS programmers shoot themselves with Free bullets, leftovers from their favorite flamewar?
Nothing is ever easy
[ Parent ]
-1 and then some (2.28 / 14) (#19)
by toulouse on Sat May 01, 2004 at 05:21:56 PM EST

You've already exhausted this troll, and it was fucking tedious and uninspired the first time.

A perusal of your literary output demonstrates that you are not above flogging a dead horse, but kindly refrain from indulging yourself here. PlzDitchKThx.


--
'My god...it's full of blogs.' - ktakki
--


Won't work at all (2.87 / 16) (#20)
by vadim on Sat May 01, 2004 at 05:24:35 PM EST

Ok, let's take a look at your claims: Search through 7 GB of data stored on disk in 10 seconds. That is about 716MB of data parsed per second.

I suppose this approach might work if it was all in RAM, but on a hard disk? No way. A rough benchmark says my hard disk can do 55 MB/s. Okay, let's say we can get 80MB/s. That's still 9 hard drives in RAID-0 in very optimistic conditions, and a horrible MTBF. So make that 18 drives at least.

Next, we need a bus capable of handling that. I have two PCI 66MHz 64 bit slots in this machine, that's 532 MB/s. Oops. Bottleneck.

At this point, it's pretty evident that we need some hardware to support this that's more decent than my dual Athlon. AMD64 maybe, or some expensive Sun box. I don't think that Rusty can afford one of these.

Also, with the cost of 18 good (probably SCSI) hard drives and just a 7 GB file, you'd be better off just getting a machine with enough RAM to keep the whole thing in memory.

The whole idea appears to be braindead. For the money this idea would cost you probably could get enough RAM, and a decent database. For some reason you discard using indexes. You ignore that parsing 716MB of data a second will have a horrible impact on the system bus, and the disk activity will slow down the database and reduce the available amount of cache memory. You ignore that the data already is in a database and want to duplicate it for some reason.

Your approach also loses lots of possibilities that databases give you. So the search is too slow because there's too much data. Why not restrict the search?

Force users to select a timeframe to search, encourage limiting the search to a story, user or section. Maybe even add the moderation score as a parameter. Allow to search for signatures separately.

True, scanning through the whole database with a RDBMS would be slower than doing it with a text file. But that's a completely stupid idea, you could use the database's capabilities and get a better performance out of it! If the database is bad, then there are others available (postgres?)

--
<@chani> I *cannot* remember names. but I did memorize 214 digits of pi once.

Your head is clouded (2.58 / 12) (#21)
by cameldrv on Sat May 01, 2004 at 05:25:06 PM EST

Your knowledge of algorithms is stunted by your real-time experience and the need for guaranteed worst-case performance. Suppose we coded google in assembly and scanned a big text file of the whole web every time you did a search. Oh, but it's fast and easy to write because it's in assembly.

yeah (none / 0) (#134)
by dke on Mon May 03, 2004 at 08:59:12 AM EST

Reminds me of ticking the checkbox for security.
Nothing is ever easy
[ Parent ]
Roger: (2.50 / 10) (#25)
by mcc on Sat May 01, 2004 at 05:56:39 PM EST

This faux "discussion" that you are having between your main account and your "tex bigballs" dupe account is getting really grating. Please stop it.

Hey Roger. (2.64 / 14) (#29)
by scanman on Sat May 01, 2004 at 06:27:21 PM EST

If you hate vector searches so much, why don't you stop using them? No one is forcing you to use vector searches. To help you out, I'll give you the contact information for ICANN so you can ask them for the IP address of kuro5hin.org without using a vector search:

(ICANN) Internet Corporation for Assigned Names and Numbers
4676 Admiralty Way, Suite 330
Marina del Rey, CA 92092

Phone: 310-823-9358
Fax: 310-823-8649

"[You are] a narrow-minded moron [and] a complete loser." - David Quartz
"scanman: The moron." - ucblockhead
"I prefer the term 'lifeskills impaired'" - Inoshiro

OH SHIT I GAVE IT +1FP (2.26 / 23) (#30)
by James A C Joyce on Sat May 01, 2004 at 06:27:43 PM EST

I'm really sorry everybody.

On a more topical note, I bet I could come up with a better way of doing this just by picking a random techie phrase from the front page of perl.com. Ah, here we are..."Bloom Filters". There ya go, Mister Localroger Troll Sir, implement something with that.

I bought this account on eBay

I accidentally voted abstain... (none / 3) (#58)
by skyknight on Sun May 02, 2004 at 02:58:57 AM EST

At least I can plausibly claim that I was trying to vote -1. You, on the other hand, can at best claim you were trying to vote it to the section page.

It's not much fun at the top. I envy the common people, their hearty meals and Bruce Springsteen and voting. --SIGNOR SPAGHETTI
[ Parent ]
WTF Parrot Compilers are waaay better [nt] (1.75 / 4) (#65)
by driph on Sun May 02, 2004 at 08:28:35 AM EST



--
Vegas isn't a liberal stronghold. It's the place where the rich and powerful gamble away their company's pension fund and strangle call girls in their hotel rooms. - Psycho Dave
[ Parent ]
Bah. (2.84 / 13) (#32)
by regeya on Sat May 01, 2004 at 07:07:19 PM EST

As other people have said, RDBMS systems are handy because someone else has already used optimized routines for you. I'll admit that I don't know much about the subject myself, but I'm sorry to say that I think your approach is, well, stupid. I ended up getting a journalism degree because I quickly learned that CS wasn't the major for me, but not before learning something about indexing. There are far more efficient ways of indexing data than what you propose, and a decent RDBMS could handle it. The problem is that MySQL (or does kuro5hin use something different?) just isn't up to the task.

A lot of people have thrown brain power at such problems. A flat text file + parser sounds like a stupid Perl programmer's idea. This notion of reinventing the wheel and denouncing using a modular solution as lazy, though, is absurd, IMHO. Again, others have worked on problems before, and

Heh...I guess I'm a lamer, but I like the idea of using files instead of throwing everything everything into an RDBMS, though. With a decent filesystem, indexing scheme, and caching scheme, this could work. You wouldn't be able to rely on the RDBMS as a security device (that's fine, I suppose) and you'd have to handle things like locking on your own, but it could work. Hell, someone do it and see if it's possible to make it work. Let's go further: How 'bout someone takes some existing pieces and tries to make a Scoop-type system using, say (this is an example, and I've no experience in such things, so feel free to laugh) a PHP system that reads/writes XML, and a pre-built indexer such as SWISH-E (again, feel free to point and laugh if it wouldn't work) along with a home-grown indexing scheme (which again could be based on XML, though localroger will blast it for being lazy ;-D) and see if it could even come close to Scoop. I'd be interested to see if it could work.

[ yokelpunk | kuro5hin diary ]

Another thing to consider (2.60 / 5) (#56)
by regeya on Sun May 02, 2004 at 12:38:55 AM EST

After reading a little about PostgreSQL, it sounds like it has some more features that'd help solve some of kuro5hin's problems, maybe. I don't know. Maybe not. Fever-addled brain and all that...

[ yokelpunk | kuro5hin diary ]
[ Parent ]

uh... (2.80 / 21) (#35)
by coderlemming on Sat May 01, 2004 at 07:17:08 PM EST

Well, I'm going to take you seriously and assume you're not trolling.  Perhaps I'm wrong and you're just insanely subtle.

I'll grant you that your solution would work.  It's a quick, dirty hack, and if you throw enough hardware at it, it'll work.  Maybe.  For now.  But it's stupid.

First of all, in this writeup and in your comments, you proclaim the amazing power of modern disk IO systems, what with their caching and all of that.  Add to that the ability of any multitasking OS to schedule around your 99.9% IO-bound thread, and you take this to mean that you're home free.  Unfortunately, you've overlooked a few things (these are just the ones that come to mind):

  •  You're going to absolutely trash all of the disk caches.  In a fraction of a second, you're going to totally blow the one on the disk, even if it's huge, and a few seconds after that, you'll probably also get the kernel's cache.  By the time you're done, they'll likely have a small fraction of your database in cache, and not much else.  Your next search won't benefit from this at all, and nor will anything else.  Filesystem efficiency will suffer.
  •  Your suggestion that no other processes will suffer from your low-priority thread thrashing the disk is a bit short-sighted, as well.  As you point out, the RDBMS is storing its data on the disk as well, and every single other action that scoop does is going to go through the disk.  You're definitely going to severely bog that down by scanning your flat-file.
  •  When data is invalidated, how do you find the entry in the flatfile to mark it as invalid?  Another search?  This question was raised in some of the threads you linked to but never answered.
Beyond these points, you also have to realize that a simple implementation of a word-index search is just as trivial as implementing your flatfile, scan, and garbage-collection systems.  RDBMSs are made for huge data sets like this, and there's been hard research done on how to make them fast.  Besides, if it's inefficient, find ways to optimize it later.  That's almost an exact quote from you.

And one final point: there are already word-index packages out there, even open-source projects, that could be dropped in and used with very little coding.  It's been done, and done, and done some more.

Your suggestion is not (quite) laughable, because you're pointing out something many of today's programmers forget: if it works, and it's easy, why bother making it more complicated?  But in this case, the complication factor is not that high, and the benefit is several orders of magnitude.


--
Go be impersonally used as an organic semen collector!  (porkchop_d_clown)

Hey, he IS subtle. (2.00 / 4) (#36)
by craigd on Sat May 01, 2004 at 07:44:59 PM EST

This guy is the one who gave us the AST article.

Not saying this is a troll, just that we'll never know.


A man who says little is a man who speaks two syllables.
[ Parent ]
subtle? (2.40 / 5) (#49)
by JahToasted on Sat May 01, 2004 at 10:26:50 PM EST

I thought the play on "A Modest Proposal" made it pretty damn obvious.
______
"I wanna have my kicks before the whole shithouse goes up in flames" -- Jim Morrison
[ Parent ]
It would have been more fun if it had been... (none / 0) (#187)
by Entendre Entendre on Mon May 03, 2004 at 11:59:14 PM EST

A brief introduction to search technology.

--
Reduce firearm violence: aim carefully.
[ Parent ]

Great comment (none / 1) (#68)
by tzanger on Sun May 02, 2004 at 09:11:48 AM EST

It's disputing the search method on its (lack of?) merit, rather than rmg's tiresome "but it's in ASSEMBLY, DUDE!  IT'LL BE A BAJILLION TIMES FASTER BECAUSE OF THAT!!1"

People keep forgetting that localroger's after a solution that works now and is somewhat fast, not the fastest, baddest-ass system out there.  No, it won't be the fastest.  No it's certainly not the smartest.  But it'll get us a working search now.

I still dispute the 7G dataset size.  Maybe 7G if you include all the archived stories.  I'm betting that the actual dataset is under 1500MB.  Still big, but search results will be available quickly enough to satisfy most humans.

You're absolutely right about trashing the cache though.  I think localroger's idea works great for small, less complex systems -- systems without a DB already present, and it's a smart algo for those kinds of systems.

Perhaps a little simplistic for K5, but again he's looking for a solution that works today, not in another year while we squabble about vecetor searches and indices.

[ Parent ]

Doesn't work (none / 1) (#75)
by bugmaster on Sun May 02, 2004 at 01:36:06 PM EST

localroger's solution is actually so bad that it does not work at all. For a site with k5, with a decent user base and 7G worth of comments, the stupid asm parser will take about 200-300 seconds per search, best case scenario. That's the same as it not working, because no one will actually wait that long.
>|<*:=
[ Parent ]
you are an idiot. (2.85 / 7) (#79)
by rmg on Sun May 02, 2004 at 03:52:03 PM EST

rmg's synopsis of localroger's argument is exactly correct.

what localroger describes is a variation on what the previous search did, except that it duplicates all of k5 in a text file (rather than leaving it in the database) and reimplements mysql's so as to unleash the awesome power of assembly language.

his way does not work. do you know why there is no comment search? i'll tell you: i remarked in one of my diaries back in september that you could dos k5 if you submitted a search for the word "is." someone actually did it. it became common knowledge that the reason k5 was slow over the summer (primarily) was use of the comment search. so rusty got rid of it. it did not work.

i realize i'm a troll and you're a nerd and as such cannot appreciate that there are some situations in which mockery is more appropriate than rational discussion (having likely spent your life as the target of it), you have to disagree with me because i'm a troll, etc, but if you take a hard look at localroger's argument here, it boils down to the use of assembly to do the impossible.

many people have tried to talk sense into localroger. with characteristic arrogance, he has refused to listen, so now he will endure our mockery, even if he comes out saying he was trolling later on (and he almost certainly will -- it's hard to believe anyone in his line of work could be so stupid as to believe what he is saying).

----

dave dean
[ Parent ]

yes, but... (none / 0) (#99)
by coderlemming on Sun May 02, 2004 at 06:28:07 PM EST

Yes, but mockery, in this case, got old fast :P


--
Go be impersonally used as an organic semen collector!  (porkchop_d_clown)
[ Parent ]
mocking localroger (none / 2) (#100)
by rmg on Sun May 02, 2004 at 06:57:47 PM EST

never gets old.

----

dave dean
[ Parent ]

Small systems (none / 0) (#180)
by Pseudonym on Mon May 03, 2004 at 10:16:36 PM EST

I think localroger's idea works great for small, less complex systems -- systems without a DB already present, and it's a smart algo for those kinds of systems.

For small, uncomplicated systems, say, such as the size of my address book, "grep" does just as well. It's also a more general solution and doesn't require assembly hacking.


sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
[ Parent ]
I hadn't thought about trashing the disk caches (none / 1) (#97)
by localroger on Sun May 02, 2004 at 06:05:11 PM EST

That's a good point. Obvoiusly it would be necessary to use a dedicated HD for the index, which would fix the trashing the HD cache problem. Unfortunately I have no idea either in Windows or Linux whether separate caches are maintained for each physical device.

What will people of the future think of us? Will they say, as Roger Williams said of some of the Massachusetts Indians, that we were wolves with the min
[ Parent ]
separate caches (none / 0) (#98)
by coderlemming on Sun May 02, 2004 at 06:25:33 PM EST

Short answer: I doubt it, at least in Linux.  I don't have much experience, but from what I can figure, it wouldn't even make sense to keep separate caches per device.  The cache is for the filesystem itself, not for the device.  The filesystem is supposed to be a uniform way of looking at data that is separate from the devices it unifies.


--
Go be impersonally used as an organic semen collector!  (porkchop_d_clown)
[ Parent ]
oh, and mem cache will be trashed, too. (nt) (none / 0) (#185)
by coderlemming on Mon May 03, 2004 at 11:57:37 PM EST




--
Go be impersonally used as an organic semen collector!  (porkchop_d_clown)
[ Parent ]
as for cache thrashing... (none / 1) (#205)
by Ubiq on Tue May 04, 2004 at 06:59:46 AM EST

Use O_STREAM or O_DIRECT if you must.

[ Parent ]
interesting (none / 0) (#220)
by coderlemming on Tue May 04, 2004 at 02:50:42 PM EST

I learn something new every day.  Thanks for that tidbit.


--
Go be impersonally used as an organic semen collector!  (porkchop_d_clown)
[ Parent ]
real full text search/index is easy enough (2.55 / 9) (#37)
by simul on Sat May 01, 2004 at 08:08:46 PM EST

Full text indexing is a well-researched issue, with a long history. I think, maybe, that you really don't know how to use indexes.

Sure, storing comments in a database can be slow. I built a commenting system that puts each comment in it's own text file.

But if you're going to be searching text files, a full-text index is *the* only way to go. You could use WAIS, or experiment with a new system like CLucense. What takes 1 second, in your article, would take a few milliseconds.

However, full-text indexing works well with MYSQL. You have you create your comemnt table as MyISAM - NOT INNODB. the problem with this is that you get slow updates. Teh advantage is highspeed searching.

http://dev.mysql.com/doc/mysql/en/Fulltext_Search.html

I'm surprised this article is receiving votes though... considering it's lack of technical merit. I thought kuro5hin'ers were more sophisticated than that.

Read this book - first 24 pages are free to browse - it rocks

I concur (none / 3) (#40)
by Mutually Assured Destruction on Sat May 01, 2004 at 08:14:02 PM EST

You are quite sophisticated.

[ Parent ]
heh...meme from k5 (none / 3) (#51)
by simul on Sat May 01, 2004 at 10:31:42 PM EST

good call. for some reason rusty likes to use that word when describing k5 users. i borrowed it. should have said "more tech-savvy". similar meme... but more apropros.

Read this book - first 24 pages are free to browse - it rocks
[ Parent ]
Rusty is what we call a "con artist" (1.12 / 8) (#54)
by Mutually Assured Destruction on Sat May 01, 2004 at 10:44:03 PM EST

Sucking up to your marks is just part of the process.

[ Parent ]
that's what i thought (none / 3) (#41)
by thesatirist on Sat May 01, 2004 at 08:34:31 PM EST

I had that same impression when I initially commented on this on the "rusty is GONE" thread. This is a job for indexes. This is not mystical black magic rocket science brain surgery. This is pretty basic software engineering.

[ Parent ]
well (none / 3) (#50)
by regeya on Sat May 01, 2004 at 10:31:10 PM EST

It's a good jumping-off point. Poorly thought-out article (even I can see that, and I know regrettably little about the subject) and it's garnered a few links. Makes me think back to the old days when people understood that this is how Slashdot was supposed to work. ;-D

[ yokelpunk | kuro5hin diary ]
[ Parent ]

and let me explain (none / 3) (#52)
by regeya on Sat May 01, 2004 at 10:34:35 PM EST

Before you say "what the fuck is he on about" let me collect my fever-addled thoughts...springtime flu should be illegal...

What I mean is this: part of the point of stories plus comments is that you have the readers participating. Slashdot had short blurbs, and often had people refuting whatever claim was made. It was interesting, and was a first look at this notion of collaborative media. Exciting stuff until some assholes (who also used to be kuro5hin regulars) started crapflooding Slashdot with their bitching and moaning about how Rob and Co. should think things through themselves instead of leaving it to the reader to think. You could say that Signal 11 and others were arguing that Rob Malda should think for them.

*ahem* Anyway, as I said before, this article is a good discussion starter even if the article itself sucks.

[ yokelpunk | kuro5hin diary ]
[ Parent ]

the answer (none / 0) (#104)
by xmedar on Sun May 02, 2004 at 07:37:02 PM EST

is to have two sets of tables, a table of everything bar the last few days / week (historic data therefore not likely to have comments added), indexed with FULLTEXT in MySql and the last few days / week (live, changable data) in a table that does not, then submit two queries to do a text search, one to each table and update the historic table with data from the live data at periods of low usage, and don't forget to pack the historic data every so often so it takes up less space and have lots of RAM to cache the data and / or create HEAP tables for the historic data which will be in RAM anyway, it's better to buy lots of RAM if the DB is only a few GB, if it's very large, tens or hundreds of GB you could split it over different servers / invest in a high end disk subsystem with 15K disks in a RAID 10 configuration with 512MB of battery backed cache on the controller, but that is getting quite expensive in K5 terms.

[ Parent ]
linux? (1.72 / 18) (#38)
by SilentChris on Sat May 01, 2004 at 08:10:18 PM EST

might i recommend linux?

A real, ready seach engine (2.50 / 6) (#39)
by danharan on Sat May 01, 2004 at 08:10:58 PM EST

Brag all you want about seeing a Cray in 1980, but you're not lazy enough to be an outstanding programmer. Lucene exists, it's free and it works. Why reinvent the wheel?

A brief caveman story. (3.00 / 11) (#47)
by NoMoreNicksLeft on Sat May 01, 2004 at 10:24:12 PM EST

Urg: Me Urg. Me invent square wheel. Square wheel rolls bumpily. Can use to move things easier.

Blagh: Me Blagh. Me need wheel to move boulder. Me will invent round wheel, make even easier.

Urg: Use Urg wheel. Only $19.95 + shipping and handling.

Blagh: Round wheel has no bumps.

Urg: Why reinvent wheel?

--
Do not look directly into laser with remaining good eye.
[ Parent ]

close, no cigar (none / 1) (#66)
by danharan on Sun May 02, 2004 at 08:35:45 AM EST

Lucene is the round wheel in this case. ht://Dig should also be considered.

Both FREE, both WORK, both PERFORM WELL.

But if you think they're square wheels, why don't you go out and code your own, or help localroger?

[ Parent ]

Round wheels. (none / 2) (#67)
by NoMoreNicksLeft on Sun May 02, 2004 at 09:09:49 AM EST

I'm not necessarily in favor of localroger's method. I just get sick of "why re-invent the wheel" as a cliche, especially when people use it incorrectly.

You say that when someone wants to redo the *same thing*, apparently for the hell of it. You don't say that when they believe they have a superior method (which already answers the why). Whether it is superior or not doesn't really matter, if it's not, just laugh at them afterwards.

--
Do not look directly into laser with remaining good eye.
[ Parent ]

This is a case of reinventing the wheel (none / 1) (#117)
by duffbeer703 on Mon May 03, 2004 at 12:06:34 AM EST

Search for message boards has been implemented in dozens, if not thousands of websites on the internet.

[ Parent ]
Yes... (none / 0) (#168)
by Polverone on Mon May 03, 2004 at 06:26:25 PM EST

But a lot of their search functions are lousy, either being too dumb or too slow or sometimes both. What's the good "wheel" (message board search) that should be grafted onto kuro5hin with a minimal amount of re-implementation?
--
It's not a just, good idea; it's the law.
[ Parent ]
here (none / 0) (#171)
by damiam on Mon May 03, 2004 at 07:30:01 PM EST

the good wheel

[ Parent ]
A smart admin (none / 0) (#172)
by duffbeer703 on Mon May 03, 2004 at 07:44:25 PM EST

Would create static links to old stories and let Google spider it.

[ Parent ]
whoa (none / 0) (#229)
by danharan on Tue May 04, 2004 at 06:38:19 PM EST

Lazy and effective. Excellent!!

[ Parent ]
You know (2.42 / 14) (#42)
by Dr Phil on Sat May 01, 2004 at 08:40:53 PM EST

I heard some people are so good that they can actually write assembly that moves faster than the speed of light.

*** ATTENTION *** Rusty has disabled my account for anti-Jewish views. What a fucking hypocrite.
Even better.. (none / 0) (#221)
by artemb on Tue May 04, 2004 at 03:19:20 PM EST

.. I've heard one guy coded an infinite loop in asm that finishes in just under 3 seconds.

[ Parent ]
Uhm. (2.00 / 8) (#43)
by SIGNOR SPAGHETTI on Sat May 01, 2004 at 09:26:51 PM EST

You can't write better assembly language than a good C compiler. Unbelievably, the 8086 is no more. Aside from that, I'm down with flat files; they fit good on magnetic tape.

--
Stop dreaming and finish your spaghetti.

Ironically... (1.80 / 5) (#46)
by NoMoreNicksLeft on Sat May 01, 2004 at 10:20:23 PM EST

People who are poor assembly programmers often claim that no one can outperform a good C compiler.

Hmmm.

--
Do not look directly into laser with remaining good eye.
[ Parent ]

Yeah. (2.30 / 10) (#48)
by SIGNOR SPAGHETTI on Sat May 01, 2004 at 10:24:32 PM EST

It's like rain on your wedding day.

--
Stop dreaming and finish your spaghetti.
[ Parent ]

Do you have magical SPAGHETTI powers? (none / 2) (#69)
by NoMoreNicksLeft on Sun May 02, 2004 at 09:11:51 AM EST

How did you know it rained on my wedding day? I'm contemplating suing rusty for violating privacy guidelines....

--
Do not look directly into laser with remaining good eye.
[ Parent ]
Let's put it in terms you can understand: (none / 1) (#86)
by warrax on Sun May 02, 2004 at 04:47:03 PM EST

Most people can't write code that can outperform a good C compiler.

Why? Because, most likely, the compiler knows more about the architecture than the programmer does. Compilers are also usually much better at figuring out instruction orders which will avoid pipeline stalls, and all that sort of stuff.

Having said that, C is not the easiest language to optimize; something like FORTRAN (shudder) is much easier to optimize well and (on numerical applications, at least) will almost always net you a substantial performance increase over C.

But writing assembly for anything you haven't tried to implement in at least 1 'medlevel' language first and shown to be too slow, is just... stupidity.

-- "Guns don't kill people. I kill people."
[ Parent ]

Fortran vs. C (none / 0) (#142)
by joib on Mon May 03, 2004 at 10:06:15 AM EST

Having said that, C is not the easiest language to optimize; something like FORTRAN (shudder) is much easier to optimize well and (on numerical applications, at least) will almost always net you a substantial performance increase over C.

Are you sure of this? Your claim was surely valid in the good ol' days, but today? Now, I know almost nothing about compiler tech, but consider:

  • FORTRAN is such a small market that there exists very few (if any?) compilers which are made only for fortran, rather the fortran compilers usually share backends with the C/C++ compiler from the same vendor. This limits the amount of fortran-specific optimizations the compiler is able to do.
  • Fortran has gotten slower. Most new fortran development is done with fortran90+, which has lots of new nice features. Dynamic memory allocation and pointers (albeit somewhat limited compared to C pointers) make the life of the programmer easier, but at the same time they remove much of the reasons for fortran's performance benefit compared to C.
  • C has gotten faster. While pointer analysis apparently is an NP-hard problem, compilers have surely improved also, if nothing else then by improved heuristics. Additionally, C99 defines the "restrict" keyword for pointers that are not aliased (apparently C99 restricted pointers are very similar to the pointers found in Fortran90), but I don't know if any current compilers use this information to optimize the code.



[ Parent ]
FORTRAN is more optimizable than C (none / 0) (#149)
by stalker on Mon May 03, 2004 at 12:29:39 PM EST

Two words: memory aliasing. C has it, FORTRAN doesn't. FORTRAN is more limited in expression than C (albeit Turing-equivalent, F. has a very narrow domain of application). Some construct, such as pointers, can make the life of an optimizing compiler hard, while FORTRAN, with its relatively poor repertoire of constructs, invites the programmer to be way more explicit in usage patterns and data structures and allows the compiler to be much smarter about high and mid level optimization. Regarding the old diatribe "Hand-coded assembler is faster than compiler generated output", I'd say: Code a 1024 point FFT without pipeline bubbles on a superscalar and I'll believe you.

[ Parent ]
stalker is right... (none / 0) (#207)
by warrax on Tue May 04, 2004 at 07:41:57 AM EST

The primary reason is pointer aliasing, but the freedom the compilers have in adapting the memory layout to be cache optimal also has huge significance for numerical applications.
FORTRAN is such a small market that...
No, it isn't, actually. Almost everyone who runs serious numerical computations on clusters uses FORTRAN to some degree. We're talking about stuff that takes so much processing power where a savings of even just 5% computation time will save you literally weeks of expensive calendar time. These people spend millions and millions (if not billions) of $'s on hardware anyway, so they're willing to pay lots and lots for good compilers (because, in the end it saves them money).
Fortran has gotten slower. Most new fortran development is done with fortran90+, which has lots of new nice features. Dynamic memory allocation and pointers (albeit somewhat limited compared to C pointers) make the life of the programmer easier, but at the same time they remove much of the reasons for fortran's performance benefit compared to C.
No, it hasn't gotten slower. If anything it's gotten faster. People who care about performance don't use any of the new (slow) features of F90. Meanwhile optimization technology has gotten better and the (already fast) "old" subset of F90 runs even faster. There is no "interference" between features in F77 and F90 compilers; you can feed F77 source to a compiler and it will use a special compiler specifically for F77 for that source.
C has gotten faster. [...], but I don't know if any current compilers use this information to optimize the code.
I don't know how well (if at all) the various C compilers handle restricted pointers, so I can't comment on that. However...

It may be the case that C is getting faster, but FORFRAN has had the benefit of literally decades of some of the brightest compiler writers working on writing optimizing compilers for it. The launguage itself has evolved extremely slowly, so the compiler writers haven't had to play "language feature" catch up with each other for years and years. This has given it a huge head start, and C is probably still a few years (at least 5 I'd say) from catching up.

Personally, I think higher-level languages (which FORTRAN is) might have a better chance of catching up, simply because optimization can be done both at a higher level and also at the low level. IOW: you can optimize the hell out of all the low-level stuff because you know the programmer isn't going to be able to do anything which can interfere with your optimization, but you can also do high-level stuff like prefetching columns/rows from a matrix because you know you'll need the data shortly.

I hate FORTRAN just as much as the next guy, but the reality is that nothing currently comes close in raw numerical performance and that matters when CPU (and wall clock time) costs as much as it does on huge clusters.

-- "Guns don't kill people. I kill people."
[ Parent ]

Uh oh, the Fortran vs. C flamefest.. (none / 0) (#213)
by joib on Tue May 04, 2004 at 10:46:53 AM EST

...but the freedom the compilers have in adapting the memory layout to be cache optimal also has huge significance for numerical applications.

Uh, what do you mean by this? In what way can a Fortran compiler adapt the memory layout that a C compiler cannot?

No, it isn't, actually.

Yes it is. If you'd want absolutely maximum performance you would build a compiler for one language and one cpu arch. OTOH, consider the architecture of gcc: there are lots of frontends which all compile the source code into an intermediate language, common for all cpu:s and languages. Then you have a set of backends which compile the intermediate language down to machine code for the cpu in question. The intermediate language brings portability, but it also reduces the possibilities for optimizing since high level information about the source code structure is lost in the translation to the intermediate language.

Now consider the commercial compilers. While they usually have higher performance than gcc, they are also (almost?) all multi-language and multi-cpu-architecture compilers, in practice meaning that their internal architecture is somewhat similar to gcc. E.g. the intel compiler supports C, C++, and Fortran 77/90/95 on x86 and IA-64, ibm:s xl compilers support C, C++ and Fortran 77/90/95 on their power family of cpu:s. Same goes for Portland, NAG and Absoft.

There simply isn't enough money in developing a single-architecture Fortran-only compiler to wring the last few % of performance compared to the multi-lang/multi-arch compilers.

For another example, consider that today, 14 years after the Fortran90 standard was published, the GNU project still hasn't released a free F90 compiler. By comparison, the fairly recent C99 is already fairly well supported and the enormously complicated C++98 standard support is in very good shape. That speaks alot about the size of the user bases (or perhaps to be more precise, the size of the user bases capable and willing to implement a compiler).

No, it hasn't gotten slower. If anything it's gotten faster. People who care about performance don't use any of the new (slow) features of F90.

Well, duh. That's like saying that C is faster if you don't use pointers or malloc(). People moved from F77 to F90 because they wanted to use the new features. Or perhaps many users already used these features in the form of vendor-specific F77 extensions and they wanted to get out of that particular quagmire.

There is no "interference" between features in F77 and F90 compilers; you can feed F77 source to a compiler and it will use a special compiler specifically for F77 for that source.

Yes, as I said earlier most compilers these days support multiple languages and multiple architectures. Linking object files compiled with the same compiler family is usually quite straightforward (with the exception of C++, with its name mangling and stuff).

I don't know how well (if at all) the various C compilers handle restricted pointers, so I can't comment on that.

I'd guess that if the compiler backends are able to use the fact that fortran references are not aliased, it would be quite straightforward to add a feature to the C frontend to mark restricted pointers in the same way. But as we both said, we don't know whether any compiler does that yet.

It may be the case that C is getting faster, but FORFRAN has had the benefit of literally decades of some of the brightest compiler writers working on writing optimizing compilers for it.

Considering that the smartest compiler gurus probably work for the big players, which all produce compiler backends shared between C, C++ and Fortran, does it matter?

I hate FORTRAN just as much as the next guy

Hey, I'm not saying I hate Fortran. In fact, currently I get my bread and butter partly from running an ab initio code, written in Fortran90, on #91 on the top500 list.

I'm just challenging the conventional wisdom that Fortran is considerably faster than C.

[ Parent ]

You are talking about different users... (none / 0) (#222)
by warrax on Tue May 04, 2004 at 03:22:19 PM EST

Well, duh. That's like saying that C is faster if you don't use pointers or malloc().
It is. It just not terribly useful, while F77 is useful for HPC. It's an utterly crappy language, but it does so much more than C without pointers, so people continue to use it.
People moved from F77 to F90 because they wanted to use the new features.
Some people... Some people just want the performance and will suffer the horrors of not using "expensive" F90 features to get it (i.e. either limiting themselves to a subset or just using F77).
There simply isn't enough money in developing a single-architecture Fortran-only compiler to wring the last few % of performance compared to the multi-lang/multi-arch compilers.
You don't seem to realize just how much money there is (and has traditionally been). Granted, there is less nowadays than there used to be (relatively speaking), but there is still a lot of money in it. (Because companies doing HPC absolutely require the highest possible performance).
Considering that the smartest compiler gurus probably work for the big players, which all produce compiler backends shared between C, C++ and Fortran, does it matter?
If you think all optimization is performed at the backend, you are sadly mistaken.
I'm just challenging the conventional wisdom that Fortran is considerably faster than C.
It's conventional wisdom for a reason. Although... the Blitz++ library has made great strides in the area of numerical computation, I don't think they're quite there yet.

-- "Guns don't kill people. I kill people."
[ Parent ]
Hrm.. (none / 0) (#228)
by joib on Tue May 04, 2004 at 06:01:57 PM EST

>Well, duh. That's like saying that C is faster if you don't use pointers or malloc().

It is. It just not terribly useful, ...

Up to a point yes. The moment you start passing around large structs by value instead of using a pointer your performance will of course drop. Fortran naturally gets around this problem by always using pass-by-reference semantics.

F77 is useful for HPC. It's an utterly crappy language, but it does so much more than C without pointers, so people continue to use it.

Well, if you don't use malloc nor pointers, except the use of restricted pointers for argument passing to functions, you have something whose semantics are quite close to F77. Given equivalent compilers (i.e. which can optimize restricted pointers), such a C program should perform more or less equally fast compared to the F77 one.

My point with that comment was more along the lines that yes, malloc() and pointers are in principle slower than the F77 approach, but still people use them because they are very useful and they don't reduce performance _that_ much. When used appropriately of course; calling malloc() inside an inner loop is almost certainly a quite stupid idea.

Some people... Some people just want the performance and will suffer the horrors of not using "expensive" F90 features to get it (i.e. either limiting themselves to a subset or just using F77).

Hmm, F77 is basically "good enough" if your problem can be broken down into operations on dense arrays. If that doesn't work, you really start wanting structs (types in F90), pointers and dynamic memory allocation, allowing you to build more advanced data structures. With F77, you're in for a world of pain in a situation like that.

In the same way that Fortran or C in practice increases performance due to the higher abstraction level (the typical HPC programmer is a scientist whose knowledge of hardware architecture is limited), more advanced data structures possible with F90 and C can allow improved performance because one can use smarter algorithms.

If you think all optimization is performed at the backend, you are sadly mistaken.

Well, no, I don't think that. As I said in a previous post, I don't know much about compiler technology, but I don't think it's too far-fetched to assume that while the use of an intermediate language allows a multi-lang/multi-arch compiler (i.e. bigger potential market), it can also cause some "least common denominator" effect.

It's conventional wisdom for a reason.

My point in my original post was that this conventional wisdom was conceived in the age of F77 and C89 (or even K & R C). Today, with F90 and C99 I don't think there is such a clear distinction performance-wise.

Although... the Blitz++ library has made great strides in the area of numerical computation, I don't think they're quite there yet.

I have to admit, I have somewhat reversed my opinion on the suitability of C++ for scientific programming. I used to think that it's the coolest thing since sliced bread, combining the speed of C with high-level OO programming, would bring world peace etc. I actually experimented somewhat successfully with template metaprogramming myself.

The reason that I have changed my opinion basically centers around the fact that C++ is simply too complicated for any mortal to master.

  • For example, Herb Sutter's web site lists about a 100 hundred common, more or less completely non-obvious ways in which C++ can bite your ass.
  • Due to the awesome number of features in C++, every programmer has their own conception of which features should be used. Some people can't stand exceptions, others think templates are a tool of the devil etc. This creates problems in a team environment. Add to this various incompabilities between compilers due to incomplete standard implementation (although most compilers are getting quite good nowadays), and you might end up with something like the Mozilla C++ coding guidelines: Basically deny everything except classes.
  • High-performance C++ basically centers around heavy usage of templates. Templates are a language on their own, further adding to the considerable burden of learning C++. Additionally, templates produce very unintuitive error messages, and take forever to compile. Compile times of hours was reported by one POOMA user, IIRC (that same user reported that he was returning to F90, for much of the same reasons I'm describing here).
  • Add to this the fact that most HPC programmers are not computer scientists, but rather experts in some other field. This is also partly why Fortran and C remain very popular; they are both comparatively simple languages to learn and use.
  • While C++ advocates like to mention the necessity of OOP for larger projects, there still are many very succesful and very large C projects. For example, consider the Linux kernel (or most kernels, for that matter), GNOME, X. There has even been published reports claiming that the average defect density as well as the effort to correct a defect in a C++ program is 2-3 times higher than in a C program (search for Leslie Hutter on google). So it seems that C++ is no panacea.

I'm not saying that OOP is itself a bad idea, just that C++ is not a very good vehicle to do it with (despite having OO features lacking in C/Fortran). Personally, I favour the mixed language approach. E.g. write most of the application in a high level language such as Python (with very nice OO support), and write the number-crunching bits as C modules (or Fortran, if you prefer) which get called from the Python code.

Well, anyway, this post is getting long-winded and I want to go to bed, so I'll just cut it here for now.

[ Parent ]

I can. (none / 1) (#70)
by BridgeHugger on Sun May 02, 2004 at 11:00:07 AM EST

Most proficient assembler programers can.  The reason?  Because  the programmer can still use an optimiser.  

[ Parent ]
Au contraire (none / 1) (#122)
by ph317 on Mon May 03, 2004 at 01:29:39 AM EST


You can always hand-code better assembler than the compiler can.  The question is usually whether it's worth the effort, and the answer is usually no.  But an intelligent assembly language programmer who fully understands the architecture he's using, the code being optimized, and the underlying computer science behind the code he's optimizing, can almost always improve on C compiler output for any algorithm sufficiently complex that it would warrant the attention.

The first step in not wasting a lot of time on optimization is to profile.  If you profile your application in typical usage patterns, you tend to find that a very small percentage of your code is what takes 90% of the cpu cycles in your app, and you only target those for optimization.  But beyond even that, most people will find that even when they can hand-code some small code fragment somewhere and speed it up by say 15-20%, that it still isn't worth it.  Because then the optimization isn't portable, and you have to maintain that crap, and it is a lot more trouble and time involved, etc, and virtually nobody's going to notice that 15%.  This is why virtually nobody hand-optimizes in assembler anymore.

There are notable exceptions though.  For one, peices of software whose whole purpose in life is to do intense mathematical algorithms on data, especially when the code in question is a shared library used by many other apps for said cpu-intense operations.  Examples of this would be mp3 encoder/decoder code, or encryption/decryption code.  Another exception would be kernel work, hence the smatterings of expertly hand-tuned assembler you'll find in a linux kernel.

[ Parent ]

OK I guess everyone should have a hobby. But. (none / 1) (#146)
by SIGNOR SPAGHETTI on Mon May 03, 2004 at 11:44:19 AM EST

Knowing the architecture very well doesn't necessarily mean you that can always in practice optimize a non-trivial sequence of instructions for it in your head. It means merely that you can write a good optimizing compiler for it. I'm sure you can write a fast search engine entirely in assembler for the 8086. That would be trivial. But when you start working with complex logic like the SPARC or Intel's "hyperthreaded" processors, you will find that, most of the time, for any reasonable expenditure of effort, you can knock off an even faster search engine in C and have lots of time left over in which to extract cube roots entirely in your head.

--
Stop dreaming and finish your spaghetti.
[ Parent ]

Hmm. Not sure this makes too much sense. (2.66 / 6) (#44)
by xC0000005 on Sat May 01, 2004 at 09:30:27 PM EST

I appreciate the value of something that would work.  However, let me point out a few things that, as a person who makes a living programming, stood out to me.

#1.  Don't re-invent.  There are already systems that do full text indexing, written by people who have specialized in it. Your flat file idea is interesting, but I would personally shun it because there's not too much point. (See other comments on this)

#2.  Regarding that search index tree, again as someone who has worked on one such thing, I have to say it is the best tool for the job. But why would you maintain the complete index in memory? (More on the idea of killing the disk later) Garbage collection is only an issue if you are allowing deletes.  Yes, Kuro5hin does allow deletes, but I'll bet they are the edge case, not the norm.  Now you need some way to maintain the integrity of your tree, hash/checksum/something.  And probably a lru/cache plan.  Does this sound familiar?  It should, because it's what already exists in a number of existing search implementations.  And in regards to killing the disk getting there, the solutions are again as simple or complex as you make them.  Simple - use a filesystem with heavy read/write caching, (or a SAN, preferrably one of the ones that like to play fun tricks with the IO queue).  The downside is you are tied to heavy metal hardware or a particular FS.  Complex is writing the buffer manager yourself.  (I do not recommend this).  

Finally, the optimal approach is actually a mixture.  You have the index tree for lightning fast searches with a slow insertion time.  You have a midlevel tagged query.  You have the last level (on the newest posts) where x% of resources can be consumed doing a slow crawl.  

I think leveraging an existing search engine like goodle is a better idea.

Voice of the Hive - Beekeeping and Bees for those who don't

I encourage this... (1.77 / 9) (#55)
by JahToasted on Sat May 01, 2004 at 11:11:04 PM EST

The biggest problem with in Computer Science today is that we've forgotten the old KISS (Keep It Simple, Stupid) rule. Let's face it, we don't need all of the features of the (IMHO) bloated MySQL. Yeah, if you're trying to set up a mission critical database server and need things like transactions, views, and inline functions, then maybe MySQL is the right choice for you. But for the rest of us, a flat file with an assembly parsing programme is the way to go. And a good programmer could write the assembler programme in such a way as to be portable across multiple architectures, something that MySQL can't really claim.

+1FP
______
"I wanna have my kicks before the whole shithouse goes up in flames" -- Jim Morrison

Can't tell if you're being sarcastic (none / 1) (#71)
by CaptainSuperBoy on Sun May 02, 2004 at 11:36:54 AM EST

If you are, bravo. If you actually think MySQL is bloated...

--
jimmysquid.com - I take pictures.
[ Parent ]
Well it depends (none / 1) (#72)
by JahToasted on Sun May 02, 2004 at 12:46:28 PM EST

If you're making a database for say an airline or something, then I guess you probably need most of the features that MySQL provides. But for a website it is extreme overkill.
______
"I wanna have my kicks before the whole shithouse goes up in flames" -- Jim Morrison
[ Parent ]
Be serious (none / 0) (#82)
by CaptainSuperBoy on Sun May 02, 2004 at 04:20:18 PM EST

This is not the time for your jests.

--
jimmysquid.com - I take pictures.
[ Parent ]
You forget Microsoft Access (none / 0) (#87)
by QuickFox on Sun May 02, 2004 at 04:55:53 PM EST

Microsoft Access keeps the entire database in a single file. Perfect for airlines.

Give a man a fish and he eats for one day. Teach him how to fish, and though he'll eat for a lifetime, he'll call you a miser for not giving him your fi
[ Parent ]
Microsoft Access (none / 0) (#96)
by localroger on Sun May 02, 2004 at 06:02:58 PM EST

...is what actually inspired the article at hand. I actually have experience with it. If all RDBMS's were like that, well, I'd like to think nobody really would use them. Except that an awful lot of people do use Access, including the nimrods who wrote my company's accounting system.

As for other RDBMS's, well, I don't use them at all, and I've seen every one of them called a piece of shit by somebody who couldn't get it to do what he wanted, whereas I have never had a filesystem fail to do what I expected it to do. So there.

What will people of the future think of us? Will they say, as Roger Williams said of some of the Massachusetts Indians, that we were wolves with the min
[ Parent ]

Access (none / 0) (#108)
by CaptainSuperBoy on Sun May 02, 2004 at 08:44:34 PM EST

Access is a fine database engine and UI tool for very small databases, I'd say 10 users and a few hundred megs max. There are a lot of nimrods designing Access apps though, and I can't vouch for the quality of the apps - but the DB engine is solid in capable hands. There's zero reason to run a website off of Access though, when MSDE is free.

--
jimmysquid.com - I take pictures.
[ Parent ]
MSDE throttles down for >5 concurrent users (none / 0) (#147)
by rpresser on Mon May 03, 2004 at 12:09:07 PM EST

If you want to be able to handle more than 5 concurrent users, you'll need something different.  Furthermore, I don't think the MSDE license allows you to legally use it to back a public website.
------------
"In terms of both hyperbolic overreaching and eventual wrongness, the Permanent [Republican] Majority has set a new, and truly difficult to beat, standard." --rusty
[ Parent ]
Yes (none / 0) (#155)
by CaptainSuperBoy on Mon May 03, 2004 at 02:17:42 PM EST

MSDE throttles at >5 concurrent queries, not users. If your queries are simple it's conceivable to serve hundreds of users at once. Also they do allow you to use it for websites, it's actually what they recommend.

They recommend upgrading at 25 concurrent users, I'd say that number is higher but nevertheless if you have that many users on your website at one time you might want to look into a full SQL Server.

--
jimmysquid.com - I take pictures.
[ Parent ]

thanks for the correction n/t (none / 0) (#158)
by rpresser on Mon May 03, 2004 at 03:08:59 PM EST


------------
"In terms of both hyperbolic overreaching and eventual wrongness, the Permanent [Republican] Majority has set a new, and truly difficult to beat, standard." --rusty
[ Parent ]
Ahaaaa! We've struck the motherlode!! (none / 1) (#130)
by toulouse on Mon May 03, 2004 at 07:54:24 AM EST

You've just revealed what has led to the incessant spamming of this subject in the last few days.

If this is the root of your confusion:

Microsoft Access is what actually inspired the article at hand. I actually have experience with it. If all RDBMS's were like that, well, I'd like to think nobody really would use them.

then I can see why you're totally out of whack on everything else.

Microsoft Access is ok for small, internal, single-business applications. It's a basic all-rounder RDBMS designed for immediacy, integration with other 'Office' products and ease-of-use. However; nobody, but nobody, should be using it for any heavyweight purposes whatsoever. The reason Oracle, PostgreSQL, MySQL and other *hardcore* DBMSs exist is because frameworks like Access are nowhere near good enough.

If this is the source of your misperception you should desist immediately. Your argument is akin to saying "Personal Web Server sucks and therefore Apache sucks too", or similarly, "Photoshop is crap because Paint sucks donkey balls". If Access is the only DBMS you have experience of then you are in no way qualified to talk on the subject. What you're actually proposing is implementing a heavyweight RDBMS yourself rather than using something else: there's no point - it's been done.

I can see how you could have come to this conclusion. Many businesses use Access because it's cheap, convenient and easy to learn; but they often deploy it for completely inappropriate purposes (Like you, I've had to deal with this myself and couldn't agree more). Access isn't even Microsoft's main database server product, SQL Server is.

What you're doing is looking at Macaulay Culkin's performance as a boxer and then proclaiming that, because of this, Joe Frazier must have been shit too.

If all RDBMS's were like that...

They're not. In fact, damn near none of them are. HTH. Yeah, yeah...IHBT...


--
'My god...it's full of blogs.' - ktakki
--


[ Parent ]
Heh (none / 0) (#163)
by QuickFox on Mon May 03, 2004 at 04:25:04 PM EST

Microsoft Access is suitable for things like a personal list of friends, with names, addresses, phone numbers and a few other details. It's suitable for short tables that are accessed just a few times a day. A tiny company (and I mean tiny) might use it for their customer list. They might even use it for their product list, if they don't have too many products.

That was the intent from the very beginning, Microsoft Access was never intended for anything heavier.

I know, some people use it for far more demanding tasks. But they shouldn't. They really, really shouldn't.

Give a man a fish and he eats for one day. Teach him how to fish, and though he'll eat for a lifetime, he'll call you a miser for not giving him your fi
[ Parent ]

LOOOOOOOOOOOOL [nt] (none / 2) (#84)
by warrax on Sun May 02, 2004 at 04:35:48 PM EST



-- "Guns don't kill people. I kill people."
[ Parent ]
With /that/ kind of talent... (none / 2) (#90)
by QuickFox on Sun May 02, 2004 at 05:09:59 PM EST

And a good programmer could write the assembler programme in such a way as to be portable across multiple architectures

If your programmer has that kind of talent, then don't have him waste his time on ordinary search algorithms. Just tell him to make his program wait faster. Simply tell him to optimize his assembly program's waiting periods so tightly that the entire five-minute wait for the hard disk takes only five microseconds.

Give a man a fish and he eats for one day. Teach him how to fish, and though he'll eat for a lifetime, he'll call you a miser for not giving him your fi
[ Parent ]

I call bullshit (2.72 / 11) (#57)
by skyknight on Sun May 02, 2004 at 02:52:16 AM EST

I have argued in the past, loudly and at great length, that modern computing suffers because lazy programmers use powerful hardware as a crutch instead of learning how to do things right.

This is an archaic mentality. You make it sound like writing in anything other than assembly is laziness. Yes, there are lazy programmers who do stupid shit and rely on blazing fast hardware to mask their incompetence. No, not everyone who uses high level languages is a dolt. You may have heard of a guy by the name of Donald Knuth. He asserts that "premature optimization is the root of all evil." Writing code in assembly in the 21st century is almost always premature optimization.

You seem to harbor a righteous sentiment that the ultimate metric for goodness of code is speed of execution. That mentality was correct in the past when machines were ridiculously expensive such as the Cray you mentioned. Today, there are much better things for which to optimize, e.g. discoverability, ease of debugging, portability, and rapidity of development.

A smart engineer determines the bottlenecks on a project and works to reduce them in a way that optimizes the translation of human labor to system value. Worshiping arbitrary and Puritanical metrics may make you feel good, but it is a waste of time.



It's not much fun at the top. I envy the common people, their hearty meals and Bruce Springsteen and voting. --SIGNOR SPAGHETTI
Bravo (none / 2) (#115)
by duffbeer703 on Sun May 02, 2004 at 11:57:58 PM EST

<blockquote>"premature optimization is the root of all evil."</blockquote>

I don't know how many times I've seen some dolt "optimize" a system without understanding what he was doing.

[ Parent ]

Bad attribution (none / 0) (#153)
by rustv on Mon May 03, 2004 at 01:37:15 PM EST

Your quote did not come from the Don Knuth, Professor Emeritus of The Art of Computer Programming at Stanford University.  It came from Professor Sir Charles Anthony Richard Hoare, Microsoft Flunkie.

____
"Don't tase me, bro." --Andrew Meyer
[ Parent ]
Ha-ha! (none / 0) (#165)
by regeya on Mon May 03, 2004 at 05:22:19 PM EST

Your appeals to authority have no affect on me! BWAHAHAHAHA!!!!

[ yokelpunk | kuro5hin diary ]
[ Parent ]

ZOUNDS!!! (none / 1) (#166)
by regeya on Mon May 03, 2004 at 05:23:23 PM EST

I fell victim to a simple English error! Your foolish argument seems to have had an effect on me! Damnation!!!!

[ yokelpunk | kuro5hin diary ]
[ Parent ]

5 million comments? (1.42 / 7) (#59)
by omghax on Sun May 02, 2004 at 03:50:58 AM EST

so K5 has been around for, what, 4 years? 4 * 365 = 1460. And 5,000,000 / 1460 = 3200 comments PER DAY!?!?! Hell no (and remember, this is talking about the early years, when K5 saw minimal traffic).

YOU STUPID FUCKING IMBECILE. Look at http://www.kuro5hin.org/pages/stats and see how many hits per day there are. And then find the ratio between the number of posts per day (3200) versus the number of page hits.

You sir, are necessarily a moron.

I put the "LOL" in phiLOLigcal leadership - vote for OMGHAX for CMF president!

Yes, but no, and also, no (2.83 / 6) (#60)
by ZorbaTHut on Sun May 02, 2004 at 04:34:57 AM EST

Yes, what you've described would work.

However, it would be ludicrously slow, for no good reason. We don't expect our database software to be bug-ridden. It isn't. It works. It might not be utterly bugfree, but chances are, for the kind of simple thing you need on a full-text search, it will work flawlessly.

Second, it would be *ludicrously* slow. Linear-time. We can do better, and linear-time will bring even modern boxes to their knees on 7gb of data unless you want to spend thousands on RAM (totally unnecessarily).

So: We can do better, more easily, and without any expected major problems. Why exactly is yours even a good idea?

Third, incidentally, full-text searches suck. Honestly. We've known that for years. It's a horrible way of searching.

A full-text search, implemented through a DB, won't take much longer than what you've described here. It will likely be as bugless, or even more bugless, and it will be far faster. And it will still suck.

but wait! (none / 1) (#184)
by coderlemming on Mon May 03, 2004 at 11:41:13 PM EST

Don't forget, all it has to do is finish in 8 seconds ;)

His algo reminds me of a lot of solutions I've seen from you, yes, you Zorba, on Topcoder.  Then again, those data sets are closed, well-defined, and linear time doesn't matter for shit since you know the worst-case.

Folks, I've seen this guy's code, he knows his stuff.  If he says it's a bad idea, it's a bad idea.  Then again, I say it's a bad idea, too.


--
Go be impersonally used as an organic semen collector!  (porkchop_d_clown)
[ Parent ]

Assembler Too Slow (2.40 / 5) (#61)
by OldCoder on Sun May 02, 2004 at 06:01:25 AM EST

The naive idea of
a really efficient parser to scan it.
has been superseded by string search algorithms some decades ago. If you are going to make a huge text file, then rip off a fast algorithm coded in C and let it rip. It will easily beat your assembler-coded naive string search. If it is still too slow, then translating it to hand-coded assembler will probably not help. Choose a different algorithm or buy a faster computer.

Of course finding the string inside the huge text file isn't the same as finding out what article the phrase was used in. That's because lots of algorithms skip around in the text and will ignore article headings. Once you've found the string you might have to search backwards to find the header. And so on.

Compressing the text may enable you to keep it all in ram, and people usually don't know exactly and precisely what they are looking for, so that it may be faster and more practical to use one of the algorithms for approximate string searching of compressed text. However, be aware that these methods are still new and research is ongoing.

Given the problem statement it might be better to search uncompressed text for more than one pattern at a time using Multiple Approximate String Matching.

--
By reading this signature, you have agreed.
Copyright © 2003 OldCoder

OR... (2.00 / 4) (#62)
by Farq Q. Fenderson on Sun May 02, 2004 at 06:37:18 AM EST

You could use an RDBMS that has the option fine-tuned indexing. Then, you could use it. MySQL's indexing is for people who don't care about how the indexing actually behaves.

I don't think you're going to get much out of writing a bunch of assembly for the purpose of doing string searches. Tight code is nice, but tight algorithms are where it's at. Besides, most worthwhile compilers can write better assembly than you can, and do.

farq will not be coming back

This problem was discussed recently (2.66 / 6) (#63)
by liquidcrystal3 on Sun May 02, 2004 at 06:54:57 AM EST

Building an open source search engine. At 7GB of searchable data, I'm pretty sure whatever is the best method for them is the best method here; a good search engine can search the entire web in a couple of seconds.

well written article but stupid idea: -1 (2.00 / 6) (#64)
by ant0n on Sun May 02, 2004 at 07:32:41 AM EST

Coding in assembly can be an interesting and rewarding activity in itself, and an understanding of assembly will improve your skills in programming in any high-level language, too. But coding in assembly as a means of speeding up a program is a very bad idea. If you want to speed up a program, you should speed up the underlying algorithm, not the actual implementation. Your proposal sounds quite naive: you have a very slow algorithm (brute-force, linear text search in a huge, flat textfile) and you want to speed this up by using fast hardware and assembly in 'a tight loop'... that's the way script kiddies think. Your argumentation is on the line of those morons who believe that every programming problem can be solved as soon as we have 7 Gigaherz - Athlons with 10 Terabytes of RAM.


-- Does the shortest thing the tallest pyramid's support supports support anything green?
Patrick H. Winston, Artificial Intelligence
Aaarrgh (2.85 / 20) (#73)
by bugmaster on Sun May 02, 2004 at 01:11:18 PM EST

God god man, not again !

I don't know how I can make this any clearer. Should I use sock puppets or something ? Look. You seem to have two main problems with the vector search and similar approaches:

  1. RDBMS is not an ideal solution to search
  2. Smart search algorithms are hard to write
I agree with #1; for pure search, RDBMS doesn't really fit the bill. Note that the vector search proposal is not an RDBMS, but a separate engine. In case of Scoop, however, the rest of the site is already written as an RDBMS (AFAIK), so it would avoid some code duplication to write search that way, too.

Your #2 reason is just stupid. Yes, search is hard to write -- but that's because it's a hard problem. Period. You write,

Simple code is code that is easily optimized and easily debugged. It also buys you short development time and low cost...
Ok. Here's a simple challenge for you, then: write a bubble-sort program in pure Assembly. Optimize the crap out of it. Now, write a quicksort program in C or Java (actually, they come with quicksort IIRC, not sure how good their implementations are). According to you, the simple bubble-sort will outperform the complex quicksort every time, right ? Well, this is something that's trivially easy to test: just keep increasing the data set in a loop, and graph the resulting time for both algorithms. One graph will look like O(n^2), the other one will look like O(N log N); I am going to let you figure out what that means.

The problem with your approach is not that you can't code (I'm sure you can) or that I'm employed by some evil commercial-code corporation (I'm not). The problem is that it's just so damn stupid. Your approach is a linear search through records on disk; this ensures that your search is only as fast as the disk. Disks are slow, localroger. Very, very slow. Even RAID arrays. Your search is O(N) on disk with a very small constant factor. Every other search engine is something like O(log N), with a large constant factor. Pull out your trusty TI-85 to see what I mean.

When you need something to take less time, the tradeoff is usually to have it take more space. For example, this is how caching works. RDBMS, vector search, other search engines... They all utilize this approach: they create some alternate data structure that takes space, and use it to save some (or, actually a lot of) time. Is this more complex ? Yes. But it's essential. Yes, a bicycle is much simpler than a car -- but you wouldn't go from San Francisco to Toronto with just a bicycle. It's simply too slow. We are talking about seven gigabytes of text here, man. Modern drives have the throughput (and that's without seek time, assuming only one user, etc.) of about 30-60 MB/s (I admit, I just pulled that number off of Google). That's on the order of 200 seconds to scan the entire file, assuming there's just one k5 user doing it.

Your statements on speed ("a thousand concurrent worst-case searches will also take 60 seconds", "The technology to bring the worst-case search under 10 seconds is already there for a reasonable cost", "and within a shorter time than K5 has already existed I'd guess that 1 second will be achievable") tells me that you are not only ignorant of the big-O notation, but of UI design as well. 60 seconds per search is not just a bit slow; it's unacceptable. 10 seconds is unacceptable. 1 second would be barely acceptable -- but I want comment search to work today, not in some bright future when hard drive maufacturers will do your thinking for you. You see, if any operation takes more than 50 ms (again, I pulled this off of Google, but I've seen similar figures before), users assume that the application has locked up (or that k5 is timing out). This includes drduck, duxup and rusty. This includes me and you, yes, you. We human beings are just wired that way. No one will sit there waiting for your best-case-scenario 60s search to terminate (especially since it's more like 200s in reality). Acceptable latency for user interaction is measured in milliseconds.

It sucks, but it looks like you might actually have to think about things before launching your search engine project -- as opposed to coding linear search in Assembly right away. Life is hard.
>|<*:=

yeah (none / 0) (#135)
by speek on Mon May 03, 2004 at 09:00:36 AM EST

100ms after I click a link, I'm, like, halfway through a monkey spank.

--
al queda is kicking themsleves for not knowing about the levees
[ Parent ]

You should see a doctor about that. [nt] (none / 0) (#186)
by Entendre Entendre on Mon May 03, 2004 at 11:58:06 PM EST


--
Reduce firearm violence: aim carefully.
[ Parent ]

+1FP, but (none / 2) (#77)
by nlscb on Sun May 02, 2004 at 03:42:31 PM EST

Careful with A Modest ... Proposal. Only you localroger can get away with that cliche. It should be retired.

Seriously, great article as usual.

Comment Search has returned - Like a beaten wife, I am pathetically grateful. - mr strange

Well... (none / 1) (#83)
by localroger on Sun May 02, 2004 at 04:32:24 PM EST

...after the number of people who didn't get Yeat Another Effort I figured I needed, as one of the characters says in the sequel to my novel, a much bigger tire iron.

What will people of the future think of us? Will they say, as Roger Williams said of some of the Massachusetts Indians, that we were wolves with the min
[ Parent ]
Effort ? (none / 0) (#120)
by bugmaster on Mon May 03, 2004 at 12:37:10 AM EST

Where is that Yeat Another Effort thing ? I don't recall seeing it before.

Anyway, I did take this "modest proposal" article at face value despite the title -- you were just so earnest about it all. IHBT.
>|<*:=
[ Parent ]

Yet Another Effort (none / 0) (#132)
by localroger on Mon May 03, 2004 at 08:18:29 AM EST

here. Did for Libertarianism what this article does for computer science.

What will people of the future think of us? Will they say, as Roger Williams said of some of the Massachusetts Indians, that we were wolves with the min
[ Parent ]
Meh (none / 0) (#167)
by bugmaster on Mon May 03, 2004 at 05:36:07 PM EST

I like this CS article better. The YAE article is just too ridiculous (sort of like the original The Modest Proposal), and it looks like few people took it at face value. Anyway, I'm looking forward to the next article in the series :-)
>|<*:=
[ Parent ]
The Adequacy Style Troll (AST): A Brief Refresher (none / 0) (#201)
by JonathanJ on Tue May 04, 2004 at 04:48:01 AM EST

I thought that it was fairly audacious of localroger to try it on, after his previous article - The Adequacy Style Troll (AST): A Brief Refresher

Then again, maybe that is the point.


** JJ **
[ Parent ]
How does a novice begin w/Assembly? (none / 1) (#78)
by nlscb on Sun May 02, 2004 at 03:48:23 PM EST

dear localroger, your complaints on programmers not knowing assembly are long and many. fine. how does an aspiring programmer correct this horrible oversight in our programming education system? i would currently describe myself as a beginning intermediate in my skills. what do you suggest i do to understand the magic of assembly, besides 20 years working on programming industrial machine tools? sincerely, nlscb

Comment Search has returned - Like a beaten wife, I am pathetically grateful. - mr strange

My recommendation (none / 2) (#80)
by localroger on Sun May 02, 2004 at 04:13:46 PM EST

As things stand, probably the most accessible platform where assembly language is necessary and useful would be the PIC microcontrollers. You can do a lot of little neat personal projects with them, and they will get you used to thinking of machines that are limited in memory and clock time the way all computers used to be.

If you're handy with a soldering iron for simple (not advanced) projects, you can probably download the software, build a programmer, and get going for $50 or less, and it's a fun hobby. You can buy the PIC controllers from Digi-Key.

If that's too much work for a bit more cash you can get one of the BASIC stamp development kits from Parallax (also sold by Digi-Key). StampBASIC isn't .asm but using it will get you used to thinking in the same way, since the math is .asm math and the limitations are PIC limitations, and the learning curve is much easier. They have a nifty "board of education" that brings all the stamp II I/O pins out to a mini-breadboard. When you get something working the way you want, you can solder it together or re-develop it with a raw PIC. Moving from StampBASIC to .asm is very natural (in fact, moving from any BASIC to .asm is more natural than moving from most other languages).

I don't use raw PIC's at work but I use a lot of BASIC stamps and Blue Earth Controllers, which are 80c51 based, a similar (but slightly more expensive) architecture. You can get BASIC and C development environments for most of these little CPU's but they really show you the advantage of knowing how the machine works. Even in BASIC or C you have to be aware of register and memory usage in ways you simply ignore on a PC.

Finally, there's the old PC route, which has the advantage of usually being free ("You really want this old XT?") and that you can usually port your creations right over to your P6 to see how much faster PC's are nowadays, but the disadvantage that it's too tempting just to move over and do anything really challenging on the Pentium.

What will people of the future think of us? Will they say, as Roger Williams said of some of the Massachusetts Indians, that we were wolves with the min
[ Parent ]

PIC16 (none / 0) (#119)
by bugmaster on Mon May 03, 2004 at 12:35:41 AM EST

Best thing about PIC16... Only 31 commands to memorize ! And about 15 of them are the same ! It's so easy to learn !

Ahahaha that marketing brochure always cracked me up. Anyway yeah, localroger is (surprisingly) right about this one. Whatever you do, don't start with X86 like I did, because their architecture is just too crazy.
>|<*:=
[ Parent ]

MSP430 (none / 0) (#240)
by dn on Fri May 07, 2004 at 08:20:09 PM EST

The MSP430 microcontroller is also friendly, with a fairly simple instruction set, and a von Neumann architecture (code and data in the same address space). Power consumption is low, speed is moderate, and a variety of models with different peripherals are available. Models with a goodly chunk of RAM are available.

PICs are fine for learning, but as I designed them into systems the attraction has worn off some. The Harvard architecture is restrictive, especially since the program space is read-only in operation. They also tend to be crippled when it comes to RAM. That says more about the things I tend to do than about the processor, but there you go. The flip side is that PICs are dead simple, top notch for bit banging, and speedy.

A counter-recommendation: nobody should start out with the 8051 family. Hell, nobody should use the 8051. I know it's popular and venerable, but so are cigarettes. It has the Harvard architecture but doesn't use that to optimize the instruction set like the PICs, the instruction set is complex and baroque and snail-slow, and the memory map is an abomination. Writing/porting compilers is hideously painful thanks to the profusion of memory access modes (including bit-addressable! but only for certain bytes!), and the Harvard architecture makes externally segmenting memory a PITA. It flounders at bit bashing, I/O, signal processing, and just about everything interesting. 8051: just don't do it.

    I ♥
TOXIC
WASTE

[ Parent ]

Noted, thanks (none / 0) (#242)
by localroger on Sat May 08, 2004 at 08:12:58 PM EST

I'll look into it.

In all fairness to Intel the 804x / 805x CPU's have longevity and hardening going for them more than anything else. They are yea hard to program but they are also built like brickbats, and overall quite a bit more powerful in some ways (though much slower) than PICs. The main thing is that they are available in industrial hardened versions that withstand heat, electrical pulse, and radiation better than nearly anything else that costs less than ten times as much.

What will people of the future think of us? Will they say, as Roger Williams said of some of the Massachusetts Indians, that we were wolves with the min
[ Parent ]

Fair enough (none / 0) (#243)
by dn on Sun May 09, 2004 at 01:01:01 AM EST

Pin-compatible (or nearly) 804x/805x devices are also available from multiple vendors, which is almost unheard of for microcontrollers. The in-circuit emulator and compilers aren't bad either.

    I ♥
TOXIC
WASTE

[ Parent ]

don't bother (none / 0) (#128)
by boxed on Mon May 03, 2004 at 07:09:09 AM EST

Learn to use the correct algorithm for the correct job instead, something localroger still hasn't learnt and that's why he spends time hand optimizing some random assembler code instead of making the code fast by fixing the algorithm.

[ Parent ]
DOS Debug utility (none / 0) (#214)
by Fon2d2 on Tue May 04, 2004 at 11:22:25 AM EST

Pick up a quick and dirty guide to assembly programming on the 8086. Something that gives you the basic explanation from the ground up and then gives you enough reference to go at it. Then get yourself a guide to all the DOS functions (all on INTs 20h and 21h I think). And a guide to the debug utility. Texts from the early 80s should be modern enough. They're possibly available at your library. You can also find it all on the web with a minimal of search effort. Once you understand the basics of debug, you can start assembling and running very short programs. No need to understand a macro language at all. Basically set up processor registers for one DOS or BIOS function, call that function, and terminate. Then run your little script and see if it worked. Once you get comfortable with that, start writing programs using the macro assembler MASM.

Eventually, once you've got the basics of assembly under your belt, you will be competent in pretty much any assembler language with a minimum of effort. This is how I self-taught myself before my assembler classes in college. Once in college we used PIC controllers on microboards, which is also very good for learning, but requires more of an investment (assuming you're not taking classes).

[ Parent ]

How to learn assembly (none / 0) (#253)
by cerberusti on Sat Aug 07, 2004 at 01:37:22 PM EST

I assume you mean x86 assembly.  This is one of the more difficult ones, but the most useful one for the PC.

Go to: http://www.intel.com/design/pentium4/manuals/index_new.htm

And order all of the IA-32 Architecture manuals.  Intel will ship them to you free of charge.  They are also available for download as PDFs, but you do not want to be reading these on a screen.  It might take a month or two for you to receive them.

Read the books, understand them.

Choose an assembler.  There are many choices.  Read up on it and choose one you think you will like.

If you need help, look for "The Art of Assembly", you can find this by using google.  This should not be your sole reference, use it only as an introduction.

If you have a specific question, e-mail me at k5@culsu.net

If you e-mail me I will expect that you have already read the architecture manuals.


[ Parent ]

Why is this being voted up? (3.00 / 11) (#85)
by The Honorable Elijah Muhammad on Sun May 02, 2004 at 04:39:27 PM EST

Dear localroger,

Being a competent programmer may give you license to bitch and moan about how the kids these days aren't taught asm at birth, but it's no excuse for not realizing why a flatfile O(n) search is a very bad idea for such a large amount of data. Yes, optimization is good, but the minor increase you can wring out of writing proper asm is miniscule compared to the improvement you can get out of choosing a proper algorithm in the first place. Yes, your idea works in theory, but in practice it's so expensive, and such a pain in the ass as to be infeasible.

If you can sit 7gb of data (that's growing every day) in ram, using your expensive disk arrays and memory, and then search through it flat... just think about how much better you could do it if you had chosen a proper method of searching in the first place.

Options:
(1) Go ahead with localroger's plan, rusty takes several thousand dollars out of the yacht fund to pay for ram, servers and disks.
(2) Go with the correct solution and implement pb's vector search code.

Sincerely,
The Honorable Elijah Muhammad


___
localroger is a tool.
In memory of the You Sad Bastard thread. A part of our heritage.
How the fuck is this getting voted up? [nt] (none / 2) (#88)
by tlhf on Sun May 02, 2004 at 05:00:17 PM EST



Because it's hilarious by being so ridiculous?[nt] (none / 2) (#91)
by QuickFox on Sun May 02, 2004 at 05:19:23 PM EST



Give a man a fish and he eats for one day. Teach him how to fish, and though he'll eat for a lifetime, he'll call you a miser for not giving him your fi
[ Parent ]
Hey jackass, (1.00 / 10) (#102)
by Mutually Assured Destruction on Sun May 02, 2004 at 07:15:27 PM EST

Learn to fucking spell, will you?

r - e - d - i - c - u - l - o - u - s ... it is NOT that fucking hard, OK?

[ Parent ]

You missed one (none / 1) (#131)
by ulric on Mon May 03, 2004 at 08:06:34 AM EST

h - e - l - a - r - i - o - u - s

[ Parent ]
/You/ learn to spell (none / 0) (#161)
by QuickFox on Mon May 03, 2004 at 04:04:07 PM EST

Mutually Assured D - i - s - t - r - u - c - t - i - o - n

Give a man a fish and he eats for one day. Teach him how to fish, and though he'll eat for a lifetime, he'll call you a miser for not giving him your fi
[ Parent ]
I know nowt about this (none / 0) (#89)
by GenerationY on Sun May 02, 2004 at 05:01:21 PM EST

but how does this solution stand up against adding a Google Site Search box to the top of the front page?

I'm tempted to think 600 USD is a small price to pay to make the problem disappear (600 USD wouldn't buy you consultancy time sufficent to implement the above would it?) I assume, however, I am missing the critical flaw in this fiendishly simple plan.

Google has been banned... (none / 2) (#93)
by greenrd on Sun May 02, 2004 at 05:52:22 PM EST

... because it brings scoop to its knees.


"Capitalism is the absurd belief that the worst of men, for the worst of reasons, will somehow work for the benefit of us all." -- John Maynard Keynes
[ Parent ]

That IS a critical flaw. Ouch. <nt> (none / 1) (#94)
by GenerationY on Sun May 02, 2004 at 05:55:28 PM EST



[ Parent ]
Google is not allowed to search the comments. (none / 2) (#95)
by localroger on Sun May 02, 2004 at 05:57:05 PM EST

As Greenrd says, though it does get the stories. Problem is, story search was never the problem.

What will people of the future think of us? Will they say, as Roger Williams said of some of the Massachusetts Indians, that we were wolves with the min
[ Parent ]
Thinking laterally (none / 0) (#101)
by GenerationY on Sun May 02, 2004 at 06:58:13 PM EST

and possibly stupidly; is there any chance it might just be easier to make scoop stand up to google's bots and spiders? Make a copy for google to index and forget about writing our own search algorithm? I can't imagine google taking more than a second to retrieve results.

[ Parent ]
Excellent idea IMO. (none / 0) (#103)
by localroger on Sun May 02, 2004 at 07:30:03 PM EST

However complex the work is, if someone else is willing to do it and you're confident they'll do it well, why waste your effort?

Perhaps this is the real key. Algorithms aside, you throw all the comments maybe not into my hypothetical big text file, but into some structure that doesn't collapse on itself when Google spiders it. Then let Google do all the hard work. After all, if you're indexing the whole Web, it's worth it to you to sort out how to do the vector indexing and all that crap.

What will people of the future think of us? Will they say, as Roger Williams said of some of the Massachusetts Indians, that we were wolves with the min
[ Parent ]

A Modest Improvement (2.90 / 11) (#92)
by Russell Dovey on Sun May 02, 2004 at 05:48:38 PM EST

My idea is so simple as to be obvious in hindsight, but:

If the searcher is faster when written in assembler, then why don't you code the textfile in assembler? Assembly language speeds up everything by a factor of 2 or 3, according to your earlier comment, so using both an assembler searcher and an assembler textfile would be 4 to 9 times faster! That's a total search time of either 15 seconds, or 6 and a half seconds!

I'll split the patent with you, we'll make millions selling this to Google alone!

"Blessed are the cracked, for they let in the light." - Spike Milligan

Information Retrieval (3.00 / 11) (#105)
by Pseudonym on Sun May 02, 2004 at 08:11:04 PM EST

Disclaimer: I write highly-scalable text database servers for a living. Therefore anything I say may be biassed.

It's a dumb idea. It's signature files all over again, only this time, we're not bothering with the actual signature files. (For the record, using relational databases for text is also a dumb idea. Text does not fit in columns. Unless your DBMS has a specialised text index built-in, performance will always be poor and space usage, especially on 7Gb of data, will be worse even than localroger's proposal. Just don't go there.)

The limiting factor these days is not disk at all, but cache. Having multiple concurrent searches will not run at the same speed as a single search because every memory access will be a cache miss.

Recent research (though there is some evidence that the "big" search engines have known this for a while) has found that compressed inverted indexes are faster than uncompressed indexes (e.g. "vector space searches") even when the vector space index fits in RAM, so long as you use the right compression algorithm. Plus, it's not nearly as hard to implement as you might think.

Oh, one more thing. A search engine that takes 60 seconds to finish, even if it's predictable, is useless. People like feedback, and the ability to refine then re-run their searches.


sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Compressed indexes (none / 0) (#138)
by joib on Mon May 03, 2004 at 09:42:57 AM EST

Not related to text searching per se, but the judy library claims to achieve higher performance than the "everyday" data structures in most situations, by using compression. It sure looks interresting, the strange thing is that I don't know of any projects that use it.

[ Parent ]
Seen that (none / 0) (#181)
by Pseudonym on Mon May 03, 2004 at 10:16:48 PM EST

I've seen Judy, but I've also never used it. The main reason is that it looks like a real pain to use.

If someone would create, say, a nice STL-looking (though not necessarily fully STL-compliant) interface on top of it, I'd be far more tempted.


sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
[ Parent ]
CLOB is now an ISO standard (none / 0) (#190)
by vyruss on Tue May 04, 2004 at 12:31:59 AM EST

most DBMSs should support it by now (?)

  • PRINT CHR$(147)

[ Parent ]
CLOB and indexes? (none / 0) (#191)
by Pseudonym on Tue May 04, 2004 at 01:46:35 AM EST

I don't know much about CLOB. Is it just a data type, or does it also specify standards for indexes on CLOBs? What about queries?

Part of the problem with text indexing is that you have to decide what a word is, and for certain types of query you also want to decide what a sentence is and what a paragraph is. This isn't an issue for k5 search, but it is for full text databases.


sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
[ Parent ]
It's really nothing of the sort (none / 0) (#195)
by vyruss on Tue May 04, 2004 at 03:25:56 AM EST

it stands for Character Large OBject, just a (potentially) limitless array.

  • PRINT CHR$(147)

[ Parent ]
OK... (none / 0) (#233)
by Pseudonym on Wed May 05, 2004 at 04:48:39 AM EST

It addresses storage, but it says nothing about indexing and queries. So it's actually not that much more use than a BLOB. Right?


sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
[ Parent ]
Right, except if your DBMS (none / 0) (#234)
by vyruss on Wed May 05, 2004 at 01:17:56 PM EST

has integrated clob search functions etc. :)

  • PRINT CHR$(147)

[ Parent ]
Why not just keep a hash in memory? (none / 2) (#106)
by big fat idiot on Sun May 02, 2004 at 08:30:49 PM EST

How many words are in the English language? Would adding all the non words in common use on the internet even manage to double that number?

So why not just build a hash of words together with a pointer to a list of all the comments that contain those words?

Then have a search engine that runs as a separate process or as a demon that scoop could hand off the code to?

Oh wait! It's been done already

Yes it would, unless... (none / 1) (#109)
by Elendale on Sun May 02, 2004 at 08:49:47 PM EST

"Would adding all the non words in common use on the internet even manage to double that number?"

You seriously under-estimate the amount of "creative" spelling people on the internet can come up with.

Of course, that's not a problem if we just don't care about stupid people's comments.


---

When free speech is outlawed, only criminals will complain.


[ Parent ]
Statistics (none / 0) (#179)
by Pseudonym on Mon May 03, 2004 at 10:09:20 PM EST

How many words are in the English language?

In 45Gb or so of 1997-era web data, you find 4-5 million distinct words, depending on how you define a word. Scanning the text from start to finish, once you pass the learning phase, you find that about one word in 4,000 is new.

Now web data is not the same as k5 data. The web data is bigger, and hence contains more unique words that you'll find in k5 data. On the other hand, web data contains fewer words per Mb than you'll find in k5 data, which is mostly text.


sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
[ Parent ]
In other words (none / 0) (#192)
by big fat idiot on Tue May 04, 2004 at 02:05:23 AM EST

A hash containing all the unique words found in 45GB of webdata can fit into a couple of megs of memory.

[ Parent ]
No (none / 0) (#193)
by Pseudonym on Tue May 04, 2004 at 02:55:21 AM EST

4-5 million distinct words do not fit in a couple of megs of memory unless you compress them.

It's still not a huge amount. Even so, it's probably easier and cheaper to stick it in a Berkeley DB file and let the OS disk cache work out the rest.

Oh, and as I previously noted, "a list of all the comments" is not very cache-efficient. At the very least you want to compress the list. Even the newsrc-style integer set representation is better than nothing.


sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
[ Parent ]
Okay, a few megs (none / 0) (#212)
by big fat idiot on Tue May 04, 2004 at 10:30:34 AM EST

Most words are not very long. You might need as many as ten or twenty megabytes to store a hash of 4 to 5 million distinct words. RAM, on the level that we're talking about here, is cheap.

Bring up the Berekely db is non-consequential. Even if the has were far larger, it would get paged out to disk eventually. In my experience, paging mechanisms are typically superior to disk cacheing, but YMMV.

Also, note that I never referred to the need to store a list of comments but only a pointer to a list of comments. How the list of comments is stored is irrelevant. You can use a linked list, another hash or whatever else floats your boat.

[ Parent ]

Go do it (2.90 / 10) (#107)
by lukme on Sun May 02, 2004 at 08:37:51 PM EST

Since you seem to have so much conviction, and assembler experience, I think it is a wonderful learning experience for you.




-----------------------------------
It's awfully hard to fly with eagles when you're a turkey.
Yes (none / 0) (#118)
by JayGarner on Mon May 03, 2004 at 12:30:19 AM EST

The Scoop code is freely available, have at it with the search doohicky, and get back to us.

[ Parent ]
*yawn* (2.66 / 9) (#110)
by Estanislao Martínez on Sun May 02, 2004 at 10:13:28 PM EST

Time complexity of scanning a text for a match: O(n), where n is the size of the text; for the sake of argument, we'll measure it in words.

Time complexity of balanced binary tree lookup: O(log2 n), where n is the number of keys.  (A key in a keyword search will be a word.)

Constant speedup factor that yokelroger claims from using assembly over C (in comment here): 3

By the time n becomes as small as 9, the C-coded balanced binary tree is finding hits faster than the assembler-coded sequential scan.  (Based on solving n = 3*log2 n)  Sure, there's additional complexities to be considered (the hit you get from searching a tree for a single keyword is presumably a list of documents containing the word being searched), but even for medium-sized values of n, the balanced binary tree is devastating.  (And using a trie might be faster and more memory efficient than a binary tree.)

--em

some extra detail (none / 3) (#112)
by Estanislao Martínez on Sun May 02, 2004 at 10:56:52 PM EST

The word "word" is used in two different senses in the analysis: as word token and word type.  The same word type will typically be instantiated by many word tokens in the same collection of text (e.g. the word type "word" has appeared as 8 word tokens in the text preceeding the numeral "8").

The sequential scan is O(n) over the number of word tokens in the collection of texts.  The balanced binary tree is O(log2 n) over the number of word types in the collection.

A trie will be worst-case O(n) over the length of the largest key.  The longest word in my /usr/share/dict/web2 is "formaldehydesulphoxylate", at 24 chars.

--em
[ Parent ]

You're ignoring disk I/O... (none / 0) (#199)
by warrax on Tue May 04, 2004 at 03:51:52 AM EST

which is usally a huge factor when searching anything over a few GB in size. It may have a huge impact on the practicality of any search strategy (such as using tries). That's not to say that you aren't correct, it just that you have to factor in I/O (specifically how and what parts of your index to keep in memory) when talking about searches on huge datasets. AFAIK almost all databases still use B+-tree (or similar) indexes and Hash table indexes because they're more I/O efficient than other data structures.

localroger is still a moron, though (albeit an eloquent one).

-- "Guns don't kill people. I kill people."
[ Parent ]

I can ignore disk IO. (none / 1) (#226)
by Estanislao Martínez on Tue May 04, 2004 at 04:47:10 PM EST


$ ls -l /usr/share/dict/web2
-r--r--r--  1 root  wheel  2493082 Feb 18 11:50 /usr/share/dict/web2

The keys can fit in RAM easily.  (Note that that  file that stores each word in a separate line-- storing it as a hash or trie will take significantly less memory.)

--em
[ Parent ]

PS (1.60 / 5) (#111)
by Estanislao Martínez on Sun May 02, 2004 at 10:26:42 PM EST

Your article is stupid *and* unfunny.

--em

+1 Fiction [nt] (3.00 / 12) (#113)
by nooper on Sun May 02, 2004 at 11:18:52 PM EST



+1 FP, Made me realize something STUPENDOUS! (none / 2) (#114)
by Empedocles on Sun May 02, 2004 at 11:39:00 PM EST

That localroger's brain fossilized shortly after 1981.

---
And I think it's gonna be a long long time
'Till touch down brings me 'round again to find
I'm not the man they think I am at home

Best of both worlds? (none / 0) (#121)
by For Whom The Bells Troll on Mon May 03, 2004 at 12:38:02 AM EST

Indexes and parsing text-files => NXDB.

Been tinkering around with quite a few for sometime now, but admittedly, most of them (including the non-OSS ones) have some distance to go before you want to take a serious look at them.

---
The Big F Word.

Not smart, better use reiserfs (none / 1) (#123)
by Highlander on Mon May 03, 2004 at 02:27:24 AM EST

A time of 60 seconds to scan a text file is much too long, and it also takes away full seconds of raw processing power.

Not to use a binary tree seems stupid, since the binary tree scales so much better. One can still use your idea of using the filesystem and keep the binary tree somehow on the file system.

Actually I think there are file systems around that already use a binary tree, I think reiserfs, so you really need only to put your keys into filenames and hits into the files. If it doesn't perform, complain to the reiserfs team and make it work, because it should.

You propose ending the search after the first hit, but we really want all stories matching a query. I rather suggest that only stories be searchable, since adding every comment to the database might cost performance, but maybe I am overestimating the power of a cluster of monkeys here.

Moderation in moderation is a good thing.

I think (none / 1) (#139)
by DDS3 on Mon May 03, 2004 at 09:57:25 AM EST

I believe reiserfs and XFS, both use btrees.

I'm not sure about the others.


[ Parent ]

HAHAHAHA (none / 2) (#124)
by the77x42 on Mon May 03, 2004 at 04:43:41 AM EST

Since rusty has so much money, he can get one of these.

Random binary from good ol' DBASE is my favorite search method. It would pick a random number and then see if your query was higher or lower. It can then get rid of half the database results. Repeat. Wash. Rinse.


"We're not here to educate. We're here to point and laugh." - creature
"You have some pretty stupid ideas." - indubitable ‮

you sound like alta vista (none / 1) (#126)
by dimaq on Mon May 03, 2004 at 06:58:48 AM EST

(IMHO)

anyway I propose you make a simple test - wget -R a large enough archive of messages of some sort - be it newsgroup or k5, then try your linear search (which can be done in a database too btw), and do the same with some reasonably dumb database like mysql using it's native b-trees to search for the words first and then parse the result to check whether a "phrase", however you define it actually occurs there.

btw you don't have to store words as such - you could store some derivative be it a truncated hash function or some sort of normalization (like plural->singular and past->present tense conversion for English). You would arrive a smaller but less precise index.

In the end the point against grepping through a large file is that it's a linear operation whereas selecting a subset of messages is logN.

And finally if I recall correclty there was an article of some sort published by folks who created alta vista search engine which discussed tradeoffs and implementations of full text search.

Oh yeah I forgot - you could let google do the job instead :)

A question (none / 0) (#239)
by RegularFry on Thu May 06, 2004 at 09:08:19 PM EST

Because I'm temporarily too lazy to look it up.

While the time-complexity of the internal database implementations may be O(logN), what's the memory complexity when you're inserting an entry and lookup vector for every word in every article, in addition to maintaning the original database? And is that not also relevant to this discussion? If not, why not?

If we're looking for the best speed/cost tradeoff, we don't just need to be concerned with the price of the processor, we need to know how much RAM/drive space we're going to need, and for a database like this, all O(N)'s are not created equal.

There may be troubles ahead, But while there's moonlight and music...
[ Parent ]

that depends (none / 0) (#246)
by dimaq on Fri May 14, 2004 at 08:01:50 AM EST

most importantly the choice depends on ratio of articles to searches.

still inserting an article and all of it's word occurancies should ne O(logN) too (not considering transactions, locking et.al.), even though databases are rarely optimal.

in the end, the suggestion in the article might be usable for smaller data sets becuase databases tend to blow up the storage space (in my case some 40 times or so), though the factor hardly depends on data set so the choice is basically Factor*logN vs. N.

[ Parent ]

I beg of you rusty, delete this shit (2.00 / 7) (#129)
by boxed on Mon May 03, 2004 at 07:16:34 AM EST

Dear Rusty,

this article makes all of kuro5hin look stupid. Not stupid as in "a little naive", but stupid as in "can't fucking read, search google or think at all". Delete this piece of shit article PLEASE, before hundreds of ignorant newbies actually believe localrogers statements and decides to code everything in assembler with bogosort. Yes, bogosort is EXACTLY as intelligent as this trite.

hubris (none / 2) (#136)
by dlec on Mon May 03, 2004 at 09:16:39 AM EST

What nonsense is this? I just can't believe what I'm reading. Your arguing that we should just search the entire text each time. Unbelievable! A modern hard drive will do about 25MB/second so your query over this 7GB file will take about 5 minutes. How is that possibly an improvement?

Let's get this straight: relational databases aren't designed for textual indexing, but for queries over structured data. What do you think the relational means anyway? They usualy have basic text searching tacked on, but it's not a core compotency.

If your finding that MySQL isn't working as well as you'd like, that is because your using the wrong tool. Did you never consider using an indexing server? Why didn't you just ask someone before coming up with this nonsense.

The only indexing server I have ever used is Lucence, but you can find a compilation of other open source Java indexers here, or open source indexing servers in general here.



Ignore this article (none / 3) (#137)
by DDS3 on Mon May 03, 2004 at 09:35:50 AM EST

...this article is horrible and plagued with bad, misinformation.

The simple fact that he thinks MySQL hangs the moon, is reason enough to ignore it, yet it gets factually worse.

The indexing ... processing requirements.

Not if they are done properly.  That's the whole friggen point of having indexes.  Since we are talking about searches, once the data is inserted, either the index is being used and reducing process or it's worthless (ignored).

Searches are already limited by default to a more modest "current" dataset.

What?  What is he trying to say here?  That databases have a non-current dataset?  WTH?

As blocks are loaded into RAM they can be used for all ongoing concurrent searches, so while one worst-case search might take 60 seconds, a thousand concurrent worst-case searches will also take 60 seconds.

What?  This guy is on crack!  This makes sooo many assumptions.  Chances are much higher that a thousand worst case concurrent searches will take somewhere between 60 and 60,000 seconds. Since when can a linear increase in work be completed with zero overhead?  Especially if we are talking about MySQL.  Plus, this assumes that the entire dataset nicely fits into RAM.  If it doesn't, expect performance to fall quickly with each additional search.  As it relates to MySQL, MySQL has serious scaling issues, so I'm absoluetely sure that you will not see 60-seconds for 1000, 60-second, concurrent queries.  Chances are, it will be time for a quick nap.

The effect on the main database is effectively zero

Main database?  Text search isn't considered the main database?  Does this assume that the text search DB is on another box?  Seems to have more bad assumptions.

This is just putting my toe in the water.  Basically, we can ignore anything this article offers.

well.. (none / 0) (#162)
by aluminumaloi on Mon May 03, 2004 at 04:18:19 PM EST

I agree that the author of this article is smoking crack, but you missed a key point: This wouldn't be in the database, it would be a flat text file.

[ Parent ]
You're more clueless than localroger (none / 0) (#170)
by damiam on Mon May 03, 2004 at 06:51:53 PM EST

What? What is he trying to say here? That databases have a non-current dataset?

K5 keeps posts over a certain age in a separate, static archive.

Since when can a linear increase in work be completed with zero overhead?

When you're reading everything from RAM. For that matter, it might be smarter to just get an XServe/Opteron and keep the whole thing in RAM.

[ Parent ]

I'm sorry... (none / 1) (#175)
by DDS3 on Mon May 03, 2004 at 08:39:07 PM EST

...but RAM or not, a linear increase of work of 60 will result in a minimum of 60x more work.  Period.  Even ignoring that, all other things considered, it will never be zero.


[ Parent ]
Tautology (none / 0) (#198)
by warrax on Tue May 04, 2004 at 03:38:17 AM EST

"Work x 60" = "Work x 60". Duh.

But if you mean what I think you mean, you're still wrong. You see the bottleneck in this case is disk I/O. So the CPU is just idling most of the time while a search is going on. That means that you can add more concurrent searches which are basically free; thus yielding faster results for two users than doing the two searches one after the other.

Even if the bottleneck were CPU, you could still do the total work faster (total wall clock time) than otherwise; you see you can (more often than not) construct more efficient matching automata (or whatever it is they do these days, I haven't kept up with the litterature) if you know that you have to match several different strings. Even if you don't know this in advance (i.e. searches are being added while other searches are going on), any decent on-line search algorithm can adapt to new search strings being added while running. So even in that case, you're wrong.

You are right about one thing though: Localroger has no fuckin' idea what he's talking about.

-- "Guns don't kill people. I kill people."
[ Parent ]

That's still wrong.... (none / 0) (#202)
by DDS3 on Tue May 04, 2004 at 05:59:48 AM EST

First the assertion is that it's in ram and doing 60 times more work is zero cost.  Now you're saying it's in ram but is a disk (I/O) bottleneck and, despite that it's doing 60 times more work, still has zero cost.

I surely hope you guys don't do coding for a living because this is one horribly poor mistake after another.


[ Parent ]

Couldn't resist (none / 0) (#203)
by DDS3 on Tue May 04, 2004 at 06:15:24 AM EST

...a second posting...

For sake of argument, if it's linear, on average, a search is going to take 1/2n.  If you add another bit of search criteria to the inner loop, you're going to have, on average, 1/2n*c, whereby c = number of criteria (versus, c(1/2n)).  And this, ignores tons and tons of items and makes many poor assumptions just to greatly simplify the model.  Long story short, even in absoluete best case algorithms, the wall clock time is going to go way up, period.  If you really insist on following these poor assumptions, I'll have to crack open a book and actually work out the times, but I shouldn't have to, because this really is common sense.  Seriously, just think about it for a second.  More work is always going to mean more work and more work is always going to be reflected by more wall clock time.  More work is never going to equal zero cost.  Period.

[ Parent ]

arg...third even... (none / 0) (#204)
by DDS3 on Tue May 04, 2004 at 06:38:42 AM EST

I'm sorry, but there is just so many things wrong with everything you've stated, I can't stand it.

You see the bottleneck in this case is disk I/O. So the CPU is just idling most of the time while a search is going on.

Ignoring the fact that others have stated that it's in RAM, we'll run with the notion that it's on disk.  If you add a second search, that results, on average, twice the amount of I/O.  If you add, 60 more searchs, then, on average, that's going to be 60 times more I/O.  Thusly, it will be 60 times the average I/O duration.  Without knowing any other details, that's the assumption that you should build on.

Now then, we can make some assumptions and improve things, but the improvement will NEVER be free (zero cost).  Using your assumption, every RDBMS out there, should be mucho faster than they are in reality.  After all, according to you, if I assume 60, concurrent queries, with the same criteria, the duration should be the same as it is with a single query.  That, of course, is false and never happens; even if everything fits into RAM.

Even if we assume much more intelligent search capabilities, at best, you're going to have an inner loop itteration equal to the number of additional search criteria, which will result in a linear time increase for the inner loop.  And frankly, I have a hard time imagining any search algorithm that is going to add additional criteria to a single search when they are unrelated.  The reason?  Because, on average, each additional criteria added is going to increase the best case search duration and negatively effect every other search in process.  In other words, your super smart search algorithm would result in absoluetely horrible systemic durations.

That means that you can add more concurrent searches which are basically free; thus yielding faster results for two users than doing the two searches one after the other.

That's just completely false.  The simple fact is, the second linear search is probably going to have a smaller wall clock time than the first.  Likewise, the two wall clock times, added together and making some reasonable cache assumptions, is going to be less than the total wall clock time for two concurrent searches.

So even in that case, you're wrong.

I think computer science proves me right.  Care to provide anything which proves otherwise?

After It's all said and done, I'm hoping you'll apply some common sense and really think about this.  Come on, adding more work will always result in more work.  More work will never equal zero cost.  Beyond that, if we run with your first I/O assumption (applies to a RAM model as well), common sense tells you that I/O is very costly and that ANY additional I/O load is going to result in higher wall clock times.  Period.  Seriously, think about it.


[ Parent ]

poor form. (none / 1) (#209)
by rmg on Tue May 04, 2004 at 09:56:36 AM EST

this thread smacks of effort. replying three times to a biter is just pathetic.

----

dave dean
[ Parent ]

Poor form? (none / 2) (#216)
by DDS3 on Tue May 04, 2004 at 01:33:22 PM EST

Why is educating someone poor form?  Commnuicatng is poor form?  Isn't this the whole point of this site?

Poor form is replying to a thread without adding any real content.  You replied simply to complain, doing nothing for the overall topic or thread.  Shesh.

Nuff said.

[ Parent ]

the comment was educational. (none / 1) (#217)
by rmg on Tue May 04, 2004 at 01:37:54 PM EST

you might have learned something about form, though you seem set against doing so.

communication is bad form in a variety of circumstances. in the present context, we have a troll begging a biter to respond by posting to him three times. it's an afront to the style and panache of trolling.

----

dave dean
[ Parent ]

megid? (none / 2) (#218)
by DDS3 on Tue May 04, 2004 at 01:53:14 PM EST

Seems like him trolling...which, oddly enough is exactly what you're doing.  If begging is what you read into that, then, I can't explain why you want to read that into it.  On the other hand, one might argue that since they were three posts back to back in a small timeframe, it doesn't really change anything.  Worse, why would it matter to you unless you're trolling.  Which, obviously you are.

Megid guy is a serious loser.  Don't be like him, if you are, in fact, a difference person.

[ Parent ]

me? troll? (none / 3) (#219)
by rmg on Tue May 04, 2004 at 02:14:05 PM EST

well i never~!

----

dave dean
[ Parent ]

Try this: (none / 1) (#238)
by RegularFry on Thu May 06, 2004 at 08:57:22 PM EST

I may have misinterpreted the original article, but here's what I understood (and read between the lines a little):

Reading a chunk of text to be searched from disk into memory takes 60 units of time.
Running a search for a single term over that chunk takes 1 unit of time.
Assume that reporting matches is free for simplicity.
Having read the first chunk in from disk, we request the second, and start our search.
1 time unit later, we have finished our search for the first term, leaving us 59 time units left until we need to start searching the second chunk.
We now pop the next search term off our in-memory stack of requested search terms, and search the first chunk for it while we're waiting. We can repeat this step 58 more times until we have to stall the load of the second chunk of text.
This gives us, in wall-clock terms, 60 concurrent searches in the same time as 1 would take.

With a little work, this could be applied to a RAM model, as well, and the beauty is, the wall-clock time is (above 60 searches) not I/O bound, but CPU bound. There's no claim that we're getting something for nothing - the CPU is going to be using 60 times as much energy, or thereabouts, for the search, because now it's doing something where before it was having a rest. However, by being just that little bit cunning, we can ensure that additional searches don't result in additional disk I/O work.

IRMC :-)

There may be troubles ahead, But while there's moonlight and music...
[ Parent ]

This proposal is absurd (none / 2) (#143)
by p3d0 on Mon May 03, 2004 at 10:07:41 AM EST

Tell you what... You write your whole-file-search in assembler, and I'll write a Python script in one day that outperforms it by three orders of magnitude on a 7GB dataset. I'll achieve this by using "weird hash algorithms" and "strange lookup files" because that's what it takes to do the job properly.
--
Patrick Doyle
My comments do not reflect the opinions of my employer.
Of course... (none / 0) (#251)
by trimethyl on Thu Jun 10, 2004 at 03:44:04 PM EST

You could never implement those "weird hash algorithms" or "strange lookup files" in assembly.

Here's a hint from an old assembler programmer: hash tables and lookup files were implemented in mainframe assembly long before Python was even thought of. And contrary to the ignorance of the masses, these algorithms are often easier to implement in assembly because of the fact that a hash table index can be implicitly converted into into an address by a machine instruction; there's no need for a rat's nest of confusing explicit casts and weird assignments.

What he's proposing is ideal for assembly. The search algorithm is well-known, well-understood, and simple. It would easily fit into a 2 kB binary module, disk drivers included. Your Python script and interpreter would most certainly result in cache misses on every record accessed - after all, you've only got 256k of cache for your interpreter, the disk buffer, the OS's disk driver and task manager, and your script. Chances of fitting all of that in 256k? Not good. Of course, I'm talking about the processor's cache, not main memory. But then, it's not as if your Python script would know the difference. I'd be willing to bet that if your hash table exceeded the size of available RAM, your script would go into perpetual swap, lockout the rest of the users, and ultimately get shut down. Pity, the Python programmer can't know at runtime how much of his program is actually in memory, and how much swapped to disk...

I'm not so ignorant to suggest that assembly is the ideal language for everything; you've got to pick the right tool for the job. It just so happens that in this case, assembly _is_ the right tool - it needs to be fast, and it only needs to be written once.

Yes, your Python script could run faster... Until he discovers how easy a hash table is implemented in assembly, or the processor string handling instructions like movsd and scasd, or that that base+index+offset addressing modes will calculate his hash key for him...



[ Parent ]
Conclusion: a large number of kurobots (none / 0) (#144)
by Undesirable Username on Mon May 03, 2004 at 11:27:02 AM EST

have a really sick sense of humor.

Although... reading through all the disparaging comments one experiences a level of meta-amusement hardly seen before. Almost orgasmic. Maybe not so sick after all =)

Actually, the reaction reminds me of (none / 1) (#145)
by Undesirable Username on Mon May 03, 2004 at 11:28:34 AM EST

http://adequacy.org/public/stories/2001.12.2.42056.2147.html
Congrats, localroger! You're on the path to immortality!

[ Parent ]
check out lurker for high-performance indexing (none / 2) (#148)
by klash on Mon May 03, 2004 at 12:15:04 PM EST

I highly recommend looking at the software package lurker. It's a mailing list archiver that creates keyword indexes using a highly specialized on-disk database. The database code is very cleanly separated from the rest of the package in the directory "libesort." The code is also very clear and very well-documented. Here is what the code comments have to say about this database (sorry about the bad formatting of the table, for some reasons scoop doesn't support <pre>):

/* What is ESort? -- An Online External Sort
*
* ESort provides a very primitive database API: An insert-only set of keys.
*
* There are very different trade-offs when compared with a standard database.
*
* N = number of keys in the set, M = number of operations
* The seeks needed to for each repeated operation in one transaction:
*
* ESort BTree Hash
* Insertion O(1) O(M*log(N)) O(M)
* Read in sequence O(1) O(M) impossible
* Seek to key O(M*log^2(N)) O(M*log(N)) O(M)
* Delete impossible O(M*log(N)) O(M)
*
* From this table, one can conclude that ESort allows blindly fast insertion
* and sequential read, but pays in the cost to seek. This is what makes it
* highly appropriate for keyword indexes.
*
* An additional restriction is that ESort only supports a single-writer.
* An additional advantage is that ESort readers get snap-shots.
*/

Lurker is designed to scale extremely well, so it seems likely that it could handle the demands of keyword-indexing all of kuro5hin's stories.



Dear localroger, (2.75 / 8) (#150)
by razumiking on Mon May 03, 2004 at 12:34:30 PM EST

I've enjoyed reading you stories for about two years now. Your writing is been a credit to K5. Since I got my account, I've always reflexively voted up your articles, but now I don't know if I can trust what comes out of your account anymore.

As I read this article, I couldn't believe what I was reading. It was misinformed, arrogant, even deceptive. It seemed as if it was designed to provoke people. I felt like, and I don't use this word lightly, I was reading a troll. It was my old usenet days all over again.

I've had bad experiences with trolls, localroger. I've had two of my favorite newsgroups and my favorite website overrun. Trolls have posted my personal information online in places where I thought my anonymity was assured. Trolling is not a game, localroger. When you troll, real people get hurt. It is as you said in a diary a while back: Trolls are the bullies of the internet.

I know you're a smart guy. I think everyone here knows it. You don't need to show us how clever you are by tripping up less canny readers with your trickery. That's the sort of thing RobotSlave and rmg get off on. It's not you.

People say K5 is dying, but I didn't believe it until today. Today, one of my favorite posters, a central figure in this community, and surely the most respected, has turned to trolling. I'm stunned and disillusioned.

From now on, I'll have to read your stories very carefully before I vote on them. I'll read all your posts with a suspicious eye. I'm sorry it has come to this, but I don't think we really have any choice anymore. You've broken the seal of trust. It will take along time to heal.

Sincerly,

Graeme.

OH MY GOD! (2.60 / 5) (#154)
by Mutually Assured Destruction on Mon May 03, 2004 at 02:03:34 PM EST

HE'S THREATENING TO READ THINGS WITH A CRITICAL EYE! DUCK AND COVER, PEOPLE!

[ Parent ]
Well... (none / 3) (#157)
by localroger on Mon May 03, 2004 at 02:52:28 PM EST

It's not the first time, but I don't do it very often.

I never asked to be rated uncritically. Everyone has an off day, and I'm no exception. If I've written something that isn't good, I'd expect it to be voted down. I am frankly surprised that this got voted up (and if I'm not mistaken it didn't really, it got autoposted).

And you know, while I'm not quite as dumb as the article makes it sound, I really don't work with DB's very much and I learned a few things from the comments. So it wasn't the complete waste of time most trolls are.

Anyway, all I ask is an open mind, and if you honestly think I wrote a dog turd, don't think it's unloyal or anything to slap me with a -1.

What will people of the future think of us? Will they say, as Roger Williams said of some of the Massachusetts Indians, that we were wolves with the min
[ Parent ]

So it was a troll... (none / 0) (#247)
by KrispyKringle on Sun May 16, 2004 at 07:43:12 PM EST

right? Or did you write this with sincerity?

I'm not sure which way I'll respect you less. If you wrote it with sincerity, not working with DBs much is no excuse; you have no clue how algorithm design works, as this article makes abundandly clear. If you did this as a troll, then I suppose I have to at least give you credit for getting it posted (how the fuck did this get posted, anyway? It's complete drivel!).

Well, whatever. Better luck next time.

[ Parent ]

Yes, basically (none / 0) (#249)
by localroger on Thu May 20, 2004 at 11:02:29 PM EST

It was originally a mistake I made in a comment and the pile-on replies were so vicious and numerous that I decided to see if they'd follow it into something clearly labelled A MODEST PROPOSAL. Which they did.

Meanwhile, I realize I did waste a lot of honest peoples' time, and I'm a bit sorry about that. But it was worth it. I think.

What will people of the future think of us? Will they say, as Roger Williams said of some of the Massachusetts Indians, that we were wolves with the min
[ Parent ]

Hmm. (none / 0) (#250)
by KrispyKringle on Sat May 29, 2004 at 03:11:43 AM EST

Well, I can forgive you somewhat for the subtle reference. But I didn't know at the time whether that was intentional, or whether you were simply being, shall we say, faux-intellectual. You know, trying to sound intelligent when you may, in fact, have been a 13 year old kid who heard assembly was cool. So hard to tell these days.

Still, I believe I ought to give you credit for being, if a troll, at least a sensitive troll, apologetic for wasting others' time.

Cheers.

[ Parent ]

1 word article replacement: (none / 2) (#151)
by phred on Mon May 03, 2004 at 12:35:58 PM EST

grep

assembly's fast! (none / 3) (#152)
by rmg on Mon May 03, 2004 at 01:14:21 PM EST

yeah yeah yeah!!

it's not slow!!

no no no !!

----

dave dean

I'm not going to enter this debate (none / 0) (#156)
by Haunting Koan on Mon May 03, 2004 at 02:41:00 PM EST

But isn't there some way to use Google to search k5?

Yes, but Google places high load on Scoop. (none / 0) (#174)
by topynate on Mon May 03, 2004 at 08:01:50 PM EST

Modifying Scoop such that Google could spider it has been suggested somewhere else on here.


"...identifying authors with their works is a feckless game. Simply to go by their books, Agatha Christie is a mass murderess, while William Buckley is a practicing Christian." --Gore Vidal
[ Parent ]
Off topic - (none / 1) (#159)
by mami on Mon May 03, 2004 at 03:41:10 PM EST

I wonder, if my impressions are right. All the time something very controversial and upsetting is going on in politics, some oldtimers like localroger and irmdkl write regularly very uncontroversial and more or less boring, but valuable "technical or cultural" articles.

ooh, now that I read a couple of comments (none / 0) (#160)
by mami on Mon May 03, 2004 at 03:46:06 PM EST

I guess I don't to have feel guilty not having read the article. If the article was really "stupid" as some have claimed, my claim that this article was posted just to distract from evil politics on K5 makes even more sense.

ahemm, instincts, instincts ...

[ Parent ]

Two words. (none / 0) (#169)
by hummassa on Mon May 03, 2004 at 06:45:48 PM EST

Radix search.

Separation of concerns (none / 0) (#178)
by Pseudonym on Mon May 03, 2004 at 09:52:06 PM EST

If you think about the way an inverted file, signature file or vector-space index search is implemented, there are basically two parts: First, map a word to some ordinal. Second, map the ordinal to the set of record numbers (possibly plus statistics). The ordinal may be bypassed in some implementations, but the two phases are still there.

Radix search, hashing (including possibly perfect hashing), binary search trees and so on all help with the first part of the problem, and I agree that if all other things are equal (e.g. if highly concurrent modifications aren't a concern) radix search is usually superior these days. However, it doesn't help with the second phase of the problem.


sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
[ Parent ]
I don't get it. (none / 0) (#206)
by hummassa on Tue May 04, 2004 at 07:40:32 AM EST

push the stats in the leafs of the radix tree. it works.

[ Parent ]
And... (none / 0) (#232)
by Pseudonym on Wed May 05, 2004 at 04:47:29 AM EST

...those stats are what exactly?


sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
[ Parent ]
It's called AST (none / 3) (#173)
by KWillets on Mon May 03, 2004 at 07:57:15 PM EST

Assembler

Search

Technology

Google appliance (none / 0) (#176)
by jabber on Mon May 03, 2004 at 08:46:43 PM EST

How many of these could Rusty buy for the $30k he collected in his CMF stunt?

[TINK5C] |"Is K5 my kapusta intellectual teddy bear?"| "Yes"

One (none / 1) (#177)
by KWillets on Mon May 03, 2004 at 09:26:42 PM EST

And it would cause the same problems as the Googlebot does, since it's a fairly unimaginative port of the same program.

[ Parent ]
You apparently ignore basic facts (none / 2) (#182)
by vyruss on Mon May 03, 2004 at 10:47:02 PM EST

of computer science. One of them is, even if the hard disk does nothing but max out its sustained transfer rate on your (must be non-fragmented) single text file, you need A BUFFER to read the text into. After having skimmed your article and preceding comment and found nothing of the sort, I refuse to even read any more of it.

This is not a personal attack, I admire you for your writing but this is like reinventing air, not the wheel. And badly, might I say.

  • PRINT CHR$(147)

but... (none / 0) (#183)
by coderlemming on Mon May 03, 2004 at 11:36:38 PM EST

Wait, are you implying he needs a 7gb buffer to read the entire file into?  This isn't perl.  He won't have to do

@db=<DATABASE_FLATFILE>;

or anything.  I dunno, read the file in chunks, implement a ring-buffer, switch back and forth between two small buffers so one's being filled while one's being scanned?  There are very obvious ways to solve this problem, unless I'm misunderstanding you.

Note, I don't actually think his idea is a good one...


--
Go be impersonally used as an organic semen collector!  (porkchop_d_clown)
[ Parent ]

I'm just saying (none / 0) (#188)
by vyruss on Tue May 04, 2004 at 12:21:24 AM EST

that reading data straight from the device, without breaking it up in chunks, is a very bad idea. I meant the buffers you read data streams into, not to buffer the whole file!

  • PRINT CHR$(147)

[ Parent ]
On another note, (none / 0) (#189)
by vyruss on Tue May 04, 2004 at 12:23:57 AM EST

If someone really needed to do it with a single textfile (the way Charles Manson really needed to kill those people), it would be much preferable to stuff the whole thing into an Oracle CLOB or something.

  • PRINT CHR$(147)

[ Parent ]
Eh. (none / 0) (#208)
by Znork on Tue May 04, 2004 at 08:57:36 AM EST

man mmap

[ Parent ]
This isn't about memory management (none / 0) (#227)
by vyruss on Tue May 04, 2004 at 05:41:56 PM EST

it's about basic I/O procedures.

  • PRINT CHR$(147)

[ Parent ]
Buffers (none / 0) (#231)
by Znork on Wed May 05, 2004 at 04:14:53 AM EST

You claim you need a buffer to read text into. You dont. You mmap the file into address space directly and let the OS demand page it as you access various regions of the mmapped file.

[ Parent ]
Yuck! :) [nt] (none / 0) (#235)
by vyruss on Wed May 05, 2004 at 01:18:30 PM EST



  • PRINT CHR$(147)

[ Parent ]
et tu, roger? (none / 1) (#194)
by Entendre Entendre on Tue May 04, 2004 at 03:18:07 AM EST

Now I REALLY wish adequacy was still around.

--
Reduce firearm violence: aim carefully.

wtf are you smoking? (none / 3) (#196)
by polyglot on Tue May 04, 2004 at 03:26:26 AM EST

gaining a few percent by writing in asm gets you nowhere if you use an O(n) or worse algorithm instead of O(log n). We don't let people out of first year CS until they understand that.

Seriously, go back to writing fiction. That was good. This is an unmitigated disaster.
--
"There is no God and Dirac is his prophet"
     -- Wolfgang Pauli
‮־

sometimes the gain is not a few percent (none / 0) (#252)
by cerberusti on Sat Aug 07, 2004 at 12:41:54 PM EST

There are many cases when an asm implementation can shave a few orders of mangitude off of the time it takes to process a dataset.  Algorithmic improvements are only one area to look at when trying to improve performance.  It is not always the best thing to do.

We do not let people with CS degrees do important programming work in the real world until they understand this.

Too much focus on theory and not enough on how things work in practice will leave some gaping holes in your knowledge.

[ Parent ]

Use a relational database (none / 0) (#197)
by Ward57 on Tue May 04, 2004 at 03:37:28 AM EST

to hold the index system, and then prune the index so that it fits into main memory. Say a "prune" that triggers every so often, and when it can't allocate more space (for a carefully defined meaning of space, 1000,000 entries or a sensible number).

I haven't followed this discussion (none / 1) (#200)
by cyberdruid on Tue May 04, 2004 at 03:56:48 AM EST

But why not use one of the many good open source search engines? Swish-e is wicked fast.

our new k5 motto (none / 1) (#210)
by auraslip on Tue May 04, 2004 at 10:24:02 AM EST

"we hate this site"
124
how does google do it so fast? (none / 0) (#211)
by auraslip on Tue May 04, 2004 at 10:24:42 AM EST

with their "4,285,199,774 web pages"
124
Google's a tad larger than K5 (none / 0) (#230)
by Kal on Tue May 04, 2004 at 10:20:58 PM EST

From here, the first link from a Google search on how many machines Google uses.

According to calculations by the IEE, in a paper about the Google cluster, a rack with 88 dual-CPU machines used to cost about $278,000. If you divide the $250 million figure from the S-1 filing by $278,000, you end up with a bit over 899 racks. Assuming that each rack holds 88 machines, you end up with 79,000 machines.

[ Parent ]
Let google do it? (none / 1) (#215)
by univgeek on Tue May 04, 2004 at 01:24:47 PM EST

I know, I know google clobbers the db engine. However, would the following work -
  1. Create a separate 'archive directory'. This is basically a directory tree where each file is a comment in a flat text file, with only one link.
  2. This whole tree is cordoned off from the rest of the site, and ONLY this tree is allowed for search by google. I guess the robots.txt would allow this.
  3. The link would be to the db version of the comment. This way, anyone who is interested in following up, can go to the 'live' comment. Google wouldn't poke at this link because it would be part of a 'banned' tree in the robots.txt. Perhaps this may be too painful for users? Or it could even be a 20 sec refresh, leading to the db 'live' version.
  4. Add new comments to this tree, as they are created.
The tree could be
a) year -> month -> day -> comment, or
b) year -> month -> day -> story -. comments

Disclaimer: I have no clue on how mySQL will handle this, or any clue of whether the space occupied by this archive will be too large. Please poke holes in this. Sorry if this is too dumb.

Arguing with an Electrical Engineer is liking wrestling with a pig in mud, after a while you realise the pig is enjoying it!

The Poll (none / 0) (#223)
by theboz on Tue May 04, 2004 at 03:48:07 PM EST

I wish K5 had the multiple poll options like HuSi does, because I keep my data in most of the options.

Stuff.

Are you a programmer? (none / 2) (#225)
by trhurler on Tue May 04, 2004 at 04:26:30 PM EST

I could write Tex's word based solution and debug it to a very high level of reliability in a matter of hours, even in C. In less than a week, I could have an assembler solution for a single chosen target, but I wouldn't want to. The time required to do this in a high level language is probably stupidly small, but performance would suffer in most of them(and yes, you stupid language bigots, in this case, O(N) != 2(O(N)), so shut up already!), and it isn't hard to do this in C anyway.

Also, database corruption is easy to fix. Once every n time units(you pick based on practicality and your need for data integrity,) you back up the whole shebang by doing an export. You keep the last m exports just in case, with an automated system to shuffle them as needed(this will take maybe an hour to create, maybe two hours, given that you include debugging and documentation.)

In conclusion, your idea is bad. It is bad for one simple reason: if today, we are manipulating seven gigs of data, then tomorrow, it will be seventy. Are you willing to wait ten minutes for your data? Even if disks become twice as fast(this takes a LONG time, historically,) you'll still be screwed; five minutes isn't good enough.

The correct solution probably IS to learn how to tweak the RDBMS, if that can be done. It may "defeat the purpose," but it is probably the minimal effort solution that provides good performance, and that's what counts. The problem is, I doubt MySQL can be made to do what is needed. Frankly, it sucks goat nads. Tex's idea is a close second; my main concern is whether it would actually perform as well as one would think. I'd have to give that more thought and maybe some trial runs before I was sure.

--
'God dammit, your posts make me hard.' --LilDebbie

You can't be serious... (none / 1) (#236)
by ucblockhead on Wed May 05, 2004 at 11:32:12 PM EST

Your proposal just makes no sense. There's a reason databases exist. Storing the the data in one big chunk only makes sense if the data can only be used in one big chunk.

One way to make the comment search faster is to create a word index as comments/stories/etc are posted.

Roughly speaking, you have a table where the primary index is a single word, and the other columns are the comment id and perhaps the post date. Then to find all comments with that word, you do a select on that index, getting a set of rows in date order, each one representing a different post.

Of course, this will be a bit disk-hungry. But that's just an instance of trading memory (or diskspace) for speed. (K5 may have 7 gb of data, but a 250 gb disk costs $200.)

It's the difference between a Unix find and a Unix locate. (Roughly...the index for locate isn't created on the fly.)

(Having a real database like Oracle, and having a seperate database servers obviously helps.)

See, the thing is that searching the amount of data k5 has is not at all a "hard" problem. Real, professional systems generally have larger datasets and stricter requirements.
-----------------------
This is k5. We're all tools - duxup

localroger, (none / 0) (#237)
by ethereal on Thu May 06, 2004 at 11:02:05 AM EST

You ARE the Brute Squad!

--

Stand up for your right to not believe: Americans United for Separation of Church and State

if you had money... (none / 0) (#241)
by metalotus on Fri May 07, 2004 at 11:04:00 PM EST

could you pay Google to index your site? I'm not sure how most web sites store their data, but if you were inclined, it could save a lot of time.

I've found that if you want to sell your algorithm, just show someone how well it works! You could take a sample of K5 data, say 20 stories, and work with a reduced data set. You could write a simplified version of your algorithm and show how much work it does. Then you could hypothesize how the optimizations would help, and explain how well your program will scale from 20 stories to 7GB of stories. Without sample data & code, it is not convincing.



Ha! (none / 0) (#244)
by BOredAtWork on Sun May 09, 2004 at 02:42:47 PM EST

Well done, localroger. Some very interesting conversations have started because of this; Jonathan Swift would be proud.

What the fuck? (none / 0) (#245)
by ksandstr on Tue May 11, 2004 at 12:12:31 PM EST

Tex's suggestion is almost exactly the way that the typical full-text search is implemented[2]! If you were to have knowledge of a proper relational database system (such as PostgreSQL), you'd know that such things can be implemented as database extensions (OpenFTS springs to mind) while being entirely transparent to the application[1]. Hell, from what I've gathered, a simple combination of stored procedures (defined as operators if you really need the syntactic sugar), table rules and a secondary schema for the full-text searching system would be sufficient for such an implementation, even without resorting to a shared object (i.e. C linkage) extension!

Then again, migrating an application from MySQL to PostgreSQL is nowhere as painful as changing your mindset with regard to how RDBMS-oriented applications are supposed to work.

[1] Though I suppose this would introduce some confusion along MySQL-heads who're already used to tossing all their database logic into the application, the one place where it simply does not belong.
[2] Even MySQL does this behind the scenes, although in a characteristically immature and inflexible manner. PostgreSQL supports a special-case form of full text search (really a column prefix search) when a b-tree index is created on a varchar column and a LIKE operator is applied as the primary search rule with exactly one percent sign at the end.

--
Gegen kommunismus und bolschewismus und terrorismus, jawohl!

I'm missing a detail (none / 1) (#248)
by gstover on Tue May 18, 2004 at 09:12:08 PM EST

If the search algorithm does its search through the big file, & it finds the target string, how do you map the target string's location back to the file that the user wants?

gene



A Modest Search Algorithm Proposal | 253 comments (246 topical, 7 editorial, 1 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest © 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!