Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Building a Search Engine.

By byronm in Internet
Sat May 01, 2004 at 12:24:05 AM EST
Tags: Internet (all tags)
Internet

Without a doubt this has been one of the most absurd and strangest projects I have started so far. Not long ago the idea that I could build a search engine capable of indexing the Internet as a whole seemed so far away. Now it is becoming a reality. Without further ado I wish to announce the early release of mozdex.com an Open Search Engine.


Mozdex.com was dreamed up from the belief that searching should be more of a science and a factual process rather then a proprietary and secretive process. Through the beauty of open source and the hard work of the Nutch team we have been able to use Nutch build a beta test index of nearly 50 million pages.

What we want to do is provide a search system where you can see how the algorithm ranks pages. The ability to see incoming anchors and references to the pages gives more insight into the results. We feel that by working with an open API and Algorithm that the mass of great minds on the Internet can work together to come up with an algorithm that doesn’t lend itself so much to being cheated by “spammy” sites. The premise being that a well thought out algorithm can understand the basic tricks of the trade and more quickly react to new hacks & cheats used to "spam" indexes.

Mozdex was initially seeded from the Dmoz.org directory. We imported the rdf dump and spidered out from those links to create our beta index. This is how we arrived to the name “mozdex” or short for the dmoz.org index. Over the next few days we will be spidering out and referencing a link throughput of atleast 100 outbound and 100 inbound urls; increasing the anchors and ranking of pages and creating a more balanced ranking index. Interestingly by the limited subset of data through indexing dmoz.org sites the results are “true” as to what they would be from that specific market. Oddly enough with our smaller index subset non-english sites had more rank then alot of the english counterparts.

Through early May our goal is to hit the magical 250 million page mark. At 250 million pages we have sufficient data to start ranking results based on anchors and analyzer algorithms. To reach this goal we have a network of two db servers using the Lucene Index system (Jakarta Project) with two terabytes of disk space on each server as it generally takes 10kb per page to store the data and index segments. Our query farm is five P4’s and soon to be five more AMD Opterons with 16 gigs of memory. Through some early testing we were able to realize that our biggest cost was rack and facility space and that the performance as well as memory capacity of the Opterons offered us the best value. When thinking of query servers and indexes the goal is to have as much of the index segments in memory as possible for quickest retrieval. The memory capacity and throughput on the Opterons is a great advantage in this arena.

Obviously data availability as well as query performance is crucial when building a large index with not only the overhead of index maintenance but query process and day to day searches. Each query server has a master and replication server that is load balanced and failed over. On our web tiers we utilize Jakarta Tomcat JSP servers load balanced behind Squid. Squid offers us a highly efficient method to load balance, cache and tweak the throughput of our server farm. Many hardware based systems are built from squid so we are taking it a step further and using squid as an integral part in providing a high availability and high performance web farm

I will be putting up a daily blog of activities, events, issues as we come up with them and this will be made available as a link from the search page. Open search isn’t just the technology we run but the process and concepts we use to achieve our planned results of 2.5 billion pages by end of year.

We ask the kuro5hin community what your opinions and thoughts of such an index would be. Are webmasters, publishers and searchers generally interested in how they get the information that they are presented when they search? Do you think it is actually feasible to work on a process that stays ahead of the cheaters or do you believe it will be something doomed to fail just because its competing against the likes of Yahoo or Google?

With any interest in this subject we would be more then happy to publish white papers on our network, our servers and what it takes to build an index capable of indexing and searching millions of pages. Through the open technologies and the minds of the Internet we feel we can provide an invaluable search tool. Let us know what your opinion is.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Poll
Is there demand for open search?
o Yes 43%
o No 26%
o I don't Care 30%

Votes: 88
Results | Other Polls

Related Links
o Kuro5hin
o Yahoo
o Google
o mozdex.com
o Mozdex.com
o Nutch
o Dmoz.org
o Jakarta Project
o Also by byronm


Display: Sort:
Building a Search Engine. | 99 comments (68 topical, 31 editorial, 0 hidden)
General thoughts. (1.16 / 18) (#1)
by The Honorable Elijah Muhammad on Wed Apr 28, 2004 at 05:53:26 PM EST

(1) -1, buy an ad.
(2) You have no hope of competing on the same level as Google, Microsoft etc.
(3) ... I was trying to work in something about how K5 has no comment search, but I'm lazy.


___
localroger is a tool.
In memory of the You Sad Bastard thread. A part of our heritage.
my own thoughts (3.00 / 6) (#3)
by thankyougustad on Wed Apr 28, 2004 at 06:19:36 PM EST

I think because it pertains to the open source community, and it's also an explanation of the project, not just "hey come and use this sweet new search engine I made," the article has merit.
Also, who gives a shit if you can't compete with google. Most people on the internet doen't even know how to use google correctly, and I don't think anyone is going to try to make a search engine to compete directly with them. I'll vote for this article.

No no thanks no
Je n'aime que le bourbon
no no thanks no
c'est une affaire de goût.

[ Parent ]
You miss the point (2.66 / 6) (#43)
by Wiggy on Thu Apr 29, 2004 at 11:27:29 AM EST

Most people on the internet doen't even know how to use google correctly

You missed it didn't you? The main reason Google is so successful is because it returns the result you want 99% of the time without you having to know how to use it. The majority of people are not using search technology for data mining.

Unless you can appeal to the 99% of users, you will fail. Catering for the 1% just isn't good enough.

Mini-me
[ Parent ]
It's not a business. (none / 2) (#86)
by scanman on Sat May 01, 2004 at 02:58:20 PM EST

Since he's doing this as a hobby, the only way it can "fail" is if he decides it isn't fun anymore, no matter how many people use it.

"[You are] a narrow-minded moron [and] a complete loser." - David Quartz
"scanman: The moron." - ucblockhead
"I prefer the term 'lifeskills impaired'" - Inoshiro

[ Parent ]

He already has (3.00 / 10) (#14)
by it certainly is on Wed Apr 28, 2004 at 09:10:26 PM EST

See here. You must mean "buy another ad".

kur0shin.org -- it certainly is

Godwin's law [...] is impossible to violate except with an infinitely long thread that doesn't mention nazis.
[ Parent ]

he probably should (2.66 / 6) (#16)
by j1mmy on Wed Apr 28, 2004 at 09:14:22 PM EST

the link in his ad is broken.


[ Parent ]
wow. (none / 3) (#22)
by ZorbaTHut on Wed Apr 28, 2004 at 10:54:20 PM EST

Competence, meet Byronm. I see you haven't been introduced.

[ Parent ]
Infrastructure (3.00 / 7) (#5)
by dennis on Wed Apr 28, 2004 at 06:34:04 PM EST

If you can make it a peer-to-peer algorithm so nobody has to buy 100,000 servers, then maybe this has potential. Don't ask me how to do it, though!

I've been wondering about this (none / 3) (#10)
by whazat on Wed Apr 28, 2004 at 07:38:03 PM EST

And we basically would need the following.
  1. An automatic method of determining rogue members of the peer group so that they can be distrusted by the web portal.
  2. Public Key Encryption (or other methods) of messages, so that the web pages that access the peer group can be sure that they are not accessing spam central peers.
The first is the killer. Without it the p2p system will be gamed all the advertisers, bloggers and spammers. Google has enough trouble as it is, without distributing there system across nodes that can be physically accessed.

There seems to be two alternatives, real AI, or some system where user feedback is used to determine which servers are crap. How easy it is to stop servers posting, and how easy it is to get a new server used by the p2p system, will determine how easy it is to game the system.

Of course I may be missing some simple solution.

[ Parent ]

Another way (none / 3) (#33)
by liquidcrystal3 on Thu Apr 29, 2004 at 08:36:14 AM EST

"I may be missing some simple solution"

How about just increasing the trust/position of sites which users don't come straight back from and reducing it from those that they do. Making users login or using cookies would mean that you could build up trust of users too.

[ Parent ]

Hmm (none / 3) (#36)
by whazat on Thu Apr 29, 2004 at 08:59:42 AM EST

What do you mean build up trust of users?

Whatever you mean it got me thinking. Any spammer could set an automatic system of perl scripts to use the service and leave positive feedback for their servers and negative feedback for legitimate servers. Unless you have some form of verification of users being humans (image recognition) I don't think it is worth it. Else how do you tell legitimate users from spam users?

[ Parent ]

Easy (none / 0) (#89)
by greenrd on Sun May 02, 2004 at 05:32:08 PM EST

Unless you have some form of verification of users being humans (image recognition) I don't think it is worth it. Else how do you tell legitimate users from spam users?

Biometric ID cards with digital signature capabilities, whose private key is only known to the Gubbmint!

David Blunkett is a frickin' genius!!!


"Capitalism is the absurd belief that the worst of men, for the worst of reasons, will somehow work for the benefit of us all." -- John Maynard Keynes
[ Parent ]

The paradox of an open server (2.80 / 5) (#11)
by whazat on Wed Apr 28, 2004 at 07:57:01 PM EST

Say I decide to set up a commercial server using Nutch. I want to make my server different from any other Nutch server, so it will stand out from the crowd.

I am under no obligation to give these changes back to the community, because I am not distributing any software. So the user still won't know the bias put in by the search engine, despite it being based on open source.

What I would like to see is an open source search engine that could be specialised by subject. So that I could run a search engine on plasma fusion, or a specialist subject of my choice. Also give it a non-graphical interface, so that other people could collate my results. That I might find useful.

Just wondering... (2.80 / 5) (#12)
by JahToasted on Wed Apr 28, 2004 at 08:48:30 PM EST

why doesn't google ban the sites that abuse it? I mean if I create a bunch of bogus sites and link back to my main site to increase my ranking, why doesn't google simply delete my site, and all of the sites that link to it (ie. the bogus sites)? Yeah, a few legit sites might get delisted, and people will bitch and moan, but it would be better than having every second result going to some crap site trying to sell junk.

So why doesn't google just ban the bad sites?
______
"I wanna have my kicks before the whole shithouse goes up in flames" -- Jim Morrison

It does ban abusive sites. (none / 3) (#13)
by it certainly is on Wed Apr 28, 2004 at 09:07:10 PM EST

But it's kinda difficult to spot immediately in the whatever billion pages they index. That's why they try to avoid doing it manually.

They also don't want kill innocent sites accidentally, as that would be terrible press. So they can't just press the 'delete this and everything it links to' button.

kur0shin.org -- it certainly is

Godwin's law [...] is impossible to violate except with an infinitely long thread that doesn't mention nazis.
[ Parent ]

it scores them (none / 3) (#15)
by j1mmy on Wed Apr 28, 2004 at 09:11:42 PM EST

poorly.

PageRank isn't all about who links to who. Google says as much on their own pages. Google penalizes link farm style pages, and the pages linked to from them. I think it was search king that sued them for implementing anti-link-farm scoring, which was search king's business model.

[ Parent ]

So, by linking to searchfarm (none / 2) (#20)
by JackStraw on Wed Apr 28, 2004 at 10:37:56 PM EST

You've just listed kuro5hin as a spam page?

Good job ;-)


-The bus came by, I got on... that's when it all began.
[ Parent ]

Don't worry (3.00 / 7) (#38)
by CaptainSuperBoy on Thu Apr 29, 2004 at 09:32:08 AM EST

Rusty doesn't allow Google to index comments, because the comment search works so great here.

--
jimmysquid.com - I take pictures.
[ Parent ]
then explain this (none / 1) (#58)
by danharan on Thu Apr 29, 2004 at 09:00:26 PM EST

Google search for link:jimmysquid.com

[ Parent ]
Those are from /story (none / 3) (#60)
by CaptainSuperBoy on Thu Apr 29, 2004 at 09:55:28 PM EST

You n00b.

--
jimmysquid.com - I take pictures.
[ Parent ]
Indiscriminate listing (none / 0) (#99)
by Pkchukiss on Sat Sep 04, 2004 at 07:41:34 AM EST

I think Google feels strongly about listing all the webpages that it can. After all, as a search engine, it would have failed had it delisted any webpage which misused it. Somebody would have a use even for that offending website. One solution would be to bury the website at the bottom of the ranking board, to discourage cheating.

________________
Ignorant no more
My blog
[ Parent ]
p2p search engine (3.00 / 4) (#17)
by speek on Wed Apr 28, 2004 at 09:26:29 PM EST

I'd like to see a search engine that trades speed for brains. Something like a p2p app that sends out search requests, and nodes return information that particular node has indexed. A node might index sites based on the browsing behavior of the node owner (I say "based on" because it has to be careful not to violate the owner's privacy/security). Basically, I'm thinking of an implementation of the kind of search you read about in sci-fi stories where the person sends out "agents" that return info. Well, sending out agents won't work because who's going to agree to run your agent? Hmm, maybe a p2p app could accomplish the same goal? You send out search requests (complicated requests) and results filter back in over the next few hours/days.

--
al queda is kicking themsleves for not knowing about the levees

www.grub.org [nt] (none / 2) (#25)
by Truffle Hunter on Wed Apr 28, 2004 at 11:38:26 PM EST



[ Parent ]
It's not p2p (none / 1) (#37)
by whazat on Thu Apr 29, 2004 at 09:05:31 AM EST

I can't find where to download the server. As such it is still a client-server architechture.

[ Parent ]
YouSearch (none / 2) (#61)
by KWillets on Thu Apr 29, 2004 at 10:42:11 PM EST

YouSearch is a distributed index for personal webservers. It uses a Bloom filter (basically a bitmap of word hashes) to summarize which keywords are where, and stores them on a hub.

The words in each query are mapped to their corresponding bits in the Bloom filter and routed to the peers which have matching content. It's one way to reduce queries to non-matching peers.

Still, no one has cracked the problem of searching millions of peers.

[ Parent ]

The search is terrible (3.00 / 12) (#21)
by JackStraw on Wed Apr 28, 2004 at 10:40:23 PM EST

Try searching for kuro5hin, or even Google! You get random pages within those sites; not the pages that you'd actually want to go to.

Not that I don't love the idea, but it certainly is far from being useful.


-The bus came by, I got on... that's when it all began.

I must agree... (none / 3) (#52)
by Elendale on Thu Apr 29, 2004 at 03:49:03 PM EST

Maybe i'm being biased, but if i type "google" or "google search" into your search engine and the first result isn't only not www.google.com but not even a site that begins with www.google.com i find your search engine of questionable use. The question i, as someone who might like to use another search engine, might ask is: what is your search engine good at?
---

When free speech is outlawed, only criminals will complain.


[ Parent ]
From what I understand (2.47 / 21) (#23)
by Dr Phil on Wed Apr 28, 2004 at 10:55:48 PM EST

The fastest text search method is to make a flat text file of the entire web and use highly optimised assembly algorithms to search it in a matter of seconds. At least that's what localroger told me.

*** ATTENTION *** Rusty has disabled my account for anti-Jewish views. What a fucking hypocrite.
Encourage: localroger's blunder is still funny! /n (none / 3) (#51)
by gilrain on Thu Apr 29, 2004 at 03:30:19 PM EST



[ Parent ]
You asked (2.55 / 9) (#24)
by tricknology2002 on Wed Apr 28, 2004 at 11:00:52 PM EST

[D]o you believe it will be something doomed to fail just because its competing against the likes of Yahoo or Google?

No. I'm sure there will be reasons for failure other than Google and Yahoo. But those two will probably be at the top of a long list.

1, discourage, uses the word "Yahoo" (1.00 / 6) (#50)
by kpaul on Thu Apr 29, 2004 at 03:09:34 PM EST

no text
2014 Halloween Costumes
[ Parent ]
Scaling (2.75 / 4) (#30)
by helianthi on Thu Apr 29, 2004 at 05:31:05 AM EST

Great idea, but before spending too much energy into this, think twice of the scaling factor. Building a 50 million pages index doesn't mean at all your engine will handle 5 or 50 billion. Think big right from the beginning.

More Scaling details... (none / 1) (#40)
by byronm on Thu Apr 29, 2004 at 10:33:14 AM EST

I've added more details on the scaling and our testing of scalability. The software contains the resources to scale, however its just really dependent on the architecture and servers used as well. Google goes with the many cheap servers with less memory because they have more access to facilities to manage them. We are aiming at a less power hungry, less density but higher performing network through denser memory systems and ofcourse faster cpu's, switches and technology available today.

[ Parent ]
um (2.75 / 4) (#49)
by reklaw on Thu Apr 29, 2004 at 03:01:40 PM EST

is there some reason why that search engine gives me a results screen in something that looks like Spanish ["Resultados 1-10 (de un total de 6.531 documentos)" and "Search" changes to "Buscar"] no matter what I search for?
-
My take on why (none / 1) (#68)
by Highlander on Fri Apr 30, 2004 at 03:15:08 AM EST

I would guess that your browser language preferences are set to include spanish.

Other reasons could be weird browser cookies or extra arguments included in the URL.

Or maybe the server detected you were located in the south of the US :-P

Moderation in moderation is a good thing.
[ Parent ]

You hit a bug we just fixed (none / 2) (#79)
by byronm on Sat May 01, 2004 at 11:02:52 AM EST

What happened is the Jakarta server was running on a standard Fedora kernel that apparently locks user processes to a ulimit of 2048 (unless ran by root). We have since compiled our own kernal and will be rebooting shortly. Apparently the jvm couldn't open anymore files so it just started reading from the first language in the language set that it could open which happened to be spanish i believe :)

[ Parent ]
We need alternatives (2.83 / 6) (#53)
by QuickFox on Thu Apr 29, 2004 at 05:02:31 PM EST

Our dependence on Google makes us very vulnerable. Suppose Google starts charging users five dollars for each search. Or something. There should be alternatives. So this is a great idea.

But to make it useful you need to make it much faster than it is now. Why is it so slow, even though you seem to describe powerful hardware? And of course it must return much better search results.

Give a man a fish and he eats for one day. Teach him how to fish, and though he'll eat for a lifetime, he'll call you a miser for not giving him your fi

charging users (none / 2) (#62)
by adimovk5 on Thu Apr 29, 2004 at 11:04:55 PM EST

If Google starts charging common users, it will stop being used and die. Most people will not pay for a fantastic search engine if an adequate engine is free.

[ Parent ]
Exactly (none / 1) (#67)
by QuickFox on Fri Apr 30, 2004 at 02:02:50 AM EST

That's my point. That's why we need alternatives.

Give a man a fish and he eats for one day. Teach him how to fish, and though he'll eat for a lifetime, he'll call you a miser for not giving him your fi
[ Parent ]
alternative search engines (none / 1) (#69)
by adimovk5 on Fri Apr 30, 2004 at 09:56:58 AM EST

Alternative search engines already exist. Go to searchenginewatch.com and you will find plenty of competition.

Google dominates at the moment because it has the most successful idea. It rates sites according to the number of links to a site. Theoretically this makes the most popular sites the highest rated. The entire internet votes on searches.

About a dozen or so major players are competing against Google. Niche search engines are also starting to become popular. They are useful for only certain areas but are very useful in their specialty. Also, the next technology is waiting around the corner. Google is the top dog now but its position is precarious rather than solid.

[ Parent ]

Slow because it's beta :) (none / 2) (#82)
by byronm on Sat May 01, 2004 at 12:10:09 PM EST

You guys aren't hitting the hardware that is on our "production" network yet. We are also busy working on the code, bouncing the servers and testing out different parameters. I expect by tonight we will have the newer index running on the new servers (atleast 3 of them) so that should give you alot better performance.

[ Parent ]
nice search engine, but needs some work.. (none / 3) (#54)
by Suppafly on Thu Apr 29, 2004 at 05:08:56 PM EST

You need to work on the algorithm that orders the results. If I search for openbeos, the main openbeos website should be first or 2nd, not like 8th.
---
Playstation Sucks.
True (none / 1) (#78)
by byronm on Sat May 01, 2004 at 11:00:53 AM EST

If you search for OpenBeos right now we are only samplying about 1-2% of the internet so as far as what was pulled from dmoz.org as our seeded system openbeos's main website shows up as #8. I'm sure as we reach out to more sites that raise the weight of the said domain it will naturally rise in serps.

[ Parent ]
On gaming and finances (3.00 / 4) (#59)
by danharan on Thu Apr 29, 2004 at 09:29:46 PM EST

There is a lot of money to be made in search engines with text ads, and it would be rather cool if a search engine's revenues were to finance OSS projects like those you use.

Second, about gaming. Jakob Nielsen suggested an extremely simple system that could get rid of most of it: give people a toolbar that lets them give a page a thumbs up or thumbs down. As long as my votes are not personally identifiable, this might also be a marketing gold mine as well as a way to give people some ownership over their search engine.

Where the big G was revolutionary was in using metadata. Rather than simply take a document -or god forbid, its meta-tags- at face value, it used links. The only way we are going to have another dramatic increase in search results is by taping another type of metadata. The Open Directory is a good start, but it too has flaws. Editors can introduce bias, old sites are rarely updated, and new sites take forever to be updated.

As an aside, there is one meta-tag that I would like to see taken at face value: the ICBM address.

---

Oh, I really like the idea, and think it has economic potential too. I also like the fact that I'll be able to figure out why fine art greeting cards does not bring up a site I'm redesigning although the same expression in quotes does.

Heh (none / 0) (#91)
by ZorbaTHut on Mon May 03, 2004 at 02:07:35 AM EST

give people a toolbar that lets them give a page a thumbs up or thumbs down

Oh, yes, that's a great idea. And then when you get seven million "down" votes for your own home page, at the same instant, from computers spread all over the world, maybe it won't seem like such a good idea anymore. :P

Remember: unless you make steps to verify that it's a human, *any* "poll the user" system can be easily corrupted by botnets. And I doubt people are going to want to decipher messed-up images with text in them just to vote on a website.

[ Parent ]

Not likely (none / 0) (#92)
by danharan on Tue May 04, 2004 at 06:43:13 PM EST

a) that I would get that much traffic
b) that it would matter

One nice thing to do with such a database is to see what other people with a voting record similar to yours liked and disliked- only bot-like voters will know what bots felt about a certain page.

Several other counter-measures could be taken, and together there would be no point in using bots- save perhaps as a crude DOS.

[ Parent ]

Missing the point (none / 0) (#93)
by ZorbaTHut on Tue May 04, 2004 at 08:10:27 PM EST

My point is that the existence of this toolbar would give companies another way to influence page ratings artificially. Whether your particular site is popular or not, you can't say to me that's a good thing - that's what's currently causing Google's trouble to begin with.

And I don't believe your vague "other counter-measures" comment also. If it was so easy to do countermeasures, why is Google having the trouble that it is?

Countermeasures aren't easy - any test that a computer can perform, a computer can bypass. The reverse is not true.

(Proof: Given the code for a "computer-vs-human" tester, and assuming it is possible for it to return "human", it is possible for a human to reverse-engineer the code and figure out the behavior that will trigger "human" - worst-case, by simply performing a series of actions himself, then letting the computer duplicate it with minor variations.

Alternatively, given an environment where the sole input is a single boolean bit and the user is told to choose randomly, it is clearly impossible to detect reliably whether the user is a computer or not.)

[ Parent ]

not missed.... (none / 0) (#94)
by danharan on Tue May 04, 2004 at 08:34:20 PM EST

How on earth could you effectively game Amazon's system that lets you know that "people that liked this book also liked XYZ"?

Add to that the fact that you would have data for several sites, and it's hard to fake. "People who liked the same site A, B, and C that you liked also enjoyed D."

Use a unique identifier in the toolbar, one that requires a download. Discount ratings from people  or IP addresses that are sending disproportionate amounts of ratings. Occasionally give people an opportunity to do more than rate as good/bad, e.g. ask them of two sites which they prefer.

Methinks that would be near impossible to game. If you think it is, why don't you just say how?

[ Parent ]

Well (none / 0) (#95)
by ZorbaTHut on Tue May 04, 2004 at 11:12:24 PM EST

I don't know how amazon's system works. It could be that it only works based on what people have actually bought, which would be smart. Or it could be that it also works based on what people say they're interested in, or visit in close consecutive order, or add to their shopping cart, in which case you could - as a rather dumb naive solution - simply create a metric ton of accounts and add the items to them that you're trying to associate with each other. Any extra complexity would just be to get around the bot detection.

So here's how to game it. Write worm, or piggyback on existing worm backdoor. Add a payload that downloads the toolbar and extracts the unique identifier. Send occasional ratings at about the same frequency as a normal person (say, half a dozen a day), picked mostly randomly from their IE cache, and occasionally putting in one of the "interesting" sites that we're trying to modify. When asked on site preference, pick randomly if it doesn't involve anything we're interested in, otherwise pick our site (either 100% of the time, or simply at a higher weighting.)

Doesn't trigger the "disproportionate-amount-of-rating" sensor because, hey, you're not getting disproportionate ratings from anyone in particular. Since it's effectively a gutted "real" install of the product (if you wanted to be truly insane, you could set up some sort of a VM, but this is probably safe until they change the protocol on you, which they'll try to avoid since it would kill any existing installations) there's no other way to detect it.

Methinks that would be quite effective. If you can think of a foolproof way to block that, tell me how, and I'll show you how I'd get around it.

[ Parent ]

well, yeah.... (none / 0) (#96)
by danharan on Thu May 06, 2004 at 05:45:00 PM EST

Technically, what you say is possible. Practically, though, it seems dubious.

If a person is talented enough to hack a worm that goes undetected by the major AV software makers, undetected by our software.... well, maybe they could affect something that counts for maybe 1/3 of the ranking of a page.

But such a person would be sufficiently talented to succeed in any number of legal programming pursuits so I don't really see the point, though your suggestion is clever.

[ Parent ]

Who said (none / 0) (#97)
by ZorbaTHut on Fri May 07, 2004 at 03:29:16 PM EST

undetected?

I don't see any reason a worm has to be undetectable to spread. And you wouldn't be able to detect it without having software installed on the client, so you'd have no way to detect whether incoming votes are valid or invalid.

And I don't see the point either, personally - but look at the quantity of spyware, adware, and worms out there. And look at how many worms don't do anything financially useful - they're not even making money off it!

Now I'm going to go and say something that might come as a surprise - I think a toolbar with up/down votes *would* be useful. At worst, it wouldn't be less useful than nothing (and probably more useful, in the sense that it would give people a sense of contributing to the results.) But saying "oh, I've got a way to get rid of abuse! Use a voting system!" seems, to me, the height of naivety :)

[ Parent ]

+1FP, but... (3.00 / 6) (#63)
by pb on Thu Apr 29, 2004 at 11:45:04 PM EST

Although this is a topic that I am interested in, (and we could always use some more tech around here) I have some serious questions.

What is your relation to the Nutch project?

I see that you use the same icons, but I don't see any visible links to them, or
indeed any mention of them. Are you guys one and the same, or do you just like to rip off people's websites and freely available source code.

...which brings me to my second question.

Where is the source?

If you profess to have or want some sort of 'open' API or algorithm, then you might--at the least--want to provide a download for it. I understand that it may take time to fully document your (or The Nutch Project's) algorithms and API's; it shouldn't take nearly as much time to put up a tarball of the source. In fact, I managed to get one from the (unmentioned on your site) Nutch project's website.

Whoa, why do your search results suck so badly?

Now, I understand that this is a "BETA" project, but... some simple suggestions. First, cluster results from the same site, and list the base URL first. A simple search for 'slashdot' yields all of their various servers as separate results.

Meanwhile, a search for 'mozdex' seems to favor logs of your spidering on other sites. I guess some tweaking is in order here.

What made you pick 'Lucene' and 'Nutch' for indexing and searching, respectively.

There are lots of search engines already out there, open ones in fact. Do these scale better? Do they do something radically new? Are all the other ones just meant for searching a local site? etc.

What exactly are your algorithms?

A simple summary will do. I guess this is in the same place that the source is.

Directions for the future?

When I was playing around with searching, I stumbled upon latent semantic indexing and vector-space searching; in fact, I wrote a little search engine based around it. I rather like the idea of it. You could add a button or a link to every page that said "find pages like this", and it would actually work... maybe.

I know Google uses something like this for Google News; I don't know what they do for their "similar pages" links.

Good luck, and remember... if you're going to open yourself up to public scrutiny, expect people to start asking you questions. :)
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall

New features to come (none / 1) (#77)
by byronm on Sat May 01, 2004 at 10:59:35 AM EST

Spell Checker "Stemming" & "Clustering" of which you say is "more pages on xyz.." This is part of the Carrot2 project that we are incorporating. Carrot2 was defaulted to use an interface that is only compatible with IE and netscape so we are doing some work to fit it into a standard search engine result page. As far as the links and credits to the applications those are being posted on our next build. thanks for the feedback!

[ Parent ]
Interesting, but poorly explained (2.66 / 6) (#64)
by coryking on Fri Apr 30, 2004 at 12:50:14 AM EST

It seems as though you've failed at one of the primary goals of your site:  to make it public why the engine is ranking stuff the way it does.  And you've failed that goal miserable.

Sure you say "cory, read the fucking FAQ (idiot)" but that is not good enough.  In fact, I am sure if I read your FAQ, you'd explain it.  But you know what?  I dont care about your FAQ, I haven't read the FAQ and I dont care to.  Neither does %99.99999999999999999999999 of your future, hypothetically large audiance.  Your "explaination" link doesn't explain anything to anybody expect the people who wrote your engine.

If you want to make your (very interesting) goal of informing your users why they got the results they did, you had better clean this up.  I dont think ANYBODY but somebody closely involved in your project can understand that page.  To everybody else, it's greek.  Think of all the people who use google.  Your grandma, your mom, your girlfriend, your teacher, co-worker, garbage man.  EVERYBODY uses google.  Could your garbage man understand that?  NO.  Your grandma?  No.  You either need to rethink your goal, or revisit that page and make it easy to understand.

A few links wouldn't hurt (none / 1) (#71)
by KWillets on Fri Apr 30, 2004 at 02:17:10 PM EST

The page doesn't have to be readable at grade 6 level, but it wouldn't hurt to put a link behind each function named in the explain page. For instance "tf" is apparently related to term frequency, but how I don't know how.

[ Parent ]
The results are based on the subset spidered (none / 2) (#76)
by byronm on Sat May 01, 2004 at 10:57:59 AM EST

The results that you link to are based on the 40-50 million pages currently in the beta index. I even stated that i thought it was weird that by sampling the dmoz.org pages that alot of what YOU consider to be highly ranked pages based on a specific algorithm isn't that. There is some tweaking that google does to keep google at the top as if you go to each individual google (for each country) they show there own search as the most relevent result. Our index is growing "naturally" and we havn't applied any "RULES" to it yet. - Nor have we gone through and isolated languages from each other yet. When we do apply rules, these will be made public. However we won't have a good base to work from until we reach atleast 250 million pages as that will give is a system from which we have the inbound/outbound links and anchors to work with. Thanks for your comments though :)

[ Parent ]
(-1), buy an ad (1.20 / 10) (#66)
by guyjin on Fri Apr 30, 2004 at 01:17:22 AM EST

[nt]
-- 散弾銃でおうがいして ください
Al Gorithm (2.75 / 4) (#70)
by adimovk5 on Fri Apr 30, 2004 at 10:08:56 AM EST

The premise being that a well thought out algorithm can understand the basic tricks of the trade and more quickly react to new hacks & cheats used to "spam" indexes.

Spammers and cheats have to guess at the inner workings of search algorithms. They stay one step behind sites like Google who can discover their fakery and change the rules of the game accordingly.

A visible algorithm will allow spammers and cheats to more efficiently exploit your algorithm. It will be rendered useless in a short time and so will your search engine. The defenders will be acting out of altruism and good will. The attackers will be acting out of greed and self interest. Who do you think will win?



We will win (none / 1) (#81)
by byronm on Sat May 01, 2004 at 11:15:43 AM EST

We already know the most common cheating methods, and usually the only new ones that are exploited are through new technologies and processes that proprietary systems can't adapt to. The biggest cheats today are invisable keywords by setting text and background to same, keyword stuffing in urls, providing hidding links in pages and stuffing these on networds of sites that are all on the same topics. There is also CSS, Javascript and other cheats that came about as those technologies were being utilized by publishers. We have the added bonus now that google is going for IPO that they will have to do what there investors say, not what the community demands. So with the added "brain power" and the more agility we have access to we can adapt and react to new spamming techniques much quickler. Our risk is that we piss off a few people but atleast were don't have an investor dictacting that but rather a community of people looking for relevent searches and providing feedback.

[ Parent ]
You will lose (none / 2) (#83)
by adimovk5 on Sat May 01, 2004 at 01:29:54 PM EST

You have no more a chance of defeating the spammers and other cheats than the anti-virus people have of defeating the virus makers.

[ Parent ]
Wow... (none / 3) (#85)
by bigchris on Sat May 01, 2004 at 02:06:21 PM EST

... that's sounds remarkably similar to the Security through Obscurity argument. You know, the one that goes: if we make our code secret, nobody will be ever able to hack us! We'll be perfectly safe.

Except that argument doesn't work (how many flaws have we found in IIS that got exploited?)

Right now, we have a seach engine (Google) which does a remarkable job of getting around cheats who try to game the system. However, how do we know that, and how do we know we're getting a pretty unbiased result when we do a search of the Internet? We don't. With an open system we can see what's going on behind the scenes, and then implement anti-gaming measures when necessary.

---
I Hate Jesus: -1: Bible thumper
kpaul: YAAT. YHL. HAND. btw, YAHWEH wins ;) [mt]
[ Parent ]

secret code (none / 1) (#87)
by adimovk5 on Sat May 01, 2004 at 05:55:31 PM EST

nobody will be ever able to hack us

He is claiming that making the code public will give him a better chance of defeating the enemy. I am claiming that keeping the code secret gives you a better chance. I am not claiming that making the code secret locks out the enemy completely.

With an open system we can see what's going on behind the scenes, and then implement anti-gaming measures when necessary.

In an open system, exploiters of the code will see immediately that they have been countered and how it was done. They can then immediately counter your move secretly.

how do we know we're getting a pretty unbiased result when we do a search of the Internet? We don't.

The best way to judge the bias of a system is with reason and intellect. Don't blindly accept the search results on the first page as the best for you. Dig a few pages into the search. Use more than one search engine and compare results. In other words, do some research.

The internet has brought the world's libraries, newsrooms, and salons to our homes and offices. Would you stop researching upon obtaining the first book or newspaper in a real world library?

[ Parent ]

google won't change much (none / 0) (#88)
by emmons on Sat May 01, 2004 at 10:11:39 PM EST

You'd better rethink your premise.

Google employs the best and the brightest in the world when it comes to search, and even as a public company and even if it were set up to be at the mercy of market control, it would still be in the best interest of the company to beat those who would try to cheat the engine. Google makes money because it delivers the best results. How will its being a public company change that?

That's not to say that I won't wish you luck and hope you succeed, you should just have a more realistic idea of who you're competing with.

---
In the beginning the universe was created. This has made a lot of people angry and been widely regarded as a bad move.
-Douglas Adams

[ Parent ]

Why This is Good (2.80 / 5) (#75)
by KWillets on Sat May 01, 2004 at 12:29:16 AM EST

(I posted this earlier during voting, but mistakenly left it as an editorial comment.  So, I'll just repost it.)

This is, as the author admits, an incomplete project, but the idea is well worth exploring.

The Internet was conceived as an open system, where almost everything is accessible to all users, over open protocols. While the early Web was built on link navigation or "web surfing" to find information, search rapidly became the dominant access method (in fact, I think hyperlinking is actually declining, in usage if not in volume, but that's just a guess so far).

As search has become dominant, we have seen the rise of the private search engine, with proprietary crawling, indexing, and ranking technology. The obvious implication is that the basic web framework is simply broken; almost all web users must go through centralized search engines to find content, and these engines must guard their ranking methods to keep from being manipulated.

So, the current situation is that we rely on trusted intermediaries to interpret the web for us, using occult algorithms. In short, it's the Middle Ages.

Given the effectiveness of search as a basic access method, the only way out is to remove the mystery from the search and ranking algorithms. Mozdex may be on the right track, but the ultimate result is going to have even more user control over ranking factors. The idea that one ranking method should work for all searches should be discarded, and replaced with a diverse range of user-driven query methods.


Thanks for the better speech :) (none / 2) (#80)
by byronm on Sat May 01, 2004 at 11:07:57 AM EST

I admit, i'm horrible at writing down my ideas and documenting what i do, but you did put it into better terms then I :) There isn't truely 1 alogorithm that will work and we believe that through Open Source we will be able to adapt and work through something that can setup some sort of "process" that works. We are working with Carrot2 to add search clustering so relevent and topical searches will be included and people will be able to use more english phrases to get better results. This will be in hopefully by tommorow after some initial testing. The website is also being updated to reflect all of the software we run and the websites that offer it as well as access to the mailling lists and sourceforge repositories that we use. Thanks for all the feedback everyone!

[ Parent ]
Ummm... cool idea... however... (none / 0) (#84)
by bigchris on Sat May 01, 2004 at 01:58:02 PM EST

currently it's spitting out Apache Tomcat/4.1.30 errors.

type Exception report

message

description The server encountered an internal error () that prevented it from fulfilling this request.

exception

org.apache.jasper.JasperException: Lock obtain timed out
    at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:254)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilt erChain.java:247)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain. java:193)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:2 56)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:1 91)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2422)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve.java :171)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:641)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:163)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:641)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:174 )
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:199)
    at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:828)
    at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnectio n(Http11Protocol.java:700)
    at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:584)
    at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:68 3)
    at java.lang.Thread.run(Thread.java:534)

root cause

java.io.IOException: Lock obtain timed out
    at org.apache.lucene.store.Lock.obtain(Lock.java:97)
    at org.apache.lucene.store.Lock$With.run(Lock.java:147)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:99)
    at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:75)
    at net.nutch.searcher.IndexSearcher.<init>(IndexSearcher.java:43)
    at net.nutch.searcher.NutchBean.init(NutchBean.java:71)
    at net.nutch.searcher.NutchBean.<init>(NutchBean.java:60)
    at net.nutch.searcher.NutchBean.<init>(NutchBean.java:50)
    at net.nutch.searcher.NutchBean.get(NutchBean.java:42)
    at org.apache.jsp.search_jsp._jspService(search_jsp.java:65)
    at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:137)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
    at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:210)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilt erChain.java:247)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain. java:193)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:2 56)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:1 91)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2422)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve.java :171)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:641)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:163)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:641)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:174 )
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:199)
    at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:828)
    at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnectio n(Http11Protocol.java:700)
    at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:584)
    at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:68 3)
    at java.lang.Thread.run(Thread.java:534)


---
I Hate Jesus: -1: Bible thumper
kpaul: YAAT. YHL. HAND. btw, YAHWEH wins ;) [mt]

doesn't work atm (none / 1) (#90)
by neetij on Sun May 02, 2004 at 07:09:46 PM EST

well, it doesnt seem to be working...a search query of 'mozdex' (in IE/opera) hasn't showed results for a few minutes. something sure is lacking.

Been working for a while now (none / 0) (#98)
by byronm on Mon May 10, 2004 at 09:36:20 PM EST

you probably hit it when we were adding the new servers :)

[ Parent ]
Building a Search Engine. | 99 comments (68 topical, 31 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest © 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!