Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Page Ranking, Spamdexing and Open Source

By vryl in Internet
Fri Dec 29, 2000 at 07:19:00 AM EST
Tags: Technology (all tags)
Technology

I love Google, it has great results, but has a few cretinisms in its search functionality, so occasionally I have to use Raging, which has better search functionality, but a generally poorer ranking system.

What I really want to do is roll my own search engine and ranking system.


So I got to thinking, what with WebQL and more, why don't the search engines open up their back end with some sort of API and scripting language, and then sponsor a site with Open Sourced front ends and ranking systems.

Think about it, the best ranking systems could be voted on by use, and people could contribute to making them more effective.

A ranking system is a constantly moving target as spamdexers are constantly reverse engineering the so far hidden and proprietary systems to up their (usually porn) sites in the ranking.

At the moment, we have to rely on the hidden programmers to look after our best interests, and this is not the Free Software philosophy.

There is nothing to lose for the search engine companies, as they have "more eyes on the code" etc, and can incorporate the best strategies into their own systems if the feel like doing so.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Poll
Best Search Engine
o Google 92%
o Raging 0%
o Hotbot 1%
o Webcrawler 0%
o Excite 0%
o Northern Light 1%
o Lycos 0%
o Other 4%

Votes: 82
Results | Other Polls

Related Links
o Google
o cretinisms
o Raging
o WebQL
o spamdexers
o Also by vryl


Display: Sort:
Page Ranking, Spamdexing and Open Source | 18 comments (18 topical, editorial, 0 hidden)
This is good (4.20 / 5) (#1)
by Arkady on Thu Dec 28, 2000 at 08:51:58 PM EST

I's like to see this happen.

Aside from the obvious benefits mentioned in the article, it also dovetails nicely into the article before it in the queue about filtering software.

The biggest problem that gets people uptight and wanting filtering systems is the good old cry of unwanted porn showing up on their screens. With a seperable front-end like this, the good ol' Rev. Fallwell, however, could publish a search engine front that looked for porn in the index and dropped it in the ranking phase.

This would let the more uptight users more closely control their experience, hopefully getting some of them off the rest of our backs. Not that it'd solve everything, of course, but it might very well help take some steam out of the censors.

Cheers,
-robin

Turning and turning in the widening gyre
The falcon cannot hear the falconer;
Things fall apart; the centre cannot hold;
Mere Anarchy is loosed upon the world.


Rant and paranoia ahead (3.33 / 3) (#11)
by Wiglaf on Thu Dec 28, 2000 at 11:52:58 PM EST

Sounds like a good idea but <paranoia> *Maybe the censorware people push the search engines to not allow this idea to bloom* </paranoia> But, it would be cool to have this. I am tired of searching for "FOO" and finding "hot FOO sex with donkeys that have 15 inch d***'s" And don't get me started ont he 10 billion pages that pop-up if you "accidently" go to a porn site.

Paul: I DOMINATE you to throw rock on our next physical challenge.
Trevor: You can't do that! Do you really think Vampires go around playing rock paper sissors to decide who gets to overpower one another?
[ Parent ]
But how? (2.80 / 5) (#2)
by enterfornone on Thu Dec 28, 2000 at 08:54:32 PM EST

Spamdexers whole business revolves around getting the top place in search engines. Someone with a good understanding of how to do this can make thousands (I knew a guy who was making $40k a month doing just that).

How do you make a system that is immune to this? Systems like Dmoz are somewhat immune, but only due to the fact that they are edited by humans. Having a computer sort of the difference between a relevant web page and something that has been deliberately been made to look liek a relevant web page is no easy feat.

I can't see voting working. It's too easy to abuse, it makes it easier to push your page to the top by voting for yourself continiously. It's really just inviting a DoS.



--
efn 26/m/syd
Will sponsor new accounts for porn.

Voting (3.66 / 3) (#5)
by vryl on Thu Dec 28, 2000 at 09:14:07 PM EST

Would be basically by use, and you could trim down obvious 'spamming' of the results by the usual methods (IP addresses etc) enough to get fairly reliable results. Nothing is perfect, but if it takes off, and millions of users are 'voting with their feet' by using the better front ends, then is should sort itself out.

Hopefully ...

[ Parent ]

tricking users (3.50 / 2) (#6)
by Delirium on Thu Dec 28, 2000 at 09:37:03 PM EST

Wouldn't this cause pages which tricked users into thinking they were relevant results to get voted up? Plus, it would seem that once a page gets voted up the situation is somewhat self-perpetuating: highly-voted pages come up first in the search results, so thus get more clicks, so thus more votes, so stay highly-voted. How would a new page work its way up the list?

[ Parent ]
wtf? (4.00 / 1) (#8)
by vryl on Thu Dec 28, 2000 at 10:52:35 PM EST

Dood, how do you trick a user into thinking they are getting relevant results? Almost by definition, if someone thinks the result is relevant then it is relevant? Or am I missing something here?

[ Parent ]
relevance (3.00 / 1) (#12)
by Delirium on Fri Dec 29, 2000 at 01:13:42 AM EST

I meant tricking them into thinking it's a relevant result from the little summary results list that the search engine brings up. Obviously if they still think it's relevant after clicking on it it is in fact relevant, but the scenario I was describing was one in which it looks relevant until the user actually goes there. Then they realize it's really just a spam, but they've already "voted with their feet" by clicking on the link.

[ Parent ]
There should be a "This Link Sucks" butt (none / 0) (#18)
by gauntlet on Fri Dec 29, 2000 at 11:21:27 AM EST

So that if the search engine is fooled by the content, and the user is fooled by the blurb, they can go back to google's search page and click "This Link Sucks", which would indicate that it wasn't reflective of what they were looking for.

Into Canadian Politics?
[ Parent ]

terminology (2.42 / 7) (#3)
by 31: on Thu Dec 28, 2000 at 08:54:47 PM EST

when you put a link on a word, i assume there's gonna be more information... not a definition... really, I think anyone at k5 that wants to know what something means has the resources to find it out on their own, but when dissing google, i'm thinking it would be good to say what's so poorly designed about it, and maybe how you could fix them. After all, unless you identify the problems, how can you hope to do better?

-Patrick
fair enough (3.60 / 5) (#4)
by vryl on Thu Dec 28, 2000 at 09:09:12 PM EST

But I just love the jargon file. Basically Google does not allow "Quoted Text" which is a majore downer. Also, it seems to atomise compound booleans, which is a pisser as well. I have spoken to them about this, and they say they are onto it, but I don't know what is going on. It seems pretty easy to fix, I just wonder is some arsehole has a dodgy patent on some obvious tech or something.

[ Parent ]
Definition == More Information (2.00 / 2) (#10)
by iCEBaLM on Thu Dec 28, 2000 at 11:18:45 PM EST

A definition is more information on a word, especially if the audience you're talking to wouldn't normally know the meaning. I don't see how you can say it isn't. And I also don't see how you can rag on someone for linking to definitions when most people wouldn't even link *at all*.

-- iCEBaLM

[ Parent ]
"why don't they?" (4.75 / 4) (#7)
by Speare on Thu Dec 28, 2000 at 10:47:08 PM EST

why don't the search engines open up their back end with some sort of API and scripting language, and then sponsor a site with Open Sourced front ends and ranking systems?

It's all about page hits. Ad revenue. Value added. Portals.

Altavista even changed their search results recently, so that it clicks THROUGH Altavista when you want to visit a found site. This, I'm sure, is to get more revenue through providing statistics on who clicks what.


[ e d @ h a l l e y . c c ]
Google is My Favorite (3.50 / 4) (#9)
by espo812 on Thu Dec 28, 2000 at 11:07:09 PM EST

Google is my favorite search engine. It seems to come up with the better results from searches than any other engine I've seen.

One thing I don't understand is their mentality towards the searching algorithm they use. There was an article somewhere else (slashdot I think) about some porn sites exploting Googles algorithm to artificially increase their status on searches. As a result Google had to go back and rework some things so this didn't happen.

Why don't these engines take a page from the crypto community? Develop an algorithm, and publish it. Flaws will be found quickly. If you're lucky, you get fixes. If not at least you know what to work on. Sometimes I will never understand buisnesses.

Just my US$0.02 - feel free to ask for change.

espo
--
Censorship is un-American.
security through obscurity (3.66 / 3) (#14)
by Delirium on Fri Dec 29, 2000 at 01:38:51 AM EST

The problem is that in the crypto community they come up with mathematically sound crypto algorithms. The algorithm is secure even if full source is available - while hiding the algorithm used may add some security, it's not the main source of security, so dispensing with it is ok. With search engines, the "security through obsurity" of hidden algorithms is really all they have. I'm not sure it's possible, even in theory, to come up with a set of criteria for search engines which can't be spammed. Someone will just create pages matching your criteria (people will go to great lengths to do this, creating many real-looking pages on many different servers if necessary), so the only way to prevent it is to keep them guessing at what your criteria really are.

[ Parent ]
Understanding Google (4.33 / 3) (#15)
by fremen on Fri Dec 29, 2000 at 02:32:03 AM EST

Actually, the generalities of Google are fairly well understood. Scientific American did an article in their June, 1999 issue on Clever, a similar system to Google. Clever was developed as a research project at IBM, while Google was done at Stanford. The article talks mostly about Clever, but there are a couple of paragraphs towards the end about the similarities and differences between the two search engines. Well worth the read, particularly if you are into futuristic Internet searching heuristics.

[ Parent ]
Google's cretinsims. (4.40 / 10) (#13)
by elenchos on Fri Dec 29, 2000 at 01:24:56 AM EST

Rather than a definition of cretinism, which I can easily look up myself without you having to bother making a link, it would help me more if you put in more specific detail about what Google does wrong. For example, "Go type in Foo and Google will give you Bar result, when it should really return Baz." If you really went all out you could give an example of the correct result appearing when you use some other search engine.

Adequacy.org

Psychological reasons (2.75 / 4) (#16)
by Robby on Fri Dec 29, 2000 at 09:12:43 AM EST

why don't the search engines open up their back end with some sort of API and scripting language, and then sponsor a site with Open Sourced front ends and ranking systems.

Well, there are several problems with this sort of thing. open sourcing the ranking algorithm here isn't a good idea because it's a simple 'human behaviour' based algorithm. Consider a Psychology experiment: if you've ever been involved in one, you never get told what the topic is, or what's being tested until the experiment is over. Why? Well, the simple fact is being aware of what social aspect is being measured will change your behaviour.

In the same way, open sourcing a ranking algorithm will make people think "ah, ok, if I do this (goes to fiddle with HTML)i'll be up the top of the search! great! Obviously, that renders the entire ranking system obsolete and useless. back to square 1, oh wait, but all the search engines suck now :)

If you want to roll your own ranking algorithm, cool - but it's not easy (i'd say - anyone know?) - Just don't open source a good algorithm because as soon as you do, you make it obsolete.

Look at this from a purely logistical POV (4.33 / 3) (#17)
by ozone on Fri Dec 29, 2000 at 09:45:26 AM EST

Basically, what you're asking for is access to the raw data that a search engine collects, which you would then index using an external system. Right?

From my meager ADSL connection and running a webcrawler I've written, I got a couple of Gb (uncompressed) from about 100k Urls. Now, scale that up to a billion, which I believe is the number of pages Google has indexed, and you've got 20Tb (uncompressed) of information that you either have to transfer, store, and then index, or manipulate directly on their systems.

Assuming you move the data around, I can't see sponsored open-source sites having enough bandwidth, processing power or disk space. Manipulating the data through an API on Google's systems doesn't fix the problem, as you still have to have large portions of the data in order to create the indices, which again raises the issue of bandwidth, storage etc

What might work is for Google to make available some servers on the Google network, which can be used to directly access their data. But then you gotta ask who they would give access to? Multiple algorithms and front-ends necessarily means multiple teams, so they would be giving semi-random groups of people access to their core systems, thus exposing themselves to all sorts of horrible scenarios.

My feeling is that the system proposed could be accomplished by starting a search engine from scratch, with this kind of development in mind, sorta a permanent work-in-progress.

Page Ranking, Spamdexing and Open Source | 18 comments (18 topical, 0 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!