Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Google's Page Rank - Great for Searching the Internet but not Single Sites

By petersu in Technology
Sat Apr 19, 2003 at 10:26:44 PM EST
Tags: Software (all tags)
Software

One of the main reasons that Google is the most popular search engine on the Internet is it's page ranking system. The algorithm it uses has become so famous that it is now known simply as "PageRank". PageRank has been so widely hailed that it seems that any search system without it is deemed to be immature, behind the times or just plain useless.

Brilliant as Google is, the funny thing about PageRank is that unless you are writing an Internet search engine (come on, are you really going to be doing that?), it is probably the worst possible way to sort search results. In fact you should never use the PageRank algorithm when returning results from a single site.




Before Google became a private company, its founders, Sergey Brin and Larry Page, were both working on doctorates on the topic of Internet Search Engines at Stanford University. Luckily for us, the details of Google's engine were published so anyone could see how it worked. In The Anatomy of a Large-Scale Hypertextual Web Search Engine, the authors detailed the now famous PageRank algorithm. It is very simple, and can be stated as follows:

PR(A) = (1-d) + d (PR(T1) /C(T1) + ... + PR(Tn) /C(Tn) )

Where PR(A) is the PageRank of Page A (the one we want to work out).
D is a dampening factor. Nominally this is set to 0.85.
PR(T1) is the PageRank of a site pointing to Page A.
C(T1) is the number of links off that page.
PR(Tn) /C(Tn) means we do that for each page pointing to Page A.

You employ the page rank algorithm by firstly guessing a PageRank for all the pages you have indexed and then recursively iterating until the PageRank converges. This process is described in detail in PageRank Uncovered by Chris Ridings and Mike Shishigin. PageRank Uncovered is a very thorough and clearly written examination of PageRank, what it is, what it is not and how to exploit it. It makes great bed time reading.

Although it's not obvious from the algorithm, what PageRank in effect says is that pages "vote" for other pages on the Internet. So if Page A links to Page B, it is saying B is an important page. In addition, if lots of pages link to a page, then it has more votes and its worth should be higher. These assumptions have been widely criticised, but, perhaps because nobody has been able to come up with a better system that can be tested on a live search engine, PageRank has evolved to become the de-facto standard of rating search results.

Most search systems are not written to index the Internet. They are written to merely index a particular web site (for instance, the search box at the bottom of this page). If you searched for a term that was present on the home page and some other pages as well, PageRank would always rank the home page as the first result. This is not a good thing. Let's look at a practical example to see why.

Say you had a web site that was meant to provide information on models and you called it www.bikini.com. Let's say that you knew there was a model on the site called Lola Corwin and that she had a personal page. If you were looking for Lola with a search system that employed PageRank and Lola happened to be on the Home page as featured model of the month, the home page would come up first and Lola's page second. I actually tried this on Google at the time of writing this article and indeed, the search term:

Corwin site:www.bikini.com

brought the home page up first and her personal page second. Clearly, you really wanted Lola's home page to be first, not the site's home page. PageRank has failed you.

I am being a little unfair to Google as PageRank is only one of the factors Google employs to rank pages. The others include word position, font, capitalisation and search term appearance in title tags. These 'others', however, are the only ones you should use when ranking search results within a given site. PageRank is entirely meaningless because it places such undue importance on the home page where detailed information is almost never found. Unfortunately, a detailed description of how these other factors should be calculated is not documented anywhere and it's difficult to know how Google uses them.

Since Google became a private company, these 'other' factors and the PageRank algorithm itself have apparently undergone modifications. But as Google is now a private company, they are no longer documenting their technology. If you want to use them in your software, I guess you will have to wait until a smart person somewhere creates a better system in a publicly released Ph.D. thesis before they run off to become a billionaire.

These thoughts evolved as I was trying to write a search result ranking system for a search engine I have developed called the Yider. It is a free product that is designed for Windows servers. I had to have some kind of ranking system, so I decided to use the word count and position as my page rank. It works like this:

a) Assume a web page contains the following text between the <body> tags:

"Ph.D.'s on search engines should be banned because their final findings only become workable when a company is established to produce a practical result from an incomplete research paper. This denies everyone who funded them a chance to see the benefits of their research. It's one of the reason I hate search engines in general. Haven't you got anything better to do than search the Internet all day anyway? Why not fiddle with real engines like the one in your car?"

This text consists of 467 characters

b) Let's say we were searching for the phrase "search engines". This phrase occurs at characters 11 and 303.

c) Rank phrase matches with a score of 1 but penalise them linearly depending on their distance from the beginning of the text:

Phrase rank = 1 x (467 - 11) / 467 + 1 x (467 - 303) / 467
= 0.9764 + 0.3512 = 1.3276

d) We now need to take account of partial phrase matches. I do this as follows. The two words in the phrase can be found at the following locations:

search - 11, 303, 373

engines - 18, 310, 434

Some of these positions correspond to the full phrase match. These should not be considered as they have already been taken into account. Thus we should only consider the following locations.

search - 373
engines - 434

e) We will rate these as for the phrase match, but only with an importance of 0.5:

Word rank = 0.5 x (467 - 373) / 467 + 0.5 x (467 - 434) / 467
= 0.1006 + 0.0353 = 0.1359

f) The total page rank is = phrase rank + word rank = 1.3276 + 0.1359 = 1.4635

Note that my page rank is not an absolute measure of a page's worth. It is simply a measure of the relative relavence of this page to other pages in the same site.

Before presenting search results to users, I rank every page that contains a full or partial phrase match using this algorithm. This is a common sense and simple approach that seems to provide good page ranking for me and is certainly better than using Google's PageRank.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Google
o hailed
o The Anatomy of a Large-Scale Hypertextual Web Search Engine
o PageRank Uncovered
o criticised
o www.bikini .com
o Yider
o Also by petersu


Display: Sort:
Google's Page Rank - Great for Searching the Internet but not Single Sites | 76 comments (43 topical, 33 editorial, 0 hidden)
Swoooooooooosh!!! (3.00 / 2) (#3)
by Kasreyn on Sat Apr 19, 2003 at 12:46:18 AM EST

Hear that? That was the sound of this article going WAY over my head. ^_^;; Abstaining due to ignorance. But I'm sure you did a good job, all the same. ^_^


-Kasreyn


"Extenuating circumstance to be mentioned on Judgement Day:
We never asked to be born in the first place."

R.I.P. Kurt. You will be missed.
Nah, read it again (none / 0) (#31)
by A Proud American on Sat Apr 19, 2003 at 11:41:00 AM EST

Searching and algorithm analysis is no more difficult than your Ivy League calculus class.  Sure, there's a lot of foreign-looking equations at first, but once you memorize a few variables and their values, etc., it's much easier.

I mean, compared to things like cryptography and shit, Web search is elementary.

____________________________
The weak are killed and eaten...


[ Parent ]

Well, no, (none / 0) (#52)
by it certainly is on Sat Apr 19, 2003 at 11:59:49 PM EST

knowing the existing methods of web search is elementary. Coming up with entirely new and innovative methods of web search that are machine solvable is quite hard.

kur0shin.org -- it certainly is

Godwin's law [...] is impossible to violate except with an infinitely long thread that doesn't mention nazis.
[ Parent ]

OT: Calc classes (none / 0) (#66)
by gzt on Mon Apr 21, 2003 at 02:15:19 PM EST

Though this school isn't "Ivy League", the calc classes (except for the most basic possible sequence) generally involve proving various theorems in real calculus rather than doing the plug-and-chug calculations one finds in engineering calculus courses and high school calc. I suspect you'd find proving e is irrational [1] is more difficult than anything this article discusses, and that Ivy League calculus courses cover approximately the same material as the ones at this Prestigious College.

You should probably amend it to "college calculus class" or something. Unless the Ivy League is as decadent as I hear.

Cheers,
GZT
[1] a standard third-quarter calc problem, it involves doing a power series expansion.

[ Parent ]

good on you (5.00 / 4) (#13)
by martingale on Sat Apr 19, 2003 at 03:07:01 AM EST

You've understood one of the most important aspects of searching: good heuristics are entirely domain dependent.

Google's ranking is necessarily a one size fits all approach, which means it will never perform better than "average", for suitably chosen definitions of "average". Right now they don't have any appreciable competition, but they will certainly be beaten by specialized engines which can take advantage of extra domain knowledge for well defined subtopics.

You have one major advantage when building a "small" search engine, because you don't have to be efficient. What you'll need to do at some point I think is construct a testing framework to measure improvements when you tweak the parameters.

Sorry (2.33 / 3) (#19)
by A Proud American on Sat Apr 19, 2003 at 09:14:31 AM EST

But PageRank has changled a lot since its early days.  Back then, the Google guys kept promising how they'd never sell-out and would keep Google and its accompanying code in the public domain.  Well, they lied.

Anyway, since they were private, the algorithm has changed quite significantly, although I can't authoritatively state it with 100% accuracy.

So, the formula you have there and all of your analysis is based on ~10 year old data, really.  That's not fair to Google IMHO.

____________________________
The weak are killed and eaten...


It's not fair, but what else? (none / 0) (#24)
by petersu on Sat Apr 19, 2003 at 10:45:51 AM EST

Since Google is so popular, and one of the few published parts of their technology is PageRank, I found everyone wanted my little itsy bitsy search engine to use it. When I did, I found it wasn't appropriate for the reasons outlined in my article. I think that's worth pointing out. If I had the time to engage in full scale research, I'm sure that most of my ideas would go out the window. In addition, I'd never build anything because all of my time would be spent in research.

Geez, this sounds a bit defensive, huh, but it's worth pointing out that not everything you read about great ideas will necessarily apply to a particular software problem.

get yarted
[ Parent ]
sure it's fair (4.00 / 1) (#44)
by tealeaf on Sat Apr 19, 2003 at 06:15:50 PM EST

It's fair for two reasons:

1. Withholding information about search is unfair when public depends on said search results.  Secret formula excuse can be used to effectively censor results, and because no one knows the formula, Google can just say, "uh, we don't censor anyone, that's just how our secret formula works, tough noogies..."  So if it's ok for Google to throw this little pile of crap in our faces, it's ok for us to throw back some of it in form of reviews based on whatever public data we do have.  An eye for an eye.


2. The discussed formula is still a fundamental ingridient as is quite apparent from looking at search results.  Sure, they are always tweaking this and that, adding weights and other heuristics, but it doesn't really matter, because the attacks based on the above formula still work and Google produces crap results in some areas where people have begun to attack Google's ranking system.

http://www.google.com/search?q=google+ranking+attack

Anyway, I feel uneasy that Google's system is not good enough to work in the open and that it absolutely requires "security through obscurity" in order to work.  Shame.


[ Parent ]

What? (3.87 / 8) (#32)
by DarkZero on Sat Apr 19, 2003 at 11:45:06 AM EST

Corwin site:www.bikini.com

brought the home page up first and her personal page second. Clearly, you really wanted Lola's home page to be first, not the site's home page. PageRank has failed you.

...

PageRank is entirely meaningless because it places such undue importance on the home page where detailed information is almost never found.

The PageRank system has "failed" and is "meaningless" because it put the home page one inch above a more specific page in a search on a specific site? That's a pretty ridiculous exaggeration, especially since it relies on the search term being featured on the home page at the time. If the information is still there, but just one inch below where you would like it to be, that's not a failure, besides a failure to meet your specific idea of how the site should work and not someone else's.

Your information also depends on how the site is laid out. If the domain goes to just a splash screen and the main page is at domain.com/index/ or domain.com/main.html, which is the case with more than half of the corporate sites that I visit, specific searches will bring you to the specific page that you're looking for before the main page. For instance, try "svc site:snkneogeo.co.jp" in Google. Because the main page is just a splash page and the actual index is at /top/top.html, the SNK Vs. Capcom (SVC) English page comes up first, the SNK Vs. Capcom Japanese page comes up second, and the main page for snkneogeo.co.jp comes up third.

So basically, Google fails nothing but your expectations for how it should work, and not mine, and it only does so some of the time. I don't know about anyone else, but I don't really see that as a problem worthy of an article. Would anyone like to see my mathematical analysis of how ShopRite failed me with their meaningless decision to put oranges at the front of their fruit stands instead of the apples that I'm looking for?

Could you please rework the article (2.80 / 5) (#34)
by tokugawa on Sat Apr 19, 2003 at 01:13:59 PM EST

into a "HOW TO BEAT THE PAGERANK SYSTEM" type article, so that I can make millions of dollars at the expense of the internet community's wellbeing?

Much obliged,
Tokugawa

Google has become worse lately (5.00 / 4) (#35)
by Mister Pmosh on Sat Apr 19, 2003 at 01:39:20 PM EST

I've started looking into other search engines that would act more like google used to act. It's not so much the advertising that bothers me, but the way that message boards are almost always the first links. For example, I was searching for some certain php based software last night, and the link to the actual software was perhaps number ten. The other stuff was all message boards that talked about the software. This is actually one of the better ones I've run into, others that I can't think of off the top of my head have been worse.

Google's image search, news search, and newsgroup search are still invaluable and above all the competition, but their web search has been lacking, so I'll keep looking until I find something that is as good as Google used to be.
"I don't need no instructions to know how to rock!" -- Carl

Games (5.00 / 2) (#40)
by DarkZero on Sat Apr 19, 2003 at 04:23:11 PM EST

I've noticed something else wrong with Google lately. Looking for sites about specific video games on Google sucks. Absolutely sucks. The first thirty links or so are cheat sites that either serve up very small pages with generic tips on them or stick you with a "Sorry, game not found" response from their in-site search engine. Searching for rare games is even harder. I didn't find any sites for any of Gust's Atelier games until about the fifth or sixth search page. All of the ones before that were blank or mostly blank pages from cheat sites and message boards, and the average message board response was a thread with two or three posts in it that amounted to little more than "Ever heard of this game?", "Nope", "Me neither, sorry", and that's it. Once I got to the sixtieth and seventieth responses though, the returns were excellent. Fan sites, art sites, lengthy reviews... basically everything that I came to Google for.

That's actually why I find this article pretty ironic. I've gone to Google tons of times for searches within sites and come up with exactly what I've wanted, but searching the web itself with Google has become harder and harder in the last few months. The thing with the cheat sites is mostly a product of people learning how to screw with Google, which is something I've seen when searching a few other popular subjects too, I have no idea how this problem with the message boards got started.

[ Parent ]

I agree (5.00 / 1) (#43)
by Mister Pmosh on Sat Apr 19, 2003 at 05:54:48 PM EST

I've not searched for old video games too often, but when I've tried to find rumors and additional information on upcoming video games it is a pain in the ass. What is worse is when you are looking for fan sites such as the ones you mentioned, and you get tons of "cheat code" sites that tell you how to go into the options menu and put the difficulty level on easy. Of course, they try to pop up ads and give you doubleclick cookies while they're at it.

Personally, when I'm wanting to check out a game, I'll just go to IGN or a few other sites that are gaming magazines. If I'm wanting to buy games, movies, or music, I just go to Amazon and check out what they have since their selection is better than anything within a three hour drive of my home.

The way Google is becoming has made me search for less things online. I've relegated myself to just a few sites to search for specific things now and rarely make use of internet search engines unless it's something that is very hard to find.
"I don't need no instructions to know how to rock!" -- Carl
[ Parent ]

I've noticed similiar problems with Google (none / 0) (#65)
by samiam on Mon Apr 21, 2003 at 01:21:50 PM EST

I've notice similiar problems with Google. In my case, it is trying to get information about how to use a given Linux program; often, I will get a zillion mirrors of the HOWTO (which didn't help me) and finding documentation a little more readbale than the HOWTO means going to the fourth or fifth page.

I think the problem here is that a lot of people put up a link to some HOWTO mirror, but few people put a link up to the truly useful documentation.

- Sam

[ Parent ]

How are you looking at searching? (3.50 / 2) (#36)
by Elkor on Sat Apr 19, 2003 at 02:04:57 PM EST

To my mind, searching the internet is like scooping a net through a pond. You use a different net (search terms) depending the time of fish (pages) you want to retrieve.

When you pull up a net of fish, you have to sort through them for the ones you are looking for.

This is exactly what Google (and other search engines) do. To my mind, there is nothing wrong with the homepage being first, and the one you're looking for being second. Or even in the top 10.

It's better than "Old School" where you had to randomly try pages and find links to other pages to find the details you were looking for.

In conclusion, promote your method as a better site search technique, and point out that Google is better for internet searching. But don't try to trash Google because it doesn't work the way you want it to.

Regards,
Elkor


"I won't tell you how to love God if you don't tell me how to love myself."
-Margo Eve
there is a fundamental problem with Google (4.00 / 1) (#42)
by tealeaf on Sat Apr 19, 2003 at 05:34:27 PM EST

...and other search engines.  The problem is this: as long as the webmaster is the only force that determines page rank, people can and will mess with Google's ranks.

At this point Google's algorithm is very, very well known and there are very easy and very effective attacks against it (like creating thousands of junk sites that point to your site, etc.).  The problem is that I sincerely doubt it's possible to defend against these attacks with more clever algorithms, because of the fundamental relationship between data suppliers and engine.

Webmasters, right now, are the sole data suppliers, and thus, if they figure out (and they eventually will figure out anything Google does) how Google ranks pages, they will come up with just the right set of inputs to screw with Google.

So adjusting the values and tweaking page weights based on length, time, etc...all this is garbage that Google is involved with that will NOT SOLVE the problem because Google is blind to the threat model.

There is only one solution to this.  It is not perfect, but it's the only thing that has a chance of working.  Expand data suppliers.  Let people vote on the quality of links.  Of course this is likely to fall to social engineering attacks, but it is always better to have more diverse data inputs than less.  Google should take a page out of Slashdot, etc. and learn a thing or two about moderations, metamoderation, random moderation points, etc.  There are ways to make moderation hard to attack.  Taken together with page rank and other data inputs, it can correct the problem.

I think the days of good search engines that do not allow (an enlightened form of, and not a naive form of) moderation are over.  Anything that passively ranks pages based on pages themselves is doomed to fail long term.

So Google, are you ready?


...ok (none / 0) (#46)
by duffbeer703 on Sat Apr 19, 2003 at 07:31:15 PM EST

So webmasters determining pagerank is fundamentally flawed. But you propose that the unwashed masses will do better?

All I know is that google's "moderation" system works.

[ Parent ]

you didn't get it (none / 0) (#49)
by tealeaf on Sat Apr 19, 2003 at 09:11:25 PM EST

The problem is that webmaster can vote for him/herself many times (unfairly) by creating phony sites that link to the main site, or by spamming existing sites with links to your site.  There is a fine line between advertising and bullshit ranking manipulation.

So it's not that "unwashed" masses are better.  There needs to be a system of checks and balances that can thwart such attack.

As far as I'm concerned, webmasters are just as unwashed as anyone else.  There is nothing special about them as a group.  So they're neither better nor worse as far as ability to judge is concerned.  But their ability to spam Google with fake sites is incredible and Google's customers have no way to deal with it in a formal way.


[ Parent ]

And people (none / 0) (#62)
by duffbeer703 on Mon Apr 21, 2003 at 08:21:32 AM EST

can vote for themselves multiple times by creating multiple accounts and scripting the voting process.

Just look at Kuro5hin during the prewar days or "the other site" every day. People manipulate the system.

[ Parent ]

sorry, doesn't scale well (4.00 / 2) (#53)
by martingale on Sun Apr 20, 2003 at 12:49:48 AM EST

Interactive moderation is not going to help Google, not because it wouldn't be a neat idea in principle, but because it's not something that can be incorporated in a database with three billion entries, and full of redundancies.

You are effectively suggesting that Google should maintain one or more variables for each page, but be flexible enough that any one of these can be changed pretty much at any time. Just try and think for a moment what kind of submission rotocols would be necessary, and then realize that even if that side of the system was up and running, each new vote would have to have repercussions on the relative rankings of three billion pages, stored on at best hundreds of computers. Your search time would go from 1 second to several hours at least.

The only reasonable thing that Google can do is precompute the rankings and other bells and whistles statically, and hope that the majority of users will find them acceptable.

Moderation is for microdatabases like k5.

[ Parent ]

highly distributed task, non-realtime (none / 0) (#54)
by tealeaf on Sun Apr 20, 2003 at 03:55:44 PM EST

Just like today, Google is highly distributed, this can be too.  Each box takes care of a relatively small domain.  So each vote only affects a certain box (or significantly less than the total number of boxes).

Secondly, we don't need votes to take effect in real time.  I don't think Google's ranking works in real time too, does it?  I doubt it.  A saner solution is to run the ranking algorithm every 24 hours or every 48 hours...  That's plenty good enough.  So basically it's like doing what they already do for ranking, just that you'd have a couple of extra factors thrown into equation.  Not too bad, if you ask me.

So all that Google has to do when you moderate is to record your vote and that's it.  Same thing it does when it encounters a web page for the first time.  It just records its presense.  It doesn't rerun page rank for each new page, that would be insane.

;)


[ Parent ]

major difficulties left (5.00 / 1) (#58)
by martingale on Sun Apr 20, 2003 at 10:33:07 PM EST

I don't think you realize the size of the dataset Google deals with. It's just not practical to keep a variable on each indexed page. As it is, Google recomputes its weights each month over a period of several days. We could at least expect the same with your scheme. But I'm not sure you'd be too happy with a voting system with a one month lag.

Here's what Google does in a nutshell, (ok, it's speculation ;-): the crawlers produce unordered compressed repositories of web pages. If you take a look at the Google contest web page, you'll see what I mean. The pages can't be ordered according to domain when they come in, since crawling a whole site at once is bad form, and often triggers retaliation by web admins.

Next, the unordered repositories are distributed over lots of computers and iteration begins. Now the PR equation a sparse matrix equation, but nevertheless at several billion webpages, you're looking at a lot of computer communication to synchronize the calculations. Since the repositories are initially unordered, I'd expect n^2 communication between the number crunching computers, although this _migh_ be reducible by reordering the web pages in the repositories. However, that would amount to sorting the data according to the shape of the web graph, which isn't so easy either. Bottom line is the calculations are a major pain in the ass, and not as highly localized as one would expect.

After the pagerank and other weights have been calculated satisfactorily, the servers receive little chunks of it. The sorted list of documents is associated with each valid keyword, at least to within the first few results (ever noticed that you can never read the 87,000th result in a Google search? They'd be inane to put it in their list). In this way, the servers can quickly look up the sorted lists and present you with results quasi instantaneously.

Of course, some web queries are a lot more popular than others. I expect Google have a mechanism for replicating the servers which hold popular indexes, to reduce the load on the hardware. This is something that's going to take a lot of time to propagate, probably a couple of weeks.

So why is this incompatible with user defined weights? In principle, a vote on one page can change the relative rankings of a whole lot of them. The only safe way is to recompute the whole pagerank and redistribute sorted lists to the servers. Given the difficulties outlined above, I'd expect this sort of thing about once a month, for the same reason that Google already recompute once a month and not more frequently.

If you're happy with a lag of a whole month, then I guess it's doable...

[ Parent ]

indeed, lag is acceptable (none / 0) (#59)
by tealeaf on Sun Apr 20, 2003 at 11:15:46 PM EST

Of course I'm happy with one month lag!  Think about it, how is it worse than what we already have?  Not much.  I'd expect a negligible time difference with the introduction of an additional weight.  There would be a large difference in the quality of ranking, in my opinion.  Lag would be the same.  

Sounds like a win to me.  Only a tiny percentage of people need to participate in the system to make it effective, and it should (if it works how I expect it to work) cut down drastically on link/page spam.

Only two things prevent Google from going to the dogs.  One is that not everyone knows how Google works (mostly because they don't care), and two, not everyone who does know cares to abuse the system.  But I really don't expect this situation to remain the same.  I just don't believe that webmasters can continue to resist the temptation.  I'm a cynic.


[ Parent ]

cynically speaking... (5.00 / 1) (#60)
by martingale on Mon Apr 21, 2003 at 01:03:32 AM EST

One month lag could be a problem. For things like news and current affairs, this would be completely useless. Remember that we must compare the voting system with the "current" system, which depends not on history but only on what information is actually readable on the pages in the database at the time of computation. So in January, we'd have a huge bias towards christmas shopping, but in december Google would think it was only november. Of course this criticism also applies to Google now, but it isn't as visible because the weights are computed from relatively fixed quantities, ie the geometry of the web. Explicitly incorporating people's behaviour means making things like the slashdot effect a variable.

Only two things prevent Google from going to the dogs. One is that not everyone knows how Google
There's little doubt that Google *will* go to the dogs, if for no other reason that people's expectations change as they see what the search engine is capable of. If they don't innovate against themselves, the perception will be that the results are getting worse.

The real headstart they have, I think, is the incredible size of their index. I believe that there really hasn't been a real comparison between Google's search heuristics and those of its competitors, simply because the competitors do not have as many pages indexed.

Take for example a competitor. Suppose that their algorithm is better, but they have half the number of pages Google has. That means Google has twice the chance of finding exactly what you want in their first page of returned results, even if the site is way down on the page. With the same number of pages in their index, competitor's algorithm might have put that page higher up.

[ Parent ]

also web search is NOT ad-hoc (none / 0) (#55)
by tealeaf on Sun Apr 20, 2003 at 04:01:16 PM EST

That's why it's so nicely distributable to begin with.  Web search is a highly specialized search.  Web ranking is a highly specialized ranking algorithm too.  In other words, Google doesn't allow user-driven ad-hoc, realtime ranking.  Imagine if as part of your search you could type your own weights and even factors into the ranking formula?  That's not allowed.  All this narrows the problem a lot.  It's not the same problem as generic, ad-hoc, realtime, SQL-based querying to a single DB server.  Not at all the same thing.


[ Parent ]
Google does something like that (5.00 / 2) (#64)
by Gromit on Mon Apr 21, 2003 at 01:13:25 PM EST

...with their toolbar for IE: Voting buttons

--
"The noble art of losing face will one day save the human race." - Hans Blix

[ Parent ]
That is why... (4.00 / 4) (#45)
by thelizman on Sat Apr 19, 2003 at 06:28:27 PM EST

...pagerank is not the only determining factor for site placement. I have yet to see an example of a crap site that has managed to spike it's ratings to communities of sites outside of its league for a given search criteria. All the uproar over slamming google is overblown. It always involves some arcane search term that merely elevates a site above its peers. The page isn't necessarily that good, it's just more relevant than most of the pages it beats.
--

"Our language is sufficiently clumsy enough to allow us to believe foolish things." - George Orwell
well I liked it anyway (none / 0) (#47)
by livus on Sat Apr 19, 2003 at 08:36:12 PM EST

it may be subjective as all hell but it provoked a lot of informative comments. Just the kind of thing vegetables like me need to see more of. +1FP

---
HIREZ substitute.
be concrete asshole, or shut up. - CTS
I guess I skipped school or something to drink on the internet? - lonelyhobo
I'd like to hope that any impression you got about us from internet forums was incorrect. - debillitatus
I consider myself trolled more or less just by visiting the site. HollyHopDrive

PS (4.50 / 2) (#50)
by rev ine on Sat Apr 19, 2003 at 09:17:02 PM EST

Bring back news.google.com's cache!

Depends on the site (4.60 / 5) (#51)
by Captain Trips on Sat Apr 19, 2003 at 10:56:17 PM EST

On smaller sites with a centralized design, PageRank doesn't do much good, since there's only one person (the webmaster) voting on the pages. It's more useful for sites that are bigger and less organized. That is, for sites that look a lot like the web as a whole. University websites are a good example, and Google even has a hardcoded URL for mine. They're also pushing their technology to corporations to use on their intranets for the same reason.

--
The fact that cigarette advertising works, makes me feel like maybe, just maybe, Santa Claus is real.—Sloppy
Clearly Shmearly (5.00 / 2) (#56)
by synaesthesia on Sun Apr 20, 2003 at 04:28:57 PM EST

Clearly, you really wanted Lola's home page to be first, not the site's home page. PageRank has failed you.

Not at all, it might be very pertinent information to me to know that Lola Corwin is featured model of the month.

As you have noticed, searching for pages on a single site is a completely different kettle of fish from searching for pages across the whole internet (primarily because no-one is trying to artificially boost their ranking within a single site).

What if www.bikini.com had a (textual) link to Lola's page (in a sidebar) from every page on the site except for Lola's page - in which the text "Corwin" doesn't appear at all, except for within a GIF?

Sausages or cheese?

it's simple, really. (none / 0) (#57)
by somasonic on Sun Apr 20, 2003 at 04:43:37 PM EST

lola just needs a k5 account and to have her personal site linked to in every single post, then she can sit back and watch her site zoom to #1!

take that, bikini.com, you bastion of beach party!

Smart Comment (none / 0) (#76)
by muirhead on Tue Aug 26, 2003 at 08:12:08 AM EST

So who needs a smart comment when you've got links?
Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, very grumpy!

[ Parent ]
this is why... (5.00 / 1) (#61)
by blisspix on Mon Apr 21, 2003 at 01:29:00 AM EST

again and again, the need for people who are highly experienced in non-Internet forms of indexing and retrieval need to develop a new search system. Some of those people are librarians, experts in organising information using standardised methods such as Dewey Decimal, Library of Congress Subject headings, and MARC. They are also working heavily on Dublin Core Metadata.

To me, the problem is that data needs to be organised and arranged before it is indexed. Whether people add metadata to their pages when they create them or it is automatically generated somehow once a page is available on the Internet, it needs to be done. The data that would need to be organised would include keywords, authorship, date produced, language, subject, etc.

After this, then pages should be available to be indexed by search engines using this criteria.

There needs to be some sort of authority system built into search engines and to the structure of the Internet generally. It's just a big ol' mess. Convincing 14year old Geocities users to use XML and metadata isn't going to work, so there needs to be some kind of post-creation indexing system using standardised headings and data.

14 yearold Geocities users (none / 0) (#63)
by jt on Mon Apr 21, 2003 at 11:01:31 AM EST

Rarely have anything of value on their pages anyway; I say we just don't index them.

[ Parent ]
Problem with Metadata (none / 0) (#73)
by xria on Wed Apr 23, 2003 at 10:57:06 AM EST

As people own the pages that are being indexed, can change them at will, and can put anything on them they like I cant see that any form of pre-indexing system is going to be all that successful. Im sure if you indexed today a search on Iraq War would come up with many pages that a few months ago, or a few months from now, wont have any content relating to that subject any more (which is why googles cache is very important, of course). Trusting the people who create content to correctly apply the metadata that applies to their site/page has proved a failure, basically if search engiens pay any attention to metadata at all, they weight it very low down, as it is abused so often. For example I know that the company I work for has a website that contains metatags for every competitor and their products that they have ever heard of, which obviously is intended to give a marginal competitive advantage by making us appear when a competitor is searched on.

[ Parent ]
wait.. .thats not what happened (2.33 / 3) (#68)
by joschi on Mon Apr 21, 2003 at 03:51:03 PM EST

using your search of "Corwin site:www.bikini.com " i was in fact return this page:

http://www.bikini.com/supermodels/lola.html

which is decidedly not the front page.  looks like pagerank worked.

Yeah but... (none / 0) (#69)
by petersu on Mon Apr 21, 2003 at 07:11:14 PM EST

I wrote the article four weeks ago when she was on the home page
get yarted
[ Parent ]
Your algorithm has flaws (5.00 / 1) (#70)
by scheme on Mon Apr 21, 2003 at 11:27:02 PM EST

Just off the top of my head, I can see at least one quick problem with your ranking algorithm. Two equivalent sentences end up having different ranks based on word order. Consider the following two sentences:

Cyanoacrylate bonds to objects using anionic polymerization.

Using anionic polymerization, cyanoacrylate bonds to objects.

The two sentences rank differently on a search for anionic polymerization even though they are equivalent.

Now consider a more realistic example, where page a has half a page of other information (like a title, authors, funding institutions, dates, etc) and then discusses search engines algorithms. Meanwhile page b consists entire of sentences like 'search engines are the coolest thing in the world. Search engines rule.'. Your algorithm would rank the second page higher since it rewards pages for mentioning terms more often and more frequently.

The ideal situation would be an analysis of pages within a site using a perfect nlp technique to categorize the content of te page. However, barring that depending entirely on the frequency and position of search terms seems simplistic and prone to abuse. Although google's pagerank is not ideal it seems like one of the better solutions to getting an 'objective' rating of a page's content.

Also I have a quibble about some of your links. Your criticism link goes to googlewatch which is run by a gentleman who believes that his site (namebase) should have top or very high listings for the biographical information it stores. I.e. searching for Richard Cheney on google should bring up namebase's biography on Cheney in one of the first few positions.


"Put your hand on a hot stove for a minute, and it seems like an hour. Sit with a pretty girl for an hour, and it seems like a minute. THAT'S relativity." --Albert Einstein


Very interesting (none / 0) (#71)
by petersu on Tue Apr 22, 2003 at 07:40:09 PM EST

Perfect nlp technique has me thinking but at the end of the day, my simple little search engine probably doesn't need it. I am going to look into this stuff but at the moment I am implementing my system which brings up a general point. Most site searches that are implemented are for small web sites because most web sites are really, quite small e.g. less than 10 Mb of text. The Yider caters for these wonderfully but might be stretched for even for kuro5hin's site. It's the same with the ranking system. I need a reasonable system, not the best, but PageRank is not even reasonable when searching a single site.

get yarted
[ Parent ]
question: using link phrases (none / 0) (#72)
by kubalaa on Wed Apr 23, 2003 at 12:23:28 AM EST

Here's an idea I thought of way back when I first heard about google, and I assumed this was how it worked -- what if we use a basic vector-IR representation for each document, then adjust each document with the vectors of all the documents which reference it until it all converges? Or, maybe we could adjust each document based on the words appearing around the link itself?

Serious Question (1.33 / 3) (#74)
by Talez on Thu Apr 24, 2003 at 10:10:20 AM EST

Do you people ever stop fucking complaining about Google?

If you think you can do better I advise you to either apply for a job there or make your own, better search engine and stop annoying us about your grand scheme to break something that works just fine.

Si in Googlis non est, ergo non est

Can you read? (none / 0) (#75)
by petersu on Thu Apr 24, 2003 at 11:26:27 PM EST

This article is not a complaint about Google. It is a comment about the unsuitability of Internet search techonolgoy when it's applied to a single site. Internet searching and single site searching are two different things that people like you seem to confuse. Dear me.

get yarted
[ Parent ]
Google's Page Rank - Great for Searching the Internet but not Single Sites | 76 comments (43 topical, 33 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!