create account | help/FAQ | contact | links | search | IRC | site news
 Everything Diaries Technology Science Culture Politics Media News Internet Op-Ed Fiction Meta MLP

 Google's Page Rank - Great for Searching the Internet but not Single Sites By petersu in TechnologySat Apr 19, 2003 at 10:26:44 PM EST Tags: Software (all tags) One of the main reasons that Google is the most popular search engine on the Internet is it's page ranking system. The algorithm it uses has become so famous that it is now known simply as "PageRank". PageRank has been so widely hailed that it seems that any search system without it is deemed to be immature, behind the times or just plain useless. Brilliant as Google is, the funny thing about PageRank is that unless you are writing an Internet search engine (come on, are you really going to be doing that?), it is probably the worst possible way to sort search results. In fact you should never use the PageRank algorithm when returning results from a single site.

Before Google became a private company, its founders, Sergey Brin and Larry Page, were both working on doctorates on the topic of Internet Search Engines at Stanford University. Luckily for us, the details of Google's engine were published so anyone could see how it worked. In The Anatomy of a Large-Scale Hypertextual Web Search Engine, the authors detailed the now famous PageRank algorithm. It is very simple, and can be stated as follows:

PR(A) = (1-d) + d (PR(T1) /C(T1) + ... + PR(Tn) /C(Tn) )

Where PR(A) is the PageRank of Page A (the one we want to work out).
D is a dampening factor. Nominally this is set to 0.85.
PR(T1) is the PageRank of a site pointing to Page A.
C(T1) is the number of links off that page.
PR(Tn) /C(Tn) means we do that for each page pointing to Page A.

You employ the page rank algorithm by firstly guessing a PageRank for all the pages you have indexed and then recursively iterating until the PageRank converges. This process is described in detail in PageRank Uncovered by Chris Ridings and Mike Shishigin. PageRank Uncovered is a very thorough and clearly written examination of PageRank, what it is, what it is not and how to exploit it. It makes great bed time reading.

Although it's not obvious from the algorithm, what PageRank in effect says is that pages "vote" for other pages on the Internet. So if Page A links to Page B, it is saying B is an important page. In addition, if lots of pages link to a page, then it has more votes and its worth should be higher. These assumptions have been widely criticised, but, perhaps because nobody has been able to come up with a better system that can be tested on a live search engine, PageRank has evolved to become the de-facto standard of rating search results.

Most search systems are not written to index the Internet. They are written to merely index a particular web site (for instance, the search box at the bottom of this page). If you searched for a term that was present on the home page and some other pages as well, PageRank would always rank the home page as the first result. This is not a good thing. Let's look at a practical example to see why.

Say you had a web site that was meant to provide information on models and you called it www.bikini.com. Let's say that you knew there was a model on the site called Lola Corwin and that she had a personal page. If you were looking for Lola with a search system that employed PageRank and Lola happened to be on the Home page as featured model of the month, the home page would come up first and Lola's page second. I actually tried this on Google at the time of writing this article and indeed, the search term:

Corwin site:www.bikini.com

brought the home page up first and her personal page second. Clearly, you really wanted Lola's home page to be first, not the site's home page. PageRank has failed you.

I am being a little unfair to Google as PageRank is only one of the factors Google employs to rank pages. The others include word position, font, capitalisation and search term appearance in title tags. These 'others', however, are the only ones you should use when ranking search results within a given site. PageRank is entirely meaningless because it places such undue importance on the home page where detailed information is almost never found. Unfortunately, a detailed description of how these other factors should be calculated is not documented anywhere and it's difficult to know how Google uses them.

Since Google became a private company, these 'other' factors and the PageRank algorithm itself have apparently undergone modifications. But as Google is now a private company, they are no longer documenting their technology. If you want to use them in your software, I guess you will have to wait until a smart person somewhere creates a better system in a publicly released Ph.D. thesis before they run off to become a billionaire.

These thoughts evolved as I was trying to write a search result ranking system for a search engine I have developed called the Yider. It is a free product that is designed for Windows servers. I had to have some kind of ranking system, so I decided to use the word count and position as my page rank. It works like this:

a) Assume a web page contains the following text between the <body> tags:

"Ph.D.'s on search engines should be banned because their final findings only become workable when a company is established to produce a practical result from an incomplete research paper. This denies everyone who funded them a chance to see the benefits of their research. It's one of the reason I hate search engines in general. Haven't you got anything better to do than search the Internet all day anyway? Why not fiddle with real engines like the one in your car?"

This text consists of 467 characters

b) Let's say we were searching for the phrase "search engines". This phrase occurs at characters 11 and 303.

c) Rank phrase matches with a score of 1 but penalise them linearly depending on their distance from the beginning of the text:

Phrase rank = 1 x (467 - 11) / 467 + 1 x (467 - 303) / 467
= 0.9764 + 0.3512 = 1.3276

d) We now need to take account of partial phrase matches. I do this as follows. The two words in the phrase can be found at the following locations:

search - 11, 303, 373

engines - 18, 310, 434

Some of these positions correspond to the full phrase match. These should not be considered as they have already been taken into account. Thus we should only consider the following locations.

search - 373
engines - 434

e) We will rate these as for the phrase match, but only with an importance of 0.5:

Word rank = 0.5 x (467 - 373) / 467 + 0.5 x (467 - 434) / 467
= 0.1006 + 0.0353 = 0.1359

f) The total page rank is = phrase rank + word rank = 1.3276 + 0.1359 = 1.4635

Note that my page rank is not an absolute measure of a page's worth. It is simply a measure of the relative relavence of this page to other pages in the same site.

Before presenting search results to users, I rank every page that contains a full or partial phrase match using this algorithm. This is a common sense and simple approach that seems to provide good page ranking for me and is certainly better than using Google's PageRank.

 Display: Threaded Minimal Nested Flat Flat Unthreaded Sort: Unrated, then Highest Highest Rated First Lowest Rated First Ignore Ratings Newest First Oldest First
 Google's Page Rank - Great for Searching the Internet but not Single Sites | 76 comments (43 topical, 33 editorial, 0 hidden)
 Swoooooooooosh!!! (3.00 / 2) (#3) by Kasreyn on Sat Apr 19, 2003 at 12:46:18 AM EST

 Hear that? That was the sound of this article going WAY over my head. ^_^;; Abstaining due to ignorance. But I'm sure you did a good job, all the same. ^_^ -Kasreyn "Extenuating circumstance to be mentioned on Judgement Day:We never asked to be born in the first place."R.I.P. Kurt. You will be missed.
 Nah, read it again (none / 0) (#31) by A Proud American on Sat Apr 19, 2003 at 11:41:00 AM EST

 Searching and algorithm analysis is no more difficult than your Ivy League calculus class.  Sure, there's a lot of foreign-looking equations at first, but once you memorize a few variables and their values, etc., it's much easier. I mean, compared to things like cryptography and shit, Web search is elementary. ____________________________The weak are killed and eaten...[ Parent ]
 Well, no, (none / 0) (#52) by it certainly is on Sat Apr 19, 2003 at 11:59:49 PM EST

 knowing the existing methods of web search is elementary. Coming up with entirely new and innovative methods of web search that are machine solvable is quite hard. kur0shin.org -- it certainly is
 OT: Calc classes (none / 0) (#66) by gzt on Mon Apr 21, 2003 at 02:15:19 PM EST

 Though this school isn't "Ivy League", the calc classes (except for the most basic possible sequence) generally involve proving various theorems in real calculus rather than doing the plug-and-chug calculations one finds in engineering calculus courses and high school calc. I suspect you'd find proving e is irrational [1] is more difficult than anything this article discusses, and that Ivy League calculus courses cover approximately the same material as the ones at this Prestigious College. You should probably amend it to "college calculus class" or something. Unless the Ivy League is as decadent as I hear. Cheers, GZT [1] a standard third-quarter calc problem, it involves doing a power series expansion. [ Parent ]
 good on you (5.00 / 4) (#13) by martingale on Sat Apr 19, 2003 at 03:07:01 AM EST

 You've understood one of the most important aspects of searching: good heuristics are entirely domain dependent. Google's ranking is necessarily a one size fits all approach, which means it will never perform better than "average", for suitably chosen definitions of "average". Right now they don't have any appreciable competition, but they will certainly be beaten by specialized engines which can take advantage of extra domain knowledge for well defined subtopics. You have one major advantage when building a "small" search engine, because you don't have to be efficient. What you'll need to do at some point I think is construct a testing framework to measure improvements when you tweak the parameters.
 Sorry (2.33 / 3) (#19) by A Proud American on Sat Apr 19, 2003 at 09:14:31 AM EST

 But PageRank has changled a lot since its early days.  Back then, the Google guys kept promising how they'd never sell-out and would keep Google and its accompanying code in the public domain.  Well, they lied. Anyway, since they were private, the algorithm has changed quite significantly, although I can't authoritatively state it with 100% accuracy. So, the formula you have there and all of your analysis is based on ~10 year old data, really.  That's not fair to Google IMHO. ____________________________The weak are killed and eaten...
 It's not fair, but what else? (none / 0) (#24) by petersu on Sat Apr 19, 2003 at 10:45:51 AM EST

 Since Google is so popular, and one of the few published parts of their technology is PageRank, I found everyone wanted my little itsy bitsy search engine to use it. When I did, I found it wasn't appropriate for the reasons outlined in my article. I think that's worth pointing out. If I had the time to engage in full scale research, I'm sure that most of my ideas would go out the window. In addition, I'd never build anything because all of my time would be spent in research. Geez, this sounds a bit defensive, huh, but it's worth pointing out that not everything you read about great ideas will necessarily apply to a particular software problem. get yarted[ Parent ]
 sure it's fair (4.00 / 1) (#44) by tealeaf on Sat Apr 19, 2003 at 06:15:50 PM EST

 It's fair for two reasons: 1. Withholding information about search is unfair when public depends on said search results.  Secret formula excuse can be used to effectively censor results, and because no one knows the formula, Google can just say, "uh, we don't censor anyone, that's just how our secret formula works, tough noogies..."  So if it's ok for Google to throw this little pile of crap in our faces, it's ok for us to throw back some of it in form of reviews based on whatever public data we do have.  An eye for an eye. 2. The discussed formula is still a fundamental ingridient as is quite apparent from looking at search results.  Sure, they are always tweaking this and that, adding weights and other heuristics, but it doesn't really matter, because the attacks based on the above formula still work and Google produces crap results in some areas where people have begun to attack Google's ranking system. Anyway, I feel uneasy that Google's system is not good enough to work in the open and that it absolutely requires "security through obscurity" in order to work.  Shame. [ Parent ]
 What? (3.87 / 8) (#32) by DarkZero on Sat Apr 19, 2003 at 11:45:06 AM EST

 Could you please rework the article (2.80 / 5) (#34) by tokugawa on Sat Apr 19, 2003 at 01:13:59 PM EST

 into a "HOW TO BEAT THE PAGERANK SYSTEM" type article, so that I can make millions of dollars at the expense of the internet community's wellbeing? Much obliged, Tokugawa
 Google has become worse lately (5.00 / 4) (#35) by Mister Pmosh on Sat Apr 19, 2003 at 01:39:20 PM EST

 I've started looking into other search engines that would act more like google used to act. It's not so much the advertising that bothers me, but the way that message boards are almost always the first links. For example, I was searching for some certain php based software last night, and the link to the actual software was perhaps number ten. The other stuff was all message boards that talked about the software. This is actually one of the better ones I've run into, others that I can't think of off the top of my head have been worse. Google's image search, news search, and newsgroup search are still invaluable and above all the competition, but their web search has been lacking, so I'll keep looking until I find something that is as good as Google used to be. "I don't need no instructions to know how to rock!" -- Carl
 Games (5.00 / 2) (#40) by DarkZero on Sat Apr 19, 2003 at 04:23:11 PM EST

 I agree (5.00 / 1) (#43) by Mister Pmosh on Sat Apr 19, 2003 at 05:54:48 PM EST

 I've not searched for old video games too often, but when I've tried to find rumors and additional information on upcoming video games it is a pain in the ass. What is worse is when you are looking for fan sites such as the ones you mentioned, and you get tons of "cheat code" sites that tell you how to go into the options menu and put the difficulty level on easy. Of course, they try to pop up ads and give you doubleclick cookies while they're at it. Personally, when I'm wanting to check out a game, I'll just go to IGN or a few other sites that are gaming magazines. If I'm wanting to buy games, movies, or music, I just go to Amazon and check out what they have since their selection is better than anything within a three hour drive of my home. The way Google is becoming has made me search for less things online. I've relegated myself to just a few sites to search for specific things now and rarely make use of internet search engines unless it's something that is very hard to find. "I don't need no instructions to know how to rock!" -- Carl[ Parent ]
 I've noticed similiar problems with Google (none / 0) (#65) by samiam on Mon Apr 21, 2003 at 01:21:50 PM EST

 I've notice similiar problems with Google. In my case, it is trying to get information about how to use a given Linux program; often, I will get a zillion mirrors of the HOWTO (which didn't help me) and finding documentation a little more readbale than the HOWTO means going to the fourth or fifth page. I think the problem here is that a lot of people put up a link to some HOWTO mirror, but few people put a link up to the truly useful documentation. - Sam [ Parent ]
 How are you looking at searching? (3.50 / 2) (#36) by Elkor on Sat Apr 19, 2003 at 02:04:57 PM EST

 To my mind, searching the internet is like scooping a net through a pond. You use a different net (search terms) depending the time of fish (pages) you want to retrieve. When you pull up a net of fish, you have to sort through them for the ones you are looking for. This is exactly what Google (and other search engines) do. To my mind, there is nothing wrong with the homepage being first, and the one you're looking for being second. Or even in the top 10. It's better than "Old School" where you had to randomly try pages and find links to other pages to find the details you were looking for. In conclusion, promote your method as a better site search technique, and point out that Google is better for internet searching. But don't try to trash Google because it doesn't work the way you want it to. Regards, Elkor "I won't tell you how to love God if you don't tell me how to love myself." -Margo Eve
 there is a fundamental problem with Google (4.00 / 1) (#42) by tealeaf on Sat Apr 19, 2003 at 05:34:27 PM EST

 ...ok (none / 0) (#46) by duffbeer703 on Sat Apr 19, 2003 at 07:31:15 PM EST

 So webmasters determining pagerank is fundamentally flawed. But you propose that the unwashed masses will do better? All I know is that google's "moderation" system works. [ Parent ]
 you didn't get it (none / 0) (#49) by tealeaf on Sat Apr 19, 2003 at 09:11:25 PM EST

 The problem is that webmaster can vote for him/herself many times (unfairly) by creating phony sites that link to the main site, or by spamming existing sites with links to your site.  There is a fine line between advertising and bullshit ranking manipulation. So it's not that "unwashed" masses are better.  There needs to be a system of checks and balances that can thwart such attack. As far as I'm concerned, webmasters are just as unwashed as anyone else.  There is nothing special about them as a group.  So they're neither better nor worse as far as ability to judge is concerned.  But their ability to spam Google with fake sites is incredible and Google's customers have no way to deal with it in a formal way. [ Parent ]
 And people (none / 0) (#62) by duffbeer703 on Mon Apr 21, 2003 at 08:21:32 AM EST

 can vote for themselves multiple times by creating multiple accounts and scripting the voting process. Just look at Kuro5hin during the prewar days or "the other site" every day. People manipulate the system. [ Parent ]
 sorry, doesn't scale well (4.00 / 2) (#53) by martingale on Sun Apr 20, 2003 at 12:49:48 AM EST

 Interactive moderation is not going to help Google, not because it wouldn't be a neat idea in principle, but because it's not something that can be incorporated in a database with three billion entries, and full of redundancies. You are effectively suggesting that Google should maintain one or more variables for each page, but be flexible enough that any one of these can be changed pretty much at any time. Just try and think for a moment what kind of submission rotocols would be necessary, and then realize that even if that side of the system was up and running, each new vote would have to have repercussions on the relative rankings of three billion pages, stored on at best hundreds of computers. Your search time would go from 1 second to several hours at least. The only reasonable thing that Google can do is precompute the rankings and other bells and whistles statically, and hope that the majority of users will find them acceptable. Moderation is for microdatabases like k5. [ Parent ]
 highly distributed task, non-realtime (none / 0) (#54) by tealeaf on Sun Apr 20, 2003 at 03:55:44 PM EST

 Just like today, Google is highly distributed, this can be too.  Each box takes care of a relatively small domain.  So each vote only affects a certain box (or significantly less than the total number of boxes). Secondly, we don't need votes to take effect in real time.  I don't think Google's ranking works in real time too, does it?  I doubt it.  A saner solution is to run the ranking algorithm every 24 hours or every 48 hours...  That's plenty good enough.  So basically it's like doing what they already do for ranking, just that you'd have a couple of extra factors thrown into equation.  Not too bad, if you ask me. So all that Google has to do when you moderate is to record your vote and that's it.  Same thing it does when it encounters a web page for the first time.  It just records its presense.  It doesn't rerun page rank for each new page, that would be insane. ;) [ Parent ]
 major difficulties left (5.00 / 1) (#58) by martingale on Sun Apr 20, 2003 at 10:33:07 PM EST

 indeed, lag is acceptable (none / 0) (#59) by tealeaf on Sun Apr 20, 2003 at 11:15:46 PM EST

 Of course I'm happy with one month lag!  Think about it, how is it worse than what we already have?  Not much.  I'd expect a negligible time difference with the introduction of an additional weight.  There would be a large difference in the quality of ranking, in my opinion.  Lag would be the same.   Sounds like a win to me.  Only a tiny percentage of people need to participate in the system to make it effective, and it should (if it works how I expect it to work) cut down drastically on link/page spam. Only two things prevent Google from going to the dogs.  One is that not everyone knows how Google works (mostly because they don't care), and two, not everyone who does know cares to abuse the system.  But I really don't expect this situation to remain the same.  I just don't believe that webmasters can continue to resist the temptation.  I'm a cynic. [ Parent ]
 cynically speaking... (5.00 / 1) (#60) by martingale on Mon Apr 21, 2003 at 01:03:32 AM EST

 One month lag could be a problem. For things like news and current affairs, this would be completely useless. Remember that we must compare the voting system with the "current" system, which depends not on history but only on what information is actually readable on the pages in the database at the time of computation. So in January, we'd have a huge bias towards christmas shopping, but in december Google would think it was only november. Of course this criticism also applies to Google now, but it isn't as visible because the weights are computed from relatively fixed quantities, ie the geometry of the web. Explicitly incorporating people's behaviour means making things like the slashdot effect a variable. Only two things prevent Google from going to the dogs. One is that not everyone knows how Google There's little doubt that Google *will* go to the dogs, if for no other reason that people's expectations change as they see what the search engine is capable of. If they don't innovate against themselves, the perception will be that the results are getting worse. The real headstart they have, I think, is the incredible size of their index. I believe that there really hasn't been a real comparison between Google's search heuristics and those of its competitors, simply because the competitors do not have as many pages indexed. Take for example a competitor. Suppose that their algorithm is better, but they have half the number of pages Google has. That means Google has twice the chance of finding exactly what you want in their first page of returned results, even if the site is way down on the page. With the same number of pages in their index, competitor's algorithm might have put that page higher up. [ Parent ]
 also web search is NOT ad-hoc (none / 0) (#55) by tealeaf on Sun Apr 20, 2003 at 04:01:16 PM EST

 That's why it's so nicely distributable to begin with.  Web search is a highly specialized search.  Web ranking is a highly specialized ranking algorithm too.  In other words, Google doesn't allow user-driven ad-hoc, realtime ranking.  Imagine if as part of your search you could type your own weights and even factors into the ranking formula?  That's not allowed.  All this narrows the problem a lot.  It's not the same problem as generic, ad-hoc, realtime, SQL-based querying to a single DB server.  Not at all the same thing. [ Parent ]
 Google does something like that (5.00 / 2) (#64) by Gromit on Mon Apr 21, 2003 at 01:13:25 PM EST

 ...with their toolbar for IE: Voting buttons -- "The noble art of losing face will one day save the human race." - Hans Blix[ Parent ]
 That is why... (4.00 / 4) (#45) by thelizman on Sat Apr 19, 2003 at 06:28:27 PM EST

 ...pagerank is not the only determining factor for site placement. I have yet to see an example of a crap site that has managed to spike it's ratings to communities of sites outside of its league for a given search criteria. All the uproar over slamming google is overblown. It always involves some arcane search term that merely elevates a site above its peers. The page isn't necessarily that good, it's just more relevant than most of the pages it beats. -- "Our language is sufficiently clumsy enough to allow us to believe foolish things." - George Orwell
 well I liked it anyway (none / 0) (#47) by livus on Sat Apr 19, 2003 at 08:36:12 PM EST

 it may be subjective as all hell but it provoked a lot of informative comments. Just the kind of thing vegetables like me need to see more of. +1FP --- HIREZ substitute. be concrete asshole, or shut up. - CTS I guess I skipped school or something to drink on the internet? - lonelyhobo I'd like to hope that any impression you got about us from internet forums was incorrect. - debillitatus I consider myself trolled more or less just by visiting the site. HollyHopDrive
 PS (4.50 / 2) (#50) by rev ine on Sat Apr 19, 2003 at 09:17:02 PM EST

 Bring back news.google.com's cache!
 Depends on the site (4.60 / 5) (#51) by Captain Trips on Sat Apr 19, 2003 at 10:56:17 PM EST

 On smaller sites with a centralized design, PageRank doesn't do much good, since there's only one person (the webmaster) voting on the pages. It's more useful for sites that are bigger and less organized. That is, for sites that look a lot like the web as a whole. University websites are a good example, and Google even has a hardcoded URL for mine. They're also pushing their technology to corporations to use on their intranets for the same reason. -- The fact that cigarette advertising works, makes me feel like maybe, just maybe, Santa Claus is real.—Sloppy
 Clearly Shmearly (5.00 / 2) (#56) by synaesthesia on Sun Apr 20, 2003 at 04:28:57 PM EST

 Clearly, you really wanted Lola's home page to be first, not the site's home page. PageRank has failed you. Not at all, it might be very pertinent information to me to know that Lola Corwin is featured model of the month. As you have noticed, searching for pages on a single site is a completely different kettle of fish from searching for pages across the whole internet (primarily because no-one is trying to artificially boost their ranking within a single site). What if www.bikini.com had a (textual) link to Lola's page (in a sidebar) from every page on the site except for Lola's page - in which the text "Corwin" doesn't appear at all, except for within a GIF? Sausages or cheese?
 it's simple, really. (none / 0) (#57) by somasonic on Sun Apr 20, 2003 at 04:43:37 PM EST

 lola just needs a k5 account and to have her personal site linked to in every single post, then she can sit back and watch her site zoom to #1! take that, bikini.com, you bastion of beach party!
 Smart Comment (none / 0) (#76) by muirhead on Tue Aug 26, 2003 at 08:12:08 AM EST

 So who needs a smart comment when you've got links? Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, Grumpy Watkins, very grumpy! [ Parent ]
 this is why... (5.00 / 1) (#61) by blisspix on Mon Apr 21, 2003 at 01:29:00 AM EST

 again and again, the need for people who are highly experienced in non-Internet forms of indexing and retrieval need to develop a new search system. Some of those people are librarians, experts in organising information using standardised methods such as Dewey Decimal, Library of Congress Subject headings, and MARC. They are also working heavily on Dublin Core Metadata. To me, the problem is that data needs to be organised and arranged before it is indexed. Whether people add metadata to their pages when they create them or it is automatically generated somehow once a page is available on the Internet, it needs to be done. The data that would need to be organised would include keywords, authorship, date produced, language, subject, etc. After this, then pages should be available to be indexed by search engines using this criteria. There needs to be some sort of authority system built into search engines and to the structure of the Internet generally. It's just a big ol' mess. Convincing 14year old Geocities users to use XML and metadata isn't going to work, so there needs to be some kind of post-creation indexing system using standardised headings and data.
 14 yearold Geocities users (none / 0) (#63) by jt on Mon Apr 21, 2003 at 11:01:31 AM EST

 Rarely have anything of value on their pages anyway; I say we just don't index them. [ Parent ]
 Problem with Metadata (none / 0) (#73) by xria on Wed Apr 23, 2003 at 10:57:06 AM EST

 As people own the pages that are being indexed, can change them at will, and can put anything on them they like I cant see that any form of pre-indexing system is going to be all that successful. Im sure if you indexed today a search on Iraq War would come up with many pages that a few months ago, or a few months from now, wont have any content relating to that subject any more (which is why googles cache is very important, of course). Trusting the people who create content to correctly apply the metadata that applies to their site/page has proved a failure, basically if search engiens pay any attention to metadata at all, they weight it very low down, as it is abused so often. For example I know that the company I work for has a website that contains metatags for every competitor and their products that they have ever heard of, which obviously is intended to give a marginal competitive advantage by making us appear when a competitor is searched on. [ Parent ]
 wait.. .thats not what happened (2.33 / 3) (#68) by joschi on Mon Apr 21, 2003 at 03:51:03 PM EST

 using your search of "Corwin site:www.bikini.com " i was in fact return this page: http://www.bikini.com/supermodels/lola.html which is decidedly not the front page.  looks like pagerank worked.
 Yeah but... (none / 0) (#69) by petersu on Mon Apr 21, 2003 at 07:11:14 PM EST

 I wrote the article four weeks ago when she was on the home page get yarted[ Parent ]
 Your algorithm has flaws (5.00 / 1) (#70) by scheme on Mon Apr 21, 2003 at 11:27:02 PM EST

 Very interesting (none / 0) (#71) by petersu on Tue Apr 22, 2003 at 07:40:09 PM EST

 Perfect nlp technique has me thinking but at the end of the day, my simple little search engine probably doesn't need it. I am going to look into this stuff but at the moment I am implementing my system which brings up a general point. Most site searches that are implemented are for small web sites because most web sites are really, quite small e.g. less than 10 Mb of text. The Yider caters for these wonderfully but might be stretched for even for kuro5hin's site. It's the same with the ranking system. I need a reasonable system, not the best, but PageRank is not even reasonable when searching a single site. get yarted[ Parent ]
 question: using link phrases (none / 0) (#72) by kubalaa on Wed Apr 23, 2003 at 12:23:28 AM EST

 Here's an idea I thought of way back when I first heard about google, and I assumed this was how it worked -- what if we use a basic vector-IR representation for each document, then adjust each document with the vectors of all the documents which reference it until it all converges? Or, maybe we could adjust each document based on the words appearing around the link itself?
 Serious Question (1.33 / 3) (#74) by Talez on Thu Apr 24, 2003 at 10:10:20 AM EST

 Do you people ever stop fucking complaining about Google? If you think you can do better I advise you to either apply for a job there or make your own, better search engine and stop annoying us about your grand scheme to break something that works just fine. Si in Googlis non est, ergo non est
 Can you read? (none / 0) (#75) by petersu on Thu Apr 24, 2003 at 11:26:27 PM EST

 This article is not a complaint about Google. It is a comment about the unsuitability of Internet search techonolgoy when it's applied to a single site. Internet searching and single site searching are two different things that people like you seem to confuse. Dear me. get yarted[ Parent ]
 Google's Page Rank - Great for Searching the Internet but not Single Sites | 76 comments (43 topical, 33 editorial, 0 hidden)
 Display: Threaded Minimal Nested Flat Flat Unthreaded Sort: Unrated, then Highest Highest Rated First Lowest Rated First Ignore Ratings Newest First Oldest First

All trademarks and copyrights on this page are owned by their respective companies. The Rest © 2000 - Present Kuro5hin.org Inc.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

 create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!