Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Google and Recursion

By ikarus in Technology
Mon Apr 15, 2002 at 08:58:55 AM EST
Tags: Internet (all tags)
Internet

We all know more or less how Google works: Links act as votes, and the more votes a page has, the higher its PageRank(tm). This system allows Google to do a decent job of locating the authoritative souce for any particular topic. The links are the key to this system, they determine relevance. In theory one can exploit this system by creating "fake" or "pointer" sites that serve only to drive up the rankings for another site. Such a practice is known as Google Bombing, and while possible, involves a good deal of coordination among a large group of people, or a single person with lots of free time. There has been some debate about the degree to which weblogs can affect Google's rankings by facilitating large numbers of people to all simultaneously link to the latest and greatest trends, fads, memes, or news bites on the internet.

This past thursday, Google unleashed the Googe API, a SOAP interface that allows developers to query Google and retrieve results without having to use the normal HTML form interface. This is unleashing a new form of Google bombing.


The Google API has been well received and numerous applications and scripts have sprung up to take advantage of it. One of the most popular uses of the API has been to create "Google Boxes" that show the top ten results for a query on a particular topic. For example, scripting.com shows the top ten results for the term "scripting" or the top ten linkers to scripting.com. Scripting.com, like many other sites that display similar query results, is a weblog, and Google knows that weblogs change frequently, so it indexes them frequently.

Remember that Google uses links as votes, which in turn determine relevancy. By embedding these "Google Boxes" sites are creating recursive link structures. In other words, sites are querying Google to find the most authoritative links on a topic and then creating links to the results, which in turns makes them authoritative. The rating system is essentially feeding itself.

This is an issue that blogdex and Daypop, sites that track popular timely links, both dealt with. Both sites distribute there results via RSS so that others can display the information. In order to prevent a feedback loop of links, where the top results would always stay on top (because they are being linked to by sites who display the results), a redirector was employed. Instead of providing actual links, Daypop and blogdex provide links back to their own sites, which in turn redirect to the real link. This allows them to ignore these links when determining ratings.

The problem with the "Google Boxes" is that they don't redirect, they link directly to the sources provided by Google. This means Google has no way to know that the links are a result of Google itself, not a "normal" page vote. The impact that this new form of Google bombing will have on the Google index is up for speculation. It may have no effect at all. It may have an effect on only a small population of terms. Only time, and the popularity of "Google Boxes," will tell. Either way, it's kind of fun to think about.

The moral of the story: Use a redirect.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Google
o Google Bombing
o affect
o Googe API
o Google Boxes
o scripting. com
o blogdex
o Daypop
o Also by ikarus


Display: Sort:
Google and Recursion | 27 comments (22 topical, 5 editorial, 0 hidden)
Wrecking the net for everyone (2.60 / 5) (#2)
by esun on Mon Apr 15, 2002 at 07:55:56 AM EST

Is it a matter of time before spammers wreck everything on the net, from email to icq to the only decent search engine to come along in years?

I yearn for the days of old: black-text-on-grey-background, and right clicking to load images.

(o/t) try a text mode browser (3.00 / 1) (#4)
by martingale on Mon Apr 15, 2002 at 08:20:46 AM EST



[ Parent ]
Argh! Cruel, cruel logic! Damn you! (n/t) (3.00 / 1) (#10)
by tenpo on Mon Apr 15, 2002 at 08:53:49 AM EST



[ Parent ]
erm... (3.33 / 3) (#17)
by delmoi on Mon Apr 15, 2002 at 06:18:04 PM EST

I yearn for the days of old: black-text-on-grey-background, and right clicking to load images.

So, why don't you just change your prefrences to indicate that? Both NS and IE support it..
--
"'argumentation' is not a word, idiot." -- thelizman
[ Parent ]
there is not much of a problem (4.16 / 6) (#3)
by martingale on Mon Apr 15, 2002 at 08:19:32 AM EST

The "Google Boxes" you mention won't allow you to increase your site's ranking, unless it's already up there in the top ten. Very different from Google bombing, which can bring an unknown site to the top.

As you pointed out and can be read about originally here, each site A which links to a site B increases B's ranking by a small amount, but doesn't affect A directly.

A's own ranking can only be affected positively if there are loops of the form A->B->some other sites->A, and the smaller the loop, the higher the effect on A. That's because all sites within the loop are affected, with decreasing benefits. The most affected site will be B, followed by B's successor, followed by B's successor's successor, etc, up to A, who gets a very small boost if the loop is large.

The total increase in A's ranking is a result of adding the small increases for all possible loops.

Now suppose you're an unknown site. Nobody links to you, but you decide to link to the top ten. Since these don't link back to you, or in a very very roundabout way, you'll get zero benefit from your Google Box.

Now suppose you're a top ten site, and half the top ten sites link back to you directly. If you link to each of them, you'll get a relatively large boost to your own ranking back from each of them, hence a noticeable increase. But if you had already linked to all of them before, you won't get any benefit, since I believe Google counts multiple links to the same page as one.

You'll note that the bit which allows you to increase your site's ranking requires lots of links, just like in Google bombing. If you've read so far, I think you'll agree that Google bombing is the better technique (which, for the record, I don't condone).



well put (3.66 / 3) (#6)
by ikarus on Mon Apr 15, 2002 at 08:33:21 AM EST

for the reasons you explain above, this is not entirely the same as Google Bombing, and may not be all that big of an issue. as you said, this only benefits sites that are already in the "top ten." my only concern (being ignorant of the details of the PageRank system) is that these "super nodes" continue to increase their status, and that maybe the sites they link to have their ratings affected as well. this is just speculation though.

[ Parent ]
super nodes (3.00 / 1) (#19)
by martingale on Mon Apr 15, 2002 at 08:29:10 PM EST

You've got a valid concern, but I think (based on the original PageRank description, I'm sure Google have improved/modified it since, but that's proprietary information we don't have access to) the amount of skewing won't be very great, compared to what can be achieved with something like Google Bombs.

Also, if you link consistently to the top ten, rather than to specific web sites, the top ten might change over time in ways beyond your control, so it's hard to guarantee a benefit to yourself. That means that people will probably not abuse Google Boxes voluntarily. But who knows, a variation on the Google Box might be more dangerous/powerful.



[ Parent ]
Still matters (4.71 / 7) (#7)
by dennis on Mon Apr 15, 2002 at 08:37:48 AM EST

It may not be exploitable, but it still distorts the rankings. If lots of people post links from Google's high-ranks, the rankings of those sites will be magnified.

If people find sites on Google, evaluate them independently, and post links to the ones they like, that's great. Sites will still be ranked by the size of the human audience that approves of them. But if people just blindly post Google search results, then the ranking is no longer based on real aggregate opinion, and sites that get an early lead will tend to hang onto them automatically.

[ Parent ]

correct (4.00 / 1) (#18)
by martingale on Mon Apr 15, 2002 at 08:20:21 PM EST

Any tampering with the web link structure skews Google's results for the users, obviously. In principle, that's bad, but in practical terms you can't stop the people from trying.

My argument suggests that the Google Box as described in the article is mainly useful to prop up support for the most popular sites, so it's unlikely to be used eg by pr0n sites to spam the result listings. Basically, with the Google Box, you can vote for others, but not yourself. You can however get boosted as a side effect.

Of course, if you're smart about it, you can experiment to see which other sites will help you the most back when you help them by linking, but that's essentially Google bombing again, which people already know how to use. Moreover, there's a finite limit to the benefit you can derive for yourself, as the links you create are only counted once. So it should be fairly hard to get an early lead and hang on to it this way.

Note that the analysis only applies to the Google Box described in the article. Variations and mutations might be more "evil".



[ Parent ]
Sorta OT (4.00 / 1) (#8)
by Skippy on Mon Apr 15, 2002 at 08:43:08 AM EST

Are links the only thing Google uses to determine page ranking? I've been more curious as of late because I get a lot of hits to a technical article on my site from Google but to the best of my knowledge it isn't linked anywhere.

# I am now finished talking out my ass about things that I am not qualified to discuss. #
Not exactly (3.50 / 2) (#11)
by rusty on Mon Apr 15, 2002 at 10:20:53 AM EST

Google doesn't just use links alone, but a fairly complicated weighting system as well. Links from some places are more authoritative than links from other places, basically.

____
Not the real rusty
[ Parent ]
And, surely... (4.00 / 1) (#12)
by jsled on Mon Apr 15, 2002 at 11:32:43 AM EST

They also use more traditional semantic document parsing techniques in combination with the link weighting.

[ Parent ]
Authority based on PageRank? (none / 0) (#25)
by Aquarius on Wed Apr 17, 2002 at 02:52:50 AM EST

As I understand it, I think that your PageRank dictates how authoritative your vote is, so highly visible sites have their links weighted correspondingly more heavily.

But I suspect that the PageRank algorithm in all its glory is weird black magic. :)

Aq.

"The grand plan that is Aquarius proceeds apace" -- Ronin, Frank Miller
[ Parent ]
speculations on proprietary Google technologies (4.00 / 1) (#20)
by martingale on Mon Apr 15, 2002 at 08:51:50 PM EST

I think it's fairly obvious that Google are tampering with the basic PageRank framework, though we (I?) don't know for sure. For example they allow you to search by language, or with more or less pr0n. The fact that they can identify/classify documents this way means they have a framework in place.

So it should be relatively easy for them to include document weights if they wanted to which take the classifications into account, as rusty and jsled already proposed. Simple document weighting is really easy to do, but the trick of course is to end up with useful weights.

Having said all this, your particular example can be explained without resorting to advanced PageRank modifications.

Other people have reported similar phenomena (too lazy to find the links), with the following explanation: sometimes, people publish web server access logs (maybe inadvertently) which the search engine crawlers find. On those logs, there's a whole lot of information, including the IP of the client's machine, and perhaps the referral address (ie the last web address visited before the server was queried). Google may use these addresses as if they were a direct link to your site. This might have occurred if you have ever fired up your browser, browsed your own site, and then browsed some other site in the same session.



[ Parent ]
The PageRank algorithm is described (4.00 / 1) (#23)
by fhbehr on Tue Apr 16, 2002 at 06:47:52 PM EST

here.

[ Parent ]
Theoretically (3.33 / 3) (#14)
by nutate on Mon Apr 15, 2002 at 12:41:15 PM EST

In theory people could use the web api to show any offset up to 1000 of the ten results, so you could show results 420-429 of your name. (Is this the only application so far, searching for your name? :) ) This theoretically would reduce the tyranny of the top ten into a tyranny of the top 1010.

I think ya just have to adopt a wait and see attitude on this one. It's interesting to think of the little link vortexes that could arise from this, but I doubt it will happen due to lack of adoption of googleboxes for anything more than vanity.

peace

Feedback, not recursion (4.90 / 11) (#15)
by vectro on Mon Apr 15, 2002 at 01:38:48 PM EST

This is an example of feedback, not recursion. Feedback is when the output of a system is also an input to that same system. Many voltage regulators, for example, work on the principle of feedback, by "looking" at the output voltage and adjusting it if it's not right.

Feedback is also the thing that gives you high-pitched whines when the microphone gets too close to the speaker. That's positive feedback, which is probably what's going on here.

Recursion, on the other hand, is when something is defined in terms of itself. This is most commonly used in mathematics, where functions are defined recursively. f(n)=n*f(n-1) is a recursive definition of the factorial function.

I think the key distinction between feedback and recursion is that the result of recursion is abstract, whereas feedback is dependent on the time domain.

“The problem with that definition is just that it's bullshit.” -- localroger
Unfortunately (3.66 / 3) (#16)
by CaptainSuperBoy on Mon Apr 15, 2002 at 03:56:55 PM EST

It's great that some sites are being responsible about this by using redirects to prevent this recursive indexing. Unfortunately, there's nothing requiring a site to do this redirection. What happens when someone passes around their own GoogleBox code that doesn't do redirects? It's up to the site owner to be responsible for the quality of Google's data. Maybe Google should modify their license agreement to protect against this sort of thing.. require that people mark their Google links with some meta data, like <A google=noindex> or something of that ilk.

Also, does the redirection even help? I would guess that Google, being the #1 search engine, would make their bot smart enough to follow a redirect to the final page..

OT.. I am thoroughly impressed with Google's API. People will find many innovative uses for it.. we haven't even seen the beginning of the new ways people will use this web service. It's also an excellent example of XML web services for someone who doesn't quite grasp the concept. WSDL is also neat.. it's basically a SOAP type library that makes it extremely easy for people to plug their app into your web service's interface.

--
jimmysquid.com - I take pictures.

For Google and Recursion... (none / 0) (#21)
by salsaman on Tue Apr 16, 2002 at 11:47:06 AM EST

...see 'Recursion and Google'

For Recursion and Google... (none / 0) (#22)
by cgray4 on Tue Apr 16, 2002 at 04:14:08 PM EST

...see 'Google and Recursion'.

There, now we have some recursion here.

[ Parent ]
splitting hairs (none / 0) (#24)
by martingale on Tue Apr 16, 2002 at 09:23:07 PM EST

Actually, that's circularity. For recursion, you need to be able to point to a well defined starting point.



[ Parent ]
Link to those who link to me (none / 0) (#27)
by Rasman on Thu May 02, 2002 at 07:28:57 AM EST

Google has the feature that lets you see the sites that link to you, right? So what if you set up a site that linked to the sites that linked to it. For example, Site A is a page that uses the Google API to provide a list of links to sites that link to it. Site B wants to improve it's ranking, so it links to Site A. After a few Google crawls, Site A will be linking to Site B, boosting B's rating.

By creating Site A, I'm both helping other sites boost their ratings as well as slowly becoming and "authority" on lots of different subjects. The more "authoritative" Site A becomes, the higher the rankings of the sites I'm boosting!

Can someone tell me why this is impossible or stupid or already been done? Seems pretty simple to me...

Google and Recursion | 27 comments (22 topical, 5 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!