Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Study on the 'structure' of the World Wide Web

By Marcin in News
Mon May 15, 2000 at 10:47:32 PM EST
Tags: Internet (all tags)
Internet

A group of researchers from IBM, Compaq and Altavista have done a study into the 'structure' of the web based on linkage. It's the biggest study of its kind done and used about 200 million pages and 1.5 billion hyperlinks.


They basically took 200 million pages and 1.5 billion links (i'd hazard a guess and say that they used Altavista's database) and did a whole bunch of interesting stuff to do with graphs (as in logic type graphs, not necessarily pretty pictures).

Their findings were, well they're best explained by a quote from the paper:

Our analysis reveals an interesting picture (Figure 9) of the web's macroscopic structure. Most (over 90%) of the approximately 203 million nodes in our crawl form a single connected component if hyperlinks are treated as undirected edges. This connected web breaks naturally into four pieces. The first piece is a central core, all of whose pages can reach one another along directed hyperlinks -- this "giant strongly connected component" (SCC) is at the heart of the web. The second and third pieces are called IN and OUT. IN consists of pages that can reach the SCC, but cannot be reached from it - possibly new sites that people have not yet discovered and linked to. OUT consists of pages that are accessible from the SCC, but do not link back to it, such as corporate websites that contain only internal links. Finally, the TENDRILS contain pages that cannot reach the SCC, and cannot be reached from the SCC. Perhaps the most surprising fact is that the size of the SCC is relatively small -- it comprises about 56M pages. Each of the other three sets contain about 44M pages -- thus, all four sets have roughly the same size.

In the article itself there is a bit more detail on their findings and some graphs of the other sort.

I think it's interesting that the world wide web is now so huge that it can be analysed like this and 'patterns' can be found in it. It really is like an 'organism' (yes, that's spelt right ;) ). I'd like to see more research into what types of sites make up the various portions.. ie. what's in the heart, what are the seperate clusters (pr0n?) and so forth. Anyway, i'm rambling now.. I want to know what other people think :)

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o a study into the 'structure' of the web
o logic type graphs
o Also by Marcin


Display: Sort:
Study on the 'structure' of the World Wide Web | 24 comments (24 topical, editorial, 0 hidden)
Good for quickes.... (none / 0) (#7)
by duxup on Mon May 15, 2000 at 08:48:08 PM EST

duxup voted 0 on this story.

Good for quickes.

Wow, for once a "structure of the w... (none / 0) (#2)
by Inoshiro on Mon May 15, 2000 at 08:51:58 PM EST

Inoshiro voted 1 on this story.

Wow, for once a "structure of the web" which doesn't look like some colourful sealife mushed into splotches :-)



--
[ イノシロ ]
Re: Wow, for once a (3.00 / 1) (#19)
by duxup on Tue May 16, 2000 at 08:05:11 AM EST

Personally I thought it would end up looking like a Penguin, but oh well :-)


[ Parent ]
Interesting figures.... the web is ... (none / 0) (#5)
by warpeightbot on Mon May 15, 2000 at 08:57:10 PM EST

warpeightbot voted 1 on this story.

Interesting figures.... the web is really that small? And only a quarter of it is core? Whee. One wonders how close that is to reality...

Re: Interesting figures.... the web is ... (3.00 / 1) (#11)
by Marcin on Mon May 15, 2000 at 11:14:23 PM EST

Interesting figures.... the web is really that small? And only a quarter of it is core? Whee. One wonders how close that is to reality...

Well, when you think about it there are probably a _lot_ of sites that didn't make it into this analysis at all. There are a couple of 'interconnected islands' they found, but you'll probably find there are a lot more like this that they didn't find. Like if both a friend and myself had our own websites on our own servers and we were linked to each other (and maybe had outlinks but no in links apart from our own) then there's no way our sites would be found and show up on a study like this. I hope that makes some sense :)

I think if you somehow got absolutely every HTML page out there you'd end up with the current 'picture', ie. a big clump, lots of things pointing into and out of it, lots of small islands, and then basically like a 'star field', just lots of dots of 1 or two pages/sites all over the place.
M.
[ Parent ]

Re: Interesting figures.... the web is ... (3.00 / 1) (#16)
by rongen on Tue May 16, 2000 at 05:32:55 AM EST

think if you somehow got absolutely every HTML page out there you'd end up with the current 'picture',

Well, you'd end up with an old picture, and it wouldn't necassarily represent a single time-slice. The web grows so fast that no image will ever be current. Also, there is no way to simultaneously map the whole thing... There will inevitably be some lag between the first and last nodes being mapped so the best you can ever do is show the state of the web (or some portion of it) at some time in the past. What will be REALLY cool is to see how this data stacks up in a few years... the patterns of growth and change are the cool thing! :)
read/write http://www.prosebush.com
[ Parent ]

Re: Interesting figures.... the web is ... (none / 0) (#25)
by Anonymous Hero on Tue May 16, 2000 at 09:48:03 PM EST

According to the latest figures (Spring '99), the web is approximately 800 million pages, and is growing at a rate of one million pages per day [1]. This study was only conducted on 200 million pages. As Marcin observes, this sounds suspiciously like the Altavista search database. It is also one of the main reasons why searching sucks so much -- less than 25% of the web is indexed by any one search engine.

I think it'd be interesting to compare this study to the earlier one [2] which found the web to be, on average, 19 clicks "wide" -- that is, only 19 clicks separate one given page from any other.

Luke Francl
(first K5 post!)

[1] Steve Lawrence and Lee Giles. Accessibility of information on the web, Nature, Vol. 400, pp. 107-109, 1999. See http://wwwmetrics.com/ for a good summary.

[2] . Albert, H. Jeong and A.L. Barabasi. The diameter of the world wide web, Nature Volume 401, pp. 130 - 131, 1999.

[ Parent ]
Yummy, I love new and cool applicat... (5.00 / 1) (#8)
by madams on Mon May 15, 2000 at 09:15:21 PM EST

madams voted 1 on this story.

Yummy, I love new and cool applications of graph theory. It is interesting to note that, according to the study, any two web pages are usually seperated by less than 20 links, and they expect this number to grow logarithmically[1].

Also, for anyone unfamiliar with graph theory, a strongly connected component (SCC) is a set mutually reachable equivalence classes of vertices in a graph[2]. The vertices (or nodes) can be broken into groups such that all the vertices within each group are reachable from each other. Each of these groups is called a Strongly Connected Component. There is an elegant linear time[3] algorithm for determining the SCCs in a graph.

It would be nice if someone could make some pretty pictures out of this data like CAIDA has done (see their AS topology poster, for instance), but oh well.

[1] They don't say what base though. 2? 10? e? Maybe one of the researchers was a computer scientist, one was an engineer, and one was a mathematician, so they couldn't decide.

[2] Okay, so that explanation didn't help much. :-)

[3] O( V + E), where V and E are the number of vertices and edges, respectively.

--
Mark Adams
"But pay no attention to anonymous charges, for they are a bad precedent and are not worthy of our age." - Trajan's reply to Pliny the Younger, 112 A.D.

The base of the logarithm is unknown. (4.00 / 1) (#15)
by your_desired_username on Tue May 16, 2000 at 02:45:41 AM EST

[1] They don't say what base though. 2? 10? e? Maybe one of the researchers was a computer scientist, one was an engineer, and one was a mathematician, so they couldn't decide.
     
First, an algebra review:

  log(base,a)==log(other_base,a)/log(other_base,b) [Equation 1]

  Meaning that you can convert a logarithm from one base to any other
    base by dividing by a constant value (log(other_base,b)), or
    multiplying by the inverse of the same constant.

Now, suppose you have found that the minimum number of links between
  any 2 web pages (just an example ;-) varies according to this
  formula:

    links==x*log(2,number_of_public_web_pages)

  where x is an unknown constant, and 2 is the base of the log.

  Suppose you want to represent that in base 10 logs. This means you
    need to divide by log(2,10) (to change the base), and, to keep the
    equation balanced, multiply by log(2,10) :

    links==x*log(2,number_of_public_web_pages)

    links==x*log(2,10)*log(2,number_of_public_web_pages)/log(2,10)

      Since:

      log(2,number_of_public_web_pages)/log(2,10)==log(10,number_of_public_web_pages)

      [See Eq 1]

    links==x*log(2,10)*log(10,number_of_public_web_pages)

  Still a logarithmic equation. Furthermore, since x is an unknown
    constant, x*log(2,10) is also an unknown constant. Let us call it
    x2:

    x2==x*log(2,10)

    links==x2*log(10,number_of_public_web_pages)

  Since you know niether x nor x2, does it matter whether this log is
    base 2 or base 10?

  Wait, there's more. First, let us go back to the original equation:

    links==x*log(2,number_of_public_web_pages)

  Now, imagine an unknown constant y such that 1/log(y,2) == x :

    x==1/log(y,2)

  and plug it in place of x:
        
    links==log(2,number_of_public_web_pages)/log(y,2)

    By equation 1, we have:

      log(2,number_of_public_web_pages)/log(y,2)==log(y,number_of_public_web_pages)

    So

    links==log(y,number_of_public_web_pages)

  Which means you do not know the base in the first place ... So why
    pretend you do?
    


[ Parent ]
Re: The base of the logarithm is unknown. (none / 0) (#20)
by madams on Tue May 16, 2000 at 12:34:18 PM EST

First, an algebra review:

[snip]

Which means you do not know the base in the first place ... So why pretend you do?

All your math is right, but a couple of things. X is not necessarily an unknown constant (it just is in your example). But, more importantly, my comment about the base of the logarithm is more of a standard joke than anything else. When we just say "log", computer scientists, engineers, and mathematicians will likely assume different things. Mathematicians, for instance, immediately think natural log (ln), while CS types are thinking in binary, base 2.

But I'll just be quiet.

--
Mark Adams
"But pay no attention to anonymous charges, for they are a bad precedent and are not worthy of our age." - Trajan's reply to Pliny the Younger, 112 A.D.
[ Parent ]

Re: The base of the logarithm is unknown. (2.00 / 1) (#21)
by rusty on Tue May 16, 2000 at 01:29:12 PM EST

When we just say "log", computer scientists, engineers, and mathematicians will likely assume different things.

And here I am, thinking "access log, or error log?" :-)

____
Not the real rusty
[ Parent ]

log(log(log(what?))) (none / 0) (#22)
by madams on Tue May 16, 2000 at 03:39:28 PM EST

And here I am, thinking "access log, or error log?" :-)

I should have included Web admins and beavers in my list of dissenters.

--
Mark Adams
"But pay no attention to anonymous charges, for they are a bad precedent and are not worthy of our age." - Trajan's reply to Pliny the Younger, 112 A.D.
[ Parent ]

Um, the pullquote you used discusse... (none / 0) (#9)
by genehack on Mon May 15, 2000 at 09:31:41 PM EST

genehack voted 1 on this story.

Um, the pullquote you used discusses what types of pages are found in the different sections. Also, graph theory (the study of nodes and the edges that connect them) has been around for a long time.

However, this is an interesting application of it.

Re: Um, the pullquote you used discusse... (none / 0) (#10)
by Marcin on Mon May 15, 2000 at 11:09:31 PM EST

Um, the pullquote you used discusses what types of pages are found in the different sections.

What I meant was more specific data like "All portals are in the 'heart'", "Porn sites are all off on their own" that kinda thing. A bit more detail than what they've mentioned. Although obviously that kinda info is a bit hard to gather without actually looking into detail at the content.

Also, graph theory (the study of nodes and the edges that connect them) has been around for a long time.

I wasn't trying to imply that it was anything new, just that not everybody may have heard of it and assumed the other sort of graph :)
M.
[ Parent ]

Wintermute is the SCC... ... (none / 0) (#1)
by joeyo on Mon May 15, 2000 at 09:38:16 PM EST

joeyo voted 1 on this story.

Wintermute is the SCC...

--
"Give me enough variables to work with, and I can probably do away with the notion of human free will." -- demi

hrm...four clusters of www...... (none / 0) (#4)
by davidu on Mon May 15, 2000 at 10:03:32 PM EST

davidu voted 1 on this story.

hrm...four clusters of www...

cool.... (none / 0) (#6)
by deimos on Mon May 15, 2000 at 10:08:04 PM EST

deimos voted 1 on this story.

cool.
irc.kuro5hin.org: Good Monkeys, Great Typewriters.

nifty.... (1.00 / 1) (#3)
by mr. creep on Mon May 15, 2000 at 10:15:17 PM EST

mr. creep voted 1 on this story.

nifty.
--
brian - geeknik.net

fast server (2.00 / 1) (#12)
by thelaw on Mon May 15, 2000 at 11:24:35 PM EST

that was a really fast load for the page... i'm impressed. i wonder what it feels like to be kuro5hinned? (you know, slashdotted, huro5hinned, shut up.) jon

Re: fast server (none / 0) (#13)
by madams on Tue May 16, 2000 at 01:11:53 AM EST

It's nice to have pages that load fast these days. However, since the page is being served from an AIX box at IBM, it's unlikely that traffic being redirected from Kuro5hin would overload the server.

But maybe you're on to something: being Kuro5hinned (K5'd?) is more important than being Slashdotted. It means a site is relevent to the discerning readers of Kuro5hin.

--
Mark Adams
"But pay no attention to anonymous charges, for they are a bad precedent and are not worthy of our age." - Trajan's reply to Pliny the Younger, 112 A.D.
[ Parent ]

Further information online at Nature (3.50 / 2) (#17)
by robin on Tue May 16, 2000 at 07:33:19 AM EST

Nature has an interesting review (`The Web Is A Bow Tie'), apparently based on this study. There are some interesting sidebar links as well. This is in the 11 May 2000 edition if a newsstand near you carries it.
--
W.A.S.T.E. (do not antagonise the Horn)
Connections (4.00 / 2) (#18)
by duxup on Tue May 16, 2000 at 08:02:19 AM EST

The study seems to say that any two web pages are usually separated by less than 20 links.

I wonder how that is figured though. When people talk about "6 degrees of separation" (man I hope I got the # right:-)) with people, at least person A knows person B and person B of course knows person A as well. However, websites don't work that way. I have a link on my page to many sites that don't have a link to me (not that they should). For this study websites I'm assuming that for 2 sites to be connected they would not both have to have links to one another. I wonder how the stats would look if they needed to have links to one another?

With community sites like Kuro5hin and /. I wonder if you couldn't actually see the internet social connections. Considering when I post my URL is posted is on here too, and I have a link to here as well. It would be interesting.

I remember a long time ago a Broadcasting magazine put together a map of all large communications companies and who has partnerships, majority, minority ownership, or are a subsidiary of another company. Amazing how they all were only 2 connections away from one another, and some had some interesting ties. Wish I could find that info again.

Isn't this obvious? (none / 0) (#23)
by BlaisePascal on Tue May 16, 2000 at 05:03:13 PM EST

So the web consists of:

1. A strongly connected core
(if A,B are pages in this core, then
there is a path (series of links) from
A to B, and from B to A)
2. Nodes which have a path from the core,
but no path to the core.
3. Nodes which have a path to the core,
but no paths from the core
4. "Tendrils" off of 2 and 3,
and
5. Disconnected components.

So... How does this differ from any other large connected graph?


The eternel question remains: where is the free po (none / 0) (#24)
by Anonymous Hero on Tue May 16, 2000 at 09:19:00 PM EST

In other news, its amusing that Alta Vista, a company so desperately in need of a reason to live, would waste the manpower to reach such obvious common sense conjectures.

[ Parent ]
Study on the 'structure' of the World Wide Web | 24 comments (24 topical, 0 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!