Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

Record Levels of Crap—and Counting

By aether in News
Mon Jul 10, 2000 at 10:19:47 PM EST
Tags: Internet (all tags)

In seven months, the Internet has increased in size from 1 billion pages to 2.1 billion pages. That is a lot of crap. There are two main schools of thought when it comes to finding information: search engines and directories. I ask, how can we sift through the junk and find the best 10% of content?

Directories have a structure that people enjoy because it is created by humans. Yahoo! and the Open Directory are the most successful examples of this. The biggest limitation is the speed at which humans catagorize sites. Even with a Editor base of 26,000, the Open Directory isn't able to keep up with the Internet growth rate of 7 million pages a day.

Search engines, on the other hand, have speed on their side. Even then, they are behind in number of pages indexed. The largest full text archive is Google with 560 million pages—well short of 2.1 billion.

Whoopee, you say. Does it really matter how much is indexed if it is already the best 10%? How do we know when we have the best 10%—and when to quit?

I've found that the most successful search engine is Google which is a popularity contest of sorts. Netscape's What's Related feature is another way of referring popular sites. Based on technology from Alexa Internet, it follows people surfing (if you have Alexa's 45k software installed.) The choices made by an Alexa user are aggregated and given as related sites.

Even then, when results are found, it is not necessarily accurate. For the most part, intuition serves as the best guide to discern quality info. Sometimes, intuition is even wrong (with due reason.) Is there a way where an search engine can check accuracy and/or validity? Do the Google/Alexa popularity methods help achieve this?

And finally, it is all compounded by the fact that URLs on average have a 44 day lifetime. But that's another story.

In summary:

  • Are we going about finding information the best way?
  • Is it imperative that we index all of the web?
  • Have directories outlived their usefulness? (That is if we must index everything.)
  • Do Google/Alexa methods aid in finding accurate information?


Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure


Related Links
o Google
o Yahoo
o 1 billion pages
o 2.1 billion pages
o crap
o Yahoo!
o Open Directory
o 7 million pages a day
o Google [2]
o popularity contest
o What's Related
o Alexa Internet
o accurate
o guide
o wrong
o 44 day lifetime
o Also by aether

Display: Sort:
Record Levels of Crap—and Counting | 22 comments (22 topical, editorial, 0 hidden)
Search engines and directories are overrated (3.00 / 3) (#1)
by Imperator on Mon Jul 10, 2000 at 10:48:30 PM EST

Search engines are good for finding specific pieces of information. Directories are good for finding quality sites on the surface. But to go in depth, you really need to follow links from the sites you reach. I've seen too many newbies of late that were taught to use search engines and directories, and never venture more than several links away from them. They miss most of the potential of the web.

Re: Search engines and directories are overrated (none / 0) (#2)
by cvisors on Tue Jul 11, 2000 at 01:02:44 AM EST

What is even worse, I work for an ISP and the amount of newbies that put in the actual url for a site in a search engine,and expect the site to come up. Though I do push people on to google, which I think, at present is one of the better search engines, out there at the moment. But the amount of stuff out there makes it somewhat problematic (SP) for new users of the net.

[ Parent ]
Re: Search engines and directories are overrated (none / 0) (#4)
by Imperator on Tue Jul 11, 2000 at 01:41:24 AM EST

There are also newbies that figure out where to type the address... and then press the "Search" button on the toolbar. I'm glad mass-market browsers are starting to include "Go" buttons next to the address bar for people not used to using the keyboard to confirm input.

[ Parent ]
Relativity theory: (4.00 / 2) (#3)
by current on Tue Jul 11, 2000 at 01:14:32 AM EST

Personally "I do not trust anything found on the net": I see them, I read them, and i think about them.

The articles found in net are always subjective, that is made from the view of the author. This is not neccessary a flaw, but a merit. It gives a point of view. And every point of view you see helps you understand what really happens.

So when it comes to information on given subject then words "whats related" sound good to me. This is a situation when we should use a search engine: when we seek related data.

Of course it is different when you seek services...
... i'd use directory services to find sevices from the net. If the company offering them has not listed itself, its not in the net (for me at least). Weren't the drectory services built becouse of commercial pressure? if you want it use it.

The Eternal Meta-Discussion

Search engines lack effective basic filtering (4.50 / 2) (#5)
by Anonymous Hero on Tue Jul 11, 2000 at 01:50:10 AM EST

The biggest problem with most search engines is the lack of filtering of identical data. Look up anything about Linux on most search engines, and you get around 2000 hits. 1999 are the same HOWTO page, replicated on 1999 different servers around the world. Some will be the latest edition, and some will be out of date. Try to guess which is which from the search engine. The last remaining entry is problem the valuable information you are searching for, but its buried in some arbitrary position in the list of hits. Surely if they just MD5'd all these pages they could at least gather the identical ones into a single listed entry. It might not then be so hard to find the most up to date version, or the few unique hits.

Re: Search engines lack effective basic filtering (1.00 / 1) (#8)
by Anonymous Hero on Tue Jul 11, 2000 at 06:55:17 AM EST

MD5 sums won't tell you anything. Add a space to the beginning of the page, change the title, heck, even changing the CVS string will change the MD5 sum. And anyway, different pages could possibly have the same MD5 sum. Normally not an issue, but when you index a billion pages it happens.

Probably better to do a diff-style operation on a page. If 90% of it is the same, it probably is the same. But how do you do this on 1 billion pages. Thats in the order of a billion billion comparisons, each requiring a non-trivial computation.

Thats why you get the same site, and most HOWTO's should have a link to the most up to date copy anyway.

(paranoidfish, can't be arsed to login)

[ Parent ]
Re: Search engines lack effective basic filtering (5.00 / 1) (#16)
by fuzzygroup on Tue Jul 11, 2000 at 10:00:00 PM EST

Given the very real problems with content comparisons and performance, this is actually a real application of the distributed computing approaches -- www.centrata.com, distributed.net, etc.

[ Parent ]
Re: Search engines lack effective basic filtering (5.00 / 1) (#19)
by Anonymous Hero on Wed Jul 12, 2000 at 02:29:43 AM EST

Very hard problem to distribute. Broken into chunks small enough to be communicable across real-world bandwidth, you've got waayyy too many chunks, and too little result from each unit of computation. The overhead eats you alive.

...and that's all I can really say right now, dammit. I wanna tell more about what I'm doing sooo bad right now I can taste it...

[ Parent ]

How many are pages? (4.00 / 2) (#6)
by Metrol on Tue Jul 11, 2000 at 02:49:19 AM EST

As I took over working on my company's web site, all the pages were static HTML. Every major product had a page to itself, plus all the ancillary stuff like a contact page and such. All told, it was something like 500 pages of static content. Oh yeah, lots of fun to keep up to date.

I then re-wrote the entire thing using PHP with a database back end. The bulk of the navigation process is now down to 4 files which create the various page views on the fly. In essence, by utilizing a database backend I've created many more "pages" from what a user would see, even though it's only the same data presented in a variety of ways.

Another project I worked on involved putting together a news site for a friend of mine, again using PHP and a database. We're only talking about maybe 6 files making up the bulk of the site, with 1000's of pages generated from the stories he has entered.

Yeah, there may be 7 million pages of content created each day, but what is the nature of that content? If it's being dynamically fed through a scripting engine then of course the numbers look ludicrously inflated. Heck, how many "pages" are generated each day on sites such as this one?

What might be a more interesting approach to measuring growth is to see how many new domains are actually putting put to use, not just parked. I think you'd get a much clearer picture than just trying to count pages which are increasingly being dynamicly fed.

Re: How many are pages? (none / 0) (#7)
by Imperator on Tue Jul 11, 2000 at 04:18:25 AM EST

Domains aren't an accurate measure of content any more than "pages". I believe you're correct that dynamic pages account for most of that 7m number, but there's a figure hiding behind that that's difficult to quantify. How much new content is appearing? Pages and domains can't measure that.

[ Parent ]
Re: How many are pages? (none / 0) (#10)
by genehack on Tue Jul 11, 2000 at 09:23:45 AM EST

I believe you're correct that dynamic pages account for most of that 7m number, but there's a figure hiding behind that that's difficult to quantify. How much new content is appearing? Pages and domains can't measure that.

Well, we could use bytes to quantify it -- that's not too hard. Getting a count of new bytes, and telling new bytes from old, is left as an exercise for the motivated reader.


[ Parent ]

Re: How many are pages? (none / 0) (#11)
by dblslash on Tue Jul 11, 2000 at 09:46:16 AM EST

I'm not sure that that was the original purpose of the question. If you haven't noticed, most of what content is out there is taken off of existing sites. The same information is simply copied from site to site. It is a rare site, indeed, which creates its own content. If looked at in the appropriate light, k5 itself could be said to provide minimal content. Stories are submitted which are simply links to other sites. The content obviously comes from the discussion generated, but very rarely is new information put forth.

What about sites like memepool where the site is simply a collection of links? Is it new content? While it's nice to have those links assembled in the same place, nothing is preventing me from finding the original sites.

sorry.. just random thoughts. :-)

[ Parent ]

Re: How many are pages? (none / 0) (#12)
by genehack on Tue Jul 11, 2000 at 10:17:40 AM EST

If you haven't noticed, most of what content is out there is taken off of existing sites. The same information is simply copied from site to site.

Those two sentences say different things. Content != information, at least not necessarily. Sure, a weblog posting a link to a story on another site is generating minimal new information. It is, however,generating new content -- a stream of bytes novel for that particular location on the 'net.

And, even then, new information can be generated from the new content -- suppose, to use your example, that a memepool editor chooses to juxtapose two links in a novel fashion -- isn't new information being generated there? (Consider the "support group for everything" item.)

sorry.. just random thoughts. :-)

Indeed. 8^)= I think what we're providing is content (or is it information?) demonstrating that the definitions of same in the original article were perhaps a bit vague.

thinking he should take another crack at Shannon...

[ Parent ]

Re: How many are pages? (none / 0) (#18)
by Anonymous Hero on Tue Jul 11, 2000 at 11:17:25 PM EST

That's true. Domains aren't accurate, take Everything2.net and Everything2.com - both of which are the same site.

Also, using the previous example, simple byte counts don't work to measure content either as many sites have the same template all over the site (usually a menu, site logo, etc... A la Kuro5hin's template) and is this content? How about for the fourth time?

I think the only accurate measurer would have to count bytes/words not pages - but remove:

  • /menu/logo/~ everything that is The Template[tm]
  • duplicate content. Duplicate content could mean the number of sites that reprint Reuters News Service, and the number of urls that just show the same slashdot post (albeit further down the page).

[ Parent ]
Re: How many are pages? (none / 0) (#20)
by Metrol on Wed Jul 12, 2000 at 03:54:04 AM EST

Domains aren't an accurate measure of content any more than "pages".

You are quite correct, and I almost feel silly for suggesting it at this point. Done caught me grabbing at straws for something that could be quantifialble.

[ Parent ]
Relative (2.20 / 5) (#9)
by Photon Ghoul on Tue Jul 11, 2000 at 08:38:32 AM EST

"crap" and "junk" are in the eyes of the beholder
no sig
Re: Relative (5.00 / 1) (#13)
by aether on Tue Jul 11, 2000 at 04:13:32 PM EST

Very true. But can we sift through to find the relative 10% of the gold? If so, it would make using the Internet even more valuable than it already is.

[ Parent ]
Re: Relative (5.00 / 1) (#17)
by Ranger Rick on Tue Jul 11, 2000 at 10:22:38 PM EST

I guess it depends on "training". I've been on the net for about 7 years, and I've learned all the tricks it takes to get the most out of search engines.

I can usually find pretty much anything I'm looking for within five minutes, if it's indexed. I know there are many people that are the same way.

Perhapse we should have search engine classes.

Of course, making it so search engines can do this for us in the future is left as an exercise for the programmer. :)


[ Parent ]
How do we find the non-crap? (5.00 / 1) (#14)
by torpor on Tue Jul 11, 2000 at 04:57:15 PM EST

Simple. Don't follow the 90% rule. It's a lie.

j. -- boink! i have no sig!
Well almost all of it is crap (5.00 / 1) (#15)
by gelfling on Tue Jul 11, 2000 at 08:13:59 PM EST

It would be easier to find the gold from the dross and ignore the vast overwhelming majority of non content. Perhaps we need more domains eg. .crap, .per (personal web page), .adv (advertising), etc.. this would at least categorize search engines. ok ok maybe that won't work... More importantly is the way that search engines work; by keywords. There is no context. There is no way to practically search by the relationships of snippets of content to one another. In the end a vast database of everything in the web would be no more successful than only capturing 25% or 50% since there is a very low probability of anything that you missed actually being of any benefit to your search. That is, even if you had a card catalog of everthing in the library you would still have to look into each of the sources to determine if you could use the source. Now in the library example you can probably manually search every relevant source and quickly decide if the source is relevant by flipping through the index or the table of contents or simply looking at the title. But when you scale this up a hundred or so orders of magnitude then the problem would have to be attacked differently. What you need is way to build complex context relationships for every 'atomic' content value. A content 'atomic' is the shortest possible string of contiguous information that can be context linked to any other atomic unit. Regardless of whether you could actually parse content into that many strings and assign a contextural value, indicator and index to them you would then have the additional problem of performing the same process to non textural information like pictures, sounds, equations, etc... Sounds to me like you would bump up against a classical data warehouse problem for meta indexes for very large DB's: a meta index that is larger than the source data. Apply a meta index to the meta index.....and so on. ok that's it. My brain hurts.

sorting out the best sites. (5.00 / 1) (#21)
by Anonymous Hero on Wed Jul 12, 2000 at 03:58:05 PM EST

There's plenty of interesting and useful sites out there that are buried in a mass of white noise. Right now, the best way to find them I know of is Google - and it's not great at all. Don't get me wrong - it's okay if you're looking for something specific, but if you just want to find a good site on some general topic, it's lousy. Why? It looks at how often the site was 'hit', which means sites that have been around for a long time and sites that spend a lot on ads get return more often, even if there's a much more useful sites out there.

Dmoz.org (netscape's open directory site) seems like a good solution, but it's not. I poked around it and I found plenty of garbage. I didn't get into it enough to understand what exactly is wrong there, but my first guess is that they need every listed site to have a 'vote' button next to it that allows you to give it 1 to 5 mark.

What, no one's mentioned Sturgeon's Law yet? (5.00 / 1) (#22)
by marlowe on Fri Jul 14, 2000 at 10:08:59 AM EST

I guess it went without saying.
-- The Americans are the Jews of the 21st century. Only we won't go as quietly to the gas chambers. --
Record Levels of Crap—and Counting | 22 comments (22 topical, 0 editorial, 0 hidden)
Display: Sort:


All trademarks and copyrights on this page are owned by their respective companies. The Rest © 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!