Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
What if search engines were cheap?

By phinance in Internet
Sat Feb 17, 2001 at 08:56:05 PM EST
Tags: Culture (all tags)
Culture

Most search engines found on the WWW today are suffer from two problems: (i) They cover so many web pages that queries can yield many irrelevant results, and (ii) updating their database takes so long that pages are often out of date. One solution to these problems is to have individuals create many small search engines, each covering a niche topic. Because they would cover only a small number of related pages, these search engines would return relevant results and could be updated more frequently. If the cost of creating and maintaing the search engine were low, these sites could generate the revenue (from advertisements or "tip jar" funding) needed to be self-sustaining.

To make this work, we need a cheap, easy search engine, like LiSEn.


A Problem With Search Engines Today

Search engines today are centralized behemoths. They collect and index hundreds of millions of web pages and hope to find information about any topic from a query that is just a word -- or maybe a few words. While my favorite search engine, Google, usually seems up to the task, sometimes it's a little difficult to find what I want. Sometimes I want to find out what a collection of my favorite Linux/Open Source sites has said about, say, Open Content or network audio, or my book (go on laugh, but I like to know how it's doing sometimes... ;) Then what? Search for, say "network audio" on Google and it reports 1,600,000 hits most of which, no doubt, are not within my favorite set of Linux/Open Source sites.

Updating a huge database (Google currently says, "Search 1,326,920,000 web pages" on its home page) of web pages can take a long time, even with many expensive crawlers/indexers since data has to be transferred from sites all over the world in a single location. Thus, we often find out of date pages on search engines, and do not find recent pages.

My Search Engine

To combat this problem I built myself a (quick) little search engine that indexes and searches K5, Slashdot, Newsforge, etc. (Aside: Yes, there's a huge amount of overlap here :) but each site provides a something original.) and even delivers cached HTML pages to me all from my home Linux box. Woo-hoo! I'm set. I've been preparing to move it to a publicly accessible server to share it with anyone else who might be interested.

Then I started thinking.

As I wrote the search engine I kept in mind the resources of a typical cheap web hosting site: Apache, standard Perl, limited bandwidth, CPU shared with lots of other sites, possibly no shell access, and maybe 100MB of disk space. MySQL will cost you extra. I wanted to be able to run the search engine from this type of web hosting account, so the search engine, which I've dubbed LiSEn, for "Little Search Engine", was designed accordingly.

First, LiSEn's crawler/indexer is run on a separate machine from the web server. This allows me to use my home computer's CPU (which is dedicated to me) and connection (which is flat rate) to do the hard word (crawling and indexing). I can then ship the results off to the web server via scp or ftp. This system keeps me from interfering with other web sites running on the same server, avoids the need for shell access on the server, and saves me some money in bandwidth.

Second, LiSEn uses regular files to store the keywords. The keywords are sorted and indexed, so searches are still fast (tested with up to ~500,000 unique keywords). Since the database is rewritten in a single operation on the web server by the update from the crawler, and the web search scripts only read the database, there is no need to worry about file locking or transactions. Thus, we do not need MySQL or any other database software (that might cost you extra on the web server) for this system to function.

Here's where the thinking came in: Since this search engine was designed to be cheap, many people can afford to run it on a web site. All that is needed for wider acceptance is ease of use. The current version of LiSEn, lisen-0.9, is not diffuclt to use, but it can be made easier. I'm working on that and will only release version 1.0 when I am satisfied that that requirment is meant. I hope you'll try it out and send me your comments and suggestions (and, if you're really nice, patches :).

Decentrailzed Searching: A social solution to a technological problem

Now if lots of people create niche-topic, LiSEn-based search engines, there may come a time when we'll be able to easily search for what we really want. Imagine, if you will, search engines covering the home pages of the people from your school or town, or covering all of the good quilting sites, or the tourist sites for every country in the world -- or just the countries deemed most fun by a group of friends who travel a lot.

Not only will these search engines give more relevant results within their topic, but they may be kept more up-to-date. A centralized search engine covering the whole web faces the problem of transferring data from tens of millions of sites all over the world to a central location, and so it is not uncommon to find outdated pages. A small search engine covering, say ~10-~100 sites could be updated faster than once/week. Additionally, many small crawlers distributed throughout the world would not be competing with each other for bandwidth as a centralized cluster of crawlers might. (This comment about competition just a guess and is unsubstantiated, but it may be one of the reasons (aside from serving many visitors) Google uses Exodus for their bandwidth and not, say, DSL from their local provider ;).

What do you think?

  • Would you use LiSEn?
  • What set of web sites would you search?
  • How do you think the LiSEn concept could be improved?

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o Slashdot
o Google
o LiSEn
o Google [2]
o my book
o current version of LiSEn
o Google uses Exodus
o Also by phinance


Display: Sort:
What if search engines were cheap? | 23 comments (13 topical, 10 editorial, 0 hidden)
In addition (3.00 / 1) (#2)
by mind21_98 on Sat Feb 17, 2001 at 03:49:04 PM EST

This idea sounds good. However I do have a few questions:

  1. Would it be possible to create a search interface that would automatically route your search to one of these smaller search engines? This would allow the user to go to one place to search but at the same time provide the benefits of smaller search engines.
  2. How would the crawler determine if the site is related to the topic that the search engine is covering? Is there going to be any automatic method of doing this or will the search engine administrator have to look at each URL to be checked to make sure? Would you be in the same situation if a site referenced by a page that is deemed to be relevant is not related to what the search engine is designed to search for?
Just my two cents about this idea :)

--
mind21_98 - http://www.translator.cx/
"Ask not if the article is utter BS, but what BS can be exposed in said article."

Distributed Searching (3.33 / 3) (#4)
by interiot on Sat Feb 17, 2001 at 04:21:30 PM EST

Distributed searching is one of the more complicated distributed applications. And people have been thinking about it for a while... a search on google turns up 1730 hits.

One of the many problems that I see with it is that the Gnutella problem... if every query gets sent to several thousand mini-servers, then the bandwidth requirements are multiplied to unmanagable levels.

isn't there also a trust problem? (3.50 / 2) (#11)
by Estanislao Martínez on Sat Feb 17, 2001 at 07:51:49 PM EST

Isn't there also the problem that in a distributed search system, one of the nodes could abuse the system to make a particular "result" come unmeritingly high?

--em
[ Parent ]

Centralized Trust Server? (3.00 / 1) (#15)
by zephiros on Sat Feb 17, 2001 at 09:03:40 PM EST

Well, a simple fix for trust issues would be having a centralized server that rated installations of the software. When a peer installation of the software returned results, it could return the target URL in the form:

http://mothership.lisen.org?url=www.example.com

Clicking the link would send the user to the central server, which would then 302 them on to the correct site. The mothership could use the HTTP referer to determine which site sent the user. Referer checking could also be used to protect against some types of cheating; if a client came from a URL that didn't look Lisen-like, the sender would get no trust points. Referer checking could allow the central server to determine how many pages of seach results the user had to dig through before they found a hit (presumably results are returned by page, and the page URL includes a variable which indicates the current page number). Finally, assuming the refering URL included search terms, referer checking would allow the central server to determine what servers returned the most useful results for which search terms.

I suppose there would also need to be some heuristic in place to protect against one IP (or subnet, in the case of a dial-up pool) spamming the central server. This sounds like a job for a perl script.
 
Kuro5hin is full of mostly freaks and hostile lunatics - KTB
[ Parent ]

still... (none / 0) (#17)
by Estanislao Martínez on Sun Feb 18, 2001 at 12:01:51 AM EST

Well, a simple fix for trust issues would be having a centralized server that rated installations of the software.

This replaces one trust issue with another, so it's not a fix in an of itself; although it does minimize the trust required, so it is indeed an improvement.

The problem then becomes how to rate installations of the software, which I believe could be, at the least, hairy. How can the server evaluate if a node is giving the answers it should, without duplicating the work that the node does? This would be the ultimate benchmark, but it would make the whole distributed system pointless. Its hard to imagine that any cheaper method would not be abusable in some way.

When a peer installation of the software returned results, it could return the target URL in the form:

http://mothership.lisen.org?url=www.example.com

Clicking the link would send the user to the central server, which would then 302 them on to the correct site. The mothership could use the HTTP referer to determine which site sent the user. Referer checking could also be used to protect against some types of cheating; if a client came from a URL that didn't look Lisen-like, the sender would get no trust points.

I don't understand which kind of cheating you have in mind here. I'm assuming that we are dealing with a site which is registered as member of a hypothetical Lisen network. Said site, when given a search, returns results which favor some pre-chosen sites instead of the ones that should come out on top. This can be made quite sutble-- e.g. a search for "Linux" could be made to return links from some particular company's site first.

Also, I run Internet Junkbuster, which blocks referrer lines. What would you do with my case? This is crucial, since if you decide not to require a referrer line to allow me to get through, no security mechanism depending on them can work.

Referer checking could allow the central server to determine how many pages of seach results the user had to dig through before they found a hit (presumably results are returned by page, and the page URL includes a variable which indicates the current page number). Finally, assuming the refering URL included search terms, referer checking would allow the central server to determine what servers returned the most useful results for which search terms.

You are trusting the referring URL, which I can fake in order to fool your central server.

Also, there's another possible problem-- how good of a benchmark is the result number of the relevant link for judging specialized search nodes? Perhaps there is something about the nature of the data that a particular node specializes in that makes the problem of identifying the best page tougher.

Anyway, an abuser could set up a node to favor sites which did have a lot of relevant data, and would get clicked on a lot anyway-- they just happen to get even more links because of the node putting them at #1. How do you distinguish this kind of attack from the normal case?

--em
[ Parent ]

Re: Still (none / 0) (#20)
by zephiros on Sun Feb 18, 2001 at 02:13:15 AM EST

How can the server evaluate if a node is giving the answers it should, without duplicating the work that the node does?

It doesn't. Evaluation is based on utility to the users, not legitimacy of the results. As stated, the intended use is to spider small portions of the web. Short of giving each person a list of sites which they are required to spider, you're going to end up with skewed content. IMO, the only measuring stick you can apply to every site is "do most users find what they were looking for?"

Also, I run Internet Junkbuster, which blocks referrer lines. What would you do with my case? This is crucial, since if you decide not to require a referrer line to allow me to get through, no security mechanism depending on them can work.

Er. Security? Where? The intent is to prevent cheating on the part of site operators. The worst that can happen is that the site will not get credited for your clicks, and the central server will send you on to the content. I look at a lot of web logs. 99.99% of users are not using referer-blocking software.

You are trusting the referring URL, which I can fake in order to fool your central server.

True, but not often enough to substantially skew results. Unless you change your IP address to a brand-new network each time. Of course, even if you do, if the central server admins figure out that you're cheating, they can purge your site stats. The idea is not to make it impossible to cheat. The idea is to make it expensive.

how good of a benchmark is the result number of the relevant link for judging specialized search nodes? Perhaps there is something about the nature of the data that a particular node specializes in that makes the problem of identifying the best page tougher.

Let's say site X produces precisely the results that any given user wanted on the first page. As such, Site X would only produce one call to the central server per search. Site Y provides piles of partial matches, but requires the user to dig through page after page of garbage. Site Y would produce many central server hits per search. My intent was to recognize that, even though Site Y registered more hits, Site X was delivering as much utility.

Anyway, an abuser could set up a node to favor sites which did have a lot of relevant data, and would get clicked on a lot anyway-- they just happen to get even more links because of the node putting them at #1. How do you distinguish this kind of attack from the normal case?

This isn't an attack. This is good design. If a site indexed lots of really useful documents that lots of people were looking for, then by all means the central server should send people there (providing the documents pertained to the user's search). I don't see how indexing relevant data is a bad thing.
 
Kuro5hin is full of mostly freaks and hostile lunatics - KTB
[ Parent ]

LiSEn in action? (4.66 / 3) (#5)
by Dries on Sat Feb 17, 2001 at 04:52:45 PM EST

This all sounds nice and dandy but I would like to see LiSEn in action first! All I can find is the tar-ball with the source code, yet I want to try LiSEn first. So here goes my question: is there a demo site setup where we can see LiSEn in action?

-- Dries
Demo LiSEn site (4.00 / 1) (#19)
by phinance on Sun Feb 18, 2001 at 12:29:11 AM EST

Take a look at the the demo site. It's a search engine for some of my favorite tech-related sites like K5, Slashdot, Wired, etc.

Dave
Read, annotate, and discuss open source documentation.
Andamooka: Open support for open content.
[ Parent ]

I am not an expert, but... (3.50 / 2) (#10)
by ponos on Sat Feb 17, 2001 at 07:32:52 PM EST

I voted +1 to section, but I do have some technical
concerns.

A succesful search engine should return hits that
actually mean something relevant to the keywords and
this requires the ability selectively rank the
relevant pages and present the most appropriate ones.

I do not think you can do this with a fragmented
knowledge base, because each "topical" search engine
cannot compare the relevance of its results with other
"topical" engines and decide which subset from the
total (all engines included) number of hits
should be presented. Theoretically it can be done,
but it would propably take a great number of transactions
between engines or with the client.

What I am trying to say is that this sort of thing
will return, say, 1000 results (that's not too many,
really and refining your search beyond that point
might be tiresome or inaccurate). The problem is that
you cannot rank them according to their "relevance"
(simplistic approach : number of occurences of the
keyword) because in order to do that you actually
need to >sort< the 1000 results and this must be
done either by the client or by the engines.

If the client needs to receive 1000 results >before<
it can present them to the user (because the user
will not want the list in random order) it might
take an extended amount of time, say, 1 minute which
is plainly not good. Google will send you the first
10 or so in a few seconds and it will usually get
it right. Sorting while the results arive is a possible
workaround, but there is no way to know that the
last result will not be the one you want.

Further difficulties arise from the fact that you can't
just compile a list of "keywords" if you want to assess
the query accurately. You need to also understand
interactions between terms that make an article
about computers (mentioning "music") much different
than an article about music (mentioning "computers").
This means that the database has to be analyzed statisti-
cally to see important patterns that allow the
search engine to "guess" what the user really wants.

Such technology already exists, but, if you want to
have a valid "ranking" system (assuming that you
find a way to do it) you will have to install the same
compiled "keyword-file" over all engines, otherwise
different engines will give different rankings to the
same page. This is obviously time-consuming because
this "keyword-file" should be updated across all hosts
every time the database (any one of them)
changes significantly. If you fail to adress this
problem, and assuming you somehow get a ranking system
to work, you will still need a meta-search engine to
find which topical engine is most appropriate.

Anyway, it sounds like an honest effort, but my
understanding is that handling extremely large datasets
(10 GB across 100 hosts is still a large dataset!)
is tough, especially when interaction with a >human< user
through a restrictive medium (modem/internet) is
desired.

I will ignore total bullshit in other's posts. Please
be kind enough to ignore total bullshit in my posts,
too.

I will only respond to reasonable arguments. My time
is extremely limited.

Petros


-- Sum of Intelligence constant. Population increasing.
Clarification (none / 0) (#13)
by phinance on Sat Feb 17, 2001 at 08:37:01 PM EST

The goal is to have a user visit a single topical search engine, query it, and get results from that search engine. In effect, the first step in the search process is to choose the right search engine. My hope is that niche search engines will be promoted within their niche so that if you visit, say, any popular sites on knitting, you'll find a link to a knitting search engine which will let you search all of the "good" (as determined by the owner of the search engine) sites on knitting.

The idea of submitting a query about any topic to a central location and getting a result is, well, not at all what I meant to suggest. Getting good results from a small, well-constructed database is easier than getting them from a single large one. Getting results from a single large DB that is sitting on computers all over the world would be, as you said, even harder. :)

This is not to say that having a "one master, many slaves" system like you talked about is not an interesting idea, it's just not what LiSEn is about. It's about putting search engine technology in that hands of any enthusiast who thinks he/she can pick a good set of sites and wants to help out his/her community.

Dave
Read, annotate, and discuss open source documentation.
Andamooka: Open support for open content.
[ Parent ]

gonesilent? (3.00 / 1) (#14)
by heighting on Sat Feb 17, 2001 at 08:52:40 PM EST

I believe this is what gonesilent (formerly infrasearch) is doing. Try searching for either of the two words in google.

improvements (none / 0) (#22)
by nickp on Mon Feb 19, 2001 at 04:58:21 PM EST

You may want to add the following:
  • a search engine of search engines, so users can be directed to appropriate local engines
  • natural language queries. There's a lot of potential in this. At least it's not too hard to be better than AskJeeves.
BTW, it seems that you are unaware of the fact that going to http://www.google.com/linux/ will restrict your searches to linux themes. I just tried "network audio" on it and got 64,000 hits instead of 1,600,000.

"Gravitation cannot be held responsible for people falling in love." -- Albert Einstein

re: improvements (none / 0) (#23)
by phinance on Tue Feb 20, 2001 at 02:43:21 PM EST

BTW, it seems that you are unaware of the fact that going to http://www.google.com/linux/ will restrict your searches to linux themes. I just tried "network audio" on it and got 64,000 hits instead of 1,600,000.

I wasn't. That's a nice feature. In general it would be nice if enthusiats in any subject community could create their own sub-Google by going to, say, my.google.com, and entering a list of URLs and a description of the list to create a niche search engine. Then Google could post these descriptions and let people search just these sub engines.

I do realize that ranking, clustering, and other techniques help to do these things automatically, but letting people create customized search sets could greatly increase the signal/noise ratio in search results on the given topic.

Dave
Read, annotate, and discuss open source documentation.
Andamooka: Open support for open content.
[ Parent ]

What if search engines were cheap? | 23 comments (13 topical, 10 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest © 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!