A Problem With Search Engines Today
Search engines today are centralized behemoths. They collect and index hundreds of millions of web pages and hope to find information about any topic from a query that is just a word -- or maybe a few words. While my favorite search engine, Google, usually seems up to the task, sometimes it's a little difficult to find what I want. Sometimes I want to find out what a collection of my favorite Linux/Open Source sites has said about, say, Open Content or network audio, or my book (go on laugh, but I like to know how it's doing sometimes... ;) Then what? Search for, say "network audio" on Google and it reports 1,600,000 hits most of which, no doubt, are not within my favorite set of Linux/Open Source sites.
Updating a huge database (Google currently says, "Search 1,326,920,000 web pages" on its home page) of web pages can take a long time, even with many expensive crawlers/indexers since data has to be transferred from sites all over the world in a single location. Thus, we often find out of date pages on search engines, and do not find recent pages.
My Search Engine
To combat this problem I built myself a (quick) little search engine that indexes and searches K5, Slashdot, Newsforge, etc. (Aside: Yes, there's a huge amount of overlap here :) but each site provides a something original.) and even delivers cached HTML pages to me all from my home Linux box. Woo-hoo! I'm set. I've been preparing to move it to a publicly accessible server to share it with anyone else who might be interested.
Then I started thinking.
As I wrote the search engine I kept in mind the resources of a typical cheap web hosting site: Apache, standard Perl, limited bandwidth, CPU shared with lots of other sites, possibly no shell access, and maybe 100MB of disk space. MySQL will cost you extra. I wanted to be able to run the search engine from this type of web hosting account, so the search engine, which I've dubbed LiSEn, for "Little Search Engine", was designed accordingly.
LiSEn's crawler/indexer is run on a separate machine from the web server. This allows me to use my home computer's CPU (which is dedicated to me) and connection (which is flat rate) to do the hard word (crawling and indexing). I can then ship the results off to the web server via scp or ftp. This system keeps me from interfering with other web sites running on the same server, avoids the need for shell access on the server, and saves me some money in bandwidth.
Second, LiSEn uses regular files to store the keywords. The keywords are sorted and indexed, so searches are still fast (tested with up to ~500,000 unique keywords). Since the database is rewritten in a single operation on the web server by the update from the crawler, and the web search scripts only read the database, there is no need to worry about file locking or transactions. Thus, we do not need MySQL or any other database software (that might cost you extra on the web server) for this system to function.
Here's where the thinking came in: Since this search engine was designed to be cheap, many people can afford
to run it on a web site. All that is needed for wider acceptance is ease of use. The current version of LiSEn, lisen-0.9, is not diffuclt to use, but it can be made easier. I'm working on that and will only release version 1.0 when I am satisfied that that requirment is meant. I hope you'll try it out and send me your comments and suggestions (and, if you're really nice, patches :).
Decentrailzed Searching: A social solution to a technological problem
Now if lots of people create niche-topic,
LiSEn-based search engines, there may come a time when we'll be able to easily search for what we really want. Imagine, if you will, search engines covering the home pages of the people from your school or town, or covering all of the good quilting sites, or the tourist sites for every country in the world -- or just the countries deemed most fun by a group of friends who travel a lot.
Not only will these search engines give more relevant results within their topic, but they may be kept more up-to-date. A centralized search engine covering the whole web faces the problem of transferring data from tens of millions of sites all over the world to a central location, and so it is not uncommon to find outdated pages. A small search engine covering, say ~10-~100 sites could be updated faster than once/week. Additionally, many small crawlers distributed throughout the world would not be competing with each other for bandwidth as a centralized cluster of crawlers might. (This comment about competition just a guess and is unsubstantiated, but it may be one of the reasons (aside from serving many visitors)
Google uses Exodus for their bandwidth and not, say, DSL from their local provider ;).
What do you think?
- Would you use LiSEn?
- What set of web sites would you search?
- How do you think the LiSEn concept could be improved?