Basically, what you're asking for is access to the raw data that a search engine collects, which you would then index using an external system. Right?
From my meager ADSL connection and running a webcrawler I've written, I got a couple of Gb (uncompressed) from about 100k Urls. Now, scale that up to a billion, which I believe is the number of pages Google has indexed, and you've got 20Tb (uncompressed) of information that you either have to transfer, store, and then index, or manipulate directly on their systems.
Assuming you move the data around, I can't see sponsored open-source sites having enough bandwidth, processing power or disk space. Manipulating the data through an API on Google's systems doesn't fix the problem, as you still have to have large portions of the data in order to create the indices, which again raises the issue of bandwidth, storage etc
What might work is for Google to make available some servers on the Google network, which can be used to directly access their data. But then you gotta ask who they would give access to? Multiple algorithms and front-ends necessarily means multiple teams, so they would be giving semi-random groups of people access to their core systems, thus exposing themselves to all sorts of horrible scenarios.
My feeling is that the system proposed could be accomplished by starting a search engine from scratch, with this kind of development in mind, sorta a permanent work-in-progress.