First, the thank yous
In the first days of the disaster, Jason Camp of VHosting worked tirelessly trying to get bubba to come back from the dead. To no avail, unfortunately, but both for that, and for all the great support he's given us for the past year-and-a-few-months, I am eternally in his debt.
Voxel.net stepped in with an offer of new hardware and redundant cluster hosting, so that future single-machine failures won't have this kind of effect. It was a scheme we've been wanting to pursue for a while, and they have a lot of experience with highly-available server clusters, so we jumped on that. They have worked tirelessly, and provided probably the best support I've ever seen anywhere in getting this new setup going. And it has not been an easy road.
Specifically, I have to thank Raj, Voxel's President, who went so far as to call me from Singapore, where he's on vacation, to let me know what the status of the project was. Andres, who wrestled with LVS last night until 6AM, finally getting it to work properly, and Jim, who put in similarly long hours and great effort overcoming what has surely been the most problem-ridden installation in history.
Anybody can do something and have it go perfectly right off. What really shows you what people are made of is how they act when everything goes wrong, and these guys are a class act.
Finally, I have to thank our
still-anonymous no longer anonymous (!) friend, el_guapo, at Compaq, who also went to great lengths to rescue our data from bubba's arrays. I'd love to name in if he were to give me permission to do so. But until then, he remains our anonymous friend.
So what's going on now?
Right now, the site is running on a total of three machines. One lightweight box in front is running LVS, which is a Linux load-balancing system. It's splitting requests off to two dual P700s, one of which is running Apache and Scoop, and the other of which is running Apache, Scoop, and MySQL. Both are using the one database, and LVS is configured to favor the scoop-only machine for web connections, to compensate.
All this is because the database machine (formerly known as "hex" for those of you who keep up) is having kernel trouble. Very soon, it will come back online, and there will be a brief outage while we move the database back onto it's own box. Really, it'll be minutes. I promise. ;-)
Once the four-machine cluster is running smoothly, we will be swinging into action to build it up and ensure full redundancy. Promicro Systems, who built the two new Scoop machines we're running on now, is going to become a hardware sponsor, and hook us up with some more new machines, to fill in wherever the current system isn't fault-tolerant. We hope to end up with:
This should bring us to a point where the cluster can expand as needed, and won't be so fragile with respect to hardware failure.
- Two quad Xeon database servers, one live, one mirroring the live one, as a hot spare
- Four Scoop servers, dual P3 class
- Two LVS frontends, one hot-spare which will take over if something happens to the live one
And yes, from now on, we're going to keep backups.