On Saturday, June 23, the primary controller in the router that controls access to all OSDN servers hosted at the Exodus facility in Waltham, MA, suffered a catastrophic failure. The sites affected were Slashdot, freshmeat, NewsForge, and Mediabuilder, among others.
The secondary controller did not automatically take over as it should have. It did not work when activated manually, either. The first Cisco support people contacted professed to be "amazed" at the situation, saying it was the first time they had seen a failure of this kind.
OSDN and Cisco people, working through Saturday night, were unable to cure the problem. Sunday afternoon, OSDN employee Kurt Gray and Cisco rep Scott, working by telephone, were stepping through the router's configuration and, says Kurt, as they worked to undo other changes that had been made, "on one reset everything came back."
OSDN network operations were already in the process of rebuilding the company's network to eliminate the router as a potential single point of failure.
As of 7 p.m. US EDT Sunday most of the sites were available at least part of the time, but full service was not yet restored. There may still be slowdowns or intermttent failures until a permanent fix is made.
We'll have a more complete story within a few days. Right now, OSDN network operations staff members are too busy working to talk.