Skip to main content

Blackout

Posted by pbrittan on August 19, 2003 at 7:05 AM PDT

Single points of failure can be entire systems. Prevention may lie in "fencing in".

For those of you on the West Coast, I can assure you that it was pretty dark here in New York last Thursday evening. A little after 4pm, suddenly all our lights, air-conditioners, phones, etc., in our office shut down. The UPS alarms started ringing, letting us know we were operating on battery power. We soon realized that the power was going to take a long time to come back on, hours if not days, and we didn’t have enough battery life for that, so all we could do was execute an orderly shut down of our servers and wait.

The effects of a loss of power are devastating, especially in an overcrowded city. No light, no A/C, thousands of people trapped in subways and on commuter railways, no ventilation in roadway tunnels, no ATMs, no credit card processing, no cell towers to relay our phone calls, no phone systems in our offices, no answering machines, no PCs, no refrigeration, very limited cooking, etc. Life as we know it completely on hold. We are utterly dependent on electricity and the systems that deliver it to us.

The power system -- something I admittedly don’t know a lot about -- seems pretty well distributed. There are a multitude of power plants, operating independently but interconnected through a singular power grid. This grid is supposed to be able to handle changes in local supply and demand, routing extra energy to a hot region where too many people are cooling off in front of the A/C, and seamlessly covering for a downed plant.

But apparently, these independent plants are also susceptible to each other, through the grid. I read that 21 major power plants spread out over 9,000 square miles all shut down within 3 minutes of each other, as a defensive response to some type of surge in the grid, leaving roughly 50 million people without power. Although the grid is supposed to nicely handle failures at any particular power plant, which I assume it does all the time and of which I am thankfully oblivious, it apparently can be a single point of failure itself, leading to catastrophic shut-down of the entire system.

So how can we prevent systems from being single points of failure? The answer may lie in the concept of "fencing in" instead of "fencing out". In my home state of Montana, the law of the land is "fence out". That means that it is incumbent upon any ranch to keep his neighbors’ livestock out of his fields – he is not responsible for keeping his own livestock in. This mechanism dates from the time when most of Montana was open range land, and the occasional farms were responsible for keeping that free-range livestock out of their fields. In this day, when all the land has been claimed and cut up into contiguous ranches, this "fence out" rule seems a little anachronistic, but it remains the rule.

I think that that same rule is used in numerous distributed systems. In our recent blackout, each power station acted entirely in its own self-interest and shut down to protect itself from the surge running across the grid, i.e. they each fenced the menace out. If instead, there were cooperation across the grid to isolate the surge, i.e. fence it in, then catastrophic system failure could have been avoided.

Currently viruses are handled by “fence out” methodology. Every individual system attempts to fence viruses out leaving the viruses free to run around the network looking for just one system that fails to fence them out so that they can propagate. If they were fenced in, they would not have the ability to look for a weak node to attack.

The only way to fence viruses in is to make sure that they have no medium for transmission. As we move towards a model of “utility computing” in which compute resources are served up like electricity from a distributed grid, the risks of a systemic failure become greater. However, since servers tend to operate in strictly managed environments, it should be easier to isolate destructive code (viruses and bugs) and its effects. And by connecting to desktop environments without sending executable content to them, we can make sure that destructive code never gets to leave that rigorously managed server grid. Mark Williamson at HP in the UK has been experimenting with using innovative "fence in" techniques to combat viruses.

It is consistent with game theory that the benefit of the whole is maximized by each constituent pursuing their own self-interest and cooperating to pursue the interest of the group. That is what fencing in requires.

Related Topics >>