 |
What does 99.999% reliability really mean?
Posted by johnm on April 24, 2005 at 11:46 AM | Comments (6)
Michael Levin posted a Java coding challenge this morning to write a Java program to show what terms like e.g., 99.999% reliability actually means.
Here's my quick and dirty take (written while watching F1 :-):
/*
* What does X nines reliability really mean?
*/
import java.io.*;
public class Nines
{
private static int SPM = 60;
private static int SPH = SPM * 60;
private static int SPD = SPH * 24;
private static int SPY = SPD * 365;
public static void main (String[] args)
{
spit ("Nines of Reliability: (Hours / Minutes / Seconds)\n");
spit (2, SPY / 100.0);
spit (3, SPY / 1000.0);
spit (4, SPY / 10000.0);
spit (5, SPY / 100000.0);
spit (6, SPY / 1000000.0);
spit (7, SPY / 10000000.0);
}
private static void spit (int numNines, double seconds)
{
spit (numNines + " 9's (");
for (int i = 0; i < numNines; i++)
{
if (2 == i)
spit (".");
spit ("9");
}
spit ("%) = up to ");
spit ((seconds / SPH) + "h / ");
spit ((seconds / SPM) + "m / ");
spit (seconds + " seconds of downtime per year.\n");
}
private static void spit (String str)
{
System.out.print (str);
}
}
Which tells us:
Nines of Reliability: (Hours / Minutes / Seconds)
2 9's (99%) = up to 87.6h / 5256.0m / 315360.0 seconds of downtime per year.
3 9's (99.9%) = up to 8.76h / 525.6m / 31536.0 seconds of downtime per year.
4 9's (99.99%) = up to 0.876h / 52.559999999999995m / 3153.6 seconds of downtime per year.
5 9's (99.999%) = up to 0.0876h / 5.256m / 315.36 seconds of downtime per year.
6 9's (99.9999%) = up to 0.00876h / 0.5256000000000001m / 31.536 seconds of downtime per year.
7 9's (99.99999%) = up to 8.76E-4h / 0.05256m / 3.1536 seconds of downtime per year.
So, 5 9's means less than 5 1/2 minutes of downtime per year. Hmm... Now, how long does it take your server just to boot once?
Bookmark blog post: del.icio.us Digg DZone Furl Reddit
Comments
Comments are listed in date ascending order (oldest first) | Post Comment
-
Psst! Hey, Michael! FYI, jRoller didn't let me add a comment to your blog no matter how I formatted the damn thing.
I shall refrain from making any comment about what that says about reliability. :-)
Posted by: johnm on April 24, 2005 at 12:10 PM
-
Now, how long does it take your server just to boot once?
Warning: straw man alert.
If you need that kind of uptime, let's be real: you'd be stupid to use a SINGLE SERVER whose uptime could destroy that figure. You'd have failover, with multiple servers. Perhaps clusters of clusters, in fact, if it's really important.
Don't cloud the issues with bad points.
BTW: none of my servers take five minutes to boot - and I use multiple J2EE application servers. Why do yours take so long?
Posted by: epesh on April 25, 2005 at 05:44 AM
-
epesh: I find it fascinating that you read so much into my simple comparison of boot time. Boot time was used since it's something that most computer folks deal with on a regular basis and so, as per the whole point of this "challenge", it's something that people can help relate to the time scales involved.
Posted by: johnm on April 25, 2005 at 09:51 AM
-
If none of your servers take 5 minutes to boot, you've never used a high-end Sun box... I've got servers which take longer than that doing self-test before the PROM is loaded, let alone the operating system.
Of course, they seem to have a knack of taking longer when the reboot is an unplanned, middle-of-the-business-day affair... in the middle of the night it's practically instantaneous :-).
And yes, of course they are in reality part of a replicated failover cluster. The trouble is, full 'live/live' configuration isn't always possible, giving you 'live/standby'. The equation with a 'live/standby' configuration normally boils down to, which will take longer: fixing the supposedly live machine, or getting everything switched on to the standby... Experience suggests that whichever you decide, someone else will disagree and could've fixed things faster ;-).
Posted by: tim_walls on April 26, 2005 at 02:43 AM
-
Well, when I was referring to "servers," I was referring to J2EE application servers, specifically. Even WebSphere takes only three minutes for me, a time limit I find intolerable, in all honesty.
But that's why you have clusters, for high availability. When one dependency goes down, the others don't. (Presumably, that is; you can always organize your architecture such that dependencies kill the whole cluster, or you can have them all on the same power outlet, etc.)
However, it's fairly easy to leverage J2EE such that your servers can stay up and survive such things, I think. (At least, I find it so.) So five nines is easily attainable if you're actually going to commit to it.
Posted by: epesh on April 26, 2005 at 08:04 AM
-
Hold ya horses guys! Did someone say five nines availability of systems is easily attainable? It's actually much harder to achieve that it might at first seem. When talking about availability, you really need to distinguish between scheduled or unscheduled downtime. But even in the easier case of five nines excluding scheduled downtime, it's not a guarantee to undertake lightly.
A few things to consider...
If it's important to a business to achieve five nines availability on their computer systems, it's unlikely that the systems are simple ones. By that, I mean it's unlikely that they trivially scale horizontally across the whole software stack. So that means you're into expensive fault-tolerant hardware.
And, again, if five nines is important, it's likely you won't get away with the system running at single data centre. There could be a massive power failure (including back-up generators), or the data center could go off-line for other reasons e.g. fire. So you'll be needing even more expensive hardware and redundant, wide pipes between data centers that allow for near real-time replication of data between geographically dispersed regions (different continents, probably).
And t's not about rebooting your servers. It's about failing over to servers that are already running somewhere else, in the space of a few seconds, with the ability to compensate for transactions during the disaster, possibly by manual intervention. And if you need five nines, that means there's probably a big financial penalty for not resolving transactions in a timely manner. And there are probably lots of transactions. So, in the case of something going wrong, you're likely to need a large team of people (tens to hundreds of people) to ensure you can put right the things that went wrong in any reasonable time frame.
Bottom line is that hardly any businesses need five nines availability of their systems. Good job too, 'cos it's massively expensive, for anything but the simplest of scenarios. Three nines is difficult enough to achieve in any moderately complex system. A picture of a little server room with a cluster of cheap boxes doesn't match the reality of what it takes to keep heavily-used, mission-critical systems running smoothly for a business.
Posted by: psynixis on April 26, 2005 at 02:37 PM
|