Are You Prepared for the Small Data Center Outages?

August 7, 2012 5 Comments »
Are You Prepared for the Small Data Center Outages?

In some sense, there is no “small outage” if your company relies heavily on its data center for major business functions. Even a small outage—whether you define such as having a short time span or a limited scope—can affect your income and your reputation. You may be prepared for a major downtime event, but are you ready for smaller events as well?

Data Center Downtime: It Doesn’t Take Much

Data centers are sophisticated, highly interconnected systems that require a number of different subsystems to all function properly to ensure the facility can provide services. Unfortunately, this all too often means that a seemingly minor mistake, accident or event can bring the entire system to a grinding halt. Consider the EPO (emergency power off) button: all it takes is one employee thinking that this button is a door opener, and you have a full-fledged data center outage on your hands. Recently, Wikipedia suffered an outage as a result of a cut fiber optic cable in a data center (“Wikipedia outage caused by a data center cable cut”), and Twitter was silenced for a time during the Olympics owing to a system failure (and, interestingly, a nearly simultaneous failure of a backup system) in its own data centers (“Twitter: Data center problems caused outage”).

The lesson here is that it can take much less than a hurricane, earthquake, utility outage or malicious attack to bring down your data center. And if your company depends on its data center to enable basic business functions (for instance, retail sales over the Internet), then every moment of downtime equals lost revenue. Furthermore, customers that go to your site or try to otherwise access your services and receive an error message may well simply go to a competing provider or retailer—that’s more than just one business transaction you’ve lost, it’s a customer and all his or her future business. And your customers probably won’t care if it’s a big or small outage: most will have very little patience when your services are inaccessible when those customers need them. Nevertheless, a range of effects exists, as Bob Baird notes (“The Service Disruption Continuum”): “Disruptive events don’t have to be a major disaster to wipe out your business. They can be anything from a relatively minor malfunctioning network card to a devastating event such as a sudden regional disaster that not only destroys your data center but also shuts down surrounding roads, bridges, and other infrastructure.”

Preparing for Small Outages

No system is perfectly reliable: everything has a chance of failure. Thus, if you follow probability, this means that eventually your data center (regardless of how many nines you boast) will eventually suffer an outage. You, of course, want to do everything you can to avoid outages, such as installing backup and redundant systems to avoid single points of failure, but you also need an action plan when one of those inevitable downtime events occurs. In many cases, the difference between a “large” and “small” outage may be negligible. In other cases, it may not be. For instance, a system failure that leaves your services available but painfully slow for customers can be just as bad—if not worse—than a full-fledged outage. (You probably know the aggravation of a slow-loading website: you waste a bunch of time and then close the window in fury anyway.) Thus, your procedures for dealing with small outages will probably be similar to those for dealing with larger outages. In either event, the key is preparation to minimize damage to your business. Here are a few tips.

  • Make safety a top priority. Often a data center outage is caused by an event that is annoying but far from a danger to personnel. But in those cases where dangerous situations arise (such as exposed electrical conductors), ensure that safety is the top priority. Don’t endanger your employees in the name of dollars. On the other hand, of course, know when you’re going overboard: some safety measures can simply be wasteful. The key is to find the right balance to minimize the probability of harm to personnel relative to the costs of safety measures.
  • Plan ahead. This is perhaps the most important step to recovering quickly from an outage—small or large. If you wait until the outage occurs to formulate an action plan, you’re already at a disadvantage. Determine ahead of time who should be contacted (and who should be present in the facility) should an event occur—and this may even depend on the scope of the failure. Develop procedures for identifying and fixing the problem. Have a ready list of service providers that you might need to contact for help, should some system (like a cooling unit) fail. And, perhaps most importantly, keep all this information neat, organized and located in a place where it is easily accessible to those who need it. By planning ahead, you can more quickly get your data center—and your business—running again.
  • Back up your data. For most folks, most of the time, insurance policies are annoying expenses that nickel and dime you with no return. But when disaster strikes, those policies pay in spades. The same applies to backing up your critical data: it seems like an annoying waste of time—until you lose data. Then, backing up pays for itself, often many times over. Backups are something that you must do regularly during normal operating times, however: it’s a useless (or nearly so) exercise when downtime has already struck.
  • Deploy a data center infrastructure management/monitoring (DCIM) solution. The key to quickly resolving a downtime event is identification of the problem. Wandering around with a flashlight and a multimeter probably won’t do the trick: you need (preferably) central access to information and status about your systems so that you can spot trouble areas. A DCIM solution can also aid in identifying these trouble areas before they cause downtime—another tremendous benefit.
  • Track usage of your data center services. Times of peak usage can put a strain on your systems: and these may be the best times to look for potential problems before they cause downtime. It’s also during these times that you should be most prepared for an outage: this is when a circuit breaker is most likely to flip or a cooling unit to fail. It’s also when customers most heavily depend on you.

Conclusions

In many ways, preparation for small outages is the same as preparation for large ones. A small outage may have less of an effect on your business, but it must still be resolved, lest it snowball into a larger problem. Small outages can also be signals of an existing larger problem that could result in a major outage down the road. In either case, however, you should take steps now to prepare for outages: they’re going to happen to your data center, but preparing now can save you revenue and save your company’s reputation in the eyes of customers.

Photo courtesy of clayirving

About Jeff Clark

Jeff Clark is editor for the Data Center Journal. He holds a bachelor’s degree in physics from the University of Richmond, as well as master’s and doctorate degrees in electrical engineering from Virginia Tech. An author and aspiring renaissance man, his interests range from quantum mechanics and processor technology to drawing and philosophy.

Pin It

5 Comments

  1. David Laurello August 8, 2012 at 6:55 pm -

    Excellent advice. More people would follow it if they knew their cost of downtime. In our experience and research, less than 50% say they know what an hour of downtime cost them. And those who do will underestimate by a wide margin. They look at obvious hard costs … cost per effected employee, for example. You mention reputation and future lost business … hard to calculate but no less critical in determining actual downtime cost. You also mention system interconnectivity. The impact of downtime can ripple up and down the value chain. Understanding the end-to-end calamity caused by a downtime incident, large or small, is the only informed way to invest in proper protection and, preferably, prevention. – Dave Laurello, CEO & Chairman, Stratus Technologies.

Add Comment Register



Leave a Reply