Although a data center network is designed not to fail, it does happen. And if it does, it puts data owners in a precarious situation—especially when it’s a colocation facility (colo) that goes down.
As recent situations have illustrated, the ramifications of a colo outage can be devastating. Case in point: two outages in U.K. data centers in July operated by one of the world’s largest communications and colo providers reportedly took down 10 percent of voice and data traffic in and around London for more than four hours. Unfortunately for the businesses who operated out of those data centers, their mistaken assumption that they had secured their data in a stable environment led to consequences.
Despite going to great lengths to design and operate data centers to avoid outages, colocation facilities aren’t immune to problems. Unplanned outages are costly failures for colos in both the short and long term. Many may face one-time financial penalties for failing to meet their SLAs, but also the long-term damage to reputation and recurring revenues if a customer chooses to leave or uses the incident as leverage to stay at a lower rate.
From a colo’s perspective, it’s a pretty straightforward discussion on what should (or shouldn’t) have been done to prevent these outages. It’s a different discussion, however, if you’re the data owner and your colo solution goes down. If you’ve made the strategic decision to collocate your data off site, you’ve gone through the risk analysis and justified the decision. But have you prepared yourself for the unthinkable? The question is, what to do if you find yourself in this situation?
Preparation doesn’t begin and end with the colo-selection process. The best way to prepare for a worst-case colo failure is to continually address the possibility. In the event of a colo failure, your diligent preparedness and awareness of the processes will give you resources and tools to mitigate the situation. If you haven’t thought about it already or haven’t done so recently, I recommend evaluating your situation in the following areas.
Spread It Out
First and foremost, when you’re developing a data center strategy, you should avoid putting everything in one place. Doing so multiplies what I call the risk factor. It may seem obvious, but it’s just as important to avoid putting all the critical applications in the same location. Consider putting production in one location and your backup in another. Then walk through each scenario and identify how a failure of any level will affect production and operations. Repeat this process annually.
Trust but Verify
Obtain your provider’s audit records and, more importantly, review them. In many cases, colos are audited to be compliant with regulations such as HIPAA, SOX and PCI. Sometimes, however, boxes are simply checked by people who don’t fully understand IT or how data centers must operate. Have an audit done by industry professionals who understand how a reliable data center should operate. These third-party audits are typically inexpensive compared with the risk they identify and the wealth of info they can provide. In most cases, mitigating these risks is often minimal in capex and opex relative to overall opportunity cost of suffering an outage.
Get It in Writing
You need to know what the colocation provider will do to fix the situation. When developing the contract with a provider, insist on written agreements that acknowledge what the parties have agreed to regarding what constitutes an outage. Having a common understanding of the language and what it means is critical. I’ve heard more than one story of how, after the fact, data owners found out the language didn’t encompass what they thought it did. Additionally, have in writing the services the provider will deliver during a failure and its commitment to rectifying the situation in an acceptable timeframe.
Be sure to know your business risk and plan for worst-case scenarios. Most colos have an alternate site that can handle basic disaster recovery to ensure their customers experience little or no impact to operations. Most companies are still chasing the elusive “active-active” database in data centers (colo, cloud or on premises). Although some are close and claim successful “active-active” capability, interruption almost always causes pain while trying to employ the disaster-recovery backup. Databases are less complete than you would like, and the chances of lost data or application impacts during transition are likely. I recommend setting proper expectations in lieu of promising the world.
Understand (and Document) the Process
During a failure, everyone goes into crisis mode. It’s important to understand (and document) how your colo provider handles events such as natural disasters and faulty components. What steps does it take and in what order? An important question to ask is who gets access in the event of a failure. Just like you, your server neighbors will be clamoring for access to their servers after a failure. Know precisely whether you’ll get access, who has access, when you have access and what you’ll be allowed to do if you gain access. Additionally, know exactly what extra security measures will be taken to protect your data during the repair period.
A vital element of the process is the communication protocol. Open communication is vital to effectively managing the situation and providing your superiors with updates. Know who will be your main point of contact, who you call to get updates and how often they’ll be giving you updates. Additionally, verify the contact names and numbers regularly. Nothing is worse than having an outdated number or former employee on the call list when it matters the most.
Documentation doesn’t only apply to the colo side, but all data centers related to a company’s operations. We find time and time again that our customers haven’t documented their processes and procedures for day-to-day operations. And if they do, they haven’t updated them as often as they should. Documentation is critical to being prepared in the event of a disaster—from knowing where applications are running to knowing who is most affected by outages and who needs to know about changes.
Ask About Dirty Laundry
During the evaluation process, most colos tell you about how systems are put in place to prevent a service interruption. They also give you testimonials and references from satisfied customers. What they don’t often tell you about is their “dirty laundry”—specifically, instances of failure. As we all know, “oops” happen. First, tell them their answers won’t disqualify them. Then ask them directly whether they suffered a failure in the last year and, if so, the details of the failure, how it was rectified and what steps have been taken to prevent it from happening again. You can learn much about colos from their honesty as well as how they handled the situation. Crises are when good partners shine.
Know Your Escape Clause
In the event that you lose confidence in your colo partner, it’s important that you know about any “escape clauses” in your contract. Make sure the contract avoids vague language that can be construed in such a way that you are locked into the relationship.
Know Your Options
Most colo contracts span several years, during which time the colo market will expand and new players will enter the market. Although you might not currently be looking for a new colo, you should continually evaluate other providers or use a consultant or broker to review your options with you. And in the event of a failure, you must know your options for moving to a new solution—should the situation warrant. In some cases, if the failure is significant or long enough, the ramifications could force the colo out of business and leave you scrambling.
Become a Data Center Nerd
In the recent U.K. colo failure, the cause of the problem was a single faulty breaker. Although one would think that critical facilities would avoid single points of failure, the evidence shows this one didn’t. Today, we’re all in the data business, and in your role you must become a “data center nerd.” Be on a continual quest for knowledge about not only your data center but also the market trends. Be a sponge at all levels.
Ask questions. Read reports. Be intimately familiar with all aspects of your data center solution. Most importantly, know the potential points of failure and understand what situations might trigger an outage. Let’s all hope that situation never arises. But if it does, you must be prepared to address those affected and direct your team. The best recommendation is to have a plan during those failure scenarios, and follow the plan. Communication is critical to success of this plan; as impatient as people may be, they must follow it. But communication of how it works before the situation arises is the only way they’ll know what the protocol is during those situations. By regularly reviewing these important areas, you’ll have the knowledge to effectively move through a failure.
About the Author
Tim Kittila, PE, is Parallel Technologies’ Director of Data Center Strategy. In this role, he oversees the company’s data center consulting and services to help companies with their operation, whether it’s a privately-owned data center, colocation facility or a combination of the two. Earlier in his career at Parallel Technologies, Tim served as Director of Data Center Infrastructure Strategy and was responsible for data center design/build solutions and led the mechanical and electrical data center practice, including engineering assessments, design-build, construction-project management and environmental monitoring. Before joining Parallel Technologies in 2010, he was vice president at Hypertect, a data center infrastructure company. Tim earned his Bachelor of Science in mechanical engineering from Virginia Tech and holds a master’s degree in business from the University of Delaware’s Lerner School of Business.