Regardless of a data center design’s intended level of availability, that facility won’t meet its stated goals in the absence of proper maintenance. To keep services available, data center operators must take steps to ensure that individual aspects of the facility are maintained.
Imagine your car is a data center, and your car’s ability to get you from here to there is its availability. Now, imagine you never change the oil and filter, you completely ignore the amount of tread on your tires and you disregard any warning lights that illuminate on the dashboard. Chances are it won’t be long before your car can no longer be considered “available.” On the other hand, if you properly maintain it by changing the oil and other fluids regularly, checking the tires for wear and for proper inflation and getting needed service when appropriate, your car will last longer and will not suffer from the types of unplanned breakdowns that can plague a car that receives no regular maintenance. And yes, sometimes taking your car to the shop means that you won’t be able to use it—meaning it is unavailable. But planned maintenance is much different than unplanned maintenance: if you schedule maintenance, you can plan for times when the lack of availability least harms your schedule. Unplanned maintenance—that is, a breakdown—has no regard for your schedule, and it can pile on additional costs beyond those of planned maintenance (e.g., towing in the context of a car).
The same logic that applies to automobiles (and almost everything else in life) also applies to data centers: if you want it to work when you need it, you must make an effort to properly maintain it. Because data centers are complex facilities with many interworking components and systems, maintenance can be a daunting task. But in a difficult economy, the “run to failure” model is too expensive a proposition—both in added costs associated with unplanned downtime (an equipment failure of some kind) and in reduced customer satisfaction or even lost customers. Do you want your data center to live up to its availability potential and promises? Then maintenance is a requirement.
What Is Availability?
Availability is many things to many people. One data center operator might consider certain circumstances to represent a facility that is “available,” whereas another operator might consider those same circumstances to represent an “unavailable facility.” Clearly, however, data center availability is some measure of a user’s ability to access services offered by a data center. The particular details of what that means may vary, but the general concept of availability is fairly clear. According to a March 2011 whitepaper from Fujitsu (“Frequently Asked Questions on High Availability”), “IEEE defines availability as the degree to which a system or a component is operational and accessible when required for use by an authorized user.” A simple formula for expressing availability is the ratio of uptime to total operating time, where uptime is the difference between total operating time and downtime. In addition, the Fujitsu whitepaper notes that “in practice, a distinction is made between planned and unplanned downtimes.”
An Emerson Network Power whitepaper entitled “Maximizing Data Center Efficiency, Capacity and Availability through Integrated Infrastructure” cites the 2011 Data Center Users’ Group survey regarding the importance of availability. According to this survey, 53% of responding IT professionals included availability as a top 3 data center challenge; availability ranked first in the list, followed closely by infrastructure monitoring at 52% and heat density at 47%.
But what exactly is availability? Availability must be carefully defined in the context of provider-customer relationships. Consider, for instance, a data center facility that provides some service and a remote user attempting to access that service through the cellular network. In addition to just the equipment and facilities on site at the data center location, another factor influencing availability (from the user’s perspective) is the intervening network. This includes cell towers, fiber for transmitting communications over potentially long distances and any other equipment that switches, processes, conditions or otherwise manipulates the signal between the user and the data center. A failure at any point along this route—regardless of the fact that it may not be under the data center provider’s control—disrupts availability of the service to the user. On the basis of the IEEE definition, the data center’s service is unavailable to the user in this scenario, even though everything at the data center site is working perfectly.
Thus, a large part of availability, at least as far as understanding what a customer is getting from a service provider, depends on what is in the applicable service-level agreement. But in the context of just the availability of services to the edge the data center operator’s property, availability is often measured in terms of “nines”: for instance, 99.999% (“five nines”) availability, which corresponds to about five minutes of downtime per year. As the Data Center Journal discussed in a previous article (“What Do All Those Nines Mean?” http://www.datacenterjournal.com/index.php?option=com_k2&view=item&id=2285:what-do-all-those-nines-mean?&Itemid=503), however, these statements of availability can be misleading or even incorrect. This single number doesn’t, for example, take into account when downtime occurs (scheduled downtime off hours is much better than an equal—or even lesser—amount of unscheduled downtime at a time when most users rely on the service).
But availability is not necessarily limited to just the ability to use a service. For instance, if the service is too slow, it can be effectively useless. Rudy Millian, Product Manager for Anue Systems, notes that availability can be more than just the basic definition above: “Our customers define data center availability in terms of key performance metrics including:
• Downtime – The available data center network is outage-free
• Performance – The available data center network delivers on the expected QoS and SLAs
• Security – The available data center network is secure
• Regulatory compliance – The available data center network meets relevant regulatory compliance, such as SOX, HIPAA and PCI compliance”
What’s Maintenance Got to Do With It?
Every component or system has some finite probability of failure—whether that component is in your car or in your data center. That means that the system or component will eventually fail. Thus, even the most robustly designed data center will eventually experience downtime if left alone. The best way to mitigate this probability of failure is to maintain the facility: maintenance is essentially a process of replacing, repairing or adjusting certain components to decrease their chances of failure. Some components, however, have a greater chance of failure than others, and a data center operator could easily devote the entire staff to maintenance and still experience a failure. Proper maintenance is a selective process of choosing those areas that are most critical and/or most prone to failure and conducting periodic inspection and implementing appropriate remedies to minimize the overall chance of downtime. Millian notes that “poor network maintenance leads to a compromised data center and negatively impacts availability. Downtime is more frequent and [mean time to repair—MTTR] is high, QoS targets and SLAs are not met, security breaches go undetected and the business fails network audits.”
Note that redundancy, although important to maintaining data center availability, is not sufficient. Adding more backup systems is beneficial to a point, but in real-world systems, beyond a certain level of redundancy the system begins to become more unreliable as additional redundant systems are added. (This is a consequence of factors such as an imperfect ability to detect and “cover” failures and the inherent—albeit perhaps small—level of unreliability in systems that detect failures and transfer functions to redundant systems.) Thus, no design is good enough to maintain availability over the long term, short of getting proper maintenance.
This type of preventive maintenance can yield tremendous returns in the long term, as the cost of unplanned downtime can quickly and greatly exceed that of unplanned downtime (for preventive maintenance). Although a program of preventive maintenance can be developed at any time for any data center, the best approach is to prepare one in conjunction with the design phase of the facility. Paul Goodison, CEO ofCormant, believes that “we should consider maintenance not as the process of checking/maintaining after the event, but processes built into the [data center] from the very start.” Again, however, it’s never too late to start maintaining your data center: a good maintenance program requires an initial investment of time and money, but even this investment—to say nothing of the ongoing work of maintenance—can provide fast returns.
Keys to Data Center Maintenance
One of the best aids to data center maintenance is monitoring. If you know what’s going on in your facility—and, in particular, what’s going wrong—you can more easily correct it. Thus, monitoring infrastructure can be well worth the capital costs by way of reduced maintenance time and effort, in addition to (in some cases) the ability to take precautions to prevent failures in addition to responding to failures. “Best-practice at our customers is to go beyond maintenance. Our customers rely on monitoring tools, such as application and network performance monitors and intrusion detection systems, to proactively detect and address potential data center availability issues. The Anue NTO network monitoring switch is an integral part of such monitoring solutions. The NTO enables customers to decouple the number of tools they need to deploy from the number of network monitoring points they have, thus eliminating the shortage of access points. In addition, it optimizes the traffic to individual tools, boosting tool performance and enabling customers to monitor more with less. The outcome is better tool utilization, increased visibility and higher data center availability,” said Millian.
For example, Goodison suggests that data center operators make sure that “the rack/room/space/facility is being monitored for power consumption and temperature and these are monitored against set maximums or values. So for instance, a rack may have a maximum current/power draw, and that needs to be a) agreed [upon] in advance and b) monitored (maintained). If not, then overloads/overheating and outages are going to happen (and do).” Furthermore, “equipment (servers, network equipment, etc.) needs correct labeling, dual redundant power, and data connectivity, and this all needs documenting and testing before it goes live to ensure maintainability later on.” Goodison notes several important points regarding maintenance: First, an established (agreed upon) procedure is needed before failures and other events occur. Second, the procedure should be tested, and third, it should be documented to give staff a ready reference for how to respond to various circumstances.
“Depending on the area, [monitoring] can either supplement correct routine maintenance (where an issue arises) or if we look (again) at the IT IM portion of the DC then we use monitoring to maintain availability. For items like dual connectivity, port utilization or current power values we can use that monitored data to plan future expansion. We might also use things like power plate rate data as well to see maximums and blend this data into an overall availability picture of a [data center],” said Goodison. “Again from an IT DCIM point of view we want to be monitoring power and temperature, but also equipment deployments and port utilization. It’s absolutely vital to know in near real-time that a new device has been deployed and where (and likely why) so that it can be managed. (CableSolve has a unique portable component that will enable the engineers to record everything they do including racking, patching, plugging-in and testing a piece of equipment, all providing vital data for the future.) Server data and network equipment data is also often monitored. One example might be switch port use, if a switch port reports it is in use, but the IT IM system does not know about the connection that is a red flag that needs investigating as it may cause issues in availability/uptime later.”
Downtime is guaranteed. Eventually, one way or the other, a data center facility will go offline for some amount of time. The data center operator can, however, control to some extent when and for how long the data center is offline—assuming proper maintenance procedures have been followed. Ultimately, however, a data center must be brought offline to perform certain maintenance tasks; the alternative is to let it go offline when something fails. The former can be scheduled, but the latter occurs randomly. The best policy is to properly schedule this required maintenance to have a minimal effect on customers. Customers will certainly have more appreciation for short, scheduled downtime during low-use times than they will for unscheduled downtime during peak usage.
Additional Steps to High Availability
Millian also cites several other measures that data center operators can implement to improve their availability. One of these measures is increased visibility: “More network visibility leads to higher availability, because it improves communication among teams while reducing operating expenses. Solutions with self-serve features boost visibility across the IT organization.” In addition, “Simplicity reduces maintenance costs and frees resources to focus on value-add activities. Look for solutions with intuitive, easy-to-use interfaces. The lower learning curve means administrators are more likely to use the solution and are less likely to make availability-impacting mistakes.” Third, Millian recommends automation and integration. “Integrated and automated solutions are more valuable to organizations, because they simplify management and add to the total value proposition of the integrated components. Look for solutions with public APIs and SNMP support.”
Failures will occur in the data center—this is a certainty. What’s less certain is how often they will occur. By implementing proper maintenance procedures, you can increase your facility’s availability. Millian summarizes the effect of a lack of maintenance as follows: “The effect of poor maintenance is as diverse as our customer base. What our customers have in common is that their network infrastructure is critical to their business. From financial institutions to government agencies, any downtime, security breach or missed audit has significant impact on both the IT organization and the business.”
In light of the need for data center availability on the part of customers—even if the “customer” is the company running the data center—maintenance is an absolute requirement. A “run-to-fail” approach to data center operations is ultimately more costly than a planned and consistently implemented maintenance policy in both money and employee time. One of the major keys to a good maintenance policy is monitoring, which gives data center operators the information they need to identify and correct problems before they turn into major downtime events.
Your data center is like your car: if you put in a little effort, it’ll take you where you want to go. If you ignore it, it’ll put the brakes on your progress.