Originally published in The Data Center Journal, August 2011
Everything in life requires maintenance: your wheels, your relationships, and your health, to name a few important things. Your data center is no exception. If you expect your IT facility to deliver top-notch performance that is highly reliable and available, you must put in the commensurate effort to ensure that your systems are running in a manner that supports this level of performance.
The answer to this question is obvious, and even many data center managers who don’t put much effort into maintenance would probably still recognize its importance. Most data center managers and operators in this position would likely fault a lack of sufficient time or resources (money) to implement or apply all the necessary maintenance procedures. Almost everyone—excepting perhaps the most scrupulous among us—experiences this problem: consider your car, for instance. Have you ever put off an oil change or ignored the “check engine” light because you just don’t have the time or money to deal with it right now? It’s easy to fall into that trap, and once you’re in it, rationalizations become much more tempting.
In the context of the data center, daily demands can cause maintenance to be put on the back burner—sometimes permanently. But a data center that goes without regular upkeep becomes increasingly likely to suffer downtime or other inefficiencies that affect performance—just as with automobiles and other machines. An obvious downtime incident, such as a failure of a cooling system or server, is just one possible difficulty, however. Peter Duffy, CTO of Sumerian, notes that inefficiency resulting from a lack of maintenance can yield a cumulative effect that’s just as bad as downtime, if not worse. “One of the key things we see is that a lack of appropriate maintenance (in which I’d include capacity and performance management) results in degradation—for example, it takes users longer to perform detain tasks, or automated systems have a drop in throughput—which cumulatively has a bigger effect on productivity than downtime. For example, if a system is degraded by 10% for 10 hours, that’s the equivalent of 1 hour downtime. The problem is that everyone spots and takes action when the system is down; very few people even notice a 10% degradation.”
Maintenance problems, therefore, may not just jump out and bite your data center; they may simply sneak up on you and pilfer performance here and there, costing you money in small ways that cumulatively cost your company significant amounts of money. Although most data centers probably don’t perform extensive scientific analyses of downtime events to determine their root causes (just getting the facility back up and running is often all the data center manager has time for), a lack of maintenance is often the culprit. Ben Kissell, Service Solutions Manager for Emerson Network Power’s Liebert services business, believes proper maintenance could thwart about a third of downtime events: “We estimate that 30% to 40% of system outages due to infrastructure hardware failures are avoidable through proper preventive maintenance.”
What’s Data Center Maintenance Worth to You?
If maintenance isn’t part of your data center strategy, one way you can estimate the cost to your operations is to first calculate the annual cost of downtime for your data center and then multiply by some percentage around 30% or 40%—this is an estimate of how much a lack of maintenance is costing your facility every year in downtime. Imagine if you took that amount of money and put it into maintenance: you would likely still get a return on your investment by way of increased efficiency (to say nothing about a less stressful environment).
But if you prefer a fast and ready estimate of the cost of downtime, consider the following. “A 2011 study of data center managers, conducted by the Ponemon Institute and sponsored by Emerson Network Power, showed that on average the cost of data center downtime was approximately $5,600 per minute,” said Kissell. Thus, using “the survey’s average reported incident length of 90 minutes, the average cost of a single downtime event was approximately $505,500.” If such an event occurs a couple times a year, that’s a fair amount of money you could be putting into maintenance instead of frantic recovery efforts.
Kissell also notes that “the size of the return [on investing in regular maintenance] depends on the client’s business, but simply avoiding the cost of unplanned downtime can be a huge financial benefit. Additionally, preventive maintenance service helps to avoid emergency maintenance, which is often very costly.” Additional costs for emergency maintenance can include off-hours service calls and expedited (and therefore more expensive) delivery of replacement parts. Although maintenance tasks may occasionally involve shutting down data center operations, the ability to plan this kind of downtime means you can perform a variety of tasks during hours (such as very late at night/early in the morning) when usage is at a minimum.
The exact numbers regarding cost of downtime, cost of maintenance, and returns on investment in maintenance will vary depending on your data center’s particular configuration and needs. Obviously, for instance, data centers that rely on free-cooling methods will have less heavy equipment to maintain than a data center that relies mostly on more-traditional means of cooling (e.g., CRAC units). But with few (if any) exceptions, maintenance will be cheaper in the long term than suffering downtime events and decreased operational efficiency. And if you have some piece of equipment—whether part of your IT, cooling, or power delivery infrastructure—that costs too much to maintain, you more than likely need to replace that piece of equipment rather than forgoing maintenance.
What Needs Maintenance?
In a word: everything. Some systems require less maintenance than others, however. An APC whitepaper (“Preventive Maintenance Strategy for Data Centers”), for instance, notes that transformers, power distribution units (PDUs) and air and water distribution systems typically require little maintenance, whereas equipment like traditional CRAC units, fire alarm systems, chillers and generators require a high level of maintenance. Other equipment, such as next-generation uninterruptible power supply (UPS) systems, may require only a moderate level of maintenance. But every aspect of the data center requires maintenance, as Duffy emphasizes: “It applies equally to all systems—servers, storage, network and power.”
Some areas are commonly lacking in regular maintenance in some data centers. Kissell identifies several, including switchgear, circuit breakers, ATS and PDUs, as well as critical systems like UPSs, batteries and HVAC systems. In addition, however, some not-so-obvious tactics can help identify maintenance issues before they turn into downtime events. For example, infrared (IR) scans can locate the source of a number of problems, according to Kissell. “The IR scan can locate unusually high temperatures, which represent the deterioration of components and electrical connections due to vibrations, improper torque and other hidden problems. This helps the data center manager identify and correct an issue before it becomes an IT availability problem.”
Although it may not be considered strictly a maintenance procedure, computational fluid dynamics (CFD) can be a useful tool as well. CFD allows the data center manager (either using on-site data center staff and appropriate software or by way of a third-party service provider) to model air flow and heat distribution in the facility. Using this information, proper adjustments to cooling and IT infrastructure can be made to minimize hot spots and other thermal problems that can, over time (or sometimes more immediately), damage sensitive equipment and lead to downtime. Although CFD can be expensive, service and software providers usually offer a variety of options to data center providers, and CFD need not be an “everyday” type of maintenance—it might best be considered a performance optimization step.
Small maintenance steps can prevent problems by addressing some commonly overlooked areas. For instance, “Simple issues like available disk space are easily preventable but often cause problems; depending on where this happens, it can bring applications to a halt,” said Duffy. In this case, just monitoring or periodically checking available disk capacity can be enough to prevent potentially serious problems. In other words, not every aspect of data center maintenance need be complicated and expensive; sometimes, a short, regular observation is enough.
Duffy believes that “Maintenance in its broadest form (including capacity and performance) should be in [a data center manager’s] top three activities.” In other words, it should be a high priority. To be sure, data center managers face a number of challenges, ranging from reconciling the demands of management and facilities personnel to planning and overseeing equipment upgrades and day-to-day operations. Despite all these duties, however, maintenance is one task that should not suffer—and with a well thought-out strategy with scheduled maintenance tasks, it need not suffer. Kissell broadly identified the areas data center managers should focus on in planning for and performing maintenance:
2. UPS systems
3. Power generators
4. HVAC infrastructure
5. Switchgear, circuit breakers, ATS and PDUs
6. Periodic performance of IR scans
The following list of tips for data center managers offers a number of considerations for creating and implementing a data center maintenance strategy.
- Carefully define the goals of your maintenance program. The best way to determine if your strategy is succeeding is to know what you want to accomplish. Kissell notes several potential objectives, including “reduced unplanned downtime, [increased] safety, improved mean time between failure” and so on.
- Keep good documentation. Of course, one can easily go overboard on this point, but insufficient documentation tends to be more problematic than overly scrupulous documentation. By keeping a ready reference for procedures, maintenance history and data, metrics and other information, you can ensure that maintenance is conducted regularly and in accordance with establish protocols. And when unexpected downtime does occur, good records can help you determine what might be (or what might not be) the problem.
- Make it regular. Although not every maintenance procedure need be conducted on a regular schedule, many should be. In those cases, make sure the procedures are actually being followed. A schedule is helpful, but only in so far as you actually stick with it. So, even though it sounds obvious, be sure to perform regular maintenance procedures regularly.
- Don’t go it alone. Data center managers need not try to implement a maintenance strategy alone. Get help when you need it, and accept input from your team. Duffy said, “Work closely with your infrastructure and application support teams—they will often have insights on capacity and performance that have a direct impact on data center maintenance.” And sometimes, certain maintenance tasks are best performed by third parties, who may have expertise in certain areas that your in-house staff lacks. For instance, do you have zinc whiskers growing on your floor tiles? Unless you have the experience and equipment needed to deal with this particular problem, you could end up making matters worse by attempting to clean the floor tiles in house.
- Keep management up to date. Budgets are tight for many companies, and although a good chunk of data centers expect an increased budget in the coming year, how much of that money is allocated to maintenance? Data center managers should explain to management—if necessary—the importance of regular maintenance, as well as the need to allocate a portion of the budget to it. Also, keep management apprised of how it’s going. If the C-suite knows how maintenance is helping, they’ll be more likely to keep the funds flowing in that direction.
- Create an inventory and keep it current. You can’t maintain what you don’t know you have. Identify all the different pieces of equipment in your facility, as well as their ages, manufacturers, locations, conditions and so on. An inventory can be useful even beyond the realm of maintenance, so strongly consider creating one, and make sure it’s up to date.
- Assign maintenance priorities. Some maintenance activities are more important than others. By assigning a priority (choose whatever system you’d like, but apply it consistently), you can more easily determine what should be done when you are in a rush. In some cases, a maintenance procedure should trump other activities; in other cases, the maintenance procedure may need to wait.
- Determine who should do what tasks. Not every maintenance task should be up for grabs. Some personnel may be more qualified than others to perform certain tasks. So, before you start maintaining, determine who is allowed (or expected) to perform each task, whether it’s in-house personnel or a third-party service provider. Also, identify what equipment should be used, what training (where appropriate) is necessary and what safety procedures must be implemented.
- Put safety first. Electricity is everywhere in the data center, and other hazards abound. Make sure data center personnel know what safety procedures should be in place when performing maintenance activities. The last thing you need is someone flipping on a circuit breaker to an area under maintenance.
- Set standards for cleanliness. Sure, your data center may seem to operate well enough even though junk is piled in aisles between racks, but unseen problems can arise. Apart from creating a less-than-optimal atmosphere, a lack of cleanliness can obstruct airflow, creating hotspots, and create safety hazards. A white-glove test may not be necessary, but furniture piled in the data center should be a no-no.
- Don’t be afraid to outsource. Although this point is covered obliquely above, you shouldn’t expect every maintenance task to be performed by on-site staff. You might save a little money letting a trusted employee do the work instead of bringing in another party, but the long-term costs of this approach may, in some cases, exceed the short-term cost of hiring out the task.
- Check the specs. Although not universally the case, often times the manufacturer knows quite well what a certain piece of equipment can handle and how it should be maintained. Regardless of this, however, warranties could be voided if you don’t follow manufacturer-recommended maintenance and operating procedures. So, check manufacturer recommendations and consider these as part of your maintenance plan
With a lack of maintenance being the root cause of anywhere from a third to a half of downtime events in the data center, who can afford to ignore it? Maintaining your equipment can often seem like a boring and (sometimes) pointless exercise, and doing it regularly probably won’t earn you much in the way of recognition. Unfortunately, management can’t see the potential downtime events that were thwarted by heroic maintenance efforts on the part of the data center manager; they do, however, see all too clearly when such an event does occur. Nevertheless, regular, comprehensive maintenance can also improve the efficiency of data center systems, yielding benefits beyond just the reduced stress and lower costs associated with a facility that experiences few unplanned downtime events. As such, the returns on an investment in maintenance are well worth the costs—even if it doesn’t earn you much in the way of recognition.
Photo courtesy of IntelFreePress