This article is the second of a three-part series examining the main challenges of acquiring, implementing and utilizing data center infrastructure management (DCIM). Part 1 presented a broad review on the different DCIM functions in light of the owner’s operational requirement in the data center, highlighting the challenges in monitoring, capacity-management, analytics and reporting. This part presents a method for expanding DCIM from a data center management tool to manage IT, capacity, energy and cost.
Vendors who sell DCIM as a “cure-all” or “one-size-fits-all” solution rather than tailoring the system to the customer’s individual business requirements compound the problems faced by owners who want to move to DCIM. These “one-solution systems” aim to address all the focuses of IT and facilities, however, so they have often focused on one or two priorities but fail to meet all the functional needs of data center operators. To address this issue some DCIM vendors also provide modular packages that can be tailored to the customer. For example, a colocation provider may be more interested in control of HVAC-plant and power-management subsystems, whereas a small enterprise data center may want to directly monitor IT systems in the white space and rely on facilities staff to look after control and power.
The questions that data center owners must ask themselves are “What do I need to manage my data center?” and “What visibility do I need from my data center to manage ongoing operation and plan for the future?” Consider two different cases: data center (1) is a new build with rack PDU metering, and data center (2) is a legacy facility that has no rack power-monitoring capability, but it has been using an existing BMS to monitor infrastructure and spreadsheets in order to track sold capacity and customer power draw using manual readings. One can argue that the requirements of (2) are very different from those of (1). It may be that data center (2) will prefer to integrate the BMS with a DCIM solution that will also take inputs from manually assembled power-utilization spreadsheets, whereas data center (1) may want DCIM to collect live rack power data and monitor customer power usage per circuit. The operator of data center (2) may have identified its limitations and have a good understanding of its capacity limits, whereas data center (1) is relying on the DCIM platform.
The commonality between the data center providers is that they both want a sensible handle on their operating cost, efficiency, capacity management and forecasting capability.
At some point in the data center life, and at some level in the organization, someone will try to forecast energy consumption and cost. This effort could be in the form of a spreadsheet built around a model of the facility (see Figure 1). It’s a labor-intensive task and will evolve into a continuous line of error-ridden spreadsheets that will ultimately be understood by only one person in the organization. To add to this difficulty, the problem is not static: equipment performance characteristics will vary will load, ambient temperature and operating conditions. The task is too onerous for a human to solve manually.
Adding Rocket Science to DCIM
To understand cost properly, one must solve the engineering problem first. This is where predictive modelling adds some rocket science to the analytics of the DCIM workflow model previously presented in Part 1 of this series (see Figure 2).
One thing to clear up is the misnomer that every reference to modeling is computational fluid dynamics (CFD). Part 3 will examine the operational benefits of CFD alongside DCIM. When we reference modeling in the context of plant performance, we are referring to the function of the plant; for example, the behavior of computer-room air-handling (CRAH) units with changing IT load, or chillers with changing ambient conditions and load.
The question is often asked, “How do we calculate or model the expected performance?” The answer is simple: the performance of every device in the data center can be determined by a mathematical model. The vendors have libraries of this information, whether it’s the load versus efficiency for a UPS, transformer, fan or a pump. Figure 3 shows the coefficient of performance (CoP) of a particular chiller against ambient temperature and load.
The main characteristics of the curve in Figure 3 are the sharp drop in CoP as condenser temperature increases beyond 30°C, as well as the small improvement in CoP as the cooling load approaches 100%. The complexities of the chiller behavior in Figure 3 represent a single subsystem component among the collective group of critical plant components that function in the data center. These components interact with each other in the distribution of power and heat in the facility. For example, the chiller sees a cooling load and draws power from the switchboard or panel serving it, representing the interaction between the mechanical plant and electrical plant.
The Skeleton Behind the Model DCIM
There is no single definition of what DCIM is or what it should have. The canard is that DCIM will fix your data center. The reality, however, is that it should provide a deeper view into the consumption and performance of the infrastructure and highlight areas of concern in the facility that need to be addressed before they become major risk items.
To unlock the ability to truly forecast cost and energy in the data center, DCIM should be able to provide a glimpse into the future and give the operator an idea of how the data center will react to different load and temperature conditions, as well as keep track of the utilized capacity of the major plant items. One way this feat can be achieved is through a performance model that encompasses the main mechanical and electrical infrastructure in the power and cooling chain (see Figure 4). This model replaces the previously onerous manually constructed model in Figure 1.
The purpose of the model is to warn the facilities personnel whether the reading at the meter is a healthy one or whether it falls outside the acceptable range for that condition. Why is this capability important? How do we know the reading from the utility meter or the chiller submeter is correct? The common response provided by site personnel is “We calibrate our meters annually,” which misses the point entirely. The meter presents the reading as it is, but it doesn’t provide that layer of intelligence to inform the data center owner how far the reading is from the healthy value for that load and temperature condition. Think of the car dashboard as an example; the driver has all the actionable intelligence needed to navigate and make decisions before driving. Without this intelligence—e.g., the fuel meter and speedometer—it would be like driving in the dark.
Of course this model must tie in with the DCIM platform. The first things to identify are the main data center components or functions that require management. Then identify the primary metering points of the facility and, secondly, to what granularity the information needs to be recorded. It could be kW, amps, volts, Hz, RPM, °C, °F or relative humidity, but also important is an appropriate recording interval, such as 15 minutes, 30 minutes or hourly. Metering everything that moves and filling up disks with data that cannot relate back to any actionable intelligence is a poor investment. As a guide, choose the 1–10% of those meters that a human can realistically read and review. For example ask what you will actually do with per-rack metering when you only need to manage power at the row or PDU level.
In the example shown in Figure 5(a), the main meters have been nominated and the data center key performance indicators (KPIs) are tracked against the expected design performance (see Figure 5b). Where the differences start to grow in Figure (5b), a blind divergence occurs; it’s only visible by plotting the healthy expected design reading from the model. The common cause of divergence is manual intervention, where someone has changed a setting on the cooling system or the system has responded adversely to a load condition, but only to a degree that is not yet a risk item. If it is detected early as per Figure 5(b), the facilities personnel can attend to it promptly and attempt to recover the divergence gap and bring the data center back in line with the design.
The DCIM analytics and data storage functions should calculate and store the hourly predicted performance of the data center by analytically relating the energy interactions between the mechanical and electrical plant as per the power and heat-chain model in Figure 4. The inputs to this model are the measured data center IT load and external ambient temperature. This data can then be presented alongside the actual logged reading from the utility and or other plant meters in the facility (see Figure 5b).
DCIM as a Decision-Making Tool
The previous section discussed how DCIM could be equipped to assist facilities personnel in identifying risk and divergence by tracking the performance of the facility against the design, but also by prompting senior management or engineering to ask, “Did I get what I paid for?”
Given that the engineering problem can be solved by the performance model in Figure 4, reporting activity-based metrics becomes a trivial exercise. These metrics could include customer-level PUE, total cost, and delivery cost ($/kWh)—the types that provide business intelligence and decision-making capabilities to senior management. Figure 6 shows allocated metrics for each IT customer in the data center that was modeled in Figure 4. The output comes from the engineering model that considers energy overheads and is for the cost-based metrics of capital and maintenance costs. Consider Figure 6 (a); the allocated PUE of Customer IT (1) is much higher than that of IT (2) and IT (3). This result is an outcome of the activity-based performance model in Figure 4, whereby IT (1) is powered by a different string of UPSs and different type of CRAH units than IT (2) and IT (3).
Figure 6: Enabling useful KPIs and business intelligence from your DCIM platform. The figure shows important data center metrics allocated across the three different IT customers in Data Hall 1 and 2: (a) allocation of PUE, (b) allocation of cost, (c) allocation of $/kWh and (d) the data center performance model.
The performance model therefore only allocates to IT (1) the contributing energy and cost overheads. Furthermore, the total cost in Figure 6(b) is lower for IT (1) compared with IT (2) and IT (3). This result could be due to IT (2) and IT (3) using higher IT load (i.e., more IT kWh but less energy overhead). Additionally, IT (2) and IT (3) can have more-onerous capital and maintenance costs tied to them, possibly from an upgrade project, than IT (1).
Figure 6(c) plots the allocated $/kWh for each IT customer according to their activity-based performance model. “Data Center Delivery Cost” is the honest way of reporting cost efficiency and can be more relevant than PUE to senior management. For example, in the first two to three years of life for a new facility, the capital-cost component is dominant regardless of the PUE and operating efficiency.
As per the definition in Figure 7, the delivery-cost metric captures the amortized capital, maintenance cost and energy-utility cost against the kWh of IT load delivered. Using this definition of $/kWh, it’s easier to interpret the cost efficiency of customer IT (2) and IT (3) from IT (1), as Figure 6(c) shows. Customer IT (1) costs the business $1.6/kWh, whereas as IT (2) and IT (3) come in at $0.6/kWh and $0.8/kWh. Although the total cost of IT (2) and IT (3) is higher because they are using more IT draw (see Figure 6b), the cost efficiency of serving these customers is significantly lower owing to one or more of the numerator components in Figure 7. This information is, of course, very sensitive and should only be in the hands of senior management or sales teams. But it helps the business decide on potential renegotiation penalties and/or decide to release unused capacity from existing customers in order to sell space to new customers.
The model takes performance and data inputs from all the main plant items to allow for the assessment of data center capacity utilization at a component and/or system level. It helps the data center owner plan for upcoming infrastructure maintenance and enhancement programs, and it helps evaluate facility upgrade options without the need for long, complicated, error-ridden spreadsheets. Figure 8 presents a 10-year net-present-worth (NPV) calculation for a cooling-system upgrade, giving consideration to the influences on the energy overhead, maintenance cost and required capital investment. In this case, it’s easy for in-house engineering teams to assess upgrade options in a shorter time frame and rule out those that fail to present a cost-efficient business case. On the basis of Figure 8, installing a Turbocor Chiller will make little difference to the cost position over 10 years, whereas an indirect cooling system presents a saving of $600k.
This part of the series presents an approach that can complement DCIM analytics to accurately track data center energy, cost and capacity against design performance. The core intelligence of this approach is a smart model that can inform the data center owner of the impact of load, temperature and change on the facility.
This approach allows the data center owner to realistically manage the level of expectation from the facility with fair or otherwise achievable PUE and/or cost targets. But it also enables the owner to measure the influence of change in one subsystem component on the overall facility efficiency and cost, before any implementation. The output from the analytics will cater to the requirements of facilities personnel as well as senior management.
For this effort to be effective, the appropriate meters need to be nominated and output needs to be stored at appropriate intervals. Doing so requires some integration with existing BMS or DCIM platforms. The challenge eases owing to the ease with which data can be gathered from all the different hardware devices in the facility.
About the Author
Ehsaan Farsimadan is Director of Engineering at i3 Solutions Group. He previously worked as a technical consultant for the Uptime Institute, being responsible for data center design and facility Tier certifications. Before Uptime Institute, he served at Romonet as Head of Modelling and Customer Engineering, being responsible for the company’s client engineering and consulting engagements worldwide. He is a mechanical engineer with diverse industry experience. Ehsaan also previously worked as an M&E design consultant at Cundall, where he was responsible for developing data center concepts to scheme design in addition to leading the modeling team. He has developed specialities in data center predictive modeling and IT services, and he has good knowledge of electrical systems. Ehsaan is a chartered engineer accredited by the IMechE. In 2008, he obtained his doctorate (PhD) in mechanical engineering and also holds a bachelor’s degree in mechanical engineering with aeronautics. He has made significant contributions to the field of turbulence research and the data center industry through publication of a number of journal papers.