A massive transition is occurring in the data center, and it all starts with how we derive value from the systems already in place. Applications remain the focal point of business success, but the systems on which they are deployed continues to influence their effectiveness. The goal of cloud computing for enterprise IT and SaaS providers is to employ existing compute, network and storage resources as effectively as top-tier cloud providers are. But how?
Enter telemetry. Telemetry has a long history and was introduced when the first data-transmission circuits were developed in 1845 to send data between the Russian Tsar’s winter palace and the army headquarters. It was and is still is used to monitor the locks and water levels of the Panama Canal. Simply put, it enables data collection from remote points for monitoring and measurement.
Telemetry, as it applies to data center infrastructure today, offers unique value because it can uncover important observational data across servers, network switches and applications, and it allows administrators to observe the state of systems so that they can make decisions in a highly automated way. For example, with data centers moving to composable software-defined architectures, telemetry can help better understand why and how the increased use of API-calls and microservices in modern applications affects east-west traffic. Using this information, decisions can now be made to optimize the system for better throughput and more-fluid end-user experiences, regardless of the content and regardless of the location of the data center or end user.
Harnessing telemetry data to begin making such decisions is something we call intelligent resource orchestration (IRO), and it is driving the next evolution of cloud infrastructure. Software-defined infrastructure (SDI) harnesses the power of IRO to allow cloud components running workloads and consuming hardware to do so in a highly automated way. This automation reduces human interaction while increasing the analysis of patterns to achieve greater density, scale and agility, along with less downtime. Top cloud providers are already using SDI to deliver cloud services faster and more efficiently. By dynamically allocating the required resources through a layer of abstraction and intelligent software, the application and service delivery can be carefully orchestrated on demand across many thousands of nodes.
Now, we know that humans are natural observers. We can recognize patterns and have a highly tuned ability to identify threats. But we are also prone to mistakes and inefficient use of time—a major disadvantage for cloud-based data centers, where consumers demand always-on, continuously available digital services. Previously, computers lacked the same level of intrinsic observability, but recent strides in machine learning are changing that situation. “Sane automation” is the most reasonable step between the two worlds: we can automate repetitive tasks and context while leaving mission-critical business decisions to the humans. But IRO’s promise is becoming a reality. It can now observe the system for us, and it’s moving closer to being able to take actions on the basis of those observations.
The IRO model comprises a few primary domains that interact with each other. Here is how it works:
Watch: Data for each resource—compute, network and storage—is virtual and made available for review. Information is collected from hardware, software, components and services. For example, we can get visibility into data throughout the data center using a common API and feed it into a common monitoring and learning framework.
Decide: In this instance, something happens—a request is made or a system dies—so what should happen as a result of this change to the system? The computer reviews observational data to decide how to react. This domain is where we find schedulers, decision engines and orchestration policies—for example, the monitoring interfaces and analytics that a system uses to consume the massive amounts of information the cloud offers. This situation is similar to that of Sane dashboards that an SRE might need.
Act: If the previous decisions lead to a good outcome, the approach becomes automatic, reducing human intervention and giving computers the power to change and optimize the system in real time. For example, we now automate what we can through configuration management as well as principles such as continuous integration and SRE playbooks.
Learn: This domain is changing rapidly with the onset of machine and deep learning, and it will position the computer as a tool for recognizing patterns in huge data volumes, offering the capability to evolve the domain cycle each time around. As the computer learns, it recognizes changes and makes recommendations for improvements, such as APIs that should be called for action. As a result, we’re able to further reduce human involvement and make the system run more smoothly and efficiently. This stage allows us to improve on everything we’ve done so far. It can be as simple as capacity planning or as complex as defining the cost per customer for an SaaS provider.
An instance of this cycle in practice is the telecom industry, which must contend with running, scheduling and automating millions of tiny services to offer a fast, stable overall service. This capability is especially important with the looming advent of the 5G network. Using IRO, telecom providers can create a highly automated system that operates at scale, greatly reduces lag and reduces costs. In fact, any latency-sensitive industry, such as health care, financial services and content-delivery networks, can benefit from a cloud-based IRO approach. A great place to start is the Snap open-source telemetry platform.
Telemetry data plays a critical role in automating cloud infrastructure and improving the total cost of ownership, and it’s enabling IRO at every level of the cloud stack. To gain the most valuable insights and the most effective automation possible, we must start with the data that telemetry collection can offer. Doing so will enable a more intelligent cloud. It will fundamentally affect every industry and will create unprecedented opportunities for greater intelligence, the creation of new business models and the efficient delivery of new digital services.
About the Author
Jonathan Donaldson is Vice President for the Data Center Group as well as General Manager for the Software Defined Infrastructure Group at Intel.