Industry Outlook is a regular Data Center Journal Q&A series that presents expert views on market trends, technologies and other issues relevant to data centers and IT.
This week, Industry Outlook asks Dr. Ajay Dholakia about how organizations can implement and employ real-time analytics to help their business. Ajay is a principal engineer with the Lenovo Data Center Group (DCG), working on customer solutions in the areas of big data, analytics, AI and health care. He also drives new projects for solution development using emerging technologies such as the Internet of Things (IoT) and blockchain. In his career of over 25 years, he has led diverse projects in research, technology, product and solution development, and business/technical strategy. Ajay is currently chief architect for Lenovo DCG’s big data and AI solutions.
Ajay holds more than 50 patents and has authored over 40 technical publications, including the book Introduction to Convolutional Codes with Applications. He earned a BE (Hons.) in electrical and electronics engineering from the Birla Institute of Technology and Science in India, an MBA from the Henley Business School in the UK, and an MS and PhD in electrical and computer engineering from North Carolina State University. He is also a senior member of the Institute of Electrical and Electronics Engineers (IEEE) as well as a member of the Association for Computing Machinery (ACM).
Industry Outlook: Lots of talk focuses on the massive amounts of real-time data and the value it provides to enterprises. But all of that data requires processing to yield helpful insights. What are the most common hardware and software challenges that IT organizations must solve to tackle their big data challenges?
Ajay Dholakia: At the hardware and software level, each element must be designed to take in data differently. With both real-time and batch-mode analytics gaining popularity, the need to deliver on service-level agreements (SLAs) is driving new requirements for hardware design and software development. But just upgrading the hardware and/or software may be insufficient to realize the full value of all the data available to an enterprise.
IO: What major architecture patterns are emerging to handle the tremendous growth of data?
AD: A new class of architecture patterns can be called “data-centric” in the sense that applications are being developed to match the variety, volume and velocity of data rather than forcing the data into structures that quickly become unwieldy. The data-centric architecture must address the challenges of data ingestion, aggregation, cleaning, verification, integration, storage, analytics and, finally, usage. This is how data flows from the source: from becoming information as it’s stored to then becoming insights based on the latest analytics and finally enabling decisions that drive actions for the targeted operation in an enterprise. Each stage of this data-processing flow or pipeline calls for some new patterns to be deployed.
IO: How is the new data-centric perspective reshaping data center architectures?
AD: The shift from the application-centric to the data-centric perspective is forcing some reshaping of the data center architecture. The main shift in mindset is to access, collect, process and use data wherever it’s available. The data center architecture must therefore be flexible to connect with data sources and repositories that may be outside the traditional physical boundaries. It also means applications must be flexible so they can run where the data is, which in turn means the application elements need an API-driven flexible design.
IO: Must all the processing occur in central data centers? Or will distributed, edge-centric data centers become more prevalent?
AD: Given that most traditional data centers are central, they’re a natural starting point for adding new data-centric capabilities. But this approach has critical limitations. Enterprises must fully understand the data-centric approach and grasp the notion of “data gravity”—which, stated differently, says data tends to stay where it originates and/or collects. It’s the driving force behind the emerging class of edge-centric data centers. SLAs involving latency, response time, security, data sovereignty and data locality will all stretch and spread the traditional central data centers into a collection of connected edge-centric pods that can process data locally while still passing it to the central repositories.
IO: What’s the impact of machine learning on data analytics? Is it necessary to stay competitive?
AD: Let’s think of machine learning (ML) as a broad collection of analytics tools. Although many ML algorithms have been around for several years, the ML toolbox is constantly expanding with new ones. In particular, a subset of ML called deep learning (DL) is undergoing lots of research activity as well as garnering interest from various industries.
IO: Is there an “easy” button for deploying storage and processing infrastructure that can handle heavy data loads? What should enterprises consider when making architecture decisions?
AD: The “easy button” in this case sits at the architecture level. Ensuring that the architectural elements for data ingestion, storage and processing are provisioned to deliver the required performance, reliability and scalability is the place to start. Depending on the data volume, variety and velocity, the data-ingestion pipeline must be able to accommodate all the data sources and feed into data storage for both batch-mode and real-time analytics. The data-storage elements must be staged for structured, semistructured and unstructured data, allowing for seamless capacity growth over time. Finally, data-processing compute capability for both ML/DL model training and real-time inference on the basis of trained models must become available. Here, hardware accelerators and scalability-based dynamic provisioning of the compute cluster are important features. Architected in this manner, the infrastructure can take on as many shapes and sizes as the analytics workload requires at a given time.
IO: What are the basic rules for optimizing real-time data-analytics workflows? Can IT organizations take a one-size-fits-all approach?
AD: Although a one-size-fits-all approach can appear to work in the first few instances, the inherent inefficiencies and inflexibilities will limit the potential value overall. A few basic rules include modular design, API-driven elements, incorporation of acceleration at both the hardware and software levels, and the ability to monitor the SLAs using relevant metrics.
IO: What are the primary machine-learning applications for data analytics and how do they vary among industries?
AD: Applications for ML-based data analytics vary widely. If you’re in the financial sector, fraud detection is one real-time-analytics task that employs the latest ML techniques. The architecture for such an application involves a data-ingestion pipeline, data storage, batch-mode processing for training ML models and streaming-mode analytics for deploying the trained model “on the wire.” In the case of fraud detection for, say, credit-card transactions, millions of transactions would therefore require processing with a sub-second window for declaring whether a transaction should be tagged as fraudulent. For the transportation industry, fleet management is a real-time-analytics use case. For health care, a variety of in-clinic and at-home patient care can involve real-time analytics.
IO: What are the primary metrics for evaluating the suitability of analytics techniques for real-time data processing?
AD: The analytics engines must deliver the throughput and latency that the target applications require. They must therefore achieve response times such that the insights from the analytics are deemed impactful in a timely fashion. Furthermore, the analytics engines need to be flexible and modular, and preferably API based, so they can be added into the target applications as micro-services.
IO: How can enterprises begin enabling machine-learning algorithms to handle real-time data processing?
AD: Enterprises must formulate a strategy that enables all the elements for real-time analytics. Setting up the data pipelines that connect the data sources to the analytics engines and the analytics output to visualization and usage in the target applications are the main steps. Additionally, access to data repositories and data-science sandboxes will help complete the architecture and allow for variability to solve specific business problems.