Businesses in all fields are introducing machine-learning technologies, such as AI, into their processes to deliver better products for their customers and better bottom-line results for their shareholders. But implementing AI effectively requires custom machine-learning models, massive amounts of computation and almost unfathomable amounts of data. Dealing with petabytes of data, originating with everything from the smallest Internet of Things (IoT) devices to the largest city in the world, can be a challenge for storage technologies designed in the era of the megabyte and millisecond.
Providing all that data to machine learning requires a new storage-interface technology designed for memory-speed storage: NVM Express (NVMe). Unlike SATA and SAS, NVMe eliminates latency-inducing disk-centric protocol, instead using the fastest general-purpose processor-connection technology, PCI Express (PCIe), to minimize latency and provide massive bandwidth per device. This focus on petabytes and microseconds makes NVMe a great match for machine learning.
Data as the AI Pipeline
The key to machine learning is data. Processing the amounts of data required for meaningful results requires a well-thought-out data pipeline. Every company’s data pipeline is different to match its business needs, but all pipelines have the same general stages: collection, preparation, design and training. The output of this four-stage data pipeline is generally a model that can then run inference on new data at the edge or in the core. Owing to the sheer volume of data normally required, however, all stages must optimize their data flow to avoid bottlenecks. The NVMe interface was designed for such a task and can help the AI pipeline in four ways:
- Faster and more-cost-effective data collection
- Quicker data-set preparation turnaround
- Shorter turnaround time for the model-design cycle
- More-hardware-efficient model training
NVMe for Smarter Data Collection
The first challenge for implementing AI is collecting the raw data into a central data store. The variety of this data is almost unlimited: sensor reports from IoT devices, networking logs, manufacturing-quality reports and more. In practice, tools such as Apache Spark and commercial services handle this task and perform filtering on the incoming data stream, finally depositing the unstructured data into NoSQL database clusters. NVMe can decrease the physical footprint of these servers while at the same time increasing their responsiveness.
Traditional NoSQL clusters consist of servers with multiple local hard-drive interfaces with SATA hard drives. Hard drives provide an economical method of storing petabytes, but achieving maximum bandwidth often requires tens of SATA or SAS hard drives per server. This architecture obviously increases the size of individual servers and quickly fills data center racks with servers whose CPUs are mostly idle.
A single NVMe interface can provide the bandwidth of many individual SATA or SAS interfaces while requiring only a single add-in-card or 2.5" drive. Replacing the individual NoSQL servers’ large hard-drive arrays with much smaller NVMe SSDs enables shrinking of individual NoSQL nodes and reduction of the total cluster rack space.
NVMe for Smarter Data Preparation
Having terabytes or petabytes of data is a necessary prerequisite for AI training, but this data is seldom in a readily usable format. Data needs to be transformed into a format that an AI pipeline can more easily process. Outliers and spurious data must be filtered out. Portions of the data that may be inappropriate or even illegal, such as protected personal information, may require filtering at this stage as well.
This kind of processing demand can overwhelm a storage system that’s not designed for high throughput. The limited per-interface bandwidth of SAS and SATA pales in comparison with NVMe’s PCIe-based bandwidth of up to 6.4GB/s or higher. Bandwidth isn’t the only demand of a storage system during this preparation stage: parallelism is also critical. Because the amount of data being processed is so large, this processing stage operates in parallel across multiple servers in the cluster and across multiple cores in an individual server. NVMe supports up to 64K command queues and 64K commands, streamlining parallel operation inside those servers.
NVMe for Smarter Model Design
Once data is clean and has a uniform, easily digestible format, the real work of the data scientist can begin. Every problem is different, so the scientists usually must iteratively develop a unique machine-learning structure. Only after much trial and error on a smaller subset of the data does a candidate-trainable model go to the next processing stage. As in all scientific and engineering projects, many false starts may precede the final result, which often means multiple attempts are necessary.
The speed of individual cycles in this trial-and-error process can have an outside impact on the final model design and the quality of the resulting machine-learning model. Shortening a design-and-test cycle time from 10 to 8 hours may enable a data scientist to double the effective rate, however. Instead of starting a job in the morning and not seeing results until the next day—an effective rate of one test per day—scientists may be able to design and run a test in the morning, obtain the results and tweak parameters in time to start another job before they leave the office in the afternoon, yielding an effective rate of two cycles per day.
Just as in prior stages, NVMe’s bandwidth and parallelism can come into play to help increase data scientists’ effectiveness. Their personal workstations, where they test their models in private sandboxes, can take advantage of NVMe’s low latency for operating system and testing data sets, as well as provide the fastest scratch space for analysis and test runs.
NVMe for Smarter Model Training
Once data engineers convert the data to a machine-learning-friendly format and data scientists have designed a learning model structure, the job of training the network begins. Hundreds or thousands of machines outfitted with accelerators take the formatted data and use it to refine the model parameters until they converge on a model that real applications can use for inference.
Older acceleration technologies based on GPUs were rarely I/O bound, so storage performance was seldom a concern. The general-purpose CPU running the server had plenty of time to handle I/O operations and get the next batch of data ready for the GPU. That’s no longer true, though, with FPGAs and even custom ASICs implementing the model training.
Because machine-learning accelerators can process data orders of magnitude faster than prior technologies, the general-purpose CPU running the server needs to process the I/O orders of magnitude mode efficiently. Legacy I/O stacks such as SATA and SAS waste precious CPU cycles translating I/O requests into protocols designed in the last century. Doing so increases I/O-request latency, which can directly reduce accelerator utilization. These legacy I/O stacks also increase host-CPU load, potentially limiting the number of accelerators that can run per processor.
Because it was designed from the start as a memory-speed storage protocol, NVMe doesn’t incur these protocol-translation overheads. It thus minimizes processor load and helps ensure the timely feeding of data to these next-generation accelerators. An exciting extension to the NVMe protocol currently under examination, Controller Memory Buffers (CMB), may reduce this load even further by allowing NVMe devices to handle these direct memory transfers without host intervention.
NVMe for AI is Simply Smarter
Machine learning and AI are built around data. Collecting that data, processing it into usable formats, exploring learning architectures and finally training a model requires a storage interface that can be effective at petabyte scale and be optimized for microsecond latency. NVMe, a technology designed for memory-speed storage, can provide just such an interface for machine learning and many other tasks.
About the Author
Ulrich Hansen focuses on product planning, product-line management and technical marketing for Western Digital’s enterprise SSD portfolio. His role includes defining the company’s next-generation solid-state products while ensuring that new products and technologies are successfully introduced into the enterprise and data center markets. He’s also responsible for assessing market opportunities and emerging technologies, defining requirements for new products, and aligning customers and industry partners with Western Digital’s product and technology strategies.
Ulrich has more than 20 years of experience in a number of high-technology sectors including servers, storage, and network and communications systems. Before joining Western Digital through the HGST acquisition, he served as the senior director of marketing for Entorian Technologies and has held senior positions in product development, marketing and corporate strategy with management consultancies and technology companies that include A.T. Kearney and Dell. He holds a master’s degree in business administration from the University of Texas at Austin and a master’s degree in electrical engineering (“diplom-ingenieur”) from the RWTH Aachen in Germany.