Storage requirements for AI, ML and analytics in 2022
Artificial intelligence (AI) and machine learning (ML) promise to transform entire sectors of the economy and society if they are not already doing so. From driverless cars to customer service “bots,” AI and ML-based systems are driving the next wave of business automation.
They are also massive consumers of data. After about a decade of relatively steady growth, the data used by AI and ML models has grown exponentially as scientists and engineers strive to improve the accuracy of their systems. This places new and sometimes extreme demands on IT systems, including storage.
AI, ML and analytics require large amounts of data, mostly in unstructured formats. “All of these environments use massive amounts of unstructured data,” said Patrick Smith, field CTO for Europe, Middle East and Africa (EMEA) at vendor Pure Storage. “It’s a world of unstructured data, not blocks or databases.”
In particular, when training AI and ML models, larger data sets are used for more accurate predictions. As Vibin Vijay, an AI and ML specialist at OCF, points out, a basic proof-of-concept model could expect 80% accuracy on a single server.
With training on a cluster of servers, this will increase to 98% or even 99.99% accuracy. However, this places its own demands on the IT infrastructure. Almost all developers assume that more data is better, especially in the training phase. “This leads to huge data collections of at least petabytes for the company to manage,” said Scott Baker, CMO at IBM Storage.
Storage systems can become a bottleneck. The latest advanced analytics applications make heavy use of CPUs and especially GPU clusters connected via technologies such as Nvidia InfiniBand. Developers are even considering connecting memory directly to GPUs.
“AI and ML workloads typically use powerful GPUs in the learning phase that are expensive and in high demand,” said Brad King, co-founder and field CTO at vendor Scality. “They can chew through massive amounts of data and often idlely wait for more data due to storage limitations.
“The amounts of data are usually large. Big is of course a relative term, but in general, the more relevant data available, the better the insights to extract actionable insights from data.”
The challenge is to deliver high-performing storage at scale and on budget. As OCF’s Vijay points out, designers may want all storage on high-performance Tier 0 flash, but this is rarely, if ever, practical. And because of how AI and ML works, especially in the training phases, this may not be necessary.
Instead, organizations are adopting tiered storage, moving data up and down the tiers, from flash to the cloud and even to tape. “You’re looking for the right data in the right place at the right price,” says Vijay.
Businesses also need to think about data retention. Data scientists cannot predict what information will be needed for future models, and analytics are enhanced with access to historical data. Inexpensive and long-term data archiving remains important.
What types of storage are the best?
There is no single option that meets all storage needs for AI, ML, and analytics. The traditional notion that analytics is a high throughput, high I/O workload best suited to block storage must be balanced against data volumes, data types, the speed of decision making, and of course, budgets. An AI training environment places different demands on a web-based recommendation engine that works in real time.
“Block storage has traditionally been well-suited for high-throughput, high-I/O workloads where low latency is important,” said Tom Christensen, Global Technology Advisor at Hitachi Vantara. “However, with the advent of modern data analytics workloads, including AI, ML, and even data lakes, it has been found that traditional block-based platforms are unable to meet the scale-out demands that the compute side of these platforms creates. Therefore, a file and object-based approach must be taken to support these modern workloads.”
Block access memory
Block-based systems keep the edge in raw performance and support data centralization and advanced features. According to IBM’s Scott Baker, block storage arrays support application programming interfaces (APIs) that AI and ML developers can use to enhance repetitive operations or even offload storage-specific processing to the array. It would be wrong to completely exclude block storage, especially when high IOPS and low latency are required.
This is offset by the need to build dedicated storage networks for block storage – typically Fiber Channel – and the overheads associated with block storage based on an off-array (host-based) file system. As Baker points out, this becomes even more difficult when an AI system uses more than one operating system.
file and object
Therefore, system architects prefer file or object-based storage for AI and ML. Object storage is designed with large petabytes of capacity in mind and is scalable. It is also designed to support applications such as the Internet of Things (IoT).
Erasure coding provides privacy, and advanced metadata support in object systems can benefit AI and ML applications.
In contrast, object storage lags behind the performance of block systems, although the gap is closing with newer, more powerful object technologies. And application support varies, as not all AI, ML, or analytics tools support AWS’s S3 interface, the de facto standard for objects.
Cloud storage is largely object-based, but offers other benefits for AI and ML projects. Above all, this includes flexibility and low upfront costs.
The main disadvantages of cloud storage are latency and potential data egress costs. Cloud storage is a good choice for cloud-based AI and ML systems, but it’s harder to justify where data needs to be extracted and loaded onto local servers for processing, as it increases costs. But the cloud is economical for long-term data archiving.
What do storage providers recommend?
Not surprisingly, vendors don’t recommend a single solution for AI, ML, or analytics – the number of applications is too large. Instead, they recommend looking at the business needs behind the project and looking to the future.
“Understanding what outcomes or business purposes you need should always be your first thought when deciding how to manage and store your data,” said Paul Brook, director of data analytics and AI for EMEA at Dell. “Sometimes the same data is needed on different occasions and for different purposes.”
Brook notes the convergence between block and file storage in single appliances and systems that can bridge the gap between file and object storage with a single file system. This will help AI and ML developers by providing a more common memory architecture.
For example, HPE recommends on-premises, cloud, and hybrid options for AI and sees a convergence between AI and high-performance computing. NetApp touts its ONTAP for AI cloud-connected all-flash storage system.
At Cloudian, CTO Gary Ogasawara anticipates a convergence between the data warehouse’s high-performance batch processing and streaming computing architectures. This will push users towards object solutions.
“Block and file storage have architectural limitations that make scaling beyond a certain point costly,” he says. “Object storage offers unlimited, extremely cost-effective scalability. Object Storage’s advanced metadata capabilities are another key benefit in supporting AI/ML workloads.”
It’s also important to plan for storage from the start, because without adequate storage, project performance suffers.
“To successfully implement advanced AI and ML workloads, having the right storage strategy is just as important as the advanced compute platform you choose,” says Christensen of Hitachi Vantara. “Underserving a complex distributed and very expensive computational platform leads to poorer results, reducing the quality of your result and ultimately shortening the time to value.”