The question confronting companies dealing with big data is what to do with all that data. But this question really has a twofold meaning: often, it focuses on how a company should go about gleaning useful (and profitable) information from the masses of data that it collects. But this is only part of the story; the other sense of the question is where will the company keep all that data? Big data is about more than just algorithms and techniques for distilling information—it’s about keeping that data in a place that is safe, accessible and affordable. But can storage keep up with the demands of big data analytics?
Trying to Hold the Deluge in a Paper Cup
The words of the popular song “Don’t Dream It’s Over” by Crowded House illustrate a major problem that companies face with regard to big data: how can all that information be stored economically and accessibly? Although solid-state drives (SSDs) offer greater access speeds than disks (HDDs), for instance, they are much more expensive, so a wholesale move to SSDs from HDDs may help on the accessibility end, but not on the affordability end. A hybrid approach is an alternative, but even that fails to address the flood of data that companies face.
A Forbes.com article aptly titled “Big Data And Storage: Why You Can’t Save Everything Forever” notes that “the amount of information we are generating is going up exponentially while the cost of storing it is not going down fast enough for us to be able to store all of it.” Think just of all the video cameras that are recording inside and outside buildings, along roads, in public places and elsewhere—all that video data must be stored, presumably for a long enough period to meet potential uses (ideally, forever—at least in the eyes of some). Add to that all the other sensors, computers and other gadgets recording location data, web traffic, tweets and status updates, snapshots of the web and on and on, and the result is trouble for data storage.
This problem is nothing new: companies and industry observers have been aware for some time of the difficulty of keeping pace with the amount of data created each day—to say nothing of how to use it. Another Forbes.com article says, “Forget about gigabytes and terabytes. Many corporations, banks, government agencies and scientific research institutions now handle petabytes of information.” And yes, floppy disks holding 1.5 megabytes were once actually somewhat useful.
Forget Analytics; Where Do We Put It All?
Technologies such as data deduplication seek to address the problem of data volume by increasing storage efficiency. But such approaches, although certainly helpful in their own ways, are the equivalent of putting a band-aid on a gushing wound. Forbes.com goes on to note that “the capacity of hard drives isn’t increasing fast enough to keep up with the explosion of digital data worldwide. Forecasts call for a 50-fold increase in global data by 2020, but hard drives may grow only by a factor of 15, even with new technology that allows more data to be crammed onto each square inch of a disk.”
Individual companies, of course, may be able to deal with their own data-storage needs within budget, perhaps by outsourcing some storage resources to the cloud or simply investing in more local infrastructure. But the worldwide rate of increase means that eventually HDDs will reach practical limits as a storage medium—even in the cloud. (And don’t forget that for every bit of data a conscientious company stores, it must also consume at least one more bit of storage space as a backup.) That leads to two alternative approaches to the problem: phase out old or useless data to make up the difference or develop a new storage technology that provides an affordable alternative to HDDs. Simply waiting for SSDs to fall in price may not cut it.
Each Approach Has Its Problems
Simply deleting unneeded data is a tempting option—until you give it some careful thought. What are the criteria by which data is to be judged useful (or not)? Obviously, the age of the data isn’t enough; some older data may be far more valuable than newer data. Furthermore, assuming some reasonable criteria can be determined, can these criteria be implemented in an algorithm? (Forget hiring people to do this laborious task.) This situation is likely a big data problem—which brings us full circle.
Development of new technologies is always an appealing thought—much research has been invested in, for example, biological storage (just think how much information is stored in your DNA, which takes up a microscopic amount of volume). But the process of going from new concept to workable model to practical product can be long, and even then, additional time is needed to reach commodity prices.
So, will big data eventually hit a storage ceiling?
Big Data = Big Problem?
Obviously, even if companies were forced to take steps to limit data collection and even reduce storage space, big data analytics would still be needed. If the dynamics of the growing flood of information and the inability of current storage technologies to keep pace continue, eventually companies will need to consider more carefully how to determine what data is worth storing. In other words, the focus must shift from just dealing with large amounts of data (“big data”) to a more “intelligent” approach in which storage limitations are a more critical factor (one might call it “smart data”—a term that is already being bandied about the industry).
No, storage won’t really kill big data, but it will eventually—barring unforeseen revolutionary developments in storage technology—force a shift in focus. Companies will need to consider beforehand what constitutes valuable data, or how it can be differentiated from useless data before it consumes precious storage space. This is no simple problem, and it is partly what big data analytics is designed to address, but eventually the solution must involve more than just more HDDs and more server processors to analyze it all.
Beyond simply the practical matters of cost, privacy may become more of a focus—particularly if giving in to demands for greater privacy means a less burdensome storage budget. Imagine, for instance, how the flood of data could be eased if there wasn’t a dozen cameras on every street corner.
Moving data-storage resources to the cloud is an option for individual companies, but it doesn’t address the overall issue of data growth. Someone still needs to maintain storage resources to contain all the data—cloud-service providers may be able to do it a little more cheaply than individual companies, but the overall logistical problems remain.
Photo courtesy of Jeff Kubina