Many of the stated concerns regarding big data analytics surround privacy: should organizations and governments have access to such extensive information about individuals and groups, and what (if any) laws or policies should govern their collection and processing of that data? A critical but less-often discussed concern is security. What security challenges specific to big data arise when companies and government agencies collect, store, analyze and distribute vast amounts of information, and what—if anything—can be done to mitigate these challenges?
More Than Just Lots of Data
In a sense, when an organization collects and stores vast amounts of data, it makes itself something of a conspicuous target. What hacker, for instance, wouldn’t love to get a peek at the Federal Data Services Hub backing the (insecure) Healthcare.gov website? More broadly, any company with gobs of unstructured data holds something valuable, but simply having that data may not present any fundamentally new threat.
Robert McGarvey, citing Brainloop’s global VP of marketing David Topping, notes that “the big-data stores of petabytes of data are largely safe from hackers because they simply are too large, and hackers—with the exception, perhaps, of those who are state sponsored—lack the analytical tools to extract meaningful information.” In other words, the same problem that companies face is also a significant deterrent for hackers: getting something of value out of all that data. Thus, when considering individual big data repositories, added security measures beyond any other kind of database seems unwarranted—particularly when considering the often limited capabilities of private hackers relative to major organizations.
A Closer Look: Context and Granular Security
But the above considerations don’t imply that big data is necessarily more secure simply because it’s unstructured or more difficult to sift in aggregate. Big data repositories, if they are to be at all useful, cannot maintain all the context of every piece of information. As InfoWorld’s Andrew C. Oliver points out, “The more data you aggregate, the challenge of preserving granular rights and permissions grows. How do you keep all of those data ownership and data context rules in place without killing the performance that caused you to choose a big data solution in the first place?”
Granular security partitions data in a spectrum of accessibility classes; for instance, certain employees of the organization may only be able to access non-financial data, whereas employees with a higher “clearance level” may be able to access more information. In addition, certain information may be owned by another organization, or it may be limited in how it can be used. The challenge is in maintaining an organized and secure system despite potentially insufficient context: companies therefore face a security-versus-profitability question that could easily be answered with a response like, “Well, we have standard network security, so the data is safe.”
Big Data Can’t Be Anonymous
The more detailed information you collect, the easier you can tie it to an individual, raising both privacy and security concerns. CSO notes that “computer scientists have shown they can use data that is not [personally identifiable information] to reconstitute the associated person’s identity.” For instance, “If a brand or government acquired a list of GPS records covering one year, it could use that to learn a lot about a person or persons including their identities.” Finding an identity in such a case is simple in most cases: look for where the GPS unit is located in the early-morning hours and search the Internet for a name associated with that location. In general, the process may be a little more sophisticated, but conceptually, it’s a simple problem to solve.
Despite attempts to make big data anonymous, the best organizations can do is make the data “pseudonymous”—somewhat anonymous, but certainly still associable with a real identity. This limitation on anonymity is part of the danger of big data: hackers and other malicious parties may not be able to perform fine analysis of data, but given the right kinds of even limited information, they can glean all sorts of exploitable conclusions that enable fraud, theft or worse.
The Real Gold: Results
Although raw data requires protection even if it part of an unstructured big data repository, the greater threat is posed by the results of big data—the nuggets of valuable information that companies pay tremendous costs to obtain. McGarvey again cites David Topping: “Many organizations throw too much budget at trying to safeguard the big-data stores when their real risks arise in how the output—the insights produced by analysis—travels around the enterprise, often with little monitoring or protection…[M]ost security experts agree that an organization's employees, acting in ignorance, are the most common culprits in a big-data breach.”
Protecting big data, although it involves the raw information, requires an even greater focus on the insights that analytics provide. In particular, these insights must be treated at least as carefully—if not more so—than the raw data.
Addressing Big Data Security
The question, then, is how to address these concerns. One approach is the use of honeypots or “sinkholes” that offer hackers an attractive but ultimately bogus target that enables the organization to more safely study methods of attack and implement protections. This strategy is less than ideal since it only works when the system already has some vulnerabilities—known or otherwise. But it is a possibility for identifying and addressing those weaknesses.
Citing a report by Forrester Research entitled “Future of Data Security and Privacy: Controlling Big Data,” IBM notes that “security professionals apply most controls at the very edges of the network. However, if attackers penetrate your perimeter, they will have full and unrestricted access to your data.” The solution, naturally, is to provide a security layer around the data such that simply accessing the network isn’t enough to garner free reign over everything on that network.
Encryption, particularly when dealing with insights from big data analytics, is also a means of protecting information, although it is certainly not a novel concept.
Big data has rightly come under scrutiny for its privacy implications, particularly in light of NSA spying and the complicity of major IT companies. A different but closely related concern is security: in particular, how organizations should protect both the raw, unstructured data and the insights resulting from big data analytics. Unfortunately, making data entirely anonymous is impossible, as the information can (sometimes in combination with other private or even public sources) be matched with individuals and used for all manner of purposes. Although hackers may not be able to perform the same sophisticated analyses on stolen data, often a cursory look is enough to glean valuable information (as in the case of GPS data). And with mega repositories of data, such as the Federal Data Services Hub, making excellent targets, the security aspect of big data requires more scrutiny.