Big data is like an abundant, expanding natural resource emerging from the modern data landscape. IoT (sensor), mobile, social, clickstream, web and open data are important contributors to the proliferation of data we’re witnessing today. Worldwide data is expected to increase tenfold by 2025—reaching a total of 163 ZB—according to a recent IDC-Seagate study.
Data is plentiful, but not necessarily useful in its raw, unrefined form. As with any natural resource, “crude” data must be refined before it can be harnessed for productive purposes, such as equipment maintenance, product innovation, competitive intelligence, marketing, data monetization and active health care. The refinement process can incorporate data exploration, preparation, correlation and contextualization, labeling and annotating, unification and integration, and application of security and governance policies. Metadata is also an important component, as it serves a role in both the input and output stages of the overall data-refinement process.
The extent to which data analysis contributes to unbiased conclusions, accurate predictions and insightful decision-making is constrained by the veracity of that data. If it hasn’t been provisioned for analysis, the data may suffer from fragmentation, minimal labeling and missing information. Such characteristics can be evident in electronic health records (EHRs), which illustrate the challenges of data refinement. One hurdle to gathering and analyzing EHR data is the scarcity of proper labeling and consistent semantics.
EHRs are designed primarily to fulfill patient-care, administrative and financial needs. The multipurpose objectives of EHRs—which don’t take into account data analysis per se—can create data fragmentation, which requires rectification before the data can be provisioned for analyses such as clinical research. Another challenge to building data sets from shared patient health records is the lack of standardization in how EHRs are implemented among health-care organizations, and even within the same health-care system. For example, distinct departments (e.g., radiology, orthopedics and internal medicine) in the same hospital may employ EHRs differently to satisfy their unique data-entry requirements, documentation and ordering needs, and preferences, thereby creating data silos.
Data security and privacy can also be impediments to analyzing regulated data, such as that in EHRs. The best approach to surmounting this obstacle is applying proper security and governance during the refinement process. Companies such as Google are experimenting with federated learning in their effort to advance analytics while ensuring privacy.
Data refinement is crucial to achieving reliable outcomes from data analysis, including meaningful conclusions, accurate predictions and informed decisions. Ideally, the process of refining raw data to produce complete and meaningful information does the following:
- Build relevant semantics
- Handle data exceptions
- Establish complete, holistic, contextual views of the data
- Enrich metadata for downstream processes
- Handle data-protection, privacy and compliance requirements
Three Advantages of Data Virtualization as a Data Refinery
1. Refinery at Scale
Modern analytics relies on data from myriad fragmented data sources. Experience tells us that big data sources aren’t always amenable to replicating and relocating when the data is distributed across multiple systems. Data virtualization delivers the scale to work effectively with big data sources by offering an alternative paradigm: move processing to the data. In other words, process the data where it resides and minimize the network traffic.
Data virtualization brings the speed and scale necessary for data refinement without replicating or relocating the data sources. It uses logical data architectures, making all the underlying data sources appear as a single system. It provides multiple optimization strategies (e.g., platform-specific optimizations and push-down processing), the intelligence to choose a specific optimization, and a library of prebuilt optimizations, such as MPP in-memory processing.
2. Responsible Data Sharing
Data Privacy by Design
Cultural and legal hurdles often impede data sharing, which has become a major component of big data analysis. Data-privacy regulations are compelling organizations to incorporate, or otherwise demonstrate adequate consideration of, data privacy at every design and implementation stage of a new project. Data virtualization employs a central approach that reduces the cost of complying with the increasing numbers of active-data-privacy regulations and allows data privacy to be included by design.
The core function of data virtualization is to enable distributed data to remain at the source while exposing it to consumers through a single logical layer. This approach removes the need for continual data replication. Less replication translates into fewer copies of personal and sensitive data in the organization and fewer problems with data security and governance.
Data virtualization also enables organizations to easily create aggregated, consistent views of data, such as risk data, from across the organization. These views can be selectively shared while fully adhering to an organization’s data-access and privacy policies, as Figure 1 illustrates.
Overcoming Information-Sharing Challenges
Data virtualization overcomes the major information-sharing challenges below:
- Disparate data sources. Using data virtualization, data can be quickly and easily integrated across myriad internal and external systems.
- Differing data formats. Data virtualization can connect to data in different formats using different technologies and protocols. These complexities are hidden from users and applications.
- Differing data standards. Using lookup tables or in-memory maps, data virtualization can integrate data, even if it originates from different standards.
- Incomplete data. Data virtualization allows data to be brought together across disparate systems for a holistic view.
- Unprocessed data. Data calculations performed on aggregated data (in contrast with partial, siloed data) provide a complete view of risk across the organization.
- Sensitive data. Data virtualization provides security and privacy capabilities so that users only see the data that they’re permitted to see.
3. Universal Semantic Model
Business users come in all shapes and sizes. It’s imperative to understand who they are (e.g., data analysts, power users, executives or machines) and the data they need (e.g., pre-aggregated, precalculated, a certain granularity, role specific or domain specific). For machines in particular, properly labeled data sets are of paramount importance for effective machine learning. It’s also important to employ language that business users understand when provisioning the data for analysis. For example, account is appropriate for a user in finance, whereas customer is the preferred term for a user in customer care. It’s essential to support multiple semantics to avoid forcing users to change terminology.
A universal semantic model powered by data virtualization provides a common and consistent view of data across the organization. By not being embedded in a single business-intelligence (BI) tool, the semantic model is common to multiple BI tools and can access virtually any data source.
Data virtualization accomplishes the following objectives in making self-service analytics a reality:
- Enable building a flexible semantic model quickly and easily
- Provide a self-service platform, with guardrails
- Support both “data cowboys” (with limits) and regular business users
- Accelerate self-service initiatives (eliminate analysis silos) while retaining control and governance
Provisioning complete, trusted, high-quality information is critical for decision making, as well as for predictive and prescriptive analytics. Data virtualization is an ideal technology to fulfill this need on the part of organizations that strive to use data as a strategic asset.
About the Author
Lakshmi Randall is Director of Product Marketing at Denodo, a leader in data-virtualization software. Previously, she was a Research Director at Gartner covering data warehousing, data integration, big data, information management and analytics practices. To learn more, visit www.denodo.com or follow the company (@denodo) or Lakshmi (@LakshmiLJ) on Twitter.