Data Virtualization: Closing the Knowledge Gap
Key Points
- The amount of data has exploded (from 4.4 ZB in 2013 to 44 ZB in 2020), yet the ability to extract actionable information has not kept pace, creating a large “knowledge gap.”
- Enterprise data is scattered across countless heterogeneous sources—relational, NoSQL, cloud, on‑premise, and mainframe—making analytics and model building cumbersome and expensive.
- Data virtualization solves this by providing a single, secure, logical view of all sources without physically moving or copying data, dramatically lowering engineering complexity, cost, and enabling seamless collaboration among stewards, engineers, and scientists.
- Its architecture consists of three core layers (connection, virtualization, and consumer) plus a governance catalog that captures metadata, lineage, privacy, and protection rules to enforce controlled, secure access to virtual datasets.
- Built as a peer‑to‑peer computational mesh, IBM’s data‑virtualization engine leverages advanced parallel processing and query optimization to deliver faster data exploration and superior performance compared with traditional federation approaches.
Sections
- Data Virtualization to Bridge Knowledge Gap - Aishwarya Srinivasan highlights the explosive growth of unexploited data across diverse storage systems and argues that data virtualization—allowing seamless, secure querying of all sources without replication—can close the knowledge gap while reducing cost and complexity.
- ML‑Driven Query Optimization Benefits - The passage explains how a machine‑learning‑powered query optimizer like IBM Db2’s can learn from execution experience to dramatically speed up queries—up to tenfold—while data virtualization and compression reduce infrastructure costs and storage needs.
Full Transcript
# Data Virtualization: Closing the Knowledge Gap **Source:** [https://www.youtube.com/watch?v=2XB4UaBIvNI](https://www.youtube.com/watch?v=2XB4UaBIvNI) **Duration:** 00:04:30 ## Summary - The amount of data has exploded (from 4.4 ZB in 2013 to 44 ZB in 2020), yet the ability to extract actionable information has not kept pace, creating a large “knowledge gap.” - Enterprise data is scattered across countless heterogeneous sources—relational, NoSQL, cloud, on‑premise, and mainframe—making analytics and model building cumbersome and expensive. - Data virtualization solves this by providing a single, secure, logical view of all sources without physically moving or copying data, dramatically lowering engineering complexity, cost, and enabling seamless collaboration among stewards, engineers, and scientists. - Its architecture consists of three core layers (connection, virtualization, and consumer) plus a governance catalog that captures metadata, lineage, privacy, and protection rules to enforce controlled, secure access to virtual datasets. - Built as a peer‑to‑peer computational mesh, IBM’s data‑virtualization engine leverages advanced parallel processing and query optimization to deliver faster data exploration and superior performance compared with traditional federation approaches. ## Sections - [00:00:00](https://www.youtube.com/watch?v=2XB4UaBIvNI&t=0s) **Data Virtualization to Bridge Knowledge Gap** - Aishwarya Srinivasan highlights the explosive growth of unexploited data across diverse storage systems and argues that data virtualization—allowing seamless, secure querying of all sources without replication—can close the knowledge gap while reducing cost and complexity. - [00:03:19](https://www.youtube.com/watch?v=2XB4UaBIvNI&t=199s) **ML‑Driven Query Optimization Benefits** - The passage explains how a machine‑learning‑powered query optimizer like IBM Db2’s can learn from execution experience to dramatically speed up queries—up to tenfold—while data virtualization and compression reduce infrastructure costs and storage needs. ## Full Transcript
Hello everyone.
My name is Aishwarya Srinivasan and I am an AI and ML innovation leader here at IBM.
What we see in the current knowledge era is that the volume of data has increased tremendously
but the amount of information extracted from the data hasn’t increased as much, which leads to a
knowledge gap with the un-used data. The total accumulated volume of data has grown from 4.4ZB
in 2013 to 44ZB in 2020, so we accumulated 9 times more data just in those 7 years, but we
haven’t been leveraging all this data. This data can be anywhere. Different industries, different
organizations, local businesses, individuals store their data in various sources - Oracle,
Db2, SQL Server, PostgreSQL, MongoDB etc. that could be residing on multiple platforms;
cloud, on-premises, and mainframes in various formats; relational, non-relational and NoSQL.
The challenge comes when we want to use all these data sources for analytics
and to build models. So, what’s the best way to move data? To not move it at all.
This is where Data Virtualization comes into play.
Data virtualization is a technique to connect to all data sources seamlessly and securely in
a single location. With these capabilities, we can query all the data source as if they reside in one
single space. without having to copy and replicate data, regardless of its format. This would
significantly reduce costs and complexity for data engineering, simplify data management, improve
collaboration from data stewards to engineers to data scientist, and enable centralized access.
When working with such huge volumes of data,
governance also comes into play which is also addressed with data virtualization.
As an overview, Data virtualization consists of three layers.
The bottom most is the connection layer that interacts with the databases that we need.
Next is the virtualization layer that is used to build optimized queries, virtual tables and scale
while preserving performance with parallel processing. The third layer is the consumer
layer which has all the user interface through while once can build views and query them. As
a complementary layer, we have the governance catalog, that pulls metadata from across the
data sources and builds on business terms, data lineage, data privacy and protection rules.
With this we can have a controlled, governed and secure access to virtual datasets.
Let’s dig a little deeper into this.
With Data Virtualization, we can achieve faster Data Exploration. Data virtualization is designed
and architected as a peer-to-peer computational mesh, which offers a significant advantage over
traditional federation architecture. Using IBM Research advancements,
the data virtualization engine can rapidly deliver query results from multiple data sources by
leveraging advanced parallel processing and optimization. Collaborative highly
paralleled compute models provide superior query performance compared to federation,
up to 430% faster against 100TB datasets. Data virtualization has
unmatched scaling of complex queries with joins and aggregates across dozens of live systems.
Secondly, when running queries on these datasets, we need
to build optimized versions for better modeling. Machine learning can make database queries faster
and improve outcomes. For example, a traditional cost-based query optimizer can provide a suggested
execution strategy for a given query, but if the strategy doesn’t work as expected, the optimizer
can’t learn from the experience. However, a machine learning-powered query optimizer can
learn from experience and refine the query path with each execution. That’s how the Db2 Machine
Learning Optimizer works. It mimics neural network patterns to optimize query paths. The result is
faster insights to your team—with some queries being completed
up to 8 – 10 times faster as measured by IBM internal testing.
Third is Cost saving. To run a business in a data-centric environment, you must have access
to exactly what you need, when you need it. When business users have responsive and proactive
access to all data with governance, they make better business decisions. Data virtualization
leads to significantly lower costs for infrastructure and reduces time spent
managing data, which has a direct effect on organizations bottom line.
Another cost savings of the platform comes in the form of the high-level compression techniques
within Virtual Data Pipeline. This solution helps enterprises minimize the required storage of data
copies while still maintaining continuity between the compressed and optimized versions
of the source or production data. The first copy typically sees around a 50 percent decrease in
size while subsequent copies can garner upwards of 95 percent decrease from the original source data.
Finally, Virtual Data Pipeline can be used as a storage backup and recovery capability
and expands to cover test data management and analytics data
pipeline. Many organizations are shifting their application strategy to a continuous approach
to ensure quality and support agile development and DevOps. For test data, this means drastically
reducing the data provisioning time while enabling automation and self-service access to data.
So, with Data Virtualization organizations can view, access,
manipulate and analyze data, without worrying about the physical location.
To learn more about Data Virtualization, do visit our website.