Learning Library

← Back to Library

Data Virtualization: Closing the Knowledge Gap

4m • Unknown Channel • databases • deep-dive • intermediate • Watch on YouTube ↗

Key Points

The amount of data has exploded (from 4.4 ZB in 2013 to 44 ZB in 2020), yet the ability to extract actionable information has not kept pace, creating a large “knowledge gap.”
Enterprise data is scattered across countless heterogeneous sources—relational, NoSQL, cloud, on‑premise, and mainframe—making analytics and model building cumbersome and expensive.
Data virtualization solves this by providing a single, secure, logical view of all sources without physically moving or copying data, dramatically lowering engineering complexity, cost, and enabling seamless collaboration among stewards, engineers, and scientists.
Its architecture consists of three core layers (connection, virtualization, and consumer) plus a governance catalog that captures metadata, lineage, privacy, and protection rules to enforce controlled, secure access to virtual datasets.
Built as a peer‑to‑peer computational mesh, IBM’s data‑virtualization engine leverages advanced parallel processing and query optimization to deliver faster data exploration and superior performance compared with traditional federation approaches.

Sections

Full Transcript

# Data Virtualization: Closing the Knowledge Gap **Source:** [https://www.youtube.com/watch?v=2XB4UaBIvNI](https://www.youtube.com/watch?v=2XB4UaBIvNI) **Duration:** 00:04:30 ## Summary - The amount of data has exploded (from 4.4 ZB in 2013 to 44 ZB in 2020), yet the ability to extract actionable information has not kept pace, creating a large “knowledge gap.” - Enterprise data is scattered across countless heterogeneous sources—relational, NoSQL, cloud, on‑premise, and mainframe—making analytics and model building cumbersome and expensive. - Data virtualization solves this by providing a single, secure, logical view of all sources without physically moving or copying data, dramatically lowering engineering complexity, cost, and enabling seamless collaboration among stewards, engineers, and scientists. - Its architecture consists of three core layers (connection, virtualization, and consumer) plus a governance catalog that captures metadata, lineage, privacy, and protection rules to enforce controlled, secure access to virtual datasets. - Built as a peer‑to‑peer computational mesh, IBM’s data‑virtualization engine leverages advanced parallel processing and query optimization to deliver faster data exploration and superior performance compared with traditional federation approaches. ## Sections - [00:00:00](https://www.youtube.com/watch?v=2XB4UaBIvNI&t=0s) **Data Virtualization to Bridge Knowledge Gap** - Aishwarya Srinivasan highlights the explosive growth of unexploited data across diverse storage systems and argues that data virtualization—allowing seamless, secure querying of all sources without replication—can close the knowledge gap while reducing cost and complexity. - [00:03:19](https://www.youtube.com/watch?v=2XB4UaBIvNI&t=199s) **ML‑Driven Query Optimization Benefits** - The passage explains how a machine‑learning‑powered query optimizer like IBM Db2’s can learn from execution experience to dramatically speed up queries—up to tenfold—while data virtualization and compression reduce infrastructure costs and storage needs. ## Full Transcript

0:00Hello everyone. 0:01My name is Aishwarya Srinivasan and I am an AI and ML innovation leader here at IBM. 0:07What we see in the current knowledge era is that the volume of data has increased tremendously 0:13but the amount of information extracted from the data hasn’t increased as much, which leads to a 0:20knowledge gap with the un-used data. The total accumulated volume of data has grown from 4.4ZB 0:28in 2013 to 44ZB in 2020, so we accumulated 9 times more data just in those 7 years, but we 0:39haven’t been leveraging all this data. This data can be anywhere. Different industries, different 0:46organizations, local businesses, individuals store their data in various sources - Oracle, 0:54Db2, SQL Server, PostgreSQL, MongoDB etc. that could be residing on multiple platforms; 1:02cloud, on-premises, and mainframes in various formats; relational, non-relational and NoSQL. 1:11The challenge comes when we want to use all these data sources for analytics 1:15and to build models. So, what’s the best way to move data? To not move it at all. 1:22This is where Data Virtualization comes into play. 1:26Data virtualization is a technique to connect to all data sources seamlessly and securely in 1:33a single location. With these capabilities, we can query all the data source as if they reside in one 1:41single space. without having to copy and replicate data, regardless of its format. This would 1:43significantly reduce costs and complexity for data engineering, simplify data management, improve 1:51collaboration from data stewards to engineers to data scientist, and enable centralized access. 1:55When working with such huge volumes of data, 1:59governance also comes into play which is also addressed with data virtualization. 2:04As an overview, Data virtualization consists of three layers. 2:10The bottom most is the connection layer that interacts with the databases that we need. 2:16Next is the virtualization layer that is used to build optimized queries, virtual tables and scale 2:23while preserving performance with parallel processing. The third layer is the consumer 2:29layer which has all the user interface through while once can build views and query them. As 2:36a complementary layer, we have the governance catalog, that pulls metadata from across the 2:43data sources and builds on business terms, data lineage, data privacy and protection rules. 2:50With this we can have a controlled, governed and secure access to virtual datasets. 2:58Let’s dig a little deeper into this. 3:01With Data Virtualization, we can achieve faster Data Exploration. Data virtualization is designed 3:06and architected as a peer-to-peer computational mesh, which offers a significant advantage over 3:06traditional federation architecture. Using IBM Research advancements, 3:06the data virtualization engine can rapidly deliver query results from multiple data sources by 3:06leveraging advanced parallel processing and optimization. Collaborative highly 3:07paralleled compute models provide superior query performance compared to federation, 3:13up to 430% faster against 100TB datasets. Data virtualization has 3:19unmatched scaling of complex queries with joins and aggregates across dozens of live systems. 3:19Secondly, when running queries on these datasets, we need 3:23to build optimized versions for better modeling. Machine learning can make database queries faster 3:26and improve outcomes. For example, a traditional cost-based query optimizer can provide a suggested 3:26execution strategy for a given query, but if the strategy doesn’t work as expected, the optimizer 3:26can’t learn from the experience. However, a machine learning-powered query optimizer can 3:26learn from experience and refine the query path with each execution. That’s how the Db2 Machine 3:27Learning Optimizer works. It mimics neural network patterns to optimize query paths. The result is 3:28faster insights to your team—with some queries being completed 3:32up to 8 – 10 times faster as measured by IBM internal testing. 3:37Third is Cost saving. To run a business in a data-centric environment, you must have access 3:38to exactly what you need, when you need it. When business users have responsive and proactive 3:38access to all data with governance, they make better business decisions. Data virtualization 3:40leads to significantly lower costs for infrastructure and reduces time spent 3:46managing data, which has a direct effect on organizations bottom line. 3:50Another cost savings of the platform comes in the form of the high-level compression techniques 3:51within Virtual Data Pipeline. This solution helps enterprises minimize the required storage of data 3:56copies while still maintaining continuity between the compressed and optimized versions 4:01of the source or production data. The first copy typically sees around a 50 percent decrease in 4:03size while subsequent copies can garner upwards of 95 percent decrease from the original source data. 4:04Finally, Virtual Data Pipeline can be used as a storage backup and recovery capability 4:10and expands to cover test data management and analytics data 4:14pipeline. Many organizations are shifting their application strategy to a continuous approach 4:15to ensure quality and support agile development and DevOps. For test data, this means drastically 4:15reducing the data provisioning time while enabling automation and self-service access to data. 4:16So, with Data Virtualization organizations can view, access, 4:21manipulate and analyze data, without worrying about the physical location. 4:26To learn more about Data Virtualization, do visit our website.