Learning Library

← Back to Library

Data Virtualization: Closing the Knowledge Gap

Key Points

  • The amount of data has exploded (from 4.4 ZB in 2013 to 44 ZB in 2020), yet the ability to extract actionable information has not kept pace, creating a large “knowledge gap.”
  • Enterprise data is scattered across countless heterogeneous sources—relational, NoSQL, cloud, on‑premise, and mainframe—making analytics and model building cumbersome and expensive.
  • Data virtualization solves this by providing a single, secure, logical view of all sources without physically moving or copying data, dramatically lowering engineering complexity, cost, and enabling seamless collaboration among stewards, engineers, and scientists.
  • Its architecture consists of three core layers (connection, virtualization, and consumer) plus a governance catalog that captures metadata, lineage, privacy, and protection rules to enforce controlled, secure access to virtual datasets.
  • Built as a peer‑to‑peer computational mesh, IBM’s data‑virtualization engine leverages advanced parallel processing and query optimization to deliver faster data exploration and superior performance compared with traditional federation approaches.

Full Transcript

# Data Virtualization: Closing the Knowledge Gap **Source:** [https://www.youtube.com/watch?v=2XB4UaBIvNI](https://www.youtube.com/watch?v=2XB4UaBIvNI) **Duration:** 00:04:30 ## Summary - The amount of data has exploded (from 4.4 ZB in 2013 to 44 ZB in 2020), yet the ability to extract actionable information has not kept pace, creating a large “knowledge gap.” - Enterprise data is scattered across countless heterogeneous sources—relational, NoSQL, cloud, on‑premise, and mainframe—making analytics and model building cumbersome and expensive. - Data virtualization solves this by providing a single, secure, logical view of all sources without physically moving or copying data, dramatically lowering engineering complexity, cost, and enabling seamless collaboration among stewards, engineers, and scientists. - Its architecture consists of three core layers (connection, virtualization, and consumer) plus a governance catalog that captures metadata, lineage, privacy, and protection rules to enforce controlled, secure access to virtual datasets. - Built as a peer‑to‑peer computational mesh, IBM’s data‑virtualization engine leverages advanced parallel processing and query optimization to deliver faster data exploration and superior performance compared with traditional federation approaches. ## Sections - [00:00:00](https://www.youtube.com/watch?v=2XB4UaBIvNI&t=0s) **Data Virtualization to Bridge Knowledge Gap** - Aishwarya Srinivasan highlights the explosive growth of unexploited data across diverse storage systems and argues that data virtualization—allowing seamless, secure querying of all sources without replication—can close the knowledge gap while reducing cost and complexity. - [00:03:19](https://www.youtube.com/watch?v=2XB4UaBIvNI&t=199s) **ML‑Driven Query Optimization Benefits** - The passage explains how a machine‑learning‑powered query optimizer like IBM Db2’s can learn from execution experience to dramatically speed up queries—up to tenfold—while data virtualization and compression reduce infrastructure costs and storage needs. ## Full Transcript
0:00Hello everyone. 0:01My name is Aishwarya Srinivasan and I am an AI and ML innovation leader here at IBM. 0:07What we see in the current knowledge era is that  the volume of data has increased tremendously 0:13but the amount of information extracted from the  data hasn’t increased as much, which leads to a 0:20knowledge gap with the un-used data. The total  accumulated volume of data has grown from 4.4ZB 0:28in 2013 to 44ZB in 2020, so we accumulated 9  times more data just in those 7 years, but we 0:39haven’t been leveraging all this data. This data  can be anywhere. Different industries, different 0:46organizations, local businesses, individuals  store their data in various sources - Oracle, 0:54Db2, SQL Server, PostgreSQL, MongoDB etc.  that could be residing on multiple platforms; 1:02cloud, on-premises, and mainframes in various  formats; relational, non-relational and NoSQL. 1:11The challenge comes when we want to use  all these data sources for analytics 1:15and to build models. So, what’s the best  way to move data? To not move it at all. 1:22This is where Data Virtualization comes into play. 1:26Data virtualization is a technique to connect  to all data sources seamlessly and securely in 1:33a single location. With these capabilities, we can  query all the data source as if they reside in one 1:41single space. without having to copy and replicate  data, regardless of its format. This would 1:43significantly reduce costs and complexity for data  engineering, simplify data management, improve 1:51collaboration from data stewards to engineers to  data scientist, and enable centralized access. 1:55When working with such huge volumes of data, 1:59governance also comes into play which is  also addressed with data virtualization. 2:04As an overview, Data virtualization  consists of three layers. 2:10The bottom most is the connection layer that  interacts with the databases that we need. 2:16Next is the virtualization layer that is used to  build optimized queries, virtual tables and scale 2:23while preserving performance with parallel  processing. The third layer is the consumer 2:29layer which has all the user interface through  while once can build views and query them. As 2:36a complementary layer, we have the governance  catalog, that pulls metadata from across the 2:43data sources and builds on business terms, data  lineage, data privacy and protection rules. 2:50With this we can have a controlled, governed  and secure access to virtual datasets. 2:58Let’s dig a little deeper into this. 3:01With Data Virtualization, we can achieve faster  Data Exploration. Data virtualization is designed 3:06and architected as a peer-to-peer computational  mesh, which offers a significant advantage over 3:06traditional federation architecture.  Using IBM Research advancements, 3:06the data virtualization engine can rapidly deliver  query results from multiple data sources by 3:06leveraging advanced parallel processing  and optimization. Collaborative highly 3:07paralleled compute models provide superior  query performance compared to federation, 3:13up to 430% faster against 100TB  datasets. Data virtualization has 3:19unmatched scaling of complex queries with joins  and aggregates across dozens of live systems. 3:19Secondly, when running queries  on these datasets, we need 3:23to build optimized versions for better modeling. Machine learning can make database queries faster 3:26and improve outcomes. For example, a traditional  cost-based query optimizer can provide a suggested 3:26execution strategy for a given query, but if the  strategy doesn’t work as expected, the optimizer 3:26can’t learn from the experience. However, a  machine learning-powered query optimizer can 3:26learn from experience and refine the query path  with each execution. That’s how the Db2 Machine 3:27Learning Optimizer works. It mimics neural network  patterns to optimize query paths. The result is 3:28faster insights to your team—with  some queries being completed 3:32up to 8 – 10 times faster as  measured by IBM internal testing. 3:37Third is Cost saving. To run a business in a  data-centric environment, you must have access 3:38to exactly what you need, when you need it. When  business users have responsive and proactive 3:38access to all data with governance, they make  better business decisions. Data virtualization 3:40leads to significantly lower costs for  infrastructure and reduces time spent 3:46managing data, which has a direct  effect on organizations bottom line. 3:50Another cost savings of the platform comes in  the form of the high-level compression techniques 3:51within Virtual Data Pipeline. This solution helps  enterprises minimize the required storage of data 3:56copies while still maintaining continuity  between the compressed and optimized versions 4:01of the source or production data. The first copy  typically sees around a 50 percent decrease in 4:03size while subsequent copies can garner upwards of  95 percent decrease from the original source data. 4:04Finally, Virtual Data Pipeline can be used  as a storage backup and recovery capability 4:10and expands to cover test data  management and analytics data 4:14pipeline. Many organizations are shifting their  application strategy to a continuous approach 4:15to ensure quality and support agile development  and DevOps. For test data, this means drastically 4:15reducing the data provisioning time while enabling  automation and self-service access to data. 4:16So, with Data Virtualization  organizations can view, access, 4:21manipulate and analyze data, without  worrying about the physical location. 4:26To learn more about Data  Virtualization, do visit our website.