Learning Library

← Back to Library

AutoSQL Enables Unified Data Lakehouse Queries

5m • Unknown Channel • databases • tutorial • intermediate • Watch on YouTube ↗

Key Points

The exploding volume of data across on‑prem, cloud, and vendor environments demands a simpler way to access and manage it.
Traditional architectures with tightly‑coupled storage‑compute and heavy ETL pipelines cause scaling problems and data duplication, prompting a shift to “lakehouse” designs that layer independent compute over inexpensive object stores.
IBM Cloud Pak for Data’s new AutoSQL engine provides a unified, SQL‑based compute layer that can query structured and unstructured data directly on data lakes, integrate Spark, and leverage data virtualization for external sources.
AutoSQL also embeds end‑to‑end governance, allowing custom policies to be applied to any ingested or virtualized data set across on‑prem warehouses, S3 buckets, Azure, Snowflake, Oracle, Teradata, etc.
A live demo shows how quickly users can create connections to sources like AWS S3, virtualize multiple data assets, and combine them into a single virtual table for use in both BI and data‑science workloads.

Sections

00:00:00 Simplifying Data Lakes with AutoSQL - IBM’s AutoSQL engine in Cloud Pak for Data unifies compute across structured, unstructured, and external sources, allowing direct queries over cloud object stores and lakehouses while providing built‑in, end‑to‑end governance.

Full Transcript

# AutoSQL Enables Unified Data Lakehouse Queries **Source:** [https://www.youtube.com/watch?v=N2CuentyYGw](https://www.youtube.com/watch?v=N2CuentyYGw) **Duration:** 00:05:42 ## Summary - The exploding volume of data across on‑prem, cloud, and vendor environments demands a simpler way to access and manage it. - Traditional architectures with tightly‑coupled storage‑compute and heavy ETL pipelines cause scaling problems and data duplication, prompting a shift to “lakehouse” designs that layer independent compute over inexpensive object stores. - IBM Cloud Pak for Data’s new AutoSQL engine provides a unified, SQL‑based compute layer that can query structured and unstructured data directly on data lakes, integrate Spark, and leverage data virtualization for external sources. - AutoSQL also embeds end‑to‑end governance, allowing custom policies to be applied to any ingested or virtualized data set across on‑prem warehouses, S3 buckets, Azure, Snowflake, Oracle, Teradata, etc. - A live demo shows how quickly users can create connections to sources like AWS S3, virtualize multiple data assets, and combine them into a single virtual table for use in both BI and data‑science workloads. ## Sections - [00:00:00](https://www.youtube.com/watch?v=N2CuentyYGw&t=0s) **Simplifying Data Lakes with AutoSQL** - IBM’s AutoSQL engine in Cloud Pak for Data unifies compute across structured, unstructured, and external sources, allowing direct queries over cloud object stores and lakehouses while providing built‑in, end‑to‑end governance. ## Full Transcript

0:00it's no surprise that the volume of data 0:03across multiple stores locations 0:05clouds and even vendors is accelerating 0:08but how do you manage this complexity 0:10and make it simple to leverage your data 0:13hi my name is love agarwal and i'm a 0:15solution engineer for ibm data and ai 0:17and today i'm here to talk about one of 0:19our newest capabilities 0:21auto sql so i want to first start with 0:24how we got here traditionally we have 0:27seen many architectures that have 0:28big data warehouses with storage and 0:30compute tightly coupled 0:32as well as data lakes in multiple clouds 0:35with a lot of etl pipelines to move and 0:37replicate data around 0:39for different bi and data science use 0:41cases 0:42this has led to increasingly complex 0:44data pipelines 0:45difficulty in scaling workloads and 0:48unnecessary data duplication 0:50what we have seen become more common is 0:53a new modern architecture which utilizes 0:55separate compute engines 0:57layered over inexpensive cloud object 0:59stores and data lakes resulting in the 1:01concept of a data 1:02lake house so now let's get back to auto 1:06sql 1:07auto sql is our new unified compute 1:09engine in 1:10ibm cloud pack for data that can query 1:12both structured and unstructured data 1:15directly over your data lakes and cloud 1:17object stores 1:18leverage data virtualization to access 1:21other external data sources 1:22as well as support spark in addition 1:25auto sql brings integrated governance as 1:28part of the cloud pack for data platform 1:30which allows any ingested or connected 1:32data source to be fully governed 1:34end-to-end with custom policies now we 1:37have a single interface and engine to 1:40support both data science 1:42and bi across any data source 1:44environment 1:45whether that be your on-prem data 1:46warehouse s3 buckets in aws 1:49data lake and azure snowflake oracle 1:52teradata it doesn't matter 1:54all right now let me show you with a 1:56quick demo how easy it is to access data 1:59from various sources with auto sql 2:01and our end-to-end hybrid data platform 2:03ibm cloud pack for data 2:06so i'll start by logging on to cloud 2:08pack for data and once i do that i'm 2:10presented with my home page 2:12now i want to connect to some different 2:14data sources so i'll go over to 2:16platform connections under data and 2:19click 2:20new connection so we can see there is an 2:23extensive list of both ibm sources as 2:25well as third-party sources 2:28i'm going to connect to an s3 bucket 2:30that i haven't up in aws 2:32i'll put in all my credentials and click 2:34on create connection 2:41so this connection will allow us to 2:43directly query our source 2:45however i also want to virtualize some 2:48data sources 2:49so i'll click over into the data 2:51virtualization tab 2:54now if i look at my sources we can see 2:56the many different instances that i have 2:58virtualized 2:59in my constellation view and now i'm 3:02ready to actually do something that's 3:03very powerful 3:04which is using data virtualization to 3:07combine tables from multiple sources 3:10into one virtual table for us to use 3:14so i'll go ahead and search for the 3:16tables that i have virtualized 3:19and join them into a new virtual table 3:22in a way that allows me to pick and 3:24choose exactly how i want it to be 3:26structured 3:27based on the available attributes okay 3:30great 3:31so now this new table is available for 3:33us to start using to build 3:34insights so i'll hop over to the 3:37projects tab 3:39and open one of the data science 3:41projects that i've been working on 3:44we can see there are several data 3:45science assets in here 3:47but i'll go in and open one of the 3:49notebooks that i've already been working 3:51on 3:56okay in here we can see that i have the 3:59ability to query 4:00that same s3 bucket that i connected to 4:02earlier 4:03as well as that new virtual table that i 4:05created 4:06i can now use this data to build out 4:08whatever model i want 4:09and deploy it directly in the platform 4:11to make it available for consumption by 4:14my business analysts or other data 4:16consumers in my organization 4:19all right so to recap we connected to 4:21various data sources 4:22in the cloud pack for data platform we 4:24virtualized certain sources 4:26and created new virtual tables to 4:28interact with our data in new ways 4:30and then we were able to query those 4:32sources right from our notebook to build 4:34and deploy models 4:36and by the way all this was done in a 4:38governed manner 4:39where any governance policies that were 4:41defined in cloud pack for data 4:43apply to all of the data sources that we 4:45connected to 4:46with auto sql we're reducing costs by 4:49reduced migration and significantly less 4:51data duplication 4:52we're reducing complex etl work as we 4:55saw when 4:56simply creating virtual tables we're 4:58automating 4:59security and governance for trust and 5:01data validity and quality 5:03we're leveraging one performant and 5:05scalable query engine 5:06for both big data and warehousing that 5:09can execute distributed and virtualized 5:11queries 5:1253 faster than the industry standard 5:15and we're avoiding lock-in with our 5:17vendor agnostic design 5:19that allows the same engine to work with 5:21any data source 5:22on any cloud if you'd like to see 5:25more videos like this in the future 5:27please click like and subscribe 5:29and if you want to learn more about ibm 5:31cloud pack for data 5:33make sure to check out the link in the 5:36description 5:40you