AutoSQL Enables Unified Data Lakehouse Queries
Key Points
- The exploding volume of data across on‑prem, cloud, and vendor environments demands a simpler way to access and manage it.
- Traditional architectures with tightly‑coupled storage‑compute and heavy ETL pipelines cause scaling problems and data duplication, prompting a shift to “lakehouse” designs that layer independent compute over inexpensive object stores.
- IBM Cloud Pak for Data’s new AutoSQL engine provides a unified, SQL‑based compute layer that can query structured and unstructured data directly on data lakes, integrate Spark, and leverage data virtualization for external sources.
- AutoSQL also embeds end‑to‑end governance, allowing custom policies to be applied to any ingested or virtualized data set across on‑prem warehouses, S3 buckets, Azure, Snowflake, Oracle, Teradata, etc.
- A live demo shows how quickly users can create connections to sources like AWS S3, virtualize multiple data assets, and combine them into a single virtual table for use in both BI and data‑science workloads.
Full Transcript
# AutoSQL Enables Unified Data Lakehouse Queries **Source:** [https://www.youtube.com/watch?v=N2CuentyYGw](https://www.youtube.com/watch?v=N2CuentyYGw) **Duration:** 00:05:42 ## Summary - The exploding volume of data across on‑prem, cloud, and vendor environments demands a simpler way to access and manage it. - Traditional architectures with tightly‑coupled storage‑compute and heavy ETL pipelines cause scaling problems and data duplication, prompting a shift to “lakehouse” designs that layer independent compute over inexpensive object stores. - IBM Cloud Pak for Data’s new AutoSQL engine provides a unified, SQL‑based compute layer that can query structured and unstructured data directly on data lakes, integrate Spark, and leverage data virtualization for external sources. - AutoSQL also embeds end‑to‑end governance, allowing custom policies to be applied to any ingested or virtualized data set across on‑prem warehouses, S3 buckets, Azure, Snowflake, Oracle, Teradata, etc. - A live demo shows how quickly users can create connections to sources like AWS S3, virtualize multiple data assets, and combine them into a single virtual table for use in both BI and data‑science workloads. ## Sections - [00:00:00](https://www.youtube.com/watch?v=N2CuentyYGw&t=0s) **Simplifying Data Lakes with AutoSQL** - IBM’s AutoSQL engine in Cloud Pak for Data unifies compute across structured, unstructured, and external sources, allowing direct queries over cloud object stores and lakehouses while providing built‑in, end‑to‑end governance. ## Full Transcript
it's no surprise that the volume of data
across multiple stores locations
clouds and even vendors is accelerating
but how do you manage this complexity
and make it simple to leverage your data
hi my name is love agarwal and i'm a
solution engineer for ibm data and ai
and today i'm here to talk about one of
our newest capabilities
auto sql so i want to first start with
how we got here traditionally we have
seen many architectures that have
big data warehouses with storage and
compute tightly coupled
as well as data lakes in multiple clouds
with a lot of etl pipelines to move and
replicate data around
for different bi and data science use
cases
this has led to increasingly complex
data pipelines
difficulty in scaling workloads and
unnecessary data duplication
what we have seen become more common is
a new modern architecture which utilizes
separate compute engines
layered over inexpensive cloud object
stores and data lakes resulting in the
concept of a data
lake house so now let's get back to auto
sql
auto sql is our new unified compute
engine in
ibm cloud pack for data that can query
both structured and unstructured data
directly over your data lakes and cloud
object stores
leverage data virtualization to access
other external data sources
as well as support spark in addition
auto sql brings integrated governance as
part of the cloud pack for data platform
which allows any ingested or connected
data source to be fully governed
end-to-end with custom policies now we
have a single interface and engine to
support both data science
and bi across any data source
environment
whether that be your on-prem data
warehouse s3 buckets in aws
data lake and azure snowflake oracle
teradata it doesn't matter
all right now let me show you with a
quick demo how easy it is to access data
from various sources with auto sql
and our end-to-end hybrid data platform
ibm cloud pack for data
so i'll start by logging on to cloud
pack for data and once i do that i'm
presented with my home page
now i want to connect to some different
data sources so i'll go over to
platform connections under data and
click
new connection so we can see there is an
extensive list of both ibm sources as
well as third-party sources
i'm going to connect to an s3 bucket
that i haven't up in aws
i'll put in all my credentials and click
on create connection
so this connection will allow us to
directly query our source
however i also want to virtualize some
data sources
so i'll click over into the data
virtualization tab
now if i look at my sources we can see
the many different instances that i have
virtualized
in my constellation view and now i'm
ready to actually do something that's
very powerful
which is using data virtualization to
combine tables from multiple sources
into one virtual table for us to use
so i'll go ahead and search for the
tables that i have virtualized
and join them into a new virtual table
in a way that allows me to pick and
choose exactly how i want it to be
structured
based on the available attributes okay
great
so now this new table is available for
us to start using to build
insights so i'll hop over to the
projects tab
and open one of the data science
projects that i've been working on
we can see there are several data
science assets in here
but i'll go in and open one of the
notebooks that i've already been working
on
okay in here we can see that i have the
ability to query
that same s3 bucket that i connected to
earlier
as well as that new virtual table that i
created
i can now use this data to build out
whatever model i want
and deploy it directly in the platform
to make it available for consumption by
my business analysts or other data
consumers in my organization
all right so to recap we connected to
various data sources
in the cloud pack for data platform we
virtualized certain sources
and created new virtual tables to
interact with our data in new ways
and then we were able to query those
sources right from our notebook to build
and deploy models
and by the way all this was done in a
governed manner
where any governance policies that were
defined in cloud pack for data
apply to all of the data sources that we
connected to
with auto sql we're reducing costs by
reduced migration and significantly less
data duplication
we're reducing complex etl work as we
saw when
simply creating virtual tables we're
automating
security and governance for trust and
data validity and quality
we're leveraging one performant and
scalable query engine
for both big data and warehousing that
can execute distributed and virtualized
queries
53 faster than the industry standard
and we're avoiding lock-in with our
vendor agnostic design
that allows the same engine to work with
any data source
on any cloud if you'd like to see
more videos like this in the future
please click like and subscribe
and if you want to learn more about ibm
cloud pack for data
make sure to check out the link in the
description
you