Data Warehouse, Lake, and Lakehouse Explained
Key Points
- Data warehouses are relational systems that ingest structured data via ETL, centralize it, and serve curated datasets for reporting and analytics.
- Data lakes collect raw data of any format (structured, semi‑structured, or unstructured) using ELT, letting users transform it later for AI/ML and exploratory workloads.
- Data lakehouses merge the scalability of lakes with the governance of warehouses by adding a metadata layer and schema, enabling both SQL‑based analytics and programmatic (e.g., Python) processing.
- The primary differences among the three lie in purpose (reporting vs AI/ML), data type support, processing flow (ETL vs ELT), governance capabilities, and performance characteristics.
Sections
- Data Warehouse vs Lake vs Lakehouse - An overview of how enterprises store and process structured, semi‑structured, and unstructured data using data warehouses, data lakes, and emerging lakehouse architectures.
- Comparing Data Warehouse, Lake, Lakehouse - The speaker outlines how a data lakehouse blends raw, low‑cost storage with hybrid analytics, contrasting purpose, data structure support, and storage costs among traditional data warehouses, data lakes, and lakehouses.
- Schema Flexibility & Scalability Comparison - The speaker explains that data lakes use schema‑on‑read, lakehouses support both schema‑on‑read and schema‑on‑write, and that lakes and lakehouses can scale cheaply whereas scaling a data warehouse is costly and difficult.
Full Transcript
# Data Warehouse, Lake, and Lakehouse Explained **Source:** [https://www.youtube.com/watch?v=PQFWQmL3fLY](https://www.youtube.com/watch?v=PQFWQmL3fLY) **Duration:** 00:07:41 ## Summary - Data warehouses are relational systems that ingest structured data via ETL, centralize it, and serve curated datasets for reporting and analytics. - Data lakes collect raw data of any format (structured, semi‑structured, or unstructured) using ELT, letting users transform it later for AI/ML and exploratory workloads. - Data lakehouses merge the scalability of lakes with the governance of warehouses by adding a metadata layer and schema, enabling both SQL‑based analytics and programmatic (e.g., Python) processing. - The primary differences among the three lie in purpose (reporting vs AI/ML), data type support, processing flow (ETL vs ELT), governance capabilities, and performance characteristics. ## Sections - [00:00:00](https://www.youtube.com/watch?v=PQFWQmL3fLY&t=0s) **Data Warehouse vs Lake vs Lakehouse** - An overview of how enterprises store and process structured, semi‑structured, and unstructured data using data warehouses, data lakes, and emerging lakehouse architectures. - [00:03:11](https://www.youtube.com/watch?v=PQFWQmL3fLY&t=191s) **Comparing Data Warehouse, Lake, Lakehouse** - The speaker outlines how a data lakehouse blends raw, low‑cost storage with hybrid analytics, contrasting purpose, data structure support, and storage costs among traditional data warehouses, data lakes, and lakehouses. - [00:06:26](https://www.youtube.com/watch?v=PQFWQmL3fLY&t=386s) **Schema Flexibility & Scalability Comparison** - The speaker explains that data lakes use schema‑on‑read, lakehouses support both schema‑on‑read and schema‑on‑write, and that lakes and lakehouses can scale cheaply whereas scaling a data warehouse is costly and difficult. ## Full Transcript
All organizations generate massive amount of data
from various data sources like databases, application, social media.
For developers and engineers, understanding
how to efficiently store, manage and process this data is critical.
Concepts like data warehouse, data lake
and data lakehouse are essential
tools for modern data management workflows.
Today, I'm here to provide you with the clear insights
into these concepts and the key differences later. So,
let's break down our system into main three parts: data
warehouse, data lake and data lakehouse.
First on my list is data warehouse,
also called as EDW, which is enterprise
data warehouse is a relational database management system.
It aggregates data from various different data sources
into single central repository.
A data source can be anything like a CSV file,
sales data, CRM data, which means any structured data.
The data is processed with the technology called ETL,
which is extract, transform and load.
The data is extracted and transformed
based on the business need and then load into
the data warehouse.
And then transform and curated data
is made available for end users for different analytics
and reporting purpose.
That's the data warehouse.
Second on my list is data lake. In data lake, it
aggregates data from various different data sources
into single, central repository
in its raw and original data format.
The data source can be anything like CSV file,
any structured data,
unstructured data or semi-structured data.
Data is directly dumped into the data lake
and then the data is processed
and transformed with the technology called ELT,
which means extract, load and transform.
The data is extracted and load into the data lake,
and then transform based on the business need. And
then transform and curated data
is made available for end users for different
AI ML use cases.
That's the data lake.
Third on my list is data lakehouse. Data lakehouse ...
This is the innovative solution which brilliantly merge best of both worlds,
which is data warehouse and data lake.
It combines the flexibility and the scalability of data
lake with the data management capability of data warehouse.
Raw data from various different data
sources is added into data lakehouse.
The data type can be anything like a structured data,
unstructured data or semi-structured data.
And then metadata layer is added in data lakehouse,
which provides
structure and schema to the data,
which also enable the data management and the governance feature.
The data is processed with various tools
and technology including SQL,
Python programming language and any other programming language.
And then transformed and curated
data is made available for end users
for different analytical and ML purpose.
That's the data lakehouse.
Let's deep dive into the key differences.
There are six main key differences in terms of each concepts.
First key difference in terms of purpose:
What is the purpose of data warehouse?
Data warehouse mostly used for optimized SQL analytics.
Whereas in data lake, data lake
mostly used for storing raw data.
Whereas in
data lakehouse, data lakehouse mostly used for hybrid analytics.
Second key difference is in terms of structure of the data.
What type of data is stored in each of these three? In data warehouse, it
used only structured data.
Whereas in data lake,
we can use any type of data
like structured,
unstructured or semi-structured.
In data lakehouse as well, all
type of data is stored, so all type of data is supported.
Second key difference: in terms of storage cost.
Storing data in data warehouse is high.
The cost is high. Reason?
We have to be organized all the data
before storing the data into warehouse.
So the storage cost for data warehouse is high.
Whereas in data lake,
we can dump any type of data in its raw format,
so we don't need to worry about organizing the data in data lake.
the storage cost for the data lake is low.
Similar to data lakehouse as well. We
can like dump all type of data in data lakehouse, so ...
so the storage cost for the data lake house is low.
It use the cheaper object storage.
The fourth key difference: in terms of performance. The performance
of data warehouse is high
because everything is organized before storing the data, so
accessing the data is in real time is low.
So we can access the data very easily in data
warehouse. Whereas in data lake
we store any type of data directly into the data lake.
Everything is unorganized, so
accessing those data takes time. So,
the performance is low.
In data lakehouse,
the performance is high and it's ... which is near to the data
warehouse performance.
The fifth key difference: in terms of flexibility.
The data warehouse support fixed schema,
which is also called schema on write,
which means d ... data ... for the data,
the structure of the data and the schema has to be defined
before we store the data into the data warehouse.
Whereas in data lake
it supports flexible schema,
which is also called
schema on read. Means, schema
doesn't have to be defined before storing the data ...
before storing the data into the data lake,
and we can ... uh... provide the schema when we access or query
the data. Whereas in data lakehouse,
it supports both type of schema,
so it is flexible.
The last and the sixth key difference:
in terms of scalability. For data warehouse,
the scaling data warehouse is expensive
and difficult.
Whereas in data lake,
we can scale the data lake very easily
and at a low cost.
In data lakehouse as well,
we can scale the data lakehouse easily
and at the low cost.
So, these systems ... each have unique strengths
and tailored for different workflows.
For developers and engineers,
understanding these concepts are really essentials.
Thank you so much for joining me today.