Learning Library

← Back to Library

Data Warehouse, Lake, and Lakehouse Explained

7m • Unknown Channel • databases • tutorial • intermediate • Watch on YouTube ↗

Key Points

Data warehouses are relational systems that ingest structured data via ETL, centralize it, and serve curated datasets for reporting and analytics.
Data lakes collect raw data of any format (structured, semi‑structured, or unstructured) using ELT, letting users transform it later for AI/ML and exploratory workloads.
Data lakehouses merge the scalability of lakes with the governance of warehouses by adding a metadata layer and schema, enabling both SQL‑based analytics and programmatic (e.g., Python) processing.
The primary differences among the three lie in purpose (reporting vs AI/ML), data type support, processing flow (ETL vs ELT), governance capabilities, and performance characteristics.

Sections

Full Transcript

# Data Warehouse, Lake, and Lakehouse Explained **Source:** [https://www.youtube.com/watch?v=PQFWQmL3fLY](https://www.youtube.com/watch?v=PQFWQmL3fLY) **Duration:** 00:07:41 ## Summary - Data warehouses are relational systems that ingest structured data via ETL, centralize it, and serve curated datasets for reporting and analytics. - Data lakes collect raw data of any format (structured, semi‑structured, or unstructured) using ELT, letting users transform it later for AI/ML and exploratory workloads. - Data lakehouses merge the scalability of lakes with the governance of warehouses by adding a metadata layer and schema, enabling both SQL‑based analytics and programmatic (e.g., Python) processing. - The primary differences among the three lie in purpose (reporting vs AI/ML), data type support, processing flow (ETL vs ELT), governance capabilities, and performance characteristics. ## Sections - [00:00:00](https://www.youtube.com/watch?v=PQFWQmL3fLY&t=0s) **Data Warehouse vs Lake vs Lakehouse** - An overview of how enterprises store and process structured, semi‑structured, and unstructured data using data warehouses, data lakes, and emerging lakehouse architectures. - [00:03:11](https://www.youtube.com/watch?v=PQFWQmL3fLY&t=191s) **Comparing Data Warehouse, Lake, Lakehouse** - The speaker outlines how a data lakehouse blends raw, low‑cost storage with hybrid analytics, contrasting purpose, data structure support, and storage costs among traditional data warehouses, data lakes, and lakehouses. - [00:06:26](https://www.youtube.com/watch?v=PQFWQmL3fLY&t=386s) **Schema Flexibility & Scalability Comparison** - The speaker explains that data lakes use schema‑on‑read, lakehouses support both schema‑on‑read and schema‑on‑write, and that lakes and lakehouses can scale cheaply whereas scaling a data warehouse is costly and difficult. ## Full Transcript

0:00All organizations generate massive amount of data 0:03from various data sources like databases, application, social media. 0:07For developers and engineers, understanding 0:10how to efficiently store, manage and process this data is critical. 0:14Concepts like data warehouse, data lake 0:16and data lakehouse are essential 0:18tools for modern data management workflows. 0:21Today, I'm here to provide you with the clear insights 0:24into these concepts and the key differences later. So, 0:27let's break down our system into main three parts: data 0:30warehouse, data lake and data lakehouse. 0:33First on my list is data warehouse, 0:36also called as EDW, which is enterprise 0:39data warehouse is a relational database management system. 0:42It aggregates data from various different data sources 0:47into single central repository. 0:50A data source can be anything like a CSV file, 0:53sales data, CRM data, which means any structured data. 0:58The data is processed with the technology called ETL, 1:02which is extract, transform and load. 1:04The data is extracted and transformed 1:07based on the business need and then load into 1:09the data warehouse. 1:15And then transform and curated data 1:19is made available for end users for different analytics 1:23and reporting purpose. 1:25That's the data warehouse. 1:28Second on my list is data lake. In data lake, it 1:32aggregates data from various different data sources 1:36into single, central repository 1:38in its raw and original data format. 1:41The data source can be anything like CSV file, 1:45any structured data, 1:46unstructured data or semi-structured data. 1:49Data is directly dumped into the data lake 1:53and then the data is processed 1:58and transformed with the technology called ELT, 2:01which means extract, load and transform. 2:04The data is extracted and load into the data lake, 2:08and then transform based on the business need. And 2:10then transform and curated data 2:13is made available for end users for different 2:16AI ML use cases. 2:19That's the data lake. 2:21Third on my list is data lakehouse. Data lakehouse ... 2:25This is the innovative solution which brilliantly merge best of both worlds, 2:30which is data warehouse and data lake. 2:32It combines the flexibility and the scalability of data 2:36lake with the data management capability of data warehouse. 2:39Raw data from various different data 2:42sources is added into data lakehouse. 2:47The data type can be anything like a structured data, 2:52unstructured data or semi-structured data. 2:57And then metadata layer is added in data lakehouse, 3:02which provides 3:04structure and schema to the data, 3:07which also enable the data management and the governance feature. 3:11The data is processed with various tools 3:14and technology including SQL, 3:17Python programming language and any other programming language. 3:21And then transformed and curated 3:24data is made available for end users 3:27for different analytical and ML purpose. 3:33That's the data lakehouse. 3:35Let's deep dive into the key differences. 3:38There are six main key differences in terms of each concepts. 3:42First key difference in terms of purpose: 3:44What is the purpose of data warehouse? 3:46Data warehouse mostly used for optimized SQL analytics. 3:53Whereas in data lake, data lake 3:55mostly used for storing raw data. 4:01Whereas in 4:04data lakehouse, data lakehouse mostly used for hybrid analytics. 4:11Second key difference is in terms of structure of the data. 4:16What type of data is stored in each of these three? In data warehouse, it 4:20used only structured data. 4:25Whereas in data lake, 4:27we can use any type of data 4:29like structured, 4:31unstructured or semi-structured. 4:35In data lakehouse as well, all 4:37type of data is stored, so all type of data is supported. 4:43Second key difference: in terms of storage cost. 4:46Storing data in data warehouse is high. 4:49The cost is high. Reason? 4:50We have to be organized all the data 4:53before storing the data into warehouse. 4:55So the storage cost for data warehouse is high. 4:59Whereas in data lake, 5:01we can dump any type of data in its raw format, 5:04so we don't need to worry about organizing the data in data lake. 5:08the storage cost for the data lake is low. 5:11Similar to data lakehouse as well. We 5:14can like dump all type of data in data lakehouse, so ... 5:18so the storage cost for the data lake house is low. 5:22It use the cheaper object storage. 5:25The fourth key difference: in terms of performance. The performance 5:28of data warehouse is high 5:31because everything is organized before storing the data, so 5:36accessing the data is in real time is low. 5:40So we can access the data very easily in data 5:43warehouse. Whereas in data lake 5:45we store any type of data directly into the data lake. 5:49Everything is unorganized, so 5:50accessing those data takes time. So, 5:53the performance is low. 5:56In data lakehouse, 5:58the performance is high and it's ... which is near to the data 6:01warehouse performance. 6:06The fifth key difference: in terms of flexibility. 6:10The data warehouse support fixed schema, 6:13which is also called schema on write, 6:17which means d ... data ... for the data, 6:20the structure of the data and the schema has to be defined 6:23before we store the data into the data warehouse. 6:26Whereas in data lake 6:28it supports flexible schema, 6:32which is also called 6:34schema on read. Means, schema 6:36doesn't have to be defined before storing the data ... 6:40before storing the data into the data lake, 6:42and we can ... uh... provide the schema when we access or query 6:46the data. Whereas in data lakehouse, 6:48it supports both type of schema, 6:50so it is flexible. 6:55The last and the sixth key difference: 6:57in terms of scalability. For data warehouse, 7:01the scaling data warehouse is expensive 7:05and difficult. 7:10Whereas in data lake, 7:12we can scale the data lake very easily 7:16and at a low cost. 7:18In data lakehouse as well, 7:20we can scale the data lakehouse easily 7:24and at the low cost. 7:27So, these systems ... each have unique strengths 7:30and tailored for different workflows. 7:33For developers and engineers, 7:35understanding these concepts are really essentials. 7:38Thank you so much for joining me today.