Learning Library

← Back to Library

Data Warehouse, Lake, and Lakehouse Explained

Key Points

  • Data warehouses are relational systems that ingest structured data via ETL, centralize it, and serve curated datasets for reporting and analytics.
  • Data lakes collect raw data of any format (structured, semi‑structured, or unstructured) using ELT, letting users transform it later for AI/ML and exploratory workloads.
  • Data lakehouses merge the scalability of lakes with the governance of warehouses by adding a metadata layer and schema, enabling both SQL‑based analytics and programmatic (e.g., Python) processing.
  • The primary differences among the three lie in purpose (reporting vs AI/ML), data type support, processing flow (ETL vs ELT), governance capabilities, and performance characteristics.

Full Transcript

# Data Warehouse, Lake, and Lakehouse Explained **Source:** [https://www.youtube.com/watch?v=PQFWQmL3fLY](https://www.youtube.com/watch?v=PQFWQmL3fLY) **Duration:** 00:07:41 ## Summary - Data warehouses are relational systems that ingest structured data via ETL, centralize it, and serve curated datasets for reporting and analytics. - Data lakes collect raw data of any format (structured, semi‑structured, or unstructured) using ELT, letting users transform it later for AI/ML and exploratory workloads. - Data lakehouses merge the scalability of lakes with the governance of warehouses by adding a metadata layer and schema, enabling both SQL‑based analytics and programmatic (e.g., Python) processing. - The primary differences among the three lie in purpose (reporting vs AI/ML), data type support, processing flow (ETL vs ELT), governance capabilities, and performance characteristics. ## Sections - [00:00:00](https://www.youtube.com/watch?v=PQFWQmL3fLY&t=0s) **Data Warehouse vs Lake vs Lakehouse** - An overview of how enterprises store and process structured, semi‑structured, and unstructured data using data warehouses, data lakes, and emerging lakehouse architectures. - [00:03:11](https://www.youtube.com/watch?v=PQFWQmL3fLY&t=191s) **Comparing Data Warehouse, Lake, Lakehouse** - The speaker outlines how a data lakehouse blends raw, low‑cost storage with hybrid analytics, contrasting purpose, data structure support, and storage costs among traditional data warehouses, data lakes, and lakehouses. - [00:06:26](https://www.youtube.com/watch?v=PQFWQmL3fLY&t=386s) **Schema Flexibility & Scalability Comparison** - The speaker explains that data lakes use schema‑on‑read, lakehouses support both schema‑on‑read and schema‑on‑write, and that lakes and lakehouses can scale cheaply whereas scaling a data warehouse is costly and difficult. ## Full Transcript
0:00All organizations generate massive amount of data 0:03from various data sources like databases, application, social media. 0:07For developers and engineers, understanding 0:10how to efficiently store, manage and process this data is critical. 0:14Concepts like data warehouse, data lake 0:16and data lakehouse are essential 0:18tools for modern data management workflows. 0:21Today, I'm here to provide you with the clear insights 0:24into these concepts and the key differences later. So, 0:27let's break down our system into main three parts: data 0:30warehouse, data lake and data lakehouse. 0:33First on my list is data warehouse, 0:36also called as EDW, which is enterprise 0:39data warehouse is a relational database management system. 0:42It aggregates data from various different data sources 0:47into single central repository. 0:50A data source can be anything like a CSV file, 0:53sales data, CRM data, which means any structured data. 0:58The data is processed with the technology called ETL, 1:02which is extract, transform and load. 1:04The data is extracted and transformed 1:07based on the business need and then load into 1:09the data warehouse. 1:15And then transform and curated data 1:19is made available for end users for different analytics 1:23and reporting purpose. 1:25That's the data warehouse. 1:28Second on my list is data lake. In data lake, it 1:32aggregates data from various different data sources 1:36into single, central repository 1:38in its raw and original data format. 1:41The data source can be anything like CSV file, 1:45any structured data, 1:46unstructured data or semi-structured data. 1:49Data is directly dumped into the data lake 1:53and then the data is processed 1:58and transformed with the technology called ELT, 2:01which means extract, load and transform. 2:04The data is extracted and load into the data lake, 2:08and then transform based on the business need. And 2:10then transform and curated data 2:13is made available for end users for different 2:16AI ML use cases. 2:19That's the data lake. 2:21Third on my list is data lakehouse. Data lakehouse ... 2:25This is the innovative solution which brilliantly merge best of both worlds, 2:30which is data warehouse and data lake. 2:32It combines the flexibility and the scalability of data 2:36lake with the data management capability of data warehouse. 2:39Raw data from various different data 2:42sources is added into data lakehouse. 2:47The data type can be anything like a structured data, 2:52unstructured data or semi-structured data. 2:57And then metadata layer is added in data lakehouse, 3:02which provides 3:04structure and schema to the data, 3:07which also enable the data management and the governance feature. 3:11The data is processed with various tools 3:14and technology including SQL, 3:17Python programming language and any other programming language. 3:21And then transformed and curated 3:24data is made available for end users 3:27for different analytical and ML purpose. 3:33That's the data lakehouse. 3:35Let's deep dive into the key differences. 3:38There are six main key differences in terms of each concepts. 3:42First key difference in terms of purpose: 3:44What is the purpose of data warehouse? 3:46Data warehouse mostly used for optimized SQL analytics. 3:53Whereas in data lake, data lake 3:55mostly used for storing raw data. 4:01Whereas in 4:04data lakehouse, data lakehouse mostly used for hybrid analytics. 4:11Second key difference is in terms of structure of the data. 4:16What type of data is stored in each of these three? In data warehouse, it 4:20used only structured data. 4:25Whereas in data lake, 4:27we can use any type of data 4:29like structured, 4:31unstructured or semi-structured. 4:35In data lakehouse as well, all 4:37type of data is stored, so all type of data is supported. 4:43Second key difference: in terms of storage cost. 4:46Storing data in data warehouse is high. 4:49The cost is high. Reason? 4:50We have to be organized all the data 4:53before storing the data into warehouse. 4:55So the storage cost for data warehouse is high. 4:59Whereas in data lake, 5:01we can dump any type of data in its raw format, 5:04so we don't need to worry about organizing the data in data lake. 5:08the storage cost for the data lake is low. 5:11Similar to data lakehouse as well. We 5:14can like dump all type of data in data lakehouse, so ... 5:18so the storage cost for the data lake house is low. 5:22It use the cheaper object storage. 5:25The fourth key difference: in terms of performance. The performance 5:28of data warehouse is high 5:31because everything is organized before storing the data, so 5:36accessing the data is in real time is low. 5:40So we can access the data very easily in data 5:43warehouse. Whereas in data lake 5:45we store any type of data directly into the data lake. 5:49Everything is unorganized, so 5:50accessing those data takes time. So, 5:53the performance is low. 5:56In data lakehouse, 5:58the performance is high and it's ... which is near to the data 6:01warehouse performance. 6:06The fifth key difference: in terms of flexibility. 6:10The data warehouse support fixed schema, 6:13which is also called schema on write, 6:17which means d ... data ... for the data, 6:20the structure of the data and the schema has to be defined 6:23before we store the data into the data warehouse. 6:26Whereas in data lake 6:28it supports flexible schema, 6:32which is also called 6:34schema on read. Means, schema 6:36doesn't have to be defined before storing the data ... 6:40before storing the data into the data lake, 6:42and we can ... uh... provide the schema when we access or query 6:46the data. Whereas in data lakehouse, 6:48it supports both type of schema, 6:50so it is flexible. 6:55The last and the sixth key difference: 6:57in terms of scalability. For data warehouse, 7:01the scaling data warehouse is expensive 7:05and difficult. 7:10Whereas in data lake, 7:12we can scale the data lake very easily 7:16and at a low cost. 7:18In data lakehouse as well, 7:20we can scale the data lakehouse easily 7:24and at the low cost. 7:27So, these systems ... each have unique strengths 7:30and tailored for different workflows. 7:33For developers and engineers, 7:35understanding these concepts are really essentials. 7:38Thank you so much for joining me today.