Learning Library

← Back to Library

Data Pipelines Explained Through Water

Key Points

  • Data pipelines move raw, “dirty” data from sources (data lakes, databases, streaming feeds) to where it can be used, much like water pipelines transport untreated water to treatment plants.
  • Like water treatment, data must be cleaned, de‑duplicated, and formatted before it becomes useful for business decision‑making.
  • The primary method for this is ETL (extract, transform, load), which extracts data, applies transformations to resolve mismatches and missing values, and loads it into a target repository such as an enterprise data warehouse.
  • Pipelines can operate in batch mode on a schedule or in continuous streaming mode to handle real‑time data feeds, and they may also employ techniques like data replication or virtualization.

Full Transcript

# Data Pipelines Explained Through Water **Source:** [https://www.youtube.com/watch?v=6kEGUCrBEU0](https://www.youtube.com/watch?v=6kEGUCrBEU0) **Duration:** 00:08:30 ## Summary - Data pipelines move raw, “dirty” data from sources (data lakes, databases, streaming feeds) to where it can be used, much like water pipelines transport untreated water to treatment plants. - Like water treatment, data must be cleaned, de‑duplicated, and formatted before it becomes useful for business decision‑making. - The primary method for this is ETL (extract, transform, load), which extracts data, applies transformations to resolve mismatches and missing values, and loads it into a target repository such as an enterprise data warehouse. - Pipelines can operate in batch mode on a schedule or in continuous streaming mode to handle real‑time data feeds, and they may also employ techniques like data replication or virtualization. ## Sections - [00:00:00](https://www.youtube.com/watch?v=6kEGUCrBEU0&t=0s) **Water Pipeline Analogy for Data Pipelines** - The speaker compares water treatment and distribution pipelines to data flows from lakes, databases, and streams, illustrating how raw data is collected, processed, and delivered to where it’s needed in an organization. ## Full Transcript
0:00let's talk about data pipelines what 0:02they are when and how they're used 0:04so i want to start with a simple idea 0:07most of us are fortunate enough to turn 0:09on the tap whenever we like and fresh 0:12clean water comes out 0:14however have you have you thought about 0:15how that water actually gets to you 0:18well 0:19water starts out in our 0:23lakes 0:26our 0:27oceans 0:31and even our 0:33rivers 0:37but most of us probably wouldn't drink 0:39straight from the lake right we have to 0:41treat and transform this water into 0:43something that's safe for us to use and 0:45we do this using 0:50treatment 0:52facilities 0:54and we get the water from where it is to 0:55where it needs to go using 0:59water pipelines 1:02right 1:08now once that water has gotten from the 1:10source to their treatment plants it's 1:12then cleansed and and made sure it's 1:14safe to use and then it's sent out using 1:17even more pipelines to where we need it 1:19and we use it in a couple different 1:20places we need it 1:22for 1:23drinking water 1:25we need it for 1:28cleaning 1:31and we also need it for 1:35agriculture 1:36right so we use even more pipelines 1:40to get this water to where it's needed 1:45okay 1:46now 1:48as you can see water pipelines take 1:50water from where it is to where it's 1:52needed 1:53now we can start to think about data in 1:56organizations in a very similar way so 1:58data and organization starts out in 2:02data lakes 2:03it's in different databases that are a 2:08part of different sas applications some 2:10applications are on-prem 2:13and then we also have 2:15streaming data 2:17which is kind of like our river here 2:20now this can be data that is coming in 2:22in real time and so an example of that 2:24could be sensor data from uh factories 2:27where data's being collected every 2:29second and being sent 2:30and being sent back up to our 2:31repositories 2:34so just like our water sources 2:36this data is dirty it's contaminated and 2:39it must be 2:40cleaned and transformed before it's 2:42useful 2:43in helping us make business decisions 2:46now when we talk so how do we do this 2:49work we do it using 2:52not water pipelines but 2:55data pipelines 2:59okay 3:00so when we talk about data pipelines we 3:02have a few different processes that we 3:04can use 3:05to help us handle the task of 3:07transforming and cleaning this data 3:09we can use 3:11processes like 3:13etl 3:14we can use 3:18data 3:19replication 3:22we can also use something called 3:25data virtualization 3:31right 3:32okay 3:35so one of the most common processes is 3:37etl which stands for extract transform 3:40and load 3:41and that does exactly what it sounds 3:43like it extracts data from where it is 3:45it transforms it by 3:48cleaning up 3:50mismatching data by taking care of 3:52missing values getting rid of duplicated 3:55data putting in making sure the right 3:57columns are there and then loading it 3:59into 4:00a landing repository 4:03for 4:04ready-to-use business data an example of 4:06one of these repositories could be 4:09an enterprise data warehouse right 4:14okay 4:15so 4:16most of the time we use something called 4:18batch processing 4:21which 4:22means that on a given schedule 4:26we load data into our etl tool and then 4:28load it to where it needs to be 4:30but we could also have stream ingestion 4:35which would support the streaming data 4:37that i mentioned earlier so it's 4:39continuously taking data in transforming 4:41it and then continuously loading it to 4:43where it needs to be 4:45okay 4:46now another tool that we might see is 4:49data replication 4:50so what this involves is a continuously 4:53replicating and copying data into 4:56another repository 4:58before being loaded or used by our use 5:00case 5:01so we could have 5:04a repository here in the middle 5:07that 5:08copies data from our source into this 5:12into this repository so why would we do 5:15that right 5:16well one of the reasons could be that 5:18the application or use case where we 5:20need this data 5:22needs to have a really high performant 5:24back end to it and it's possible that 5:26our source data can't support something 5:28like that 5:29another reason could be for backup and 5:32disaster recovery reasons so in the 5:34situation where our source data goes 5:37offline for some reason we still have 5:40this backup 5:41to keep running our business processes 5:43against 5:45okay 5:46so the last one i want to touch on is 5:48data virtualization 5:51so all of the methods that i've 5:52described so far require you to copy 5:54data from where it is and move it into 5:56another repository 5:58but what if we want to test out a new 6:00data use case and don't want to go 6:02through a large data transformation 6:04project 6:05well in that case we can use a 6:07technology called data virtualization 6:10to simply virtualize access to our data 6:14sources 6:15and only query them in real time when we 6:18need them without copying them over 6:20and once we're happy with the outcome of 6:23our our test use case we can go back and 6:26build out these formal data pipelines 6:29so data virtualization technology allows 6:31us to access all these disparate data 6:33sources 6:35without having to go through 6:37building out 6:38permanent data pipelines 6:50so once we're satisfied with the results 6:52of our data virtualization project we 6:54can build a formal data pipeline that 6:56can support the massive amounts of data 6:58that we need to that we need in a 7:01production use case 7:03now unfortunately we haven't figured out 7:05a way how to virtualize water but we can 7:08definitely do it with data in our in our 7:09organizations 7:11okay 7:12so after we've 7:14used all these different processes to 7:15get data ready for uh analysis or 7:18different applications we can start 7:19using it so what are the different ways 7:22in which we can use this data 7:23well we might need it for our business 7:27intelligence platforms that 7:30are needed for 7:32different types of reporting 7:34well we might also need it for 7:37machine learning use cases right so 7:39machine learning requires tons and tons 7:41of high quality data so we need to use 7:44these data pipeline tools to feed our 7:47machine learning 7:48algorithms 7:51and 7:53so this clean data can be fed into our 7:55machine learning models 7:57to help us start making better and 7:58smarter decisions in our business 8:01okay so as we can see data pipelines 8:06take data from 8:07data 8:08producers 8:13and 8:14give them to 8:16data 8:17consumers 8:20thank you if you have questions please 8:22drop us a line below and if you want to 8:24see more videos like this in the future 8:26please like and subscribe