Learning Library

← Back to Library

Building Governed Data Lakes for AI

5m • Unknown Channel • databases • tutorial • intermediate • Watch on YouTube ↗

Key Points

Data lakes serve as centralized repositories that ingest and store diverse data sources—streaming, batch, internal, and external—to enable powerful user and business insights.
A flexible ingestion framework standardizes and copies data into the lake, allowing analysts to work on the data without affecting the original sources.
Raw data typically requires extensive cleansing, preparation, and feature extraction before it can be used for advanced analytics or machine learning.
Each processing step generates new derived datasets that remain linked to the original data, which is essential for tracing impacts and updating models when source data changes.
Embedded governance captures metadata, enforces usage policies, and maintains lineage throughout the pipeline, ensuring data is used correctly and responsibly.

Sections

Full Transcript

# Building Governed Data Lakes for AI **Source:** [https://www.youtube.com/watch?v=LxcH6z8TFpI](https://www.youtube.com/watch?v=LxcH6z8TFpI) **Duration:** 00:05:15 ## Summary - Data lakes serve as centralized repositories that ingest and store diverse data sources—streaming, batch, internal, and external—to enable powerful user and business insights. - A flexible ingestion framework standardizes and copies data into the lake, allowing analysts to work on the data without affecting the original sources. - Raw data typically requires extensive cleansing, preparation, and feature extraction before it can be used for advanced analytics or machine learning. - Each processing step generates new derived datasets that remain linked to the original data, which is essential for tracing impacts and updating models when source data changes. - Embedded governance captures metadata, enforces usage policies, and maintains lineage throughout the pipeline, ensuring data is used correctly and responsibly. ## Sections - [00:00:00](https://www.youtube.com/watch?v=LxcH6z8TFpI&t=0s) **Understanding Data Lakes and Ingestion** - Adam Kocoloski explains what data lakes are, how they consolidate diverse data sources via a common ingestion framework, and the preparation steps needed to enable intelligent applications. - [00:03:11](https://www.youtube.com/watch?v=LxcH6z8TFpI&t=191s) **Data Lake Enables Intelligent Applications** - The passage outlines how a data lake supports creating dashboards, recommendation engines, and automated processes by following the AI ladder’s four stages—collect, organize, analyze, and infuse—resulting in a continuous feedback loop of new data and smarter models. ## Full Transcript

0:00Hi everyone, my name's Adam Kocoloski with IBM Cloud 0:02and I'm here to talk to you today about data lakes 0:04- what they are, how you use one, 0:06and the kind of things you ought to be thinking about as you set one up 0:09to power your applications and 0:11create more intelligent experiences for users. 0:14So, data lakes exist because we're all awash with data 0:18and we've got systems of record, 0:20we've got systems of engagement, 0:22we've got streaming data, we've got batch data internal, external data, 0:26and it's really a combination of these different kinds of data sources 0:29that leads us to get powerful insights 0:32about what our users are doing, 0:33about the way the world is working around us, 0:35and leads us to develop more intelligent applications. 0:38Data lakes start by collecting all those different types of data sources 0:42through a common ingestion framework 0:44and that ingestion framework is something that typically wants to be able 0:47to support a diverse array of different types of data, 0:50and it wants to kind of standardize 0:52and centralize all that stuff into a common storage repository. 0:55That's not always required, 0:57but typically you don't want to be analyzing the source data directly, 1:00you want to be able to take a copy of it, 1:02so that you've got the flexibility to do the kind of things you need to do with that data. 1:07And speaking of that, 1:08the data typically doesn't common a form where you can use it right out of the box. 1:12There's a lot of data cleansing and data preparation that's required. 1:17There is often times the ability to, or the requirement to create new features, 1:26something we call feature extraction, 1:28combinations of different types of data that need to be 1:31pulled together in order to create the right 1:35sort of bits of information to analyze. 1:40And once you cleanse that data, prep the data, 1:44model the right kind of features for your analysis, 1:48then you get to the fun part - which is actually going in and doing the machine learning model training 1:52and doing your advanced analytics. 1:56And each of these steps is typically creating new derived data sets 2:02that tie back to the original one. 2:04And that relationship is a really important thing to capture, 2:07because, let's say, there was a problem with one of your data sources. 2:09You know there was a correction that needed to be made. 2:12You need to understand how that flows through the entire pipeline 2:15of more refined data sets and models that you're producing, 2:19so that you can go back and correct it. 2:20And that's what this governance stuff comes into play. 2:23This is something that's really you know infused at every step of the journey. 2:26It means collecting meta data, you know data about your data, you know the right kinds 2:31of information about the tables in your data sets and how they relate to one another. 2:35It means being able to enforce policies so that as an organization we use the data the 2:40way it's meant to be used, the way it's intended to be used, the way it's acceptable to be 2:45used to drive the business forward. 2:47That's really something that can't be bolted on after the fact that something has to be 2:50present throughout the entire life cycle. 2:53If we stop here, we haven't really changed anything. 2:57It's only by getting these insights that were producing in this data lake back out into 3:01the real world that were able to you know deliver on the business promise of these data 3:07lakes that that we're all investing in and that's where this apply step comes in. 3:11This can take a few different forms. 3:14You might be you know building simply dashboards That are helping business executives make 3:19smarter decisions about where to take the business forward with new projects to invest 3:24in. 3:25Or you might be building smarter applications that are able to make intelligent recommendations 3:31to the users of those apps based on you know historical purchased data. 3:38Increasingly we're also seeing a lot of process automation where an intelligent model can 3:45smooth over some typically manual business processes and create a more intelligent experience 3:51and based on the sort of rich data driven understanding of the problem at hand. 3:58And really this whole process iterates back, right. 4:02Those more intelligent applications, they end up generating new data and the cycle continues. 4:07And so that in a nutshell at a very high level is what a day lake does. 4:12Some of you may have heard us talk about "the ladder to AI", the "AI ladder", and we talk 4:17about that - we talk about collecting data. 4:21We talk about organizing data. 4:25We talk about analyzing. 4:29And we talk about infusing. 4:32And really those four steps on this ladder are things that you can see represented throughout 4:38this data lake environment. 4:40Clearly over here we're doing a lot of collection of these individual sources of data. 4:44This data preparation and feature extraction step into governed fashion is absolutely what 4:49we mean by the organizing of data. 4:52ML model training is a key example of data analysis. 4:55And we talk about infusing the insights from the data lake into the applications, that's 5:01really this last step here. 5:02And so, there is very much a clear linkage between climbing this AI ladder and a data 5:08lake as a vehicle that can help you make that journey. 5:11Thanks for watching. 5:12If you have any questions or comments, please drop us a line below. 5:13If you enjoyed this content, please consider liking or subscribing thank you.