Learning Library

← Back to Library

Protecting Data for AI Adoption

Key Points

  • AI’s power comes from data, so protecting that data is the first critical step before integrating AI into products or business processes.
  • The evolution of data storage—from ancient writings to relational databases (Codd 1970) to server farms, cloud, hybrid cloud, data lakes, and lakehouses—has continually improved how we keep and retrieve information.
  • Modern data ecosystems still rely on structured data stored in databases, but they also incorporate less‑structured data in lakes and lakehouses to support diverse AI workloads.
  • Effective AI initiatives require specialized roles: data engineers design and manage the data architecture, while data scientists transform and analyze the data to generate insights.

Full Transcript

# Protecting Data for AI Adoption **Source:** [https://www.youtube.com/watch?v=LyfG7SGRiZA](https://www.youtube.com/watch?v=LyfG7SGRiZA) **Duration:** 00:15:07 ## Summary - AI’s power comes from data, so protecting that data is the first critical step before integrating AI into products or business processes. - The evolution of data storage—from ancient writings to relational databases (Codd 1970) to server farms, cloud, hybrid cloud, data lakes, and lakehouses—has continually improved how we keep and retrieve information. - Modern data ecosystems still rely on structured data stored in databases, but they also incorporate less‑structured data in lakes and lakehouses to support diverse AI workloads. - Effective AI initiatives require specialized roles: data engineers design and manage the data architecture, while data scientists transform and analyze the data to generate insights. ## Sections - [00:00:00](https://www.youtube.com/watch?v=LyfG7SGRiZA&t=0s) **Data Foundations for AI Security** - The speaker stresses that AI depends on data and must be safeguarded, while sketching the historical progression from ancient record‑keeping to relational databases as the groundwork for today’s AI-driven business initiatives. - [00:03:02](https://www.youtube.com/watch?v=LyfG7SGRiZA&t=182s) **Evolution of Data Roles to AI** - The speaker outlines the progression from data engineers, scientists, and admins managing and securing data, to modern business applications and AI systems that now extract, train, and operationalize data. - [00:06:13](https://www.youtube.com/watch?v=LyfG7SGRiZA&t=373s) **Classifying Data and Controlling Access** - The speaker stresses that identifying the sensitivity of data is the first step to protection and recommends using role‑based permissions instead of direct access to manage how users and systems interact with that data. - [00:09:18](https://www.youtube.com/watch?v=LyfG7SGRiZA&t=558s) **Ensuring Identity Management for Access** - The speaker explains that managing access requires authenticating and authorizing users—including privileged accounts—through robust identity management, applying least‑privilege principles, and avoiding shared or generic IDs. - [00:12:25](https://www.youtube.com/watch?v=LyfG7SGRiZA&t=745s) **Governance, Risk, and Encryption Strategies** - The speaker links low risk to reduced monitoring, outlines the comprehensive governance umbrella covering data classification, cataloging, and identity/access controls, and emphasizes encrypting data with independently managed keys to render stolen information useless. ## Full Transcript
0:00Howdy everyone. 0:00If you're like me, everywhere you're turning now, 0:04you're hearing about AI this, AI that how do I get AI into automation. 0:09How do I leverage and use AI in my products. 0:13How do I use it in my business? 0:14The thing about AI is AI doesn't exist without data. 0:19You have to have data. 0:20And the thing that you need to think about is, how am I going to protect that data? 0:24So what I want to talk about now is some of the fundamentals that you can use 0:29to protect your data as you start to use and build out AI in your business. 0:34Now, information and data has been around pretty much 0:37since the beginning of human history. 0:39We wrote on hieroglyphs, we wrote scrolls, we wrote papers in books, 0:43books went in libraries. 0:44We tried to access all that. In the 60s, when mainframes and computers 0:49started really getting into the mainstream of business, 0:52we really started them formalizing how we stored data. 0:56So you had the inner 0:57integrated data systems, you had information management systems. 1:01These all ran to store data, but they weren't very good 1:04at retrieving that data. In 1970, E.F. 1:07Codd from IBM actually wrote the seminal paper around 1:11relational database management. 1:13And this was the first time where we really had data 1:16that we could retrieve easily and use it for business purposes. 1:21And that's really the foundation by which everything that we're doing today 1:25is built off of. 1:26So when we started back in the 60s 1:29with stuff, it was basically structured data. 1:34We knew 1:35exactly what it looked like, we knew what fields the were. 1:38It was very organized into a database. 1:40So we had structured data and we had a database. 1:43We also, as we started expanding 1:46and growing, those database became based on a server 1:49so that we could access all our data off the server servers became overloaded. 1:55So we started distributing the data 1:59to many servers that evolved into cloud. 2:02And of course the evolution just keeps going. 2:05And now we're into hybrid cloud. 2:07All of these models basically provide ways that we can store data. 2:12Now we've done certain things like we've taken, 2:15you know, 2:15a little bit of data structure here, and we've built data lakes on top of that, 2:19we realized we still 2:20wanted to have some of the benefits of using databases and servers. 2:24So we expanded that and have Lakehouse. 2:27But at the 2:28end of the day, it's really all about data. 2:31And data is stored in some sort of a system. 2:35Whatever the system is, we store data. 2:39Now we have a user 2:42and users want to extract data and use that data. 2:46They're going to query it for information they need. 2:49They're going to write reports. Whatever it is. 2:51People pull information out of data to use. Now, 2:54they can't just take normal data that's just been dumped in there and use it. 2:58It has to be manipulated. 3:00It has to be stored instruction in certain ways. 3:02So that brings in we need to have engineers that work on that data. 3:07So this is data engineers, people that can go and manipulate structure of the data. 3:11We have data scientist. 3:14So data scientists go in and they work on the data. 3:17There's a lot of people that can come in and interact. 3:20And we also have admins 3:23that operate on the data. 3:24So all of these are coming in. 3:26They're changing data, manipulating data. 3:28They're making it so that a user can get the data they need. 3:32Now we also evolved to the point where we have business applications. 3:37Now that we want to run and work on data, 3:40either to manipulate it, change it, or read it and use it. 3:43So we have our business applications and they're also interacting with data. 3:49So this is all really good. 3:50We've come from our evolution. We've built data. We know how to store it. 3:53We have all sorts of models for storing it. 3:55So now if we fast forward a little bit in the last, 3:59you know, decade, a couple of decades for a while, 4:01we've also worried about the security of this data. 4:03So we really are worrying about does somebody like 4:06David Lightman come in, hack into our database and they breach it, 4:11they breach the data, they steal the data. 4:13There's ransomware put against. 4:14And so all of this 4:16we have known what to do with and we've built systems around this. 4:19Now let's get up to where we're at today. 4:22Now we have AI systems. 4:24These have come into play. 4:25And we have AI that we need to extract data out. 4:28We're training models. 4:30We're building systems out of it. 4:31These can be data that we want to train from. 4:34It can be vector databases. 4:36It can be a whole set of kinds of data that we need for our AI systems. 4:41We also may have AI data 4:43that needs to actually interact with our business processes, 4:47because we want it to flow back in, into our data and manipulate data. 4:52Look at enterprise data, extract that out. 4:54This supports our our Rag models of AI. 4:57All of this supports our gen AI systems. 4:59Whatever it is, we're using automation, whatever that is. 5:02We have now introduced AI into this. 5:05With AI, we also have security concerns around that as well. 5:09There are things that people can come do. 5:11They can do data poisoning. So it poisons our data 5:14that we use the train and that manipulates how AI works. 5:17You know, there's lots of different things as we talk about data 5:21that we need to be concerned about and how it's going to operate, 5:25not just in our normal business operations, but now as we're 5:28starting to leverage AI and more and more, how are we going to protect that? 5:32So how do we build our walls around our data so that we can protect it? 5:38So what I want to talk about is go through a few strategies, 5:42some fundamental strategies around protecting data 5:46that you can use to make sure that as you're engaging AI and you're building out 5:51all these systems, you're at least being aware of what it is 5:54that you need to do to make sure the data you're built off of is being protected. 5:59So let's talk about the strategies 6:02that we can use for protecting our data. 6:06So, the first one and this is probably the simplest and sounds 6:09the simplest and the most fundamental is classification of data. 6:13This is extremely important. 6:14And what this means is do you understand 6:18what kind of data you have that you're extracting out. 6:22Is it sensitive personal information? 6:24Is it personally identifiable information? 6:27Do we have confidential information? 6:29What kind of sensitive information do we need to be aware of 6:33so we know how to protect it? 6:35It seems easy, but this is one thing that actually often times gets overlooked. 6:40You, you you don't know what kind of data you have, 6:43so you don't know how and and what you should be protecting in so, so 6:48data classification is, is the first thing that you need to be aware of. 6:51What kind of data do you have? 6:54The second strategy is really about managing access. 7:01So users access the data. 7:05The the engineers they're accessing systems are accessing data. 7:09So the first thing that you want to think about when you're talking about 7:13how to manage access to the data is no direct access. 7:21So a user should not 7:23actually be entitled to go in and actually hit things directly. 7:27What we really want to do is put a role in there, 7:31and that role has privilege against a set of data so that they can go in 7:36and they assign themselves to a role or a governance assigns to a role. 7:41And that role lets them know what they can do. 7:44And you do these roles everywhere, right? 7:46AI would use roles, the business applications, user roles, 7:50everyone that's trying to manipulate the data and get it ready, 7:53they would all have roles assigned to them. 7:55So no direct access actually work through a layer of indirection. 7:59A layer of abstraction, 8:00and have roles that actually are the things that manipulate the data. 8:04And then and then whatever it is that's coming in would go through that role. 8:08The second thing that you should think about 8:11is really to the extent, 8:13where possible, always make the data read only. 8:17And this really talks about the stack area up here. 8:20If we're looking at people who are reading the data using the data or AI, 8:26this part should be read only as much as possible. 8:28We'll talk about this in just a minute, because that won't be read only. 8:32But really try to make this read only. 8:34The next thing you want to talk about are you want to think about 8:38when we're talking about managing access is use least privilege. 8:43And what this says is if a user or even an AI 8:47system is coming in, they shouldn't get access to everything. 8:51They should only get access to what they need to do the job 8:55that they're trying to execute. 8:56So that would be a very specific rule that says that I can only access 9:00a few different things. Right. 9:02And if you need to access something else, then you have another role. 9:05And then maybe you associate that person 9:07could have multiple roles, or an AI system could have multiple roles. 9:11That gets them 9:12just to the narrow piece of information that they need to perform 9:16some sort of the task. So that's the least privilege. 9:18That's the next thing we need to 9:19we need to make sure we're doing when we manage access. 9:23The last thing under managing access is identity management. 9:30And what this really 9:31says is that a user we should we should know who they are. 9:34There's lots of good videos on this. We authenticate them. 9:37That tells us who they are. 9:39There's another system that tells us what they're entitled to do that 9:43maps to the roles. 9:44But all this is around identity management, 9:46and whether it's a business application coming into a sort of, you know, 9:49through a assert or APIs, however that is done by all of these 9:54entities that are accessing data need to have their identities managed. 9:58So we know who it is. We trust who they are. 10:00We know they've been authenticated, we know they've been authorized, 10:03and now we can feed up the data. 10:04So that's the last piece that we're talking about with Manage Access. 10:09Now let's talk about this group here. 10:10These are our privileged users 10:13and we need to think about them as well. 10:17Now we have to have all this stuff with them. 10:19Maybe you know we may alter some of this. 10:22Like obviously we're not going to do read only, 10:25but they still should have least privilege. 10:26We should still have identity management. 10:29The other things that we need to bring in when we're thinking about this now 10:32with privilege users, let's talk about these business applications for a second. 10:36Let's limit or eliminate shared IDs. 10:44An application shouldn't just be using a generic ID 10:47that many people on the other end of this have access to as well, right? 10:51Because then we then we lose control of knowing under identity management 10:55who is actually trying to do stuff. 10:57So we need to think about things like, do we have vaults? 10:59Can we rotate secrets? 11:00You know, how do we make sure that the business application use a 1 to 1 11:05ID to get in or credential to get in and access the data. 11:09So limit any kind of shared credentials, that would 11:12provide access or even even on the engineering side. 11:15The next thing that we need to do then is to monitor. 11:21This set of people have more access 11:24privileges than we do up on the, the, the operational side of the stack. 11:29And so because of that, we need to make sure that we're looking for 11:32are there any anomalies in their behavior. 11:35Is there did a idea of somebody get compromised? 11:37They're coming in at two in the morning. 11:39It's not a regular time for them. 11:41So can we look at that as anomaly and think, okay, 11:44there may be a breach or something going on here. 11:46So basically just monitor the activity, make sure it's within the patterns 11:50of what we expect. The behavior to be. 11:52So therefore we know that everything is proceeding the way 11:55that we want it to be and that something else isn't happening. 11:59Now if you do detect something along this line, then we can take an action. 12:04One of the things as we talk about monitoring, it's also associated with risk. 12:11How and this gets back to our classification. 12:14If there's if it's more sensitive information there's higher risk. 12:18And therefore we really kind of want to monitor certain things 12:21to make sure nothing is happening with that sensitive information. 12:25If the risk is really low, then maybe the monitoring goes down with that. 12:29Right? 12:29So monitoring and risk are always kind of associated together. 12:32Now when we think about these sets of things right here, 12:36this is really about governance. 12:41This is about data governance 12:42classifications about data governance cataloging what the data is. 12:46All of this falls under a governance umbrella, as does identity governance. 12:50You get into IAM, you get into IGA, you get into access governance. 12:54Really, all of this falls under the umbrella of governance. 12:58And there are some really good tools for providing this level of governance. 13:02If you're trying to implement, 13:03you know, these strategies, 13:04there are some really good ways for doing that, 13:06and there's some really good videos on how to do that as well. 13:09All right. 13:10Next thing that we want to do as a strategy is encrypt 13:15our data. 13:16If we encrypt everything in here, then if a breach is something gets stolen, 13:20there's a better likelihood 13:22that the data will then be useless because they don't have the keys. 13:25Unlock it. 13:25And that's actually an important topic as well. 13:28Make sure that the keys that you have are independently or third party manage. 13:32In other words, your admins shouldn't have the keys to unlock the data. 13:36The admins should be manipulating the system. 13:40They can. 13:41If you talk about a role, an admin can have a special role 13:44that says that they can build structures, they can build tables, they can build, 13:47you know, whatever it is you need. 13:48If you're using object database, 13:50whatever it is, admins build structure, but they can't see or manipulate data. 13:54If they do, then that data is encrypted. 13:56If they have the keys to make an unlock it, which is we want. 13:59Why we want to separate that out. 14:00So encryption is a very important topic when we come to protecting data. 14:05And then finally 14:08repeat all of this. 14:12This is just this is just good security hygiene. 14:14You know, it's not enough to just say, look, I did this. 14:17I checked my box, I did that, I check my box okay. 14:19Check check check. 14:20You know, I'm all good. 14:22Things are constantly changing. 14:24It's a very dynamic nature of of data as it changes 14:27what's coming and what kind of systems we're using to store. 14:30So you should be continually reassessing. 14:33Do I have the right classification if I catalog this correctly? 14:35If I set up all my access right, have that somebody get an access 14:39because they used to be over here, 14:40and then they moved over here and they retained that access. 14:43So just repeat just continually repeat 14:46these strategies to make sure your data is properly protected. 14:50And this is really this is really as we look at building out AI systems. 14:56And that's built off of data. 14:59This is a set of strategies that can help you 15:02to make sure that the data you're using is protected. 15:07Thank you.