Learning Library

← Back to Library

Netflix's Iceberg: Revolutionizing Data for AI

6m • Unknown Channel • databases • deep-dive • intermediate • Watch on YouTube ↗

Key Points

In 2017 Netflix’s massive catalog overwhelmed traditional relational databases, which couldn’t scale, lacked versioning, and required downtime to modify schemas.
To solve this, Netflix built an in‑house table format called Iceberg that stores data as immutable files in cloud object storage (e.g., Amazon S3), decoupling compute from storage.
Iceberg’s design introduced features like schema evolution without downtime, lazy loading of only needed data, and searchable metadata, dramatically improving performance and scalability for big‑data workloads.
Recognizing the broader industry benefit, Netflix open‑sourced Iceberg, enabling other companies to adopt the same data‑lake architecture.
The widespread adoption of Iceberg now underpins many enterprises’ data pipelines, making it easier to prepare massive datasets for downstream AI and analytics applications.

Sections

00:00:00 Netflix’s 2017 Data Overhaul - In 2017 Netflix confronted the scaling limits of traditional relational databases, spurring an innovative data architecture that now serves as a model for how large enterprises prepare and manage data for AI applications.

Full Transcript

# Netflix's Iceberg: Revolutionizing Data for AI **Source:** [https://www.youtube.com/watch?v=B-hhzEbKbiE](https://www.youtube.com/watch?v=B-hhzEbKbiE) **Duration:** 00:06:33 ## Summary - In 2017 Netflix’s massive catalog overwhelmed traditional relational databases, which couldn’t scale, lacked versioning, and required downtime to modify schemas. - To solve this, Netflix built an in‑house table format called Iceberg that stores data as immutable files in cloud object storage (e.g., Amazon S3), decoupling compute from storage. - Iceberg’s design introduced features like schema evolution without downtime, lazy loading of only needed data, and searchable metadata, dramatically improving performance and scalability for big‑data workloads. - Recognizing the broader industry benefit, Netflix open‑sourced Iceberg, enabling other companies to adopt the same data‑lake architecture. - The widespread adoption of Iceberg now underpins many enterprises’ data pipelines, making it easier to prepare massive datasets for downstream AI and analytics applications. ## Sections - [00:00:00](https://www.youtube.com/watch?v=B-hhzEbKbiE&t=0s) **Netflix’s 2017 Data Overhaul** - In 2017 Netflix confronted the scaling limits of traditional relational databases, spurring an innovative data architecture that now serves as a model for how large enterprises prepare and manage data for AI applications. ## Full Transcript

0:00step into the time machine with me we're 0:01going to talk about something that 0:03happened way back in 2017 that ended up 0:07powering a lot of the way big 0:10corporations today are thinking about 0:12prepping their data for AI so there's an 0:14AI tie-in at the end here so in 2017 0:18Netflix had a problem they had so many 0:21movies and shows and people were 0:22watching them so much that their 0:24traditional table structures and their 0:26traditional database were breaking down 0:28at the time databases work a lot like I 0:32think most me people's mental models of 0:34databases operate so just to explain 0:37that in detail they have rows they have 0:39tables you look up the row you look up 0:42the table it sits on a file somewhere it 0:44sits on a file on a server somewhere and 0:47there you go right now those kinds of 0:49databases do match what we imagine but 0:52they have problems at scale you cannot 0:55uh update them without shutting down the 0:56database imagine adding a column and 0:58having to shut down the entire database 1:00it's a problem uh they don't have 1:02versioning so you can't go back in time 1:03and see what the data was like before 1:06they don't have the ability to overwrite 1:07or edit necessarily in the same way they 1:11don't have they have performance issues 1:13because you have to look across the 1:14entire database there's not really a way 1:15to do it only 1:17partially I could go on there's a lot of 1:19issues some of them include storage and 1:22Netflix realized they needed to innovate 1:24they needed to fix they needed to make 1:26something that actually served their 1:27needs in 2017 and so what they came up 1:30with was what we now know as 1:33Iceberg and they developed it inhouse at 1:36Netflix in order to serve TVs and movies 1:39and shows effectively so all of us 1:41streaming contributed to 1:43Innovation isn't it's a nice feeling 1:45right uh and what they did was they 1:48converted the traditional model of the 1:50database and they moved it to the cloud 1:52and so it has um a core file storage 1:57Motion in the cloud like would sit 2:00Amazon S3 as an example uh Netflix would 2:02use AWS quite 2:04famously and it that meant it was 2:06infinitely extensible it didn't have to 2:09sit in just sort of uh traditional 2:11compute 2:12limitation it also meant that you could 2:14design it differently than a traditional 2:16database so it you could update it on 2:19the go it had metadata that you could 2:21query it did not have downtime if you 2:23dropped a column you could use lazy 2:25loading on the database which meant that 2:27you could pull only the part that you 2:28cared about at the time you didn't have 2:30to pull the whole thing which made it 2:31more performant there were a lot of 2:34advantages to Iceberg that essentially 2:36added up to Big Data works better here 2:39now Netflix could have kept that they 2:41could have said no no no this is ours 2:43like we don't want to share and our 2:44model of competition in Tech suggests 2:47that they would but our model is bad 2:50because big tech companies both compete 2:53and cooperate and in this case tools 2:55like this that are the foundational 2:57elements of the internet or that power 2:58our apps tend to be open sourced more 3:02often than not and so Netflix open- 3:05sourced it they actually handed it over 3:06to the Apache software Foundation which 3:08is the software foundation for projects 3:11like this they've been running since 3:131999 and by the time 2021 rolled around 3:17this little project that started at 3:19Netflix had been incubated by Apache and 3:21became a top level project at Apache 3:24which means that it uh was considered 3:26stable it was maintained by a rich 3:27community of developers Etc 3:30now you might 3:31wonder why what what possible gain would 3:35Netflix have to do this other than being 3:37nice and Netflix isn't necessarily known 3:39for that I can think of one if you are 3:42going to have a core part of your 3:45infrastructure that you have to maintain 3:48over 3:49time it would be smarter if you could 3:52build it in such a way that you knew 3:54that you could get talent in the door to 3:56maintain it and upgrade it and improve 3:58it over time now you could do that by 4:01training laboriously everyone who comes 4:04into your company on your special 4:06proprietary way of doing things but 4:09because this is a foundational part of 4:10the 4:11internet it makes more sense to just 4:14open source it your competitive 4:17Advantage is still your shows it's not 4:19your 4:20database and allow people who have 4:24learned it elsewhere to come to Netflix 4:26and practice their craft it's a talent 4:28advantage 4:30moving back Apache makes this a top 4:32level project you're still wondering 4:34where the AI connection is well it turns 4:37out they made it a top level project 4:40just before chat GPT exploded like a 4:42meteor on the scene and this was a 4:46perfect open-source solution to Major 4:50data Lakes which means that when all of 4:53these companies around the world began 4:54to ask themselves how do we collect our 4:56data and get it into a state where we 4:58can actually build AI models against it 5:01build AI models on top of it it was 5:03right there and so just like that it 5:06began to be adopted all over the 5:08industry data bricks has it snowflakes 5:10has it AWS has it Azure has it and all 5:14of these Cloud providers and uh data 5:17providers have figured out that they can 5:21use this open- Source tool developed by 5:23Netflix to help us with our scrolling 5:25and our movie watching to enable large 5:28scale data lak that companies around the 5:30world can leverage for AI 5:34deployments and I think that's a really 5:36cool story and if you look at that and 5:39you say wow that's that's kind of neat 5:41there are so many stories like that that 5:43have enabled the world we have today and 5:46they're not always viciously competitive 5:48like this one actually exemplifies 5:50cooperation even after Netflix turned 5:53over the software to 5:55Apache it took the work of hundreds of 5:58developers thousands of developers to 6:01mature The open- Source software so it 6:03was actually something that was stable 6:05enough for large scale deployments at 6:08these 6:09companies that's a big deal the 6:11developer Community is remarkably 6:13cooperative and I don't think we talk 6:15about it enough and I wanted to do a 6:17story that actually shows how that kind 6:19of cooperation unlocks capabilities that 6:23we are building against to this day so 6:25there you go that's the story of Iceberg 6:27and how it helped power the future 6:29future of AI through cooperation cheers