Learning Library

← Back to Library

Apache Iceberg: Solving Modern Big Data

Key Points

  • Big data is essential for training, tuning, and evaluating modern AI models, but its sheer volume makes management increasingly complex.
  • A data management system can be likened to a library that needs ample storage, processing power (the “librarian”), and rich metadata to organize and retrieve content at scale.
  • Since the early 2000s, technologies like Apache Hadoop introduced distributed storage (HDFS) and parallel processing (MapReduce) to handle data that outgrows single machines.
  • MapReduce, while powerful, required Java programming and proved cumbersome for analysts accustomed to simple SQL queries, highlighting a usability gap.
  • Apache Iceberg addresses these challenges by offering an open‑source, modern data‑management layer that simplifies handling, querying, and evolving massive datasets.

Sections

Full Transcript

# Apache Iceberg: Solving Modern Big Data **Source:** [https://www.youtube.com/watch?v=6tjSVXpHrE8](https://www.youtube.com/watch?v=6tjSVXpHrE8) **Duration:** 00:12:42 ## Summary - Big data is essential for training, tuning, and evaluating modern AI models, but its sheer volume makes management increasingly complex. - A data management system can be likened to a library that needs ample storage, processing power (the “librarian”), and rich metadata to organize and retrieve content at scale. - Since the early 2000s, technologies like Apache Hadoop introduced distributed storage (HDFS) and parallel processing (MapReduce) to handle data that outgrows single machines. - MapReduce, while powerful, required Java programming and proved cumbersome for analysts accustomed to simple SQL queries, highlighting a usability gap. - Apache Iceberg addresses these challenges by offering an open‑source, modern data‑management layer that simplifies handling, querying, and evolving massive datasets. ## Sections - [00:00:00](https://www.youtube.com/watch?v=6tjSVXpHrE8&t=0s) **Big Data Challenges and Apache Iceberg** - The speaker explains why massive data sets are vital for AI, outlines the difficulties of managing them, and introduces the open‑source Apache Iceberg as a modern solution, using a library analogy to illustrate storage, processing, and metadata management. - [00:03:06](https://www.youtube.com/watch?v=6tjSVXpHrE8&t=186s) **Hadoop, MapReduce, and Hive Evolution** - The passage explains how Hadoop’s distributed storage and MapReduce processing introduced scalable big‑data handling but were hard for analysts, leading to Hive’s 2008 debut, which converts SQL‑like queries into MapReduce jobs and uses a metastore to optimize access. - [00:06:15](https://www.youtube.com/watch?v=6tjSVXpHrE8&t=375s) **Bridging Storage and Query Gaps** - The speaker explains how soaring mobile/IoT data pushes firms to adopt cheap, scalable S3 storage, yet Hive’s inability to access S3 and its sluggish batch‑only performance hinder real‑time analytics, prompting the 2017 open‑source release of Apache Iceberg to unify S3 compatibility with both batch and interactive query workloads. - [00:09:23](https://www.youtube.com/watch?v=6tjSVXpHrE8&t=563s) **Iceberg's Metadata-Driven Flexibility** - Iceberg decouples storage and compute via a rich metadata layer, enabling any engine to query any storage while offering governance features like versioning, transactions, and schema evolution through snapshot metadata. - [00:12:32](https://www.youtube.com/watch?v=6tjSVXpHrE8&t=752s) **Encouraging Community Participation in Iceberg** - The speaker emphasizes that Iceberg's growth depends on active community involvement, thanks viewers, and invites them to join the open‑source ecosystem. ## Full Transcript
0:00You may have heard of the term "big data", 0:05but why is that important? 0:08The answer you get today might be something along the lines of the fact 0:12that a huge amount of data is required to train, tune and evaluate 0:17the A.I. models that are the future of computing. 0:20But managing all of this data can be really difficult. 0:24Luckily for us, we have the open source Project Apache Iceberg 0:29to make things much easier. 0:33In this video, I'll be taking you through a brief history of Big Data 0:37and its challenges and solutions of the last two decades 0:42so that you can walk away with an understanding of why Apache Iceberg 0:45is such a great choice for modern data management. 0:48But before we get into that, let's define what a data management system is. 0:55We can think about it 0:56in terms of a library, 1:02a library similar to big data stores, more content than ever before. 1:07Not just in physical books, but in digital storage as well. 1:12And that's the first component of our library. 1:16We need a good amount of storage 1:19for all of these different types of content. 1:23The second component is some sort of processing power. 1:31So some way to satisfy the library visitors requests. 1:36And in a library we can sort of think as the librarian, as the processing power. 1:41We also need to keep some sort of metadata, 1:51which would be information on how the content of the library is organized. 1:55So maybe they use the Dewey Decimal System. 1:58It might also store some metadata on that metadata. 2:06And this can provide something 2:08like a historical record of the library's contents over time. 2:13So, of course, these components do not just apply to a library. 2:16They really apply to any data management system. 2:20The only difference is the scale at which they work. 2:25So organizations that do a lot of data processing today 2:28are doing so at a much larger scale than a library is. 2:31Hence the term "big data". 2:34And big data is getting even bigger all the time. 2:37So let's go back to the dawn of big data 2:41to see how the problem has evolved over time so that we can frame 2:44our discussion on why Apache Iceberg is such a great choice. 2:48So we'll start in the early 2000. 2:53And this, of course, is the adolescence of the Internet. 3:01Thanks to the Internet, we're now processing more data than ever before. 3:06And it's, of course, much more data than a single machine is capable of. 3:10So in 2005, in order to address this, 3:17Apache Hadoop is open sourced and it provides a multi machine architecture. 3:26It's composed of two main parts. 3:30First is a set of on-prem distributed machines 3:35called the Hadoop Distributed file System. 3:42It also has a parallel processing model called MapReduce 3:54that processes the underlying data. 3:57So this is cool because it's easier to just add a machine 4:02to our cluster whenever the volume of data that we're working with scales up. 4:06But there is a pain points, and that is with MapReduce. 4:12MapReduce jobs are essentially Java programs 4:15and they're much more difficult to write when compared with the simple 4:20one line SQL statements that a data analyst would be more familiar with. 4:26So this would be like going to a library in order to find a particular book. 4:30But when you get there, you find that you and the librarian speak different languages. 4:35We clearly have a bit of a bottleneck at the processing stage, 4:40but a few years later, in 2008, 4:46Apache Hive comes onto the scene. In order to solve this problem. 4:54Its main draw is its ability to translate SQL like queries into MapReduce jobs. 5:04But it comes with a bonus feature as well. And that is the Hive Metastore. 5:14This is meta database 5:17that essentially stores pointers to certain groups of files 5:21in the underlying file system. 5:23So now when a query is submitted, it's done so in SQL, 5:27Hive accesses it's meta store to optimize this query 5:31before it's finally sent off to MapReduce. 5:36So taking it back to our library example again, 5:39we now have a pocket translator 5:43that we can use to speak to the librarian. 5:47The librarian also has a cheat sheet 5:53that they can use to find where a particular genre of book is stored in its shelves. 6:00So this works very well for a while until the 2010's. 6:09And at this point, we have another problem of scale. 6:15The reason for this is we have more mobile devices than ever before. 6:22So we have a lot of smartphones, we have a lot of Internet of Things devices, 6:27and they're all producing more data than ever. 6:32To handle this increase in the amount of data. 6:36Organizations are more and more turning to cloud based S3 storage. 6:43The reason being that S3 storage is much more affordable 6:48and even easier to scale than in DFS would be. 6:53Unfortunately, Hive cannot talk to S3 storage. 6:57It can only talk to HDFC, but there is another problem as well. 7:04More and more, instead of doing the traditional scheduled batch processing 7:10that was more popular, we're now doing a lot more on demand, real time processing. 7:15Like what something like the Presto query engine can do. 7:21And Hive is just too slow for this use case. 7:26So we have two problems, 7:28but unfortunately there's a third as well. 7:31And organizations don't really want to start from scratch 7:34with their data management system. 7:37They still have a lot of storage of data in HDFC, and that processing is certainly not obsolete. 7:50It has its place in the ecosystem. 7:53So perhaps they want to run some batch jobs using their existing hive instance 7:59or a query engine like Apache Spark. 8:05So luckily for us, we don't have to wait too long for a solution. 8:09All of these problems in 2017, 8:16Apache Iceberg is open sourced 8:20and it promises not only to solve all of these problems, 8:23but also to introduce new features of its own. 8:29Iceberg is really interesting because essentially, 8:34rather than providing its own storage and compute layers, 8:40it's simply a layer of metadata in between. 8:44So like in Hive, Iceberg's metadata contains a picture of how the underlying storage is organized. 8:51But Iceberg, however, keeps a much more fine grained picture than Hive does. 8:56So if we compare it to our library example, 8:59now that we're using Apache Iceberg, 9:01our library is more like one that has a makes use of the Dewey Decimal System 9:06and has a very organized index to keep track of all of that. 9:10As you can imagine, that means requests are processed much faster, 9:15but it's not just more efficient. 9:18Iceberg's metadata makes it more flexible as well. 9:23Since we're essentially decoupling the storage and the compute 9:26using this extra layer of separation of the metadata, 9:30we now have the flexibility to query 9:34using any number of processing engines 9:36and to access data in any number of underlying storage systems. 9:42The only requirement is that all the pieces of the ecosystem understand Iceberg's metadata language. 9:50So again, taking it back to our library example, 9:53rather than having the single librarian who does not speak our language, 9:57the library has kindly hired several more librarians that speak a variety of languages. 10:04Their key qualification is, of course, that they can understand the libraries index. 10:11And as I mentioned, the index itself is a lot more detailed. 10:15So not only can we point to the physical shelves of the library, 10:19we can also point to the digital content as well. 10:22But Iceberg is more than just efficient and flexible. 10:26It provides several new features of its own, 10:29mostly in the realm of data governance. 10:31With Iceberg, you can do data versioning operations, 10:35asset transactions, schema evolution, partition evolution, and more. 10:40And initially it sounds like that would require a lot of extra infrastructure in order to support. 10:46But in fact it is thanks to an extra layer of metadata that Iceberg keeps, 10:53and this time the metadata is meta-metadata. 10:58So Iceberg essentially takes snapshots of our data at particular points in time. 11:04And this is what allows us to have a really fine grained control 11:09over the integrity and the consistency of our data. 11:13So let's bring it back one last time to our library. 11:17Say, in our library, we want to add a historical record of the contents over time. 11:23Well, we already have the pretty detailed index that we keep. 11:27It's actually not that much extra information that we have to store 11:31in order to tell, for example, when a particular piece of content was added to the collection. 11:37So we now have data governance features 11:39with only needing to store one extra field in our index. 11:43And, as much as this is not a lot of extra information, it is a big impact change. 11:50And this is really the theme of Iceberg overall. 11:54Due to the clever way that it organizes its metadata, 11:59Iceberg is efficient, flexible and feature rich, 12:03all with very little relative overhead. 12:07So now as we move into the mid 2020s 12:14and as data is getting even bigger thanks to this AI boom, 12:19it becomes clear why Iceberg continues to be such a popular choice for modern data management. 12:26So now that you know what Iceberg is, 12:29I would really encourage you to go out and get involved. 12:32Like all open source communities, 12:34Iceberg will only continue to improve, 12:36the more people that participate in the discussion. 12:39So thank you for watching and I hope to see you out there on the open source world.