Learning Library

← Back to Library

AI-Powered Document Intelligence

Key Points

  • Writing—from cave paintings to PDFs—has been humanity’s core technology for capturing and transmitting information, making documents the primary vessels of data across history.
  • In today’s data‑driven world, the biggest obstacle for developers is that most documents are unstructured, requiring conversion into highly structured, machine‑readable formats to support reliable decision‑making.
  • AI agents and document‑intelligence tools are emerging solutions that can automatically process raw, heterogeneous content—text, punctuation, and varied tabular layouts—to extract usable data at scale.
  • Real‑world documents differ dramatically in length and complexity, often containing nested tables that span multiple pages, demanding sophisticated extraction techniques to turn such chaotic inputs into consistent, actionable datasets.

Sections

Full Transcript

# AI-Powered Document Intelligence **Source:** [https://www.youtube.com/watch?v=_pEEJu-2KKM](https://www.youtube.com/watch?v=_pEEJu-2KKM) **Duration:** 00:20:08 ## Summary - Writing—from cave paintings to PDFs—has been humanity’s core technology for capturing and transmitting information, making documents the primary vessels of data across history. - In today’s data‑driven world, the biggest obstacle for developers is that most documents are unstructured, requiring conversion into highly structured, machine‑readable formats to support reliable decision‑making. - AI agents and document‑intelligence tools are emerging solutions that can automatically process raw, heterogeneous content—text, punctuation, and varied tabular layouts—to extract usable data at scale. - Real‑world documents differ dramatically in length and complexity, often containing nested tables that span multiple pages, demanding sophisticated extraction techniques to turn such chaotic inputs into consistent, actionable datasets. ## Sections - [00:00:00](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=0s) **Transforming Unstructured Documents with AI** - The speaker traces the evolution of written language as a technology, frames documents as unstructured data, and proposes using AI agents and document intelligence to convert that data into structured, decision‑ready information. - [00:03:02](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=182s) **Challenges of Large Document OCR** - The speaker outlines how massive, table‑heavy documents exceed 600 pages, exposing OCR’s inability to preserve semantics across page breaks and emphasizing the need to treat documents as interrelated, unstructured data. - [00:06:09](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=369s) **Document Hierarchies in R&D & Supply Chain** - The speaker illustrates how linked artifacts—research papers citing one another, patents, product documentation, and supply‑chain records such as bills of lading, insurance certificates, receipts, and claims—create horizontal and vertical hierarchies that trace the provenance, relationships, and accountability of ideas and physical goods. - [00:09:11](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=551s) **Estimating Parameters for Language Models** - The speaker outlines the finite character and numeric space of English, estimates that modeling this space requires 600–700 billion parameters, and briefly introduces the input‑output and token‑embedding concepts behind generative pre‑trained transformers. - [00:12:43](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=763s) **Attention Layer for Stylistic Output** - The passage explains that the model’s attention output layer controls stylistic preferences (e.g., New York vs. Silicon Valley tone) and then discusses the difficulty of distilling large documents—especially after OCR expansion—into a compact set of key data points. - [00:15:47](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=947s) **Modular Document Processing Agents** - The speaker outlines a suite of specialized agents—inspection, OCR, vectorization, and splitting—that collaboratively analyze file characteristics, extract text from images, generate embeddings with LLMs, and intelligently segment documents. - [00:19:01](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=1141s) **From Deterministic Pipelines to Autonomous Agents** - The speaker explains shifting sequential, deterministic workflows into event‑driven, autonomous agent interactions, highlighting gains in efficiency, scalability, and the emergence of a non‑deterministic design space. ## Full Transcript
0:00Written language is one of humanity's most important 0:04and transformative technologies. 0:06We may not think about writing or alphabets as technology, 0:11but beginning with cave paintings 0:14and working forward into hieroglyphics, 0:18cuneiform, Gutenberg's movable 0:21type printing press, Xerox copies. 0:25And now, even with modern digital images like the Portable Document File, 0:29throughout history, people have written important things down. 0:35We tend to record important things that happen in important 0:39and complicated collections of facts in written form. 0:44Today, we live in a data driven world, and as developers and technologists, 0:48we're trying to develop ways 0:50to support decision making that can be very data driven. 0:55The challenge that we have is that documents are unstructured. 0:59And so from an engineering 1:00or developer perspective, documents are unstructured data. 1:04What we're trying to do to leverage document 1:07data in decision making is to structure the data. 1:11And to do this we need to talk about how we can process 1:15raw, unstructured data to create highly structured, usable data. 1:19Or rather, how can we use very powerful tools 1:23that can they can enhance and improve our ability 1:26to have humans working directly with unstructured data. 1:31Today we're going to take a look at two of my favorite things. 1:34AI agents and document intelligence. 1:38Before we jump into the AI conversation, let's make sure that we frame 1:43the document problem as a data challenge. 1:47So we all know you know what a document is, right? 1:50So here we may have a document. 1:52We'll call it document one. 1:54And you know inside this document you know 1:57documents tend to have like a title and may have a lot of words. 2:00Some of our words are short and some of our words are long. 2:03If we're lucky, maybe this thing's got some structure, like some paragraphs. 2:07It's going to have some punctuation, throughout, etc.. 2:12Now, we know in, 2:15in a variety of environments, business environments that we're going 2:19to also come in to documents, that have a lot of words like document one. 2:24But then, you know, they're going to have tabular data oftentimes. 2:28So we may have, examples of tabular data, 2:33you know, like a, like a two by four table. 2:35And then we may have a more complicated table down here 2:39that looks like a what we did a five by three down here. 2:42Okay. 2:43Now we know that 2:45documents are going to contain, you know, words and punctuation. 2:49We know they're going to contain tabular data. 2:51But another thing that we're going to see is, 2:54you know, some documents are are short and some documents 2:58can really get very long, can get long and quite complicated. 3:02Right. 3:02So we can get documents that are 600 pages plus word counts 3:06really start to take off on us. 3:08And then over here, you know, document four, 3:11may be an example of something like document two 3:14where we've got words, you know, in tables everywhere. 3:18And we can start getting into complexities like tables 3:23that run across pages. 3:25Right. 3:26So we can get into lots and lots of pages, you know, with long tables. 3:30You know, maybe we have tables, you know, with in excess 3:33of even 10,000 rows. 3:35So this is this is our document space. 3:37And so this this can be viewed 3:40as traditionally as an unstructured data problem. 3:43Okay. 3:44And, and when we run into practitioners who have, 3:46who have wrestled with this problem in the past, the technology 3:50we always hear about is optical character recognition. 3:54And what optical character recognition does 3:57is it's able to use computer vision to go through this document 4:01and to recognize, 4:04different characters and words, 4:07and it's able to translate these into text. 4:11With OCR, you know, you can even have some look, perhaps, recognizing tables. 4:16But we get into real trouble. 4:19You know, when we run into things like page breaks 4:22and when we and we when we translate data within OCR, 4:26we don't have any real semantic understanding of what we've done. 4:30We've really just accomplished creating a lot of text. 4:32Okay. 4:33So let's keep that problem in mind. 4:36The other thing that we want to talk about with documents 4:38is that any one document is not that valuable, 4:42because documents tend to relate to one another. 4:45Okay. 4:46So let's talk for a quick minute about what we call hierarchies. 4:53Okay. 4:54And with hierarchies we have vertical hierarchies 4:58and horizontal hierarchies okay. 5:02And what we mean by hierarchies or how documents 5:06in a population relate to one another to create a logical whole. 5:10Here are a few examples. 5:11Let's imagine that we're in sort of a financial legal use case, 5:16and we have a master service agreement as a contract. 5:20But then underneath this master service agreement, we have a statement of work. 5:25And then this statement of work is later amended. 5:29And then even later still, we have a statement of work ..2 5:35And this statement of work can also be amended and so on. 5:39So now as we as we look at this, we say, okay, well what can happen next? 5:43A statement of work may spawn a purchase order, 5:47and a purchase order may eventually have an invoice. 5:51So when we think about vertical hierarchy, we're talking about 5:54having to understand this set of documents all together 5:57to understand the whole of the meaning of this relationship. 6:02And then we start to 6:03have horizontal hierarchies that require us 6:06to look at different document types and how they relate to one another. 6:10We can use two more, really quick examples here. 6:13So if I'm in an R&D, engineering space, I may have a research paper, 6:18and then later on we may have another important research result. 6:22We'll call that R2. 6:24And that result may site work or results in the first result. 6:28And then later on we may have another research R3 6:33and it may cite R2. 6:36And the original paper 6:38As an example of a horizontal hierarchy, 6:40We may eventually work outward and get to, a patent filing. 6:45So this might be a US patent. 6:47And later still we may have productization and product documentation. 6:51So again in this example, we have a lot of research work 6:56that gives us sort of the epistemology of the ideas 7:00and, and the in the citations of who is creating the ideas. 7:03And then in the horizontal space we have the evolution towards 7:07US patent filings and productization. 7:10A last very short example would be people who move physical goods 7:14around the world, supply chain type people where we have bills of lading, 7:20and when we ship something expensive, we probably have 7:24a certificate of insurance that shows that we've insured what we're shipping, 7:28and then we're going to eventually send this to someone who's going to have 7:31a receiving, a shipping receipt of what they believe they received. 7:35And if they think they received something that was damaged, 7:39they're going to file a claim. 7:41And the claim is going to relate to who we shipped to, 7:44and it's going to relate to our certificate of insurance. 7:46So again, to understand the relationship 7:49of the shipper, we need to understand the emails and the CEOs. 7:53And then to understand the the horizontal relationships, 7:56we need to understand how e bills relate to shipping receipts 7:59and how the certificate of insurance relates to a claim. 8:04So we've introduced documents as a data problem, 8:08and we've discussed how documents are fundamentally constructed 8:13with language and data types like numbers and dates. 8:17Okay, so let's talk about the breakthrough. 8:20The breakthrough and the big new tool that we have today 8:24are these GPT models, which are found, foundation 8:28models that allow us to develop these these large language models. 8:32And GPT stands for generative 8:35pre-trained transformer. 8:38Okay. 8:38So a lot of the technology that's in these GPT 8:42models is borrowed from, neural nets. 8:46But what we what we have is we have the ability to apply, 8:50these transformers to a language which is finite. 8:55So let's talk about the English language here. 8:58So in the English language we have about 170,000. 9:04It's a little more than that. 9:05But we have about 170,000 words that are in the active vocabulary. 9:10Right. 9:11We also know that we have, you know, a, b c dot 9:15dot dot to see, you know, so we know we've got 26, characters. 9:21And then of course we, we know that we have numbers. 9:24You know, numbers can be represented different ways, right. 9:27Like you can have a decimal 5.1 or whatever. 9:30But ultimately, the numbers, the numbers we know, to be infinite, 9:34but they're somewhat easy to recognize. 9:38So, so basically you've got this, you've got this mostly finite space, 9:42finite when we come to language, still infinite when we come to numbers. 9:46But ultimately we've got we've got an area that we can work in here. 9:51And when we when we consider these numbers 9:54and we look at some of the open source Lims that are on the market, 9:58we know that this type of a space 10:02ends up being, greater than, 10:07greater than 600 billion 10:11parameters. 10:15So to parameterize 10:17a space that looks like the English language, we end up with, 10:22a little over between 600 and 700 billion 10:26parameters. Okay. 10:27Now, let's let's jump for a minute into. 10:30Okay. 10:31We talked about, generative pre-trained transformers. 10:34Let's very briefly hit on what what that's all about. 10:38So basically, you know, we're trying to have inputs. 10:43We want a machine that can take inputs 10:48and give us, you know, expected outputs. 10:51Right. 10:51So we're trying to create inputs and outputs. 10:53That makes sense. 10:55Our inputs are going to be language right. 10:57They're going to be we're going to call them tokens later on. 11:00But basically we're trying to do embedding 11:03which is which is taking vocabulary and turning it into math. 11:07So this is this is basically one dimensional vectors, 11:11sort of looking at the 11:12concept of distance between things in a one dimensional space. 11:16And then from here we start using transformers. 11:19And what transformers do is they start to create, 11:22a really high dimensional space so that instead of just looking, 11:28at the distance, you know, between two things, 11:31you start to see two dimensional representations, 11:35you know, that look, something more like this 11:37if you've ever, you know, looked at any of the literature that's out there. 11:40So this is kind of like 1D and this is like a 2D matrix, 11:43you know, representation of a multi dimensional space okay. 11:47Attention and normalization start to deal with grouping. 11:52So this idea of grouping and normalization 11:57is this is this fancy way to move from like one 11:59dimensional vectors up into really big matrices 12:03and then chunking things into like smaller groups of matrices. 12:07Okay. 12:08This is all fancy math that you can go, look into more if you'd like to. 12:12And then when we get to softmax and attention output, 12:17softmax is a really fancy probabilistic algorithm 12:21that basically looks at input tokens 12:24and then does a bunch of this computation to create an output, 12:28a set of output tokens that will have probability one, meaning 12:32this is the thing that's sort of determining what is the most 12:37likely thing 12:38that you're expecting to see based on what you put in the attention output. 12:43As we said, is kind of related to the most likely, an attention output. 12:47This is the layer 12:48where you say, hey, I want my answers to sound like you're from New York. 12:52Or I want my answers to sound like you're from Silicon Valley. 12:55So if you're going to do stylistic type of outputs, this is the layer that 12:58you're usually operating in. 13:00And then your, 13:01when you project vocabulary, this is, this is literally your output. 13:06Right. 13:06So we have this we have this new exciting tech. 13:11And we go back to our problem of having a document. 13:14Okay. So we have a document. 13:16And actually we we have a set a lot of documents. 13:19But let's look at one document here. 13:21And this document may have a thousand words in it. 13:24Okay. 13:25What we're trying to get to is we're trying to get to 13:29a data model representation of this document 13:34and the data model that we care about might be small. 13:38It might we might really care about 20 really key data points 13:42or 50 key data points out of this, out of this larger document. 13:47The mistake that people make is that to get from here to here, 13:51we we tend to want to think about a reductionist process. 13:55You know, we're taking we're taking this document 13:58and we're whittling it down, you know, 14:01to find these couple of key points that we want to end up with over here. 14:05But in reality, what happens is we have this huge expansion 14:09that's happening, right? 14:10So first, you know, we may apply an OCR. 14:13And as we apply an OCR right, the data expands 14:19from a thousand data points to maybe 1 million or 10 million data points. 14:25Then we apply some natural language processing 14:29and we develop still more data. 14:31So we're expanding. 14:33And then in the end we get to something like an LLM 14:37And then we, we deal with even more data still. 14:40So really what we're doing is we've got this we've got this expansion, 14:45we've got this expansion that's happening. 14:47And then and then that's going to allow us to get back 14:51to contracting to this data model, 14:54which is where which is where we're going to make everybody happy. 14:58Right? Is if we can really get this, this, 14:59this representation of the data that we care about. 15:02Okay. 15:03So this is our data layer. 15:05This is our LLM breakthrough. 15:07And these are some of the older technologies that work together 15:10in this overall, pipeline. 15:13And so what remains for us is to talk about how we can develop 15:17a genetic workflows or develop agents, 15:20to, to do this work for us. 15:23Now that we've talked about some of the newer technologies, in particular LLM’s 15:27and how LLM’s can play with 15:29other technologies, we're more familiar with, like OCR and NLP. 15:33Now let's talk about how this comes together. 15:36In developing agents and what agenetic 15:39workflows are versus a traditional workflows. 15:42So let's let's start by naming a couple of agents and thinking about 15:45what would be some useful agents. 15:47So we can have an inspection agent. 15:50And an inspection agent could take a file and do some deep file inspection. 15:56Right. 15:56So an inspection agent may take checksums. 15:59It may look at word spacing within files. 16:02It may look at file length. 16:04It may look at file size. 16:05It may look at aspects of what kind of contents are in the file okay. 16:11Another agent that could be useful could be an OCR agent. 16:14So we go out and we take our most performant 16:16OCR engine that we really like. 16:18Or in some cases, if you have a multi multi-modal model, 16:22some, some folks are starting to look at Lims 16:25themselves as being able to replace the OCR. 16:28But essentially this agent can take our image data and, 16:33and transform it into, you know, 16:36text data, alphanumeric data tables, etc.. 16:40We could have a 16:42vectorized agent, vectorized agent is is probably almost 16:47certainly pretty heavily involved, with, with an LLM of our choice. 16:52The vectorized agent is going to is going to chunk 16:55and chunk up our document into tokens, 16:58in groupings and run it through an LLM to develop 17:02some of that vectorized magic that we just spoke about. 17:06The splitter agents, we can make a splitter agent here. 17:09Here it is. 17:10The splitter agents could be, an agent 17:14that takes a look at everything we've done up to this point 17:17and makes determinations and learns where documents should be split. 17:22Right. 17:22So we may find that we want to split 17:25documents into into multiple, multiple documents. 17:29In cases where maybe we got sloppy and, and combined 17:33a lot of multiple documents into one before we processed. 17:36The extraction or an extract 17:39agent can be another agent that we create. 17:43And the extract agent, you know, could be key in taking all that data 17:48that we described and helping us get back down to identifying 17:52that really critical data model, that we talked about. 17:56So the extract agent is going to is going to have a lot of prompting. 18:00It's going to have a lot of automated prompting. 18:02That's going to line up with the data model, 18:05to allow us to identify those key data points. 18:08And then, you know, lastly, with our little example 18:10we're working through here, maybe we have a match or a matching agent, 18:14and the matching agent, you know, has access to all this metadata. 18:19And the matching agent is doing some of that magic 18:23that we were talking about, 18:24you know, to help us establish those horizontal and vertical hierarchies 18:28so that we take our swarm of documents and we start to understand 18:32the documents, you know, logically, holistically, on a logical standpoint 18:37or more, more, horizontally in terms of transactional standpoint. 18:41So if we if we kind of look at how this is arranging itself 18:45and we're, it's tempting to think of agentic workflows 18:50in sort of a linear framework, 18:52which is similar to how we would probably set up a data pipeline prior 18:56to taking an AI approach, right, where we tend to think about 19:00inputs and outputs. 19:01But an output from one stage becomes an input to the next stage and an output, 19:05and then the output becomes an input to the next stage, and so on. 19:09And so what we're really getting to today 19:12when we think about rearranging, these types of sequential workflows, 19:17which we might describe as deterministic, 19:20okay, what we're really getting to today is, you know, maybe we can arrange 19:23a generic workflows which are more autonomous 19:27and which are triggered by events such as new data arriving into, 19:31the area where we're giving these agents scope to operate. 19:36And, and perhaps we can create interactions where these agents 19:40are looking at the work of the other agents, 19:42and then they're doing their useful piece of the work. 19:45Right. 19:46And so what we what the possibility that we open up here 19:49is not only autonomy, which could lead to efficiency 19:53and computing resource use, scalability. 19:56But we're also kind of entering this non-deterministic, 20:00non-deterministic space, which, which potentially 20:04opens up a lot of new possibilities that we haven't yet considered.