AI-Powered Document Intelligence
Key Points
- Writing—from cave paintings to PDFs—has been humanity’s core technology for capturing and transmitting information, making documents the primary vessels of data across history.
- In today’s data‑driven world, the biggest obstacle for developers is that most documents are unstructured, requiring conversion into highly structured, machine‑readable formats to support reliable decision‑making.
- AI agents and document‑intelligence tools are emerging solutions that can automatically process raw, heterogeneous content—text, punctuation, and varied tabular layouts—to extract usable data at scale.
- Real‑world documents differ dramatically in length and complexity, often containing nested tables that span multiple pages, demanding sophisticated extraction techniques to turn such chaotic inputs into consistent, actionable datasets.
Sections
- Transforming Unstructured Documents with AI - The speaker traces the evolution of written language as a technology, frames documents as unstructured data, and proposes using AI agents and document intelligence to convert that data into structured, decision‑ready information.
- Challenges of Large Document OCR - The speaker outlines how massive, table‑heavy documents exceed 600 pages, exposing OCR’s inability to preserve semantics across page breaks and emphasizing the need to treat documents as interrelated, unstructured data.
- Document Hierarchies in R&D & Supply Chain - The speaker illustrates how linked artifacts—research papers citing one another, patents, product documentation, and supply‑chain records such as bills of lading, insurance certificates, receipts, and claims—create horizontal and vertical hierarchies that trace the provenance, relationships, and accountability of ideas and physical goods.
- Estimating Parameters for Language Models - The speaker outlines the finite character and numeric space of English, estimates that modeling this space requires 600–700 billion parameters, and briefly introduces the input‑output and token‑embedding concepts behind generative pre‑trained transformers.
- Attention Layer for Stylistic Output - The passage explains that the model’s attention output layer controls stylistic preferences (e.g., New York vs. Silicon Valley tone) and then discusses the difficulty of distilling large documents—especially after OCR expansion—into a compact set of key data points.
- Modular Document Processing Agents - The speaker outlines a suite of specialized agents—inspection, OCR, vectorization, and splitting—that collaboratively analyze file characteristics, extract text from images, generate embeddings with LLMs, and intelligently segment documents.
- From Deterministic Pipelines to Autonomous Agents - The speaker explains shifting sequential, deterministic workflows into event‑driven, autonomous agent interactions, highlighting gains in efficiency, scalability, and the emergence of a non‑deterministic design space.
Full Transcript
# AI-Powered Document Intelligence **Source:** [https://www.youtube.com/watch?v=_pEEJu-2KKM](https://www.youtube.com/watch?v=_pEEJu-2KKM) **Duration:** 00:20:08 ## Summary - Writing—from cave paintings to PDFs—has been humanity’s core technology for capturing and transmitting information, making documents the primary vessels of data across history. - In today’s data‑driven world, the biggest obstacle for developers is that most documents are unstructured, requiring conversion into highly structured, machine‑readable formats to support reliable decision‑making. - AI agents and document‑intelligence tools are emerging solutions that can automatically process raw, heterogeneous content—text, punctuation, and varied tabular layouts—to extract usable data at scale. - Real‑world documents differ dramatically in length and complexity, often containing nested tables that span multiple pages, demanding sophisticated extraction techniques to turn such chaotic inputs into consistent, actionable datasets. ## Sections - [00:00:00](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=0s) **Transforming Unstructured Documents with AI** - The speaker traces the evolution of written language as a technology, frames documents as unstructured data, and proposes using AI agents and document intelligence to convert that data into structured, decision‑ready information. - [00:03:02](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=182s) **Challenges of Large Document OCR** - The speaker outlines how massive, table‑heavy documents exceed 600 pages, exposing OCR’s inability to preserve semantics across page breaks and emphasizing the need to treat documents as interrelated, unstructured data. - [00:06:09](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=369s) **Document Hierarchies in R&D & Supply Chain** - The speaker illustrates how linked artifacts—research papers citing one another, patents, product documentation, and supply‑chain records such as bills of lading, insurance certificates, receipts, and claims—create horizontal and vertical hierarchies that trace the provenance, relationships, and accountability of ideas and physical goods. - [00:09:11](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=551s) **Estimating Parameters for Language Models** - The speaker outlines the finite character and numeric space of English, estimates that modeling this space requires 600–700 billion parameters, and briefly introduces the input‑output and token‑embedding concepts behind generative pre‑trained transformers. - [00:12:43](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=763s) **Attention Layer for Stylistic Output** - The passage explains that the model’s attention output layer controls stylistic preferences (e.g., New York vs. Silicon Valley tone) and then discusses the difficulty of distilling large documents—especially after OCR expansion—into a compact set of key data points. - [00:15:47](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=947s) **Modular Document Processing Agents** - The speaker outlines a suite of specialized agents—inspection, OCR, vectorization, and splitting—that collaboratively analyze file characteristics, extract text from images, generate embeddings with LLMs, and intelligently segment documents. - [00:19:01](https://www.youtube.com/watch?v=_pEEJu-2KKM&t=1141s) **From Deterministic Pipelines to Autonomous Agents** - The speaker explains shifting sequential, deterministic workflows into event‑driven, autonomous agent interactions, highlighting gains in efficiency, scalability, and the emergence of a non‑deterministic design space. ## Full Transcript
Written language is one of humanity's most important
and transformative technologies.
We may not think about writing or alphabets as technology,
but beginning with cave paintings
and working forward into hieroglyphics,
cuneiform, Gutenberg's movable
type printing press, Xerox copies.
And now, even with modern digital images like the Portable Document File,
throughout history, people have written important things down.
We tend to record important things that happen in important
and complicated collections of facts in written form.
Today, we live in a data driven world, and as developers and technologists,
we're trying to develop ways
to support decision making that can be very data driven.
The challenge that we have is that documents are unstructured.
And so from an engineering
or developer perspective, documents are unstructured data.
What we're trying to do to leverage document
data in decision making is to structure the data.
And to do this we need to talk about how we can process
raw, unstructured data to create highly structured, usable data.
Or rather, how can we use very powerful tools
that can they can enhance and improve our ability
to have humans working directly with unstructured data.
Today we're going to take a look at two of my favorite things.
AI agents and document intelligence.
Before we jump into the AI conversation, let's make sure that we frame
the document problem as a data challenge.
So we all know you know what a document is, right?
So here we may have a document.
We'll call it document one.
And you know inside this document you know
documents tend to have like a title and may have a lot of words.
Some of our words are short and some of our words are long.
If we're lucky, maybe this thing's got some structure, like some paragraphs.
It's going to have some punctuation, throughout, etc..
Now, we know in,
in a variety of environments, business environments that we're going
to also come in to documents, that have a lot of words like document one.
But then, you know, they're going to have tabular data oftentimes.
So we may have, examples of tabular data,
you know, like a, like a two by four table.
And then we may have a more complicated table down here
that looks like a what we did a five by three down here.
Okay.
Now we know that
documents are going to contain, you know, words and punctuation.
We know they're going to contain tabular data.
But another thing that we're going to see is,
you know, some documents are are short and some documents
can really get very long, can get long and quite complicated.
Right.
So we can get documents that are 600 pages plus word counts
really start to take off on us.
And then over here, you know, document four,
may be an example of something like document two
where we've got words, you know, in tables everywhere.
And we can start getting into complexities like tables
that run across pages.
Right.
So we can get into lots and lots of pages, you know, with long tables.
You know, maybe we have tables, you know, with in excess
of even 10,000 rows.
So this is this is our document space.
And so this this can be viewed
as traditionally as an unstructured data problem.
Okay.
And, and when we run into practitioners who have,
who have wrestled with this problem in the past, the technology
we always hear about is optical character recognition.
And what optical character recognition does
is it's able to use computer vision to go through this document
and to recognize,
different characters and words,
and it's able to translate these into text.
With OCR, you know, you can even have some look, perhaps, recognizing tables.
But we get into real trouble.
You know, when we run into things like page breaks
and when we and we when we translate data within OCR,
we don't have any real semantic understanding of what we've done.
We've really just accomplished creating a lot of text.
Okay.
So let's keep that problem in mind.
The other thing that we want to talk about with documents
is that any one document is not that valuable,
because documents tend to relate to one another.
Okay.
So let's talk for a quick minute about what we call hierarchies.
Okay.
And with hierarchies we have vertical hierarchies
and horizontal hierarchies okay.
And what we mean by hierarchies or how documents
in a population relate to one another to create a logical whole.
Here are a few examples.
Let's imagine that we're in sort of a financial legal use case,
and we have a master service agreement as a contract.
But then underneath this master service agreement, we have a statement of work.
And then this statement of work is later amended.
And then even later still, we have a statement of work ..2
And this statement of work can also be amended and so on.
So now as we as we look at this, we say, okay, well what can happen next?
A statement of work may spawn a purchase order,
and a purchase order may eventually have an invoice.
So when we think about vertical hierarchy, we're talking about
having to understand this set of documents all together
to understand the whole of the meaning of this relationship.
And then we start to
have horizontal hierarchies that require us
to look at different document types and how they relate to one another.
We can use two more, really quick examples here.
So if I'm in an R&D, engineering space, I may have a research paper,
and then later on we may have another important research result.
We'll call that R2.
And that result may site work or results in the first result.
And then later on we may have another research R3
and it may cite R2.
And the original paper
As an example of a horizontal hierarchy,
We may eventually work outward and get to, a patent filing.
So this might be a US patent.
And later still we may have productization and product documentation.
So again in this example, we have a lot of research work
that gives us sort of the epistemology of the ideas
and, and the in the citations of who is creating the ideas.
And then in the horizontal space we have the evolution towards
US patent filings and productization.
A last very short example would be people who move physical goods
around the world, supply chain type people where we have bills of lading,
and when we ship something expensive, we probably have
a certificate of insurance that shows that we've insured what we're shipping,
and then we're going to eventually send this to someone who's going to have
a receiving, a shipping receipt of what they believe they received.
And if they think they received something that was damaged,
they're going to file a claim.
And the claim is going to relate to who we shipped to,
and it's going to relate to our certificate of insurance.
So again, to understand the relationship
of the shipper, we need to understand the emails and the CEOs.
And then to understand the the horizontal relationships,
we need to understand how e bills relate to shipping receipts
and how the certificate of insurance relates to a claim.
So we've introduced documents as a data problem,
and we've discussed how documents are fundamentally constructed
with language and data types like numbers and dates.
Okay, so let's talk about the breakthrough.
The breakthrough and the big new tool that we have today
are these GPT models, which are found, foundation
models that allow us to develop these these large language models.
And GPT stands for generative
pre-trained transformer.
Okay.
So a lot of the technology that's in these GPT
models is borrowed from, neural nets.
But what we what we have is we have the ability to apply,
these transformers to a language which is finite.
So let's talk about the English language here.
So in the English language we have about 170,000.
It's a little more than that.
But we have about 170,000 words that are in the active vocabulary.
Right.
We also know that we have, you know, a, b c dot
dot dot to see, you know, so we know we've got 26, characters.
And then of course we, we know that we have numbers.
You know, numbers can be represented different ways, right.
Like you can have a decimal 5.1 or whatever.
But ultimately, the numbers, the numbers we know, to be infinite,
but they're somewhat easy to recognize.
So, so basically you've got this, you've got this mostly finite space,
finite when we come to language, still infinite when we come to numbers.
But ultimately we've got we've got an area that we can work in here.
And when we when we consider these numbers
and we look at some of the open source Lims that are on the market,
we know that this type of a space
ends up being, greater than,
greater than 600 billion
parameters.
So to parameterize
a space that looks like the English language, we end up with,
a little over between 600 and 700 billion
parameters. Okay.
Now, let's let's jump for a minute into.
Okay.
We talked about, generative pre-trained transformers.
Let's very briefly hit on what what that's all about.
So basically, you know, we're trying to have inputs.
We want a machine that can take inputs
and give us, you know, expected outputs.
Right.
So we're trying to create inputs and outputs.
That makes sense.
Our inputs are going to be language right.
They're going to be we're going to call them tokens later on.
But basically we're trying to do embedding
which is which is taking vocabulary and turning it into math.
So this is this is basically one dimensional vectors,
sort of looking at the
concept of distance between things in a one dimensional space.
And then from here we start using transformers.
And what transformers do is they start to create,
a really high dimensional space so that instead of just looking,
at the distance, you know, between two things,
you start to see two dimensional representations,
you know, that look, something more like this
if you've ever, you know, looked at any of the literature that's out there.
So this is kind of like 1D and this is like a 2D matrix,
you know, representation of a multi dimensional space okay.
Attention and normalization start to deal with grouping.
So this idea of grouping and normalization
is this is this fancy way to move from like one
dimensional vectors up into really big matrices
and then chunking things into like smaller groups of matrices.
Okay.
This is all fancy math that you can go, look into more if you'd like to.
And then when we get to softmax and attention output,
softmax is a really fancy probabilistic algorithm
that basically looks at input tokens
and then does a bunch of this computation to create an output,
a set of output tokens that will have probability one, meaning
this is the thing that's sort of determining what is the most
likely thing
that you're expecting to see based on what you put in the attention output.
As we said, is kind of related to the most likely, an attention output.
This is the layer
where you say, hey, I want my answers to sound like you're from New York.
Or I want my answers to sound like you're from Silicon Valley.
So if you're going to do stylistic type of outputs, this is the layer that
you're usually operating in.
And then your,
when you project vocabulary, this is, this is literally your output.
Right.
So we have this we have this new exciting tech.
And we go back to our problem of having a document.
Okay. So we have a document.
And actually we we have a set a lot of documents.
But let's look at one document here.
And this document may have a thousand words in it.
Okay.
What we're trying to get to is we're trying to get to
a data model representation of this document
and the data model that we care about might be small.
It might we might really care about 20 really key data points
or 50 key data points out of this, out of this larger document.
The mistake that people make is that to get from here to here,
we we tend to want to think about a reductionist process.
You know, we're taking we're taking this document
and we're whittling it down, you know,
to find these couple of key points that we want to end up with over here.
But in reality, what happens is we have this huge expansion
that's happening, right?
So first, you know, we may apply an OCR.
And as we apply an OCR right, the data expands
from a thousand data points to maybe 1 million or 10 million data points.
Then we apply some natural language processing
and we develop still more data.
So we're expanding.
And then in the end we get to something like an LLM
And then we, we deal with even more data still.
So really what we're doing is we've got this we've got this expansion,
we've got this expansion that's happening.
And then and then that's going to allow us to get back
to contracting to this data model,
which is where which is where we're going to make everybody happy.
Right? Is if we can really get this, this,
this representation of the data that we care about.
Okay.
So this is our data layer.
This is our LLM breakthrough.
And these are some of the older technologies that work together
in this overall, pipeline.
And so what remains for us is to talk about how we can develop
a genetic workflows or develop agents,
to, to do this work for us.
Now that we've talked about some of the newer technologies, in particular LLM’s
and how LLM’s can play with
other technologies, we're more familiar with, like OCR and NLP.
Now let's talk about how this comes together.
In developing agents and what agenetic
workflows are versus a traditional workflows.
So let's let's start by naming a couple of agents and thinking about
what would be some useful agents.
So we can have an inspection agent.
And an inspection agent could take a file and do some deep file inspection.
Right.
So an inspection agent may take checksums.
It may look at word spacing within files.
It may look at file length.
It may look at file size.
It may look at aspects of what kind of contents are in the file okay.
Another agent that could be useful could be an OCR agent.
So we go out and we take our most performant
OCR engine that we really like.
Or in some cases, if you have a multi multi-modal model,
some, some folks are starting to look at Lims
themselves as being able to replace the OCR.
But essentially this agent can take our image data and,
and transform it into, you know,
text data, alphanumeric data tables, etc..
We could have a
vectorized agent, vectorized agent is is probably almost
certainly pretty heavily involved, with, with an LLM of our choice.
The vectorized agent is going to is going to chunk
and chunk up our document into tokens,
in groupings and run it through an LLM to develop
some of that vectorized magic that we just spoke about.
The splitter agents, we can make a splitter agent here.
Here it is.
The splitter agents could be, an agent
that takes a look at everything we've done up to this point
and makes determinations and learns where documents should be split.
Right.
So we may find that we want to split
documents into into multiple, multiple documents.
In cases where maybe we got sloppy and, and combined
a lot of multiple documents into one before we processed.
The extraction or an extract
agent can be another agent that we create.
And the extract agent, you know, could be key in taking all that data
that we described and helping us get back down to identifying
that really critical data model, that we talked about.
So the extract agent is going to is going to have a lot of prompting.
It's going to have a lot of automated prompting.
That's going to line up with the data model,
to allow us to identify those key data points.
And then, you know, lastly, with our little example
we're working through here, maybe we have a match or a matching agent,
and the matching agent, you know, has access to all this metadata.
And the matching agent is doing some of that magic
that we were talking about,
you know, to help us establish those horizontal and vertical hierarchies
so that we take our swarm of documents and we start to understand
the documents, you know, logically, holistically, on a logical standpoint
or more, more, horizontally in terms of transactional standpoint.
So if we if we kind of look at how this is arranging itself
and we're, it's tempting to think of agentic workflows
in sort of a linear framework,
which is similar to how we would probably set up a data pipeline prior
to taking an AI approach, right, where we tend to think about
inputs and outputs.
But an output from one stage becomes an input to the next stage and an output,
and then the output becomes an input to the next stage, and so on.
And so what we're really getting to today
when we think about rearranging, these types of sequential workflows,
which we might describe as deterministic,
okay, what we're really getting to today is, you know, maybe we can arrange
a generic workflows which are more autonomous
and which are triggered by events such as new data arriving into,
the area where we're giving these agents scope to operate.
And, and perhaps we can create interactions where these agents
are looking at the work of the other agents,
and then they're doing their useful piece of the work.
Right.
And so what we what the possibility that we open up here
is not only autonomy, which could lead to efficiency
and computing resource use, scalability.
But we're also kind of entering this non-deterministic,
non-deterministic space, which, which potentially
opens up a lot of new possibilities that we haven't yet considered.