AI Inference: From Training to Real-Time
Key Points
- Inferencing is the phase where an AI model applies the knowledge encoded in its trained weights to make predictions or solve tasks on new, real‑time data.
- Model development consists of two stages: training, during which the model learns relationships in the data and stores them as neural‑network weights, and inference, where those weights are used to interpret unseen inputs.
- During training, the model adjusts its weights to capture patterns (e.g., keywords, punctuation) that differentiate classes such as spam versus non‑spam emails.
- In inference, the model compares a user’s query or new data against the learned weight patterns, generalizing from its training to produce an output in real time.
- The ultimate goal of inference is to generate an actionable result—such as flagging an incoming email as spam—by leveraging the model’s stored knowledge to make accurate, rapid decisions.
Sections
- Understanding AI Model Inference - The passage explains how inference follows training by applying learned model weights to new real‑time data, enabling the model to generalize and make predictions while highlighting concerns of cost and speed.
- Spam Detection Model Workflow - The speaker explains how a trained model learns spam‑related patterns, encodes them in its weights, evaluates new emails in real time to produce a spam probability, and uses business rules to decide whether to move or flag the message.
- Scaling, Speed, and Cost of AI Inference - The passage outlines how massive, real‑time inference demands for large language models drive high energy, hardware, and operational expenses, and explains that optimizing every layer of the AI stack—including specialized chips—is essential to achieve faster, more efficient performance.
- Graph Fusion for Multi‑GPU Inference - The speaker explains how graph fusion trims communication overhead and how splitting a model’s computational graph into strategic chunks lets massive models exceed a single GPU’s memory by parallelizing operations across multiple GPUs.
Full Transcript
# AI Inference: From Training to Real-Time **Source:** [https://www.youtube.com/watch?v=XtT5i0ZeHHE](https://www.youtube.com/watch?v=XtT5i0ZeHHE) **Duration:** 00:10:40 ## Summary - Inferencing is the phase where an AI model applies the knowledge encoded in its trained weights to make predictions or solve tasks on new, real‑time data. - Model development consists of two stages: training, during which the model learns relationships in the data and stores them as neural‑network weights, and inference, where those weights are used to interpret unseen inputs. - During training, the model adjusts its weights to capture patterns (e.g., keywords, punctuation) that differentiate classes such as spam versus non‑spam emails. - In inference, the model compares a user’s query or new data against the learned weight patterns, generalizing from its training to produce an output in real time. - The ultimate goal of inference is to generate an actionable result—such as flagging an incoming email as spam—by leveraging the model’s stored knowledge to make accurate, rapid decisions. ## Sections - [00:00:00](https://www.youtube.com/watch?v=XtT5i0ZeHHE&t=0s) **Understanding AI Model Inference** - The passage explains how inference follows training by applying learned model weights to new real‑time data, enabling the model to generalize and make predictions while highlighting concerns of cost and speed. - [00:03:09](https://www.youtube.com/watch?v=XtT5i0ZeHHE&t=189s) **Spam Detection Model Workflow** - The speaker explains how a trained model learns spam‑related patterns, encodes them in its weights, evaluates new emails in real time to produce a spam probability, and uses business rules to decide whether to move or flag the message. - [00:06:12](https://www.youtube.com/watch?v=XtT5i0ZeHHE&t=372s) **Scaling, Speed, and Cost of AI Inference** - The passage outlines how massive, real‑time inference demands for large language models drive high energy, hardware, and operational expenses, and explains that optimizing every layer of the AI stack—including specialized chips—is essential to achieve faster, more efficient performance. - [00:09:21](https://www.youtube.com/watch?v=XtT5i0ZeHHE&t=561s) **Graph Fusion for Multi‑GPU Inference** - The speaker explains how graph fusion trims communication overhead and how splitting a model’s computational graph into strategic chunks lets massive models exceed a single GPU’s memory by parallelizing operations across multiple GPUs. ## Full Transcript
What is inferencing.
It's an AI model's time to shine its moment of truth, a test of
how well the model can apply information learned during training to make a prediction or solve a task.
And with it comes a focus on cost and speed.
Let's get into it.
So an AI model,
it goes through two primary stages.
What are those?
The first of those is the training stage where the model learns how to do stuff.
And then we have the inferencing.
Stage that comes after training.
Now, we can think of this as the difference between learning something and then putting what we've learned into practice.
So during training, a deep learning model computes how the examples in its training set are related.
What it's doing effectively here is it's figuring out relationships between all of the data in its training set.
And it encodes these relationships into what are called a series of model weights.
These are the weights that connect artificial neurons.
So that's training.
Now, during inference, a model goes to work on what we provide it, which is real time data.
So this is the actual data that we are inputting into the model.
What happens in inferencing is the model.
Compares the user's query with the information processed during training and all of those stored weights.
And what the model effectively does is it generalizes based on everything that it has learned during training.
So it generalizes from this stored representation to be able to interpret this new unseen data
in much the same way that you and I can draw on prior knowledge to infer
the meaning of a new word or make sense of a new situation.
And what's the goal of this?
Well, the goal of AI inference is to calculate an output, basically a result, an actionable result.
So what sort of result are we talking about?
Well, let's consider a model that attempts to accurately flag incoming email,
and it's going to flag it based on whether or not it thinks it is spam.
We are going to build a spam detector model.
Right.
So during the training stage, this model would be fed a large labeled data set.
So we get into a whole load of data here, and this contains a bunch of emails that have been labeled.
The civically the labels are spam or not spam for each email.
And what happens here is the model learns to recognize patterns and features commonly associated with spam emails.
So these might include the presence of certain keywords.
Yeah.
These ones so unusual.
Send the email addresses, excessive use of exclamation marks, all that sort of thing.
Now the model encodes these learned patterns into its weight here, creating a complex set of rules to identify spam.
Now, during inference, this model is put to the test.
It's put to the test with new unseen data in real time, like when a new email arrives in a user's inbox.
The model analyzes the incoming email, comparing its characteristics
to the patterns it's learned during training and then makes a prediction.
Is this new unseen email spam or not spam?
Now, the actionable result here might be a probability score
indicating how likely the email is to be spam, which is then tied into a business rule.
So, for example, if the model assigns a 90% probability
that what we're looking at here is spam, well we should move that email directly to the spam folder.
That's what the business rule would say.
But if the probability the model comes back with is just 50%,
the business rule might say to leave the email in the box, but flag it for the users to decide what to do.
So what's happening here is the model is generalizing.
It can identify spam emails even if they don't exactly match any specific example from its training data.
As long as they share similar characteristics with the spam patterns, it's learned.
Okay.
Now, when the topic of inferencing comes up, it is often accompanied with four preceding words.
Let's cover those next.
The high cost of those are the words often added before inferencing.
Training AI models, particularly large language models, can cost millions of dollars in computing processing time.
But as expensive as training an AI model can be, it is dwarfed by the expense of inferencing.
Each time someone runs an AI model, there's a cost, a cost in kilowatt hours, a cost in dollars, a cost in carbon emissions.
On average, something like about 90% of an AI model's life is spent in inferencing mode.
And therefore, most of the AI's carbon footprint comes from serving models to the world, not in training them.
In fact, by some estimates, running a large model puts more carbon into the atmosphere over its lifetime
than the average American car.
Now, the high costs of inferencing, they stem from a number of different factors.
So let's take a look at some of those.
First of all, there's just the the sheer scale, the scale of operations.
While training happens just once, inferencing happens millions or even billions of times over a model's lifetime,
a chat bot might field millions of queries every day, each requiring a separate inference.
Second, there's the need.
The need,
for speed.
We want fast AI models.
We're working with real time data here requiring near instantaneous responses
which often necessitate powerful energy hungry hardware like GPUs.
Third, we have to consider also just the general complexity of these AI models.
As models grew larger and more sophisticated to handle the more complex tasks,
they require more computational resources for each inference.
This is particularly true for LLMs with billions of parameters.
And then finally there is the cost in terms of infrastructure costs,
data centers to maintain and cool low latency network connections to power,
all these factors contribute to significant ongoing costs in terms of energy consumption,
hardware wear, and tear and operational expenses.
Which brings up the question of if there's a better way to do this faster and more efficiently.
How fast an AI model runs depends on the stack.
What's the stack?
Well, improvements made to each layer can speed up inferencing and top of the stack.
Is hardware at the hardware level.
Engineers are developing specialized chips.
These are chips made for AI,
and they're optimized for the types of mathematical operations that dominate deep learning, particularly matrix multiplication.
These AI accelerators can significantly speed up inferencing tasks compared to traditional CPUs and even to GPUs,
and to do so in a more energy efficient way.
Now, bottom of the stack.
I put software.
And on the software side, there are several approaches to accelerate inferencing.
One is model compression.
Now that involves techniques like pruning and quantization.
So what do we mean by those?
Well, first of all, pruning that removes unnecessary weights from the model.
So it's reducing its size without significantly impacting accuracy.
And then for quantization, what that is talking about is reducing the precision of the model's weights,
such as from 32 bit floating point numbers to eight bit integers,
and that can really speed up computations and reduce memory requirements.
Okay, so we've got hardware and software.
What's in the middle?
Middleware of course, middleware bridges the gap between the hardware and the software,
and middleware frameworks can perform a bunch of things to help here.
One of those things is called graph fusion.
And graph fusion reduces the number of nodes in the communication graph,
and that minimizes the roundtrips between CPU's and GPUs.
And they can also implement parallel tenses as well.
Strategically splitting the models computational graph into chunks
and those chunks can be spread across multiple GPUs and run at the same time.
So running a 17 billion parameter model
that requires something like 150GB of memory, which is nearly twice as much as an NVIDIA a100 GPU holds,
but if the compiler can split the AI model's computational graph into strategic chunks,
those operations can be spread across GPUs and run at the same time.
So that's inferencing.
It's a game, a game of pattern matching that tends complex training into rapid fire problem solving.
One spammy email at a time.
If you have any questions, please drop us a line below.
And if you want to see more videos like this in the future, please like and subscribe.
Thanks for watching.