Learning Library

← Back to Library

AI Inference: From Training to Real-Time

Key Points

  • Inferencing is the phase where an AI model applies the knowledge encoded in its trained weights to make predictions or solve tasks on new, real‑time data.
  • Model development consists of two stages: training, during which the model learns relationships in the data and stores them as neural‑network weights, and inference, where those weights are used to interpret unseen inputs.
  • During training, the model adjusts its weights to capture patterns (e.g., keywords, punctuation) that differentiate classes such as spam versus non‑spam emails.
  • In inference, the model compares a user’s query or new data against the learned weight patterns, generalizing from its training to produce an output in real time.
  • The ultimate goal of inference is to generate an actionable result—such as flagging an incoming email as spam—by leveraging the model’s stored knowledge to make accurate, rapid decisions.

Full Transcript

# AI Inference: From Training to Real-Time **Source:** [https://www.youtube.com/watch?v=XtT5i0ZeHHE](https://www.youtube.com/watch?v=XtT5i0ZeHHE) **Duration:** 00:10:40 ## Summary - Inferencing is the phase where an AI model applies the knowledge encoded in its trained weights to make predictions or solve tasks on new, real‑time data. - Model development consists of two stages: training, during which the model learns relationships in the data and stores them as neural‑network weights, and inference, where those weights are used to interpret unseen inputs. - During training, the model adjusts its weights to capture patterns (e.g., keywords, punctuation) that differentiate classes such as spam versus non‑spam emails. - In inference, the model compares a user’s query or new data against the learned weight patterns, generalizing from its training to produce an output in real time. - The ultimate goal of inference is to generate an actionable result—such as flagging an incoming email as spam—by leveraging the model’s stored knowledge to make accurate, rapid decisions. ## Sections - [00:00:00](https://www.youtube.com/watch?v=XtT5i0ZeHHE&t=0s) **Understanding AI Model Inference** - The passage explains how inference follows training by applying learned model weights to new real‑time data, enabling the model to generalize and make predictions while highlighting concerns of cost and speed. - [00:03:09](https://www.youtube.com/watch?v=XtT5i0ZeHHE&t=189s) **Spam Detection Model Workflow** - The speaker explains how a trained model learns spam‑related patterns, encodes them in its weights, evaluates new emails in real time to produce a spam probability, and uses business rules to decide whether to move or flag the message. - [00:06:12](https://www.youtube.com/watch?v=XtT5i0ZeHHE&t=372s) **Scaling, Speed, and Cost of AI Inference** - The passage outlines how massive, real‑time inference demands for large language models drive high energy, hardware, and operational expenses, and explains that optimizing every layer of the AI stack—including specialized chips—is essential to achieve faster, more efficient performance. - [00:09:21](https://www.youtube.com/watch?v=XtT5i0ZeHHE&t=561s) **Graph Fusion for Multi‑GPU Inference** - The speaker explains how graph fusion trims communication overhead and how splitting a model’s computational graph into strategic chunks lets massive models exceed a single GPU’s memory by parallelizing operations across multiple GPUs. ## Full Transcript
0:00What is inferencing. 0:02It's an AI model's time to shine its moment of truth, a test of 0:07how well the model can apply information learned during training to make a prediction or solve a task. 0:13And with it comes a focus on cost and speed. 0:17Let's get into it. 0:19So an AI model, 0:22it goes through two primary stages. 0:27What are those? 0:29The first of those is the training stage where the model learns how to do stuff. 0:38And then we have the inferencing. 0:42Stage that comes after training. 0:47Now, we can think of this as the difference between learning something and then putting what we've learned into practice. 0:56So during training, a deep learning model computes how the examples in its training set are related. 1:03What it's doing effectively here is it's figuring out relationships between all of the data in its training set. 1:13And it encodes these relationships into what are called a series of model weights. 1:20These are the weights that connect artificial neurons. 1:25So that's training. 1:26Now, during inference, a model goes to work on what we provide it, which is real time data. 1:35So this is the actual data that we are inputting into the model. 1:41What happens in inferencing is the model. 1:43Compares the user's query with the information processed during training and all of those stored weights. 1:49And what the model effectively does is it generalizes based on everything that it has learned during training. 1:56So it generalizes from this stored representation to be able to interpret this new unseen data 2:03in much the same way that you and I can draw on prior knowledge to infer 2:07the meaning of a new word or make sense of a new situation. 2:11And what's the goal of this? 2:13Well, the goal of AI inference is to calculate an output, basically a result, an actionable result. 2:23So what sort of result are we talking about? 2:28Well, let's consider a model that attempts to accurately flag incoming email, 2:34and it's going to flag it based on whether or not it thinks it is spam. 2:39We are going to build a spam detector model. 2:44Right. 2:45So during the training stage, this model would be fed a large labeled data set. 2:52So we get into a whole load of data here, and this contains a bunch of emails that have been labeled. 3:00The civically the labels are spam or not spam for each email. 3:10And what happens here is the model learns to recognize patterns and features commonly associated with spam emails. 3:17So these might include the presence of certain keywords. 3:20Yeah. 3:21These ones so unusual. 3:23Send the email addresses, excessive use of exclamation marks, all that sort of thing. 3:28Now the model encodes these learned patterns into its weight here, creating a complex set of rules to identify spam. 3:38Now, during inference, this model is put to the test. 3:42It's put to the test with new unseen data in real time, like when a new email arrives in a user's inbox. 3:53The model analyzes the incoming email, comparing its characteristics 3:58to the patterns it's learned during training and then makes a prediction. 4:03Is this new unseen email spam or not spam? 4:08Now, the actionable result here might be a probability score 4:13indicating how likely the email is to be spam, which is then tied into a business rule. 4:18So, for example, if the model assigns a 90% probability 4:24that what we're looking at here is spam, well we should move that email directly to the spam folder. 4:32That's what the business rule would say. 4:34But if the probability the model comes back with is just 50%, 4:38the business rule might say to leave the email in the box, but flag it for the users to decide what to do. 4:45So what's happening here is the model is generalizing. 4:49It can identify spam emails even if they don't exactly match any specific example from its training data. 4:56As long as they share similar characteristics with the spam patterns, it's learned. 5:01Okay. 5:02Now, when the topic of inferencing comes up, it is often accompanied with four preceding words. 5:09Let's cover those next. 5:12The high cost of those are the words often added before inferencing. 5:19Training AI models, particularly large language models, can cost millions of dollars in computing processing time. 5:26But as expensive as training an AI model can be, it is dwarfed by the expense of inferencing. 5:33Each time someone runs an AI model, there's a cost, a cost in kilowatt hours, a cost in dollars, a cost in carbon emissions. 5:40On average, something like about 90% of an AI model's life is spent in inferencing mode. 5:50And therefore, most of the AI's carbon footprint comes from serving models to the world, not in training them. 5:56In fact, by some estimates, running a large model puts more carbon into the atmosphere over its lifetime 6:02than the average American car. 6:04Now, the high costs of inferencing, they stem from a number of different factors. 6:10So let's take a look at some of those. 6:12First of all, there's just the the sheer scale, the scale of operations. 6:18While training happens just once, inferencing happens millions or even billions of times over a model's lifetime, 6:25a chat bot might field millions of queries every day, each requiring a separate inference. 6:31Second, there's the need. 6:33The need, 6:35for speed. 6:36We want fast AI models. 6:39We're working with real time data here requiring near instantaneous responses 6:44which often necessitate powerful energy hungry hardware like GPUs. 6:51Third, we have to consider also just the general complexity of these AI models. 6:58As models grew larger and more sophisticated to handle the more complex tasks, 7:03they require more computational resources for each inference. 7:06This is particularly true for LLMs with billions of parameters. 7:10And then finally there is the cost in terms of infrastructure costs, 7:16data centers to maintain and cool low latency network connections to power, 7:22all these factors contribute to significant ongoing costs in terms of energy consumption, 7:27hardware wear, and tear and operational expenses. 7:30Which brings up the question of if there's a better way to do this faster and more efficiently. 7:39How fast an AI model runs depends on the stack. 7:44What's the stack? 7:45Well, improvements made to each layer can speed up inferencing and top of the stack. 7:51Is hardware at the hardware level. 7:55Engineers are developing specialized chips. 7:59These are chips made for AI, 8:03and they're optimized for the types of mathematical operations that dominate deep learning, particularly matrix multiplication. 8:10These AI accelerators can significantly speed up inferencing tasks compared to traditional CPUs and even to GPUs, 8:17and to do so in a more energy efficient way. 8:21Now, bottom of the stack. 8:24I put software. 8:26And on the software side, there are several approaches to accelerate inferencing. 8:31One is model compression. 8:33Now that involves techniques like pruning and quantization. 8:37So what do we mean by those? 8:39Well, first of all, pruning that removes unnecessary weights from the model. 8:46So it's reducing its size without significantly impacting accuracy. 8:51And then for quantization, what that is talking about is reducing the precision of the model's weights, 8:58such as from 32 bit floating point numbers to eight bit integers, 9:02and that can really speed up computations and reduce memory requirements. 9:07Okay, so we've got hardware and software. 9:10What's in the middle? 9:12Middleware of course, middleware bridges the gap between the hardware and the software, 9:18and middleware frameworks can perform a bunch of things to help here. 9:22One of those things is called graph fusion. 9:27And graph fusion reduces the number of nodes in the communication graph, 9:32and that minimizes the roundtrips between CPU's and GPUs. 9:37And they can also implement parallel tenses as well. 9:43Strategically splitting the models computational graph into chunks 9:48and those chunks can be spread across multiple GPUs and run at the same time. 9:53So running a 17 billion parameter model 9:56that requires something like 150GB of memory, which is nearly twice as much as an NVIDIA a100 GPU holds, 10:05but if the compiler can split the AI model's computational graph into strategic chunks, 10:12those operations can be spread across GPUs and run at the same time. 10:18So that's inferencing. 10:19It's a game, a game of pattern matching that tends complex training into rapid fire problem solving. 10:27One spammy email at a time. 10:31If you have any questions, please drop us a line below. 10:33And if you want to see more videos like this in the future, please like and subscribe. 10:39Thanks for watching.