Evaluating Forecast Accuracy with Loss Functions
Key Points
- A loss function quantifies the error between an AI model’s predicted output and the actual value, with larger differences indicating higher loss.
- In a real‑world case, a colleague’s model that forecasted YouTube video views performed poorly, illustrating the need to assess and improve predictions using loss metrics.
- By calculating loss, we can iteratively adjust model parameters: decreasing loss means the model improves, while increasing loss signals deterioration, guiding the training process toward a predefined error threshold.
- Loss functions fall into regression (for continuous targets like video views, house prices, temperature) and classification categories, with common regression losses including Mean Squared Error (MSE) that heavily penalizes large mistakes and Mean Absolute Error (MAE).
Sections
- Loss Functions in Forecasting Models - The speaker explains how loss functions quantify prediction errors and illustrates their use with a YouTube view‑forecasting AI model that performed poorly, underscoring the need for model adjustments.
- Choosing Between MSE, MAE, Huber - The passage explains the characteristics of mean squared error, mean absolute error, and Huber loss—how they handle outliers—and offers guidance on selecting the appropriate regression loss based on the presence and impact of extreme values.
- Cross‑Entropy vs Hinge Loss - The passage explains entropy, describes how cross‑entropy loss quantifies the uncertainty of model predictions against certain ground‑truth labels, and contrasts this with hinge loss, which enforces confident, margin‑based correctness especially in binary classification.
- Loss Function Guides Model Training - The loss function measures model performance and, via its gradient, directs optimization algorithms to update weights and biases until the loss is minimized.
Full Transcript
# Evaluating Forecast Accuracy with Loss Functions **Source:** [https://www.youtube.com/watch?v=v_ueBW_5dLg](https://www.youtube.com/watch?v=v_ueBW_5dLg) **Duration:** 00:10:09 ## Summary - A loss function quantifies the error between an AI model’s predicted output and the actual value, with larger differences indicating higher loss. - In a real‑world case, a colleague’s model that forecasted YouTube video views performed poorly, illustrating the need to assess and improve predictions using loss metrics. - By calculating loss, we can iteratively adjust model parameters: decreasing loss means the model improves, while increasing loss signals deterioration, guiding the training process toward a predefined error threshold. - Loss functions fall into regression (for continuous targets like video views, house prices, temperature) and classification categories, with common regression losses including Mean Squared Error (MSE) that heavily penalizes large mistakes and Mean Absolute Error (MAE). ## Sections - [00:00:00](https://www.youtube.com/watch?v=v_ueBW_5dLg&t=0s) **Loss Functions in Forecasting Models** - The speaker explains how loss functions quantify prediction errors and illustrates their use with a YouTube view‑forecasting AI model that performed poorly, underscoring the need for model adjustments. - [00:03:05](https://www.youtube.com/watch?v=v_ueBW_5dLg&t=185s) **Choosing Between MSE, MAE, Huber** - The passage explains the characteristics of mean squared error, mean absolute error, and Huber loss—how they handle outliers—and offers guidance on selecting the appropriate regression loss based on the presence and impact of extreme values. - [00:06:15](https://www.youtube.com/watch?v=v_ueBW_5dLg&t=375s) **Cross‑Entropy vs Hinge Loss** - The passage explains entropy, describes how cross‑entropy loss quantifies the uncertainty of model predictions against certain ground‑truth labels, and contrasts this with hinge loss, which enforces confident, margin‑based correctness especially in binary classification. - [00:09:27](https://www.youtube.com/watch?v=v_ueBW_5dLg&t=567s) **Loss Function Guides Model Training** - The loss function measures model performance and, via its gradient, directs optimization algorithms to update weights and biases until the loss is minimized. ## Full Transcript
How good is an AI model at forecasting?
We can put an actual number on it.
In machine learning a loss function tracks the degree of error in the output from an AI model,
and it does this by quantifying the difference or the loss between a predicted value.
So let's say that that is five, the model gave us five, as the output and then comparing that to the actual value.
So maybe the model gave us ten and we call that the ground truth.
Now, if the model's predictions are accurate, then the difference between these two numbers,
the loss, in effect, is comparatively small.
If it's predictions are inaccurate, let's say it came back with an output of one instead of five, then the loss is larger.
So let me give you an example of how we can use this.
Now, I have for a colleague who built an AI model to forecast how many views his videos would receive on YouTube.
He fed the model YouTube titles and then the model forecast how many views that video would receive in its first week.
Here they are.
Little bit vain, if you ask me.
But it wasn't me.
It was my colleague.
Now, how well did the model do?
Well, when comparing the model forecasts to the actual number of real YouTube views,
the model wasn't getting too close.
The model predicted that the cold brew video would bomb, and that pour over guide video would be a big hit.
Just wasn't the case, though.
Now, this is a hard problem to solve and clearly this model needs some adjustments
and that's where loss functions can help.
Loss functions
let us define how well a model is doing mathematically.
And if we can calculate loss, we can then adjust model parameters and see if that increases loss,
meaning it's made it worse, or if it decreases loss, meaning it's made it better.
And at some point we can say that a machine learning model has been sufficiently trained.
When loss has been minimized below some predefined threshold.
Now at a high level, we can divide loss functions into two types, regression loss functions and then classification loss functions.
And let's start.
With regression, which measures errors in predictions involving continuous values.
Predictions like the price of a house or the temperature for a given day or well, the views for a YouTube video.
Now, in these cases
the loss function measures how far off the model's predictions are from the actual continuous target values.
Now, regression loss must be sensitive to two things, basically whether the forecast is correct or not.
But also the degree to which it diverges from the ground truth.
And there are multiple ways to calculate regression loss functions.
Now, the most common of those is called MSE or mean squared error.
Now, as its name suggests,
MSE is calculated as the average of the squared difference
between the predicted value and the true value across all training examples.
And squaring the error means the MSE gives large mistakes a disproportionately heavy impact on overall loss,
which strongly punishes outliers.
So that's MSE.
MAE or mean absolute error measures the average absolute difference between the predicted value
and MAE and is less sensitive to outliers compared to MSE as it doesn't square the errors.
So how do you decide which regression loss function to pick?
Well, if your ground truth data has relatively few extreme outliers with minimal deviation.
Like, I don't know, the temperature ranges in the month of July in the southern US, which, trust me, is basically always hot.
Well then MSE is a particularly useful option for you
as you want to heavily penalize predictions that are far off from the actual values.
MAE is a better option when data does contain more outliers.
And we don't want those outliers to overly influence the model.
Forecasting demand for a product.
That's a good example where occasional surges in sales shouldn't overly skew the model.
But there is a third choice.
The third choice is called huber loss.
Now, hubar loss is a compromise.
It's a compromise between MSE and MAE.
It behaves like MSE for small errors and MAE for large errors,
which makes it useful when you want the benefits of penalizing large errors but not too harshly.
Now I've calculated the lost functions for the YouTube example.
This is the MAE value summing up the absolute differences,
meaning on average the predictions were off by about 16,000 views per video.
The MSE lost function, that's over 400 million.
It skyrockets and that's due to the squaring of large errors, and the huber loss.
That also indicates poor predictions, but provides a more balanced perspective,
penalizing large errors less severely than MSI.
But look, these numbers don't mean a whole lot on their own.
We want to adjust the model's parameters, generate new forecasts and see where we move the needle on loss.
But before we get to how to do that, let's talk about the other type of loss function classification.
Unlike regression loss functions which deal with predicting continuous numerical values,
classification loss functions, well, they're focused on determining the accuracy of categorical predictions.
Is an email spam or not spam?
Are these plants classified into their correct species based on their features?
So the loss function in classification tasks measures how well the predicted
probabilities or labels match the actual categories.
Now cross entropy loss is one way of doing this, and it's the most widely used loss function for classification tasks.
Now, what is entropy?
It's a measure of uncertainty within a system.
So if you're flipping a coin, there are only two possible outcomes heads or tails.
The uncertainty is pretty low.
So low entropy.
Running a six sided die means there's more uncertainty about which of these six possible numbers will come up.
The entropy is higher.
Now cross entropy loss measures how uncertain the model's predictions are compared to the actual outcomes.
In supervised learning, model predictions are compared to the ground truth classifications provided by data tables.
Those ground truth labels are certain, and so they have low or in fact no entropy.
As such, we can measure the loss in terms of the difference in certainty we'd have using the ground truth labels
to the certainty of the labels predicted by the model.
Now, an alternative to this is called hinge loss instead.
Now, this is commonly used in support of vector machines
and hence loss encourages the model to make both correct predictions and to do so with a certain level of confidence.
It's all about measuring that level of confidence,
and it focuses on maximizing the margin between classes with the goal that the model is not just correct,
but it's confidently correct by a specified margin.
And this makes the hinge loss particularly useful in binary classification tasks
where the distinction between classes needs to be as clear and as far apart as possible.
So we've calculated our loss function.
Great, but what can we do with that information?
Now remember that the primary reason for calculating the loss function is to guide the model's learning process.
The last function provides a numeric value that indicates how far off the model's predictions are from the actual results.
And by analyzing this loss, we can adjust the model's parameters typically through a process called optimization.
In essence, the loss function acts as a feedback mechanism,
telling the model how well it's performing and where it needs to improve.
The lower the loss, the better the model's predictions align with the true outcomes.
Now, after adjusting the YouTube prediction model,
we get a new set of forecasts and we can now compare the loss functions between the two models,
and in all three cases, the loss function is now lower,
indicating less loss with the greatest effect on MSE, mean squared error.
As the model reduced the large prediction error for the poorer the video.
Now that's lost function as an evaluation metric,
but it can also be used as inputs into an algorithm that actually influences the model parameters.
To minimize loss, for example, by using gradient descent.
And that works by calculating the gradient or the slope of a loss function
with respect to each parameter.
Using the gradient of the loss function
Optimization algorithms determine which direction to step the model in order to move down the gradient
and therefore reduce loss.
The model learns by updating the weight and bias terms until the loss function has been sufficiently minimized.
So that's loss function.
It's both a scorekeeper that measures how well your model is performing,
and a guide that directs the model's learning process,
and a thanks to lost function.
My a, my colleague, can keep tweaking his YouTube AI model
to minimize the loss and teach that model to make better predictions.