Understanding Confusion Matrices with Scikit-learn
Key Points
- Diarra Bell introduces confusion matrices as a tool to evaluate classification model performance, noting common classifiers like logistic regression, Naive Bayes, SVMs, and decision trees.
- She demonstrates building a binary classifier in a Jupyter notebook using scikit‑learn’s breast‑cancer dataset, importing the necessary libraries (metrics, train‑test split, scaler, pandas, Matplotlib).
- After loading the dataset, she creates a pandas DataFrame to inspect the feature columns and target labels that indicate malignant versus benign samples.
- The workflow proceeds with preprocessing (train‑test split and scaling), fitting a logistic‑regression model, and then generating a confusion matrix to visualize true/false positives and negatives.
- The video explains how to interpret the confusion matrix and related metrics (accuracy, precision, recall, F1) to assess the model’s classification results.
Sections
- Introducing Confusion Matrices with Scikit‑learn - Diarra Bell explains what a confusion matrix is and walks through building a binary logistic‑regression classifier on the breast‑cancer dataset in a Jupyter notebook, importing the necessary scikit‑learn tools and visualizing results.
- Adding Target Labels and Splitting Data - The speaker shows how to append a target column indicating malignant (0) or benign (1) samples to the feature DataFrame, displays it, and then separates the dataset into X (features) and Y (labels).
- Scaling Data and Training Logistic Regression - The speaker explains how to standardize features, fit a logistic regression model to the scaled training data, and quickly generate a confusion matrix for evaluation.
- Visualizing and Interpreting Confusion Matrix - The speaker demonstrates using sklearn’s `confusion_matrix_display` with matplotlib to plot a confusion matrix and explains how each cell (true positives, true negatives, etc.) reflects the model’s prediction accuracy for malignant and benign samples.
- Calculating Classification Metrics - The speaker demonstrates how to use scikit-learn’s accuracy, precision, and recall functions on test predictions derived from a confusion matrix, interprets the resulting scores, and notes that these metrics guide further model fine‑tuning.
Full Transcript
# Understanding Confusion Matrices with Scikit-learn **Source:** [https://www.youtube.com/watch?v=PoqGrCscJ7k](https://www.youtube.com/watch?v=PoqGrCscJ7k) **Duration:** 00:15:13 ## Summary - Diarra Bell introduces confusion matrices as a tool to evaluate classification model performance, noting common classifiers like logistic regression, Naive Bayes, SVMs, and decision trees. - She demonstrates building a binary classifier in a Jupyter notebook using scikit‑learn’s breast‑cancer dataset, importing the necessary libraries (metrics, train‑test split, scaler, pandas, Matplotlib). - After loading the dataset, she creates a pandas DataFrame to inspect the feature columns and target labels that indicate malignant versus benign samples. - The workflow proceeds with preprocessing (train‑test split and scaling), fitting a logistic‑regression model, and then generating a confusion matrix to visualize true/false positives and negatives. - The video explains how to interpret the confusion matrix and related metrics (accuracy, precision, recall, F1) to assess the model’s classification results. ## Sections - [00:00:00](https://www.youtube.com/watch?v=PoqGrCscJ7k&t=0s) **Introducing Confusion Matrices with Scikit‑learn** - Diarra Bell explains what a confusion matrix is and walks through building a binary logistic‑regression classifier on the breast‑cancer dataset in a Jupyter notebook, importing the necessary scikit‑learn tools and visualizing results. - [00:03:05](https://www.youtube.com/watch?v=PoqGrCscJ7k&t=185s) **Adding Target Labels and Splitting Data** - The speaker shows how to append a target column indicating malignant (0) or benign (1) samples to the feature DataFrame, displays it, and then separates the dataset into X (features) and Y (labels). - [00:06:07](https://www.youtube.com/watch?v=PoqGrCscJ7k&t=367s) **Scaling Data and Training Logistic Regression** - The speaker explains how to standardize features, fit a logistic regression model to the scaled training data, and quickly generate a confusion matrix for evaluation. - [00:09:17](https://www.youtube.com/watch?v=PoqGrCscJ7k&t=557s) **Visualizing and Interpreting Confusion Matrix** - The speaker demonstrates using sklearn’s `confusion_matrix_display` with matplotlib to plot a confusion matrix and explains how each cell (true positives, true negatives, etc.) reflects the model’s prediction accuracy for malignant and benign samples. - [00:12:27](https://www.youtube.com/watch?v=PoqGrCscJ7k&t=747s) **Calculating Classification Metrics** - The speaker demonstrates how to use scikit-learn’s accuracy, precision, and recall functions on test predictions derived from a confusion matrix, interprets the resulting scores, and notes that these metrics guide further model fine‑tuning. ## Full Transcript
Hi, I'm Diarra Bell and I'm an AI engineer at IBM.
Today, we're going to be talking about confusion matrices.
A confusion matrix is a way to summarize the performance of a classification model.
Classification models can be applied to a variety of different use cases to sort data into different categories.
A few examples of models that can be used for classification include logistic regression, Naive Bayes, support vector machines, and decision trees.
In this video, we'll build a quick binary classifier model with scikit-learn,
and we'll analyze a confusion matrix to assess the results.
You don't have to be confused.
We'll explain everything in the video.
All right.
Let's get started with actually writing some code.
So as you can see here, I have a Jupyter notebook open and I have some libraries that we're going to import.
Some are from scikit-learn.
I'm importing the load breast cancer default dataset for our logistic regression.
I'm also importing the metrics library logistic regression.
And then I'm also importing some, functions that, or symmetric scores for our model, and we'll talk about those later.
I'm also importing Matplotlib, and I'm also importing train test splits so that we can split our data into training and test sets.
I'm importing a scalar for us to do some preprocessing on the data and then I'm importing pandas for us to visualize the data as a data frame.
Now that we have all of our libraries imported, let's get started with actually loading a dataset into Jupyter Notebook.
So we're using the default notebook, or the default data set called the breast Cancer dataset,
that has information about different cells and labels that determine whether or not they are cancerous or they are non cancerous or malignant or benign.
So this is a really common dataset that's used in machine learning because it's very simple, easy to understand,
and... so let's get started.
So we're going to create a variable called data.
And we're going to have that equal to the load breast cancer function that we got from scikit-learn.
And now the next thing we're going to do is we're going to create a data frame just so we can see what that data looks like.
So let's create a data frame called DF,
and we'll set that equal to a pandas.dataframe,
and then we'll say we're going to get that data from the load breast cancer function.
And actually, the class actually has its own function to get the data from it.
And then we'll say that the columns are equal to data.feature.
Let's hold on.
Okay.
And so now we have a data frame there.
And let's just see what that looks like.
So hold on.
Let me just display the head of that data so you'll see what it looks like.
We'll see that we have several rows and each row is a different sample of cells.
And then these are all of the features that can affect the cell.
And we are using these features to determine whether or not the cell is cancerous or, it's a benign sample of cells.
So now what we need to do is actually add the target, the target labels to this data set.
So the target labels are the classes that we are predicting.
We have the class zero, which is malignant, which means it's cancer.
And we have the class one, which means it's benign, which means it's not cancer.
So let's display those target labels.
So we're going to have a new column that we're going to add to our data frame and we're going to call it target,
and we'll say that is equal to data.target, which gets all of the target labels.
And if we display the head of our data frame now, this is what we'll get.
So you'll see it's exactly the same.
Only now we have our target labels.
As you can see, these are all zero, which means that they are malignant samples.
Okay.
So the next thing that we're going to do is we are going to split the data into X and Y variables.
X is a variable that contains all of the features, and Y is a variable that contains all of the target labels.
So the classes that we're predicting, either 0 or 1 either malignant or benign.
So we're going to get started by actually loading the data as X and Y variables.
And we can do this with an inbuilt function called return x,y, which we're going to set to true.
And so now it's going to return the all the features and all of the target labels as separate variables from this dataset.
The next thing that we're going to do is we're going to split that data into training data and test data.
We're going to use a default split, which is about 25% of the data is going to be used for testing, and 75% is going to be used for training.
We do this so that we don't have any overlap between the testing and the training data.
And it's randomized so we can get a better idea of how well the model performs on new and unseen data.
So let's get started with that.
So we'll use the trained test split function and let's create some variables x train, x test y train and y test.
And we'll set that as train the variables that are the result of train to split x and y.
Let's run that.
All right.
Good to go.
The next step is preprocessing the data.
So we actually do need to scale the data because we're using a logistic regression.
And all of the features are going to be compressed into zero and one using the sigmoid function.
So we'll have to create a scalar.
And we've imported the library earlier.
So let's just create an instance of the the standard scalar class.
And we're going to scale the x data for both the training and the test sets.
So let's create a variable called x train scaled.
And we'll set that equal to the scalar.fit, transform extreme and we'll do the same for the test dataset.
All right. So now we have our scale versions of both the training data and the testing data.
The next thing that we're going to do is start training the model, which is the super exciting part, and it only takes about one line of code.
So now we're going to build a logistic regression,
and let's create a variable called model,
and we'll set that equal to logistic regression,
which is the cost that we already imported before.
And we're going to fit this data to our training set.
So we're going to give it our scaled version of the the x variable,
and then we're also going to give it the training set for y.
So now we're going to just fit our model to that data and our model will be trained.
All right.
Pretty quick because we have a relatively small dataset.
So now let's get started by building our confusion matrix, which once again only takes one line of code.
We're going to start by creating a variable called confusion matrix.
Pretty simple.
You can call it whatever you want.
And we're going to use the metrics library.
And there's actually a convenient confusion matrix
function that we can use and we're going to put in our actual labels and then we're going to put in our predicted labels to make the confusion matrix.
Let's print it out.
So it's not going to be a graphical representation yet.
It's going to be just an array.
So as you can see, it printed out, just an array of numbers.
And we're going to explain what those numbers mean in a second.
But I think it'll be easier if we have an actual graph so that we can see what the true labels are and what the actual the predicted labels are.
So now that we have this confusion matrix, which is our numerical display of our confusion matrix,
let's do a graphical display so we can actually see more clearly what each part is referring to.
So let's create a variable called confusion matrix display.
And we're going to set that equal to a function in the metrics library called confusion matrix display.
And this gives you a graphical representation of a confusion matrix.
And we're going to set our confusion matrix equal to the one that we just did.
And let's print that out and display it here.
So.
And now let's use matplotlib to show.
Okay.
There you go.
So now we have our actual confusion matrix.
So let's explain what we're looking at here, because we see a bunch of numbers.
So each square in this confusion matrix represents how many of samples that the label predicted a certain as a certain class,
and what the actual class was for that sample.
So up here in our top left are our true positives.
These are great.
This means that the actual label was positive and the predicted label with the model was actually able to predict was also positive.
So as you can see here, true positives means that it was correctly identified to be cancerous.
So zero is, as we know, stands for a malignant sample and one is a benign sample.
So we'll see here that these are all correctly identified.
Now, if we go to our bottom right, we'll see that this is also a positive thing.
These are true negatives.
So that means that the actual label was negative and the model was correctly able to identify it as negative as well.
So you'll see here that there were 90 samples that were actually negative.
And the model correctly identified them as being negative.
Now, when we get to the purple areas here, this is kind of a danger zone because these are things that we don't want to see,
and we want as little of these as possible.
So right down here, we have our false positives.
This means that the model thought it was a positive sample, but it's actually a negative sample.
So right here, this means that the model thought that it was cancerous, but it actually wasn't cancerous.
And up here on the very top, right, these are our false negatives.
Now, especially working with health care, these are really dangerous because we're basically we have a model that thinks that this isn't cancerous, but it actually is.
And when it comes to cancer screening, we want to make sure that it's recognized as soon as possible.
So we always want to make sure.
So we notice there were five examples where the model thought that it was the one label, but it actually was a zero.
So now let's calculate some metrics that we can extrapolate from this confusion matrix.
So we can start with accuracy.
So accuracy is a score that determines the correct overall amount of predictions.
So let's just start by printing out accuracy.
And we have a very convenient function called accuracy score in scikit-learn that predicts this for us.
And all we need is the the test data and then the prediction data for the test set.
Okay.
The next thing that we can also look at is precision, which basically is a metric to determine how often the model was correct when predicting positive.
So.
That one is also available on scikit-learn.
You can see that it was a function that we imported earlier.
Another metric that we can use is recall.
Recall is a metric that determines how often the model was able to correctly identify true positives out of all the positives there are.
So let me just put in my test and our y predictions.
And let's just print out what those look like.
All right.
So we can see that our accuracy was 95%.
That's pretty good.
Our position is 94% and recall is 97%.
So at this point, this is the point where we can start either fine tuning the model or making some changes to the model to improve its performance,
but now that we have these metrics, we can see how well the model was able to learn new data and what we were able to extrapolate from that data.
Overall, a confusion matrix is a simple and easy way to determine how well the classification model is performing.
From the results of this confusion matrix, as we saw before.
We can decide to either fine tune the model further or leave it as is by effectively analyzing the results of the model.
We can create models that have higher performance metrics, which is especially helpful for machine learning models used in health care.
Overall, I hope that this quick tutorial is helpful for you and if it is, leave a comment.
Happy coding.