Learning Library

← Back to Library

Understanding Confusion Matrices with Scikit-learn

Key Points

  • Diarra Bell introduces confusion matrices as a tool to evaluate classification model performance, noting common classifiers like logistic regression, Naive Bayes, SVMs, and decision trees.
  • She demonstrates building a binary classifier in a Jupyter notebook using scikit‑learn’s breast‑cancer dataset, importing the necessary libraries (metrics, train‑test split, scaler, pandas, Matplotlib).
  • After loading the dataset, she creates a pandas DataFrame to inspect the feature columns and target labels that indicate malignant versus benign samples.
  • The workflow proceeds with preprocessing (train‑test split and scaling), fitting a logistic‑regression model, and then generating a confusion matrix to visualize true/false positives and negatives.
  • The video explains how to interpret the confusion matrix and related metrics (accuracy, precision, recall, F1) to assess the model’s classification results.

Full Transcript

# Understanding Confusion Matrices with Scikit-learn **Source:** [https://www.youtube.com/watch?v=PoqGrCscJ7k](https://www.youtube.com/watch?v=PoqGrCscJ7k) **Duration:** 00:15:13 ## Summary - Diarra Bell introduces confusion matrices as a tool to evaluate classification model performance, noting common classifiers like logistic regression, Naive Bayes, SVMs, and decision trees. - She demonstrates building a binary classifier in a Jupyter notebook using scikit‑learn’s breast‑cancer dataset, importing the necessary libraries (metrics, train‑test split, scaler, pandas, Matplotlib). - After loading the dataset, she creates a pandas DataFrame to inspect the feature columns and target labels that indicate malignant versus benign samples. - The workflow proceeds with preprocessing (train‑test split and scaling), fitting a logistic‑regression model, and then generating a confusion matrix to visualize true/false positives and negatives. - The video explains how to interpret the confusion matrix and related metrics (accuracy, precision, recall, F1) to assess the model’s classification results. ## Sections - [00:00:00](https://www.youtube.com/watch?v=PoqGrCscJ7k&t=0s) **Introducing Confusion Matrices with Scikit‑learn** - Diarra Bell explains what a confusion matrix is and walks through building a binary logistic‑regression classifier on the breast‑cancer dataset in a Jupyter notebook, importing the necessary scikit‑learn tools and visualizing results. - [00:03:05](https://www.youtube.com/watch?v=PoqGrCscJ7k&t=185s) **Adding Target Labels and Splitting Data** - The speaker shows how to append a target column indicating malignant (0) or benign (1) samples to the feature DataFrame, displays it, and then separates the dataset into X (features) and Y (labels). - [00:06:07](https://www.youtube.com/watch?v=PoqGrCscJ7k&t=367s) **Scaling Data and Training Logistic Regression** - The speaker explains how to standardize features, fit a logistic regression model to the scaled training data, and quickly generate a confusion matrix for evaluation. - [00:09:17](https://www.youtube.com/watch?v=PoqGrCscJ7k&t=557s) **Visualizing and Interpreting Confusion Matrix** - The speaker demonstrates using sklearn’s `confusion_matrix_display` with matplotlib to plot a confusion matrix and explains how each cell (true positives, true negatives, etc.) reflects the model’s prediction accuracy for malignant and benign samples. - [00:12:27](https://www.youtube.com/watch?v=PoqGrCscJ7k&t=747s) **Calculating Classification Metrics** - The speaker demonstrates how to use scikit-learn’s accuracy, precision, and recall functions on test predictions derived from a confusion matrix, interprets the resulting scores, and notes that these metrics guide further model fine‑tuning. ## Full Transcript
0:00Hi, I'm Diarra Bell and I'm an AI engineer at IBM. 0:04Today, we're going to be talking about confusion matrices. 0:07A confusion matrix is a way to summarize the performance of a classification model. 0:12Classification models can be applied to a variety of different use cases to sort data into different categories. 0:18A few examples of models that can be used for classification include logistic regression, Naive Bayes, support vector machines, and decision trees. 0:28In this video, we'll build a quick binary classifier model with scikit-learn, 0:32and we'll analyze a confusion matrix to assess the results. 0:35You don't have to be confused. 0:36We'll explain everything in the video. 0:38All right. 0:39Let's get started with actually writing some code. 0:42So as you can see here, I have a Jupyter notebook open and I have some libraries that we're going to import. 0:49Some are from scikit-learn. 0:52I'm importing the load breast cancer default dataset for our logistic regression. 0:58I'm also importing the metrics library logistic regression. 1:02And then I'm also importing some, functions that, or symmetric scores for our model, and we'll talk about those later. 1:10I'm also importing Matplotlib, and I'm also importing train test splits so that we can split our data into training and test sets. 1:18I'm importing a scalar for us to do some preprocessing on the data and then I'm importing pandas for us to visualize the data as a data frame. 1:27Now that we have all of our libraries imported, let's get started with actually loading a dataset into Jupyter Notebook. 1:36So we're using the default notebook, or the default data set called the breast Cancer dataset, 1:42that has information about different cells and labels that determine whether or not they are cancerous or they are non cancerous or malignant or benign. 1:53So this is a really common dataset that's used in machine learning because it's very simple, easy to understand, 2:01and... so let's get started. 2:04So we're going to create a variable called data. 2:07And we're going to have that equal to the load breast cancer function that we got from scikit-learn. 2:18And now the next thing we're going to do is we're going to create a data frame just so we can see what that data looks like. 2:25So let's create a data frame called DF, 2:28and we'll set that equal to a pandas.dataframe, 2:35and then we'll say we're going to get that data from the load breast cancer function. 2:42And actually, the class actually has its own function to get the data from it. 2:47And then we'll say that the columns are equal to data.feature. 2:56Let's hold on. 3:04Okay. 3:05And so now we have a data frame there. 3:08And let's just see what that looks like. 3:11So hold on. 3:12Let me just display the head of that data so you'll see what it looks like. 3:17We'll see that we have several rows and each row is a different sample of cells. 3:22And then these are all of the features that can affect the cell. 3:26And we are using these features to determine whether or not the cell is cancerous or, it's a benign sample of cells. 3:34So now what we need to do is actually add the target, the target labels to this data set. 3:44So the target labels are the classes that we are predicting. 3:47We have the class zero, which is malignant, which means it's cancer. 3:51And we have the class one, which means it's benign, which means it's not cancer. 3:55So let's display those target labels. 3:59So we're going to have a new column that we're going to add to our data frame and we're going to call it target, 4:11and we'll say that is equal to data.target, which gets all of the target labels. 4:17And if we display the head of our data frame now, this is what we'll get. 4:22So you'll see it's exactly the same. 4:23Only now we have our target labels. 4:25As you can see, these are all zero, which means that they are malignant samples. 4:30Okay. 4:31So the next thing that we're going to do is we are going to split the data into X and Y variables. 4:36X is a variable that contains all of the features, and Y is a variable that contains all of the target labels. 4:43So the classes that we're predicting, either 0 or 1 either malignant or benign. 4:49So we're going to get started by actually loading the data as X and Y variables. 4:54And we can do this with an inbuilt function called return x,y, which we're going to set to true. 5:02And so now it's going to return the all the features and all of the target labels as separate variables from this dataset. 5:12The next thing that we're going to do is we're going to split that data into training data and test data. 5:20We're going to use a default split, which is about 25% of the data is going to be used for testing, and 75% is going to be used for training. 5:28We do this so that we don't have any overlap between the testing and the training data. 5:33And it's randomized so we can get a better idea of how well the model performs on new and unseen data. 5:39So let's get started with that. 5:41So we'll use the trained test split function and let's create some variables x train, x test y train and y test. 5:57And we'll set that as train the variables that are the result of train to split x and y. 6:06Let's run that. 6:07All right. 6:07Good to go. 6:08The next step is preprocessing the data. 6:12So we actually do need to scale the data because we're using a logistic regression. 6:16And all of the features are going to be compressed into zero and one using the sigmoid function. 6:23So we'll have to create a scalar. 6:25And we've imported the library earlier. 6:29So let's just create an instance of the the standard scalar class. 6:34And we're going to scale the x data for both the training and the test sets. 6:39So let's create a variable called x train scaled. 6:43And we'll set that equal to the scalar.fit, transform extreme and we'll do the same for the test dataset. 7:09All right. So now we have our scale versions of both the training data and the testing data. 7:14The next thing that we're going to do is start training the model, which is the super exciting part, and it only takes about one line of code. 7:22So now we're going to build a logistic regression, 7:26and let's create a variable called model, 7:29and we'll set that equal to logistic regression, 7:32which is the cost that we already imported before. 7:35And we're going to fit this data to our training set. 7:39So we're going to give it our scaled version of the the x variable, 7:46and then we're also going to give it the training set for y. 7:51So now we're going to just fit our model to that data and our model will be trained. 7:57All right. 7:58Pretty quick because we have a relatively small dataset. 8:02So now let's get started by building our confusion matrix, which once again only takes one line of code. 8:08We're going to start by creating a variable called confusion matrix. 8:11Pretty simple. 8:12You can call it whatever you want. 8:14And we're going to use the metrics library. 8:19And there's actually a convenient confusion matrix 8:25function that we can use and we're going to put in our actual labels and then we're going to put in our predicted labels to make the confusion matrix. 8:36Let's print it out. 8:38So it's not going to be a graphical representation yet. 8:41It's going to be just an array. 8:43So as you can see, it printed out, just an array of numbers. 8:47And we're going to explain what those numbers mean in a second. 8:49But I think it'll be easier if we have an actual graph so that we can see what the true labels are and what the actual the predicted labels are. 8:57So now that we have this confusion matrix, which is our numerical display of our confusion matrix, 9:04let's do a graphical display so we can actually see more clearly what each part is referring to. 9:12So let's create a variable called confusion matrix display. 9:19And we're going to set that equal to a function in the metrics library called confusion matrix display. 9:30And this gives you a graphical representation of a confusion matrix. 9:34And we're going to set our confusion matrix equal to the one that we just did. 9:39And let's print that out and display it here. 9:46So. 9:54And now let's use matplotlib to show. 10:03Okay. 10:04There you go. 10:05So now we have our actual confusion matrix. 10:09So let's explain what we're looking at here, because we see a bunch of numbers. 10:15So each square in this confusion matrix represents how many of samples that the label predicted a certain as a certain class, 10:26and what the actual class was for that sample. 10:30So up here in our top left are our true positives. 10:35These are great. 10:36This means that the actual label was positive and the predicted label with the model was actually able to predict was also positive. 10:47So as you can see here, true positives means that it was correctly identified to be cancerous. 10:53So zero is, as we know, stands for a malignant sample and one is a benign sample. 11:00So we'll see here that these are all correctly identified. 11:05Now, if we go to our bottom right, we'll see that this is also a positive thing. 11:09These are true negatives. 11:11So that means that the actual label was negative and the model was correctly able to identify it as negative as well. 11:18So you'll see here that there were 90 samples that were actually negative. 11:24And the model correctly identified them as being negative. 11:29Now, when we get to the purple areas here, this is kind of a danger zone because these are things that we don't want to see, 11:35and we want as little of these as possible. 11:38So right down here, we have our false positives. 11:42This means that the model thought it was a positive sample, but it's actually a negative sample. 11:47So right here, this means that the model thought that it was cancerous, but it actually wasn't cancerous. 11:55And up here on the very top, right, these are our false negatives. 12:00Now, especially working with health care, these are really dangerous because we're basically we have a model that thinks that this isn't cancerous, but it actually is. 12:10And when it comes to cancer screening, we want to make sure that it's recognized as soon as possible. 12:16So we always want to make sure. 12:18So we notice there were five examples where the model thought that it was the one label, but it actually was a zero. 12:27So now let's calculate some metrics that we can extrapolate from this confusion matrix. 12:34So we can start with accuracy. 12:36So accuracy is a score that determines the correct overall amount of predictions. 12:46So let's just start by printing out accuracy. 12:51And we have a very convenient function called accuracy score in scikit-learn that predicts this for us. 12:58And all we need is the the test data and then the prediction data for the test set. 13:07Okay. 13:09The next thing that we can also look at is precision, which basically is a metric to determine how often the model was correct when predicting positive. 13:21So. 13:23That one is also available on scikit-learn. 13:27You can see that it was a function that we imported earlier. 13:43Another metric that we can use is recall. 13:46Recall is a metric that determines how often the model was able to correctly identify true positives out of all the positives there are. 14:07So let me just put in my test and our y predictions. 14:13And let's just print out what those look like. 14:17All right. 14:17So we can see that our accuracy was 95%. 14:20That's pretty good. 14:22Our position is 94% and recall is 97%. 14:27So at this point, this is the point where we can start either fine tuning the model or making some changes to the model to improve its performance, 14:35but now that we have these metrics, we can see how well the model was able to learn new data and what we were able to extrapolate from that data. 14:46Overall, a confusion matrix is a simple and easy way to determine how well the classification model is performing. 14:52From the results of this confusion matrix, as we saw before. 14:55We can decide to either fine tune the model further or leave it as is by effectively analyzing the results of the model. 15:02We can create models that have higher performance metrics, which is especially helpful for machine learning models used in health care. 15:08Overall, I hope that this quick tutorial is helpful for you and if it is, leave a comment. 15:12Happy coding.