Learning Library

← Back to Library

Text Classification: Types and Techniques

Key Points

  • Text classification transforms raw text—like emails or Netflix movie descriptions—into automated categories such as spam vs. not‑spam or comedy vs. drama, reducing the need for manual labeling.
  • The three main classification tasks are binary (two classes), multiclass (one of many exclusive classes), and multi‑label (assigning multiple categories to a single item, e.g., an action‑adventure film).
  • The workflow centers on heavy preprocessing of raw text (cleaning punctuation, tokenization, etc.) before converting it into numerical vectors via word embeddings.
  • A pre‑trained language model (e.g., BERT, ChatGPT, Granite) is then fine‑tuned to the specific classification problem, leveraging its learned representations to predict the appropriate labels.

Full Transcript

# Text Classification: Types and Techniques **Source:** [https://www.youtube.com/watch?v=hHiPs_wICsE](https://www.youtube.com/watch?v=hHiPs_wICsE) **Duration:** 00:13:52 ## Summary - Text classification transforms raw text—like emails or Netflix movie descriptions—into automated categories such as spam vs. not‑spam or comedy vs. drama, reducing the need for manual labeling. - The three main classification tasks are binary (two classes), multiclass (one of many exclusive classes), and multi‑label (assigning multiple categories to a single item, e.g., an action‑adventure film). - The workflow centers on heavy preprocessing of raw text (cleaning punctuation, tokenization, etc.) before converting it into numerical vectors via word embeddings. - A pre‑trained language model (e.g., BERT, ChatGPT, Granite) is then fine‑tuned to the specific classification problem, leveraging its learned representations to predict the appropriate labels. ## Sections - [00:00:00](https://www.youtube.com/watch?v=hHiPs_wICsE&t=0s) **Understanding Text Classification Types** - A brief overview of text classification, illustrating binary, multiclass, and multilabel approaches using spam email and Netflix movie genre examples. - [00:03:05](https://www.youtube.com/watch?v=hHiPs_wICsE&t=185s) **Text Classification Pipeline Steps** - The speaker outlines the end‑to‑end workflow for text classification, covering raw‑text preprocessing, feature extraction via word embeddings, choosing an appropriate language model, and producing labeled outputs. - [00:06:17](https://www.youtube.com/watch?v=hHiPs_wICsE&t=377s) **AI Email Sorting & Sentiment** - The speaker outlines how AI models can automatically filter spam, gauge sentiment, categorize topics, and interpret customer feedback from email and social‑media messages. - [00:09:28](https://www.youtube.com/watch?v=hHiPs_wICsE&t=568s) **Balancing Data and Handling Ambiguity** - The speaker explains how to ensure a well‑balanced model by maintaining proper class ratios, clarifying ambiguous terms like “bank,” and providing a diverse spread of examples across sentiment subcategories. - [00:12:30](https://www.youtube.com/watch?v=hHiPs_wICsE&t=750s) **Model Validation and Drift Management** - It explains the need to continuously validate text classification models against data drift caused by real‑world changes, ensuring they keep correctly categorizing incoming business communications. ## Full Transcript
0:00So let's jump in with a quick question. 0:04How many of you have come across spam or in your email 0:08or while on Netflix, the different categories of a movie? 0:13Well, that's text classification. 0:18Text classification takes raw text, like these documents, 0:26and funnels them into a computational engine 0:31that then outputs different classifications. 0:34So it could be, in the two examples mentioned, 0:38a spam email or simply a not-spam. 0:48Or, in the Netflix examples, a comedy, 0:54drama, etc. 0:58So in today's world, we're constantly bombarded with 1:01tons of information. 1:03And what text classification can provide us 1:07is a means in order to simplify and automate 1:12the classification of different types of text without human input. 1:17Types of text classification. 1:22There's three major types. 1:24Starting with the least complex: binary classification. 1:30That can be expressed as either a one or a zero. 1:34Or in the email example, a spam versus not-spam. 1:39The second is multiclass classification, I'll put that. 1:52And that's either a 1... or rather, a 2, 1, 0. 2:01Or if using the email example, 2:04a business related email, 2:06a customer related email or an order email. 2:11The third, and the most complex, 2:13is what's called multi-label classification. 2:22And this kind of the most complex because 2:24you can assign a specific email or a specific type of text, 2:28multiple classifications. 2:31So switching over to the Netflix example, 2:35a movie can be classified as an action adventure. 2:39And it has those two classifications as just that one entity. 2:44So depending on the business use case, 2:47and text complexity, 2:50you'll go through and determine if you need to use 2:53one of these three major types. 2:57Key techniques of text classification. 3:02There's 4 key techniques. 3:05So the first one is how do you handle the raw text? 3:10Most of your time is spent preparing and preprocessing the text, 3:13and you usually do that in script languages such as Python. 3:18So you take the raw text, 3:20you extract it from the document, 3:22depending on your use case, you remove periods, hyphens, 3:27apostrophe S's, that sort of thing. 3:30It all depends on the use case. 3:31But, again, this is where most of your time is spent, 3:34working through and preprocessing that text. 3:37Before the next step, 3:40which is called feature extraction, I'll just put F.E. for short. 3:47So this is where you take the texts and you send it into 3:51what's called word embeddings. 3:54So this is a bit of a black box, 3:56and the details of it are outside of the scope of today's discussion, 4:01but you're essentially taking the raw text 4:04and then converting it into a long list of numbers. 4:10The third is the model. 4:15So when I say model, I mean a large language model 4:18like ChatGPT or Granite model or BERT model. 4:24And depending on what you're trying to classify, 4:27different types of models have different pre-trained 4:31with their own levels of text ... 4:34with their own pre-trained types of text. 4:38So in other words, there could be a model that's built on 4:42just classifying spam versus non-spam emails, 4:46or classifying different types of movies 4:50or different types of news documents. 4:53This is where you would select that type of model 4:56that's specific for your use case. 5:00Then the fourth type, or the fourth step, 5:03is the labeled output. 5:08So I'll just write "output". 5:13This part you need to work through iteratively. 5:15So this is just the 5:18the types of classifications that you're receiving 5:20from each of these steps. 5:23Depending on your output, 5:24you might have to go all the way back to your text 5:27and work it. 5:28You might have to go back to your feature extraction and adjust it. 5:32Or as mentioned, 5:33you might have the wrong model selected, 5:36so you might have to go back to that model 5:37and select a different one. 5:41So through these four key techniques, 5:45it gives you just an idea of what steps are required, 5:49iterative steps are required, in order to take the raw text, 5:53turn it into features, pass it through the model 5:57and then get our labeled output. 6:00So what are real world applications of text classification? 6:11I'm going to go through just a couple here, 6:13but the first one is, as mentioned previously, spam detection. 6:19So you get a bunch of emails. 6:21You're not sure if they're relevant to you 6:24or is it someone sending you something inappropriate. 6:27Well, you can add an AI text classification model 6:31onto your inbox and classify those emails 6:35as spam or not-spam. 6:39The next one is what's called sentiment analysis. 6:43So positive ... 6:45the classic examples are positive, negative or neutral. 6:49So if a string of text is happy or sad or neutral, 6:55and you can use that in the business world 7:01to determined customers and how they feel about something, 7:04how they feel about a product. 7:05Let it be how they post about it on Twitter or X, 7:10or how they post about it on Instagram, 7:13you can determine how they're feeling about something like that 7:16through sentiment analysis. 7:18The next one, and this is a more business-specific 7:22and internal specific type of application, 7:26but it's what's called topic categorization. 7:30So let's say, for example, 7:33a business is receiving emails from customers, 7:36and instead of having an administrator go through 7:39and manually classify those emails for, let's say, 7:43for an order or a technical request or a customer service request, 7:51you can have an AI model go through and classify 7:56each of those automatically into those categories. 8:02The fourth is what's called customer feedback. 8:07So this ties into, as mentioned, the others, 8:10such as with sentiment. 8:12But if you're trying to determine 8:15how a customer is feeling about something, 8:18let's say for example, they email you and they say, 8:21"this product is terrible, I want to return it and never buy something again". 8:26Well, from a business standpoint, 8:27you want to make sure that you speak to that customer 8:30immediately to try to rectify the situation. 8:33Whereas on the flip side of that, 8:36if a customer is happy with the product and just wants to send out a thank you, 8:40you don't need to prioritize that as immediately as you would with 8:45something a little more negative. 8:48So these are the four, I feel, real world major categories 8:54of text classification. 8:55Obviously there is a lot out there that you can do with this 9:00and the applications are almost limitless. 9:04So challenges and best practices when it comes to classifying text. 9:12The first one is what's termed as imbalanced data sets. 9:17So, you eed to make sure that you have the right number of examples 9:25for each type of thing that you're trying to classify. 9:29If you have too many of one type, or too little of another, 9:33your output and your model won't be as balanced as you want it to be. 9:38So you need to make sure that you have 9:40the right number relative to the output that you're expecting. 9:46The second is what's called ambiguous text. 9:57And this one's a little bit gray and it's relative to each use case. 10:01But the example I like to give is the word "bank". 10:08The word bank can have a couple applications, 10:11like the physical location where you store money 10:14or the side of a river. 10:17The model might not necessarily know what you want it to mean. 10:23So leading into it and leading into the text that you're using, 10:28you need to make sure that you have that specified. 10:34The third one is the diverse. 10:40Diverse, meaning you have a wide spread of different types of examples. 10:45So using sentiment analysis as an example here, 10:49positive, negative and neutral. 10:52We need to make sure that the types of training examples you have within each 10:58spread both the extremes of "extremely positive" 11:04to "kind of positive", and then into the negative, 11:08"extremely negative", "kind of negative", 11:12and then everything else in the middle, neutral. 11:15So it's important that you have the spread within each. 11:20Because if you don't, you might only be receiving 11:23classifications on extremely positive or extremely negative. 11:28While you're going to want to capture 11:33within that spread of each category, each subcategorization, those sentiments. 11:42So what can we do to fix this? 11:44Each of these components? 11:46So one of the things that we can do to fix this 11:49is through what's called, well, proper labeling. 11:56This can be really time intensive. 12:00But what I mean by that is you go through each of your training examples 12:05and manually read and discern: 12:10is this using the sentiment example? 12:12Is this positive? Is this negative? 12:15And then manually label that yourself. 12:18Don't rely on somebody else that might not be versed in 12:24the tasks that you're trying to perform. 12:29Do it yourself. 12:31Do it by hand. 12:35And then the last one within this, is validation. 12:46So what I mean by validation is 12:49making sure, once you train that first model, 12:53that the data that you're then receiving or sending it out in the real world 12:59still are being classified in the way that you want. 13:03So there's this thing called "drift" 13:05where if a world event comes along 13:08and changes what the sentiment of a particular idea or a topic would be, 13:13the the model would perform differently. 13:18So you need to be constantly going back and reviewing it 13:22to make sure that the model is classifying what you want it to classify. 13:28So to wrap things up, let's revisit why text classification is so powerful. 13:33Businesses are getting flooded by tons of information daily, 13:37thousands of emails, phone calls, etc.. 13:42So these text classification models are able to classify these things 13:47without human intervention, 13:49quickly and efficiently and repeatedly.