Learning Library

← Back to Library

Bag of Words: Concept & Applications

Key Points

  • Bag‑of‑Words (B&G foods) is a feature‑extraction method that transforms text into numerical vectors by counting word occurrences, enabling machine‑learning models to process language data.
  • A common application is email spam detection, where word frequency patterns help classify messages as legitimate or spam.
  • The concept extends to visuals as “bag of visual words,” breaking images into key features (e.g., ears, whiskers) for tasks like object detection.
  • Typical downstream uses include text classification, document similarity, and search‑query matching, though the approach has trade‑offs such as loss of word order and context.

Sections

Full Transcript

# Bag of Words: Concept & Applications **Source:** [https://www.youtube.com/watch?v=pF9wCgUbRtc](https://www.youtube.com/watch?v=pF9wCgUbRtc) **Duration:** 00:21:07 ## Summary - Bag‑of‑Words (B&G foods) is a feature‑extraction method that transforms text into numerical vectors by counting word occurrences, enabling machine‑learning models to process language data. - A common application is email spam detection, where word frequency patterns help classify messages as legitimate or spam. - The concept extends to visuals as “bag of visual words,” breaking images into key features (e.g., ears, whiskers) for tasks like object detection. - Typical downstream uses include text classification, document similarity, and search‑query matching, though the approach has trade‑offs such as loss of word order and context. ## Sections - [00:00:00](https://www.youtube.com/watch?v=pF9wCgUbRtc&t=0s) **Bag‑of‑Words Feature Extraction Overview** - The speaker introduces the Bag‑of‑Words technique, explaining how it converts text into numeric vectors for machine‑learning tasks such as spam detection, and outlines its advantages, limitations, applications, and possible improvements. - [00:03:06](https://www.youtube.com/watch?v=pF9wCgUbRtc&t=186s) **Bag-of-Words Feature Construction** - The speaker explains text classification and document similarity, then demonstrates building a vocabulary from two example sentences and converting those words into numeric bag‑of‑words features for a machine‑learning model. - [00:06:12](https://www.youtube.com/watch?v=pF9wCgUbRtc&t=372s) **Bag-of-Words Feature Overview** - The speaker explains how sentences are transformed into word‑count vectors, highlights the method’s simplicity and interpretability, and briefly points out its inherent drawbacks. - [00:09:19](https://www.youtube.com/watch?v=pF9wCgUbRtc&t=559s) **Bag‑of‑Words Model Drawbacks** - The speaker outlines how bag‑of‑words cannot capture word correlations, context, or order and suffers from sparsity, causing ambiguous and ineffective representations for machine‑learning tasks. - [00:12:23](https://www.youtube.com/watch?v=pF9wCgUbRtc&t=743s) **N‑grams and Text Normalization** - The speaker explains how to enhance a bag‑of‑words model by using n‑grams to capture consecutive word sequences and applying text normalization techniques like stemming to reduce words to their base forms. - [00:15:32](https://www.youtube.com/watch?v=pF9wCgUbRtc&t=932s) **IDF Role in TF‑IDF** - The speaker explains how inverse document frequency reduces the TF‑IDF weight of words that appear in many documents—preventing common stop‑words from dominating—and illustrates its application in tasks like classifying support tickets and other document‑classification problems. - [00:18:38](https://www.youtube.com/watch?v=pF9wCgUbRtc&t=1118s) **Word Embeddings and Sentiment Analysis** - The speaker describes representing words as vectors in an n‑dimensional space to gauge semantic similarity, explains that such embeddings commonly use bag‑of‑words, and shows how bag‑of‑words models can identify sentiment and flag negative or discriminatory language. ## Full Transcript
0:00We are going shopping for a new concept to learn. 0:03Keep your hands free because we are going to have a lot of bags to deal with. 0:08You guessed it. 0:09The topic for today is B&G foods. 0:14B&G foods is a feature extraction 0:16technique to convert text into numbers, 0:20and it's exactly what it sounds like. 0:23A collection of different words. 0:27A great use case for B&G foods is spam filters in your emails. 0:32For example, you might be receiving different emails 0:36about the latest news, 0:39maybe some interesting messages from your friends, 0:43and perhaps a few spammy content. 0:46Saying that you have won a lottery and you're about to become a millionaire. 0:51Bag of words looks at the different words present 0:54and the frequency in each of these emails and trusted in. 0:59Which of these would be spam? 1:03So today we are going to be looking at 1:05what bag of words means, as well as some examples. 1:09We will be looking at the pros and cons of bag of words, 1:14certain applications, 1:17and also modifications that we can use 1:20to improve our bag of words algorithm. 1:26Like I said, 1:27bag of words is a feature extraction technique, which means that 1:32all of your different texts or different words 1:36are converted into numbers. 1:41After all, 1:41numbers is what our machine learning models understand. 1:45I like to think of Bag of Words as a bag of popcorn. 1:51Let's think of the different words as kernels of popcorn. 1:56And each word represents a kernel. 1:58Or rather, each kernel represents a different word. 2:02The cool thing about Bag of Words is that it's not just limited to words, but 2:06it can also be applied to visual elements, 2:09which is bag of visual words. 2:13Let's say, for instance, you have an image of a cat. 2:20And yes, this is how I draw a cat, 2:23but you can break down this image of a cat 2:25into multiple different key features. 2:29You could have an ear, you could have 2:32whiskers, a body, 2:35legs and a tail. 2:38And each of these different elements help in multiple 2:41computer vision techniques, such as object detection. 2:45So you can use bag of words, not just in words, 2:48but also on visual words, which is images. 2:52Next, let's take a look at what bag of words looks like 2:55for different sentences, and see the pros and cons for it. 2:59Common MLP tasks where bag of words comes in handy is 3:04text classification. 3:06Let's say for example, spam or not, 3:11you could have your email 3:14and depending on what the words in that are, you could identify. 3:18So this is an example of text classification. 3:23Another example could be 3:25that of document similarity 3:28where perhaps you want to compare two different documents 3:32and check how similar they are to each other. 3:36Or maybe you have a particular query, 3:40like the type you put in a search engine, 3:43and you want to find the most relevant 3:45documents. 3:49Both of these examples text classification and document similarity. 3:53Use bag of words in the back end. 3:57Now let's take an example of two sentences 3:59and see how we can convert the text other words 4:04into features on numbers for our machine learning model. 4:07To understand. 4:08Consider two sentences. 4:11Sentence number one 4:13I think. 4:17Therefore, I am. 4:23And sentence number two. 4:26I love learning. 4:31Python. 4:37Now that we have our two examples sentences, 4:40what we are going to begin with is creating our vocabulary 4:45or a dictionary, which is the set of unique words 4:47set up in all of the given documents. 4:51In our case, here are only two sentences that we are looking at. 4:54But let's take a look at all the unique words present in here. 4:58So we have 5:00AI as a unique word. 5:03Think. 5:06Therefore. 5:09AI has already been covered over here, so we move on to the next one. 5:14I'm going to the next sentence. 5:17AI is also included here. 5:20Love learning 5:25Python one? 5:27That's 12345677. 5:31Words are seven. 5:33Unique words is what makes up our dictionary or 5:36our vocabulary based on these two sentences. 5:39Let's look at what the text representation of the bag 5:42of words representation for each of those sentences would be, 5:46and what we are constructing over here is called a document term matrix. 5:51So here are our documents. 5:52We consider our first document. 5:55And these are the different terms or the vocabulary present in here. 5:59So going over the first sentence I occurs a total of two times. 6:05She look at the count of the word I of the particular words. 6:09And you try to see how many times it occurs in that particular sentence. 6:13So I have used a total of two times. 6:16Think once. 6:18Therefore once. 6:20once. 6:22And in our first sentence, love learning and Python do not appear, 6:26which is why they get a score of zero. 6:30Doing the same technique for our second sentence, 6:33I appeals a total of one time. 6:36Think therefore and are absent in that sentence, which is why 6:41they get zero and love learning in Python, each of those other ones, 6:46which is why they get one. 6:48So what you're seeing over here is a vector of numbers 6:53that represent the first sentence. 6:56So we have now taken words and converted it into 7:00a feature representation. 7:02That is we have numbers over here, which is what our machine 7:05learning models used to understand. 7:08And similarly 7:10this is the feature representation for our second sentence. 7:16Now that we've seen 7:17what bag of words looks like or how to calculate it, 7:21the pros are kind of obvious. 7:25It's simple, which is how you saw it. 7:28You count the number of times particular word occurs, and you denote that count 7:33to that particular position for that sentence. 7:37It's easy, which is what we did over here. 7:40And it's explainable 7:43as opposed to certain other algorithms 7:46that maybe are not as intuitive. 7:49Unfortunately, as with all things in life. 7:52There are going to be pros and there are going to be cons. 7:55Next, we'll take a look at the cons of the simplistic algorithm 7:59and see if we can modify it to make it work better for us. 8:03Let's look at some of the drawbacks associated with bag of words. 8:07The first one being a compound word. 8:10Think about words like AI, 8:13artificial intelligence, or New York. 8:18In a simplistic bag of words approach. 8:20You break down artificial and intelligence, and now they are treated 8:25as two separate words with no correlation or no meaning between the two. 8:30That would apply to New York as well, where new is 8:33one word and York is another word. 8:36In this case, we are losing this semantic or the meaning 8:40that exists between the two words, which is a drawback. 8:46Let's look at another example. 8:49Perhaps kick. 8:53And baking. 8:57Maybe racing as well. 9:01Given these three words 9:03cake baking and racing, cake and baking are more likely to co-occur to occur 9:08in the same context, in the same documents 9:12as opposed to cake and racing. 9:14Well, of course, if tomorrow somebody invents 9:16a new sport called cake racing, that's going to change. 9:20But let's hope it doesn't. 9:22In this case, our Bag of words model is not able 9:26to associate the correlations that exist among the words, 9:29which might pose a problem to our machine learning models. 9:36Let's look at another 9:37drawback of Polyphemus words. 9:41Consider the word biting. 9:45Looking at just this word, it's hard to tell 9:48if I'm talking about biting the programing language or Python. 9:51The animal. 9:54Maybe there's another word 9:56that's content or content. 10:00It could mean either of the two, but just looking at the spelling, it's 10:04hard to see which is which. 10:08Another drawback that exists 10:10is that we lose the order associated between the words. 10:14Like I mentioned, Bag of words is nothing but a bag of popcorn, 10:19with each of the kernels being a specific word. 10:22And when you shake that bag, you lose all of the relationships that exist. 10:27As far as the order of the words is concerned. 10:30Let's say, for example, 10:33I have a sentence that says flight 10:37San Francisco, 10:40Mumbai from 10:44unto. 10:48What does this mean? 10:49Am I trying to fly from 10:52San Francisco to Mumbai? 10:56Am I trying to fly the other way around 10:59from Mumbai to San Francisco? 11:02It's hard to tell when we have only 11:04the bag of words available. 11:08Last but not the least 11:10is the problem of sparsity 11:13in our bag of words approach. 11:15We look at each of the unique words which makes up our vocabulary, 11:19and denote the presence of that particular word in a sentence 11:24given a large number of documents. 11:26You could have a very, very high number of vocabulary or words. 11:31Yet in each of the sentences, 11:34there could be maybe only three words or a very, very small proportion of words 11:40that actually are present with most of the other spaces being zeros. 11:50This leads to the problem of sparsity. 11:52Since our matrix are our vectors, over, here is sparse 11:57in the sense most of these elements are unoccupied 12:00because they're denoted by zeros, and very few of them are actually present. 12:05This could also pose a challenge with our models. 12:09Fear not though. 12:11Despite these drawbacks, we do have a certain modification in mind. 12:16Let's take a look at some of the modifications that can help 12:19improve our bag of words. 12:20Approach. 12:24Our first modification is n grams. 12:28Instead of looking at each individual word, 12:32you can now look at a combination of words that occur together. 12:36For example, in our artificial intelligence 12:40being the phrase we don't break it into artificial intelligence, 12:44but now we look at the presence of artificial and intelligence together 12:50and denote how many times it occurred in a particular document. 12:54Similarly, for New York, we look at the presence 12:56of New and York right after each other 13:00and denote the number of counts or the times a duck goes in that document. 13:06In this case, since our words 13:08are made up, or our faces are made up of two words 13:12and is equal to two, you could extend this with n 13:16is equal to three and is equal to five, so on and so forth. 13:19In which case you would look at, for example, if n is equal to three, 13:24you would look at three words that occur right next to each other. 13:29So maybe it is Python 13:32artificial intelligence. 13:34And any time these three words occur in your document, 13:39you would count the number of times that happens. 13:42And given the occurrence in here. 13:46Another modification 13:48that we can do is text normalization. 13:52Text normalization refers to certain preprocessing activities 13:55that you can do before you pass on the text to your bag of words. 13:59Model. 14:00A good example for this is the process of stemming, 14:05in which case 14:05you're trying to remove the ends of the words 14:09in the hope of getting back to its base word or its base stem. 14:14Consider the words coding 14:17coded 14:19codes and code. 14:23When you start removing the ends of the words. 14:27You can try to get to its base word, 14:31which is called in this case. 14:34This is a way to reduce the number of vocabulary 14:37or reduce your dictionary words, 14:39and hopefully that will help with the sparsity issue. 14:43An important concept that builds upon bag of words 14:46is Tf-Idf or term frequency. 14:50Inverse document frequency. 14:54You can think of Tf-Idf 14:56as a weight or a score associated with words, 14:59or perhaps even a feature scaling technique. 15:04TF is the term frequency, 15:06or the number of times a particular word occurs in your document. 15:11Let's say the words votes. 15:14President. 15:15Governments occur a lot of times in your document. 15:20Probably has something to do with 15:22maybe elections or some other government matter. 15:26So higher the term frequency 15:29higher is the score or the weight associated with that word. 15:33That makes sense with inverse document frequency. 15:36However, if you look at the number of documents. 15:42That that particular word occurs in. 15:47And if that word occurs in multiple documents 15:51or a huge proportion of documents, you actually give it a lower score. 15:58So the more number of documents the word occurs in, the lower 16:01the IDF score and the lower the whole Tf-Idf score becomes. 16:07This may seem a little counterintuitive, right? 16:10It's opposite of the term frequency. 16:14But I give you the example of words like d, 16:18un and some. 16:23But what's it? 16:25Don't really have any meaning on their own, 16:28but they're used to create grammatically correct sentences. 16:32As you can imagine, in an English language or a lot of documents 16:36with English language in it, these words would occur a lot of times. 16:41Perhaps, maybe even the most frequently occurring word. 16:45In that case, we do not want 16:47these words to have a high tfidf score, 16:51which is where the IDF component lowers their score. 16:56As these scores are not representative of the topics 16:59or the sentence of the documents. 17:03Let's take a look 17:04at some applications of tf IDF. 17:12Let's consider 17:13document classification as an example. 17:18Perhaps you have a company and a product that you're selling to your customers, 17:23and you have a support channel for them to come and read 17:26certain concerns, complaints or questions about your product. 17:33Maybe you have a chat associated 17:35with your customers or some support tickets, 17:40and you could use the bag of Words approach 17:43to understand which of the teams 17:46are associated with the problem that is there in the ticket. 17:50Maybe you have a building team 17:53or an onboarding team. 17:57Or a trial team. 18:00Or maybe it's a documentation issue. 18:05Looking at the vocabulary 18:07that is present, that is looking at the bag of words, representation 18:11of what is entailed in the customer chat or the support ticket. 18:14You will then be able to identify which of these teams is it, right, 18:19and the appropriate team to deal and resolve the customer's issue. 18:25Another example of bag of words 18:26is what to make. 18:31You might have heard of a to. 18:33These are word embeddings 18:35that exist in an n dimensional space. 18:40Your words are represented 18:41as vectors in this n dimensional space. 18:44For example, king and queen are two words, 18:50and the closer the words are in this n dimensional space, 18:53that means they are more related to each other. 18:56In this case, king and queen would be fairly close to each other, 19:00as you would find documents or sentences where king and queen 19:04appear together. 19:07Maybe you have another word 19:08swim, that comes in those documents as well, 19:12but you wouldn't really associate swim with king or queen 19:15as much as you would with king and queen with each other. 19:19So swim would be further away from the vectors of king and queen. 19:25This is called what to work on word embeddings, 19:28and it does use bag of words as a back end to create this n dimensional space. 19:37Another example where 19:38bag of words comes in handy is for sentiment analysis. 19:44You could look at the collection of words in a given text, 19:47and understand if a lot of those words are positive. 19:51Maybe words like happy, joy, 19:54excited, or words that are negative, 19:59frustrated, angry, hate, terrible. 20:04And depending on the bag of words representation, you would be able 20:08to identify with sentiment at false positive or negative. 20:14You could even take this further and try to create a model 20:18that helps to keep speech. 20:21So you would look 20:22at the negative sentiments or the negative words present in there, 20:26and maybe extended with other words, for example, racism 20:31or other discrimination forms, and try to create a model that helps 20:36you distinguish these annoying or unexpected texts on the internet. 20:43Now that you have this concept in the bag, 20:47I hope this helps you understand a little more about natural language 20:50processing and encourages you 20:52to continue your journey into the field of artificial intelligence. 20:56If you like this video and want to see more like it, please like and subscribe. 21:02If you have any questions or want to share your thoughts about this topic, 21:06please leave a comment below.