NLP Basics: Translating Unstructured to Structured
Key Points
- Natural language processing (NLP) is the technology that enables computers to understand and generate human language by converting unstructured text (like spoken sentences) into structured data that machines can process.
- The transformation from unstructured to structured data is called natural language understanding (NLU), while the reverse conversion from structured data back to natural language is known as natural language generation (NLG).
- A primary NLP application is machine translation, which requires grasping context and sentence structure to avoid mistranslations such as the “spirit is willing, but the flesh is weak” → “vodka is good, but the meat is rotten” example.
- NLP also powers virtual assistants (e.g., Siri, Alexa) and chatbots, interpreting spoken or written user inputs to execute commands or carry on conversational interactions.
Sections
- Understanding Natural Language Processing - Martin Keen explains NLP as the technology that translates unstructured human language—like spoken or written sentences—into structured data that computers can process, highlighting its role in AI and its relationship to natural language understanding.
- Sentiment, Spam Detection, and Tokenization - The speaker outlines NLP applications such as sentiment analysis and spam detection and explains that processing begins with tokenizing unstructured text into individual word tokens.
- Part‑of‑Speech Tagging and NER Overview - The speaker explains how part‑of‑speech tagging identifies a word’s grammatical role from sentence context and how named‑entity recognition assigns real‑world categories to tokens, converting unstructured speech into structured data for AI applications.
Full Transcript
# NLP Basics: Translating Unstructured to Structured **Source:** [https://www.youtube.com/watch?v=fLvJ8VdHLA0](https://www.youtube.com/watch?v=fLvJ8VdHLA0) **Duration:** 00:09:35 ## Summary - Natural language processing (NLP) is the technology that enables computers to understand and generate human language by converting unstructured text (like spoken sentences) into structured data that machines can process. - The transformation from unstructured to structured data is called natural language understanding (NLU), while the reverse conversion from structured data back to natural language is known as natural language generation (NLG). - A primary NLP application is machine translation, which requires grasping context and sentence structure to avoid mistranslations such as the “spirit is willing, but the flesh is weak” → “vodka is good, but the meat is rotten” example. - NLP also powers virtual assistants (e.g., Siri, Alexa) and chatbots, interpreting spoken or written user inputs to execute commands or carry on conversational interactions. ## Sections - [00:00:00](https://www.youtube.com/watch?v=fLvJ8VdHLA0&t=0s) **Understanding Natural Language Processing** - Martin Keen explains NLP as the technology that translates unstructured human language—like spoken or written sentences—into structured data that computers can process, highlighting its role in AI and its relationship to natural language understanding. - [00:04:13](https://www.youtube.com/watch?v=fLvJ8VdHLA0&t=253s) **Sentiment, Spam Detection, and Tokenization** - The speaker outlines NLP applications such as sentiment analysis and spam detection and explains that processing begins with tokenizing unstructured text into individual word tokens. - [00:07:41](https://www.youtube.com/watch?v=fLvJ8VdHLA0&t=461s) **Part‑of‑Speech Tagging and NER Overview** - The speaker explains how part‑of‑speech tagging identifies a word’s grammatical role from sentence context and how named‑entity recognition assigns real‑world categories to tokens, converting unstructured speech into structured data for AI applications. ## Full Transcript
What is natural language processing? Well, you're doing it right now, you're listening
to the words and the sentences that I'm forming and you are forming some sort of comprehension
from it. And when we ask a computer to do that that is NLP, or natural language processing.
My name is Martin Keen, I'm a Master Inventor at IBM,
and I've utilized NLP in a good number of my invention disclosures. NLP really has a
really high utility value in all sorts of AI applications. Now NLP starts with something called
unstructured text. What is that? Well, that's just what you and I say, that's how we speak.
So, for example, some unstructured text is "add eggs and milk to my shopping list."
Now you and I understand exactly what that means, but it is unstructured at least to a computer.
So what we need to do, is to have a structured representation of that same information that
a computer can process. Now that might look something a bit more like this where we have a
shopping list element. And then it has sub elements within it like an item for eggs,
and an item for milk.
That is an example of something that is structured.
Now the job of natural language processing is to translate between these two things.
So NLP sits right in the middle here translating between unstructured and structured data. And when
we go from structure from unstructured here to structured this way, that's called NLU, or
natural language understanding. And when we go this way from structured to unstructured,
that's called natural language generation, or NLG. We're going to focus today primarily
on going from unstructured to structured in natural language processing now let's think of
some use cases where nlp might be quite handy. First of all, we've got machine translation.
Now when we translate from one language to another we need to understand the context of
that sentence. It's not just a case of taking each individual word from say English and
then translating it into another language. We need to understand the overall structure
and context of what's being said. And my favorite example of this going horribly wrong
is if you take the phrase the "spirit is willing, but the flesh is weak" and you translate that from
English to Russian and then you translate that Russian translation back into English
you're going to go from the "spirit is willing, but the flesh is weak" to something a bit more
like the "vodka is good, but the meat is rotten" which is really not the intended
context of that sentence whatsoever. So NLP can help with situations like that. Now
the the second kind of use case that I like to mention relates to virtual assistants,
and also to things like chatbots. Now a virtual assistant that's something like Siri, or Alexa
on your phone that is taking human utterances and deriving a command to execute based upon that. And
a chatbot is something similar except in written language and that's taking written language and
then using it to traverse a decision tree in order to take an action. NLP is very helpful there.
Another use case is for sentiment analysis. Now this is taking some text perhaps an email message
or a product review and trying to derive the sentiment that's expressed within it.
So for example, is this product review a positive sentiment or a negative sentiment, is it written
as a serious statement or is it being sarcastic? We can use NLP to tell us. And then finally,
another good example is spam detection so this is a case of looking at a given email message
and trying to drive is this a real email message or is it spam and we can look for
pointers within the content of the message. So things like overused words or poor grammar or an
inappropriate claim of urgency can all indicate that this is actually perhaps spam. So those are
some of the things that NLP can provide but how does it work well the thing with NLP is it's
not like one algorithm, it's actually more like a bag of tools and you can apply these bag of tools
to be able to resolve some of these use cases. Now the input to NLP is some unstructured text
so either some written text or spoken text that has been converted to written text through a
speech to text algorithm. Once we've got that, the first stage of NLP is called tokenization
This is about taking a string and breaking it down into chunks so if we consider the
unstructured text we've got here "add eggs and milk to my shopping list"
that's eight words that can be eight tokens.
And from here on in we are going to work one token at a time as we traverse through this. Now
the first stage once we've got things down into tokens that we can perform is called stemming.
And this is all about deriving the word stem for a given token. So for example, running,
runs, and ran, the word stem for all three of those is run. We're just kind of removing the
prefix and the suffixes and normalizing the tense and we're getting to the word stem.
But stemming doesn't work well for every token. For example, universal and university,
well they don't really stem down to universe. For situations like that,
there is another tool that we have available and that is called lemmatization.
And lemmatization takes a given token and learns its meaning through a dictionary definition
and from there it can derive its root, or its lem. So take better for example, better is derived from
good so the root, or the lem, of better is good. The stem of better would be bet. So you can see
that it is significant whether we use stemming, or we use lemmatization for a given token.
Now next thing we can do is we can do a process called part of speech tagging.
And what this is doing is for a given token it's looking where that token is used within the
context of a sentence. So take the word make for example, if I say "I'm going to make dinner", make
is a verb. But if I ask you "what make is your laptop?", well make is now a noun. So where that
token is used in the sentence matters, part of speech tagging can help us derive that context.
And then finally, another stage is named entity recognition.
And what this is asking is for a given token is there an entity associated with it. So
for example, a token of Arizona has an entity of a U.S. state whereas a token of Ralph has an entity
of a person's name. And these are some of the tools that we can apply in this big bag of tools
that we have for NLP in order to get from this unstructured human speech through to something
structured that a computer can understand. And once we've done that then we can apply that
structured data to all sorts of AI applications. Now there's obviously a lot more to it than this
and I've included some links in the description if you'd like to know more, but hopefully this made
some sense and that you were able to process some of the natural language that I've shared today.
Thanks for watching. If you have questions, please drop us a line below. And if you want
to see more videos like this in the future, please like and subscribe.