Preprocessing Techniques in NLP

Preprocessing Techniques in NLP

How do we make it easier for machines to understand human language?

This article explores the various preprocessing techniques in NLP and why they are necessary. We take a block of text and understand all the techniques by applying them to it. Following is our block of text:

“The cat was playing in the garden. It was a sunny day, and the cat was having a great time chasing butterflies and playing with a ball of yarn. Suddenly, the cat saw a mouse and started to chase it. The mouse ran into its hole, and the cat was left alone in the garden.”

1. Punctuation Removal

This is exactly what you would expect. Remove all commas, full stops, quotes from the text. We understand punctuation, but to our model, it is just noise. So, punctuation removal would help to improve the accuracy. The above text would look something like below after removing punctuations:

The cat was playing in the garden It was a sunny day and the cat was having a great time chasing butterflies and playing with a ball of yarn Suddenly the cat saw a mouse and started to chase it The mouse ran into its hole and the cat was left alone in the garden

2. Converting to Lower Case

Again, this is exactly what it says. We convert all of the words to lower case. If we don’t, our model would treat same words with different case as different words. For example, ‘The’ and ‘the’ would be considered different words. This would increase the complexity of the model, for absolutely no reason. Our text would look like below after conversion to lower case.

the cat was playing in the garden it was a sunny day and the cat was having a great time chasing butterflies and playing with a ball of yarn suddenly the cat saw a mouse and started to chase it the mouse ran into its hole and the cat was left alone in the garden

3. Tokenization

Tokenization is the process of breaking sentences to individual words or phrases. Breaking a huge text into smaller parts helps the model to comprehend it’s meaning better. Our tokenized text looks like below:

[‘the’, ‘cat’, ‘was’, ‘playing’, ‘in’, ‘the’, ‘garden’, ‘it’, ‘was’, ‘a’, ‘sunny’, ‘day’, ‘and’, ‘the’, ‘cat’, ‘was’, ‘having’, ‘a’, ‘great’, ‘time’, ‘chasing’, ‘butterflies’, ‘and’, ‘playing’, ‘with’, ‘a’, ‘ball’, ‘of’, ‘yarn’, ‘suddenly’, ‘the’, ‘cat’, ‘saw’, ‘a’, ‘mouse’, ‘and’, ‘started’, ‘to’, ‘chase’, ‘it’, ‘the’, ‘mouse’, ‘ran’, ‘into’, ‘its’, ‘hole’, ‘and’, ‘the’, ‘cat’, ‘was’, ‘left’, ‘alone’, ‘in’, ‘the’, ‘garden’]

4. Stop-word Removal

In this process, we remove any words that do not contribute to the meaning of the sentence. For example, ‘the’, ‘to’, ‘it’, etc are words that occur frequently, but do not make the essence of the sentence. If they are removed, it doesn’t make much difference to the meaning of the sentence. They can be noise to our model. Hence, we remove them. Our text, after stop-word removal looks like:

[‘cat’, ‘playing’, ‘garden’, ‘sunny’, ‘day’, ‘cat’, ‘great’, ‘time’, ‘chasing’, ‘butterflies’, ‘playing’, ‘ball’, ‘yarn’, ‘suddenly’, ‘cat’, ‘saw’, ‘mouse’, ‘started’, ‘chase’, ‘mouse’, ‘ran’, ‘hole’, ‘cat’, ‘left’, ‘alone’, ‘garden’]

5. Stemming

Stemming refers to converting a word to it’s root. For example — 

‘playing’ — ‘play’

‘played’ — ‘play’

‘chasing’ — ‘chase’

‘sunny’ — ‘sun’

If we skip stemming, ‘played’ and ‘play’ would be treated as different words, resulting in increased complexity. Thus, we stem the words. After stemming, our text looks like:

[‘cat’, ‘play’, ‘garden’, ‘sun’, ‘day’, ‘cat’, ‘great’, ‘time’, ‘chase’, ‘butterfly’, ‘play’, ‘ball’, ‘yarn’, ‘sudden’, ‘cat’, ‘saw’, ‘mouse’, ‘start’, ‘chase’, ‘mouse’, ‘ran’, ‘hole’, ‘cat’, ‘left’, ‘alone’, ‘garden’]

6. Parts of Speech Tagging

This simply means deciding if the word is a noun, adjective, verb, adverb, etc. This helps our model to understand the meaning of the text in a better way. From our example,

cat — noun

play — verb

sunny — adjective

7. Word Sense Disambiguation

This refers to identifying the meaning of the words based on context. In our text the model needs to understand if mouse means rat or your keyboard-and-mouse mouse? Does left mean the direction or the act of leaving? Identifying context helps the model understand our text deeply.

8. N-Grams

N-gram models help to group frequently occurring words like ‘New York’ or ‘Thank You’ and treat them as single entities. ‘New York’ is a 2-gram. ‘The Powerpuff Girls’ is a 3-gram and so on.