NLP Techniques: Text Preprocessing

Sanchita Paul
AlmaBetter
Published in
6 min readApr 29, 2021

--

Natural Language Processing is getting to be one of the most popular techniques that is being used. It is a technique used to program computers to understand, process and analyze huge amounts of data that is in human natural language such as text, speech, etc.

Suppose let’s take for example we are reading reviews for a book, we as humans can understand just by looking at the reviews that it is positive or negative right? But then how do machines understand these sentiments?

This is where Natural Language Processing comes into picture.

Libraries one can use for NLP are SpaCy, NLTK,etc.

Examples of NLP being used are: Whatsapp text messages, spam classifier in Gmail and Siri.

Some wide NLP uses

STEP 1: Text Preprocessing

The first and foremost step towards processing the text data.

1. Tokenization:

Tokenization means taking a single unit such as sentence or paragraph or phrase or a document and then splitting into smaller units.

For example a paragraph will be split into sentences, sentences can be split into words and so on and so forth.

Below I have shown how the splitting

Source: KDNuggets

Let us look at a small piece of code to understand this better:

#import the necessary libraryimport nltknltk.download("punkt")paragraph = "A paragraph is a series of related sentences developing a central idea, called the topic. Try to think about paragraphs in terms of thematic unity: a paragraph is a sentence or a group of sentences that supports one central, unified idea. Paragraphs add one idea at a time to your broader argument"sentence= nltk.sent_tokenize(paragraph) #List of sentencesfor sentences in sentence:print(sentences)
Image source: Writer

Each element in the sentence list is a sentence which was split by the tokenizer.

We can also word_tokenize to splitting into words

2. Stemming:

Stemming in definition means reducing words to their word stems.

Let us understand what it actually means:

Let us look at this example of original words and how they are stemmed:

Image source: KDNuggets

Ok so why does this happen?

Stemming removes the affixes (mostly suffix) and makes then uses this base word to convert to a word that may or may not have some meaning. Stemming isn’t concerned with human understanding but just extracting the base word.

(Note: There are 3 types of Stemmers: Porter- This is the most commonly used and the most gentle of all but is computationally expensive, Porter2(Snowball)- It is an improvement on Porter and less computational time, Lancaster- Very aggressive stemming algorithm, sometimes to a fault. With porter and snowball, the stemmed representations are usually fairly intuitive to a reader, not so with Lancaster, as many shorter words will become totally obfuscated. The fastest algorithm here, and will reduce your working set of words hugely, but if you want more distinction, not the tool you would want.

#Using the same paragraph#Importing librariesfrom nltk.stem.snowball import SnowballStemmer#Stop words are words that do not add any meaning to a sentence
from nltk.corpus import stopwords
nltk.download('stopwords')
# create an object of stemming
stemmer = SnowballStemmer("english")
for i in range(len(sentence)):
text= nltk.word_tokenize(sentence[i])
text = [stemmer.stem(word) for word in text if word not in set(stopwords.words('english'))]
sentence[i]= " ".join(text)
print(sentence)
Image source: Writer

Stop words have been removed and there is stemming that has occured.

3. Lemmetization:

Stem words cannot be understood by humans that is when Lemmetization comes into being.

Basically Lemmetization removes the affixes (mostly suffix) and makes then uses this base word to convert to a word that actually has some meaning.

Image source: KDNuggets

Let us look at lemmetization with the same example:

#importing the libraryfrom nltk.stem import WordNetLemmatizer# create an object of Lemmatizing function
lemma = WordNetLemmatizer()
for i in range(len(sentence)):
text= nltk.word_tokenize(sentence[i])
text = [lemma.lemmatize(word) for word in text if word not in set(stopwords.words('english'))]
sentence[i]= " ".join(text)
print(sentence)
Image source: Writer

Each word has meaning and stop words have been removed.

Ok so if Lemmetization much better than Stemming why even do stemming? Well because in cases like Sentiment Analysis only the stem words are enough to identify the sentiments.

Also since Lemmas are words with meanings they take a lot more time than stemming for processing!

Lemmetization can be used in Q&A applications, Siri and ChatBox.

STEP 2: Text Preprocessing

We focus on converting words into vectors

1. Bag of Words:

We can use the CountVectorizer library present in NLTK library to execute this.

We cannot directly give our text data to our model, we need to convert it into numeric values for the machine to understand which are known as vectors.

Suppose let us take 4 sentences as example:

Source: Ronald James Group

The countvectorizer will do a one-hot encoding and create variables for each distinct word. Now everytime the word is seen, it will enter a 1 under that word variable. So if there are 4 sentences and 6 distinct word, the matrix will be 4x6 or number of sentences x number of distinct words.

So what we can do next from this matrix is make a bar chart of the most occured words by taking their fequencies (count of each word) For ex: dog is present 3 times in this matrix.

If we apply stopwords and execute it then the stopwords will be removed or else most of the words with the maximum frequencies will be stopwords which will not give any relevant information about the sentence.

(Binary Bag of Words: If a word is present in a sentence no matter how many times put it as ‘1’ and if it is not present put it as ‘0’,please not that 1 is put only once no matter how many times that word is present.)

How BOW differs from this is that, it takes the count of words into consideration and so everytime a word occurs it is counted so if dog is present twice in a sentence BOW will take 2.

The disadvantage of BOW is that the semantic or the importance of all the words are almost the same and they cannot be differentiated as to which word is more important than the other, this cannot help us understand the sentiment of the document or the sentence at all right?

Example: This is a good book

Here ‘good’ is the most important word to understand the sentiment and should be given more value.

So do we have a better method to overcome this? Definitely! I will be discussing that in the next part: TFIDF

2. TF-IDF: (Term Frequency-Inverse Document Frequency)

Tf-idf weight is product of two terms: the first term is the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

For example: Sentence 1-This is wonderful book. Sentence 2- I liked this book very much. Sentence 3- I didn’t find it interesting

After removing stop words they become-

Sentence 1- wonderful book. Sentence 2- like book very much. Sentence 3- didn’t find interesting

TF(word) = (Number of times word appears in a sentence) / (Total number of words in the sentence).

The TF table would look something like:

Image Source: Writer

IDF(word) = loge(Total number of sentences/Number of sentences with the word in it)

The IDF table would look like:

Image source: Writer

The final result is given by TF*IDF

--

--

Sanchita Paul
AlmaBetter

Hi I am Sanchita, an engineer, a math enthusiast, an AlmaBetter Datascience trainee and writer at Analytics Vidhya