Mastering Text Preprocessing: A Comprehensive Guide

Pratik Kumar Roy
4 min readDec 10, 2023

--

Introduction:

Text preprocessing is a crucial step in natural language processing (NLP) that involves cleaning and transforming raw text data into a format suitable for analysis and modeling. In this blog, we’ll explore some of the most common text preprocessing techniques using popular NLP libraries such as NLTK, TextBlob, spaCy, and Stanford NLP. We’ll also delve into the use of regular expressions to remove HTML tags and demonstrate the expansion of contractions.

This blog is written for revision purposes. At the end of the blog, you will find a function that you can use as a template as per your use case.

NLTK (Natural Language Toolkit):

NLTK is a powerful library in Python for working with human language data. Let’s explore a few common preprocessing techniques using NLTK.

Tokenization:

Tokenization involves breaking down a text into smaller units, typically words or phrases, known as tokens. NLTK, TextBlob, and spaCy provide tools for effective tokenization.

from nltk.tokenize import word_tokenize, sent_tokenize

text = "NLTK is a powerful library for text processing. It makes NLP easy!"
words = word_tokenize(text)
sentences = sent_tokenize(text)

print("Tokenized Words:", words)
print("Tokenized Sentences:", sentences)

Stopword Removal:

Stopwords are common words like “the,” “is,” and “and” that often carry little meaning. NLTK provides a list of stopwords that can be removed from the text.

from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
filtered_words = [word for word in words if word.lower() not in stop_words]

print("Without Stopwords:", filtered_words)

Lemmatization:

Lemmatization is the process of reducing words to their base or root form. For example, “running” becomes “run.” Lemmatization helps standardize words, making them easier to analyze.

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Lemmatization is crucial for text analysis."
lemmas = [token.lemma_ for token in nlp(text)]
print(lemmas)

Stemming

Stemming involves reducing words to their root form by removing suffixes. NLTK provides various stemmers like Porter and Lancaster.

from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ["running", "jumps", "played"]
stemmed_words = [ps.stem(word) for word in words]
print(stemmed_words)

Text Preprocessing with TextBlob

Part-of-Speech Tagging

TextBlob allows for part-of-speech tagging, which identifies the grammatical parts of words (e.g., noun, verb, adjective).

from textblob import TextBlob

text = "TextBlob is a powerful library for NLP."
blob = TextBlob(text)
pos_tags = blob.tags
print(pos_tags)

Sentiment Analysis

TextBlob simplifies sentiment analysis by providing a sentiment polarity score.

sentiment = blob.sentiment.polarity
print(sentiment)

Text Preprocessing with spaCy

Named Entity Recognition (NER)

spaCy excels in named entity recognition, identifying entities like people, organizations, and locations in the text.

doc = nlp("Apple Inc. is headquartered in Cupertino.")
entities = [(ent.text, ent.label_) for ent in doc.ents]
print(entities)

Dependency Parsing

Dependency parsing in spaCy reveals the syntactic relationships between words in a sentence.

dep_tree = [(token.text, token.dep_, token.head.text) for token in doc]
print(dep_tree)

Regular Expressions for HTML Tag Removal

Often, text data may contain HTML tags that need to be removed before analysis.

import re

html_text = "<p>This is <b>HTML</b> text.</p>"
clean_text = re.sub('<.*?>', '', html_text)
print(clean_text)

Expanding Contractions

Expanding contractions involves converting abbreviated forms like “don’t” to “do not” for consistency.

from contractions import contractions_dict

def expand_contractions(text):
for contraction, expansion in contractions_dict.items():
text = text.replace(contraction, expansion)
return text

text = "I don't know if I can make it."
expanded_text = expand_contractions(text)
print(expanded_text)

The text processing template that I commonly use

def decontracted(phrase):
# specific
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)

# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase


def clean_text(sentence):
sentence = re.sub(r"http\S+", "", sentence)
sentence = BeautifulSoup(sentence, 'lxml').get_text()
sentence = decontracted(sentence)
sentence = re.sub("\S*\d\S*", "", sentence).strip()
sentence = re.sub('[^A-Za-z]+', ' ', sentence)
# https://gist.github.com/sebleier/554280
sentence = ' '.join(e.lower() for e in sentence.split() if e.lower() not in stopwords)
return sentence.strip()

You can also check out the following notebooks to see how I used it in projects:

Conclusion

Mastering text preprocessing is essential for effective NLP applications. By leveraging the capabilities of NLTK, TextBlob, spaCy, and regular expressions, you can enhance the quality of your text data and pave the way for more accurate analyses and models. Experiment with these techniques and tailor them to your specific needs for optimal results in your natural language processing endeavors.

--

--