Easy NLP with TextBlob
Kuang Hao, Research Computing, NUS IT
Introduction
Natural Language Processing (NLP) is getting more and more popular. Captivating as it is, there is quite a steep learning curve for NLP, as it stretches across multiple research areas, from computational linguistics to artificial intelligence. For beginners, apart from NLTK, the powerful NLP toolkit I introduced last time, there is another extremely user-friendly tool: TextBlob.
In this article, we will explore what TextBlob can do to make your NLP learning easier. For common tasks like tokenization, lexicon normalization and POS (Part of Speech) tagging, as I have already shared how they can be done with NLTK, I will not repeat it here.
For installation, we can easily install TextBlob using pip. Remember to download the language corpora when you use it for the first time:
The sample notebook used to explore TextBlob can be found here. Firstly, we need to create a TextBlob object. Python’s string manipulation works for a TextBlob object: [1]
String methods like substring, find, upper, comparison and concatenate are all available on a TextBlob object, which is convenient for experienced Python users. We can even compare a TextBlob object with a string:
Noun Phrase Extraction
Noun Phrase Extraction refers to the extraction of noun phrases. Sometimes we are more interested in the “who” from the text and would like to extract them in order to assign more focus in the downstream analysis. Let’s see the example below:
It works well for most cases, but sometimes it fails under unusual cases:
As TextBlob is built on traditional language models, not based on any SOTA (state-of-the-art) framework, it does not perform well for complex cases. But more often than not, it works just fine unless we bully it with tongue twisters. 🙂
Spelling Correction
Namely, TextBlob can be used to auto-correct misspelled words. We can perform correction on a blob object (sentence/paragraph), or use spellcheck function on a single word. See the example below:
Please note that TextBlob’s spell check works on known text corpus, so when a word happens to be misspelt to another known word, TextBlob can’t correct it. Likewise, it also won’t work for unseen words:
Machine Translation
TextBlob makes use of Google Translate API for its translation function and language detection. It can also auto-detect source language when we don’t specify the “from_lang” parameter. See this example to translate from French to English:
Google Translate API can be used in Python with other packages like “googletrans”. Nonetheless, it is convenient to have an inbuilt translation feature when we are already exploring our textual data with TextBlob.
Sentiment Analysis
Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. [2]
TextBlob offers a lexicon-based sentiment analysis. A lexicon-based approach basically assigns scores to bags of words, based on a pre-defined dictionary of negative and positive words. [3]Then it will take an average to calculate the overall sentiment score for a sentence.
The sentiment function returns 2 properties: polarity and subjectivity. The value of polarity lies between -1 and 1, where 1 means a positive statement and -1 means a negative statement. The value of subjectivity lies between 0 and 1, where 0 means objective (factual information) and 1 means subjective (personal judgement). See the below example:
Since it opts a comparatively simple lexicon-based approach, it naturally performs poorly with ambiguous words and more complex cases like irony and sarcasm. Let’s see an example:
TextBlob mistakenly classifies the above sentence as positive. Nowadays, with the rapid development of NLP, those more complicated cases can be dealt with pretty well using modern language models. See the below example using RoBERTa, one of the SOTA models today. You can try for yourself here.
When we just want to peek at the overall sentiment within our textual data during EDA (exploratory data analysis), TextBlob is a fairly good tool to use.
Text Classification
We can also build a simple text classification model using TextBlob. It provides some machine learning models like Naive Bayes, Decision Tree, and Support Vector Machine. For demonstration, I trained a simple Naive Bayes text classifier with the IMDB movie review dataset using TextBlob.
We can compare results with the inbuilt sentiment function:
As we can see, sentiment function doesn’t work well here. Like I mentioned earlier, it would only work for simple sentences where positive or negative words are stated clearly. For domain specific data like movie reviews, our model that is trained specifically on this task, works better.
Conclusion
This article aims to introduce a commonly used tool for NLP: TextBlob. Throughout our simple hands-on with the tool, we find it a very convenient to use in noun phrase extraction, spell correction, machine translation, sentiment analysis, and text classification.
We also realize that sometimes it doesn’t work well, especially for domain specific text and complex cases. For sentiment analysis and spell correction, it’s advised that we only use TextBlob in those tasks during EDA phase, or just to get an approximate result to help generate assumptions. Then we will need to try more sophisticated solutions.
In all, TextBlob is user-friendly and simple. For beginners, it can help build interest in NLP, as well as lay a good foundation for deeper studies.
On HPC
To use TextBlob on HPC clusters, simply install it as a user package. The manuals for using Python on HPC clusters can be found here (CPU) and here (GPU).
Contact us if you have any issues using our HPC resources, via nTouch.
Reference
[1]. TextBlob. (2020). TextBlob Documentation. [online] Available at: https://textblob.readthedocs.io/en/dev/quickstart.html#textblobs-are-like-python-strings
[2]. WikiPedia. (2020). Sentiment analysis. [online] Available at: https://en.wikipedia.org/wiki/Sentiment_analysis
[3]. Parthvi Shah. (2020). Sentiment Analysis using TextBlob. [online] Available at: https://towardsdatascience.com/my-absolute-go-to-for-sentiment-analysis-textblob-3ac3a11d524