Applied Text Mining in Python

Text Mining

Text mining, commonly known as text data mining, refers to extracting useful information by transforming unstructured data into a structured format. The transformed structured data set might include inputs, such as names, phone numbers, or entities’ addresses. Organizations can do this insightful analysis of data using various techniques. For instance, they can use Naïve Bayes, Support Vector Machines, or apply many other deep learning algorithms. Organizations might have their data stored in formats like structured data, unstructured data, or semi-structured data. According to research, 80 percent of the organizations’ data in the digital world is in an unstructured format, so applying text mining is beneficial for making future business decisions based on findings.

Advantages of using Python for Text Mining

The users can use various programming languages for text mining. For example, Python, R language, SAS, SQL, Java, Perl, and C++. Regarding text mining, despite other languages, the developers prefer Python as the most suitable choice for it. Following are the reasons for choosing Python over other languages:

Python has a pandas library that provides more straightforward and flexible data structures with high-performance data analytical tools for practical text mining.
Every kind of text analysis in Python is relatively faster than in other languages due to the built-in libraries and the specified relevant code.
There are numerous open-source tools available in Python for text mining. For example, Scikit-learn and TensorFlow are readily available.
Moreover, SaaS tools available in Python help use the readily built text mining tools without going through the installation process. There is also the facility of already prepared text data for analysis using Natural Language Processing (NLP) techniques. This feature is also made available by using processes like word tokenization, stemming, and lemmatization.

Text Mining in Python

There are many ways to get started with text mining in Python using different libraries. Here is an example of using the NLTK library, the natural language toolkit, to build Python scripts that work with human language data. NLTK also provides a user-friendly interface for ease of work. Moreover, NLTK is open source and well documented, which helps in tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. Following is a detailed example of each of them. Firstly, the users need to install the library.

pip install nltk

After that, the need to import nltk to work with various functionalities.

import nltk

Tokenization

Tokenization is the process that breaks the strings into tokens which are small structures. It breaks a complex sentence into words, understands each word’s importance, and produces a structural description of the input sentence. Following is an example of tokenization code:

import nltk
import os
import pandas as pd
import numpy as np
import nltk.corpus
text=”In the United Kingdom, they drive on the left-hand side of the road. The Queen lives in the Buckingham Palace in the United Kingdom.”
from nltk.tokenize import word_tokenize
token=word_toeknize(text)
token

It will split the sentence into comma-separated words. After that, the users can do frequency distribution.

Frequency Distribution

The users can use the below method to find the frequency of every single word. Moreover, they can use it to find the frequency of the most repeated words as well.

from nltk.probability import FreqDist
fdist=FreqDist(token)
fdist
fdist1=fdist.most_common(5)

The above code will show every word’s frequency and the frequency of the top five repeated words.

Stemming

Stemming refers to the linguistic normalization process that removes the affixes and reduces words to their root.

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps=PorterStemmer()
stemmed_words=[]
for w in filtered_sent:
            Stemmed_words.append(ps.stem(w))
print(“Filtered Sentence:”, filtered_sent)
print(“Stemmed Sentence:”, stemmed_words)

The above code will give the root words for every derived word used in the sentence. Moreover, there could be an analysis done for lemmatization, POS tagging, Sentiment Analysis, and many more techniques that come under the roof of text mining using nltk.

Other useful articles: