All Courses

Text Munging using python

Shashank Shanu

3 years ago

Text Munging Using Python
We all know that how much these days social media are becoming popular. Everyone is showing their presence on it from a small kid to an elder one. Due to these, we are generating zettabytes and petabytes of text data on a daily. Big companies use these text data to find the hidden insights into it and stand ahead from most of their competitors in the market. Facebook, YouTube, Netflix are some of the companies which use these data to provide the best customer service to their customers.
So before moving ahead. Let’s try to first understand:

What is Text munging?

Munging is a technique which is used to clean the messy data by using some kind of transformation on it.
This is one of the important techniques as nowadays most of the data which is being generated are present in the form of unstructured data. These data need some kind of treatment or cleaning applied on it so that we can find the correct hidden insights from them.
To understand it better. Let’s take an example:
In our example, we will try to see how we can apply some kind of transformation on our text to get some useful results which give some desirable changes to data.
In a simple language, it is only about transforming the text we are dealing with.

Example

Here, in our example, we are first shuffling the text and then rearranging all the letters of the given sentence. Except the first and the last sentences to get alternate words which may get generated as a misspelt word during writing by a human.
Let’s see the code:
# import library

import random
import re

# creating a function to perform the task

def replace(a):
    word = list(a.group(2))
    random.shuffle(word)
    return a.group(1) + "".join(word) + a.group(3)

text = "Hello, You should reach the finish line."
print(re.sub(r"(\w)(\w+)(\w)", replace, text))

print(re.sub(r"(\w)(\w+)(\w)", replace, text))

Output:
When we execute the above program, we get the output as given below −

Hlleo, You slouhd raech the fsiinh lnie.
Hlleo, You suolhd raceh the fniish line.
From the above results, we can observe how the words are jumbled except for the first and the last letters. By taking a statistical approach helps us to find the wrong spelling of the words and then we can decide what are the commonly misspelt words and supply the correct spelling for them to get the desired results.
I hope after you enjoyed reading this article and finally, you came to know about Text Munging using python.
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…
Happy Programming…

Submit Review