All Courses

Processing Unstructured Data in Python

Shashank Shanu

2 years ago

Unstructured Data In Python
Table of Content
  • Reading Data from a text file
  • How to Count Word Frequency in a file?
  • Example
  • Output
           As we know today how much data is being created on a daily basis. More than 90% of these data are unstructured. Structured data is properly organized in some of the relational databases, unstructured data does not have a predefined schema and isn’t available in a specified format. So it becomes very difficult to directly use it. To overcome this problem and deal with unstructured data we need to apply some preprocessing on it so that we can use it and get the required results.
Some of the examples of unstructured data is HTML, image or pdf files. We can handle the HTML file by processing the HTML tags, a feed from twitter or a plain text document from a news feed can without having a delimiter does not have tags to handle.
For this kind of scenario, we use in-built functions from various python libraries to process the data.
Let’s try to understand it with the help of an example:

Reading Data from a text file

         In this example we are taking a text file which contains some paragraphs describing the python language.
Let’s try to see how we can read data from a text file and segregate each of the lines in it. Next, we will try to divide the output into further lines and words.
Example:
file_name = 'python.txt'  

with open(file_name) as fn:  

# Read each line
   line = fn.readline()

# Keep count of lines
   line_count = 1
   while line:
       print("Line {}: {}".format(line_count, line.strip()))
       line = fn.readline()
       line_count += 1
Output:
Line 1: Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales.
Line 2: Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library.

How to Count Word Frequency in a file?

       Python provides us a function known as the counter function which is used to count the frequency of the words in the given file. This function is present in the python collections module so first, we have to import it and then we can use it.
Let’s see an example:
Example:
from collections import Counter

with open(r'my_file.txt') as f:
               freq = Counter(f.read().split())
               print(freq)
Output:
When we execute the above code, it produces the following output.
 
Counter({'and': 4, 'Python': 3, 'that': 2, 'a': 2, 'programming': 3, 'code': 1, '1991,': 1, 'is': 1, 'programming.': 1, 'dynamic': 3, 'an': 1, 'design': 1, 'in': 1, 'high-level': 1, 'management.': 1, 'features': 1, 'readability,': 1, 'van': 1, 'both': 1, 'for': 1, 'Rossum': 1, 'system': 1, 'provides': 1, 'memory': 1, 'has': 1, 'type': 1, 'enable': 1, 'Created': 1, 'philosophy': 5, 'constructs': 1, 'emphasizes': 1, 'general-purpose': 1, 'notably': 5, 'released': 1, 'significant': 1, 'Guido': 1, 'using': 1, 'interpreted': 1, 'by': 1, 'on': 1, 'language': 1, 'whitespace.': 1, 'clear': 1, 'It': 1, 'large': 1, 'small': 1, 'automatic': 1, 'scales.': 1, 'first': 2})
I hope after you enjoyed reading this article and finally, you came to know about Processing Unstructured Data in Python
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…
Happy Programming…

Submit Review