All Courses

Data Cleansing in Python

Shivani Upare

2 years ago

Data Cleansing in Python | insideAIML
Table of Contents
  • Introduction
  • When and Why Is Data Missed?
  • How to handle missing values (say NA or NaN) using Pandas?
  • How to check Missing values using pandas?
  • How to Clean / Fill Missing Data in pandas?
  • Fill NA Forward and Backward
  • How to Drop Missing values?
  • How to Replace Missing (or) Generic Values?

Introduction

          Most of us already know about how important, it is to have a clean dataset for our Machine learning model to do predictions. Almost 60-80 % of the time in any Machine learning projects is required to have a clean and good dataset for predictions. If our data is not cleaned properly then our model will give us very bad accuracy. One of the most important parts in these areas is missing value treatment which is a major point of focus to make our models more accurate and valid for prediction.
According to IBM Data Analytics the report, you can expect to spend up to 80% of your time
IBM Report | Insideaiml

When and Why Is Data Missed?

          Some of the sources of Missing Values are as follows:
Before we get into the coding part, it’s important to understand the different sources of missing data. Some typical reasons why data is missing:
  • Let’s consider a case where a user forgot to fill in a field.
  • A user does not want to share his personal details.
  • Data was lost while transferring manually one source to another.
  • Due to programming error.
Let’s take an example to understand it in a more proper way.
Take an example of an online survey for a product of a company. Many times, people do not share all the required information in the survey related to their personal information. Few people share their experiences, but not the full details like from how long they are using the product, few people share how long they are using the product, their experiences but not their contact information. Thus, in some or another way a part of data is always missing, and this is very common in real-time.
As of now, I think you have an idea about how much its important to treat the missing values in our data. So, let’s see it now.

How to handle missing values (say NA or NaN) using Pandas?

#import the pandas library
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5,3), index=['a', 'c', 'e', 'f','h'],columns=['Column1',
'Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data)
Output:
Column1   Column2   Column3
a -0.067397 -1.570255 -0.898418
b       NaN       NaN       NaN
c  1.311982  1.972563  0.743876
d       NaN       NaN       NaN
e  0.516474 -0.436298 -0.336320
f  0.587955  0.928367  1.014634
In the above example, we have created a DataFrame having missing values. Which is represented as NaNNot a Number.

How to check Missing values using pandas?

          Pandas provide us with different functions such as isnone() and notnone() to detect missing values in our dataset which makes our life much easier. These methods can be applied to Series and DataFrames objects.
Let’s take an example
import pandas as pd

import numpy as np

data = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1','Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data['Column1'].isnone())
Output
a    false

b     true

c    false

d     true

e    false

f    false

g     true

h    false
Name: Column1, dtype: bool

How to Clean / Fill Missing Data in pandas?

          There are different methods to fill or clean the missing values. Its Totally depends on the problems statements and columns type that how to fill the missing values, Here, I will give an example of a simple function fillna() to fill the missing values.
This fillna() function can “fill in” NA values with non-none data in a couple of ways,
Let’s see it one by one
Replacing NaN with a Scalar Value
import pandas as pd

import numpy as np

data = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['Column1',
'Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c'])
print(data)

print
("NaN replaced with '0':")

print
(df.fillna(0))
Output
    Column1   Column2   Column3

a -1.145282 -1.204689 -0.011520

b       NaN       NaN      NaN

c  1.054585  0.450895 -1.765849

NaN replaced with '0':

    Column1   Column2  Column3

a  1.028044 -0.059059  0.814159

b  0.000000  0.000000  0.000000

c -0.093614  0.502746 -0.979775

d  0.000000  0.000000  0.000000

e -0.926268 0.819182  0.057756

f  0.654027  1.196219 1.441782

g  0.000000  0.000000 0.000000

h  0.888539  0.472792 -1.369401
Here, we filled the NaN values with value zero; instead we can also fill with any other value.

Fill NA Forward and Backward

          We can also fill the missing values using forward and backward method of fillna() function.
Method                             Action
pad/fill                        Forward Fill methods
bfill/backfill                Backward Fill methods
Example
import pandas as pd

import numpy as np

data = pd.DataFrame(np.random.randn(5,3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1','Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data.fillna(method='pad'))
Output
    Column1   Column2   Column3

a  0.863373  0.113220  0.167150

b  0.863373  0.113220  0.167150

c  0.175815  0.526849  0.074818

d  0.175815  0.526849  0.074818

e -0.203824 -0.921412  1.200571

f  0.864100  1.263429 -0.200021

g  0.864100  1.263429 -0.200021

h  1.774977 -0.118278  0.415756

How to Drop Missing values?

        Python pandas package also provide a function dropna() to drop the missing values. This function
is used along with the axis argument. By default, axix = 0, I.e., along row, which means that if any value within a row is NA then the whole row is dropped.
Example
import pandas as pd

import numpy as np

data =pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1','Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data.dropna())
Output
    Column1   Column2   Column3

a -0.316294 0.890039  0.349166

c -1.297559 0.113461  0.884424

e -2.175159 0.379806  2.231736

f -2.385318 1.803276 -0.342873

h  1.372849  1.482879 -0.349323

How to Replace Missing (or) Generic Values?

          Sometimes we need to replace a generic value with some specific value. We can do it by using replace method.
Replacing NaN with any scaler value is equivalent of fillna() function.
Example
import pandas as pd

import numpy as np

data = pd.DataFrame({'Column1':[10,20,30,40,50,2000],
'Column2':[1000,0,30,40,50,60]})

print(data.replace({1000:10,2000:60}))
Output
   Column1  Column2

0       10       10

1       20        0

2       30       30

3       40       40

4       50       50

5       60       60
I hope you enjoyed reading this article and finally, you came to know about Data Cleansing in Python.
     
Liked what you read? Then don’t break the spree. Visit our insideAIML blog page to read more awesome articles. 
Or if you are into videos, then we have an amazing Youtube channel as well. Visit our InsideAIML Youtube Page to learn all about Artificial Intelligence, Deep Learning, Data Science and Machine Learning. 
Keep Learning. Keep Growing. 

Submit Review