Data Cleansing in Python

Shivani Upare

8 months ago

Most of us already know about how important, it is to have a clean dataset for our Machine learning model to do predictions. Almost 60-80 % of the time in any Machine learning projects is required to have a clean and good dataset for predictions. If our data is not cleaned properly then our model will give us very bad accuracy. One of the most important parts in these areas is missing value treatment which is a major point of focus to make our models more accurate and valid for prediction.
According to IBM Data Analytics the report, you can expect to spend up to 80% of your time
IBM Report | Insideaiml
IBM Report | Insideaiml

When and Why Is Data Missed?

Some of the sources of Missing Values are as follows:
Before we get into the coding part, it’s important to understand the different sources of missing data. Some typical reasons why data is missing:
  • Let’s consider a case where a user forgot to fill in a field.
  • A user does not want to share his personal details.
  • Data was lost while transferring manually one source to another.
  • Due to programming error.
Let’s take an example to understand it in a more proper way.
Take an example of an online survey for a product of a company. Many times, people do not share all the required information in the survey related to their personal information. Few people share their experiences, but not the full details like from how long they are using the product, few people share how long they are using the product, their experiences but not their contact information. Thus, in some or another way a part of data is always missing, and this is very common in real-time.
As of now, I think you have an idea about how much its important to treat the missing values in our data. So, let’s see it now.

How to handle missing values (say NA or NaN) using Pandas?

#import the pandas library
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5,3), index=['a', 'c', 'e', 'f','h'],columns=['Column1',
'Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
Column1   Column2   Column3
a -0.067397 -1.570255 -0.898418
b       NaN       NaN       NaN
c  1.311982  1.972563  0.743876
d       NaN       NaN       NaN
e  0.516474 -0.436298 -0.336320
f  0.587955  0.928367  1.014634
In the above example, we have created a DataFrame having missing values. Which is represented as NaNNot a Number.

How to check Missing values using pandas?

Pandas provide us with different functions such as isnone() and notnone() to detect missing values in our dataset which makes our life much easier. These methods can be applied to Series and DataFrames objects.
Let’s take an example
import pandas as pd

import numpy as np

data = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1','Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
a    false

b     true

c    false

d     true

e    false

f    false

g     true

h    false
Name: Column1, dtype: bool

How to Clean / Fill Missing Data in pandas?

There are different methods to fill or clean the missing values. Its Totally depends on the problems statements and columns type that how to fill the missing values, Here, I will give an example of a simple function fillna() to fill the missing values.
This fillna() function can “fill in” NA values with non-none data in a couple of ways,
Let’s see it one by one
Replacing NaN with a Scalar Value
import pandas as pd

import numpy as np

data = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['Column1',
'Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c'])

("NaN replaced with '0':")

    Column1   Column2   Column3

a -1.145282 -1.204689 -0.011520

b       NaN       NaN      NaN

c  1.054585  0.450895 -1.765849

NaN replaced with '0':

    Column1   Column2  Column3

a  1.028044 -0.059059  0.814159

b  0.000000  0.000000  0.000000

c -0.093614  0.502746 -0.979775

d  0.000000  0.000000  0.000000

e -0.926268 0.819182  0.057756

f  0.654027  1.196219 1.441782

g  0.000000  0.000000 0.000000

h  0.888539  0.472792 -1.369401
Here, we filled the NaN values with value zero; instead we can also fill with any other value.

Fill NA Forward and Backward

We can also fill the missing values using forward and backward method of fillna() function.
Method                             Action
pad/fill                        Forward Fill methods
bfill/backfill                Backward Fill methods
import pandas as pd

import numpy as np

data = pd.DataFrame(np.random.randn(5,3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1','Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
    Column1   Column2   Column3

a  0.863373  0.113220  0.167150

b  0.863373  0.113220  0.167150

c  0.175815  0.526849  0.074818

d  0.175815  0.526849  0.074818

e -0.203824 -0.921412  1.200571

f  0.864100  1.263429 -0.200021

g  0.864100  1.263429 -0.200021

h  1.774977 -0.118278  0.415756

How to Drop Missing values?

Python pandas package also provide a function dropna() to drop the missing values. This function
is used along with the axis argument. By default, axix = 0, I.e., along row, which means that if any value within a row is NA then the whole row is dropped.
import pandas as pd

import numpy as np

data =pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1','Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
    Column1   Column2   Column3

a -0.316294 0.890039  0.349166

c -1.297559 0.113461  0.884424

e -2.175159 0.379806  2.231736

f -2.385318 1.803276 -0.342873

h  1.372849  1.482879 -0.349323

How to Replace Missing (or) Generic Values?

Sometimes we need to replace a generic value with some specific value. We can do it by using replace method.
Replacing NaN with any scaler value is equivalent of fillna() function.
import pandas as pd

import numpy as np

data = pd.DataFrame({'Column1':[10,20,30,40,50,2000],

   Column1  Column2

0       10       10

1       20        0

2       30       30

3       40       40

4       50       50

5       60       60
I hope you enjoyed reading this article and finally, you came to know about Data Cleansing in Python.
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…
Happy Learning…

Submit Review