Download our e-book of Introduction To Python

Top Discussion

How can I write Python code to change a date string from "mm/dd/yy hh: mm" format to "YYYY-MM-DD HH: mm" format? Which sorting technique is used by sort() and sorted() functions of python? How to use Enum in python? Can you please help me with this error? I was just selecting some random columns from the diabetes dataset of sklearn. Decision tree is a classification algo...How can it be applied to load diabetes dataset which has DV continuous Objects in Python are mutable or immutable? How can unclassified data in a dataset be effectively managed when utilizing a decision tree-based classification model in Python? How to leave/exit/deactivate a Python virtualenvironment Join Discussion

Top Courses

Webinars

More webinars

Data Cleansing in Python

Shivani Upare

2 years ago

Introduction
When and Why Is Data Missed?

How to handle missing values (say NA or NaN) using Pandas?

How to check Missing values using pandas?

How to Clean / Fill Missing Data in pandas?

Fill NA Forward and Backward

How to Drop Missing values?

How to Replace Missing (or) Generic Values?

Introduction

Most of us already know about how important, it is to have a clean dataset for our Machine learning model to do predictions. Almost 60-80 % of the time in any Machine learning projects is required to have a clean and good dataset for predictions. If our data is not cleaned properly then our model will give us very bad accuracy. One of the most important parts in these areas is missing value treatment which is a major point of focus to make our models more accurate and valid for prediction.

According to IBM Data Analytics the report, you can expect to spend up to 80% of your time

When and Why Is Data Missed?

Some of the sources of Missing Values are as follows:

Before we get into the coding part, it’s important to understand the different sources of missing data. Some typical reasons why data is missing:

Let’s consider a case where a user forgot to fill in a field.
A user does not want to share his personal details.
Data was lost while transferring manually one source to another.
Due to programming error.

Let’s take an example to understand it in a more proper way.

Take an example of an online survey for a product of a company. Many times, people do not share all the required information in the survey related to their personal information. Few people share their experiences, but not the full details like from how long they are using the product, few people share how long they are using the product, their experiences but not their contact information. Thus, in some or another way a part of data is always missing, and this is very common in real-time.

As of now, I think you have an idea about how much its important to treat the missing values in our data. So, let’s see it now.

How to handle missing values (say NA or NaN) using Pandas?

#import the pandas library
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5,3), index=['a', 'c', 'e', 'f','h'],columns=['Column1',
'Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data)

Output:

Column1   Column2   Column3
a -0.067397 -1.570255 -0.898418
b       NaN       NaN       NaN
c  1.311982  1.972563  0.743876
d       NaN       NaN       NaN
e  0.516474 -0.436298 -0.336320
f  0.587955  0.928367  1.014634

In the above example, we have created a DataFrame having missing values. Which is represented as NaN – Not a Number.

How to check Missing values using pandas?

Pandas provide us with different functions such as isnone() and notnone() to detect missing values in our dataset which makes our life much easier. These methods can be applied to Series and DataFrames objects.

Let’s take an example

import pandas as pd

import numpy as np

data = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1','Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data['Column1'].isnone())

Output

a    false

b     true

c    false

d     true

e    false

f    false

g     true

h    false
Name: Column1, dtype: bool

How to Clean / Fill Missing Data in pandas?

There are different methods to fill or clean the missing values. Its Totally depends on the problems statements and columns type that how to fill the missing values, Here, I will give an example of a simple function fillna() to fill the missing values.

This fillna() function can “fill in” NA values with non-none data in a couple of ways,

Let’s see it one by one

Replacing NaN with a Scalar Value

import pandas as pd

import numpy as np

data = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['Column1',
'Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c'])
print(data)

print
("NaN replaced with '0':")

print
(df.fillna(0))

Output

    Column1   Column2   Column3

a -1.145282 -1.204689 -0.011520

b       NaN       NaN      NaN

c  1.054585  0.450895 -1.765849

NaN replaced with '0':

    Column1   Column2  Column3

a  1.028044 -0.059059  0.814159

b  0.000000  0.000000  0.000000

c -0.093614  0.502746 -0.979775

d  0.000000  0.000000  0.000000

e -0.926268 0.819182  0.057756

f  0.654027  1.196219 1.441782

g  0.000000  0.000000 0.000000

h  0.888539  0.472792 -1.369401

Here, we filled the NaN values with value zero; instead we can also fill with any other value.

Fill NA Forward and Backward

We can also fill the missing values using forward and backward method of fillna() function.

Method Action

pad/fill Forward Fill methods

bfill/backfill Backward Fill methods

Example

import pandas as pd

import numpy as np

data = pd.DataFrame(np.random.randn(5,3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1','Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data.fillna(method='pad'))

Output

    Column1   Column2   Column3

a  0.863373  0.113220  0.167150

b  0.863373  0.113220  0.167150

c  0.175815  0.526849  0.074818

d  0.175815  0.526849  0.074818

e -0.203824 -0.921412  1.200571

f  0.864100  1.263429 -0.200021

g  0.864100  1.263429 -0.200021

h  1.774977 -0.118278  0.415756

How to Drop Missing values?

Python pandas package also provide a function dropna() to drop the missing values. This function

is used along with the axis argument. By default, axix = 0, I.e., along row, which means that if any value within a row is NA then the whole row is dropped.

Example

import pandas as pd

import numpy as np

data =pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1','Column2', 'Column3'])

data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data.dropna())

Output

    Column1   Column2   Column3

a -0.316294 0.890039  0.349166

c -1.297559 0.113461  0.884424

e -2.175159 0.379806  2.231736

f -2.385318 1.803276 -0.342873

h  1.372849  1.482879 -0.349323

How to Replace Missing (or) Generic Values?

Sometimes we need to replace a generic value with some specific value. We can do it by using replace method.

Replacing NaN with any scaler value is equivalent of fillna() function.

Example

import pandas as pd

import numpy as np

data = pd.DataFrame({'Column1':[10,20,30,40,50,2000],
'Column2':[1000,0,30,40,50,60]})

print(data.replace({1000:10,2000:60}))

Output

   Column1  Column2

0       10       10

1       20        0

2       30       30

3       40       40

4       50       50

5       60       60

I hope you enjoyed reading this article and finally, you came to know about Data Cleansing in Python.

Liked what you read? Then don’t break the spree. Visit our insideAIML blog page to read more awesome articles.

Or if you are into videos, then we have an amazing Youtube channel as well. Visit our InsideAIML Youtube Page to learn all about Artificial Intelligence, Deep Learning, Data Science and Machine Learning.

Keep Learning. Keep Growing.

Related Blog

Top Discussion

Top Courses

Webinars

Data Cleansing in Python

Table of Contents

Introduction

When and Why Is Data Missed?

How to handle missing values (say NA or NaN) using Pandas?

How to check Missing values using pandas?

How to Clean / Fill Missing Data in pandas?

Fill NA Forward and Backward

How to Drop Missing values?

How to Replace Missing (or) Generic Values?

Submit Review