#### World's Best AI Learning Platform with profoundly Demanding Certification Programs

Designed by IITian's, only for AI Learners.

Download our e-book of Attention Mechanism

How to use datetime objects? How to use Enum in python? Explain Scopes in Python? What is use of rank() function? How Indentation is Parsed? How to know given a binary tree is a binary search tree or not? Backpropagation: In second-order methods, would ReLU derivative be 0? and what its effect on training? Balanced Parentheses Check problem Join Discussion

4 (4,001 Ratings)

220 Learners

Jun 20th (6:00 PM) 692 Registered

Shivani Upare

8 months ago

Most of us already know
about how important, it is to have a clean dataset for our Machine learning
model to do predictions. Almost 60-80 % of the time in any Machine learning
projects is required to have a clean and good dataset for predictions. If our
data is not cleaned properly then our model will give us very bad accuracy. One
of the most important parts in these areas is missing value treatment which is
a major point of focus to make our models more accurate and valid for
prediction.

According
to IBM Data Analytics the report, you can expect to spend up to 80% of your time

IBM Report | Insideaiml

Some of the sources of Missing Values are as follows:

Before we get into the coding part, it’s important to
understand the different sources of missing data. Some typical reasons why data
is missing:

- Let’s consider a case where a user forgot to fill in a field.
- A user does not want to share his personal details.
- Data was lost while transferring manually one source to another.
- Due to programming error.

Let’s take an example to understand it in a more proper way.

Take an example of an online survey for a
product of a company. Many times, people do not share all the required
information in the survey related to their personal information. Few people
share their experiences, but not the full details like from how long they are
using the product, few people share how long they are using the product, their
experiences but not their contact information. Thus, in some or another way a
part of data is always missing, and this is very common in real-time.

As of now, I think you have an idea about how
much its important to treat the missing values in our data. So, let’s see it
now.

```
#import the pandas library
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5,3), index=['a', 'c', 'e', 'f','h'],columns=['Column1',
'Column2', 'Column3'])
data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data)
```

```
Column1 Column2 Column3
a -0.067397 -1.570255 -0.898418
b NaN NaN NaN
c 1.311982 1.972563 0.743876
d NaN NaN NaN
e 0.516474 -0.436298 -0.336320
f 0.587955 0.928367 1.014634
```

In the above example, we have created a
DataFrame having missing values. Which is represented as **NaN** – **Not a
Number**.

Pandas provide us with different functions such as
isnone() and notnone() to detect missing values in our dataset which makes our
life much easier. These methods can be applied to Series and DataFrames
objects.

```
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1','Column2', 'Column3'])
data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data['Column1'].isnone())
```

```
a false
b true
c false
d true
e false
f false
g true
h false
Name: Column1, dtype: bool
```

There are different methods to fill or
clean the missing values. Its Totally depends on the problems statements and
columns type that how to fill the missing values, Here, I will give an example
of a simple function **fillna()** to fill the missing values.

This fillna() function can
“fill in” NA values with non-none data in a couple of ways,

Let’s see it one by one

```
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['Column1',
'Column2', 'Column3'])
data = data.reindex(['a', 'b', 'c'])
print(data)
print
("NaN replaced with '0':")
print
(df.fillna(0))
```

```
Column1 Column2 Column3
a -1.145282 -1.204689 -0.011520
b NaN NaN NaN
c 1.054585 0.450895 -1.765849
NaN replaced with '0':
Column1 Column2 Column3
a 1.028044 -0.059059 0.814159
b 0.000000 0.000000 0.000000
c -0.093614 0.502746 -0.979775
d 0.000000 0.000000 0.000000
e -0.926268 0.819182 0.057756
f 0.654027 1.196219 1.441782
g 0.000000 0.000000 0.000000
h 0.888539 0.472792 -1.369401
```

Here, we filled the **NaN**
values with value zero; instead we can also fill with any other value.

We can also fill the missing values using
forward and backward method of fillna() function.

```
import pandas as pd
import numpy as np
data = pd.DataFrame(np.random.randn(5,3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1','Column2', 'Column3'])
data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data.fillna(method='pad'))
```

```
Column1 Column2 Column3
a 0.863373 0.113220 0.167150
b 0.863373 0.113220 0.167150
c 0.175815 0.526849 0.074818
d 0.175815 0.526849 0.074818
e -0.203824 -0.921412 1.200571
f 0.864100 1.263429 -0.200021
g 0.864100 1.263429 -0.200021
h 1.774977 -0.118278 0.415756
```

Python pandas package also provide a
function **dropna()** to drop the missing values. This function

is used along with the axis argument.
By default, **axix = 0**, I.e., along row, which
means that if any value within a row is NA then the whole row is dropped.

```
import pandas as pd
import numpy as np
data =pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['Column1','Column2', 'Column3'])
data = data.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(data.dropna())
```

```
Column1 Column2 Column3
a -0.316294 0.890039 0.349166
c -1.297559 0.113461 0.884424
e -2.175159 0.379806 2.231736
f -2.385318 1.803276 -0.342873
h 1.372849 1.482879 -0.349323
```

Sometimes we need to replace a generic value
with some specific value. We can do it by using replace method.

Replacing NaN with any scaler value is
equivalent of fillna() function.

```
import pandas as pd
import numpy as np
data = pd.DataFrame({'Column1':[10,20,30,40,50,2000],
'Column2':[1000,0,30,40,50,60]})
print(data.replace({1000:10,2000:60}))
```

```
Column1 Column2
0 10 10
1 20 0
2 30 30
3 40 40
4 50 50
5 60 60
```

I hope you enjoyed reading this article and finally, you came
to know about **Data Cleansing in Python.**

For more such blogs/courses on data science, machine
learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.

Thanks for reading…

Happy Learning…