Download our e-book of Introduction To Python

Top Discussion

How can I write Python code to change a date string from "mm/dd/yy hh: mm" format to "YYYY-MM-DD HH: mm" format? Which sorting technique is used by sort() and sorted() functions of python? How to use Enum in python? Can you please help me with this error? I was just selecting some random columns from the diabetes dataset of sklearn. Decision tree is a classification algo...How can it be applied to load diabetes dataset which has DV continuous Objects in Python are mutable or immutable? How can unclassified data in a dataset be effectively managed when utilizing a decision tree-based classification model in Python? How to leave/exit/deactivate a Python virtualenvironment Join Discussion

Top Courses

Webinars

More webinars

Python Pandas - Missing Data

Mohit Sharma

3 years ago

Python Pandas Working With Missing Data | InsideAIML

When and Why Is Data Missed?

Check for Missing Values

Calculations with Missing Data

Cleaning / Filling Missing Data

1. Replace NaN with a Scalar Value

2. Fill NA Forward and Backward

3. Drop Missing Values

4. Replace Missing (or) Generic Values

Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

When and Why Is Data Missed?

Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time.

Let us now see how we can handle missing values (say NA or NaN) using Pandas.


# import the pandas library
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)

Its output is as follows −


         one        two      three
a   0.077988   0.476149   0.965836
b        NaN        NaN        NaN
c  -0.390208  -0.551605  -2.301950
d        NaN        NaN        NaN
e  -2.000303  -0.788201   1.510072
f  -0.930230  -0.670473   1.146615
g        NaN        NaN        NaN
h   0.085100   0.532791   0.887415

Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number.

Check for Missing Values

To make detecting missing values easier (and across different array dtypes), Pandas provides the isnone() and notnone() functions, which are also methods on Series and DataFrame objects

Example 1


import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].isnone())

Its output is as follows −


a  false
b  true
c  false
d  true
e  false
f  false
g  true
h  false
Name: one, dtype: bool

Example 2


import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].notnone())

Its output is as follows −


a  true
b  false
c  true
d  false
e  true
f  true
g  false
h  true
Name: one, dtype: bool

Calculations with Missing Data

When summing data, NA will be treated as Zero
If the data are all NA, then the result will be NA

Example 1


import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].sum())

Its output is as follows −

2.02357685917

Example 2

import pandas as pd
import numpy as np

df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
print(df['one'].sum())

Its output is as follows −

nan

Cleaning / Filling Missing Data

Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-none data in a couple of ways, which we have illustrated in the following sections.

1. Replace NaN with a Scalar Value

The following program shows how you can replace "NaN" with "0".


import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])

df = df.reindex(['a', 'b', 'c'])

print(df)
print ("NaN replaced with '0':")
print(df.fillna(0))

Its output is as follows −


         one        two     three
a  -0.576991  -0.741695  0.553172
b        NaN        NaN       NaN
c   0.744328  -1.735166  1.749580

NaN replaced with '0':
         one        two     three
a  -0.576991  -0.741695  0.553172
b   0.000000   0.000000  0.000000
c   0.744328  -1.735166  1.749580

Here, we are filling with value zero; instead we can also fill with any other value.

2. Fill NA Forward and Backward

Using the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values.

pad/fill

Fill methods Forward

bfill/backfill

Fill methods Backward

Example 1: Using pad/fill method


import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df.fillna(method='pad'))

Its output is as follows −


         one        two      three
a   0.077988   0.476149   0.965836
b   0.077988   0.476149   0.965836
c  -0.390208  -0.551605  -2.301950
d  -0.390208  -0.551605  -2.301950
e  -2.000303  -0.788201   1.510072
f  -0.930230  -0.670473   1.146615
g  -0.930230  -0.670473   1.146615
h   0.085100   0.532791   0.887415

Example 2: Using the backfill method


import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df.fillna(method='backfill'))

Its output is as follows −


         one        two      three
a   0.077988   0.476149   0.965836
b  -0.390208  -0.551605  -2.301950
c  -0.390208  -0.551605  -2.301950
d  -2.000303  -0.788201   1.510072
e  -2.000303  -0.788201   1.510072
f  -0.930230  -0.670473   1.146615
g   0.085100   0.532791   0.887415
h   0.085100   0.532791   0.887415

3. Drop Missing Values

If you want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along the row, which means that if any value within a row is NA then the whole row is excluded.

Example 1


import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna())

Its output is as follows −


         one        two      three
a   0.077988   0.476149   0.965836
c  -0.390208  -0.551605  -2.301950
e  -2.000303  -0.788201   1.510072
f  -0.930230  -0.670473   1.146615
h   0.085100   0.532791   0.887415

Example 2

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna(axis=1))

Its output is as follows −

Empty DataFrame

Columns: [ ]

Index: [a, b, c, d, e, f, g, h]

4. Replace Missing (or) Generic Values

Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replacement method.

Replacing NA with a scalar value is the equivalent behavior of the fillna() function.

Example 1

import pandas as pd
import numpy as np

df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})

print(df.replace({1000:10,2000:60}))

Its output is as follows −


   one  two
0   10   10
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60

Example 2

import pandas as pd
import numpy as np

df = pd.DataFrame({'one':[10,20,30,40,50,2000], 'two':[1000,0,30,40,50,60]})
print(df.replace({1000:10,2000:60}))

Its output is as follows −


   one  two
0   10   10
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60

I hope you enjoyed reading this article and finally, you came to know about Python Pandas - Missing Data.

Enjoyed reading this blog? Then why not share it with others. Help us make this AI community stronger.

To learn more about such concepts related to Artificial Intelligence, visit our insideAIML blog page.

You can also ask direct queries related to Artificial Intelligence, Deep Learning, Data Science and Machine Learning on our live insideAIML discussion forum.

Keep Learning. Keep Growing.

Related Blog

Top Discussion

Top Courses

Webinars

Python Pandas - Missing Data

Table of Contents

When and Why Is Data Missed?

Check for Missing Values

Example 1

Example 2

Calculations with Missing Data

Example 1

Example 2

Cleaning / Filling Missing Data

1. Replace NaN with a Scalar Value

2. Fill NA Forward and Backward

Example 1: Using pad/fill method

Example 2: Using the backfill method

3. Drop Missing Values

Example 1

Example 2

4. Replace Missing (or) Generic Values

Example 1

Example 2

Submit Review