A small demo of some Statistical Functions in Python Pandas

Nimish Khurana

5 months ago

Table of Content
  • Percent_change
  • Covariance
  • Cov Series
  • Correlation
  • Data Ranking
         Pandas is a well-known and widely used Python library for data manipulation and analysis. It provides numerous methods and functions that expedite data analysis and preprocessing steps. On top of that, pandas also provide statistical functions that can be used to further understand the data.
Python Pandas
Statistical methods help in understanding and analyzing the behavior of data.
In this article we will try to learn about a few statistical functions, which are commonly used on Pandas objects.

Percent_change

     Percent_change can be used with Series, DataFrames, and Panel. It compares every element with its prior element and computes the change percentage.
import pandas as pd
import numpy as np
s = pd.Series([1,2,3,4,5,4])
print s.pct_change()

df = pd.DataFrame(np.random.randn(5, 2))
print df.pct_change()
Its the output is as follows −
0        NaN
1   1.000000
2   0.500000
3   0.333333
4   0.250000
5  -0.200000
dtype: float64

            0          1
0         NaN        NaN
1  -15.151902   0.174730
2  -0.746374   -1.449088
3  -3.582229   -3.165836
4   15.601150  -1.860434
By default, the pct_change() operates on columns; if you want to apply the same row-wise, then use axis=1() argument.

Covariance

      Covariance is applied on series data. The Series object has a method cov to compute covariance between series objects. NA will be excluded automatically.

Cov Series

import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print s1.cov(s2)
Its the output is as follows −
-0.12978405324
Covariance method when applied on a DataFrame, computes cov between all the columns.
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print frame['a'].cov(frame['b'])
print frame.cov()
Its the output is as follows −

-0.58312921152741437

           a           b           c           d            e
a   1.780628   -0.583129   -0.185575    0.003679    -0.136558
b  -0.583129    1.297011    0.136530   -0.523719     0.251064
c  -0.185575    0.136530    0.915227   -0.053881    -0.058926
d   0.003679   -0.523719   -0.053881    1.521426    -0.487694
e  -0.136558    0.251064   -0.058926   -0.487694     0.960761
Note − Observe the cov between a and b column in the first statement and the same is the value returned by cov on DataFrame.

Correlation

        Correlation gives the linear relationship between any two arrays of values (series). There are multiple methods to compute the correlation like Pearson(default), spearman, and Kendall.
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])

print frame['a'].corr(frame['b'])
print frame.corr()
Its the output is as follows −

-0.383712785514

           a          b          c          d           e
a   1.000000  -0.383713  -0.145368   0.002235   -0.104405
b  -0.383713   1.000000   0.125311  -0.372821    0.224908
c  -0.145368   0.125311   1.000000  -0.045661   -0.062840
d   0.002235  -0.372821  -0.045661   1.000000   -0.403380
e  -0.104405   0.224908  -0.062840  -0.403380    1.000000
If any non-numeric column is present in the DataFrame, it is excluded automatically.

Data Ranking

      Data Ranking produces a ranking for each element in the array of elements. In case of ties, assigns the mean rank.
import pandas as pd
import numpy as np

s = pd.Series(np.random.np.random.randn(5), index=list('abcde'))
s['d'] = s['b'] # so there's a tie
print s.rank()
Its output is as follows −
a  1.0
b  3.5
c  2.0
d  3.5
e  5.0
dtype: float64
Rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank.
Rank supports different tie-breaking methods, specified with the method parameter −
  • average − average the rank of tied group
  • min − lowest rank in the group
  • max − highest rank in the group
  • first − ranks assigned in the order they appear in the array
Get to learn more about Python pandas InsideAIML.
Enjoyed reading this blog? Then why not share it with others. Help us make this AI community stronger. 
You can also ask direct queries related to Artificial Intelligence, Deep Learning, Data Science and Machine Learning on our live insideAIML discussion forum.
Keep Learning. Keep Growing. 

Submit Review