Data Preprocessing, Analysis, and Visualization for building a Machine learning model

Anmol Sharma

7 months ago

Data Preprocessing, Analysis, and Visualization for building a Machine learning model | insideAIML
Table of Content
  • Introduction
  • Data Preprocessing
  • Data Analysis
  • Data Visualization
  • Conclusion

Introduction

          The meaning of Machine learning is learning from data. Data plays a major role in building an accurate, efficient Machine learning model. Better the data will be the model. To make our data better, we need to perform data preprocessing followed by data analysis and then visualizing the data. So, let us begin with a famous quote.
Bill Gates quote  | insideAIML
Image source - QuoteFancy

Data Preprocessing

          The real-world data mostly contains noise, has null values, and might not be in a suitable format. So, we can't train our model with this data. To deal with this problem, we need to pre-process data.
Data preprocessing is a technique to prepare the raw data such that a Machine learning model can learn from it.
The following data preprocessing techniques make our data fit for training the model.
  • Dealing with Null Values
  • Dealing with Categorical Variables
  • Standardize Data
Note- We will implement data preprocessing in Python.

Dealing with Null Values

          It is the first phase of data preprocessing. The  real-world  data  contains  null  values. No  machine  learning algorithm can handle these null values on its own. So, we have to deal with null values before training the model. 
There are two ways to deal with null values:

1. Deleting Rows or Columns 

          We can delete those rows or columns that contain null values. But this approach is not preferred because we will lose information. For example, suppose our dataset has 5k rows and 10 columns, out of which 2k rows contain null values for a few columns (1,2, or 3). So, if we delete these rows, we will lose 40% of our data just because those rows have few missing values; this method is not efficient and it is advised to use this method only when a few rows/columns contain missing values..

2.  Imputation of Null Values

          This method is used for those rows or columns that contain numeric data like age, salary, etc. We can calculate the mean, mode, or median of the rows or columns that contain null values and replace the null values with the mean, mode, or median. In this method, we don't lose any information and it gives better results than the previous method. Look at the demonstration below to know how to use imputation in python.
In python, nan represents null/missing value.

Dealing with Categorical Variables

         Categorical  variables  are  discrete  variables for example - gender, religion, address, etc. Machine  learning algorithms can only deal with continuous/numeric variables like age, salary, etc. So, we have to encode the categorical variables before training the model. It is a crucial step in data preprocessing.
There are two types of categorical variables.
  • Ordinal
  • Non-ordinal
         Ordinal  categorical  variables  are  those which can be  sorted in an  order.  Example -  Size of bags. L-M-S(descending) and S-M-L(ascending). On the other g=hand, non-ordinal categorical variables can’t be sorted. Example – Colour of bags.
          We have different techniques to deal with both ordinal and non-ordinal categorical variables. For ordinal, we use Label encoding, and for non-ordinal, we use One-Hot encoding.

Label Encoding 

          This technique converts categorical data into the numeric form so that it becomes machine-readable. Suppose we have 3 sizes of bags i.e., small, medium, and large; after applying Label encoding, they are labelled as 0,1, 2.

One-Hot Encoding 

          It generates new columns, specifying each possible value from the parent dataset. It is the most effective way to encode non-ordinal categorical variables so that machine learning algorithms can work on them.
Take a look at the picture below to understand the difference between Label encoding and One-Hot encoding.
Difference between Label encoding and One-Hot encoding | insideAIML
Image source - https://www.pi.exchange  
         Here, in the above picture, we can see that in Label Encoding, the features are labelled as integers. On the other hand, One-Hot Encoding labels the features by creating multiple columns equal to the number of features and then encodes them.

Standardize Data

         Standardization is the last stage in data preprocessing. Generally, the features in our data are not scaled and there is variance in our data which can affect the prediction made by our model. For example, let us say we have a housing dataset of a particular area, and we have to predict the price of houses. The most important feature is house size, and if the size of most of the houses is less than 300 sq. ft but few houses have a size more than 1000 sq. ft., then it will impact the predictions as the data is not scaled. To get rid of this problem, we use Standardization. 
          It is a technique in which values are centered around the mean and has unit deviation i.e., the mean of values is zero, and the standard deviation is one. Look at the demonstration below to understand how to use it in python.
Now, we know how to do data preprocessing in Python. Let’s move on to data analysis.

Data Analysis

          It is a technique to gain important information from data by manipulating, transforming, and visualizing the data. The goal is to find patterns in data. 
Now, the question is why do we need it?
Below are a few reasons to answer this question.
  • Helps in selecting the right algorithm.
  • Helps in selecting the right features.
  • For evaluation and presentation of our model.
We will discuss Exploratory Data Analysis (EDA) in this article as it is one of the most used techniques for data analysis.
EDA is a method of analyzing data to outline principal characteristics, often using graphs and other visualization techniques.
Steps to perform EDA:
  • Dataset Features
  • Variable Identification
  • Data Type Identification
  • Numeric Variable Statistical Summary
  • Non-Graphic Analysis
  • Graphical Analysis

1. Dataset Features 

         Start with understanding your dataset i.e., size of the dataset, number of rows and columns. Use the following code for knowing your dataset features.
dataset_name.shape

2. Variable Identification 

          It is the most important step in EDA. There are two types of variables- Numerical and Categorical. Identify the type of variables and store their names in different lists. It is a manual process.

3. Datatype identification

          Once you have categorized the variables, the next step is to identify their data type. You can use the following python code for this step.
dataset_name.dtypes

4. Numeric Variable Statistical Summary 

         To describe the  statistical  features  like  count,  mean,  min,  percentile, etc. of your dataset use the following Pandas function.
dataset_name.describe()

5. Non-Graphic Analysis

         Use Pandas methods to analyze your data.

6. Graphic Analysis

         The last step of EDA is graphic analysis i.e., visualizing your data.

Data Visualization

         It is a method to present data in graphical format. It makes data easily understandable as the data is in summarized form. Even a large amount of data can be easily understood just by looking at a graph or plot. In python, we mostly use the MatPlotlib library for data visualization.
Some of the basic plotting techniques are described below.

1. Scatter Plot 

          It is a mathematical diagram to plot the values of two variables. Take a look at the image below.
Scatter Plot | insideAIML

2.  Line Plot 

         It shows information as a series of data points connected by a straight line. Take a look at the image below.
Line Plot | insideAIML

3. Bar Charts  

        It represents categorical data with the help of rectangular bars having lengths equal to the values it speaks for. Take a look at the image below.
Bar Charts | insideAIML

Conclusion

        We learned about data preprocessing in python, analysis, and visualization which is a major step for building a machine learning model. We discovered how we can make raw data suitable for our machine learning algorithm to learn from using data preprocessing techniques. Also, we learned how to analyze data for valuable insights and finally how to present/visualize the data. Now, we are ready to build our machine learning model.
We hope you learned data preprocessing, analysis and visualization. Do reach out to us for queries on our AI dedicated discussion forum and get your query resolved within 30 minutes.
   
Liked what you read? Then don’t break the spree. Visit our blog page to read more awesome articles. 
Or if you are into videos, then we have an amazing Youtube channel as well. Visit our InsideAIML Youtube Channel to learn all about Artificial Intelligence, Deep Learning, Data Science and Machine Learning. 
Keep Learning. Keep Growing. 
    

Submit Review