Data Preprocessing, Analysis, and Visualization for building a Machine learning model
2 years ago
Table of Content
The meaning of Machine learning is learning from data. Data plays a major role in building an accurate, efficient Machine learning model. Better the data will be the model. To make our data better, we need to perform data preprocessing followed by data analysis and then visualizing the data. So, let us begin with a famous quote.
The real-world data mostly contains noise, has null values, and might not be in a suitable format. So, we can't train our model with this data. To deal with this problem, we need to pre-process data.
Data preprocessing is a technique to prepare the raw data such that a Machine learning model can learn from it.
The following data preprocessing techniques make our data fit for training the model.
Dealing with Null Values
Dealing with Categorical Variables
Note- We will implement data preprocessing in Python.
Dealing with Null Values
It is the first phase of data preprocessing. The real-world data contains null values. No machine learning algorithm can handle these null values on its own. So, we have to deal with null values before training the model.
There are two ways to deal with null values:
1. Deleting Rows or Columns
We can delete those rows or columns that contain null values. But this approach is not preferred because we will lose information. For example, suppose our dataset has 5k rows and 10 columns, out of which 2k rows contain null values for a few columns (1,2, or 3). So, if we delete these rows, we will lose 40% of our data just because those rows have few missing values; this method is not efficient and it is advised to use this method only when a few rows/columns contain missing values..
2. Imputation of Null Values
This method is used for those rows or columns that contain numeric data like age, salary, etc. We can calculate the mean, mode, or median of the rows or columns that contain null values and replace the null values with the mean, mode, or median. In this method, we don't lose any information and it gives better results than the previous method. Look at the demonstration below to know how to use imputation in python.
In python, nan represents null/missing value.
Dealing with Categorical Variables
Categorical variables are discrete variables for example - gender, religion, address, etc. Machine learning algorithms can only deal with continuous/numeric variables like age, salary, etc. So, we have to encode the categorical variables before training the model. It is a crucial step in data preprocessing.
There are two types of categorical variables.
Ordinal categorical variables are those which can be sorted in an order. Example - Size of bags. L-M-S(descending) and S-M-L(ascending). On the other g=hand, non-ordinal categorical variables can’t be sorted. Example – Colour of bags.
We have different techniques to deal with both ordinal and non-ordinal categorical variables. For ordinal, we use Label encoding, and for non-ordinal, we use One-Hot encoding.
This technique converts categorical data into the numeric form so that it becomes machine-readable. Suppose we have 3 sizes of bags i.e., small, medium, and large; after applying Label encoding, they are labelled as 0,1, 2.
It generates new columns, specifying each possible value from the parent dataset. It is the most effective way to encode non-ordinal categorical variables so that machine learning algorithms can work on them.
Take a look at the picture below to understand the difference between Label encoding and One-Hot encoding.
Here, in the above picture, we can see that in Label Encoding, the features are labelled as integers. On the other hand, One-Hot Encoding labels the features by creating multiple columns equal to the number of features and then encodes them.
Standardization is the last stage in data preprocessing. Generally, the features in our data are not scaled and there is variance in our data which can affect the prediction made by our model. For example, let us say we have a housing dataset of a particular area, and we have to predict the price of houses. The most important feature is house size, and if the size of most of the houses is less than 300 sq. ft but few houses have a size more than 1000 sq. ft., then it will impact the predictions as the data is not scaled. To get rid of this problem, we use Standardization.
It is a technique in which values are centered around the mean and has unit deviation i.e., the mean of values is zero, and the standard deviation is one. Look at the demonstration below to understand how to use it in python.
Now, we know how to do data preprocessing in Python. Let’s move on to data analysis.
It is a technique to gain important information from data by manipulating, transforming, and visualizing the data. The goal is to find patterns in data.
Now, the question is why do we need it?
Below are a few reasons to answer this question.
Helps in selecting the right algorithm.
Helps in selecting the right features.
For evaluation and presentation of our model.
We will discuss Exploratory Data Analysis (EDA) in this article as it is one of the most used techniques for data analysis.
EDA is a method of analyzing data to outline principal characteristics, often using graphs and other visualization techniques.
Steps to perform EDA:
Data Type Identification
Numeric Variable Statistical Summary
1. Dataset Features
Start with understanding your dataset i.e., size of the dataset, number of rows and columns. Use the following code for knowing your dataset features.
2. Variable Identification
It is the most important step in EDA. There are two types of variables- Numerical and Categorical. Identify the type of variables and store their names in different lists. It is a manual process.
3. Datatype identification
Once you have categorized the variables, the next step is to identify their data type. You can use the following python code for this step.
4. Numeric Variable Statistical Summary
To describe the statistical features like count, mean, min, percentile, etc. of your dataset use the following Pandas function.
5. Non-Graphic Analysis
Use Pandas methods to analyze your data.
6. Graphic Analysis
The last step of EDA is graphic analysis i.e., visualizing your data.
It is a method to present data in graphical format. It makes data easily understandable as the data is in summarized form. Even a large amount of data can be easily understood just by looking at a graph or plot. In python, we mostly use the MatPlotlib library for data visualization.
Some of the basic plotting techniques are described below.
1. Scatter Plot
It is a mathematical diagram to plot the values of two variables. Take a look at the image below.
2. Line Plot
It shows information as a series of data points connected by a straight line. Take a look at the image below.
3. Bar Charts
It represents categorical data with the help of rectangular bars having lengths equal to the values it speaks for. Take a look at the image below.
We learned about data preprocessing in python, analysis, and visualization which is a major step for building a machine learning model. We discovered how we can make raw data suitable for our machine learning algorithm to learn from using data preprocessing techniques. Also, we learned how to analyze data for valuable insights and finally how to present/visualize the data. Now, we are ready to build our machine learning model.
We hope you learned data preprocessing, analysis and visualization. Do reach out to us for queries on our AI dedicated discussion forum and get your query resolved within 30 minutes.
Liked what you read? Then don’t break the spree. Visit our blog page to read more awesome articles.
Or if you are into videos, then we have an amazing Youtube channel as well. Visit our InsideAIML Youtube Channel to learn all about Artificial Intelligence, Deep Learning, Data Science and Machine Learning.