All Courses

Complete Guide to Data Preprocessing

Anmol Sharma

2 years ago

Complete Guide to Data Preprocessing | insideAIML 
Table of Content
  • Introduction
  • What is Data Preprocessing?
  • Data Preprocessing Techniques
  • Data Preprocessing Cycle
  • Summary

Introduction

          There is a popular phrase nowadays that says, "Data is the new oil." One of the main reasons for Machine Learning's popularity is data. Every day, we generate a large amount of data used by machine learning models to learn and predict. Without data, machine learning models are useless. But the data we generate is usually in raw form means it is unstructured, contains outliers and null values. This raw data is not understandable for the machine learning model. To make the raw data fit for the machine learning model we use data preprocessing. In this article, we will discover data preprocessing in depth. Let’s get started.

What is Data Preprocessing?

          Real-world data is usually noisy, contains null values, and isn't always in a usable format for machine learning models. As a result, we won't be able to train our model using this data. To address this issue, we must first preprocess the data. Data preprocessing is a method of preparing raw data in order for a machine learning model to learn from it.
Now you might be wondering how to convert raw data to machine-readable form. Let’s move on to data preprocessing techniques and understand how the processing of data is done.

Data Preprocessing Techniques

          There are a number of steps involved to make raw data fit for the machine learning model. The techniques that we commonly use for processing data are described below.
  • Identifying and Handling the Null values.
  • Encoding Categorical data.
  • Rescale Data

Identifying and Handling the Null values

         The first step is to find out whether or not our dataset contains null values. If it contains null values we need to handle those null values. 
To handle null values we have two techniques and they are:
  • Deleting rows or columns
  • Imputing Null values

Deleting rows or columns

          We can get rid of any rows or columns with null values. However, this method is not recommended because we would lose information. Assume our dataset has 8k rows and 8 columns, with 2k rows that contain null values for few columns(1,2, or 3). As a result, deleting these rows will result in a loss of 25% data because those rows have few missing values. This approach is inefficient, and it should only be used when only a few rows/columns have missing values.

Imputing Null values

          This approach is utilized for rows or columns with numeric data such as age, salary, and so on. We can find the mean, mode, or median of the rows or columns with null values and use that information to replace the null values. We don't lose any data with this procedure, and it's quick. This strategy produces better outcomes than the prior one.

Encoding Categorical data

          Gender, religion, address, and other discrete variables are examples of categorical variables. Only continuous / numeric variables such as age, salary, and so on can be dealt with by machine learning algorithms. As a result, before training the model, we must encode the categorical variables. It's an important stage in the data preparation process.
There exist two types of categorical data.
  • Ordinal
  • Non-ordinal
1. categorical variables
          It can be sorted in a certain order. Size of bags, for example. S-M-L  (ascending) and L-M-S (descending).
2. Non-ordinal categorical variables
          On the other hand, cannot be sorted in a certain order. Colors of bags, for example.
Both ordinal and non-ordinal categorical variables are dealt with using different strategies.
Label encoding is used for ordinal data, whereas One-Hot encoding is used for non-ordinal data.

Label Encoding

          This method transforms categorical data into numeric data, making it machine-readable. If we have three sizes of bags, small, medium, and large, we can use Label encoding to label them as 0,1, and 2.

One-Hot Encoding

          It creates new columns from the underlying dataset, defining each potential value. It is the most efficient method of encoding non-ordinal categorical data for use by machine learning algorithms.
To see the difference between Label encoding and One-Hot encoding, look at the image below.
Label and One-Hot Encoding | insideAIML
          In the image above, we can see that the features are labelled as integers in Label Encoding. One-Hot Encoding, on the other hand, labels the features and then encodes them by producing numerous columns equal to the number of features.

Rescale Data

          In general, the features in our data are not scaled, and there is variance in our data, both of which can impair our model's prediction. Let's imagine we have a housing dataset for a specific location and we need to forecast the price of houses. The most essential attribute is housing size, and suppose most houses are smaller than 300 square feet, just a few are larger than 1000 square feet, then it will impact the predictions as the data is not scaled.
To deal with this problem we use normalization or standardization.

Normalization

          Normalization is a scaling technique that shifts and rescales values to make them range between 0 and 1. Min-Max scaling is another name for it.

Standardization

          It's a strategy in which values are centered around the mean and have a unit deviation, meaning the mean of the values is zero and the standard deviation is one.
Now, we have learned to process data. Let’s move on to the data preprocessing cycle.

Data Preprocessing Cycle

          Data Processing Cycle is a set of procedures for extracting valuable information from unstructured data. It is a cyclic process means every procedure in this process has to be in a fixed cyclic order.
Below is the list of procedures involved in the data preprocessing cycle.
  • Collection
  • Cleaning
  • Input
  • Processing
  • Output 
  • Storage

Collection

          The first phase in the cycle is to collect  data, which is critical  since the  quality of the  data  collected has an impact on the final result. It is necessary to guarantee that the data obtained is both accurate and useful. This stage establishes a foundation for improving what has been targeted.

Cleaning

         The act of sorting and filtering raw data to remove unneeded and erroneous data is known as data cleaning. The raw data is reviewed for errors, duplicates, miscalculations, and missing data before being translated into a format that may be used for further analysis and processing. This is done to ensure that the processing unit only receives the highest-quality data.

Input

          The raw data is transformed to a machine-readable format and sent into the processing unit in this stage. This can take the form of data entry via a keyboard, scanner, or any other type of input device.

Preprocessing

         It is the step at which the data is subjected to a variety of technical exploitations, including artificial intelligence algorithms, in order to provide a picture of the data. Depending on the nature of data, the process may be made up of numerous execution connections that execute instructions in sequential order.

Output

          It's also known as interpretation, and  it's  the  process of  transmitting  and displaying refined data to the user. Users can see the output in a variety of report formats, including audio, video, graphical, and document viewers.

Storage

          It's the final stage of the data processing cycle, where data and metadata are stored for later use. This cycle is important because it provides for quick access and retrieval of processed data, allowing it to be sent directly to the next stage when needed.

Summary

          Data preprocessing is a crucial stage of building a machine learning model. In this article, we discovered the processing of data, data preprocessing techniques and data preprocessing cycle. We learned the importance of data preprocessing in building a machine learning model and how we can transform raw data into machine-readable data.
We hope you gain an understanding of data preprocessing. Do reach out to us for queries on our AI-dedicated discussion forum and obtain your query resolved within a half-hour.
     
Enjoyed reading this blog? Then why not share it with others. Help us make this AI community stronger. 
To learn more about such concepts related to Artificial Intelligence, visit our insideAIML blog page.
You can also ask direct queries related to Artificial Intelligence, Deep Learning, Data Science and Machine Learning on our live insideAIML discussion forum.
Keep Learning. Keep Growing.

Submit Review