All Courses

Everything you need to know about Splitting Dataset

Anmol Sharma

10 months ago

Everything you need to know about Splitting Dataset | insideaiml
Table of Content
  • Introduction
  • Splitting of Dataset
  • Why do we need to Split Dataset?
  • Train-Test Split in Scikit-learn
  • Repeatable Train-Test Split
  • Conclusion

Introduction

          In the modern world, there is a famous saying ‘Data is the new Oil’. Today we are generating big datasets daily and gathering data is no more a major problem. But to use this data in an effective way for building an efficient Machine learning model is still a considerable challenge. To transform this raw data and make it readable for Machine learning models requires advanced skills and domain knowledge. Moreover, the other major challenge is splitting the dataset so that our model can predict the best possible output. But before that, we need to understand what ‘Splitting A Dataset’ means. So, let's take a deep dive and understand splitting and splitting the dataset in the best possible ways.

Splitting of Dataset

          Splitting  the  dataset is one of  the  major parts of  building a  Machine learning model. Generally, for Machine learning problems, we split the dataset into two parts - the Training sets and Test sets. But for Deep learning problems, we split the dataset into three parts - Training data, Validation data, and Test dataset. Take a look at the picture below:
ML dataset training validation test sets | Insideaiml
          In this article, we will focus on the Train-Test split i.e. splitting the dataset into training and testing sets. Now, you might be thinking about why we need to split the dataset when we can just train the model on the whole dataset. Let's find out.

Why do we need to Split Dataset?

          The reason for splitting the dataset  into  training  and  test sets is for assessing  the machine  learning model’s performance. The Train set is used to train the model, i.e., the models learn from the training set, and the testing set is used to evaluate the model’s performance. Once the model is trained on the Train set, then we feed the testing set to the model, and the model makes predictions for the Test dataset, and the prediction values are compared with the actual values to assess how well the model performed on unseen data. 
          We usually use a 70:30 split i.e., 70% of data is assigned to the Train set and 30% to the Test dataset. It is an important part as the wrong split ratio may lead to overfitting or underfitting. Other common splits are:
  • Train Dataset: 80%, Test Dataset: 20%
  • Train Dataset: 67%, Test Dataset: 33%
Now, you have learned about testing and training data. Let’s see how to implement it using Python. 

Train-Test Split in Scikit-learn

          Python’s Scikit-learn library has an inbuilt function to split the dataset into testing and training data, with several parameters so that we can easily split our dataset. Take a look at the demonstration below.
Train-Test Split in Scikit-learn | insideaiml
Here, the test_size parameter is used for setting the split ratio. 0.3 means a 70:30 split.

Repeatable Train-Test Split

          It is very important to provide random rows to the Train and Test datasets because there are chances that our data contain slightly different distributions. In such a case, if we don’t shuffle our data, we might end up assigning the slightly different types of data to the Train and Test set, which will affect the predictions of our model.
          Example: Suppose  we are  doing a  survey to  collect  data  such that we can estimate  how much money a particular age group makes annually and we first collect the data of teens followed by adults and then seniors. If we don’t shuffle our data the model, we will be trained on slightly different data and tested on different data, which will lead to poor performance of our model.
          Scikit-learn's random_state parameter shuffles the data before splitting. It takes an integer value, and that value represents different shuffle patterns.

Summary

          Splitting  a  dataset is one of the most  crucial  steps in  building a Machine  learning model. In this article, we learned about data splitting, why we need it, training and test sets, how it could be done using Scikit-learn and repeatable train-test dataset split. To master data splitting you need to build projects. You can also try machine learning and data science courses on InsideAIML to gain a deep understanding of concepts along with hands-on experience.
We hope you learned what you were looking for. Do reach out to us for queries on our AI dedicated discussion forum and get your query resolved within 30 minutes.
       
Enjoyed reading this blog? Then why not share it with others. Help us make this AI community stronger. 
To learn more about such concepts related to Artificial Intelligence, visit our blog page.
You can also ask direct queries related to Artificial Intelligence, Deep Learning, Data Science and Machine Learning on our live discussion forum.
Keep Learning. Keep Growing.
   

Submit Review