World's Best AI Learning Platform with profoundly Demanding Certification Programs
Designed by IITian's, only for AI Learners.
How to handle imbalanced data and achieve good performance?
Accuracy is not the best evaluation metric for evaluating the trained model with imbalanced dataset. For credit card fraud detection task, we have large number of normal transaction data and very few fraud data points.Accuracy only concentrates on true positive and true negative, so if any model predicts all data points as non fraud, then the accuracy would be higher. Our aim is mainly on focusing on to detect fraud data points.For this task precision-recall ,AUC and ROC curve will be best suited evaluation metric.
Data set about credit card fraud detection is not balanced enough i.e. imbalanced. In such a data set, accuracy score cannot be the measure of performance as it may only be predict the majority class label correctly but in this case our point of interest is to predict the minority label. But often minorities are treated as noise and ignored. So, there is a high probability of misclassification of the minority label as compared to the majority label. For evaluating the model performance in case of imbalanced data sets, we should use Sensitivity (True Positive rate) or Specificity (True Negative rate) to determine class label wise performance of the classification model. If the minority class label’s performance is not so good, we could do the following: