How to handle imbalanced data and achieve good performance?

By Jennifer, 7 months ago
  • Bookmark
0

A data set is given to you about credit card fraud detection. You have built a classifier model and achieved a performance score of 98.3%. Is this a good model? If yes, justify. If not, what can you do about it?


Interview question
Credit card fraud detection
2 Answers
0
B.thusharmarvel97

Accuracy is not the best evaluation metric for evaluating the trained model with imbalanced dataset. For credit card fraud detection task, we have large number of normal transaction data and very few fraud data points.Accuracy only concentrates on true positive and true negative, so if any model predicts all data points as non fraud, then the accuracy would be higher. Our aim is mainly on focusing on to detect fraud data points.For this task precision-recall ,AUC and ROC curve will be best suited evaluation metric.

0

Data set about credit card fraud detection is not balanced enough i.e. imbalanced. In such a data set, accuracy score cannot be the measure of performance as it may only be predict the majority class label correctly but in this case our point of interest is to predict the minority label. But often minorities are treated as noise and ignored. So, there is a high probability of misclassification of the minority label as compared to the majority label. For evaluating the model performance in case of imbalanced data sets, we should use Sensitivity (True Positive rate) or Specificity (True Negative rate) to determine class label wise performance of the classification model. If the minority class label’s performance is not so good, we could do the following:


  • We can use under sampling or over sampling to balance the data.
  • We can change the prediction threshold value.
  • We can assign weights to labels such that the minority class labels get larger weights.
  • We could detect anomalies.

Your Answer

Webinars

Live Masterclass on "Python for Artificial Intelligence"

Dec 4th (7:00 PM) 208 Registered
More webinars

Related Discussions

Running random forest algorithm with one variable

View More