New Hybrid Data Preprocessing Technique for Highly Imbalanced Dataset

Authors

  • Esraa Faisal Malik School of Management, Universiti Sains Malaysia, 11800 Gelugor, Penang, Malaysia
  • Khai Wah Khaw School of Management, Universiti Sains Malaysia, 11800 Gelugor, Penang, Malaysia
  • XinYing Chew School of Computer Science, Universiti Sains Malaysia, 11800 Gelugor, Penang, Malaysia

DOI:

https://doi.org/10.31577/cai_2022_4_981

Keywords:

Cost-sensitive learning, hybrid, imbalance dataset, resampling techniques

Abstract

One of the most challenging problems in the real-world dataset is the rising numbers of imbalanced data. The fact that the ratio of the majorities is higher than the minorities will lead to misleading results as conventional machine learning algorithms were designed on the assumption of equal class distribution. The purpose of this study is to build a hybrid data preprocessing approach to deal with the class imbalance issue by applying resampling approaches and CSL for fraud detection using a real-world dataset. The proposed hybrid approach consists of two steps in which the first step is to compare several resampling approaches to find the optimum technique with the highest performance in the validation set. While the second method used CSL with optimal weight ratio on the resampled data from the first step. The hybrid technique was found to have a positive impact of 0.987, 0.974, 0.847, 0.853 F2-measure for RF, DT, XGBOOST and LGBM, respectively. Additionally, relative to the conventional methods, it obtained the highest performance for prediction.

Downloads

Download data is not yet available.

Downloads

Published

2022-11-09

How to Cite

Malik, E. F., Khaw, K. W., & Chew, X. (2022). New Hybrid Data Preprocessing Technique for Highly Imbalanced Dataset. Computing and Informatics, 41(4), 981–1001. https://doi.org/10.31577/cai_2022_4_981

Most read articles by the same author(s)