Addressing Intrinsic Data Characteristics Issues of Imbalance Medical Data Using Nature Inspired Percolation Clustering

Kaikashan Siddavatam; Subhash  Shinde

doi:10.35882/jeeemi.v7i3.835

Kaikashan Siddavatam Department of Computer Engineering, Lokmanya Tilak College of Engineering, Navi Mumbai, India https://orcid.org/0009-0005-3883-4654
Subhash Shinde Department of Computer Engineering, Lokmanya Tilak College of Engineering, Navi Mumbai, India https://orcid.org/0000-0002-6709-3083

DOI: https://doi.org/10.35882/jeeemi.v7i3.835

Keywords: Artificial Intelligence(AI), DBSCAN, Imbalance Dataset, K-Means, SMOTE, Percolation

Abstract

Data on diseases are generally skewed towards either positive or negative cases, depending on their prevalence. The problem of imbalance can significantly impact the performance of classification models, resulting in biased predictions and reduced model accuracy for the underrepresented class. Other factors that affect the performance of classifiers include intrinsic data characteristics, such as noise, outliers, and within-class imbalance, which complicate the learning task. Contemporary imbalance handling techniques employ clustering with SMOTE (Synthetic Minority Oversampling Technique) to generate realistic synthetic data that preserves the underlying data distribution, generalizes unseen data and mitigates overfitting to noisy points. Centroid-based clustering methods (e.g., K-means) often produce synthetic samples that are too clustered or poorly spaced. At the same time, density-based methods (e.g., DBSCAN) may fail to generate sufficient meaningful synthetic samples in sparse regions. The work aims to develop nature-inspired clustering that, combined with SMOTE, generates synthetic samples that adhere to the underlying data distribution and maintain sparsity among the data points that enhance performance of classifier. We propose PC-SMOTE, which leverages Percolation Clustering (PC), a novel clustering algorithm inspired by percolation theory. The methodology of PC utilizes a connectivity-driven framework to effectively handle irregular cluster shapes, varying densities, and sparse minority instances. The experiment was designed using a hybrid approach to assess PC-SMOTE using synthetically generated data with variable spread and other parameters; second, the algorithm was evaluated on eight sets of real medical datasets. The results show that the PC-SMOTE method works excellently for the Breast cancer dataset, Parkinson's dataset, and Cervical cancer dataset, where AUC is in the range of 96% to 99%, which is high compared to the other two methods. This demonstrates the effectiveness of the PC-SMOTE algorithm in handling datasets with both low and high imbalance ratios and often demonstrates competitive or superior performance compared to K-means and DBSCAN combined with SMOTE in terms of AUC, F1-score, G-mean, and PR-AUC.

Downloads

Download data is not yet available.

References

Salmi, M., Atif, D., Oliva, D., Abraham, A., & Ventura, S. (2024). Handling imbalanced medical datasets: review of a decade of research. Artificial Intelligence Review, 57(10), 273. https://doi.org/10.1007/s10462-024-10884-2

Khushi, M., Shaukat, K., Alam, T. M., Hameed, I. A., Uddin, S., Luo, S., Yang, X., & Reyes, M. C. (2021). A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data. IEEE Access, 9, 109960–109975. https://doi.org/10.1109/ACCESS.2021.3102399

Xu, Z., Shen, D., Nie, T., Kou, Y., Yin, N., & Han, X. (2021). A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Information Sciences, 572, 574–589. https://doi.org/10.1016/j.ins.2021.02.056

Ali, Aida, Siti Mariyam Shamsuddin, and Anca L. Ralescu. "Classification with class imbalance problem." Int. J. Advance Soft Compu. Appl 5.3 (2013): 176-204.

Dudjak, M., & Martinović, G. (2021). An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult. Expert Systems with Applications, 182, 115297. https://doi.org/10.1016/j.eswa.2021.115297

Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study1. Intelligent Data Analysis, 6(5), 429–449. https://doi.org/10.3233/IDA-2002-6504

García, V., Mollineda, R. A., & Sánchez, J. S. (2008). On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Analysis and Applications, 11(3–4), 269–280. https://doi.org/10.1007/s10044-007-0087-5

López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141. https://doi.org/10.1016/j.ins.2013.07.007

Cao, C., & Wang, Z. (2018). IMCStacking: Cost-sensitive stacking learning with feature inverse mapping for imbalanced problems. Knowledge-Based Systems, 150, 27–37. https://doi.org/10.1016/j.knosys.2018.02.031

Roy, S., Roy, U., Sinha, D., & Pal, R. K. (2023). Imbalanced ensemble learning in determining Parkinson’s disease using Keystroke dynamics. Expert Systems with Applications, 217, 119522. https://doi.org/10.1016/j.eswa.2023.119522

Khuat, T. T., & Le, M. H. (2020). Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems. SN Computer Science, 1(2), 108. https://doi.org/10.1007/s42979-020-0119-4

Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020). Data imbalance in classification: Experimental evaluation. Information Sciences, 513, 429–441. https://doi.org/10.1016/j.ins.2019.11.004

Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220–239. https://doi.org/10.1016/j.eswa.2016.12.035

Haibo He, Yang Bai, Garcia, E. A., & Shutao Li. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953

Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning (pp. 878–887). https://doi.org/10.1007/11538059_91

Haibo He, Yang Bai, Garcia, E. A., & Shutao Li. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969

Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem (pp. 475–482). https://doi.org/10.1007/978-3-642-01307-2_43

Dou, J., Gao, Z., Wei, G., Song, Y., & Li, M. (2023). Switching synthesizing-incorporated and cluster-based synthetic oversampling for imbalanced binary classification. Engineering Applications of Artificial Intelligence, 123, 106193. https://doi.org/10.1016/j.engappai.2023.106193

Chen, W., Guo, W., & Mao, W. (2024). An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy. Applied Intelligence, 54(22), 11430–11449. https://doi.org/10.1007/s10489-024-05754-x

Douzas, G., Bacao, F., & Last, F. (2018). Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 465, 1–20. https://doi.org/10.1016/j.ins.2018.06.056

Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2012). DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique. Applied Intelligence, 36(3), 664–684. https://doi.org/10.1007/s10489-011-0287-y

Yang, Y., Akbarzadeh Khorshidi, H., & Aickelin, U. (2023). A Diversity-Based Synthetic Oversampling Using Clustering for Handling Extreme Imbalance. SN Computer Science, 4(6), 848. https://doi.org/10.1007/s42979-023-02249-3

Zhang, M., Ma, Y., Li, J., & Zhang, J. (2023). A density connection weight-based clustering approach for dataset with density-sparse region. Expert Systems with Applications, 230, 120633. https://doi.org/10.1016/j.eswa.2023.120633

Mahesh Kumar, K., & Rama Mohan Reddy, A. (2016). A fast DBSCAN clustering algorithm by accelerating neighbor searching using Groups method. Pattern Recognition, 58, 39–48. https://doi.org/10.1016/j.patcog.2016.03.008

Kaur, I., & Ahmad, T. (2024). A cluster-based ensemble approach for congenital heart disease prediction. Computer Methods and Programs in Biomedicine, 243, 107922. https://doi.org/10.1016/j.cmpb.2023.107922

Arafa, A., El-Fishawy, N., Badawy, M., & Radad, M. (2022). RN-SMOTE: Reduced Noise SMOTE based on DBSCAN for enhancing imbalanced data classification. Journal of King Saud University - Computer and Information Sciences, 34(8), 5059–5074. https://doi.org/10.1016/j.jksuci.2022.06.005

Tao, X., Li, Q., Guo, W., Ren, C., He, Q., Liu, R., & Zou, J. (2020). Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Information Sciences, 519, 43–73. https://doi.org/10.1016/j.ins.2020.01.032

Tong, W., Wang, Y., & Liu, D. (2023). An Adaptive Clustering Algorithm Based on Local-Density Peaks for Imbalanced Data Without Parameters. IEEE Transactions on Knowledge and Data Engineering, 35(4), 3419–3432. https://doi.org/10.1109/TKDE.2021.3138962

Huth, G., Lesne, A., Munoz, F., & Pitard, E. (2014). Correlated percolation models of structured habitat in ecology. Physica A: Statistical Mechanics and Its Applications, 416, 290–308. https://doi.org/10.1016/j.physa.2014.08.006

Ijaz, M., Alfian, G., Syafrudin, M., & Rhee, J. (2018). Hybrid Prediction Model for Type 2 Diabetes and Hypertension Using DBSCAN-Based Outlier Detection, Synthetic Minority Over Sampling Technique (SMOTE), and Random Forest. Applied Sciences, 8(8), 1325. https://doi.org/10.3390/app8081325 .

Hong, S., An, S., & Jeon, J.-J. (2024). Improving SMOTE via Fusing Conditional VAE for Data-adaptive Noise Filtering.

Torquato, S. (2002). Random Heterogeneous Materials (Vol. 16). Springer New York. https://doi.org/10.1007/978-1-4757-6355-3

Maddison, M. S., & Schmidt, S. C. (2020). Percolation Analysis – Archaeological Applications at Widely Different Spatial Scales. Journal of Computer Applications in Archaeology, 3(1), 269–287. https://doi.org/10.5334/jcaa.54

Amil, P., Almeira, N., & Masoller, C. (2019). Outlier Mining Methods Based on Graph Structure Analysis. Frontiers in Physics, 7. https://doi.org/10.3389/fphy.2019.00194

.Ma, L., & Fan, S. (2017). CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinformatics, 18(1), 169. https://doi.org/10.1186/s12859-017-1578-z

Stefanowski, J. (2013). Overlapping, Rare Examples and Class Decomposition in Learning Classifiers from Imbalanced Data (pp. 277–306). https://doi.org/10.1007/978-3-642-28699-5_11

Fernández-Navarro, F., Hervás-Martínez, C., & Antonio Gutiérrez, P. (2011). A dynamic over-sampling procedure based on sensitivity for multi-class problems. Pattern Recognition, 44(8), 1821–1833. https://doi.org/10.1016/j.patcog.2011.02.019

Wang, J., & Awang, N. (2025). A Novel Synthetic Minority Oversampling Technique for Multiclass Imbalance Problems. IEEE Access, 13, 6054–6066. https://doi.org/10.1109/ACCESS.2025.3526673

Wang, J., & Awang, N. (2024). MKC-SMOTE: A Novel Synthetic Oversampling Method for Multi-Class Imbalanced Data Classification. IEEE Access, 12, 196929–196938. https://doi.org/10.1109/ACCESS.2024.3521120