Addressing Intrinsic Data Characteristics Issues of Imbalance Medical Data Using Nature Inspired Percolation Clustering
Abstract
Data on diseases are generally skewed towards either positive or negative cases, depending on their prevalence. The problem of imbalance can significantly impact the performance of classification models, resulting in biased predictions and reduced model accuracy for the underrepresented class. Other factors that affect the performance of classifiers include intrinsic data characteristics, such as noise, outliers, and within-class imbalance, which complicate the learning task. Contemporary imbalance handling techniques employ clustering with SMOTE (Synthetic Minority Oversampling Technique) to generate realistic synthetic data that preserves the underlying data distribution, generalizes unseen data and mitigates overfitting to noisy points. Centroid-based clustering methods (e.g., K-means) often produce synthetic samples that are too clustered or poorly spaced. At the same time, density-based methods (e.g., DBSCAN) may fail to generate sufficient meaningful synthetic samples in sparse regions. The work aims to develop nature-inspired clustering that, combined with SMOTE, generates synthetic samples that adhere to the underlying data distribution and maintain sparsity among the data points that enhance performance of classifier. We propose PC-SMOTE, which leverages Percolation Clustering (PC), a novel clustering algorithm inspired by percolation theory. The methodology of PC utilizes a connectivity-driven framework to effectively handle irregular cluster shapes, varying densities, and sparse minority instances. The experiment was designed using a hybrid approach to assess PC-SMOTE using synthetically generated data with variable spread and other parameters; second, the algorithm was evaluated on eight sets of real medical datasets. The results show that the PC-SMOTE method works excellently for the Breast cancer dataset, Parkinson's dataset, and Cervical cancer dataset, where AUC is in the range of 96% to 99%, which is high compared to the other two methods. This demonstrates the effectiveness of the PC-SMOTE algorithm in handling datasets with both low and high imbalance ratios and often demonstrates competitive or superior performance compared to K-means and DBSCAN combined with SMOTE in terms of AUC, F1-score, G-mean, and PR-AUC.
Downloads
References
Salmi, M., Atif, D., Oliva, D., Abraham, A., & Ventura, S. (2024). Handling imbalanced medical datasets: review of a decade of research. Artificial Intelligence Review, 57(10), 273. https://doi.org/10.1007/s10462-024-10884-2
Khushi, M., Shaukat, K., Alam, T. M., Hameed, I. A., Uddin, S., Luo, S., Yang, X., & Reyes, M. C. (2021). A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data. IEEE Access, 9, 109960–109975. https://doi.org/10.1109/ACCESS.2021.3102399
Xu, Z., Shen, D., Nie, T., Kou, Y., Yin, N., & Han, X. (2021). A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Information Sciences, 572, 574–589. https://doi.org/10.1016/j.ins.2021.02.056
Ali, Aida, Siti Mariyam Shamsuddin, and Anca L. Ralescu. "Classification with class imbalance problem." Int. J. Advance Soft Compu. Appl 5.3 (2013): 176-204.
Dudjak, M., & Martinović, G. (2021). An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult. Expert Systems with Applications, 182, 115297. https://doi.org/10.1016/j.eswa.2021.115297
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study1. Intelligent Data Analysis, 6(5), 429–449. https://doi.org/10.3233/IDA-2002-6504
García, V., Mollineda, R. A., & Sánchez, J. S. (2008). On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Analysis and Applications, 11(3–4), 269–280. https://doi.org/10.1007/s10044-007-0087-5
López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141. https://doi.org/10.1016/j.ins.2013.07.007
Cao, C., & Wang, Z. (2018). IMCStacking: Cost-sensitive stacking learning with feature inverse mapping for imbalanced problems. Knowledge-Based Systems, 150, 27–37. https://doi.org/10.1016/j.knosys.2018.02.031
Roy, S., Roy, U., Sinha, D., & Pal, R. K. (2023). Imbalanced ensemble learning in determining Parkinson’s disease using Keystroke dynamics. Expert Systems with Applications, 217, 119522. https://doi.org/10.1016/j.eswa.2023.119522
Khuat, T. T., & Le, M. H. (2020). Evaluation of Sampling-Based Ensembles of Classifiers on Imbalanced Data for Software Defect Prediction Problems. SN Computer Science, 1(2), 108. https://doi.org/10.1007/s42979-020-0119-4
Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020). Data imbalance in classification: Experimental evaluation. Information Sciences, 513, 429–441. https://doi.org/10.1016/j.ins.2019.11.004
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220–239. https://doi.org/10.1016/j.eswa.2016.12.035
Haibo He, Yang Bai, Garcia, E. A., & Shutao Li. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Han, H., Wang, W.-Y., & Mao, B.-H. (2005). Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning (pp. 878–887). https://doi.org/10.1007/11538059_91
Haibo He, Yang Bai, Garcia, E. A., & Shutao Li. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem (pp. 475–482). https://doi.org/10.1007/978-3-642-01307-2_43
Dou, J., Gao, Z., Wei, G., Song, Y., & Li, M. (2023). Switching synthesizing-incorporated and cluster-based synthetic oversampling for imbalanced binary classification. Engineering Applications of Artificial Intelligence, 123, 106193. https://doi.org/10.1016/j.engappai.2023.106193
Chen, W., Guo, W., & Mao, W. (2024). An adaptive over-sampling method for imbalanced data based on simultaneous clustering and filtering noisy. Applied Intelligence, 54(22), 11430–11449. https://doi.org/10.1007/s10489-024-05754-x
Douzas, G., Bacao, F., & Last, F. (2018). Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 465, 1–20. https://doi.org/10.1016/j.ins.2018.06.056
Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2012). DBSMOTE: Density-Based Synthetic Minority Over-sampling TEchnique. Applied Intelligence, 36(3), 664–684. https://doi.org/10.1007/s10489-011-0287-y
Yang, Y., Akbarzadeh Khorshidi, H., & Aickelin, U. (2023). A Diversity-Based Synthetic Oversampling Using Clustering for Handling Extreme Imbalance. SN Computer Science, 4(6), 848. https://doi.org/10.1007/s42979-023-02249-3
Zhang, M., Ma, Y., Li, J., & Zhang, J. (2023). A density connection weight-based clustering approach for dataset with density-sparse region. Expert Systems with Applications, 230, 120633. https://doi.org/10.1016/j.eswa.2023.120633
Mahesh Kumar, K., & Rama Mohan Reddy, A. (2016). A fast DBSCAN clustering algorithm by accelerating neighbor searching using Groups method. Pattern Recognition, 58, 39–48. https://doi.org/10.1016/j.patcog.2016.03.008
Kaur, I., & Ahmad, T. (2024). A cluster-based ensemble approach for congenital heart disease prediction. Computer Methods and Programs in Biomedicine, 243, 107922. https://doi.org/10.1016/j.cmpb.2023.107922
Arafa, A., El-Fishawy, N., Badawy, M., & Radad, M. (2022). RN-SMOTE: Reduced Noise SMOTE based on DBSCAN for enhancing imbalanced data classification. Journal of King Saud University - Computer and Information Sciences, 34(8), 5059–5074. https://doi.org/10.1016/j.jksuci.2022.06.005
Tao, X., Li, Q., Guo, W., Ren, C., He, Q., Liu, R., & Zou, J. (2020). Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Information Sciences, 519, 43–73. https://doi.org/10.1016/j.ins.2020.01.032
Tong, W., Wang, Y., & Liu, D. (2023). An Adaptive Clustering Algorithm Based on Local-Density Peaks for Imbalanced Data Without Parameters. IEEE Transactions on Knowledge and Data Engineering, 35(4), 3419–3432. https://doi.org/10.1109/TKDE.2021.3138962
Huth, G., Lesne, A., Munoz, F., & Pitard, E. (2014). Correlated percolation models of structured habitat in ecology. Physica A: Statistical Mechanics and Its Applications, 416, 290–308. https://doi.org/10.1016/j.physa.2014.08.006
Ijaz, M., Alfian, G., Syafrudin, M., & Rhee, J. (2018). Hybrid Prediction Model for Type 2 Diabetes and Hypertension Using DBSCAN-Based Outlier Detection, Synthetic Minority Over Sampling Technique (SMOTE), and Random Forest. Applied Sciences, 8(8), 1325. https://doi.org/10.3390/app8081325 .
Hong, S., An, S., & Jeon, J.-J. (2024). Improving SMOTE via Fusing Conditional VAE for Data-adaptive Noise Filtering.
Torquato, S. (2002). Random Heterogeneous Materials (Vol. 16). Springer New York. https://doi.org/10.1007/978-1-4757-6355-3
Maddison, M. S., & Schmidt, S. C. (2020). Percolation Analysis – Archaeological Applications at Widely Different Spatial Scales. Journal of Computer Applications in Archaeology, 3(1), 269–287. https://doi.org/10.5334/jcaa.54
Amil, P., Almeira, N., & Masoller, C. (2019). Outlier Mining Methods Based on Graph Structure Analysis. Frontiers in Physics, 7. https://doi.org/10.3389/fphy.2019.00194
.Ma, L., & Fan, S. (2017). CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests. BMC Bioinformatics, 18(1), 169. https://doi.org/10.1186/s12859-017-1578-z
Stefanowski, J. (2013). Overlapping, Rare Examples and Class Decomposition in Learning Classifiers from Imbalanced Data (pp. 277–306). https://doi.org/10.1007/978-3-642-28699-5_11
Fernández-Navarro, F., Hervás-Martínez, C., & Antonio Gutiérrez, P. (2011). A dynamic over-sampling procedure based on sensitivity for multi-class problems. Pattern Recognition, 44(8), 1821–1833. https://doi.org/10.1016/j.patcog.2011.02.019
Wang, J., & Awang, N. (2025). A Novel Synthetic Minority Oversampling Technique for Multiclass Imbalance Problems. IEEE Access, 13, 6054–6066. https://doi.org/10.1109/ACCESS.2025.3526673
Wang, J., & Awang, N. (2024). MKC-SMOTE: A Novel Synthetic Oversampling Method for Multi-Class Imbalanced Data Classification. IEEE Access, 12, 196929–196938. https://doi.org/10.1109/ACCESS.2024.3521120
Copyright (c) 2025 Kaikashan Siddavatam, Subhash Shinde

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).