An Interpretable Machine Learning Framework for Diabetes Prediction Using SMOTE-ENN Resampling and Feature Importance Analysis

Mustafa Hammad; Ahmed E. Aboanber; Ibrahim Gad

doi:10.9734/ajrcos/2026/v19i6869

An Interpretable Machine Learning Framework for Diabetes Prediction Using SMOTE-ENN Resampling and Feature Importance Analysis

Full Article - PDF Review History Discussion

Published: 2026-06-15

DOI: 10.9734/ajrcos/2026/v19i6869

Page: 40-61

Issue: 2026 - Volume 19 [Issue 6]

Mustafa Hammad *

Department of Mathematics, Faculty of Science, Tanta University, Tanta, Egypt.

Ahmed E. Aboanber

Department of Mathematics, Faculty of Science, Tanta University, Tanta, Egypt.

Ibrahim Gad

Department of Mathematics, Faculty of Science, Tanta University, Tanta, Egypt.

*Author to whom correspondence should be addressed.

Abstract

Diabetes is a major global health challenge, making early and accurate prediction essential for improving patient outcomes and reducing healthcare burdens. This study presents an integrated framework for diabetes risk prediction using the Pima Indians Diabetes Dataset. The novelty of the proposed approach lies in combining class imbalance handling, comparative machine learning analysis, causal inference, and feature importance evaluation to achieve both high predictive performance and improved model interpretability.

To address class imbalance, the Synthetic Minority Oversampling Technique with Edited Nearest Neighbors (SMOTE-ENN) was applied during data preprocessing. Several machine learning algorithms were trained and evaluated, including Logistic Regression, KNN, Decision Tree Classifier, SVC, Random Forest Classifier, Gradient Boosting Classifier, and Extra Tree Classifier. Furthermore, LightGBM-based feature importance analysis and causal inference techniques were employed to identify the most influential factors associated with diabetes risk and enhance the explainability of the predictive models. The experimental results demonstrated that the KNN classifier achieved the best performance, attaining an accuracy of 94.33% and an AUC-ROC score of 98.47%.

These findings indicate that integrating advanced data balancing techniques with interpretable machine learning methods can improve both predictive accuracy and the understanding of diabetes-related risk factors, thereby supporting the development of reliable clinical decision-support systems.

Keywords: Diabetes prediction, machine learning, Pima Indians diabetes dataset, SMOTE-ENN, imbalanced data, K-Nearest Neighbors (KNN), feature importance, Explainable Artificial Intelligence (XAI)

How to Cite

Hammad, Mustafa, Ahmed E. Aboanber, and Ibrahim Gad. 2026. “An Interpretable Machine Learning Framework for Diabetes Prediction Using SMOTE-ENN Resampling and Feature Importance Analysis”. Asian Journal of Research in Computer Science 19 (6):40-61. https://doi.org/10.9734/ajrcos/2026/v19i6869.

Downloads

Download data is not yet available.