From Raw Data to Reliable Predictions: The Significance of Data Processing in COVID-19 Modelling

Sangita Das *

Syamaprasad College, West Bengal-700026, India.

Subhrajyoti Maji

Independent Researcher, West Bengal–712232, India.

*Author to whom correspondence should be addressed.


Abstract

Accurate predictive models are essential for analysing COVID-19 mortality trends. This study evaluates the effect of a novel customised data pre-processing pipeline on ten machine learning models predicting COVID-19 mortality in India, using data from Our World in Data (OWID). Our pipeline diverges significantly from a standard pre-processing pipeline in four main steps. First, it corrects administrative reporting delays by converting weekly reported totals into daily figures, reducing reporting biases and delivering precise estimates. Second, it applies momentum-preserving local outlier processing to retain data variability and improve accuracy. Third, it enforces mathematical relationships among features to ensure strict data consistency. Finally, it integrates an iterative feature selection process to optimise the feature set and boost model performance. Results demonstrate considerable improvement with the custom pipeline: the MLP Regressor achieved a test RMSE of 83.663 and a test R² of 0.986, outperforming the Gradient Boosting Regressor from the standard pipeline, which had a test RMSE of 171.525 and a test R² of 0.859. A primary contribution of this research is the formal validation of model stability via a newly introduced diagnostic: RMSE Variance. This metric quantifies the consistency of predictive performance across multiple iterations, distinguishing authentic generalisability from stochastic success. These results highlight the necessity of tailored pre-processing, providing a transferable framework for global health datasets facing non-stationary noise and reporting inconsistencies.

Keywords: COVID-19 Mortality Prediction, Data Pre-processing, Custom Pipeline, Feature Selection, Predictive Modelling, Machine Learning


How to Cite

Das, Sangita, and Subhrajyoti Maji. 2026. “From Raw Data to Reliable Predictions: The Significance of Data Processing in COVID-19 Modelling”. Asian Journal of Research in Computer Science 19 (2):75-96. https://doi.org/10.9734/ajrcos/2026/v19i2826.

Downloads

Download data is not yet available.