Improving Medical Insurance Cost Prediction Accuracy with Explainable Supervised Machine Learning based Classification Techniques

  • Unique Paper ID: 182714
  • Volume: 12
  • Issue: 2
  • PageNo: 3181-3195
  • Abstract:
  • Health insurance plans help people financially by covering medical bills and reducing the financial burden of disease. Healthcare and health insurance premiums are influenced by a multitude of variables. The right level of coverage and possible advantages may be better identified with the help of early cost predictions for health insurance. In the insurance sector, ML has the potential to increase policy efficiency. Machine learning algorithms are quite good at predicting expensive healthcare costs. Traditional actuarial methods often fall short in capturing complex relationships in the data. Machine learning models, especially ensemble techniques like LightGBM, CatBoost, and Decision Trees, offer improved accuracy and interpretability. The primary objective of this study is to create supervised ML models capable of producing accurate predictions about the cost of health insurance. The dataset, Medicalpremium.csv from Kaggle, was preprocessed through data cleaning, feature scaling using Standard Scaler, and class balancing using Random Over Sampler. Three advanced regression models—LightGBM, CatBoost, and Decision Tree were developed and compared against baseline models like XGBoost and Random Forest. Model performance was assessed using R-square, MAE, RMSE, and MAPE, and hyperparameter tweaking was done via Grid-SearchCV. LightGBM emerged as the best model with an R-square of 98.67%, outperforming CatBoost (97.62%) and Decision Tree (96.18%), as well as traditional models like XGBoost (82.78%) and Random Forest (82.25%). Visual explainability was incorporated through learning curves, actual vs. predicted plots, residuals, Q-Q plots, prediction error plots, and ICE plots. The study concludes that ensemble-based boosting models, especially LightGBM, offer superior accuracy and generalization in predicting medical insurance costs, establishing a reliable methodology for real-world healthcare applications.

Copyright & License

Copyright © 2025 Authors retain the copyright of this article. This article is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

BibTeX

@article{182714,
        author = {Appurva Sharma and Alok Bansal and Subhash Chandra Jat},
        title = {Improving Medical Insurance Cost Prediction Accuracy with Explainable Supervised Machine Learning based Classification Techniques},
        journal = {International Journal of Innovative Research in Technology},
        year = {2025},
        volume = {12},
        number = {2},
        pages = {3181-3195},
        issn = {2349-6002},
        url = {https://ijirt.org/article?manuscript=182714},
        abstract = {Health insurance plans help people financially by covering medical bills and reducing the financial burden of disease.  Healthcare and health insurance premiums are influenced by a multitude of variables. The right level of coverage and possible advantages may be better identified with the help of early cost predictions for health insurance.  In the insurance sector, ML has the potential to increase policy efficiency.  Machine learning algorithms are quite good at predicting expensive healthcare costs. Traditional actuarial methods often fall short in capturing complex relationships in the data. Machine learning models, especially ensemble techniques like LightGBM, CatBoost, and Decision Trees, offer improved accuracy and interpretability.  The primary objective of this study is to create supervised ML models capable of producing accurate predictions about the cost of health insurance. The dataset, Medicalpremium.csv from Kaggle, was preprocessed through data cleaning, feature scaling using Standard Scaler, and class balancing using Random Over Sampler. Three advanced regression models—LightGBM, CatBoost, and Decision Tree were developed and compared against baseline models like XGBoost and Random Forest. Model performance was assessed using R-square, MAE, RMSE, and MAPE, and hyperparameter tweaking was done via Grid-SearchCV. LightGBM emerged as the best model with an R-square of 98.67%, outperforming CatBoost (97.62%) and Decision Tree (96.18%), as well as traditional models like XGBoost (82.78%) and Random Forest (82.25%). Visual explainability was incorporated through learning curves, actual vs. predicted plots, residuals, Q-Q plots, prediction error plots, and ICE plots. The study concludes that ensemble-based boosting models, especially LightGBM, offer superior accuracy and generalization in predicting medical insurance costs, establishing a reliable methodology for real-world healthcare applications.},
        keywords = {Healthcare, Medical Insurance Costs, Machine Learning, LightGBM, CatBoost, Decision Tree, Class Imbalance, Explainable AI, GridSearchCV.},
        month = {July},
        }

Related Articles