Investigating cancer risk factors using machine learning on national electronic health records


Thesis Type: Postgraduate

Institution Of The Thesis: Akdeniz University, Institute Of Health Sciences , -, Turkey

Approval Date: 2024

Thesis Language: Turkish

Student: ESRA TOKUR SONUVAR

Supervisor: Kemal Hakan Gülkesen

Abstract:

Objective: Cancer is one of the leading causes of death worldwide. National electronic health records (EHR) provide a rich source of data that can be analyzed to identify potential cancer risk factors. The aim of this study is to evaluate the accuracy and
performance of cancer prediction using various machine learning (ML) models. Additionally, the study aims to examine the effects of various variables on cancer risk and to develop recommendations for clinical applications and patient management based on this information.
Method: The data set used in the study consists of the data citizens of the Republic of Turkey and persons with a residence permit in Turkey, over the age of 18. The data was obtained from the e-Nabız system. The experimental group was diagnosed with cancer between 1 January 2018 and 31 December 2022, and the control group had no cancer diagnosis. The data underwent standard scaling, several ML models (logistic regression
[LR], SVM, XGBoost, decision trees, random forests, artificial neural networks) were applied. Model performance was evaluated using accuracy, sensitivity, precision, F1
score, MCC, AUC-ROC, and precision-recall curve (PRC) metrics. Additionally, the effects of other variables on cancer risk were analyzed using odds ratios, p-values, and effect sizes.
Results: The analysis showed that the XGBoost model showed the highest performance with an AUC value of 0.846 (0.841-0.850, 95% CI). LR analysis showed that older age,
residence in Istanbul region, high haemoglobin, low ALT and some comorbidities were associated with cancer risk.
Conclusion: Our study revealed that the XGBoost model performed the best in cancer prediction. The effects of certain variables on cancer risk provide critical information for
clinical applications and patient management. These findings support the use of ML models in healthcare applications and contribute to a better understanding of cancer.