A Comprehensive Modeling Framework for Air Quality Prediction in Istanbul and CatBoost-SHAP Based Explainability


AKINER M. E.

Pure and Applied Geophysics, cilt.182, sa.11, ss.4771-4803, 2025 (SCI-Expanded, Scopus) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 182 Sayı: 11
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1007/s00024-025-03840-w
  • Dergi Adı: Pure and Applied Geophysics
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Aerospace Database, Agricultural & Environmental Science Database, Aquatic Science & Fisheries Abstracts (ASFA), Compendex, Geobase, INSPEC
  • Sayfa Sayıları: ss.4771-4803
  • Anahtar Kelimeler: Air quality prediction, catBoost, deep learning, meteorological parameters, particulate matter, SHAP
  • Akdeniz Üniversitesi Adresli: Evet

Özet

This study aims to contribute to developing scientific decision support systems by evaluating the effectiveness of BAT-ANN, CatBoost, LightGBM, Bi-LSTM, RF-Bi-LSTM, and SVM-Bi-LSTM models to determine the most suitable machine learning model for air quality prediction in Istanbul. Among the models trained with air quality and meteorological data collected between 2013 and 2024, CatBoost provided the most successful predictions with the lowest error rates (RMSE: 2.2781, MAE: 1.3708, AIC: 3924.774) and the highest performance metrics (R2: 0.9959, NSE: 0.9959). RF-Bi-LSTM and LightGBM models ranked second and third, respectively. SHAP analysis revealed that PM10 and PM2.5 are the most decisive factors in air quality predictions, and the synergistic effect of these variables leads to a significant increase in AQI predictions. However, it was observed that this effect reached saturation after a certain PM10 threshold. In addition, it was found that NOx showed a strong correlation with PM2.5 levels and increased air pollution by accumulating, especially at low wind speeds. The effect of CO levels on AQI is significant at low concentrations but becomes saturated at high levels. It was determined that the impact of O3 on AQI varies with factors such as temperature and solar radiation and causes sudden AQI increases at high temperatures. This situation shows that time series-based models (Bi-LSTM, RF-Bi-LSTM) can generalize better thanks to their ability to take meteorological variables into account. The CatBoost model provided high accuracy in air pollution prediction by processing categorical data naturally, and the explainability of model estimates was increased through SHAP analysis.