Dens invaginatus as a diagnostic challenge: evaluating large language models against expert endodontic reasoning

Erkal, Damla; Felek, Turgut; Butean, Oana-Paula; Er, KÜRŞAT

doi:10.1186/s12903-025-06987-z

Dens invaginatus as a diagnostic challenge: evaluating large language models against expert endodontic reasoning

Erkal D., Felek T., Butean O., Er K.

BMC ORAL HEALTH, cilt.25, ss.1-6, 2025 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 25
Basım Tarihi: 2025
Doi Numarası: 10.1186/s12903-025-06987-z
Dergi Adı: BMC ORAL HEALTH
Derginin Tarandığı İndeksler: Scopus, Science Citation Index Expanded (SCI-EXPANDED), CINAHL, EMBASE, MEDLINE, Directory of Open Access Journals
Sayfa Sayıları: ss.1-6
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Akdeniz Üniversitesi Adresli: Evet

Özet

Introduction

This study hypothesized that large language models (LLMs) would underperform compared to expert clinicians in diagnosing and managing complex endodontic anomalies, such as dens invaginatus, when provided with periapical radiographs. Although LLMs have shown promise in dental education and basic diagnostics, their effectiveness in nuanced clinical reasoning has remained unclear.

Methods

Nineteen anonymized periapical radiographs depicting challenging endodontic conditions were paired with clinical vignettes. Six advanced LLMs and one expert endodontist independently answered six structured clinical questions per case. Each response was scored against a reference key. Accuracy rates were compared using Kruskal-Wallis and Mann-Whitney U tests. Chi-square tests were used to evaluate model performance across question types.

Results

The expert achieved 100% accuracy, while all LLMs performed significantly lower (P < 0.05). Copilot demonstrated the lowest scores across all questions. The most substantial performance drop was observed in anomaly classification tasks, particularly in identifying and categorizing dens invaginatus. No significant performance differences were found among the top-performing LLMs.

Conclusions

While LLMs showed competence in basic diagnostic tasks, they failed to replicate expert-level decision-making in complex endodontic scenarios. Their current capabilities remain insufficient for unsupervised clinical use. This study is among the first to assess LLMs using real radiographic data in endodontics and highlights the need for further multimodal model development.