BMC ORAL HEALTH, cilt.25, ss.1-6, 2025 (SCI-Expanded)
Introduction
This study hypothesized that large language models (LLMs) would underperform compared to expert clinicians in diagnosing and managing complex endodontic anomalies, such as dens invaginatus, when provided with periapical radiographs. Although LLMs have shown promise in dental education and basic diagnostics, their effectiveness in nuanced clinical reasoning has remained unclear.
Methods
Nineteen anonymized periapical radiographs depicting challenging endodontic conditions were paired with clinical vignettes. Six advanced LLMs and one expert endodontist independently answered six structured clinical questions per case. Each response was scored against a reference key. Accuracy rates were compared using Kruskal-Wallis and Mann-Whitney U tests. Chi-square tests were used to evaluate model performance across question types.
Results
The expert achieved 100% accuracy, while all LLMs performed significantly lower (P < 0.05). Copilot demonstrated the lowest scores across all questions. The most substantial performance drop was observed in anomaly classification tasks, particularly in identifying and categorizing dens invaginatus. No significant performance differences were found among the top-performing LLMs.
Conclusions
While LLMs showed competence in basic diagnostic tasks, they failed to replicate expert-level decision-making in complex endodontic scenarios. Their current capabilities remain insufficient for unsupervised clinical use. This study is among the first to assess LLMs using real radiographic data in endodontics and highlights the need for further multimodal model development.