Diagnosis of shoulder dislocation on AP radiographs: A comparative analysis of diagnostic performance between orthopedic surgeons, emergency physicians, and ChatGPT models


Kirilmaz A., Erdem T. E., Yaka H., Yildirim A., Ozer M.

Injury, cilt.57, sa.2, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 57 Sayı: 2
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1016/j.injury.2025.112957
  • Dergi Adı: Injury
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Abstracts in Social Gerontology, CINAHL, EMBASE
  • Anahtar Kelimeler: Artificial intelligence, Chatgpt, Clinician experience, Orthopedic imaging, Prompt design, Shoulder AP radiograph, Shoulder dislocation
  • Akdeniz Üniversitesi Adresli: Evet

Özet

Objective This study aimed to evaluate the diagnostic performance of ChatGPT in identifying acute shoulder dislocations and to compare its accuracy with that of orthopedic specialists and emergency medicine residents. Methods A total of 250 anteroposterior (AP) shoulder radiographs were included. All images were evaluated for the presence or absence of dislocation and for dislocation subtype (anterior, posterior, inferior) by four groups: orthopedic specialists (n = 10), orthopedic residents (n = 10), emergency medicine residents (n = 10), and ChatGPT. ChatGPT-4o (OpenAI, May 2024) and ChatGPT-5.1 (OpenAI, July 2025) were accessed through the web interface using a standardized single image + text-based prompt. The models had no prior training with radiological images. Diagnostic performance was assessed using sensitivity, specificity, positive and negative predictive values, overall accuracy, area under the ROC curve (AUC), F1 score, and Cohen’s kappa for inter-reader agreement. Results In the detection of shoulder dislocation (yes/no), orthopedic specialists demonstrated the highest accuracy (95.0 %), whereas ChatGPT-4o showed the lowest (72.4 %). Orthopedic residents achieved 90.1 % accuracy, emergency medicine residents 89.0 %, and ChatGPT-5.1 78.0 %. When subtype classification (anterior, posterior, inferior) was included, orthopedic specialists again performed best (89.7 %), while ChatGPT-4o had the lowest accuracy (68.0 %). Orthopedic residents (84.7 %) outperformed emergency medicine residents (76.7 %), while ChatGPT-5.1 achieved 69.6 % accuracy. Internal-rotation AP images of nondislocated shoulders were frequently misinterpreted as posterior dislocations. Conclusion This study demonstrates that the diagnostic accuracy for acute shoulder dislocation varies according to the clinicians’ level of experience. The use of a single AP shoulder radiograph alone is not sufficient for diagnosing shoulder dislocation. Clinicians most frequently misinterpreted internally rotated AP shoulder radiographs as posterior dislocations. ChatGPT models showed moderate performance and are not yet suitable as standalone diagnostic tools in clinical decision-making. However, with further development of artificial intelligence–based systems, these models may serve as rapid preliminary screening aids in emergency settings.