Diagnostic accuracy of artificial intelligence versus 263 pediatric clinicians for childhood exanthems

Gençeli, Mustafa; Metin Akcan, Özge; Soran, Gonca; Çokbiçer, Abdulkerim; Saraç, Uğur; Üstüntaş, Talha; Yücel, Mehtap; Doğan, Methiye; Yılık Kömür, Ezgi; Gençeli, Sipil; Yılmaz Dağlı, HATİCE; Sarı, Memduha; Kılıç, Ahmet; Şahin, SÜLEYMAN; Akkuş, Abdullah

doi:10.1007/s00431-026-07044-9

Diagnostic accuracy of artificial intelligence versus 263 pediatric clinicians for childhood exanthems

Gençeli M., Metin Akcan Ö., Soran G. B., Çokbiçer A., Saraç U., Üstüntaş T., ...Daha Fazla

European Journal of Pediatrics, cilt.185, sa.6, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 185 Sayı: 6
Basım Tarihi: 2026
Doi Numarası: 10.1007/s00431-026-07044-9
Dergi Adı: European Journal of Pediatrics
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, CINAHL, EMBASE
Anahtar Kelimeler: Artificial Intelligence, Diagnosise exanthematous diseases
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Akdeniz Üniversitesi Adresli: Evet

Özet

Pediatric exanthematous diseases pose diagnostic challenges because clinical presentations overlap. To determine whether current artificial intelligence (AI) models achieve diagnostic accuracy within or above the performance distribution of pediatric residents and specialists for common rash-associated diseases. Participants and AI models were evaluated against definitive diagnoses confirmed by clinical features, laboratory findings, and consensus of two pediatric infectious disease specialists. A volunteer sample of 263 pediatric clinicians: 107 residents (years 1 through 4) and 156 specialists. Each clinician completed a blinded multiple-choice questionnaire with a clinical photograph and accompanying clinical data per case. The same cases were presented to three AI models: ChatGPT, Gemini, and Copilot. Among 263 clinicians (107 residents, 156 specialists), specialists scored higher than residents (median, 46 [IQR, 42—50] vs 41 [IQR, 36—46]; P <.001; r = 0.32). ChatGPT correctly diagnosed 53 of 61 cases (86.9%), Gemini 50 (82.0%), and Copilot 44 (72.1%). Both ChatGPT and Gemini exceeded the upper bound of the specialist population median 95% CI (47.17). All three AI models scored above the resident 95% CI upper bound (42.76). Disease-level accuracy ranged from 0% (insect bites, all models) to 100% (9 conditions, all models). Fourth-year residents scored higher than first- and second-year residents (P =.001; ε2 = 0.13). Conclusions: AI models given clinical data alongside images matched or exceeded specialist-level performance for pediatric exanthems. Accuracy varied by disease; failures clustered in conditions that require contextual reasoning. Physician oversight remains necessary where AI accuracy is lowest. (Table presented.)