JOURNAL OF ASSISTED REPRODUCTION AND GENETICS, cilt.43, sa.1, ss.1-9, 2026 (SCI-Expanded, Scopus)
Artificial intelligence (AI) has emerged as a promising tool for clinical decision support in reproductive medicine, yet the performance of general-purpose large language models (LLMs) in predicting in vitro fertilization (IVF) outcomes remains insufficiently characterized. This exploratory proof-of-concept study aimed to evaluate and compare the out-of-the-box performance of three widely accessible LLM-based systems (ChatGPT, DeepSeek, and Gemini) in forecasting key clinical and laboratory outcomes of IVF treatments.
This retrospective single-center study used data from 1473 autologous IVF/ICSI cycles, each representing a unique patient. For each cycle, relevant clinical and laboratory variables were incorporated into a standardized anonymized patient-level vignette and submitted via the publicly available web interfaces of three LLMs (ChatGPT, DeepSeek, Gemini) without any fine-tuning or internal customization. The models were asked to predict stimulation protocol, ovulation trigger type, total and mature oocyte counts, usable embryo counts, and clinical pregnancy. Predictive performance was evaluated using accuracy and tolerance-based accuracy for categorical and count-based outcomes, mean absolute error for numerical predictions, and the area under the receiver operating characteristic (ROC) curve for clinical pregnancy.
Gemini achieved the highest accuracy in predicting stimulation protocols (51.26%) and embryo counts (68.22%), while DeepSeek demonstrated the lowest numerical error for oocyte count predictions. Clinical pregnancy prediction was the most challenging task; all models showed only moderate discrimination, with Gemini achieving the highest AUC (0.711), followed by ChatGPT (0.690) and DeepSeek (0.676). Overall, model performance varied considerably across tasks and remained below thresholds that would be considered sufficient for reliable stand-alone clinical use.
In this exploratory proof-of-concept setting, general-purpose AI systems showed variable and overall suboptimal performance in predicting IVF outcomes from standardized clinical vignettes. Although certain models demonstrated relative strengths in specific tasks, none reached the reliability, consistency, or interpretability required for safe clinical implementation. These findings indicate that, in their current form, such models should not be used as clinical decision-support tools for IVF decision-making and that their use should remain restricted to carefully controlled research settings until they have been prospectively validated in multicenter cohorts and systematically compared with rigorously developed, task-specific prediction models. This study provides comparative insight into how these AI systems behave in IVF-related prediction tasks and underscores the need for cautious interpretation of AI-generated outputs.