Abstract
Objective: Hypoxic-ischemic encephalopathy is a major cause of neonatal morbidity and mortality. Diffusion-weighted imaging plays a key role in early diagnosis. With the increasing interest in large language models like ChatGPT, it is important to evaluate their potential in radiological interpretation. The aim of this study was to assess the diagnostic performance of ChatGPT-4o in identifying diffusion restriction on neonatal brain Diffusion-weighted imaging (DWI) and to determine whether clinical information (Thompson score) influences its diagnostic responses.
Material and Methods: This retrospective study included 36 neonates (18 with and 18 without diffusion restriction) who underwent brain DWI MRI between postnatal days 4 and 7. For each case, representative DWI and Apparent diffusion coefficient (ADC) images were uploaded to ChatGPT-4o in five separate sessions. The same process was repeated after adding the Thompson score. Performance was evaluated using sensitivity, specificity, Positive predictive value (PPV), Negative predictive value (NPV), Odds ratio (OR), Fleiss Kappa, and McNemar test.
Results: Without clinical information, sensitivity was 56.7%, specificity 90%, PPV 85%, and NPV 67.5% (OR; 11.77). With the Thompson score, sensitivity increased to 72.2%, specificity to 91.1%, PPV to 89%, and NPV to 76.6% (OR; 26.65). Intra-observer agreement was very high (without vs. with Thompson score; κ= 0.825 vs. κ= 0.920). McNemar test showed a statistically significant difference (p=0.045) after clinical data were included.
Conclusion: ChatGPT-4o showed high specificity and moderate sensitivity in detecting diffusion restriction on DWI. Clinical information significantly influenced diagnostic responses, highlighting both the potential and limitations of large language models (LLMs) in radiology.
Keywords: Artificial intelligence, diffusion-Weighted imaging , hypoxic-ischemic encephalopathy , large language models , neonatal MRI
References
- Chalak LF, Rollins N, Morriss MC, Brion LP, Heyne R, Sánchez PJ. Perinatal Acidosis and Hypoxic-Ischemic Encephalopathy in Preterm Infants of 33 to 35 Weeks’ Gestation. J Pediatr. 2012;160(3):388-94. doi:10.1016/j.jpeds.2011.09.001
- Graham EM, Ruis KA, Hartman AL, Northington FJ, Fox HE. A systematic review of the role of intrapartum hypoxia-ischemia in the causation of neonatal encephalopathy. Am J Obstet Gynecol. 2008;199(6):587-95. doi:10.1016/j.ajog.2008.06.094
- Chao CP, Zaleski CG, Patton AC. Neonatal Hypoxic-Ischemic Encephalopathy: Multimodality Imaging Findings. RadioGraphics. 2006;26(suppl_1):S159-72. doi:10.1148/rg.26si065504
- Liauw L, van Wezel-Meijler G, Veen S, van Buchem MA, van der Grond J. Do Apparent Diffusion Coefficient Measurements Predict Outcome in Children with Neonatal Hypoxic-Ischemic Encephalopathy? AJNR Am J Neuroradiol. 2009;30(2):264-70. doi:10.3174/ajnr.A1318
- Busch F, Hoffmann L, dos Santos DP, Makowski MR, Saba L, Prucker P, et al. Large language models for structured reporting in radiology: past, present, and future. Eur Radiol. 2024;35(5):2589-2602. doi:10.1007/s00330-024-11107-6
- Russe MF, Fink A, Ngo H, Tran H, Bamberg F, Reisert M, et al. Performance of ChatGPT, human radiologists, and context-aware ChatGPT in identifying AO codes from radiology reports. Sci Rep. 2023;13(1):14215. doi:10.1038/s41598-023-41512-8
- Temperley HC, O’Sullivan NJ, Mac Curtain BM, Corr A, Meaney JF, Kelly ME, et al. Current applications and future potential of ChatGPT in radiology: A systematic review. J Med Imaging Radiat Oncol. 2024;68(3):257-264. doi:10.1111/1754-9485.13621
- Keshavarz P, Bagherieh S, Nabipoorashrafi SA, Chalian H, Rahsepar AA, Kim GHJ, et al. ChatGPT in radiology: A systematic review of performance, pitfalls, and future perspectives. Diagn Interv Imaging. 2024;105(7-8):251-265. doi:10.1016/j.diii.2024.04.003
- Büyüktoka RE, Surucu M, Erekli Derinkaya PB, Adibelli ZH, Salbas A, Koc AM, et al. Applying large language model for automated quality scoring of radiology requisitions using a standardized criteria. Eur Radiol. 2025 Aug 20. doi: 10.1007/s00330-025-11933-2. Epub ahead of print.
- Horiuchi D, Tatekawa H, Oura T, Shimono T, Walston SL, Takita H, et al. ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology. Eur Radiol. 2025;35(1):506-516. doi: 10.1007/s00330-024-10902-5
- Huppertz MS, Siepmann R, Topp D, Nikoubashman O, Yüksel C, Kuhl CK, et al. Revolution or risk?-Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur Radiol. 2025;35(3):1111-1121. doi: 10.1007/s00330-024-11115-6
- Ueda D, Mitsuyama Y, Takita H, Horiuchi D, Walston SL, Tatekawa H, Miki Y. ChatGPT’s Diagnostic Performance from Patient History and Imaging Findings on the Diagnosis Please Quizzes. Radiology. 2023;308(1):e231040. doi: 10.1148/radiol.231040.
- Kuzan BN, Meşe İ, Yaşar S, Kuzan TY. A retrospective evaluation of the potential of ChatGPT in the accurate diagnosis of acute stroke. Diagnostic and Interventional Radiology. Diagn Interv Radiol. 2025;31(3):187-95. doi:10.4274/dir.2024.242892
- Zhang D, Ma Z, Gong R, Lian L, Li Y, He Z, et al. Using Natural Language Processing (GPT-4) for Computed Tomography Image Analysis of Cerebral Hemorrhages in Radiology: Retrospective Analysis. J Med Internet Res. 2024;26:e58741. doi:10.2196/58741
- Ozenbas C, Engin D, Altinok T, Akcay E, Aktas U, Tabanli A. ChatGPT-4o’s Performance in Brain Tumor Diagnosis and MRI Findings: A Comparative Analysis with Radiologists. Acad Radiol. 2025;32(6):3608-17. doi:10.1016/j.acra.2025.01.033
- Lacaita PG, Galijasevic M, Swoboda M, Gruber L, Scharll Y, Barbieri F, et al. The Accuracy of ChatGPT-4o in Interpreting Chest and Abdominal X-Ray Images. J Pers Med. 2025;15(5):194. doi:10.3390/jpm15050194
- Hager P, Jungmann F, Holland R, Bhagat K, Hubrecht I, Knauer M, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med. 2024;30(9):2613-2622. doi: 10.1038/s41591-024-03097-1.
- Park YJ, Pillai A, Deng J, Guo E, Gupta M, Paget M, Naugler C. Assessing the research landscape and clinical utility of large language models: a scoping review. BMC Med Inform Decis Mak. 2024;24(1):72. doi: 10.1186/s12911-024-02459-6
- Yu E, Chu X, Zhang W, Meng X, Yang Y, Ji X, Wu C. Large Language Models in Medicine: Applications, Challenges, and Future Directions. Int J Med Sci. 2025 ;22(11):2792-2801. doi: 10.7150/ijms.111780
- Zack T, Lehman E, Suzgun M, Rodriguez JA, Celi LA, Gichoya J, et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health. 2024;6(1):e12-e22. doi:10.1016/S2589-7500(23)00225-X
- Schmidt HG, Rotgans JI, Mamede S. Bias Sensitivity in Diagnostic Decision-Making: Comparing ChatGPT with Residents. J Gen Intern Med. 2025;40(4):790-795. doi:10.1007/s11606-024-09177-9
- Franc JM, Cheng L, Hart A, Hata R, Hertelendy A. Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale. CJEM. 2024;26(1):40-46. doi:10.1007/s43678-023-00616-w
- Khatri S, Sengul A, Moon J, Jackevicius CA. Accuracy and reproducibility of ChatGPT responses to real-world drug information questions. JACCP Journal of the American College of Clinical Pharmacy. J Am Coll Clin Pharm. 2025; 8(6): 432-8. doi:10.1002/JAC5.70038
- Krishna S, Bhambra N, Bleakney R, Bhayana R. Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board–style Examination. Radiology. 2024;311(2): e232715. https://doi.org/10.1148/radiol.232715
Copyright and license
Copyright © 2025 The Author(s). This is an open access article distributed under the Creative Commons Attribution License (CC BY), which permits unrestricted use, distribution, and reproduction in any medium or format, provided the original work is properly cited.