Comprehensive Summary
Bulut et al. evaluates 3 multimodal large language models (LLMs), ChatGPT-4o, Claude 3.5, and Gemini 2.0, on diagnostic accuracy regarding primary spontaneous pneumothorax (PSP) in children and adults and how age and pneumothorax size affect this accuracy. Between March and April of 2025, the retrospective study included 172 patients’ chest X-rays (CXRs) from the Emergency Department of Etlik City Hospital who had complained of dyspnea or chest pain and had CT-confirmed PSP. Experts categorized the CXRs into “small” and “large” PSP using a pleural line-to-chest wall distance of more than 3 cm for large PSP in patients 12 and older and a PSP affecting greater than 15% of the patient’s total lung volume for large PSP in younger patients. The AI models were trained on textbooks concerning emergency medicine, thoracic surgery, and pediatric surgery and then generated 3 separate responses to each of the image. The responses were categorized into 3 categories: overall accuracy (all 3 responses were correct), strict accuracy (at least 2 responses were correct), and ideal accuracy (at least 1 response was correct). Overall accuracy in identifying PSP in those older than 12 was 69.6, 57.4, and 64.9 percent in ChatGPT-4o, Gemini 2.0, and Claude 3.5 respectively. ChatGPT-4o consistently outperformed the other models in strict and ideal accuracy for this age group too (p< 0.001). Regarding PSP size and overall accuracy, ChatGPT-4o accurately identified 81.6% and 42.2% of large and small PSP respectively (p<0.001) while the other models generally had no statistically significant difference in accuracy based on PSP size. In patients younger than 12, overall accuracy was 20.8%, 20.8%, and 12.5% for ChatGPT-4o, Gemini 2.0, and Claude 3.5 respectively and no models displayed any statistically significant difference in accuracy between small and large PSP. In terms of consistency, Gemini 2.0 stood out with a Fleiss’ Kappa coefficient of 1.00, indicating perfect agreement for large PSP in the group over 12 years old while ChatGPT-40 and Claude 3.5 had only moderate consistency (p< 0.001). Consistency was very low (0.04) for ChatGPT-4o in small PSP over 12 years old and in patients under 12, consistency was generally low overall. Overall, accuracy and consistency were both lower in the pediatric cases concerning those younger than 12 years of age, and higher in the total large pneumothorax group. It was also noted that ChatGPT was most reliable among all groups in terms of accuracy. Limitations included the potential lack of pneumothorax data the LLMs were trained on, the resolution of the images used, the lack of images of those without PSP, and potential PSP false positives within the images.
Outcomes and Implications
The presentation of PSP needs immediate medical assistance and, subsequently, needs to be diagnosed extremely quickly. However, thoracic CT (the gold standard for diagnosing pneumothorax) can be a radiation and cost burden on patients. The use of AI can assist in providing these fast diagnoses with CXRs and has the potential for increased diagnostic accuracy. There still remain concerns regarding the legality and ethicality of utilizing patient data for AI analysis which bring up possibilities of malpractice and having to take responsibility for machine errors. Even so, with further validation, training, and testing, the models could be incredibly useful for less experienced clinicians and have the potential to assist clinical decision-making for radiologists and clinicians in the future.