ChatGPT-4o is Not a Reliable Study Source for Orthopedic Surgery Residents

Back

Orthopedics

ChatGPT-4o is Not a Reliable Study Source for Orthopedic Surgery Residents

Journal of Bone and Joint Surgery

Research Authors: Jain, Neil MD; Gottlich, Caleb MD; Fisher, John MD; Winston, Travis MD; Matullo, Kristofer MD, FAOA; Greenhill, Dustin MD, FAOA, FAAOS

AIIM Authors: Nikhil Angani, Nicholas Leonard

Approved by President Reda Riffi

Publication Date: Sep 1, 2025

Comprehensive Summary

This study examines the ability of ChatGPT-4o’s to answer Orthopedic In-Training Examination (OITE) questions with appropriate explanations for orthopedic surgery trainees. OITE questions were collected from 2020 to 2022 and sorted into 11 orthopedic knowledge domains. The questions were then inputted, including images and supporting information, into ChatGPT-4o. Answer explanations were then categorized by a resident physician and board certified orthopedic surgeon into 3 categories, consistent, disparate, and nonexistent (C,D,N). The data was then sorted into 6 categories where correct/incorrect (C/I) was paired with the corresponding explanation category (C,D,N). Finally these categories were sorted into 3 groups where CC was “ideal”, CN “inadequete”, and all other categories “unacceptable”. ChatGPT was correct for 64.7% of questions but overall response quality was 58.7% ideal, 6.9% inadequate, and 34.3% unacceptable. The performance of ChatGPT-4o was found to be equivalent or better than PGY-3 resident every year in the pediatrics or spine domains, and comparable to PGY-1 and 2 for all other domains. Previous studies have examined ChatGPT from a purely correct/incorrect viewpoint, not examining the quality of explanations that younger residents use as study material. This study examines the many facets of OITE questions to gain a more comprehensive view of ChatGPT’s abilities, finding that sole reliance on it as a study material will often be unacceptable.

Outcomes and Implications

This research comprehensively assesses the ability of ChatGPT-4 and 4o, finding that it has significantly improved in quality for answering and reasoning questions from the OITE. This has implications for all fields of study in medicine, since better reasoning across all question types can help support residents across all specialties. The study finds that current ChatGPT models are not adequate study resources, but it alludes to the future capabilities of newer models with potentially more data to pull from. Given time for AI in general to evolve, eventually the quality of ChatGPT as a study source will improve significantly.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.

Our mission is to

Connect medicine with AI innovation.

No spam. Only the latest AI breakthroughs, simplified and relevant to your field.