Comprehensive Summary
The aim of the study was to evaluate whether ChatGPT can aid in diagnosis and decision-making during toxicological emergencies. The study gave 100 randomized, multiple choice questions from "The Study Guide for Goldfrank’s Toxicologic Emergencies" to ChatGPT-4 and performance was evaluated by its proportion of correct and incorrect responses. This was compared to the collective proportion of correct human responses to the same question set in AccessEmergencyMedicine.com which, being a paid emergency medicine resource, was assumed to be the responses of emergency medicine professionals (although trainees also access the study guide). The questions were grouped by researchers as either “case-based” or “theoretical knowledge;” the “case-based” questions were further divided into pediatric and adult cases. Overall, ChatGPT responded correctly 89% of the time, meanwhile the mean human response accuracy was 56 ± 19%, indicating that ChatGPT outperformed humans overall. Despite the overall performance, ChatGPT was shown to provide the most incorrect responses to case-based questions (6 out of 11 of its incorrect responses) and responded incorrectly to 27% of pediatric case-based questions. There was no access to the identity of the respondents, thus, the knowledge and skill set of the human responses were unable to be generalized making it difficult to judge the external validity of the study. All together, the study shows that ChatGPT is less reliable in answering case-based questions but its overall accuracy relative to humans demonstrates potential as an aid for emergency room triage and assisting physicians in toxicology cases.
Outcomes and Implications
Although the question set was multiple choice from a study guide, indicating that even the case-based questions were not able to reflect the ambiguity of real life, ChatGPT-4’s overall increased accuracy is promising in its ability to aid diagnosis and decision-making during toxicological emergencies. The relatively lower accuracy regarding pediatric and complex case-based questions showcases the high level of attention these cases require. If ChatGPT was to be implemented in these scenarios, further training on data sets would be imperative for refining its performance. Overall, the study shows that ChatGPT may assist toxicological diagnosis and decisions in the emergency room as an aid to physicians, but will require additional training before being implemented in a more significant capacity in the future.