Page 31 - Read Online
P. 31
Yoseph et al. Art Int Surg 2024;4:267-77 Artificial
DOI: 10.20517/ais.2024.38
Intelligence Surgery
Original Article Open Access
Patient perspectives on AI: a pilot study comparing
large language model and physician-generated
responses to routine cervical spine surgery
questions
1
3
1
1,2
1
Ezra T. Yoseph , Aneysis D. Gonzalez-Suarez , Siegmund Lang , Atman Desai , Serena S. Hu , Corinna C.
Zygourakis 1
1
Department of Neurosurgery, Stanford University School of Medicine, Stanford, CA 94304, USA.
2
Department of Trauma Surgery, University Hospital Regensburg, Regensburg 93053, Germany.
3
Department of Orthopedic Surgery, Stanford University School of Medicine, Stanford, CA 94063, USA.
Correspondence to: Dr. Ezra T. Yoseph, Department of Neurosurgery, Stanford University School of Medicine, 300 Pasteur Dr,
Palo Alto, Stanford, CA 94304, USA. E-mail: ezyoseph@stanford.edu
How to cite this article: Yoseph ET, Gonzalez-Suarez AD, Lang S, Desai A, Hu SS, Zygourakis CC. Patient perspectives on AI: a
pilot study comparing large language model and physician-generated responses to routine cervical spine surgery questions. Art
Int Surg 2024;4:267-77. https://dx.doi.org/10.20517/ais.2024.38
Received: 4 Jun 2024 First Decision: 2 Sep 2024 Revised: 11 Sep 2024 Accepted: 25 Sep 2024 Published: 29 Sep 2024
Academic Editor: Andrew A. Gumbs Copy Editor: Pei-Yun Wang Production Editor: Pei-Yun Wang
Abstract
Aim: The purpose of this study was to elucidate differences in patient perspectives on large language model (LLM)
vs. physician-generated responses to frequently asked questions about anterior cervical discectomy and fusion
(ACDF) surgery.
Methods: This cross-sectional study had three phases: In phase 1, we generated 10 common questions about
ACDF surgery using ChatGPT-3.5, ChatGPT-4.0, and Google search. Phase 2 involved obtaining answers to these
questions from two spine surgeons, ChatGPT-3.5, and Gemini. In phase 3, we recruited 5 cervical spine surgery
patients and 5 age-matched controls to assess the clarity and completeness of the responses.
Results: LLM-generated responses were significantly shorter, on average, than physician-generated responses
(30.0 +/- 23.5 vs. 153.7 +/- 86.7 words, P < 0.001). Study participants were more likely to rate LLM-generated
responses with more positive clarity ratings (H = 6.25, P = 0.012), with no significant difference in completeness
ratings (H = 0.695, P = 0.404). On an individual question basis, there were no significant differences in ratings
© The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0
International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing,
adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as
long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and
indicate if changes were made.
www.oaepublish.com/ais