Page 8 - Read Online

P. 8

Giakas et al. Art Int Surg 2024;4:233-46 Artificial
DOI: 10.20517/ais.2024.24
Intelligence Surgery

Original Article Open Access

Assessing the accuracy and utility of ChatGPT
responses to patient questions regarding posterior

lumbar decompression

Alec M. Giakas , Rajkishen Narayanan, Teeto Ezeonu, Jonathan Dalton, Yunsoo Lee, Tyler Henry, John
Mangan, Gregory Schroeder, Alexander Vaccaro, Christopher Kepler

Department of Orthopaedic Surgery, Rothman Institute at Thomas Jefferson University, Philadelphia, PA 19107, USA.
Correspondence to: Dr. Rajkishen Narayanan, Department of Orthopaedic Surgery, Rothman Institute at Thomas Jefferson
University, 925 Chestnut Street Floor 5, Philadelphia, PA 19107, USA. E-mail: rajkishen.narayanan@gmail.com

How to cite this article: Giakas AM, Narayanan R, Ezeonu T, Dalton J, Lee Y, Henry T, Mangan J, Schroeder G, Vaccaro A, Kepler
C. Assessing the accuracy and utility of ChatGPT responses to patient questions regarding posterior lumbar decompression. Art
Int Surg 2024;4:233-46. https://dx.doi.org/10.20517/ais.2024.24

Received: 25 Apr 2024 First Decision: 22 Jul 2024 Revised: 15 Aug 2024 Accepted: 22 Aug 2024 Published: 4 Sep 2024

Academic Editor: Andrew A. Gumbs Copy Editor: Pei-Yun Wang Production Editor: Pei-Yun Wang

Abstract
Aim: To examine the clinical accuracy and applicability of ChatGPT answers to commonly asked questions from
patients considering posterior lumbar decompression (PLD).

Methods: A literature review was conducted to identify 10 questions that encompass some of the most common
questions and concerns patients may have regarding lumbar decompression surgery. The selected questions were
then posed to ChatGPT. Initial responses were then recorded, and no follow-up or clarifying questions were
permitted. Two attending fellowship-trained spine surgeons then graded each response from the chatbot using a
modified Global Quality Scale to evaluate ChatGPT’s accuracy and utility. The surgeons then analyzed each
question, providing evidence-based justifications for the scores.

Results: Minimum scores across all ten questions would lead to a total score of 20, whereas a maximum score
would be 100. ChatGPT’s responses in this analysis earned a score of 59, just under an average score of 3, when
evaluated by two attending spine surgeons. A score of 3 denoted a somewhat useful response of moderate quality,
with some important information adequately discussed but some poorly discussed.

Conclusion: ChatGPT has the ability to provide broadly useful responses to common preoperative questions that
patients may have when considering undergoing PLD. ChatGPT has excellent utility in providing background

© The Author(s) 2024. Open Access This article is licensed under a Creative Commons Attribution 4.0
International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing,
adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as
long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and
indicate if changes were made.

www.oaepublish.com/ais

3 4 5 6 7 8 9 10 11 12 13