Page 18 - Read Online
P. 18
Giakas et al. Art Int Surg 2024;4:233-46 https://dx.doi.org/10.20517/ais.2024.24 Page 243
ramifications of such programs and tools, it is crucial to evaluate ChatGPT’s utility and accuracy in
disseminating orthopedic information. Chatbot responses may impact patients’ perceptions of treatment
options and risks prior to an evaluation by a physician. Several studies have analyzed the utility of ChatGPT
for patients considering orthopedic surgery [11-17] . Assessing ChatGPT’s usefulness for preoperative patient
education in spine surgery is especially critical due to the relatively high risk of spine surgery and the
nuances that often guide decision making regarding the indications for different operations. To our
knowledge, the present study is the first to use a modified validated scoring system to appraise and evaluate
ChatGPT’s responses to common patient questions when considering PLD surgery.
Minimum scores across all ten questions would lead to a total score of 20, whereas a maximum score would
be 100. ChatGPT’s responses in this analysis earned a score of 59, just under an average score of 3, when
evaluated by two attending, fellowship-trained orthopedic spine surgeons. A score of 3 denoted a somewhat
useful response of moderate quality, with some important information adequately discussed but some
poorly discussed [Figure 1].
In the present study, ChatGPT was generally able to provide an accurate, albeit cursory, overview of
relevant surgical indications, techniques, complications, and alternate therapies. However, some of these
answers, when evaluated individually, lacked the clarification necessary to provide patients with a thorough
understanding to inform their medical decision making. Some of the answers have the potential to be
harmful to patients, especially those answers suggesting alternative therapy without the necessary context of
the patient’s particular history and symptom severity. In some instances, for example, PLD might be
necessary to reverse or prevent further neurologic injury, especially for urgent and emergent indications.
Suggesting alternative, non-operative treatment options for these patients could worsen or adversely impact
patient outcomes. Concordantly, a prior study reported that ChatGPT had a 53% mismanagement rate,
which would be especially deleterious for serious underlying pathology . Furthermore, non-operative
[36]
treatment option descriptions were often vague, such as physical therapy to “strengthen muscles”. This
could lead some patients to pursue inadequate or harmful treatment, which may exacerbate or accelerate
their disease processes.
Additionally, several of the claims were not fully substantiated by current spine surgery literature and
several of the listed indications (spondylolisthesis and degenerative disc disease) may be better treated with
other procedures, such as spinal fusion. As noted in previous literature, ChatGPT has been trained to
generate definitive responses to questions, even when the existing literature may not be conclusive enough
to make a specific recommendation [37,38] . In particular, the chatbot seemed to indicate the superiority of
MISS over the traditional open approach. While there is increasing research regarding the potential benefits
of minimally invasive surgery, there are still gaps in the literature, which can be most appropriately
addressed by a trained and experienced surgeon [33,34] . These discrepancies may be confusing to patients
considering PLD and could potentially lead to a delay in care. Nevertheless, ChatGPT did repeatedly
emphasize that its responses should be taken in conjunction with consultation with a spine surgeon. This
inability to address appropriate, patient-specific context affirms the findings of previous literature
[36]
supporting the spine surgeon’s role in providing individualized clinical recommendations .
One limitation of any study attempting to characterize the utility of online sources of medical information
to patients prior to a doctor’s visit is the inherent subjectivity with which the online source is evaluated. To
combat this weakness, the present analysis implemented a more objective, validated numeric scoring
system. Additionally, the responses were analyzed by two attending spine surgeons, both of whose scores
were presented, providing additional insight from physicians with differing levels of experience and areas of