Page 39 - Read Online
P. 39
Yoseph et al. Art Int Surg 2024;4:267-77 https://dx.doi.org/10.20517/ais.2024.38 Page 275
understanding and make complex medical information more accessible to those with less familiarity.
Practically, this can be accomplished with more specific LLM prompting based on one’s prior
understanding (or lack thereof) of the medical intervention. The lack of significant differences in individual
question responses in our study is important because it validates that our findings are not skewed by any
particular question, ultimately reinforcing the reliability of our findings.
As LLMs become more advanced, including faster and better at responding to complex medical questions
clearly and completely, it may become prudent for physicians to employ LLMs as tools to improve practice
efficiency and patient education. A recent study by Jahanshahi et al. assessed AI and machine learning
techniques to process online messages between doctors and patients and to generate multiple automatic
[14]
responses . Their machine learning model “BERT” was able to achieve an accuracy rate of 85.41% when
suggesting the top 3 doctor responses. Worldwide, other studies have employed LLMs in telemedicine to
reduce barriers to healthcare access and receive quick consultations in the setting of a pandemic [15-17] .
Collectively, these studies suggest that LLMs show great potential for quickly addressing medical questions
from patients. Building upon this research, our study found that both spine patients and non-spine patient
controls were satisfied with the clarity and completeness of LLM, as compared to physician-generated
responses, and that LLMs outperformed physicians in some respects including brevity and clarity.
Our study is limited by its small sample size and poor to fair inter-rater reliability. The uniformly low to fair
interrater reliability across all questions is likely due to differences in participants’ background knowledge
and potential ambiguities in our questions. Our initial intent was to capture the participant’s gut reaction
and initial response to the educational material, which is why we did not provide in-depth training. It is
likely (and has been shown here) that these “gut reactions” or impulse responses are less reliable than ones
that are given with systematic criteria. To improve reliability in future studies, we could provide rater
training to ensure raters are aligned in their understanding of evaluation criteria. This study is also limited
in that we used the free, more easily accessible ChatGPT-3.5, instead of paying for the newest version
ChatGPT-4.0 which is - at the time of writing - OpenAI’s most advanced system featuring the most safe and
[18]
useful responses . It is important to consider the differences between these models since advancements in
models’ abilities can significantly enhance their performance. Specifically, ChatGPT-4.0 boasts significant
improvements in understanding and generating human-like text, likely resulting in higher accuracy and a
deeper comprehension of complex topics. If ChatGPT-4.0 had been used in our study, the responses might
have been clearer and more closely aligned with expert-level answers, potentially influencing our assessment
[19]
of AI’s utility in this study . We expect that as the models continue to be refined, the capabilities of LLMs
in this space will only improve.
We nevertheless feel that our study is significant in that it is the first of its kind to specifically evaluate LLM
vs. physician-generated responses regarding ACDF surgery and the first to look for differences between
patient and non-patient populations. Future studies examining patient perspectives on LLM vs. physician-
generated responses should explore multiple other dimensions associated with patient satisfaction,
including empathy and perceived trustworthiness of the response. Prior research has shown that physicians
are more likely to rate LLM-produced responses as higher in empathy compared to physician-generated
responses . Another study revealed that ChatGPT-4.0 shows the capacity for empathy when used to
[20]
answer USMLE Step 2 Clinical Skills questions which are known to forecast performance in key residency
domains, such as patient care, teamwork, professionalism, and communication [21,22] . These studies both beg
the question of whether the empathy, imparted by artificial intelligence, is felt by patients scouring through
LLMs for answers to their healthcare queries. The impact of significantly shorter responses associated with
LLMs vs. physicians is also an avenue worth exploring as a measure of patient satisfaction in future studies.