Page 39 - Read Online
P. 39

Yoseph et al. Art Int Surg 2024;4:267-77  https://dx.doi.org/10.20517/ais.2024.38                                                        Page 275

               understanding and make complex medical information more accessible to those with less familiarity.
               Practically,  this  can  be  accomplished  with  more  specific  LLM  prompting  based  on  one’s  prior
               understanding (or lack thereof) of the medical intervention. The lack of significant differences in individual
               question responses in our study is important because it validates that our findings are not skewed by any
               particular question, ultimately reinforcing the reliability of our findings.

               As LLMs become more advanced, including faster and better at responding to complex medical questions
               clearly and completely, it may become prudent for physicians to employ LLMs as tools to improve practice
               efficiency and patient education. A recent study by Jahanshahi et al. assessed AI and machine learning
               techniques to process online messages between doctors and patients and to generate multiple automatic
                       [14]
               responses . Their machine learning model “BERT” was able to achieve an accuracy rate of 85.41% when
               suggesting the top 3 doctor responses. Worldwide, other studies have employed LLMs in telemedicine to
               reduce barriers to healthcare access and receive quick consultations in the setting of a pandemic [15-17] .
               Collectively, these studies suggest that LLMs show great potential for quickly addressing medical questions
               from patients. Building upon this research, our study found that both spine patients and non-spine patient
               controls were satisfied with the clarity and completeness of LLM, as compared to physician-generated
               responses, and that LLMs outperformed physicians in some respects including brevity and clarity.


               Our study is limited by its small sample size and poor to fair inter-rater reliability. The uniformly low to fair
               interrater reliability across all questions is likely due to differences in participants’ background knowledge
               and potential ambiguities in our questions. Our initial intent was to capture the participant’s gut reaction
               and initial response to the educational material, which is why we did not provide in-depth training. It is
               likely (and has been shown here) that these “gut reactions” or impulse responses are less reliable than ones
               that are given with systematic criteria. To improve reliability in future studies, we could provide rater
               training to ensure raters are aligned in their understanding of evaluation criteria. This study is also limited
               in that we used the free, more easily accessible ChatGPT-3.5, instead of paying for the newest version
               ChatGPT-4.0 which is - at the time of writing - OpenAI’s most advanced system featuring the most safe and
                             [18]
               useful responses . It is important to consider the differences between these models since advancements in
               models’ abilities can significantly enhance their performance. Specifically, ChatGPT-4.0 boasts significant
               improvements in understanding and generating human-like text, likely resulting in higher accuracy and a
               deeper comprehension of complex topics. If ChatGPT-4.0 had been used in our study, the responses might
               have been clearer and more closely aligned with expert-level answers, potentially influencing our assessment
                                     [19]
               of AI’s utility in this study . We expect that as the models continue to be refined, the capabilities of LLMs
               in this space will only improve.

               We nevertheless feel that our study is significant in that it is the first of its kind to specifically evaluate LLM
               vs. physician-generated responses regarding ACDF surgery and the first to look for differences between
               patient and non-patient populations. Future studies examining patient perspectives on LLM vs. physician-
               generated responses should explore multiple other dimensions associated with patient satisfaction,
               including empathy and perceived trustworthiness of the response. Prior research has shown that physicians
               are more likely to rate LLM-produced responses as higher in empathy compared to physician-generated
               responses . Another study revealed that ChatGPT-4.0 shows the capacity for empathy when used to
                       [20]
               answer USMLE Step 2 Clinical Skills questions which are known to forecast performance in key residency
               domains, such as patient care, teamwork, professionalism, and communication [21,22] . These studies both beg
               the question of whether the empathy, imparted by artificial intelligence, is felt by patients scouring through
               LLMs for answers to their healthcare queries. The impact of significantly shorter responses associated with
               LLMs vs. physicians is also an avenue worth exploring as a measure of patient satisfaction in future studies.
   34   35   36   37   38   39   40   41   42   43   44