Page 38 - Read Online
P. 38

Page 274                                                         Yoseph et al. Art Int Surg 2024;4:267-77  https://dx.doi.org/10.20517/ais.2024.38
























                Figure 5. (A) Clarity and (B) completeness ratings, expressed in percentages, from cervical spine surgery patients (n = 5) vs. age-
                                                                             **
                                                                     *
                matched controls (n = 5) comparing answers generated by LLMs and physicians.  P < 0.05.  P < 0.01. LLMs: Large language models.
               Comparisons of individual questions revealed no statistically significant differences in clarity or
               completeness between patients and controls for responses by ChatGPT-3.5, Gemini, and physician-
               generated answers [Supplementary Figure 2A and B, Supplementary Table 2]. Overall, study participants
               exhibited poor to fair inter-rater reliability in ratings for LLM vs. physician-generated responses with
               regards to clarity (LLMs: k = 0.16, P < 0.001; Physicians: k = 0.24, P < 0.001) and completeness (LLMs: k =
               0.23, P < 0.001; Physicians: k = 0.12, P < 0.001).


               DISCUSSION
               Several recent studies have demonstrated the potential of LLMs to deliver precise medical information and
               educate patients across various medical specialties [11-13] . This study used patients who had undergone cervical
               spine surgery and gender- and age-matched controls to investigate perspectives on LLMs vs. physician-
               generated answers to commonly asked questions regarding ACDF surgery. We found that study
               participants were more likely to rate LLMs than physician-generated responses with positive ratings for
               clarity. Despite LLM responses being much shorter than physician-generated responses, they received equal
               ratings on completeness. This finding is exciting as it demonstrates that LLMs can provide short, concise
               responses to complex medical questions that are both clear and complete, appealing to patients and controls
               alike.

               We also found that, when compared to age-matched controls, patients were more likely to rate physician-
               generated responses as clear and complete. This could potentially be explained by the patients having
               recently undergone spine surgery and spine surgery education (from the surgeon and surgical team), so that
               they are more familiar with medical terminology regarding ACDF surgery. This is further supported by
               patients also showing a trend of giving higher clarity and completeness ratings to LLM responses,
               potentially reflecting their familiarity with the subject matter. The familiarity with spine surgery likely
               introduces a bias for these patients, leading to a preference for responses that align with their prior
               knowledge. While this effect is evident in our study, it could potentially be generalized to other medical
               contexts, particularly where patients have prior experience or familiarity with a specific procedure.
               However, more research with larger sample sizes is needed to confirm this effect across different medical
               questions and procedures. For patients without prior surgery experience, LLMs could offer a more neutral
               perspective, potentially leveling the playing field between LLM and physician-generated responses. To
               better meet the needs of such patients, LLMs could be tailored with explanations that build foundational
   33   34   35   36   37   38   39   40   41   42   43