Page 35 - Read Online
P. 35

Yoseph et al. Art Int Surg 2024;4:267-77  https://dx.doi.org/10.20517/ais.2024.38                                                        Page 271

               Table 3. Inclusion and exclusion criteria used to select cervical spine surgery patients and age-matched controls
                Inclusion criteria                                  Exclusion criteria
                Speaks and reads English at native level proficiency  Participants who did not fill out the study questionnaire
                Age > 18 years old                                  Participants who did not consent to participate in the study
                Patients must have a history of cervical spine surgery at Stanford from  Age-matched controls cannot have a history of spine
                2019 to 2023                                        surgery




























                Figure 2. Line graph showing total and average word count for answers to the 10 frequently asked questions regarding ACDF surgery
                generated by ChatGPT-3.5, Gemini, and doctors. ACDF: Anterior cervical discectomy and fusion.


               averaged. For a subset of the analysis, ChatGPT-3.5 and Gemini ratings were also averaged to assess patient
               perspectives on the two different LLMs. Independent two-sample t-test was used to compare ratings for
               each question. Inter-rater reliability was assessed using Fleiss’ Kappa. Kappa values of > 0.80 indicate
               excellent reliability; 0.61 to 0.80, substantial reliability; 0.41 to 0.60, moderate reliability; 0.21 to 0.40, fair
               reliability; and ≤ 0.20, poor reliability . The level of statistical significance was set at P < 0.05 or a
                                                 [10]
               specifically listed P-value when a conservative Bonferroni correction was applied in instances of analyses for
               multiple comparisons. All statistical analyses were executed using R Studio (version 4.1.2) or Python
               (version 3.8; Python Software Foundation).

               RESULTS
               The Shapiro-Wilk test indicated that the data were not normally distributed (W = 0.825, P < 0.001). This
               finding justified the use of non-parametric statistical methods for subsequent analyses.

               Word count analysis
               Compared to physician-generated responses, ChatGPT-3.5 and Gemini produced markedly shorter
               responses to every question (LLM avg = 30.0 +/- 23.5 vs. doctors avg = 153.7 +/- 86.7 words; P < 0.01;
               Figure 2). Despite being asked to limit responses to 250 words, the longest responses produced by
               ChatGPT-3.5 and Gemini were 77 and 107 words, respectively, while the average LLM responses were 31
               and 28.9 words, respectively. Responses from physicians were significantly longer, with an average of 153.7
               words per question [Table 4]. Overall, LLMs produced significantly shorter responses than physician-
               generated responses (P < 0.001). Comparisons of individual LLM platforms also revealed shorter responses
               produced by ChatGPT-3.5 vs. physicians (P < 0.001) and shorter responses produced by Gemini vs.
   30   31   32   33   34   35   36   37   38   39   40