Page 56 - Read Online
P. 56

Page 319                                                             Boyd et al. Art Int Surg 2024;4:316-23  https://dx.doi.org/10.20517/ais.2024.53

               quantifying how accurate an output was, accuracy was assessed in a binary fashion. Any inaccuracy within
               the LLM output classified the entire output as inaccurate. Any discrepancies were discussed with an
               independent arbitrator (N.S.K.), who is an expert breast surgeon, until a consensus was reached. Statistical
               analysis was accomplished using Microsoft Excel (Version 7, Seattle, WA), performing descriptive statistics,
               t-tests, and chi-square tests where appropriate, with a predetermined level of significance of P < 0.05.


               RESULTS
               In total, eighty NLP outputs were included in the study, half from Doximity GPT and the remainder from
               ChatGPT. ChatGPT responses were longer when measured by word count than Doximity GPT outputs
               (331 vs. 218 words, P < 0.001). ChatGPT outputs also utilized more overall characters (1,842 vs. 1,139
               characters, P < 0.001) and total sentences (16.8 vs. 13.7 sentences, P < 0.001). Sentences were more verbose
               for ChatGPT outputs, with nearly 4 additional words per sentence compared to texts produced by Doximity
               GPT (20 vs. 16.3 words per sentence, P < 0.001) [Table 2].

               Considering the overall structure of the outputs, Doximity GPT outputs were, by default, structured as a
               letter to the patient from a medical provider, whereas ChatGPT generated a bulleted list. Overall, the letter
               format of Doximity GPT yielded more outputs that appeared more personal and dialogistic in nature.
               Regarding readability of the outputs, Doximity GPT outputs were more readable by all four validated
               instruments: Flesch Kincaid Reading Ease (42.6 vs. 29.9, P < 0.001), Flesch Kincaid Grade Level (11.4 vs. 14.1
               grade, P < 0.001), Coleman-Liau Index (14.9 vs. 17 grade, P < 0.001), and Automated Readability Index (11.3
               vs. 14.8 grade, P < 0.001). Regarding content, there was no difference between the two platforms regarding
               the appropriateness of the topic (99% overall). All outputs provided a degree of background medical
               information on the subject, and 96% also included direct prescriptive advice including contacting the
               surgeon in 100% of cases and adhering to postoperative instructions in 90% of instances. Medical advice
               from all AI-generated outputs was deemed reasonable. The full list of outputs is provided for reference in
               Supplementary Table 1.


               DISCUSSION
               To our knowledge, this is the first study assessing the performance of the novel healthcare-specific AI
               platform, Doximity GPT, in the setting of perioperative management following plastic surgery. We compare
               the performance of Doximity GPT and ChatGPT in responding to common perioperative questions for a
               breast augmentation procedure based on the accuracy, format, and readability of generated outputs. This
               work represents necessary fundamental research to establish the fidelity of NLP-generated responses to
               medically sound recommendations before attempting to integrate this technology into patient-facing
               clinical practice. This study identifies that Doximity’s AI platform produces reasonable, accurate
               information in response to common patient queries about breast augmentation procedures.


               A key difference between the Doximity GPT and ChatGPT-generated outputs was observed in the structure
               and formatting of the responses. The Doximity GPT outputs were automatically formatted as letters for the
               patient on behalf of the provider, signed by the account holder who entered the query. This aligns with the
               purpose of the Doximity GPT platform, which is free for all U.S. clinicians and medical students and
               intended to facilitate the creation of patient education materials, note templates, and other administrative
               tasks. On the other hand, the outputs generated by ChatGPT defaulted to bulleted lists, consistent with an
               open-access virtual platform. This difference may be explained by the additional training of Doximity GPT
               with healthcare documentation examples.
   51   52   53   54   55   56   57   58   59   60   61