Page 56 - Read Online
P. 56
Page 319 Boyd et al. Art Int Surg 2024;4:316-23 https://dx.doi.org/10.20517/ais.2024.53
quantifying how accurate an output was, accuracy was assessed in a binary fashion. Any inaccuracy within
the LLM output classified the entire output as inaccurate. Any discrepancies were discussed with an
independent arbitrator (N.S.K.), who is an expert breast surgeon, until a consensus was reached. Statistical
analysis was accomplished using Microsoft Excel (Version 7, Seattle, WA), performing descriptive statistics,
t-tests, and chi-square tests where appropriate, with a predetermined level of significance of P < 0.05.
RESULTS
In total, eighty NLP outputs were included in the study, half from Doximity GPT and the remainder from
ChatGPT. ChatGPT responses were longer when measured by word count than Doximity GPT outputs
(331 vs. 218 words, P < 0.001). ChatGPT outputs also utilized more overall characters (1,842 vs. 1,139
characters, P < 0.001) and total sentences (16.8 vs. 13.7 sentences, P < 0.001). Sentences were more verbose
for ChatGPT outputs, with nearly 4 additional words per sentence compared to texts produced by Doximity
GPT (20 vs. 16.3 words per sentence, P < 0.001) [Table 2].
Considering the overall structure of the outputs, Doximity GPT outputs were, by default, structured as a
letter to the patient from a medical provider, whereas ChatGPT generated a bulleted list. Overall, the letter
format of Doximity GPT yielded more outputs that appeared more personal and dialogistic in nature.
Regarding readability of the outputs, Doximity GPT outputs were more readable by all four validated
instruments: Flesch Kincaid Reading Ease (42.6 vs. 29.9, P < 0.001), Flesch Kincaid Grade Level (11.4 vs. 14.1
grade, P < 0.001), Coleman-Liau Index (14.9 vs. 17 grade, P < 0.001), and Automated Readability Index (11.3
vs. 14.8 grade, P < 0.001). Regarding content, there was no difference between the two platforms regarding
the appropriateness of the topic (99% overall). All outputs provided a degree of background medical
information on the subject, and 96% also included direct prescriptive advice including contacting the
surgeon in 100% of cases and adhering to postoperative instructions in 90% of instances. Medical advice
from all AI-generated outputs was deemed reasonable. The full list of outputs is provided for reference in
Supplementary Table 1.
DISCUSSION
To our knowledge, this is the first study assessing the performance of the novel healthcare-specific AI
platform, Doximity GPT, in the setting of perioperative management following plastic surgery. We compare
the performance of Doximity GPT and ChatGPT in responding to common perioperative questions for a
breast augmentation procedure based on the accuracy, format, and readability of generated outputs. This
work represents necessary fundamental research to establish the fidelity of NLP-generated responses to
medically sound recommendations before attempting to integrate this technology into patient-facing
clinical practice. This study identifies that Doximity’s AI platform produces reasonable, accurate
information in response to common patient queries about breast augmentation procedures.
A key difference between the Doximity GPT and ChatGPT-generated outputs was observed in the structure
and formatting of the responses. The Doximity GPT outputs were automatically formatted as letters for the
patient on behalf of the provider, signed by the account holder who entered the query. This aligns with the
purpose of the Doximity GPT platform, which is free for all U.S. clinicians and medical students and
intended to facilitate the creation of patient education materials, note templates, and other administrative
tasks. On the other hand, the outputs generated by ChatGPT defaulted to bulleted lists, consistent with an
open-access virtual platform. This difference may be explained by the additional training of Doximity GPT
with healthcare documentation examples.