Page 58 - Read Online
P. 58
Page 321 Boyd et al. Art Int Surg 2024;4:316-23 https://dx.doi.org/10.20517/ais.2024.53
have demonstrated a lack of sufficient understanding and knowledge of plastic surgery within the broader
healthcare workforce [19,20] . A plastic surgery-specific AI tool or an NLP with additional plastic surgery
training represents an opportunity to improve the knowledge and applicability of LLM integration within
plastic surgery. Analyses of generic AI chatbots on plastic surgery in-service training examinations, for
example, have demonstrated a wide range of accuracy, scoring at levels comparable to a first-year plastic
surgery trainee [21,22] . The creation of a specialty-specific LLM has been previously explored, particularly in
the field of otolaryngology, where an ENT-specific LLM called ChatENT was found to outperform existing
[23]
LLMs and exhibited promise in medical and patient education . An opportunity exists to develop a plastic
surgery-focused LLM to deliver the most accurate and accessible information to patients and plastic
surgeons alike. This LLM should also be able to be customized by surgeons so that individual surgeon
preferences regarding perioperative instructions can be programmed. To ensure safety in the clinical
application of these tools, appropriate escalation of patient inquiries for scenarios that merit urgent or
emergent medical attention must be incorporated into the AI tool. Patients will inevitably utilize AI
platforms to seek medical counsel independent of physician supervision. Patients have long used the
Internet for self-diagnosis, self-referral, and research of their conditions [24,25] . Thus, studies of this nature are
critical to ensure the reliability and accuracy of AI-generated health information to protect patients from
[26]
misinformation .
Since Doximity GPT and ChatGPT are backed by the same NLP program, they are subject to similar
training data biases. While the data were largely deemed clinically reasonable by the study team, previous
studies have identified inaccuracies and inadequacies when utilizing ChatGPT to answer common
postoperative questions [27,28] . Additionally, the ever-evolving nature of clinical dogmas and accepted
practices may not always align with the knowledge cut-off dates of these LLMs. Doximity GPT’s knowledge
of clinical data extends only until September 2021, so novel medical or surgical information will not be
included in any outputs. This highlights the importance of clinicians prioritizing clinical judgment and
thoroughly reviewing any AI-generated output prior to distribution to patients.
The ethical implications of incorporating AI tools into plastic surgery practice also warrant further
discussion. Previous studies have highlighted the importance of informed consent, privacy protection, bias
reduction, and regulation for these technologies [29,30] . Kenig et al. described the need for a partnership
between physicians and lawmakers when creating guidelines and regulations for the use of AI in clinical
practice, to ensure that the highest standards of quality and transparency are upheld . They also suggest
[29]
the creation of an independent body to aid in the testing and validation of healthcare-specific AI models.
Further, these tools must be trained with diverse training data, as bias from training datasets may affect the
accuracy of AI-generated responses for patients of diverse backgrounds. Periodic review and validation of
AI models used in healthcare can aid in fostering fairness, equity, and higher quality of patient-facing data.
Limitations of this study include the comparison between Doximity GPT and only one other NLP. While
ChatGPT has been demonstrated previously to have the highest working knowledge in plastic surgery, this
may have changed or evolved since that time . Furthermore, Doximity GPT is powered by the updated
[21]
ChatGPT 4.0. We elected to use ChatGPT 3.5 for comparison in this study, given it is freely accessible.
Differences assessed by this study may be attributable to the subtle nuances between the two versions of the
LLM. Assessment for the accuracy of LLM outputs is a time-consuming process to review each output, and
it remains difficult to objectively determine accuracy other than relying on the clinical judgment of the
study team. Future studies should seek to develop methodologies or tools that can more objectively
determine medical accuracy on a broader scale. This difficulty further contributed to limiting the scope of
the study, as the study team prioritized critical evaluation of each individual output rather than reviewing