Page 57 - Read Online
P. 57

Boyd et al. Art Int Surg 2024;4:316-23  https://dx.doi.org/10.20517/ais.2024.53                                                            Page 320

               Table 2. Objective structural metrics and readability scores for AI generated outputs from Doximity GPT and ChatGPT
                Variable                              Doximity GPT          ChatGPT          P-value
                Word count                            218 ± 43              331 ± 48         < 0.001
                Total characters                      1,139 ± 223           1,842 ± 266      < 0.001
                Total sentences                       13.7 ± 3.4            16.8 ± 2.9       < 0.001
                Words per sentence                    16.3 ± 2.3            20 ± 3.6         < 0.001
                Flesch Kincaid reading ease           42.6 ± 9.5            29.9 ± 7.2       < 0.001
                Flesch Kincaid grade level            11.4 ± 1.5            14.1 ± 1.6       < 0.001
                Coleman Liau index                    14.9 ± 1.6            17 ± 1.1         < 0.001
                Automated readability index           11.3 ± 1.7            14.8 ± 1.9       < 0.001

               AI: Artificial intelligence.

               While ChatGPT responses provide more detailed information for each query, Doximity GPT outputs were
               determined to be significantly more readable. Still, readability remains a limitation with NLP-generated
               outputs as both LLMs generated responses at a reading level higher than national recommendations .
                                                                                                       [17]
               With continued RLHF, Doximity GPT has the potential to be a useful tool for plastic surgeons and can
               assist with a range of tasks, such as providing basic information on procedures and writing appeal letters to
               insurance providers.

               Excitement regarding the possibility of incorporating NLPs into clinical workflow is evidenced by an
               exponential rise of exploratory papers in the literature discussing potential applications. These studies have
               focused on analyzing and comparing generic NLPs against one another. Garg et al. compared ChatGPT to
               Google Bard (Google, Mountain View, CA) for outputs regarding patient education materials for facial
               aesthetic surgery. This group specifically requested outputs be at the eighth-grade reading level. Despite this
                                                                                                [18]
               request to the LLMs, the generated outputs had an average reading level at the tenth-grade level . Lim and
               associates analyzed four generic LLMs to determine the applicability of AI-generated outputs for common
               perioperative questions for patients undergoing abdominoplasty. All LLMs generated information higher
               than the national recommended reading levels for medical literature. This group also investigated more
               subjective aspects of the AI-generated outputs, such as patient friendliness, which may be an important
               feature if such technology is integrated in a direct patient-facing manner . In terms of improving the
                                                                                [12]
               readability of content, Vallurupalli et al. suggest that LLMs may function more efficiently in simplifying pre-
               written patient instructions to an appropriate reading level compared to producing novel outputs at the
                                                                         [11]
               recommended reading level by the National Institute of Health . Further assessment of this theory
               represents future work from our group.

               While Doximity GPT represents a novel, healthcare-specific LLM, it has several limitations. First, Doximity
               GPT lacks working knowledge for information published after September 2021. This is a limitation not
               unique to Doximity GPT but common among LLMs. ChatGPT 3.5, for instance, is temporally limited to
               information published prior to January 2022. Given there is only a three-month difference between the two
               LLMs, the impact of this variance on their working knowledge is likely limited. In the analysis and
               comparison of LLMs, it is essential to consider the temporal limitations of each LLM. Continual
               programming of NLPs is required to maintain the most up-to-date programs as medical knowledge
               constantly evolves. This highlights an important consideration when using any AI-powered tool: each
               platform must be employed with awareness of its temporal limitations in knowledge. Furthermore, while
               the Doximity GPT markets a healthcare-trained LLM, the details of what additional functionality this
               provides are unclear given its proprietary nature. Another limitation of Doximity GPT is that while it has
               specific medical reinforcement, it does not have specific plastic surgery training or reinforcement. Studies
   52   53   54   55   56   57   58   59   60   61   62