Page 64 - Read Online
P. 64

Page 8 of 11               Lim et al. Plast Aesthet Res 2023;10:43  https://dx.doi.org/10.20517/2347-9264.2023.70

               exhibited similar comprehensibility.


               DISCUSSION
               Overall, ChatGPT provided the most comprehensive and comprehensible information, generating an
               extensive array of management solutions while incorporating contextual data, all in a reader-friendly
               manner. BARD’s answers were similar in structure to ChatGPT’s, albeit lacking the same level of detail and
               clarity. Bing AI struggled to attain similar levels of ChatGPT and BARD’s comprehensibility and easiness of
               understanding, often utilizing technical language with frequent statistical references and lacking succinct
               summaries which came across as unempathetic and impersonal, as reflected in its Flesch Reading Ease
               Score. However, its consistent provision of sources compensates for its shortcomings, resulting in its
               comparable DISCERN score to ChatGPT’s.


               ChatGPT and BARD demonstrated superior comprehensibility and user-centricity to Bing’s AI, making
               them more suitable for improving public comprehension of nerve injury management. Overall, the accuracy
               of the three LLMs was insufficient for use as automated diagnostic tools supporting healthcare
               professionals, as they neglected to reference key peripheral nerve injury guidelines or high-quality
                      [22]
               research . Additionally,  the  LLMs  omitted  experimental  treatments  like  stem  cell  therapy  and
               photochemical tissue bonding, indicating a limitation in their algorithms' capacity for generating innovative
                       [23]
               solutions . Despite this, as a clinical practice tool, they could ensure that patients are not misled or
               provided an unfeasible option.

               In terms of information depth, ChatGPT provided supplementary data and enhanced its primary response
               when asked, outperforming BARD. Although BARD’s comprehensiveness was comparable to ChatGPT, it
               failed to respond to the third prompt, severely impacting its readability and reliability scores. Moreover,
               ChatGPT presented the most thorough rationale for each suggestion, bolstering its credibility. BARD closely
               followed, but its explanations lacked ChatGPT's depth. Bing AI delivered the least detailed responses, which
               were sometimes not the gold standard. For example, it suggested exercise as an alternative treatment but
               neglected to specify that aerobic exercise is the most optimal form for addressing nerve injuries .
                                                                                                       [24]
               Nevertheless, Bing AI offered the most diverse range of treatment alternatives. Therefore, while Bing AI
               lacked depth, ChatGPT and BARD were limited in breadth. Notably, all three LLMs concentrated on
               layperson first aid, offering limited information for healthcare professionals and academics. They omitted
               management algorithms for nerve injuries, for example, when Bhandari (2019) recommends immediate
               surgery for penetrating trauma with neurological symptoms but conservative management for blunt
               trauma. The LLMs also neglect to address Seddon and Sunderland's categorizations of peripheral nerve
               injuries, which impact management [8,25,26] . This oversight may be attributed to the phrasing of the queries, as
               the LLMs could be presuming the authors are non-medical professionals, consequently yielding less
               scholarly and comprehensive replies. Considering this is the first study to compare these LLMs and on this
               topic, further research should seek to rectify these deficiencies.


               Literature consistency was flagged by many studies investigating ChatGPT in the past, so the comparative
               performance in generating references was pertinent to this study. Bing AI demonstrated superior
               consistency compared to BARD and ChatGPT, often supplying relevant hyperlinks to fact-check its claims.
               Meanwhile, BARD failed to produce high-level references and ChatGPT only recommended databases for
               users to search or generated aberrant references. Despite this, Bing AI primarily cited health websites over
               scholarly articles and directed users to irrelevant web pages, resulting in a higher DISCERN score than
               BARD’s but failed to surpass ChatGPT’s. Despite these constraints, LLMs’ rapid information retrieval and
               summarization capacity make them attractive for patients who are gathering information about emerging
   59   60   61   62   63   64   65   66   67   68   69