Page 59 - Read Online
P. 59

Lim et al. Plast Aesthet Res 2023;10:43  https://dx.doi.org/10.20517/2347-9264.2023.70  Page 3 of 11

               Table 1. Qualitative analysis results of large language models
                                                     Language   1-strongly   2-    3-     4-    5-strongly
                Question
                                                     model      disagree   disagree neutral agree  agree
                The language model provided accurate and reliable   ChatGPT-4             ×
                information on hand trauma nerve laceration
                                                     Bing’s AI                     ×
                                                     Google’s BARD                 ×
                The information provided by the language model was easy   ChatGPT-4       ×
                to understand
                                                     Bing’s AI             ×
                                                     Google’s BARD                        ×
                The language model conveyed empathy and maintained an   ChatGPT-4         ×
                appropriate tone
                                                     Bing’s AI             ×
                                                     Google’s BARD                              ×
                The language model provided relevant information quickly  ChatGPT-4  ×
                                                     Bing’s AI             ×
                                                     Google’s BARD                        ×
                Overall performance                  ChatGPT-4                            ×
                                                     Bing’s AI             ×
                                                     Google’s BARD                        ×


               Timeliness: the language models' speed in generating pertinent information.


               Additionally, the readability and reliability of the LLMs’ responses were evaluated using specific metrics.
               The Flesh Reading Ease Score, Flesch-Kincaid Grade Level, and the Coleman-Liau Index were employed for
               readability assessment, whereas the DISCERN score was utilized for reliability. The results are consolidated
               in Table 2 and subsequently subjected to a t-test [Table 3] for statistical significance appraisal.

               Figure 1 introduced the first scenario to the LLMs, reading “Hi Large Language Model, I am a 23-year-old
               male who has cut his right index finger with a knife. I am right hand dominant and a professional guitar
               player. I do not have any sensation on my right index fingertips. What do you think has occurred and what
               treatment do I require?” ChatGPT's response, which highlighted the prompt in red and warned of content
               policy violations, immediately suggested consulting professionals for clinical advice . Nonetheless, it
                                                                                          [5]
               identified the correct affected nerve and outlined damage mitigation steps . Notably, ChatGPT considered
                                                                              [6]
               the user's right-handedness and occupation, urging prompt expert help and personalized care. It concluded
               by discouraging risky activities and reiterating the need for professional assistance. Google’s BARD
               presented a response between ChatGPT and Bing AI in quality, suggesting nerve injury and advising
               immediate physician consultation. It also recommended basic first-aid and concluded with insights into the
                                                          [7]
               injury’s nature, prognosis, and potential therapies . Bing AI proffered a comparable response, elucidating
               its primary diagnosis of nerve damage, advocating for professional consultation and delineating possible
               treatment methods. Unlike ChatGPT, Bing AI did not propose an intermediate care model.

               Figure 2 aimed to evaluate the models’ ability to follow up on previous queries (recall ability) with
               additional information. This read “Same patient as the previous question, what surgical procedure do you
               recommend? Do I require any diagnostic test prior to surgical intervention?” ChatGPT reemphasized its ill-
               suitedness to provide recommendations and encouraged users to consult a physician. Nonetheless, it
               enumerated diagnostic tests typically conducted for injury diagnosis and severity assessment, concluding its
               response by mentioning common treatment modalities . BARD furnished a concise response, advocating
                                                              [8,9]
               for surgical repair and what it entailed, and listed identical diagnostic evaluations as ChatGPT did. It
               expounded on the repair prognosis being dependent on various factors, with earlier intervention often
   54   55   56   57   58   59   60   61   62   63   64