Page 60 - Read Online
P. 60

Page 4 of 11               Lim et al. Plast Aesthet Res 2023;10:43  https://dx.doi.org/10.20517/2347-9264.2023.70

               Table 2. Readability and reliability scores comparing the three LLMs, including t-test analysis
                                                            Readability                   Reliability
                                               Flesch reading ease   Flesch-Kincaid grade   Coleman-Liau   DISCERN
                                               score            level            index         score
                ChatGPT-4  No sensation on right fingertip 55.5  11.3            8             56
                           Recommended surgical   32.5          13.1             14            54
                           procedure
                           Likely outcome of complete   37      12.4             14            45
                           repair
                           Timeframe of complete return 38.2    12.8             12            49
                           Options aside from surgery  28.2     14.6             15            52
                Mean                           38.28            12.84            12.6          51.2
                Standard                       10.41            1.20             2.79          4.32
                deviation
                Google’s BARD  No sensation on right fingertip 51  10.5          9             47
                           Recommended surgical   35.2          13.2             12            40
                           procedure
                           Likely outcome of complete   48.2    10.3             9             55
                           repair
                           Timeframe of complete return 41.6    13.7             10            48
                           Options aside from surgery  41.2     12.2             13            55
                Mean                           43.44            11.98            10.60         49
                Standard                       6.25             1.54             1.82          6.28
                deviation
                Bing’s AI  No sensation on right fingertip 73   7                6             44
                           Recommended surgical   57.4          10               9             49
                           procedure
                           Likely outcome of complete   NIL     NIL              NIL           NIL
                           repair
                           Timeframe of complete return 50.1    9.6              10            50
                           Options aside from surgery  72.6     6.1              8             52
                Mean                           63.28            8.18             8.25          42.2
                Standard                       11.40            1.92             1.71          14.9
                deviation


               Table 3. Student T-test analysis
                T-test analysis  Flesch reading ease score  Flesch-Kincaid grade level  Coleman-Liau index  DISCERN score
                BARD vs. Bing AI  0.01               0.01                  0.08              0.37
                BARD vs. ChatGPT  0.01               0.002                 0.03              0.23
                Bing AI vs. ChatGPT  0.36            0.35                  0.21              0.53


               producing better outcomes. Augmenting ChatGPT’s reply, BARD outlined postoperative rehabilitation
               strategies, encompassing orthotic support and physical therapy for functional recovery . Unlike ChatGPT,
                                                                                        [10]
               BARD did not stress the importance of seeking expert counsel. Bing AI acknowledged its limitations by
               suggesting consultation with a professional and highlighting the importance of diagnostic assessments.
               However, it failed to delineate precise examinations and suitable therapeutic alternatives. Ultimately, it
               offered an indistinct summary compared to ChatGPT and BARD.

               In Figure 3, the inquiry “If I have completely lacerated my digital nerve, what is the likely outcome of being
               completely repaired? Provide 5 high-level evidence studies to support your answer.” sought to assess the
               models’ capacity to supply pertinent references and predict surgical outcomes. ChatGPT delivered a vague
   55   56   57   58   59   60   61   62   63   64   65