Page 60 - Read Online

P. 60

Page 4 of 11 Lim et al. Plast Aesthet Res 2023;10:43 https://dx.doi.org/10.20517/2347-9264.2023.70

Table 2. Readability and reliability scores comparing the three LLMs, including t-test analysis
Readability Reliability
Flesch reading ease Flesch-Kincaid grade Coleman-Liau DISCERN
score level index score
ChatGPT-4 No sensation on right fingertip 55.5 11.3 8 56
Recommended surgical 32.5 13.1 14 54
procedure
Likely outcome of complete 37 12.4 14 45
repair
Timeframe of complete return 38.2 12.8 12 49
Options aside from surgery 28.2 14.6 15 52
Mean 38.28 12.84 12.6 51.2
Standard 10.41 1.20 2.79 4.32
deviation
Google’s BARD No sensation on right fingertip 51 10.5 9 47
Recommended surgical 35.2 13.2 12 40
procedure
Likely outcome of complete 48.2 10.3 9 55
repair
Timeframe of complete return 41.6 13.7 10 48
Options aside from surgery 41.2 12.2 13 55
Mean 43.44 11.98 10.60 49
Standard 6.25 1.54 1.82 6.28
deviation
Bing’s AI No sensation on right fingertip 73 7 6 44
Recommended surgical 57.4 10 9 49
procedure
Likely outcome of complete NIL NIL NIL NIL
repair
Timeframe of complete return 50.1 9.6 10 50
Options aside from surgery 72.6 6.1 8 52
Mean 63.28 8.18 8.25 42.2
Standard 11.40 1.92 1.71 14.9
deviation

Table 3. Student T-test analysis
T-test analysis Flesch reading ease score Flesch-Kincaid grade level Coleman-Liau index DISCERN score
BARD vs. Bing AI 0.01 0.01 0.08 0.37
BARD vs. ChatGPT 0.01 0.002 0.03 0.23
Bing AI vs. ChatGPT 0.36 0.35 0.21 0.53

producing better outcomes. Augmenting ChatGPT’s reply, BARD outlined postoperative rehabilitation
strategies, encompassing orthotic support and physical therapy for functional recovery . Unlike ChatGPT,
[10]
BARD did not stress the importance of seeking expert counsel. Bing AI acknowledged its limitations by
suggesting consultation with a professional and highlighting the importance of diagnostic assessments.
However, it failed to delineate precise examinations and suitable therapeutic alternatives. Ultimately, it
offered an indistinct summary compared to ChatGPT and BARD.

In Figure 3, the inquiry “If I have completely lacerated my digital nerve, what is the likely outcome of being
completely repaired? Provide 5 high-level evidence studies to support your answer.” sought to assess the
models’ capacity to supply pertinent references and predict surgical outcomes. ChatGPT delivered a vague

55 56 57 58 59 60 61 62 63 64 65