Page 59 - Read Online
P. 59
Lim et al. Plast Aesthet Res 2023;10:43 https://dx.doi.org/10.20517/2347-9264.2023.70 Page 3 of 11
Table 1. Qualitative analysis results of large language models
Language 1-strongly 2- 3- 4- 5-strongly
Question
model disagree disagree neutral agree agree
The language model provided accurate and reliable ChatGPT-4 ×
information on hand trauma nerve laceration
Bing’s AI ×
Google’s BARD ×
The information provided by the language model was easy ChatGPT-4 ×
to understand
Bing’s AI ×
Google’s BARD ×
The language model conveyed empathy and maintained an ChatGPT-4 ×
appropriate tone
Bing’s AI ×
Google’s BARD ×
The language model provided relevant information quickly ChatGPT-4 ×
Bing’s AI ×
Google’s BARD ×
Overall performance ChatGPT-4 ×
Bing’s AI ×
Google’s BARD ×
Timeliness: the language models' speed in generating pertinent information.
Additionally, the readability and reliability of the LLMs’ responses were evaluated using specific metrics.
The Flesh Reading Ease Score, Flesch-Kincaid Grade Level, and the Coleman-Liau Index were employed for
readability assessment, whereas the DISCERN score was utilized for reliability. The results are consolidated
in Table 2 and subsequently subjected to a t-test [Table 3] for statistical significance appraisal.
Figure 1 introduced the first scenario to the LLMs, reading “Hi Large Language Model, I am a 23-year-old
male who has cut his right index finger with a knife. I am right hand dominant and a professional guitar
player. I do not have any sensation on my right index fingertips. What do you think has occurred and what
treatment do I require?” ChatGPT's response, which highlighted the prompt in red and warned of content
policy violations, immediately suggested consulting professionals for clinical advice . Nonetheless, it
[5]
identified the correct affected nerve and outlined damage mitigation steps . Notably, ChatGPT considered
[6]
the user's right-handedness and occupation, urging prompt expert help and personalized care. It concluded
by discouraging risky activities and reiterating the need for professional assistance. Google’s BARD
presented a response between ChatGPT and Bing AI in quality, suggesting nerve injury and advising
immediate physician consultation. It also recommended basic first-aid and concluded with insights into the
[7]
injury’s nature, prognosis, and potential therapies . Bing AI proffered a comparable response, elucidating
its primary diagnosis of nerve damage, advocating for professional consultation and delineating possible
treatment methods. Unlike ChatGPT, Bing AI did not propose an intermediate care model.
Figure 2 aimed to evaluate the models’ ability to follow up on previous queries (recall ability) with
additional information. This read “Same patient as the previous question, what surgical procedure do you
recommend? Do I require any diagnostic test prior to surgical intervention?” ChatGPT reemphasized its ill-
suitedness to provide recommendations and encouraged users to consult a physician. Nonetheless, it
enumerated diagnostic tests typically conducted for injury diagnosis and severity assessment, concluding its
response by mentioning common treatment modalities . BARD furnished a concise response, advocating
[8,9]
for surgical repair and what it entailed, and listed identical diagnostic evaluations as ChatGPT did. It
expounded on the repair prognosis being dependent on various factors, with earlier intervention often