Page 52 - Read Online
P. 52

Farinha et al. Mini-invasive Surg 2023;7:38  https://dx.doi.org/10.20517/2574-1225.2023.50  Page 11 of 14

               Despite the advantages outlined herein, these TMs have several drawbacks. The need to optimize perfusion
               flow pressures, lack of hilar dissection, clamping, and hemostasis management were identified as potentially
               needing improvements. Overcoming these shortcomings will accelerate the evolution from basic benchtop
               and part-task trainers to the development of realistic and accurate recreation of an entire PN procedure,
               which would underpin effective surgical training.

               Studies
               The clinical differentiation of the study population was heterogeneous, and the skill level criteria used to
               differentiate novices, intermediates, and experts varied considerably between studies. These criteria were
               unclear, and expertise was defined based on the number of surgeries performed rather than the number of
               PNs performed by the surgeon.

               The face and content validity studies used qualitative (i.e., based on Likert scales) questionnaires that did
               not appear to be supported by validation evidence [9,19,29] . Responses were elicited from the participants in
                                                                         [14]
               variable time frames, that is, up to one week after the use of the TM . Reports of high rates of realism and
               usefulness of training tool results were mainly obtained from experts’ evaluations. Furthermore, some
               studies enrolled novice surgeons with slim-to-no PN operative experience [9,18,29,31] .


               One study used photographs of the models and the tasks performed to complete the evaluation . The
                                                                                                    [16]
               majority of the construct validity studies assessed video recordings [9,17,19,27,29] . They used expert assessors who
               were blinded to the experience level and surgeon performing the task. Time was employed as the main
                                                                                               [38]
               metric despite evidence demonstrating that it has a weak association with performance quality . Only one
               concurrent validity study was conducted with one VR simulator, and no studies assessing the predictive
               aspect or transfer of skills were identified.


               In the studies reviewed, Likert-type scales, such as GEARS and GOALS, were used to evaluate users’
               performance in the TMs, although it was consistently demonstrated that they produce unreliable assessment
               measures [9,16,19,27,29,39] . No procedure-specific binary metrics were reported, and none of the tasks used
               performance errors as units of performance assessment. Furthermore, the methodology employed to train
               assessors in using the assessment scales was not reported, nor was an interrater reliability level.


               All identified validation studies followed the nomenclature and methodology described by Messick  and
                                                                                                    [40]
                        [41]
                                                                [18]
               Cronbach  rather than the framework described by Kane , reporting data on face, content, construct, and
               concurrent validation instead of using Kane’s validation processes (i.e., scoring, generalization,
               extrapolation, and implication) . In the “Scoring inference”, the developed skill stations included different
                                         [18]
               performance steps of the PN, and fairness was partially guaranteed by the production of standardized TMs.
               However, the main problem was that scoring predominantly used global rating scales with no reported
               attempts to demonstrate or deal with the issue of performance score reliability.

               Furthermore, no effort was expended in the “Generalization inference” area. The items used to assess
               performance were ill-defined. The researchers did not evaluate the reproducibility of scores, nor did they
               investigate the magnitude of performance error; therefore, there was no identification of the sources of
               error.


               The studies reviewed here investigated whether the test domains reflected key aspects of the real PN, but no
               analysis was performed to evaluate the relationship between the performance and real-world performance.
               The same can be said about the “Implications inference” theme. Although a weak evaluation of the impact
   47   48   49   50   51   52   53   54   55   56   57