Page 52 - Read Online
P. 52
Farinha et al. Mini-invasive Surg 2023;7:38 https://dx.doi.org/10.20517/2574-1225.2023.50 Page 11 of 14
Despite the advantages outlined herein, these TMs have several drawbacks. The need to optimize perfusion
flow pressures, lack of hilar dissection, clamping, and hemostasis management were identified as potentially
needing improvements. Overcoming these shortcomings will accelerate the evolution from basic benchtop
and part-task trainers to the development of realistic and accurate recreation of an entire PN procedure,
which would underpin effective surgical training.
Studies
The clinical differentiation of the study population was heterogeneous, and the skill level criteria used to
differentiate novices, intermediates, and experts varied considerably between studies. These criteria were
unclear, and expertise was defined based on the number of surgeries performed rather than the number of
PNs performed by the surgeon.
The face and content validity studies used qualitative (i.e., based on Likert scales) questionnaires that did
not appear to be supported by validation evidence [9,19,29] . Responses were elicited from the participants in
[14]
variable time frames, that is, up to one week after the use of the TM . Reports of high rates of realism and
usefulness of training tool results were mainly obtained from experts’ evaluations. Furthermore, some
studies enrolled novice surgeons with slim-to-no PN operative experience [9,18,29,31] .
One study used photographs of the models and the tasks performed to complete the evaluation . The
[16]
majority of the construct validity studies assessed video recordings [9,17,19,27,29] . They used expert assessors who
were blinded to the experience level and surgeon performing the task. Time was employed as the main
[38]
metric despite evidence demonstrating that it has a weak association with performance quality . Only one
concurrent validity study was conducted with one VR simulator, and no studies assessing the predictive
aspect or transfer of skills were identified.
In the studies reviewed, Likert-type scales, such as GEARS and GOALS, were used to evaluate users’
performance in the TMs, although it was consistently demonstrated that they produce unreliable assessment
measures [9,16,19,27,29,39] . No procedure-specific binary metrics were reported, and none of the tasks used
performance errors as units of performance assessment. Furthermore, the methodology employed to train
assessors in using the assessment scales was not reported, nor was an interrater reliability level.
All identified validation studies followed the nomenclature and methodology described by Messick and
[40]
[41]
[18]
Cronbach rather than the framework described by Kane , reporting data on face, content, construct, and
concurrent validation instead of using Kane’s validation processes (i.e., scoring, generalization,
extrapolation, and implication) . In the “Scoring inference”, the developed skill stations included different
[18]
performance steps of the PN, and fairness was partially guaranteed by the production of standardized TMs.
However, the main problem was that scoring predominantly used global rating scales with no reported
attempts to demonstrate or deal with the issue of performance score reliability.
Furthermore, no effort was expended in the “Generalization inference” area. The items used to assess
performance were ill-defined. The researchers did not evaluate the reproducibility of scores, nor did they
investigate the magnitude of performance error; therefore, there was no identification of the sources of
error.
The studies reviewed here investigated whether the test domains reflected key aspects of the real PN, but no
analysis was performed to evaluate the relationship between the performance and real-world performance.
The same can be said about the “Implications inference” theme. Although a weak evaluation of the impact