Page 106 - Read Online
P. 106
Choksi et al. Art Int Surg. 2025;5:160-9 https://dx.doi.org/10.20517/ais.2024.84 Page 162
In our study, we also aim to design a dry lab model for basic robotic suturing skills and create DL CV
models capable of automatically assessing the performance of a trainee on suturing tasks. We further aim to
determine if the participants are proficient at robotic suturing or need more practice.
METHODS
Study design and participants
Twenty-seven surgeons were recorded while completing two repetitions of two robotic tasks, backhand
suturing and railroad suturing, on a bench-top model created by our lab. The bench top model consisted of
artificial skin padded by packaging foam taped to a wooden block, as shown in Figure 1. This was placed
inside a robotic simulation abdominal cavity.
The railroad suturing task consisted of performing a running stitch by driving the needle from one side to
the opposite of the wound, and then re-entering next to where it exited. The backhand suturing exercise
involves performing a continuous stitch by guiding the needle from the side closest to the operator to the
opposite side.
Videos were recorded at 30 frames per second (FPS) and downsampled to 15 FPS for the study. This
prospective cohort study was approved by the Institutional Review Board of Northwell Health (IRB 23-069).
Video segmentation and data labeling
Video of each suturing task was broken down into a sequence of four sub-stitch phases: needle positioning,
needle targeting, needle driving, and needle withdrawal [Figure 1]. Using Encord (Cord Technologies
Limited, London, UK), every video was first temporally labeled into the four sub-stitch phases and then
each sub-stitch was annotated with a binary technical score (ideal or non-ideal) based on a previously
validated model [15,17] . The ideal/non-ideal classification reflects the operator’s skill while performing the
suturing action. The annotators were all surgical residents and trained by a senior surgical resident. The
annotators and engineers were blinded to the participants and experience levels when performing the
annotations and building the model. Sub-stitch annotations (labels) were mapped into frame annotations.
CV model
Our proposed system leverages a multi-model approach to surgical skill assessment. We employ two
distinct video DL models that are trained on overlapping 16-frame clips with a stride of 1 extracted from the
videos: the first model classifies the clip into one of the four sub-stitch phases or a background (no action)
class, while the second model classifies the clip into a binary technical score. These models use the Video
Swin Transformer architecture to capture spatiotemporal features within the surgical workflow. To generate
a comprehensive skill assessment, the individual model predictions are aggregated and fed into a Random
Forest Classifier, which produces a final classification of trainee skill level, categorized as either “Proficient”
or “Trainee”. Proficient in suturing was based on a case experience of 50 robotic cases. This was then
confirmed by a video review by two minimally invasive fellowship-trained attendings. This multi-faceted
approach aims to provide a robust and objective evaluation of surgical competency in robotic surgery
training [Figure 2].
At 15 fps, 16 frames correspond to 1.067 s of video time, which constitutes a very short period for the
models to identify what suturing sub-stitch is taking place, and if any mistakes are made during the sub-
stitch. Rather than naively increasing the clip size, which results in a larger memory footprint, we
introduced a dilation (frame skip) of 15 frames between consecutive frames of the clip, resulting in an
effective clip length of ~17 s. We employ dilation in both train and test phases, which improved sub-stitch

