Page 15 - Read Online
P. 15
Page 99 Liu et al. Art Int Surg 2024;4:92-108 https://dx.doi.org/10.20517/ais.2024.19
not require de-identification, such as blurring of faces. All visible subjects provided written formal consent
to be recorded and agreed to the usage of the data for this research. We analyzed eight simulated surgical
videos with a total runtime of approximately 40 min with our integrated HMR framework. Videos were
gathered from multiple perspectives from a single hybrid OR and depicted team members, including a
surgeon, scrub nurse, circulating nurse, and anesthesia nurse, entering the room, preparing the OR table
along with associated technical instruments, and engaging in attentive hand-tool movements to mimic a
real endovascular procedure. To demonstrate the utility of the derived HMR features in modeling human
behaviors, we curated a dataset of 5-second clips with discernible, common actions exhibited in
endovascular surgery. Specifically, we derived tracklet sequences from our simulated videos for each human
subject, which we further separated into 864 5-second clips. We manually annotated each clip with common
surgical actions, including (1) hand-tool interaction, (2) walking movement, and (3) visual observation of
peers, ensuring that actions were mutually exclusive for each clip in our dataset. Our curated action dataset
included 313 examples of “hand-tool interaction”, 91 examples of “walking movement”, and 460 examples
of “visual observation of peers” [Table 1].
Surgical behavior analysis
We performed a qualitative analysis of surgical scenes with mesh-derived visual attention, positioning, and
movement metrics to enhance our understanding of how human behavior emerges from human mesh-
based representations.
Movements and positioning
In comparing the positional heat maps displayed between different individuals, we found that individuals
engaging in hand-tool interactions (normally near the operating table) have distinctly concentrated
positional heatmap signatures compared to individuals engaging in walking movements [Figure 3]. While
the positional heatmaps signatures can vary in concentration for individuals directly observing peer
activities, they are not always clearly discernable from those of individuals engaging in hand-tool
interactions. We observe a similar trend relative to our graphical comparisons of movement patterns, noting
a substantial increase in movement patterns in walking movement clips compared to all other clips
[Figure 4].
Visual attention
Analysis of the visual attention profiles of individuals revealed that the distribution of visual attention shows
significant differences when a given subject is engaging in hand-tool interactions, walking movements, and
observation of peers [Figure 5]. Unlike the positional heatmaps and movement pattern graphs, the visual
attention maps displayed clear qualitative differences in the dispersion of attention between subjects
engaging in hand-tool interactions and observations of peers.
Recovering actions from mesh sequences
Motivated by our qualitative observations of action-specific differences in visual attention, positional
metrics, and movement metrics, we leveraged sequences of mesh-based embeddings for the classification of
common surgical actions from 5-second tracklet clips.
Experiments on the choice of mesh embedding representation showed that composing mesh embeddings
from 3D joint positions improved model performance in the F1 score, precision, and recall by 0.03, 0.04,
and 0.04, respectively, compared to representing mesh embeddings as 3D joint poses (Table 2, bolded
entries). In both representation strategies, we observed notable performance improvements with the new
inclusion of joints from the “cranial” and “arm” categories, with minor performance differences seen in the
further inclusion of joints in the “thorax”, “spine”, and “leg” categories. In our experiments with 3D joint

