Page 15 - Read Online
P. 15

Page 99                                                                  Liu et al. Art Int Surg 2024;4:92-108  https://dx.doi.org/10.20517/ais.2024.19

               not require de-identification, such as blurring of faces. All visible subjects provided written formal consent
               to be recorded and agreed to the usage of the data for this research. We analyzed eight simulated surgical
               videos with a total runtime of approximately 40 min with our integrated HMR framework. Videos were
               gathered from multiple perspectives from a single hybrid OR and depicted team members, including a
               surgeon, scrub nurse, circulating nurse, and anesthesia nurse, entering the room, preparing the OR table
               along with associated technical instruments, and engaging in attentive hand-tool movements to mimic a
               real endovascular procedure. To demonstrate the utility of the derived HMR features in modeling human
               behaviors, we curated a dataset of 5-second clips with discernible, common actions exhibited in
               endovascular surgery. Specifically, we derived tracklet sequences from our simulated videos for each human
               subject, which we further separated into 864 5-second clips. We manually annotated each clip with common
               surgical actions, including (1) hand-tool interaction, (2) walking movement, and (3) visual observation of
               peers, ensuring that actions were mutually exclusive for each clip in our dataset. Our curated action dataset
               included 313 examples of “hand-tool interaction”, 91 examples of “walking movement”, and 460 examples
               of “visual observation of peers” [Table 1].

               Surgical behavior analysis
               We performed a qualitative analysis of surgical scenes with mesh-derived visual attention, positioning, and
               movement metrics to enhance our understanding of how human behavior emerges from human mesh-
               based representations.


               Movements and positioning
               In comparing the positional heat maps displayed between different individuals, we found that individuals
               engaging in hand-tool interactions (normally near the operating table) have distinctly concentrated
               positional heatmap signatures compared to individuals engaging in walking movements [Figure 3]. While
               the positional heatmaps signatures can vary in concentration for individuals directly observing peer
               activities, they are not always clearly discernable from those of individuals engaging in hand-tool
               interactions. We observe a similar trend relative to our graphical comparisons of movement patterns, noting
               a substantial increase in movement patterns in walking movement clips compared to all other clips
               [Figure 4].


               Visual attention
               Analysis of the visual attention profiles of individuals revealed that the distribution of visual attention shows
               significant differences when a given subject is engaging in hand-tool interactions, walking movements, and
               observation of peers [Figure 5]. Unlike the positional heatmaps and movement pattern graphs, the visual
               attention maps displayed clear qualitative differences in the dispersion of attention between subjects
               engaging in hand-tool interactions and observations of peers.


               Recovering actions from mesh sequences
               Motivated by our qualitative observations of action-specific differences in visual attention, positional
               metrics, and movement metrics, we leveraged sequences of mesh-based embeddings for the classification of
               common surgical actions from 5-second tracklet clips.


               Experiments on the choice of mesh embedding representation showed that composing mesh embeddings
               from 3D joint positions improved model performance in the F1 score, precision, and recall by 0.03, 0.04,
               and 0.04, respectively, compared to representing mesh embeddings as 3D joint poses (Table 2, bolded
               entries). In both representation strategies, we observed notable performance improvements with the new
               inclusion of joints from the “cranial” and “arm” categories, with minor performance differences seen in the
               further inclusion of joints in the “thorax”, “spine”, and “leg” categories. In our experiments with 3D joint
   10   11   12   13   14   15   16   17   18   19   20