Page 20 - Read Online
P. 20
Liu et al. Art Int Surg 2024;4:92-108 https://dx.doi.org/10.20517/ais.2024.19 Page 104
movement patterns between subjects in transition and focusing on a stationary task can be applied to
identify transition periods in an OR procedure and instances of supply retrieval. Furthermore, visual
attention profiles can provide a broad assessment of one’s focus on a stationary task and be used to
disseminate between surgical tasks that require different levels of visual attention.
These qualitative behavioral differences served as inspiration for comprehensive ablation studies on
recovering surgical actions from mesh sequences. Overall, these studies provided critical insights for
applying mesh-level features to downstream surgical prediction tasks. In comparing model performance
when training with different mesh embedding compositions, we found that constructing mesh embeddings
from 3D joint positions resulted in improved performance over 3D joint pose compositions. One possible
explanation for this is that unlike joint poses, joint positions implicitly capture poses while carrying
information about a subject’s position in the overall scene. The observed performance difference suggests
that scene positioning is important for telling apart surgical actions and that poses can be learned by our
action recognition model from joint positions.
Leveraging this finding, we studied the impact of various joint categories on model performance and
observed that the inclusion of joints from the “pelvic”, “arm”, and “cranial” joint categories was optimal.
This quantitative result was consistent with previous qualitative observations, underscoring the differences
in attention, movement, and positional patterns between human subjects performing different surgical
actions. Interestingly, the further inclusion of joints in the “thorax”, “spine”, and “leg” categories resulted in
successive performance drops. One possible explanation for this trend is that estimations of joints from
these categories may be more imprecise due to higher tendencies for occlusion by adjacent equipment,
specifically in joints of the “spine” and “leg” categories. We observed this phenomenon in a recovered
human mesh in Figure 2, row 2, which erroneously modeled a standing subject in a sitting position. Similar
to the effects of occlusion, each subject in our videos displayed a homogenous appearance due to their
surgical attire, which may have affected the precision of pelvic and spine joint estimations. These challenges
have been observed less frequently in previous HMR studies dealing with natural imagery due to common
distinctive features between the upper and lower body attire of human subjects in natural settings [17,19] .
Future work should explore methods to mitigate these errors and assess the uncertainty of joint predictions
in surgical scenes.
Follow-up investigations into individual joints in the “arm” and “cranial” joint categories provided
empirical evidence on the importance of individual joints that are closely tied to arm movements and visual
attention for disseminating surgical actions. Specifically, we observed considerable, isolated improvements
to model performance over ablated baselines when testing the separate inclusion of (1) pelvic joints, (2)
anchoring joints for visual field computation (Section “Surgical behavior analysis”), such as the head and ear
joints, and (3) arm joints, such as the wrist and elbow joints. Due to the importance of modeling intricate
hand movements to analyze surgical performance, we hope to perform future studies that recover finger
joints to discriminate between different hand movements. While a granular understanding of hand
geometry was not central to our study of basic actions, our findings lay the groundwork for future studies
on hand movements by providing evidence that mesh sequences can effectively encode physical actions.
Furthermore, previous HMR studies have demonstrated the recovery of finger joints from in-the-wild
scenes, supporting the feasibility of this research direction [28,29] .
Altogether, our findings on recovering actions from mesh sequences demonstrated that we can consistently
recover actions from sequences of human mesh features alone. Conventional approaches that directly use
video frames to make predictions can be prone to overfitting, and our approach may circumvent this due to

