Page 20 - Read Online
P. 20

Liu et al. Art Int Surg 2024;4:92-108  https://dx.doi.org/10.20517/ais.2024.19                                                                Page 104

               movement patterns between subjects in transition and focusing on a stationary task can be applied to
               identify transition periods in an OR procedure and instances of supply retrieval. Furthermore, visual
               attention profiles can provide a broad assessment of one’s focus on a stationary task and be used to
               disseminate between surgical tasks that require different levels of visual attention.


               These qualitative behavioral differences served as inspiration for comprehensive ablation studies on
               recovering surgical actions from mesh sequences. Overall, these studies provided critical insights for
               applying mesh-level features to downstream surgical prediction tasks. In comparing model performance
               when training with different mesh embedding compositions, we found that constructing mesh embeddings
               from 3D joint positions resulted in improved performance over 3D joint pose compositions. One possible
               explanation for this is that unlike joint poses, joint positions implicitly capture poses while carrying
               information about a subject’s position in the overall scene. The observed performance difference suggests
               that scene positioning is important for telling apart surgical actions and that poses can be learned by our
               action recognition model from joint positions.


               Leveraging this finding, we studied the impact of various joint categories on model performance and
               observed that the inclusion of joints from the “pelvic”, “arm”, and “cranial” joint categories was optimal.
               This quantitative result was consistent with previous qualitative observations, underscoring the differences
               in attention, movement, and positional patterns between human subjects performing different surgical
               actions. Interestingly, the further inclusion of joints in the “thorax”, “spine”, and “leg” categories resulted in
               successive performance drops. One possible explanation for this trend is that estimations of joints from
               these categories may be more imprecise due to higher tendencies for occlusion by adjacent equipment,
               specifically in joints of the “spine” and “leg” categories. We observed this phenomenon in a recovered
               human mesh in Figure 2, row 2, which erroneously modeled a standing subject in a sitting position. Similar
               to the effects of occlusion, each subject in our videos displayed a homogenous appearance due to their
               surgical attire, which may have affected the precision of pelvic and spine joint estimations. These challenges
               have been observed less frequently in previous HMR studies dealing with natural imagery due to common
               distinctive features between the upper and lower body attire of human subjects in natural settings [17,19] .
               Future work should explore methods to mitigate these errors and assess the uncertainty of joint predictions
               in surgical scenes.


               Follow-up investigations into individual joints in the “arm” and “cranial” joint categories provided
               empirical evidence on the importance of individual joints that are closely tied to arm movements and visual
               attention for disseminating surgical actions. Specifically, we observed considerable, isolated improvements
               to model performance over ablated baselines when testing the separate inclusion of (1) pelvic joints, (2)
               anchoring joints for visual field computation (Section “Surgical behavior analysis”), such as the head and ear
               joints, and (3) arm joints, such as the wrist and elbow joints. Due to the importance of modeling intricate
               hand movements to analyze surgical performance, we hope to perform future studies that recover finger
               joints to discriminate between different hand movements. While a granular understanding of hand
               geometry was not central to our study of basic actions, our findings lay the groundwork for future studies
               on hand movements by providing evidence that mesh sequences can effectively encode physical actions.
               Furthermore, previous HMR studies have demonstrated the recovery of finger joints from in-the-wild
               scenes, supporting the feasibility of this research direction [28,29] .


               Altogether, our findings on recovering actions from mesh sequences demonstrated that we can consistently
               recover actions from sequences of human mesh features alone. Conventional approaches that directly use
               video frames to make predictions can be prone to overfitting, and our approach may circumvent this due to
   15   16   17   18   19   20   21   22   23   24   25