Page 10 - Read Online
P. 10

Liu et al. Art Int Surg 2024;4:92-108  https://dx.doi.org/10.20517/ais.2024.19                                                                  Page 94

               Despite their potential for analyzing behavior-rich scenes, HMR-based methods have yet to be explored for
               analyzing human behavior across long time frames (i.e., more than one minute). Several studies have
               developed temporal-based approaches to HMR; however, these methods are limited to short videos
                                                                     [11]
               spanning less than one minute due to computational constraints . Similar studies have investigated frame-
               based approaches, but focus on either frames with a single human subject or singular frames with multiple
                     [12]
               people . Few studies have leveraged HMR techniques to analyze group dynamics, individual behavior, and
               global movements. Furthermore, to our knowledge, no previous studies have investigated the development
               of HMR-based methods to analyze human behavior in OR videos.

               We propose an HMR-based computer vision framework for detecting, recovering, and tracking human
               meshes in surgical simulation videos. Our framework integrates a dual human head-body detector , a
                                                                                                      [13]
                                                                     [15]
               Kalman filter-based tracker , and a frame-based HMR model  trained on accessible, large-scale human
                                       [14]
               mesh and human detection datasets [16-19] . Our framework presents a unified approach to studying human
               behavior in surgical scenes by deriving metrics on human attention, human movement, and hand-tool
               interactions from a small dataset of simulated surgical videos. To evaluate the potential of leveraging our
               estimated human mesh sequences for downstream surgical prediction tasks, we trained and evaluated a
               customized multi-layer perceptron (MLP) Mixer model on a self-curated dataset of human mesh sequences
               annotated with common, short-duration surgical actions. We show that sequences of mesh embeddings can
               be leveraged successfully to discriminate between actions with similar physical behaviors yet striking
               differences in surgical significance. Overall, our work advances efforts in systematizing OR video review for
               the study of human behavior with HMR.


               METHODS
               We designed an integrated, scalable method to identify individual actions and analyze individual behavior
               from OR videos [Figure 1]. In the following sections, we describe each successive component of our method
               in detail.

               Human mesh recovery framework
               The analysis of each individual’s behavior in the OR requires three key steps: robustly detecting human
               subjects in each frame, tracking human subjects from frame to frame, and recovering human mesh
               parameters from each detected individual [Figure 1A].


               Human and head detection
               We performed human detection to identify regions in the image that our HMR model should focus on to
               recover human mesh features (Figure 2, middle column). We obtained a YoloV5 model  pretrained on the
                                                                                         [20]
               COCO 2017 dataset  and finetuned it separately on the detection of whole human subjects and human
                                [17]
                                                [13]
               heads using the CrowdHuman dataset . To improve the precision of our detections, we cross-referenced
               each human subject prediction with a prediction of an associated human head attained from a second object
               detection model.

               Subject tracking
                                                                                                [14]
               We leveraged these predictions in subsequent tracking with a simple Kalman Box-based tracker ; tracking
               results were used to associate human meshes in each frame with meshes in prior frames, allowing us to
               construct a sequential view of each individual’s changes in movement and pose throughout time. To
               improve the fidelity of our tracking procedure, we introduced constraints on the nature in which a tracklet
               can be abandoned or created based on its most recent estimated position. We defined a tracklet as a
               temporal sequence of consecutive human mesh observations, associated with a single individual in the
               scene.
   5   6   7   8   9   10   11   12   13   14   15