Page 10 - Read Online
P. 10
Liu et al. Art Int Surg 2024;4:92-108 https://dx.doi.org/10.20517/ais.2024.19 Page 94
Despite their potential for analyzing behavior-rich scenes, HMR-based methods have yet to be explored for
analyzing human behavior across long time frames (i.e., more than one minute). Several studies have
developed temporal-based approaches to HMR; however, these methods are limited to short videos
[11]
spanning less than one minute due to computational constraints . Similar studies have investigated frame-
based approaches, but focus on either frames with a single human subject or singular frames with multiple
[12]
people . Few studies have leveraged HMR techniques to analyze group dynamics, individual behavior, and
global movements. Furthermore, to our knowledge, no previous studies have investigated the development
of HMR-based methods to analyze human behavior in OR videos.
We propose an HMR-based computer vision framework for detecting, recovering, and tracking human
meshes in surgical simulation videos. Our framework integrates a dual human head-body detector , a
[13]
[15]
Kalman filter-based tracker , and a frame-based HMR model trained on accessible, large-scale human
[14]
mesh and human detection datasets [16-19] . Our framework presents a unified approach to studying human
behavior in surgical scenes by deriving metrics on human attention, human movement, and hand-tool
interactions from a small dataset of simulated surgical videos. To evaluate the potential of leveraging our
estimated human mesh sequences for downstream surgical prediction tasks, we trained and evaluated a
customized multi-layer perceptron (MLP) Mixer model on a self-curated dataset of human mesh sequences
annotated with common, short-duration surgical actions. We show that sequences of mesh embeddings can
be leveraged successfully to discriminate between actions with similar physical behaviors yet striking
differences in surgical significance. Overall, our work advances efforts in systematizing OR video review for
the study of human behavior with HMR.
METHODS
We designed an integrated, scalable method to identify individual actions and analyze individual behavior
from OR videos [Figure 1]. In the following sections, we describe each successive component of our method
in detail.
Human mesh recovery framework
The analysis of each individual’s behavior in the OR requires three key steps: robustly detecting human
subjects in each frame, tracking human subjects from frame to frame, and recovering human mesh
parameters from each detected individual [Figure 1A].
Human and head detection
We performed human detection to identify regions in the image that our HMR model should focus on to
recover human mesh features (Figure 2, middle column). We obtained a YoloV5 model pretrained on the
[20]
COCO 2017 dataset and finetuned it separately on the detection of whole human subjects and human
[17]
[13]
heads using the CrowdHuman dataset . To improve the precision of our detections, we cross-referenced
each human subject prediction with a prediction of an associated human head attained from a second object
detection model.
Subject tracking
[14]
We leveraged these predictions in subsequent tracking with a simple Kalman Box-based tracker ; tracking
results were used to associate human meshes in each frame with meshes in prior frames, allowing us to
construct a sequential view of each individual’s changes in movement and pose throughout time. To
improve the fidelity of our tracking procedure, we introduced constraints on the nature in which a tracklet
can be abandoned or created based on its most recent estimated position. We defined a tracklet as a
temporal sequence of consecutive human mesh observations, associated with a single individual in the
scene.

