Page 14 - Read Online
P. 14
Liu et al. Art Int Surg 2024;4:92-108 https://dx.doi.org/10.20517/ais.2024.19 Page 98
We trained our MLP-mixer model for 200 epochs with a learning rate of 1e-4, batch size of 16, and an
[26]
Adam Optimizer with default parameters . During training, we selected best-performing models based on
the reported F1 score during validation at the end of each training epoch. We ceased training if this metric
did not improve over 50 epochs. Training was performed on a single GeForce RTX 2080 and completed in
approximately one hour.
Evaluation metrics
We determined the predicted action class for each mesh sequence by selecting the class with the highest
corresponding predicted probability. Our most important performance metrics included (1) precision,
which quantifies the ratio of predicted images that correctly conform to a specific action class; (2) recall,
which quantifies the ratio of mesh sequences correctly designated to a specific action class; and (3) F1,
which combines precision and recall using a harmonic mean. We defined precision, P , and recall, R , for
c
c
class c as:
and
where TP , FP , and FN denote true positives, false positives, and false negatives corresponding to a given
c
c
c
action class. To provide more comprehensive measures of model performance, we calculated the area under
the precision-recall curve (AUPRC). For all metrics, we computed a weighted average based on class
prevalence.
RESULTS
This section describes the qualitative and quantitative insights into our framework’s ability to analyze
surgical behavior and recover short-duration actions from human mesh sequences of OR videos.
Datasets
To train our HMR model, we use a broad set of commonly used, open-access HMR datasets. As no surgical
HMR datasets exist, to the best of our knowledge, we employed diverse datasets from general settings. We
followed the widely referenced schema outlined by Kolotouros et al. for querying examples from the
[16]
Common Objects in Context (COCO) dataset and the Max Planck Institute for Informatics (MPII) Human
Pose dataset along with associated 2D keypoints [17,27] . We also added examples and 3D ground truth from
the 3D Poses in the Wild (3DPW) and Human 3.6M (H36M) datasets. We conducted our evaluation on the
official train/test data splits of 3DPW, an in-the-wild dataset capturing humans in diverse poses and camera
angles, and H36M, which captures human activities in controlled environments [18,19] .
For human detection, we train our model on CrowdHuman, a large, richly annotated dataset of human
[13]
subjects in crowded, natural scenes to mimic the crowded nature of OR scenes .
We curated an in-house dataset based on simulated surgical videos for experiments on surgical behavior
analysis. These videos replicated actions in the OR by real clinical personnel but did not employ actual
patients or procedures. Accordingly, our data do not include Protected Health Information (PHI) and do

