Page 16 - Read Online
P. 16

Liu et al. Art Int Surg 2024;4:92-108  https://dx.doi.org/10.20517/ais.2024.19                                                                Page 100

               Table 1. Breakdown of our dataset containing short-duration action clips for downstream surgical task evaluation leveraging
               recovered human meshes
                Action type                            Train     Validation       Test     All splits
                Hand-tool interaction                  219       46               48       313
                Walking movement                       64        13               14       91
                Visual observation of peer(s)          322       68               70       460
                All types                              605       127              132      864

               We separately used subsets of four, two, and two simulated surgical videos to create our train, validation, and test splits, respectively. We learned
               MLP mixer model parameters using our training set, tuned hyperparameters with our validation set, and evaluated our model on our held-out test
               set. MLP: Multi-layer perceptron.


               Table 2. Performance of our multi-class classification model under ablations that form the mesh embeddings separately from 3D
               joint positions (top) and 3D joint poses (bottom)
                Pelvic  Arm    Cranial   Thorax    Spine   Leg  Recall↑   Precision↑    F1↑    AUPRC↑
                Mesh embeddings as 3D joint positions
                √       -      -         -         -       -    0.62      0.38          0.47   0.57
                √       √      -         -         -       -    0.75      0.72          0.73   0.74
                √       √      √         -         -       -    0.83      0.82          0.81   0.85
                √       √      √         √         -       -    0.82      0.80          0.81   0.81
                √       √      √         √         √       -    0.78      0.78          0.77   0.74
                √       √      √         √         √       √    0.73      0.73          0.72   0.71
                Mesh embeddings as 3D joint poses
                √       -      -         -         -       -    0.75      0.75          0.75   0.72
                √       √      -         -         -       -    0.78      0.77          0.77   0.77
                √       √      √         -         -       -    0.78      0.78          0.78   0.83
                √       √      √         √         -       -    0.78      0.77          0.77   0.81
                √       √      √         √         √       -    0.78      0.77          0.77   0.85
                √       √      √         √         √       √    0.79      0.77          0.78   0.81

               Both ablations rely on the same major categories of joints, and check marks indicate that parameters from the joints in the referenced category
               are used to form the mesh embedding. For example, in the second row of the top table, 3D positions of the joints categorized under the pelvic and
               arm regions [Supplementary Material] are concatenated together to form the mesh embedding in each frame. Mesh embeddings from sampled
               frames in the 5-second action clip are collated, forming one dataset example, together with its associated action class label. Bolding indicates a
               top score.


               poses, we observed less variance in model performance among joint categories, with 0.04, 0.03, and 0.03 as
               the maximal differences between the lowest and highest performing experimental settings in the metrics of
               recall, precision, and F1, respectively (Table 2, bottom).

               Based on the results observed in Table 2, we performed further experiments to understand the contributions
               of specific joints to modeling the action recognition task, using 3D joint positions to construct mesh
               embeddings. Specifically, we performed ablation of joints in the “cranial” and “arm” joint categories, such as
               the wrists, elbows, eyes, ears, and head joints, while including joints from the “pelvic” joint category as a
               positional anchor. We chose to ablate joints from these specific categories, because these categories were
               previously observed to introduce substantial gains in model performance [Table 2]. We observed that
               optimal performance was achieved only after all individual cranial joints were included (Table 3, row 5). An
               ablation of pelvic joints from mesh embeddings that included all arm and cranial joints saw a considerable
               decrease in performance from its non-ablated baseline (Table 3, row 6).
   11   12   13   14   15   16   17   18   19   20   21