Page 13 - Read Online
P. 13
Page 97 Liu et al. Art Int Surg 2024;4:92-108 https://dx.doi.org/10.20517/ais.2024.19
class classification task. Specifically, our model predicts the action associated with a mesh sequence
[Figure 1C].
Architecture and experimental design
We leveraged a customized MLP Mixer model for our action recognition task. An MLP-Mixer model
leverages MLPs to process channel-wise and token-wise information, allowing it to capture complex
[23]
interactions among channels and patches for the modeling of image-based inputs . Follow-up studies have
applied these architectural principles to successfully model dependencies in non-image modalities and
sequential data [24,25] , demonstrating the suitability of the architecture for our action recognition task; we
aimed to separately capture relationships among (1) different joints of a single subject in a given frame and
(2) joints in sequential frames. In our experiments, we adapted the original MLP-Mixer architecture to
accept an input sequence of human mesh-based embeddings by discarding the image patch layer and
performing token mixing across the temporal and embedding dimensions of the human mesh input
sequence. We defined mesh-based embedding as a vector representation of a human in a frame by any
combination or subset of HMR-derived parameters, including estimated 3D joint poses and positions. Each
mesh-based embedding effectively captures information on how the individual can be physically modeled at
a specific point in time. The temporal dimension brings together mesh-based representations in sequence,
which, we argue, can collectively represent a specific action, gesture, or behavior. In each training step, the
MLP Mixer model takes in an input sequence of human mesh-based embeddings, representing a subject’s
physical motion across a 5-second clip, and outputs a predicted action class from the options of (1) hand-
tool interaction, (2) walking movement, and (3) visual observation of peers.
Throughout initial experiments, we explored training with different numbers of mixer layers, learning rates,
optimization algorithms, and mesh representation strategies. For all experiments, we used an unweighted
cross-entropy loss function,
Where y is the real class label for the mesh sequence, and ŷ is the predicted confidence score for the
c
c
designated class.
To understand the representative power of the HMR-derived parameters in distinguishing short-duration,
common surgical actions, we experimented with different formulations of mesh-based embeddings.
Specifically, we performed an ablation study where mesh-based embeddings were constructed solely from
the 3D positions of joints from major joint categories, such as the pelvic, thorax, and cranial joints for the
3
same task. Each joint set J = {j ..., j } was composed of n joints for which each joint j ∈ ℝ represents the 3D
1
n
i
position of the joint in the global scene, and the corresponding mesh embedding is a concatenation of all
j ∈ J. We performed a similar ablation study, where mesh-based embeddings were represented strictly with
i
predicted 3D joint poses rather than 3D joint positions. Specifically, we collected pose parameters in
accordance with the joint categories defined previously. Each pose set P = {p , ..., p } was composed of a
1
n
3×3
concatenation of n flattened pose vectors, for which each p ∈ ℝ is defined by a rotation matrix that
i
represents the pose of joint j. We performed follow-up studies looking into the performance effects of
i
ablating specific joints that are crucial for performing hand-tool interactions and computing visual
attention. Furthermore, we also studied the dependency of our approach on the rate of video frame
sampling to provide insight into the scalability of our method to videos with longer durations.

