Page 155 - Read Online
P. 155
Ao et al. Intell Robot 2023;3(4):495-513 I http://dx.doi.org/10.20517/ir.2023.28 Page 5 of 19
2.2. CNN model for gesture recognition
A CNN is a feed-forward neural network with artificial neurons that respond to a subset of the surrounding
units in the coverage area, which is excellent for large image processing [21] . The computational model based
on deep CNNs is trained end-to-end, from the original pixel to the final category, without any additional infor-
mation or manually designed feature extractors. Therefore, this method can effectively fulfill the requirement
for experimental validation. A deep learning framework is used to recognize gestures from sEMG images and
computationally elucidate patterns in transient sEMG images. We built a network architecture with four con-
volutional layers and three fully connected layers. This network is the most basic CNN, but the test results are
still very good.
2.3. Critical electrode channel selection
Muscle synergy methods are analyzed by manually extracting specific features from sEMG signals. The Grad-
CAM interpretable method can be embedded in a CNN, avoiding the step of hand-selecting features and thus
obtaining information about muscle activation during the recognition process. In addition to this, Grad-CAM
can explain the basis of gesture recognition by the network from a global perspective and get the contribution
of features to the gesture recognition task. The electrode channel location where the contributing feature
region is located was selected, and this electrode channel was used as the key electrode channel for muscle
synergy analysis. Grad-CAM uses the gradient of any target concept that flows into the final convolution
layer to generate a coarse localization mapping that highlights important regions of the predicted concept in
the image [22] . Given a gesture image as input, we propagate the image through the CNN part of the model
and then obtain the raw score for that class by task-specific computation. The gradient of all classes is set
to zero except for the desired class, which is set to one. This signal is then back-propagated to the corrected
convolutional feature map, and we combine the two to compute the coarse Grad-CAM localization, with the
result representing the specific decision the model must look for. Finally, we multiply the heat map with the
bootstrap back propagation points to obtain a concept-specific Grad-CAM visualization.
The typical positioning map for the gesture category , Grad-CAM ∈ R × , can be calculated as the following
steps. We first compute the fractional gradient of gesture category before softmax. is the linear cate-
gorical logic score of the gesture category . is a feature map about the convolutional layer, representing
each element of the feature map of the th channel. is the neuron importance weights, which are obtained
by the global average pooling (GAP) layer. The importance of the th channel for the gesture category , ,
is calculated as
= 1 ∑ ∑ (1)
This weight represents a partial linearization of the underlying deep network from and obtains the im-
portance of the feature mapping for the target gesture class . It is then obtained by performing a weighted
combination of forward activation maps through a ReLU, and the ReLU is able to filter irrelevant features. The
∈ R × is measured as
Grad-CAM
( )
∑
Grad-CAM = ReLU (2)
However, class activation mapping (CAM) generates the feature maps of the penultimate layer K, ∈ R × .
These feature maps are then spatially merged using GAP and linearly transformed to generate a score for
each gesture class.