Page 155 - Read Online
P. 155

Ao et al. Intell Robot 2023;3(4):495-513  I http://dx.doi.org/10.20517/ir.2023.28  Page 5 of 19



               2.2. CNN model for gesture recognition
               A CNN is a feed-forward neural network with artificial neurons that respond to a subset of the surrounding
               units in the coverage area, which is excellent for large image processing [21] . The computational model based
               on deep CNNs is trained end-to-end, from the original pixel to the final category, without any additional infor-
               mation or manually designed feature extractors. Therefore, this method can effectively fulfill the requirement
               for experimental validation. A deep learning framework is used to recognize gestures from sEMG images and
               computationally elucidate patterns in transient sEMG images. We built a network architecture with four con-
               volutional layers and three fully connected layers. This network is the most basic CNN, but the test results are
               still very good.

               2.3. Critical electrode channel selection
               Muscle synergy methods are analyzed by manually extracting specific features from sEMG signals. The Grad-
               CAM interpretable method can be embedded in a CNN, avoiding the step of hand-selecting features and thus
               obtaining information about muscle activation during the recognition process. In addition to this, Grad-CAM
               can explain the basis of gesture recognition by the network from a global perspective and get the contribution
               of features to the gesture recognition task. The electrode channel location where the contributing feature
               region is located was selected, and this electrode channel was used as the key electrode channel for muscle
               synergy analysis. Grad-CAM uses the gradient of any target concept that flows into the final convolution
               layer to generate a coarse localization mapping that highlights important regions of the predicted concept in
               the image [22] . Given a gesture image as input, we propagate the image through the CNN part of the model
               and then obtain the raw score for that class by task-specific computation. The gradient of all classes is set
               to zero except for the desired class, which is set to one. This signal is then back-propagated to the corrected
               convolutional feature map, and we combine the two to compute the coarse Grad-CAM localization, with the
               result representing the specific decision the model must look for. Finally, we multiply the heat map with the
               bootstrap back propagation points to obtain a concept-specific Grad-CAM visualization.


               The typical positioning map for the gesture category   ,       Grad-CAM  ∈ R   ×   , can be calculated as the following
                                                          
                                                                                            
               steps. We first compute the fractional gradient    of gesture category    before softmax.    is the linear cate-
                                                         
               gorical logic score of the gesture category   .    is a feature map about the convolutional layer, representing
                                                              
               each element of the feature map of the   th channel.    is the neuron importance weights, which are obtained
                                                              
                                                                                                           
               by the global average pooling (GAP) layer. The importance of the   th channel for the gesture category   ,    ,
                                                                                                           
               is calculated as
                                                       
                                                       =  1  ∑ ∑                                        (1)
                                                                        
                                                                        

                             
               This weight    represents a partial linearization of the underlying deep network from    and obtains the im-
                             
               portance of the feature mapping    for the target gesture class   . It is then obtained by performing a weighted
               combination of forward activation maps through a ReLU, and the ReLU is able to filter irrelevant features. The
                        ∈ R   ×    is measured as
                Grad-CAM

                                                                (       )
                                                                 ∑
                                                                       
                                                   
                                                   Grad-CAM  = ReLU                                     (2)
                                                                       
                                                                    
                                                                                                    
               However, class activation mapping (CAM) generates the feature maps of the penultimate layer K,    ∈ R   ×   .
                                                                                                        
               These feature maps are then spatially merged using GAP and linearly transformed to generate a score    for
               each gesture class.
   150   151   152   153   154   155   156   157   158   159   160