Page 17 - Read Online
P. 17

Ji et al. Intell Robot 2021;1(2):151-75  https://dx.doi.org/10.20517/ir.2021.14     Page 161

                                                                         [43]
               • Xception is a 71-layer CNN with an input image size of 299 × 299 . The network was trained on more
               than a million images from the ImageNet database and learned rich feature representations for a wide range
               of images. Users can load a pre-trained version of the network that can classify images into 1000 object
               categories.


               • MobileNet is a lightweight deep neural network designed for mobile applications of computer vision
                   [44]
               tasks . As a filter’s depth and spatial dimension can be separated, MobileNet uses depthwise separable
               convolutions to significantly reduce the number of parameters A depthwise separable convolution is made
               from depthwise convolution, the channel-wise DK × DK spatial convolution, and pointwise, the 1 × 1
               convolution to change the dimension.


               • FractalNet is a type of CNN that uses a fractal design instead of residual connections . A simple
                                                                                               [45]
               expansion rule is repeatedly applied to generate deep networks. These networks have structures of truncated
               fractals and contain interacting subpaths of different lengths. There are no pass-through or residual
               connections, and every internal signal is transformed before flowing to subsequent layers.

               • Both Trimps-Soushen and PolyNet performed very well in the ILSVRC image classification competition.
               Trimps-Soushen uses the pre-trained models from Inception-v3, Inception-v4, Inception-ResNet-v2, Pre-
               Activation ResNet-200, and Wide ResNet (WRN-68-2) for classification. PolyNet introduced a building
               block called PolyInception module formed by adding a polynomial second-order term to increase the
               accuracy. Then, a very deep PolyNet is composed based on the PolyInception module.


               For object detection tasks, the following deep learning methods can be deployed:

               • OverFeat is a classic type of CNN architecture, employing convolution, pooling, and fully connected
               layers .
                    [46]

               • R-CNN extracts only 2000 regions from the image as region proposals to work with using the selective
               search algorithm . The CNN extracts the features from the image. The extracted features at the output
                              [47]
               dense layer are fed into an SVM to classify the presence of the object within that candidate region proposal.
               For Fast R-CNN, the region proposals are identified from the convolutional feature map generated by the
                                      [48]
               CNN with the input image . The region proposals are then warped into squares and reshaped into a fixed
               size using a RoI pooling layer before being fed into a fully connected layer. From the RoI feature vector, the
               class of the proposed region and the offset values for the bounding box are predicted with a softmax layer.
               Fast R-CNN is faster than R-CNN because the convolution operation is performed only once per image to
               generate a feature map. Faster R-CNN is similar to Fast R-CNN but much faster. It uses a separate network
               to predict the region proposals instead of using a selective search algorithm to identify the region proposals
               on the feature map generated by CNN . An RoI pooling layer then reshapes the predicted region proposals
                                               [49]
               for classifying the image within the proposed region and predicting the offset values for the bounding boxes.
               Real-time object detection tasks can adopt faster R-CNN.

                                                                [50]
               • DeepID-Net introduces a deformable part-based CNN . A new deformable constrained pooling layer
               models the deformation of the object parts with geometric constraint and penalty. Besides directly detecting
               the entire object, it is also crucial to detect object parts which can then support detecting the entire object.
   12   13   14   15   16   17   18   19   20   21   22