Page 18 - Read Online
P. 18

Page 162                            Ji et al. Intell Robot 2021;1(2):151-75  https://dx.doi.org/10.20517/ir.2021.14

               • R-FCN is similar to the logic of R-CNN-based detectors but reduces the amount of work needed for each
                                                [51]
               region proposal to increase the speed . The region-based feature maps can be computed outside each
               region proposal and are independent of region proposals.

                                                                                                     [52]
               • You only look once (YOLO) algorithm detects objects in real-time using a neural network . Its
               architecture passes the nxn image once through the fully convolutional neural network and outputs mxm
               prediction. The YOLO architecture splits the input image into an mxm grid and generates two bounding
               boxes and associated class probabilities of the boxes for each grid. The bounding boxes could be larger than
               the grid itself. Differential weights of confidence predictions are adopted for boxes with or without objects
               during training. The square root of width and height are predicted differently for bounding boxes
               containing small or large objects. These changes to the loss function enable YOLO to produce better results.
               YOLOv3 (You Only Look Once, Version 3) was commonly adopted by the studies reviewed here.

               • Single Shot multibox Detector (SSD) builds on the VGG-16 architecture while discarding its fully
               connected layers [Figure 7] . The original VGG fully connected layers are replaced with a set of auxiliary
                                      [53]
               convolutional layers (from conv6 onwards) to extract features at multiple scales and progressively decrease
               the size of the input to each subsequent layer.


               For semantic segmentation tasks, the following deep learning methods can be adopted:


                                                                    [54]
               • FCN uses a CNN to transform image pixels to pixel classes . Instead of image classification or object
               detection, FCN transforms the height and width of intermediate feature maps back to those of the input
               image by using the transposed convolutional layer. Thus, the classification output and the input image have
               a one-to-one correspondence at the pixel level. Therefore, the classification results for the input pixel are
               held by the channel dimension at its output pixel at the same spatial position.


               • DeconvNet gradually deconvolutes and un-pools to obtain its output label map, different from the
               conventional FCN with possible rough segmentation output label map .
                                                                          [55]

                                                             [56]
               • DeepLab applies atrous convolution for up-sampling . Atrous convolution is a shorthand for convolution
               with up-sampled filters. Filter up-sampling amounts to inserting holes between nonzero filter taps. Atrous
               convolution allows effectively enlarging the field of view of filters without increasing the number of
               parameters or the amount of computation. Up-sampling the output of the last convolution layer and
               computing pixel-wise loss produce the dense prediction.

               • ParseNet aggregates the values of each channel feature map’s activations to declare contextual
               information . These aggregations are then merged to be appended to the final features of the network. This
                         [57]
               approach is less tiresome than the proposal cum classification approach and avoids unrelated predictions
               for different pixels under FCN approach.


               • DilatedNet uses dilated convolutions, filters with holes, to avoid losing resolution altogether . In this way,
                                                                                             [58]
               the receptive field can grow exponentially while the number of parameters only grows linearly. The front
               end is based on VGG-16 by replacing the last two pooling layers with dilated convolutions. A context
               module and a plug-and-play structure are introduced for multi-scale reasoning using a stack of dilated
               convolutions on a feature map.
   13   14   15   16   17   18   19   20   21   22   23