Page 18 - Read Online

P. 18

Page 162 Ji et al. Intell Robot 2021;1(2):151-75 https://dx.doi.org/10.20517/ir.2021.14

• R-FCN is similar to the logic of R-CNN-based detectors but reduces the amount of work needed for each
[51]
region proposal to increase the speed . The region-based feature maps can be computed outside each
region proposal and are independent of region proposals.

[52]
• You only look once (YOLO) algorithm detects objects in real-time using a neural network . Its
architecture passes the nxn image once through the fully convolutional neural network and outputs mxm
prediction. The YOLO architecture splits the input image into an mxm grid and generates two bounding
boxes and associated class probabilities of the boxes for each grid. The bounding boxes could be larger than
the grid itself. Differential weights of confidence predictions are adopted for boxes with or without objects
during training. The square root of width and height are predicted differently for bounding boxes
containing small or large objects. These changes to the loss function enable YOLO to produce better results.
YOLOv3 (You Only Look Once, Version 3) was commonly adopted by the studies reviewed here.

• Single Shot multibox Detector (SSD) builds on the VGG-16 architecture while discarding its fully
connected layers [Figure 7] . The original VGG fully connected layers are replaced with a set of auxiliary
[53]
convolutional layers (from conv6 onwards) to extract features at multiple scales and progressively decrease
the size of the input to each subsequent layer.

For semantic segmentation tasks, the following deep learning methods can be adopted:

[54]
• FCN uses a CNN to transform image pixels to pixel classes . Instead of image classification or object
detection, FCN transforms the height and width of intermediate feature maps back to those of the input
image by using the transposed convolutional layer. Thus, the classification output and the input image have
a one-to-one correspondence at the pixel level. Therefore, the classification results for the input pixel are
held by the channel dimension at its output pixel at the same spatial position.

• DeconvNet gradually deconvolutes and un-pools to obtain its output label map, different from the
conventional FCN with possible rough segmentation output label map .
[55]

[56]
• DeepLab applies atrous convolution for up-sampling . Atrous convolution is a shorthand for convolution
with up-sampled filters. Filter up-sampling amounts to inserting holes between nonzero filter taps. Atrous
convolution allows effectively enlarging the field of view of filters without increasing the number of
parameters or the amount of computation. Up-sampling the output of the last convolution layer and
computing pixel-wise loss produce the dense prediction.

• ParseNet aggregates the values of each channel feature map’s activations to declare contextual
information . These aggregations are then merged to be appended to the final features of the network. This
[57]
approach is less tiresome than the proposal cum classification approach and avoids unrelated predictions
for different pixels under FCN approach.

• DilatedNet uses dilated convolutions, filters with holes, to avoid losing resolution altogether . In this way,
[58]
the receptive field can grow exponentially while the number of parameters only grows linearly. The front
end is based on VGG-16 by replacing the last two pooling layers with dilated convolutions. A context
module and a plug-and-play structure are introduced for multi-scale reasoning using a stack of dilated
convolutions on a feature map.

13 14 15 16 17 18 19 20 21 22 23