Page 17 - Read Online
P. 17
Ji et al. Intell Robot 2021;1(2):151-75 https://dx.doi.org/10.20517/ir.2021.14 Page 161
[43]
• Xception is a 71-layer CNN with an input image size of 299 × 299 . The network was trained on more
than a million images from the ImageNet database and learned rich feature representations for a wide range
of images. Users can load a pre-trained version of the network that can classify images into 1000 object
categories.
• MobileNet is a lightweight deep neural network designed for mobile applications of computer vision
[44]
tasks . As a filter’s depth and spatial dimension can be separated, MobileNet uses depthwise separable
convolutions to significantly reduce the number of parameters A depthwise separable convolution is made
from depthwise convolution, the channel-wise DK × DK spatial convolution, and pointwise, the 1 × 1
convolution to change the dimension.
• FractalNet is a type of CNN that uses a fractal design instead of residual connections . A simple
[45]
expansion rule is repeatedly applied to generate deep networks. These networks have structures of truncated
fractals and contain interacting subpaths of different lengths. There are no pass-through or residual
connections, and every internal signal is transformed before flowing to subsequent layers.
• Both Trimps-Soushen and PolyNet performed very well in the ILSVRC image classification competition.
Trimps-Soushen uses the pre-trained models from Inception-v3, Inception-v4, Inception-ResNet-v2, Pre-
Activation ResNet-200, and Wide ResNet (WRN-68-2) for classification. PolyNet introduced a building
block called PolyInception module formed by adding a polynomial second-order term to increase the
accuracy. Then, a very deep PolyNet is composed based on the PolyInception module.
For object detection tasks, the following deep learning methods can be deployed:
• OverFeat is a classic type of CNN architecture, employing convolution, pooling, and fully connected
layers .
[46]
• R-CNN extracts only 2000 regions from the image as region proposals to work with using the selective
search algorithm . The CNN extracts the features from the image. The extracted features at the output
[47]
dense layer are fed into an SVM to classify the presence of the object within that candidate region proposal.
For Fast R-CNN, the region proposals are identified from the convolutional feature map generated by the
[48]
CNN with the input image . The region proposals are then warped into squares and reshaped into a fixed
size using a RoI pooling layer before being fed into a fully connected layer. From the RoI feature vector, the
class of the proposed region and the offset values for the bounding box are predicted with a softmax layer.
Fast R-CNN is faster than R-CNN because the convolution operation is performed only once per image to
generate a feature map. Faster R-CNN is similar to Fast R-CNN but much faster. It uses a separate network
to predict the region proposals instead of using a selective search algorithm to identify the region proposals
on the feature map generated by CNN . An RoI pooling layer then reshapes the predicted region proposals
[49]
for classifying the image within the proposed region and predicting the offset values for the bounding boxes.
Real-time object detection tasks can adopt faster R-CNN.
[50]
• DeepID-Net introduces a deformable part-based CNN . A new deformable constrained pooling layer
models the deformation of the object parts with geometric constraint and penalty. Besides directly detecting
the entire object, it is also crucial to detect object parts which can then support detecting the entire object.