Page 93 - Read Online
P. 93

Li et al. Intell Robot 2021;1(1):84-98  I http://dx.doi.org/10.20517/ir.2021.06      Page 88


















                                 Figure 2. The overall architecture of both the depth network and the pose network.




























               Figure 3. The architecture of ResNet and ResNeXt block: (a) the ResNet block; and (b) the aggregated residual transformations. Both have
               similar complexity, but the ResNeXt block has better adaptability and expansibility.


               3.1. Problem statement
               The aim of the unsupervised monocular depth network is to develop a mapping relationship Γ :   (  ) →   (  ),
               where   (  ) is an arbitrary image,   (  ) is the predicted depth map of the image   (  ), and    is per pixel in the
               image   (  ). Establishing a more accurate mapping function Γ is considered in this paper, which includes:
               (a) a simple and effective network pipeline without increasing network computational complexity; and (b) a
               high-quality depth map   (  ) with subtle details for a given input image   (  ).



               For Item (a), our focus is to change the basic building blocks of the depth CNN structure using aggregated
               residual transformations (ResNeXt). In the depth network, ResNeXt serves as feature extraction module to
               learn the image’s high-dimensional features without increasing network computational burden. For Item (b),
               low-texture regions in the low-scale depth map are weakened, bringing inaccurate image reconstruction. In-
               spired by the authors of [22] , four images with full resolution are reconstructed instead of building four images
               with different resolutions. Before the four images are reconstructed, the predicted four-scale depth map needs
               to be resized to the same resolution as input image with bilinear interpolation.



               A single image   (  ) is considered as the input of the depth network. The designed depth network outputs five-
               scale feature map      × (   ∈ 1, 2, 3, 4, 5) in the encoder network and four-scale depth map       in the decoder
   88   89   90   91   92   93   94   95   96   97   98