Page 93 - Read Online
P. 93
Li et al. Intell Robot 2021;1(1):84-98 I http://dx.doi.org/10.20517/ir.2021.06 Page 88
Figure 2. The overall architecture of both the depth network and the pose network.
Figure 3. The architecture of ResNet and ResNeXt block: (a) the ResNet block; and (b) the aggregated residual transformations. Both have
similar complexity, but the ResNeXt block has better adaptability and expansibility.
3.1. Problem statement
The aim of the unsupervised monocular depth network is to develop a mapping relationship Γ : ( ) → ( ),
where ( ) is an arbitrary image, ( ) is the predicted depth map of the image ( ), and is per pixel in the
image ( ). Establishing a more accurate mapping function Γ is considered in this paper, which includes:
(a) a simple and effective network pipeline without increasing network computational complexity; and (b) a
high-quality depth map ( ) with subtle details for a given input image ( ).
For Item (a), our focus is to change the basic building blocks of the depth CNN structure using aggregated
residual transformations (ResNeXt). In the depth network, ResNeXt serves as feature extraction module to
learn the image’s high-dimensional features without increasing network computational burden. For Item (b),
low-texture regions in the low-scale depth map are weakened, bringing inaccurate image reconstruction. In-
spired by the authors of [22] , four images with full resolution are reconstructed instead of building four images
with different resolutions. Before the four images are reconstructed, the predicted four-scale depth map needs
to be resized to the same resolution as input image with bilinear interpolation.
A single image ( ) is considered as the input of the depth network. The designed depth network outputs five-
scale feature map × ( ∈ 1, 2, 3, 4, 5) in the encoder network and four-scale depth map in the decoder