Page 95 - Read Online

P. 95

Li et al. Intell Robot 2021;1(1):84-98 I http://dx.doi.org/10.20517/ir.2021.06 Page 90

map for each module is

Õ
( ) = + ( ) (5)
=1

3.3. Network architecture
The proposed depth estimation network employs U-Net structure including an encoder network and a de-
coder network. The encoder network is built by embedding the ResNeXt block [20] . It transforms the three-
dimensional monocular image into multi-channel feature map. The decoder network builds the relationship
between extracted feature map and the depth map by a series of upsample and convolution (Up-convolution)
operations, as shown in Figure 4.

(1) To eliminate texture copy artifacts in the depth map, the Up-convolution operation [22] instead of deconvo-
lution is used to reshape the feature map. (2) Due to max-pooling and stride operations ignoring some local
features and causing some details to be lost in the depth image, skip connections are used to merge the corre-
sponding feature maps for encoder network into decoder network and obtain fine image details. (3) Inspired
by the authors of [22] , we resize all depth maps to the same resolution as input using bilinear interpolation
(represented by the operation in Equation (2)).

The structure for the pose network is designed as a standard ResNet18 encoder, which is similar to the one
in [22] . More input images in the pose network bring more accurate depth estimation under certain conditions.

However, to reduce the number of training parameters of pose network, the pose network has ( = 3)
adjacent images as input. Therefore, the shape for convolutional weights in the first layer is (3× ) ×64×3×3
rather than the default 3 × 64 × 3 × 3 in the pose network. The output of the pose network has 6 ∗ ( − 1)
channels. In addition, our pose network is trained without pre-training. All convolution layers are activated
by ReLU function [25] except for the last layer. When the pose result is evaluated, an image pair is fed into
pose network to produce six output channels, the first three-channel is rotation, and the last three-channel is
translation.

3.4. Wavelet SSIM loss
In general, the SSIM [26] loss is included in the photometric loss to measure the degree of similarity between
images. In this paper, the 2D discrete wavelet transform (DWT) is applied to SSIM to decrease the photomet-
ric loss. Firstly, The DWT divides an image into some patches with different frequencies. Then, the SSIM of
each patch is computed. To preserve high-frequency image details and avoid producing “holes” or artifacts in
some low-texture regions, it can flexibly adjust the weights of each patch of SSIM loss.

In the 2D discrete wavelet transform (DWT), low-pass and high-pass filters are performed on an image to
obtain the convolution results. For instance, four filters, , , , and , are obtained by the low-
pass filter multiplying the high-pass filter. The DWT divides an image into four small patches with different
frequencies through these four filters, which can remove unnecessary interference from the images ( . ., haze
and rain). Iteratively, the DWT can be formulated as follows:

, , , = ( ) (6)
+1 +1 +1 +1
where is the iterative time of DWT. 0 is the original image. In this paper, = 2. is the down-sampling
image. and are the horizontal and vertical edge detection images, respectively. is the corner de-
tection image.

90 91 92 93 94 95 96 97 98 99 100