Page 95 - Read Online
P. 95

Li et al. Intell Robot 2021;1(1):84-98  I http://dx.doi.org/10.20517/ir.2021.06       Page 90


               map for each module is
                                                                
                                                              Õ
                                                      (  ) =    +        (  )                          (5)
                                                                =1

               3.3. Network architecture
               The proposed depth estimation network employs U-Net structure including an encoder network and a de-
               coder network. The encoder network is built by embedding the ResNeXt block [20] . It transforms the three-
               dimensional monocular image into multi-channel feature map. The decoder network builds the relationship
               between extracted feature map and the depth map by a series of upsample and convolution (Up-convolution)
               operations, as shown in Figure 4.



               (1) To eliminate texture copy artifacts in the depth map, the Up-convolution operation [22]  instead of deconvo-
               lution is used to reshape the feature map. (2) Due to max-pooling and stride operations ignoring some local
               features and causing some details to be lost in the depth image, skip connections are used to merge the corre-
               sponding feature maps for encoder network into decoder network and obtain fine image details. (3) Inspired
               by the authors of [22] , we resize all depth maps to the same resolution as input using bilinear interpolation
               (represented by the    operation in Equation (2)).



               The structure for the pose network is designed as a standard ResNet18 encoder, which is similar to the one
               in [22] . More input images in the pose network bring more accurate depth estimation under certain conditions.

               However, to reduce the number of training parameters of pose network, the pose network has    (   = 3)
               adjacent images as input. Therefore, the shape for convolutional weights in the first layer is (3×   ) ×64×3×3
               rather than the default 3 × 64 × 3 × 3 in the pose network. The output of the pose network has 6 ∗ (   − 1)
               channels. In addition, our pose network is trained without pre-training. All convolution layers are activated
               by ReLU function [25]  except for the last layer. When the pose result is evaluated, an image pair is fed into
               pose network to produce six output channels, the first three-channel is rotation, and the last three-channel is
               translation.


               3.4. Wavelet SSIM loss
               In general, the SSIM [26]  loss is included in the photometric loss to measure the degree of similarity between
               images. In this paper, the 2D discrete wavelet transform (DWT) is applied to SSIM to decrease the photomet-
               ric loss. Firstly, The DWT divides an image into some patches with different frequencies. Then, the SSIM of
               each patch is computed. To preserve high-frequency image details and avoid producing “holes” or artifacts in
               some low-texture regions, it can flexibly adjust the weights of each patch of SSIM loss.



               In the 2D discrete wavelet transform (DWT), low-pass and high-pass filters are performed on an image to
               obtain the convolution results. For instance, four filters,        ,        ,        , and        , are obtained by the low-
               pass filter multiplying the high-pass filter. The DWT divides an image into four small patches with different
               frequencies through these four filters, which can remove unnecessary interference from the images (  .  ., haze
               and rain). Iteratively, the DWT can be formulated as follows:

                                                       ,         ,         ,          =       (        )  (6)
                                                  +1    +1    +1    +1    
               where    is the iterative time of DWT.    0       is the original image. In this paper,    = 2.         is the down-sampling
               image.         and         are the horizontal and vertical edge detection images, respectively.         is the corner de-
               tection image.
   90   91   92   93   94   95   96   97   98   99   100