Page 94 - Read Online
P. 94

Page 89                              Li et al. Intell Robot 2021;1(1):84-98  I http://dx.doi.org/10.20517/ir.2021.06


               network. The mapping function is designed as


                                              × (  (  )) = Γ    ((   1× (  (  ), . . . , (     × (  (  ))))  (1)
               where    denotes the number of feature maps,    = 5.    represents the scale factor of depth map,    ∈ 0, 1, 2, 3.
                                                           
                  denotes the resolution of feature map      × is 1/2 of the input resolution.


               Then, bilinear interpolation is applied to each predicted depth map      × to acquire the full-resolution depth
               map   (  (  )), which is defined as follows:

                                                                                                       (2)
                                                     (  (  )) =        × (  (  ))
                                                                               
               where   representsbilinearinterpolationwhichrecoverstheresolution 1/2 of      × totheinputfullresolution.


               The full-resolution depth map   (  (  )) is necessary to reconstruct the input image. Given two adjacent images
               with a target view and a source view ⟨      (  ),       (  )⟩, and the predicted 6-DoF pose transformation   , a pixel in
               the target image      ’s mapping homogeneous coordinate      →   in the source image       is computed as

                                                                    −1                                 (3)
                                                       →   ∼        →     (      )        

               where    is camera intrinsic matrix,       is set as the normalized coordinate in target image      , and      →   is a 4×4
               matrix transformed by   .



                                                     
               Therefore, the reconstructed target image    can be obtained by Equation (3) using differentiable bilinear sam-
                                                     
               pling mechanism [16]  to sample the corresponding pixel      →   on the source image      . The reconstructed target
               image    is used to calculate the photometric loss in Part D.
                        
                        
               3.2. Feature extraction module
               Equation (1) is applied to exploit higher-dimensional features and acquire feature map      × with more de-
               tails. Since the ResNeXt block has a great performance on classification task. the feature extraction module is
               constructed by the ResNeXt block. In contrast to the ResNet used in most depth CNNs, the ResNeXt block
               aggregates more image features without bringing more network parameters, as shown in Figure 3.




               The ResNeXt block puts the input image into 32 parallel groups and learns the image features, respectively.
               Each group shares the same super-parameters and is designed as a bottleneck structure which cascades three
               convolution layers with the kernel sizes, respectively, being 1 × 1, 3 × 3, and 1 × 1. The first 1 × 1 convolution
               layer extracts high-dimensional abstract features by reducing (or increasing) output channels. Given an input
               image    with    ×    ×    resolution, the transformation function       of the   th group maps image    to the high-
                                   ′
               dimensional feature map       (  ). The aggregated output    (  ) is the summation of the output of all the groups,
               which is defined as follows:
                                                               
                                                            Õ
                                                         (  ) =        (  )                            (4)
                                                              =1
               where    is the number of groups,    = 32, with    as cardinality.



               Then, to be closely connected with the input, a residual operation is used,   (  ). The aggregated output feature
   89   90   91   92   93   94   95   96   97   98   99