Page 94 - Read Online
P. 94
Page 89 Li et al. Intell Robot 2021;1(1):84-98 I http://dx.doi.org/10.20517/ir.2021.06
network. The mapping function is designed as
× ( ( )) = Γ (( 1× ( ( ), . . . , ( × ( ( )))) (1)
where denotes the number of feature maps, = 5. represents the scale factor of depth map, ∈ 0, 1, 2, 3.
denotes the resolution of feature map × is 1/2 of the input resolution.
Then, bilinear interpolation is applied to each predicted depth map × to acquire the full-resolution depth
map ( ( )), which is defined as follows:
(2)
( ( )) = × ( ( ))
where representsbilinearinterpolationwhichrecoverstheresolution 1/2 of × totheinputfullresolution.
The full-resolution depth map ( ( )) is necessary to reconstruct the input image. Given two adjacent images
with a target view and a source view ⟨ ( ), ( )⟩, and the predicted 6-DoF pose transformation , a pixel in
the target image ’s mapping homogeneous coordinate → in the source image is computed as
−1 (3)
→ ∼ → ( )
where is camera intrinsic matrix, is set as the normalized coordinate in target image , and → is a 4×4
matrix transformed by .
Therefore, the reconstructed target image can be obtained by Equation (3) using differentiable bilinear sam-
pling mechanism [16] to sample the corresponding pixel → on the source image . The reconstructed target
image is used to calculate the photometric loss in Part D.
3.2. Feature extraction module
Equation (1) is applied to exploit higher-dimensional features and acquire feature map × with more de-
tails. Since the ResNeXt block has a great performance on classification task. the feature extraction module is
constructed by the ResNeXt block. In contrast to the ResNet used in most depth CNNs, the ResNeXt block
aggregates more image features without bringing more network parameters, as shown in Figure 3.
The ResNeXt block puts the input image into 32 parallel groups and learns the image features, respectively.
Each group shares the same super-parameters and is designed as a bottleneck structure which cascades three
convolution layers with the kernel sizes, respectively, being 1 × 1, 3 × 3, and 1 × 1. The first 1 × 1 convolution
layer extracts high-dimensional abstract features by reducing (or increasing) output channels. Given an input
image with × × resolution, the transformation function of the th group maps image to the high-
′
dimensional feature map ( ). The aggregated output ( ) is the summation of the output of all the groups,
which is defined as follows:
Õ
( ) = ( ) (4)
=1
where is the number of groups, = 32, with as cardinality.
Then, to be closely connected with the input, a residual operation is used, ( ). The aggregated output feature