Page 92 - Read Online
P. 92
Page 87 Li et al. Intell Robot 2021;1(1):84-98 I http://dx.doi.org/10.20517/ir.2021.06
applied when predicting monocular image depth. Garg et al. [11] first used stereo pairs to train depth network
with known disparities between left and right images and acquired great performance. Inspired by the authors
of [11] , Godard et al. [12] designed a novel loss function which enforced both left-right and right–left disparities
consistency produced from stereo images [12] . Zhan et al. [13] extended the stereo-based network architecture
by increasing the visual odometry network (VO). The performance of Zhan’s network was superior to other
unsupervised methods at that time. To recover absolute scale depth map from stereo pairs, Li et al. [14] pro-
posed a visual odometry system (UnDeepVO), which was capable of estimating the 6-DoF camera pose and
recovering the absolute depth value.
2.2.2. Unsupervised depth estimation from monocular images
For monocular depth estimation, it is necessary to design an extra pose network to obtain pose transformation
between consecutive frames. Both depth and pose networks are trained together with loss function. Zhou
et al. [16] pioneered the training of depth networks with monocular video. They proposed two separate net-
works (SfMLearner) to learn image depth and inter-frame pose transformation. However, the accuracy of
the depth network was often limited by the influence of moving objects and occlusion. Their work motivated
some researchers to consider these shortcomings. Subsequently, Casser et al. [17] developed a separate network
(struct2depth) to learn each moving object motion, but their work was based on the condition that the num-
ber of moving objects needed to be hypothesized in advance. In addition, researchers found that the optical
flow method could be employed to deal with moving object motion. Yin et al. [18] developed a cascading net-
work framework (GeoNet) to adaptively learn rigid and non-rigid object motion. Recently, multi-task training
methods have been proposed. Luo et al. [19] intended to train depth, camera pose, and optical flow networks
(EPC++) jointly with 3D holistic understanding. Similarly, Ranjan et al. [24] proposed a competitive collabora-
tion mechanism (CC) with depth, camera motion, optical flow, and motion segmentation together. Both Luo
and Ranjan’s joint network inevitably increased the difficulty of the training network and the computational
burden of the network.
From the above works, we can see that most studies aim to improve the accuracy of the depth network by
changing the network structure or building robust supervisory signal. It is worth noting that these methods
bring network complexity and computational burden while improving the network accuracy. This motivates
us to study how to balance both sides. Poggi et al. [15] presented an effective pyramid feature extraction net-
work, which can be implemented in real-time on CPU. However, the accuracy of the network cannot satisfy
the requirements of practical applications. Xie et al. [20] provided a template with aggregated residual trans-
formations (ResNeXt), which achieved a better classification result without increasing network computation.
Because of the advantages of ResNeXt, we apply it to the image depth prediction field. The ResNeXt block
serves as a feature extraction module of the depth network to learn the image’s high-dimensional features.
The proposed approach is not only independent of depth ground truth, but also does not increase computa-
tional burden.
3. METHOD
Theproposedmethodcontainstwoparts: anend-to-endnetworkframeworkandalossfunction. Thenetwork
framework consists of a depth network and a pose network, as shown in Figure 2. Given unlabeled monocular
sequences, the depth network outputs the predicted depth map, while the pose network outputs the 6-DoF
relative pose transformation between adjacent frames. The loss function is made up of the basic photometric
loss and the depth smoothness loss, and it couples both networks into the end-to-end network.