Page 92 - Read Online
P. 92

Page 87                                                                    Li et al. Intell Robot 2021;1(1):84-98  I http://dx.doi.org/10.20517/ir.2021.06


               applied when predicting monocular image depth. Garg et al. [11]  first used stereo pairs to train depth network
               with known disparities between left and right images and acquired great performance. Inspired by the authors
               of [11] , Godard et al. [12]  designed a novel loss function which enforced both left-right and right–left disparities
               consistency produced from stereo images [12] . Zhan et al. [13]  extended the stereo-based network architecture
               by increasing the visual odometry network (VO). The performance of Zhan’s network was superior to other
               unsupervised methods at that time. To recover absolute scale depth map from stereo pairs, Li et al. [14]  pro-
               posed a visual odometry system (UnDeepVO), which was capable of estimating the 6-DoF camera pose and
               recovering the absolute depth value.



               2.2.2. Unsupervised depth estimation from monocular images
               For monocular depth estimation, it is necessary to design an extra pose network to obtain pose transformation
               between consecutive frames. Both depth and pose networks are trained together with loss function. Zhou
               et al. [16]  pioneered the training of depth networks with monocular video. They proposed two separate net-
               works (SfMLearner) to learn image depth and inter-frame pose transformation. However, the accuracy of
               the depth network was often limited by the influence of moving objects and occlusion. Their work motivated
               some researchers to consider these shortcomings. Subsequently, Casser et al. [17]  developed a separate network
               (struct2depth) to learn each moving object motion, but their work was based on the condition that the num-
               ber of moving objects needed to be hypothesized in advance. In addition, researchers found that the optical
               flow method could be employed to deal with moving object motion. Yin et al. [18]  developed a cascading net-
               work framework (GeoNet) to adaptively learn rigid and non-rigid object motion. Recently, multi-task training
               methods have been proposed. Luo et al. [19]  intended to train depth, camera pose, and optical flow networks
               (EPC++) jointly with 3D holistic understanding. Similarly, Ranjan et al. [24]  proposed a competitive collabora-
               tion mechanism (CC) with depth, camera motion, optical flow, and motion segmentation together. Both Luo
               and Ranjan’s joint network inevitably increased the difficulty of the training network and the computational
               burden of the network.



               From the above works, we can see that most studies aim to improve the accuracy of the depth network by
               changing the network structure or building robust supervisory signal. It is worth noting that these methods
               bring network complexity and computational burden while improving the network accuracy. This motivates
               us to study how to balance both sides. Poggi et al. [15]  presented an effective pyramid feature extraction net-
               work, which can be implemented in real-time on CPU. However, the accuracy of the network cannot satisfy
               the requirements of practical applications. Xie et al. [20]  provided a template with aggregated residual trans-
               formations (ResNeXt), which achieved a better classification result without increasing network computation.
               Because of the advantages of ResNeXt, we apply it to the image depth prediction field. The ResNeXt block
               serves as a feature extraction module of the depth network to learn the image’s high-dimensional features.
               The proposed approach is not only independent of depth ground truth, but also does not increase computa-
               tional burden.




               3. METHOD
               Theproposedmethodcontainstwoparts: anend-to-endnetworkframeworkandalossfunction. Thenetwork
               framework consists of a depth network and a pose network, as shown in Figure 2. Given unlabeled monocular
               sequences, the depth network outputs the predicted depth map, while the pose network outputs the 6-DoF
               relative pose transformation between adjacent frames. The loss function is made up of the basic photometric
               loss and the depth smoothness loss, and it couples both networks into the end-to-end network.
   87   88   89   90   91   92   93   94   95   96   97