Page 16 - Read Online
P. 16

Page 114                          Wu et al. Intell Robot 2022;2(2):105­29  I http://dx.doi.org/10.20517/ir.2021.20

               improvement of the algorithm performance. In general, two-stage detection fuses the feature map before or
               after the proposals. To enhance the quality of proposals, 3D-CVF [64]  fuses spatial features from images and
                                                                                              [1]
               point clouds in cross-wise views with the auto-calibrated feature projection. Based on PointNet , Roarnet [65]
               designs a two-stage object detection network whose input data contain RGB image and LiDAR point cloud
               to improve the performance with 3D pose estimation. As for the fusion of ROI-wise feature, Chen et al. [12]
               fused the feature extracted from the bird’s eye view and front view of LiDAR as well as the RGB image. As
               shown in Figure 2b, Scanet [46]  applies a spatial-channel attention module and an extension spatial up-sample
               module to generate proposals of RGB images and point clouds, respectively, in the first stage and then classifies
               and regresses the 3D bounding box with a novel multi-level fusion method. Meanwhile, some studies adopt
               multi-fusion methods in the proposed schemes. For instance, the authors of [47]  completed a two-stage detec-
               tion framework with front-end fusion and medium fusion. Its front-end fusion is to merge the sparse depth
               image (projected from LiDAR point cloud) and RGB image for the image backbone network to extract dense
               depth feature. The depth feature would be fed into the dense fusion module with LiDAR point clouds and
               pseudo-LiDAR points to prepare for medium fusion. Vora et al. [66]  complemented the context information of
               point cloud with the semantic segmentation results of the image. Through the point painting operation, point
               clouds are painted by semantic scores, and then the painted point cloud is fed into a point-based 3D detector
               to produce final results. The pipeline [67]  fuses point-wise features and couples 2D–3D anchors (which are
               generated from images and point clouds, respectively) to improve the quality of proposals in the first stage,
               after which it handles ROI-wise feature fusion in the second stage. To deal with adverse weather, MVDNet [28]
               exploits LiDAR and radar’s potential complementary advantages. This novel framework conducts a deep late
               fusion,whichmeansthatproposalsaregeneratedfromtwosensorsfirstandthenregion-wisefeaturesarefused.
               Moreover, MVDNet provides a foggy weather focused LiDAR and radar dataset generated from the Oxford
               Radar Robotcar dataset. EPNet [68]  is a closed-loop two-stage detection network. Its LI-fusion module projects
               point cloud to images and then generates point-wise correspondence for the fusion. To form the closed-loop,
               EPNet achieves 3D end-to-end detection on the high definition map and estimates the map on the fly from raw
               point clouds. ImVoteNet [48]  (which is an extension of VoteNet [49] ) supplements the point-wise 3D informa-
               tion with the geometrical and semantic features extracted from 2D-images. In its head module, LiDAR-only,
               image-only, and LiDAR-fusion features all participate in the voting to improve the detection accuracy.


               4.2.2. Single-stage detection
               Single-stage detectors outperform two-stage detectors in terms of runtime due to their compact network struc-
               ture. With the goal of high efficiency and accuracy, the fusion of single stage detector is placed in the post-
               processing stage (i.e., late fusion) in order to maintain the superior single-shot detection performance and
               improve through supplementary multi-sensor data at the same time. This indicates that only the results of
               detectors for LiDAR point cloud and other sensor data (e.g., RGB image) are fused in post-processing module
               without changing any network structure of detectors. CLOCs [69]  builds a late fusion architecture with any pair
               of pre-trained image and LiDAR detectors. The output candidates of LiDAR and image are combined before
               the non-maximum suppression operation to exploit geometric and semantic consistencies. Individual 2D and
               3D candidates are first pre-processed through specific tensor operation so that they are both in a consistent
               joint representation using sparse tensor. Then, a set of 2D convolution layers are utilized to fuse, which takes
               the sparse tensor as input and output a processed tensor. The max-pooling operation is conducted on this
               tensor to map it to the targets (formatted as a score map). Experiment results on the KITTI dataset show
               that single-stage 3D detector SECOND [70]  fusion with 2D detector Cascade R-CNN [71]  achieves better per-
               formance by a large margin compared to single-modality SECOND. The architecture of CLOCs is shown in
               Figure 2d.
   11   12   13   14   15   16   17   18   19   20   21