Page 14 - Read Online
P. 14

Page 112                          Wu et al. Intell Robot 2022;2(2):105­29  I http://dx.doi.org/10.20517/ir.2021.20

               detectors can be categorized into two-stage and single-stage detectors. Two-stage detectors detect the target
               from the region of interests proposed from the feature map, while single-stage detectors perform tasks based
               on sliding dense anchor boxes or anchor points from the pyramid map directly. This section summarizes
               contemporary 3D object detection research, focusing on diverse data modalities from different sensors. Table
               3 shows the summary for 3D object detection. Table 4 summarizes experiment results of 3D object detection
               methods on the KITTI test 3D object detection benchmark.

               4.1. LiDAR­only detection
               LiDAR-only detection generates a 3D bounding box based on networks that are only fed with a LiDAR point
               cloud. In general, two-stage detection processes LiDAR data with point-based representation, while single-
               stagedetectionperformsthetaskonmultipleformats,includingpointcloud-based,multi-viewed,andvolumet
               ric-based representations.

               4.1.1. Two-stage detection
               For the two-stage detection, segmentation is a widely-used method to remove noisy points and generate pro-
               posals in the first sub-module of the detection. One of the typical detection models is IPOD [50] , which seeds
               instance-level proposals with context and local features extracted by projected segmentation. In 2019, STD [51]
               createdpoint-levelsphericalanchorsandparallelintersection-over-union(IOU)branchestoimprovetheaccu-
               racy of the location. Following the proposal scheme of PointRCNN [52]  (whose network is illustrated in Figure
               2a), PointRGCN [53]  introduces a graph convolutional network which aggregates per-proposal/per-frame fea-
               tures to improve the detection performance. Shi et al. [54]  extended the method of PointRCNN [52]  in another
               way, by obtaining 3D proposals and intra-object part locations with a part-aware module and regressing the
               3D bounding boxes based on the fusion of appearance and location features in the part-aggregation frame-
               work. HVNet [55]  fuses multi-scale voxel features point-wisely, namely hybrid voxel feature encoding. After
               voxelizing the point cloud at multiple scales, HVNet extracts hybrid voxel features with an attentive voxel fea-
               ture encoder, and then pseudo-image features are available through scale aggregation in point-wise format.
               To remedy the proposal size ambiguity problem, LiDAR R-CNN [56]  uses boundary offset and virtual point,
               designing a plug-and-play universal 3D object detector.


               4.1.2. Single-stage detection
               Unlikethetwo-stagedetectorthatoutputsfinalfine-graineddetectionresultsontheproposals, thesingle-stage
               detector classifies and locates 3D objects with a fully convolutional framework and transformed representa-
               tion. Obviously, this method makes the foreground more susceptible to adjacent background points, thus de-
               creasing the detection accuracy. Multiple methods emerge to solve this problem. For example, VoxelNet [57]
               extracts voxel-wise features from point clouds in volumetric-based representation with random sampling and
               normalization, after which it utilizes a 3D-CNN-based framework and region proposal network to detect 3D
               objects. To bridge the gap between the 3D-CNN-based and 2D-CNN-based detection, the authors of [58]  ap-
               plied PointNet [1]  to point clouds to generate vertical-columned representation, which enables point clouds to
               be processed by the following 2D-CNN-based detection framework. Multi-task learning work [59]  introduces
               a part-sensitive warping module and an auxiliary module to refine the feature extracted from the backbone
               network by adapting the ROI pooling from R-FCN [60]  detection module. As illustrated in Figure 2c, TANet [61]
               designs a stacked triple attention module and a coarse-to-fine regression module to reduce the disturbance of
               noisy points and improve the detection performance on hard-level objects. SE-SSD [62] contains a teacher SSD
               and a student SSD. The teacher SSD produces soft targets by predicting relatively accurate results (after global
               transformation) from the input point cloud. The student SSD takes augmented input (a novel shape-aware
               data argumentation) as input, and then is trained with a consistency loss under the supervision of hard-level
               targets. 3D auto-labeling [63] , which aims at saving the cost of human labeling, proposes a novel off-board
               3D object detector to exploit complementary contextual information from point cloud sequences, achieving a
               performance on par with human labels.
   9   10   11   12   13   14   15   16   17   18   19