Page 14 - Read Online
P. 14
Page 112 Wu et al. Intell Robot 2022;2(2):10529 I http://dx.doi.org/10.20517/ir.2021.20
detectors can be categorized into two-stage and single-stage detectors. Two-stage detectors detect the target
from the region of interests proposed from the feature map, while single-stage detectors perform tasks based
on sliding dense anchor boxes or anchor points from the pyramid map directly. This section summarizes
contemporary 3D object detection research, focusing on diverse data modalities from different sensors. Table
3 shows the summary for 3D object detection. Table 4 summarizes experiment results of 3D object detection
methods on the KITTI test 3D object detection benchmark.
4.1. LiDARonly detection
LiDAR-only detection generates a 3D bounding box based on networks that are only fed with a LiDAR point
cloud. In general, two-stage detection processes LiDAR data with point-based representation, while single-
stagedetectionperformsthetaskonmultipleformats,includingpointcloud-based,multi-viewed,andvolumet
ric-based representations.
4.1.1. Two-stage detection
For the two-stage detection, segmentation is a widely-used method to remove noisy points and generate pro-
posals in the first sub-module of the detection. One of the typical detection models is IPOD [50] , which seeds
instance-level proposals with context and local features extracted by projected segmentation. In 2019, STD [51]
createdpoint-levelsphericalanchorsandparallelintersection-over-union(IOU)branchestoimprovetheaccu-
racy of the location. Following the proposal scheme of PointRCNN [52] (whose network is illustrated in Figure
2a), PointRGCN [53] introduces a graph convolutional network which aggregates per-proposal/per-frame fea-
tures to improve the detection performance. Shi et al. [54] extended the method of PointRCNN [52] in another
way, by obtaining 3D proposals and intra-object part locations with a part-aware module and regressing the
3D bounding boxes based on the fusion of appearance and location features in the part-aggregation frame-
work. HVNet [55] fuses multi-scale voxel features point-wisely, namely hybrid voxel feature encoding. After
voxelizing the point cloud at multiple scales, HVNet extracts hybrid voxel features with an attentive voxel fea-
ture encoder, and then pseudo-image features are available through scale aggregation in point-wise format.
To remedy the proposal size ambiguity problem, LiDAR R-CNN [56] uses boundary offset and virtual point,
designing a plug-and-play universal 3D object detector.
4.1.2. Single-stage detection
Unlikethetwo-stagedetectorthatoutputsfinalfine-graineddetectionresultsontheproposals, thesingle-stage
detector classifies and locates 3D objects with a fully convolutional framework and transformed representa-
tion. Obviously, this method makes the foreground more susceptible to adjacent background points, thus de-
creasing the detection accuracy. Multiple methods emerge to solve this problem. For example, VoxelNet [57]
extracts voxel-wise features from point clouds in volumetric-based representation with random sampling and
normalization, after which it utilizes a 3D-CNN-based framework and region proposal network to detect 3D
objects. To bridge the gap between the 3D-CNN-based and 2D-CNN-based detection, the authors of [58] ap-
plied PointNet [1] to point clouds to generate vertical-columned representation, which enables point clouds to
be processed by the following 2D-CNN-based detection framework. Multi-task learning work [59] introduces
a part-sensitive warping module and an auxiliary module to refine the feature extracted from the backbone
network by adapting the ROI pooling from R-FCN [60] detection module. As illustrated in Figure 2c, TANet [61]
designs a stacked triple attention module and a coarse-to-fine regression module to reduce the disturbance of
noisy points and improve the detection performance on hard-level objects. SE-SSD [62] contains a teacher SSD
and a student SSD. The teacher SSD produces soft targets by predicting relatively accurate results (after global
transformation) from the input point cloud. The student SSD takes augmented input (a novel shape-aware
data argumentation) as input, and then is trained with a consistency loss under the supervision of hard-level
targets. 3D auto-labeling [63] , which aims at saving the cost of human labeling, proposes a novel off-board
3D object detector to exploit complementary contextual information from point cloud sequences, achieving a
performance on par with human labels.