Page 13 - Read Online

P. 13

Wu et al. Intell Robot 2022;2(2):10529 I http://dx.doi.org/10.20517/ir.2021.20 Page 111

Table 2. Experiment results of 3D object classification methods on ModelNet40 benchmark. Here ”I”, ”mvPC”, ”vPC”, ”pPC”, ”rm”
stands for image, multiple view of point cloud, voxelized point cloud, point cloud, range map respectively. ”OA” represents the overall
accuracy that is the mean accuracy for all test instance; ”mAcc” represents the mean accuracy that is the mean accuracy for all shape
categories. Here the ’%’ after the number is omitted for simplicity. ”-” means the result is not available

Modal.
Category Model Novelty OA mAcc
&Repr.
PointNet [1] pPC point-wise MLP+T-Net+global max pooling 89.2 86.2
PointNet++ [2] pPC set abstraction (sampling, grouping, feature learning)+fully connected layers 90.7 90.7
Momen(e)t [32] pPC MLP+max pooling+pPC coordinates and their polynomial functions as input 89.3 86.1
SRN [34] pPC structural relation network(geometric and locational features+MLP) 91.5 -
PointASNL [35] pPC adaptive sampling module+local-nonlocal module 92.9 -
LiDAR- PointConv [36] pPC MLP to approximate a weight function+a density scale 92.5 -
Only
RS-CNN [38] pPC relation-shape convolution(shared MLP+channel-raising mapping) 92.6 -
DensePoint [39] pPC PConv+PPooling(dense connection like) 93.2 -
ShellNet [29] pPC shellconv(KNN+max pooling+shared MLP+conv order) 93.1 -
InterpConv [40] pPC interpolated convolution operation+max pooling 93.0 -
DRINet [42] vPC+pPC sparse point-voxel feature extraction+sparse voxel-point feature extraction 93.0 -
MV3D [12] I&mvPC 3D proposals network+region-based fusion network - -
LiDAR- SCANet [46] I&mvPC multi-level fusion+spatial-channel attention+extension spatial upsample module - -
Fusion MMF [47] I&mvPC point-wise fusion+ROI feature fusion - -
ImVoteNet [48] I&pPC lift 2D image votes, semantic and texture cues to the 3D seed points - -

Classifiers integrated into two-stage LiDAR-fusion 3D detectors can be divided into two categories: (1) clas-
sifiers to distinguish the target and background; and (2) classifiers to predict the final category of the target
object. Chen et al. [12] designed a deep fusion framework named multi-view 3D networks (MV3D) combining
LiDAR point clouds and RGB images. This network designs a deep fusion scheme that alternately performs
feature transformation and feature fusion, which belongs to the early fusion architecture. MV3D comprises a
3D proposal network and a region-based fusion network, both of which have a classifier. The classifier in the
3D proposal network regresses to distinguish whether it belongs to the foreground or background, and then
the results along with 3D box generated by the 3D box regressor are fed to 3D Proposal Module to generate 3D
proposals. The final results are obtained by a multiclass classifier that predicts the category of objects through
a deep fusion approach using the element-wise mean for the join operation and fusing regions generated from
multi-modal data. Motivated by deep fusion [12] , ScanNet [46] proposes multi-level fusion layers fusing 3D re-
gion proposals generated by an object classifier and a 3D box regressor to enable interactions among features.
ScanNet also introduces the attention mechanism in spatial and channel-wise dimensions in order to capture
global and multi-scale context information. The multi-sensor fusion architecture [47] can accomplish several
tasks by one framework, including object classification, 3D box estimation, 2D and 3D box refinement, depth
completion, and ground estimation. In the 3D classification part, LiDAR point clouds are first projected into
ground relative bird’s eye view (BEV) representation through the online mapping module, and then features
extracted from LiDAR point clouds, and RGB images are fused by the dense fusion module and fed into Li-
DAR backbone network to predict the probability of the category. This multi-task multi-sensor architecture
performs robustly and qualitatively on the TOR 4D benchmark. For one-stage 3D fused detectors, the classifier
is generally applied in a different way because the one-stage detectors aim to conduct classification and regres-
sion simultaneously. Qi et al. [48] proposed a one-stage architecture named ImVoteNet, which lifts 2D vote
to 3D to improve 3D classification and detection performance. The architecture consists of two parts: One
leverages 2D images to pass the geometric, semantic, and texture cues to 3D voting. The other proposes and
classifies targets on the basis of a voting mechanism such as Hough voting. The results show that this method
boosts 3D recognition with improved mAP compared with the previous best model [49] .

4. 3D OBJECT DETECTION
All the deep learning detectors follow a similar idea: they extract the feature from the input data with the
backbone and neck of the framework to generate proposals and then classify and locate the objects with a 3D
bounding box with the head part. Depending on whether region proposals are generated or not, the object

8 9 10 11 12 13 14 15 16 17 18