Page 13 - Read Online
P. 13

Wu et al. Intell Robot 2022;2(2):105­29  I http://dx.doi.org/10.20517/ir.2021.20    Page 111

               Table 2. Experiment results of 3D object classification methods on ModelNet40 benchmark. Here ”I”, ”mvPC”, ”vPC”, ”pPC”, ”rm”
               stands for image, multiple view of point cloud, voxelized point cloud, point cloud, range map respectively. ”OA” represents the overall
               accuracy that is the mean accuracy for all test instance; ”mAcc” represents the mean accuracy that is the mean accuracy for all shape
               categories. Here the ’%’ after the number is omitted for simplicity. ”-” means the result is not available

                                     Modal.
                Category  Model              Novelty                                              OA  mAcc
                                     &Repr.
                         PointNet  [1]  pPC  point-wise MLP+T-Net+global max pooling             89.2  86.2
                         PointNet++  [2]  pPC  set abstraction (sampling, grouping, feature learning)+fully connected layers  90.7  90.7
                         Momen(e)t  [32]  pPC  MLP+max pooling+pPC coordinates and their polynomial functions as input  89.3  86.1
                         SRN  [34]    pPC    structural relation network(geometric and locational features+MLP)  91.5  -
                         PointASNL [35]  pPC  adaptive sampling module+local-nonlocal module     92.9  -
                 LiDAR-  PointConv  [36]  pPC  MLP to approximate a weight function+a density scale  92.5  -
                  Only
                         RS-CNN  [38]  pPC   relation-shape convolution(shared MLP+channel-raising mapping)  92.6  -
                         DensePoint  [39]  pPC  PConv+PPooling(dense connection like)            93.2  -
                         ShellNet  [29]  pPC  shellconv(KNN+max pooling+shared MLP+conv order)   93.1  -
                         InterpConv  [40]  pPC  interpolated convolution operation+max pooling   93.0  -
                         DRINet  [42]  vPC+pPC  sparse point-voxel feature extraction+sparse voxel-point feature extraction  93.0  -
                         MV3D  [12]  I&mvPC  3D proposals network+region-based fusion network     -   -
                 LiDAR-  SCANet  [46]  I&mvPC  multi-level fusion+spatial-channel attention+extension spatial upsample module  -  -
                 Fusion  MMF  [47]   I&mvPC  point-wise fusion+ROI feature fusion                 -   -
                         ImVoteNet  [48]  I&pPC  lift 2D image votes, semantic and texture cues to the 3D seed points  -  -


               Classifiers integrated into two-stage LiDAR-fusion 3D detectors can be divided into two categories: (1) clas-
               sifiers to distinguish the target and background; and (2) classifiers to predict the final category of the target
               object. Chen et al. [12]  designed a deep fusion framework named multi-view 3D networks (MV3D) combining
               LiDAR point clouds and RGB images. This network designs a deep fusion scheme that alternately performs
               feature transformation and feature fusion, which belongs to the early fusion architecture. MV3D comprises a
               3D proposal network and a region-based fusion network, both of which have a classifier. The classifier in the
               3D proposal network regresses to distinguish whether it belongs to the foreground or background, and then
               the results along with 3D box generated by the 3D box regressor are fed to 3D Proposal Module to generate 3D
               proposals. The final results are obtained by a multiclass classifier that predicts the category of objects through
               a deep fusion approach using the element-wise mean for the join operation and fusing regions generated from
               multi-modal data. Motivated by deep fusion [12] , ScanNet [46]  proposes multi-level fusion layers fusing 3D re-
               gion proposals generated by an object classifier and a 3D box regressor to enable interactions among features.
               ScanNet also introduces the attention mechanism in spatial and channel-wise dimensions in order to capture
               global and multi-scale context information. The multi-sensor fusion architecture [47]  can accomplish several
               tasks by one framework, including object classification, 3D box estimation, 2D and 3D box refinement, depth
               completion, and ground estimation. In the 3D classification part, LiDAR point clouds are first projected into
               ground relative bird’s eye view (BEV) representation through the online mapping module, and then features
               extracted from LiDAR point clouds, and RGB images are fused by the dense fusion module and fed into Li-
               DAR backbone network to predict the probability of the category. This multi-task multi-sensor architecture
               performs robustly and qualitatively on the TOR 4D benchmark. For one-stage 3D fused detectors, the classifier
               is generally applied in a different way because the one-stage detectors aim to conduct classification and regres-
               sion simultaneously. Qi et al. [48]  proposed a one-stage architecture named ImVoteNet, which lifts 2D vote
               to 3D to improve 3D classification and detection performance. The architecture consists of two parts: One
               leverages 2D images to pass the geometric, semantic, and texture cues to 3D voting. The other proposes and
               classifies targets on the basis of a voting mechanism such as Hough voting. The results show that this method
               boosts 3D recognition with improved mAP compared with the previous best model [49] .




               4. 3D OBJECT DETECTION
               All the deep learning detectors follow a similar idea: they extract the feature from the input data with the
               backbone and neck of the framework to generate proposals and then classify and locate the objects with a 3D
               bounding box with the head part. Depending on whether region proposals are generated or not, the object
   8   9   10   11   12   13   14   15   16   17   18