Page 25 - Read Online
P. 25

Wu et al. Intell Robot 2022;2(2):105­29  I http://dx.doi.org/10.20517/ir.2021.20    Page 123

                  clouds adaptively to generate a compact yet rich representation by superpoint graph. RandLA-Net [121]
                  leveragesrandomsamplingtodownsamplelarge-scalepointcloudsandlocalfeatureaggregationmodule to
                  increase the receptive field size. SCF-Net [122]  utilizes the spatial contextual features (SCF) module for large-
                  scale point clouds segmentation. As for sensor fusion, deep learning approaches tackling the fusion of large-
                  scale and high-resolution data should place more emphasis on point-based and multi-view based fusion
                  approaches, which are more scalable than voxel-based ones. Overall, the trade-off between performance
                  and computational cost is inevitable for real application of autonomous driving.
                • A robust representation of fused data. For deep learning methods, how to pre-process the multi-modal
                  input data is fundamental and important. Although there are several effective representations for point
                  clouds, each of them has both disadvantages and advantages: voxel-based representation has tackled the
                  ordering problem, but, when enlarging the scales of point cloud or increasing the resolution of voxel, the
                  computational cost grows cubically. The quantity of point cloud that can be processed by point based repre-
                  sentation methods is limited due to the permutation invariance and computational capacity. A consensus
                  of a unified robust and effective representation for point clouds is necessary. For the data fused with images
                  and point clouds, the representation approaches depend on fusion methods. Image representation-based
                  methods mainly utilizes point clouds projected onto multi-view planes as additional branches of the image.
                  (1) Image representation is not applicable for 3D tasks because the network output results on image plane.
                  (2) Point representation-based methods leverages features or ROI extracted from RGB image as additional
                  channels of point clouds. The performance of this representation is limited by the resolution differences
                  between image (relatively high-resolution) and point clouds (relatively low-resolution). (3) Intermediate
                  data representation methods introduce an intermediate data representation to (e.g., Frustum point cloud
                  and voxelized point cloud). Voxel-based methods are limited in large scale, while frustum based methods
                  have much potential to generate a unified representation based on contextual and structural information
                  of RGB images and LiDAR point clouds.
                • Scene understanding tasks based on data sequences. The spatiotemporal information implied in the tem-
                  porally continuous sequence of point clouds and images has been overlooked for a period. Especially for
                  sensor fusion methods, the mismatch of refresh rate between LiDAR and camera causes incorrect time-
                  synchronization between inner perception system and surrounding environment. In addition, predictions
                  based on spatiotemporal information can improve the performance of tasks, such as 3D object recognition,
                  segmentation, and point cloud completion. Research has started to take temporal context into consider-
                  ation. RNN, LSTM, and derived deep learning models are able to deal with temporal context. Huang et
                  al. [123]  proposed a multi-frame 3D object detection framework based on sparse LSTM. This work predict
                  3Dobjectsinthecurrentframebysendingfeaturesofeachframeandthehiddenandmemoryfeaturesfrom
                  last frame into LSTM module. Yuan et al. [124]  designed a temporal-channel transformer, whose encoder
                  encodes multi-frame temporal-channel information and decoder decodes spatial-channel information for
                  the current frame. TempNet [125]  presents a lightweight semantic segmentation framework for large-scale
                  point cloud sequences, which contains two key modules, temporal feature aggregation (TFA) and partial
                  feature update (PFU). TFA aggregates features only on small portion of key frames with an attentional
                  pooling mechanism, and PFU updates features with the information from non-key frame.




               8. CONCLUSIONS
               LiDAR captures point-wise information which is less sensitive to illumination than that of cameras. More-
               over, it possesses invariance of scale and rigid transformation, showing a promising future in 3D scene un-
               derstanding. Focusing on the LiDAR-only and LiDAR-fusion 3D perception, this paper first summarizes the
               LiDAR-based dataset as well as the evaluation metric and then presents a contemporary review of four key
               tasks: 3D classification, 3D object detection, 3D object tracking, and 3D segmentation. This work also points
               out the existing challenges and possible development direction. We always hold the belief that LiDAR-only
               andLiDAR-fusion 3D perceptionsystems would feedback a precise and real-time description of the real-world
   20   21   22   23   24   25   26   27   28   29   30