Page 25 - Read Online
P. 25
Wu et al. Intell Robot 2022;2(2):10529 I http://dx.doi.org/10.20517/ir.2021.20 Page 123
clouds adaptively to generate a compact yet rich representation by superpoint graph. RandLA-Net [121]
leveragesrandomsamplingtodownsamplelarge-scalepointcloudsandlocalfeatureaggregationmodule to
increase the receptive field size. SCF-Net [122] utilizes the spatial contextual features (SCF) module for large-
scale point clouds segmentation. As for sensor fusion, deep learning approaches tackling the fusion of large-
scale and high-resolution data should place more emphasis on point-based and multi-view based fusion
approaches, which are more scalable than voxel-based ones. Overall, the trade-off between performance
and computational cost is inevitable for real application of autonomous driving.
• A robust representation of fused data. For deep learning methods, how to pre-process the multi-modal
input data is fundamental and important. Although there are several effective representations for point
clouds, each of them has both disadvantages and advantages: voxel-based representation has tackled the
ordering problem, but, when enlarging the scales of point cloud or increasing the resolution of voxel, the
computational cost grows cubically. The quantity of point cloud that can be processed by point based repre-
sentation methods is limited due to the permutation invariance and computational capacity. A consensus
of a unified robust and effective representation for point clouds is necessary. For the data fused with images
and point clouds, the representation approaches depend on fusion methods. Image representation-based
methods mainly utilizes point clouds projected onto multi-view planes as additional branches of the image.
(1) Image representation is not applicable for 3D tasks because the network output results on image plane.
(2) Point representation-based methods leverages features or ROI extracted from RGB image as additional
channels of point clouds. The performance of this representation is limited by the resolution differences
between image (relatively high-resolution) and point clouds (relatively low-resolution). (3) Intermediate
data representation methods introduce an intermediate data representation to (e.g., Frustum point cloud
and voxelized point cloud). Voxel-based methods are limited in large scale, while frustum based methods
have much potential to generate a unified representation based on contextual and structural information
of RGB images and LiDAR point clouds.
• Scene understanding tasks based on data sequences. The spatiotemporal information implied in the tem-
porally continuous sequence of point clouds and images has been overlooked for a period. Especially for
sensor fusion methods, the mismatch of refresh rate between LiDAR and camera causes incorrect time-
synchronization between inner perception system and surrounding environment. In addition, predictions
based on spatiotemporal information can improve the performance of tasks, such as 3D object recognition,
segmentation, and point cloud completion. Research has started to take temporal context into consider-
ation. RNN, LSTM, and derived deep learning models are able to deal with temporal context. Huang et
al. [123] proposed a multi-frame 3D object detection framework based on sparse LSTM. This work predict
3Dobjectsinthecurrentframebysendingfeaturesofeachframeandthehiddenandmemoryfeaturesfrom
last frame into LSTM module. Yuan et al. [124] designed a temporal-channel transformer, whose encoder
encodes multi-frame temporal-channel information and decoder decodes spatial-channel information for
the current frame. TempNet [125] presents a lightweight semantic segmentation framework for large-scale
point cloud sequences, which contains two key modules, temporal feature aggregation (TFA) and partial
feature update (PFU). TFA aggregates features only on small portion of key frames with an attentional
pooling mechanism, and PFU updates features with the information from non-key frame.
8. CONCLUSIONS
LiDAR captures point-wise information which is less sensitive to illumination than that of cameras. More-
over, it possesses invariance of scale and rigid transformation, showing a promising future in 3D scene un-
derstanding. Focusing on the LiDAR-only and LiDAR-fusion 3D perception, this paper first summarizes the
LiDAR-based dataset as well as the evaluation metric and then presents a contemporary review of four key
tasks: 3D classification, 3D object detection, 3D object tracking, and 3D segmentation. This work also points
out the existing challenges and possible development direction. We always hold the belief that LiDAR-only
andLiDAR-fusion 3D perceptionsystems would feedback a precise and real-time description of the real-world