Page 20 - Read Online
P. 20
Page 118 Wu et al. Intell Robot 2022;2(2):10529 I http://dx.doi.org/10.20517/ir.2021.20
tasks whose input data are divided into LiDAR point cloud data or LiDAR point cloud fused data. Summaries
can be seen in Tables 6 and 7.
6.1. 3D Semantic segmentation
6.1.1. LiDAR-only semantic segmentation
PointNet [1] provides a classic prototype of point cloud semantic segmentation architecture utilizing shared
MLPs and symmetrical poolings. On this basis, several dedicated point-wise MLP networks are proposed to
attain more information and local structures for each point. PointNet++ [2] introduces a novel hierarchical
architecture applying PointNet recursively to capture multi-scale local context. Engelmann et al. [92] proposed
a feature network with K-means and KNN to learn a better feature representation. Besides, an attention mech-
anism, namely group shuffle attention (GSA) [93] is introduced to exploit the relationships among subsets of
point cloud and select a representative one.
Apart from MLP methods, convolutional methods on pure points also achieve some state-of-the-art perfor-
mance, especially after a fully convolutional network (FCN) [94] is introduced to semantic segmentation, which
replaces the fully connected layer with a convolution and thus makes any size of input data possible. Based on
the idea of GoogLeNet [95] that takes fisheye cameras and LiDAR sensors data as input, Piewak et al. [96] pro-
posed an FCN framework called LiLaNet aiming to label emi-dense LiDAR data point-wisely and multi-class
semantically with cylindrical projections of point clouds as input data. The dedicated framework LiLaNet is
comprised of a sequence of LiLaBlocks that have various kernels and a 1×1 convolution so that lessons learned
from 2D semantic label methods can be converted to the point cloud domain. Recently, a fully convolutional
network called 3D-MiniNet [97] extends MiniNet [98] to 3D LiDAR point cloud domain to realize 3D seman-
tic segmentation by learning 2D representations from raw points and passing them to 2D fully convolutional
neural network to attain 2D semantic labels. The 3D semantic labels are obtained through re-projection and
enhancement of 2D labels.
Based on the pioneering FCN framework, an encoder–decoder framework, U-Net [99] is proposed to conduct
multi-scale and large size segmentation. Therefore, several point cloud-based semantic segmentation works
extend this framework to 3D space. LU-Net [100] proposes an end-to-end model, consisting of a model that
extracts high-level features for each point and an image segmentation network similar to U-Net that takes the
projections of these high-level features as input. SceneEncoder [101] presents an encode module to enhance the
performance of global information. As shown in Figure 4a, RPVNet [13] exploits fusion advantages of point,
voxel, and range map representations of point clouds. After extracting features from the encoder–decoder of
three branches and projecting these features into point-based representation, a gated fusion module (GFM) is
adopted to fuse features.
Due to the close relationship between the receptive field size and the network performance, a few works con-
centrate on expanding the receptive fields through dilated/A-trous convolution, which can preserve the spatial
resolution at the meanwhile. As an extension of SqueezeSeg [102] , the CNN architecture named PointSeg [103]
also utilizes SqueezeNet [104] as a backbone network with spherical images generated from point clouds as in-
put. However, PointSeg [103] takes several image-based semantic segmentation networks into consideration
and transfers them to the LiDAR domain, instead of using CRF post-processing as in SqueezeSeg [104] . The
PointSeg [103] architecture includes three kinds of main layers: fire layer adapted from SqueezeNet [104] , squeeze
reweighting layer, and enlargement layer where dilated convolutional layers are applied to extend the receptive
field. Hua et al. [105] introduced a point-wise convolution for 3D point cloud semantic segmentation, which
orders point cloud before feature learning and adopts A-trous convolution. Recently, Engelmann et al. [106]
proposed dilated point convolutions (DPC) to systematically expand the receptive field with an awesome gen-
eralization so that it can be applied in most existing CNN for point clouds.