Page 20 - Read Online
P. 20

Page 118                          Wu et al. Intell Robot 2022;2(2):105­29  I http://dx.doi.org/10.20517/ir.2021.20

               tasks whose input data are divided into LiDAR point cloud data or LiDAR point cloud fused data. Summaries
               can be seen in Tables 6 and 7.


               6.1. 3D Semantic segmentation
               6.1.1. LiDAR-only semantic segmentation
               PointNet [1]  provides a classic prototype of point cloud semantic segmentation architecture utilizing shared
               MLPs and symmetrical poolings. On this basis, several dedicated point-wise MLP networks are proposed to
               attain more information and local structures for each point. PointNet++ [2]  introduces a novel hierarchical
               architecture applying PointNet recursively to capture multi-scale local context. Engelmann et al. [92]  proposed
               a feature network with K-means and KNN to learn a better feature representation. Besides, an attention mech-
               anism, namely group shuffle attention (GSA) [93]  is introduced to exploit the relationships among subsets of
               point cloud and select a representative one.


               Apart from MLP methods, convolutional methods on pure points also achieve some state-of-the-art perfor-
               mance, especially after a fully convolutional network (FCN) [94] is introduced to semantic segmentation, which
               replaces the fully connected layer with a convolution and thus makes any size of input data possible. Based on
               the idea of GoogLeNet [95]  that takes fisheye cameras and LiDAR sensors data as input, Piewak et al. [96]  pro-
               posed an FCN framework called LiLaNet aiming to label emi-dense LiDAR data point-wisely and multi-class
               semantically with cylindrical projections of point clouds as input data. The dedicated framework LiLaNet is
               comprised of a sequence of LiLaBlocks that have various kernels and a 1×1 convolution so that lessons learned
               from 2D semantic label methods can be converted to the point cloud domain. Recently, a fully convolutional
               network called 3D-MiniNet [97]  extends MiniNet [98]  to 3D LiDAR point cloud domain to realize 3D seman-
               tic segmentation by learning 2D representations from raw points and passing them to 2D fully convolutional
               neural network to attain 2D semantic labels. The 3D semantic labels are obtained through re-projection and
               enhancement of 2D labels.


               Based on the pioneering FCN framework, an encoder–decoder framework, U-Net [99]  is proposed to conduct
               multi-scale and large size segmentation. Therefore, several point cloud-based semantic segmentation works
               extend this framework to 3D space. LU-Net [100]  proposes an end-to-end model, consisting of a model that
               extracts high-level features for each point and an image segmentation network similar to U-Net that takes the
               projections of these high-level features as input. SceneEncoder [101]  presents an encode module to enhance the
               performance of global information. As shown in Figure 4a, RPVNet [13]  exploits fusion advantages of point,
               voxel, and range map representations of point clouds. After extracting features from the encoder–decoder of
               three branches and projecting these features into point-based representation, a gated fusion module (GFM) is
               adopted to fuse features.


               Due to the close relationship between the receptive field size and the network performance, a few works con-
               centrate on expanding the receptive fields through dilated/A-trous convolution, which can preserve the spatial
               resolution at the meanwhile. As an extension of SqueezeSeg [102] , the CNN architecture named PointSeg [103]
               also utilizes SqueezeNet [104]  as a backbone network with spherical images generated from point clouds as in-
               put. However, PointSeg [103]  takes several image-based semantic segmentation networks into consideration
               and transfers them to the LiDAR domain, instead of using CRF post-processing as in SqueezeSeg [104] . The
               PointSeg [103]  architecture includes three kinds of main layers: fire layer adapted from SqueezeNet [104] , squeeze
               reweighting layer, and enlargement layer where dilated convolutional layers are applied to extend the receptive
               field. Hua et al. [105]  introduced a point-wise convolution for 3D point cloud semantic segmentation, which
               orders point cloud before feature learning and adopts A-trous convolution. Recently, Engelmann et al. [106]
               proposed dilated point convolutions (DPC) to systematically expand the receptive field with an awesome gen-
               eralization so that it can be applied in most existing CNN for point clouds.
   15   16   17   18   19   20   21   22   23   24   25