Page 22 - Read Online
P. 22

Page 120                          Wu et al. Intell Robot 2022;2(2):105­29  I http://dx.doi.org/10.20517/ir.2021.20

               images. In this work, a proposed multi-view point cloud (MVPC) representation indicates a transformation
               from 2D image to the 3D point that expresses a discrete approximation of a ground-truth 3D surface by gen-
               erating a sequence of 1-VPCs and forming predicted MVPC with their union, instead of simply combining
                                 [3]
               projections. FuseSeg proposes a LiDAR point clouds segmentation method that fuses RGB and LiDAR data
               at feature level and develops a network, whose encoder can be applied as a feature extractor for various 3D
               perception tasks. Figure 4b demonstrates details of its network. As an extension of SqueezeSeg [102] , FuseSeg
               establishes correspondences between the two input modalities first and warps features extracted from RGB
               images. Then, the features from images and point clouds are fused by utilizing the correspondences. PMF [109]
               exploits supplementary advantages between appearance information from RGB images and 3D depth informa-
               tion from LiDAR point clouds. The two-stream network including camera-stream and LiDAR-stream extracts
               features from projected point cloud and RGB image, and then features from two modalities are fused by a
               novel residual-based fusion module into LiDAR stream. Additionally, a perception-aware loss contributes to
               the fusion network’s ability. Unlike the ideas above, a novel permutohedral lattice representation method for
               data fusion is introduced [110] . SParse LATtice Networks (SPLATNet) [110]  directly processes a set of points
               in the representation of a sparse set of samples in a high-dimensional lattice. To reduce the memory and
               computational cost, SPLATNet adopts a sparse bilateral convolutional layer as the backbone instead. This
               network incorporates point-based and image-based representations to deal with multi-modal data fusion and
               processing.


               6.2. 3D Instance segmentation
               Instance segmentation is the most challenging task of scene understanding because of the necessity to combine
               object detection and semantic segmentation, which focuses on each individual instance within a class.


               6.2.1. LiDAR-only instance segmentation
               One of the ideas is a top-down concept (also called the proposal-based method) which detects the bounding
               box of an instance with object detection methods first and then performs semantic segmentation within the
               bounding box. GSPN [111]  designs a novel architecture for 3D instance segmentation named region-based
               PointNet (R-PointNet). A generative shape proposal network is integrated into R-PointNet to generate 3D
               object proposals with instance sensitive features by constructing shapes from the scene, which is converted
               into a 3D bounding box. The point ROIAlign module aligns features for proposals to refine the proposals
               and generates segmentation. Different from GSPN [111] , the single-stage, anchor-free, and end-to-end 3D-
               BoNet [112]  directly regresses 3D bounding boxes for all instances with a bounding box prediction branch.
               The backbone network exploits local point features and global features, which are then fed into a point mask
               prediction branch with a predicted object bounding box, as shown in Figure 5a.


               However, the top-down idea ignores the relation between masks and features and extracts masks for each fore-
               ground feature, which is redundant. Down-top methods, also named proposal-free methods, may provide a
               solution for these problems, which performs point-wise semantic segmentation first and then distinguishes
               differentinstances. Forexample, Zhouetal. [113] presentedaninstancesegmentationandobjectdetectioncom-
               bined architecture to exploit detailed and global information of objects. It is a two-stage network, containing
               a spatial embedding (SE)-based clustering and bounding box refinement modules. For instance, segmenta-
               tion, semantic information is attained by an encoder–decoder network, and object information is attained by
               SE strategy that takes center points of objects as important information. Aside from the above ideas, utiliz-
               ing conditional random fields (CRFs) as post-processing methods contributes to the refinement of the label
               map generated by CNN and further improves the segmentation performance. Inspired by SqueezeNet [104] ,
               SqueezeSeg [102]  proposes a pioneering lightweight end-to-end pipeline CNN to solve 3D semantic segmenta-
               tion for road-objects. This network takes transformed LiDAR point cloud as input and then leverages network
               based on SqueezeNet [104]  to extract features and label points semantically, whose results are fed into CRF to
               refine and output final results. As an extension of SqueezeSeg [102] , SqueezeSegV2 [114]  introduces three novel
   17   18   19   20   21   22   23   24   25   26   27