Page 23 - Read Online
        P. 23
     Wu et al. Intell Robot 2022;2(2):10529  I http://dx.doi.org/10.20517/ir.2021.20    Page 121
                Figure 5. Typical frameworks for two categories of LiDAR-based instance segmentation: (a) LiDAR-only and (b) LiDAR-fusion methods.
               modules to dropout noise and improve the accuracy.
               6.2.2. LiDAR-fusion instance segmentation
               Studies on LiDAR-fusion instance segmentation can also be divided into proposal-based and proposal-free.
               As for proposal-based methods, 3D-SIS [115]  introduces a two-stage image and RGB-D data fused architecture,
               leveraging both geometric and color signals to jointly and semantically learn features, for instance, segmen-
               tation and detection. 3D-SIS consists of two branches, i.e., a 3D detection branch and a 3D mask workflow
               branch. The backbone of a 3D mask takes projected color, geometry features of each detected object, and 3D
               detection results as input and outputs final per-voxel mask prediction of each instance. For mask prediction,
               3D convolutions with the same spatial resolutions that preserve spatial correspondence with raw point inputs
               are applied. Then, bounding box prediction generated from 3D-RPN is utilized to attain the key associated
               mask feature. The final mask of each instance is predicted by a 3D convolution which reduces the dimen-
               sionality of features. PanopticFusion [116]  presents an online large-scale 3D reconstruction architecture that
               fuses RGB images and depth images. The 2D instance segmentation network based on Mask-CNN takes the
               incoming RGB frame as input and fuses both semantic and instance segmentation results to attain point-wise
               panoptic labels that are integrated into the volumetric map with depth data. As illustrated in Figure 5b, Qi
               et al. [117]  proposed a pioneering object detection framework named Fustrum PointNets with point cloud and
               RGB-D fusion data as input. Frustum PointNets contains three modules: frustum proposal, 3D instance seg-
               mentation and a modal 3D box estimation, in order to fuse efficient mature 2D object detector into point cloud
               domain. The frustum point cloud is extracted from RGB-D data frustum proposal generation first and then
               is fed into set abstraction layers and point feature propagation layers based on PointNet to predict a mask for
               each instance by point-wise binary classification. When it comes to proposal-free methods, 3D-BEVIS [118]
               introduces a framework for 3D semantic and instance segmentation that transfers 2D bird’s eye view (BEV)
               to 3D point space. This framework concentrates on both local point geometry and global context informa-
               tion. 3D instance segmentation network takes point cloud as input, which consists of 2D (i.e., RGB and height
               aboveground)and3Dfeaturenetworkjointlytoexploitpoint-wiseinstancefeaturesandpredictsfinalinstance





