Page 22 - Read Online

P. 22

Page 120 Wu et al. Intell Robot 2022;2(2):10529 I http://dx.doi.org/10.20517/ir.2021.20

images. In this work, a proposed multi-view point cloud (MVPC) representation indicates a transformation
from 2D image to the 3D point that expresses a discrete approximation of a ground-truth 3D surface by gen-
erating a sequence of 1-VPCs and forming predicted MVPC with their union, instead of simply combining
[3]
projections. FuseSeg proposes a LiDAR point clouds segmentation method that fuses RGB and LiDAR data
at feature level and develops a network, whose encoder can be applied as a feature extractor for various 3D
perception tasks. Figure 4b demonstrates details of its network. As an extension of SqueezeSeg [102] , FuseSeg
establishes correspondences between the two input modalities first and warps features extracted from RGB
images. Then, the features from images and point clouds are fused by utilizing the correspondences. PMF [109]
exploits supplementary advantages between appearance information from RGB images and 3D depth informa-
tion from LiDAR point clouds. The two-stream network including camera-stream and LiDAR-stream extracts
features from projected point cloud and RGB image, and then features from two modalities are fused by a
novel residual-based fusion module into LiDAR stream. Additionally, a perception-aware loss contributes to
the fusion network’s ability. Unlike the ideas above, a novel permutohedral lattice representation method for
data fusion is introduced [110] . SParse LATtice Networks (SPLATNet) [110] directly processes a set of points
in the representation of a sparse set of samples in a high-dimensional lattice. To reduce the memory and
computational cost, SPLATNet adopts a sparse bilateral convolutional layer as the backbone instead. This
network incorporates point-based and image-based representations to deal with multi-modal data fusion and
processing.

6.2. 3D Instance segmentation
Instance segmentation is the most challenging task of scene understanding because of the necessity to combine
object detection and semantic segmentation, which focuses on each individual instance within a class.

6.2.1. LiDAR-only instance segmentation
One of the ideas is a top-down concept (also called the proposal-based method) which detects the bounding
box of an instance with object detection methods first and then performs semantic segmentation within the
bounding box. GSPN [111] designs a novel architecture for 3D instance segmentation named region-based
PointNet (R-PointNet). A generative shape proposal network is integrated into R-PointNet to generate 3D
object proposals with instance sensitive features by constructing shapes from the scene, which is converted
into a 3D bounding box. The point ROIAlign module aligns features for proposals to refine the proposals
and generates segmentation. Different from GSPN [111] , the single-stage, anchor-free, and end-to-end 3D-
BoNet [112] directly regresses 3D bounding boxes for all instances with a bounding box prediction branch.
The backbone network exploits local point features and global features, which are then fed into a point mask
prediction branch with a predicted object bounding box, as shown in Figure 5a.

However, the top-down idea ignores the relation between masks and features and extracts masks for each fore-
ground feature, which is redundant. Down-top methods, also named proposal-free methods, may provide a
solution for these problems, which performs point-wise semantic segmentation first and then distinguishes
differentinstances. Forexample, Zhouetal. [113] presentedaninstancesegmentationandobjectdetectioncom-
bined architecture to exploit detailed and global information of objects. It is a two-stage network, containing
a spatial embedding (SE)-based clustering and bounding box refinement modules. For instance, segmenta-
tion, semantic information is attained by an encoder–decoder network, and object information is attained by
SE strategy that takes center points of objects as important information. Aside from the above ideas, utiliz-
ing conditional random fields (CRFs) as post-processing methods contributes to the refinement of the label
map generated by CNN and further improves the segmentation performance. Inspired by SqueezeNet [104] ,
SqueezeSeg [102] proposes a pioneering lightweight end-to-end pipeline CNN to solve 3D semantic segmenta-
tion for road-objects. This network takes transformed LiDAR point cloud as input and then leverages network
based on SqueezeNet [104] to extract features and label points semantically, whose results are fed into CRF to
refine and output final results. As an extension of SqueezeSeg [102] , SqueezeSegV2 [114] introduces three novel

17 18 19 20 21 22 23 24 25 26 27