Page 18 - Read Online
P. 18

Page 116                          Wu et al. Intell Robot 2022;2(2):105­29  I http://dx.doi.org/10.20517/ir.2021.20


               5.1. LiDAR­only tracking
               As the temporal extension of detection, tracking achieves higher and more precise performance based on
               appearance similarity and motion trajectory. Tracking-by-detection is an intuitive method. For example, Va-
               quero et al. [72]  fused vehicle information segmented from dual-view detectors (i.e., a front view and a bird’s
               eye view) and then utilized extended Kalman filter, Mahalanobis distance, and motion update module to per-
               form 3D tracking. Furthermore, Shi et al. [73]  performed 3D tracking and domain adaption based on a variant
               of the 3D detection framework (i.e., PV-RCNN), which comprises temporal information incorporation and
               classification with RoI-wise features, and so on. In addition, detection results can be enhanced by extra target
               templates. As a typical example, P2B [74]  first matches the proposals with augmented target-specific features
               and then regresses target-wise centers to generate high-quality detection results for tracking. Following Cen-
               terTrack [75] , CenterPoint [76]  develops an object-center-tracking network through velocity estimation and the
               point-based detection that views objects as points, achieving more accurate and faster performance.


               As for the image-based tracking, the siamese network eliminates the data redundancy and speeds up the
               task through the conversion from tracking to patch matching, whose idea can be extended in the field of
               LiDAR-based tracking. Inspired by SAMF [77] , Mueller et al. [78]  designed a correlation filter-based tracker (i.e.,
               SAMF_CA) which incorporates global context in an explicit way. Experiments show that the improved opti-
               mization solution achieves a better performance in the single target tracking domain. The work of Zarzar et
               al. [79]  shows that the siamese network-based tracking with LiDAR-only data performs well in aerial navigation.
               Holding the belief that appearance information is insufficient to track, Giancola et al. [80]  encoded the model
               shape and candidate shape into latent information with a Siamese tracker. Zarzar et al. [81]  generated efficient
               proposals with a siamese network from the BEV representation of point clouds, after which it tracks 3D ob-
               jects in accordance with the ROI-wise appearance information regularized by the latter siamese framework.
               PSN [82]  first extracts features through a shared PointNet-like framework and then conducts feature augmenta-
               tion and the attention mechanism through two separate branches to generate a similarity map so as to match
               the patches. Recently, MLVSNet [83]  proposes conducting Hough voting on multi-level features of target and
               search area instead of only on final features to overcome insufficient target detection in sparse point clouds.
               Moreover, ground truth bounding box in the first frame can be regarded as a strong cue, enabling a better
               feature comparison [84] , as shown in Figure 3a.


               5.2. LiDAR­fusion tracking
               Sensorscapturedatafromvariousviews,whichisbeneficialtosupplementinsufficientinformationfortrackers.
               A challenge of tracking-by-detection is how to match the detection results with the context information. The
               simplest way is to conduct an end-fusion of the tracking results, as done by Manghat et al. [85] . In addition,
               Frossard et al. [86]  produced precise 3D trajectories for diverse objects in accordance with detection proposals
               and linear optimization. Introducing the 2D visual information, Complexer-YOLO [87]  first performs joint 3D
               object detection based on the voxelized semantic points clouds (which are fused by image-based semantic
               information) and then extends the model to multi-target tracking through multi-Bernoulli filter. This work
               demonstrates the role of scale–rotation–translation, which enables the framework to track in real time.


               However, datasampledbydifferentsensorsvaryinfrequencyanddimension, andthusitischallengingandnot
               cost-effective to match the similarity among diverse data sources. Recent years have witnessed the emergence
               of ingenious algorithms while tracking based on a siamese network is still in its infancy. Developed for single
               objecttracking, F-SiameseTracker [88] extrudesa2Dregion-of-interestfromasiamesenetworkforthepurpose
               of generating several valid 3D proposals, which would be fed into another siamese network together with a
               LiDAR template. Although these studies achieve a lot, there is still a long way to go to further integrate point
               clouds and other sensor data (i.e., RGB images) into the siamese network for LiDAR-fusion tracking. The
               pipeline of F-Siamese Tracker is explained in Figure 3b.
   13   14   15   16   17   18   19   20   21   22   23