Page 18 - Read Online
P. 18
Page 116 Wu et al. Intell Robot 2022;2(2):10529 I http://dx.doi.org/10.20517/ir.2021.20
5.1. LiDARonly tracking
As the temporal extension of detection, tracking achieves higher and more precise performance based on
appearance similarity and motion trajectory. Tracking-by-detection is an intuitive method. For example, Va-
quero et al. [72] fused vehicle information segmented from dual-view detectors (i.e., a front view and a bird’s
eye view) and then utilized extended Kalman filter, Mahalanobis distance, and motion update module to per-
form 3D tracking. Furthermore, Shi et al. [73] performed 3D tracking and domain adaption based on a variant
of the 3D detection framework (i.e., PV-RCNN), which comprises temporal information incorporation and
classification with RoI-wise features, and so on. In addition, detection results can be enhanced by extra target
templates. As a typical example, P2B [74] first matches the proposals with augmented target-specific features
and then regresses target-wise centers to generate high-quality detection results for tracking. Following Cen-
terTrack [75] , CenterPoint [76] develops an object-center-tracking network through velocity estimation and the
point-based detection that views objects as points, achieving more accurate and faster performance.
As for the image-based tracking, the siamese network eliminates the data redundancy and speeds up the
task through the conversion from tracking to patch matching, whose idea can be extended in the field of
LiDAR-based tracking. Inspired by SAMF [77] , Mueller et al. [78] designed a correlation filter-based tracker (i.e.,
SAMF_CA) which incorporates global context in an explicit way. Experiments show that the improved opti-
mization solution achieves a better performance in the single target tracking domain. The work of Zarzar et
al. [79] shows that the siamese network-based tracking with LiDAR-only data performs well in aerial navigation.
Holding the belief that appearance information is insufficient to track, Giancola et al. [80] encoded the model
shape and candidate shape into latent information with a Siamese tracker. Zarzar et al. [81] generated efficient
proposals with a siamese network from the BEV representation of point clouds, after which it tracks 3D ob-
jects in accordance with the ROI-wise appearance information regularized by the latter siamese framework.
PSN [82] first extracts features through a shared PointNet-like framework and then conducts feature augmenta-
tion and the attention mechanism through two separate branches to generate a similarity map so as to match
the patches. Recently, MLVSNet [83] proposes conducting Hough voting on multi-level features of target and
search area instead of only on final features to overcome insufficient target detection in sparse point clouds.
Moreover, ground truth bounding box in the first frame can be regarded as a strong cue, enabling a better
feature comparison [84] , as shown in Figure 3a.
5.2. LiDARfusion tracking
Sensorscapturedatafromvariousviews,whichisbeneficialtosupplementinsufficientinformationfortrackers.
A challenge of tracking-by-detection is how to match the detection results with the context information. The
simplest way is to conduct an end-fusion of the tracking results, as done by Manghat et al. [85] . In addition,
Frossard et al. [86] produced precise 3D trajectories for diverse objects in accordance with detection proposals
and linear optimization. Introducing the 2D visual information, Complexer-YOLO [87] first performs joint 3D
object detection based on the voxelized semantic points clouds (which are fused by image-based semantic
information) and then extends the model to multi-target tracking through multi-Bernoulli filter. This work
demonstrates the role of scale–rotation–translation, which enables the framework to track in real time.
However, datasampledbydifferentsensorsvaryinfrequencyanddimension, andthusitischallengingandnot
cost-effective to match the similarity among diverse data sources. Recent years have witnessed the emergence
of ingenious algorithms while tracking based on a siamese network is still in its infancy. Developed for single
objecttracking, F-SiameseTracker [88] extrudesa2Dregion-of-interestfromasiamesenetworkforthepurpose
of generating several valid 3D proposals, which would be fed into another siamese network together with a
LiDAR template. Although these studies achieve a lot, there is still a long way to go to further integrate point
clouds and other sensor data (i.e., RGB images) into the siamese network for LiDAR-fusion tracking. The
pipeline of F-Siamese Tracker is explained in Figure 3b.