Page 74 - Read Online
P. 74
Liu et al. Intell Robot 2024;4(3):256-75 I http://dx.doi.org/10.20517/ir.2024.17 Page 258
reducing the computational cost and power consumption compared with ORB-SLAM2. Similar work such as
ref [23] , by reasonably allocating computing resources between CPU-GPU and adjusting the order of modules
in the original feature detection algorithm, improved the execution efficiency of the VSLAM front-end algo-
rithm on a high-performance embedded platform. Nagy et al. proposed feature detection based on a lookup
table and non-maximal suppression based on cell grids, which further improved the front-end parallel scheme
and achieved a new breakthrough in execution speed [24] . The current parallel acceleration schemes for feature
detection are limited to separate tasks such as corner detection and non-maximal suppression. In this case,
the data needs to be copied multiple times between the video memory and the host, which greatly degrades
the overall acceleration effect. For the optimization module involving large-scale matrix solutions, traditional
sequential execution algorithms become increasingly incapable when the number of variables increases with
runtime in a long-term VSLAM system. The bundle adjustment (BA) scheme, an improvement based on
Levenberg-Mardquardt (LM), optimizes the access strategy for Hessian, Schur and Jacobian matrices, with an
overall speedup of nearly 30 times. Zheng et al. used the preconditioned conjugate gradient (PCG) method
and the inexact Newton method to solve the BA problem [25] , which improved the computational efficiency by
20 times and reduced memory consumption compared to [26] . However, the method had the problem of low
stability. Cao et al. applied principal component analysis (PBA) to SFM, only using GPU to implement high-
dimensional matrix multiplication, and did not conduct an in-depth study on the parallelization of BA [27] . At
present, there are few related researches on the parallel acceleration of the VSLAM back-end optimization and
the existing works have the problems of low accuracy, incomplete functions and poor robustness [22] .
Current works on the complete (including front-end and back-end) parallel acceleration of SLAM methods
are very rare [28] . Lu et al. [26] realized a parallelization scheme for the VINS-Mono algorithm [29] by rewriting
the flow tracking, nonlinear least-squares optimization, and marginalization program, but its speedup is not
significant. To enhance the operation speed of the SLAM system, the paper proposes a relatively complete
parallelization scheme of VSLAM method based on heterogeneous computing. For the time-consuming fea-
ture extraction and matching at the front end, a full-flow acceleration algorithm is designed to perform the
entire process from image input to feature matching on GPU. It mainly includes Gaussian pyramid genera-
tion with arbitrary scales, FAST corner extraction based on a lookup table, non-maximal suppression based
on grid cell, calculation of feature descriptors, feature matching based on Hamming distance and Random
Sample Consensus (RANSAC)-based false matching filtering; for the back-end optimization of VSLAM, the
paper uses CUDA to implement the parallel graph optimization based on the Levenberg-Marquardt method.
Combined with the characteristics of marginalization in incremental equations and independent observation
in graph optimization, the parallelization algorithms including error calculation, construction and update of
linear systems, Schur complement reduction and linear equation solving are realized. In each subtask, we
make full use of system resources such as shared memory and constant memory, and combine coarse-grained
and fine-grained parallelism to optimize the acceleration strategy. Finally, we integrate the above improved
front-end and back-end parallel algorithms into a state-of-the-art VSLAM framework, and compare it with
otherpopularmethodsonthepublicdatasetstoverifytheeffectivenessoftheproposedmethod. Theflowchart
of the proposed algorithm is shown in Figure 1. The main contributions of the work are listed as follows:
(1) The paper designs a full-flow strategy for feature extraction and matching. After the image is copied from
the CPU to the GPU, all computing tasks are independently implemented by the GPU, reducing the time
consumption on the multiple data transfers between devices.
(2) The parallel graph optimization method implemented in the paper is a perfect substitute for g2o (general
graph optimization), which not only supports basic operations such as adding and removing the vertices and
edges, acquiring error, and setting optimization levels on the CPU but also takes full account of the indepen-
dence of edges and the sparsity of matrices in VSLAM graph optimization. It can be better applied to the
global optimization and local optimization of VSLAM.

