Page 79 - Read Online

P. 79

Page 263 Liu et al. Intell Robot 2024;4(3):256-75 I http://dx.doi.org/10.20517/ir.2024.17

⌈ ⌉

. = (7)

Inthematchingprocess, threadsarefirstlyallocated, withthreadid as = . ∗ +
ℎ . , to calculate the Hamming distance of each minimum. The Hamming distance is defined as the
number of different positions corresponding to two sets of binary codes with the same length. In the actual
calculation process, the distance needs to be compared times. Therefore, the thread id is set to
ℎ . , and the total number of threads in the program is ∗ . Thereby after initial matching,
the coarse matching point pairs are obtained.

For fine feature matching, the RANSAC algorithm parallelizes the scoring of coarse match pairs and the re-
moval of false matches. The iterative process is executed in parallel, followed by the matching pair evaluation
process. The computations of the model with random sampling are independent, as is the evaluation for each
matched pair; hence, both steps can be accelerated by CUDA parallel computation. If the number of randomly
sampled point set groups is , the number of matched point pairs is . dimBlock. = min{512, n} is set, and
the thread block parameters are as given in

⌈ ∗ ⌉
dimGrid. = (8)
dimBlock .

3. PARALLEL ACCELERATION FOR VSLAM BACK-END
The nonlinear optimization of VSLAM will eventually be transformed into solving a sparse matrix problem,
and the scale of the matrix depends on the number of variables which needs to be optimized. Although the
complexity of solution can be reduced by marginalization, it is still difficult for BA to run in real-time when the
scale of variables to be optimized is too large. Especially, when SLAM detects the loop-closure and carries out
the global optimization, it is necessary to adjust the poses of all keyframes and relevant map points. Therefore,
the paper proposes a BA scheme based on g2o [30] by introducing GPU parallel computing and improving the
traditional BA process to adapt to parallel design, which mainly includes the reconstruction of key and time-
consuming parts such as error calculation and linear solution in g2o, memory scheduling optimization and
task allocation. Therefore, the running speed can be greatly improved while the accuracy remains unchanged.
The specific process can be seen in Figure 2B.

3.1 Construction for the linear structure
Creating the linear structure is the process of building incremental equations. Its main purpose is to build the
structures of all matrices used in subsequent solutions. Therefore, it is necessary to allocate fixed memory to
the relevant data structure in the GPU.

Here, taketheexampleofthecreationofcoefficient . and arediagonal block matrices, so thespace of
the graphics memory to be allocated is determined by the number, size and data type of the non-zero blocks on
theirowndiagonals. Thenumberofnon-zeroblocksin isthesameasthenumberofposeverticesdenoted
by poses_size. Since the non-zero block is a 6*6 matrix, if a double precision floating-point type is used to save
data, the space of the sizeof(double)*36*poses_size needs to be allocated. For , in addition to allocating
graphics memory according to non-zero blocks, it is also necessary to save the number and position of non-
zero blocks and the index of corresponding edges for subsequent Schur complement reduction. According to
the physical meaning of , when the j-th map point can be observed from the i-th camera, has a non-
zero block at position ( , ). Since edges do not affect each other, we assign corresponding threads to calculate
the number of non-zero blocks in parallel depending on the number of edges, and the index information could
be obtained according to the prestored non-zero block set in the device.

74 75 76 77 78 79 80 81 82 83 84