Page 79 - Read Online
P. 79

Page 263                          Liu et al. Intell Robot 2024;4(3):256-75  I http://dx.doi.org/10.20517/ir.2024.17


                                                             ⌈         ⌉
                                                                    
                                                               .   =                                    (7)
                                                                                


               Inthematchingprocess,    threadsarefirstlyallocated, withthreadid as                  =                 .  ∗                  +
                 ℎ              .  , to calculate the Hamming distance of each minimum. The Hamming distance is defined as the
               number of different positions corresponding to two sets of binary codes with the same length. In the actual
               calculation process, the distance needs to be compared                    times. Therefore, the thread id is set to
                 ℎ              .  , and the total number of threads in the program is    ∗                   . Thereby after initial matching,
               the coarse matching point pairs are obtained.

               For fine feature matching, the RANSAC algorithm parallelizes the scoring of coarse match pairs and the re-
               moval of false matches. The iterative process is executed in parallel, followed by the matching pair evaluation
               process. The computations of the model with random sampling are independent, as is the evaluation for each
               matched pair; hence, both steps can be accelerated by CUDA parallel computation. If the number of randomly
               sampled point set groups is   , the number of matched point pairs is   . dimBlock.    = min{512, n} is set, and
               the thread block parameters are as given in



                                                            ⌈      ∗     ⌉
                                                 dimGrid.    =                                          (8)
                                                             dimBlock .  


               3. PARALLEL ACCELERATION FOR VSLAM BACK-END
               The nonlinear optimization of VSLAM will eventually be transformed into solving a sparse matrix problem,
               and the scale of the matrix depends on the number of variables which needs to be optimized. Although the
               complexity of solution can be reduced by marginalization, it is still difficult for BA to run in real-time when the
               scale of variables to be optimized is too large. Especially, when SLAM detects the loop-closure and carries out
               the global optimization, it is necessary to adjust the poses of all keyframes and relevant map points. Therefore,
               the paper proposes a BA scheme based on g2o [30]  by introducing GPU parallel computing and improving the
               traditional BA process to adapt to parallel design, which mainly includes the reconstruction of key and time-
               consuming parts such as error calculation and linear solution in g2o, memory scheduling optimization and
               task allocation. Therefore, the running speed can be greatly improved while the accuracy remains unchanged.
               The specific process can be seen in Figure 2B.


               3.1 Construction for the linear structure
               Creating the linear structure is the process of building incremental equations. Its main purpose is to build the
               structures of all matrices used in subsequent solutions. Therefore, it is necessary to allocate fixed memory to
               the relevant data structure in the GPU.


               Here, taketheexampleofthecreationofcoefficient   .         and         arediagonal block matrices, so thespace of
               the graphics memory to be allocated is determined by the number, size and data type of the non-zero blocks on
               theirowndiagonals. Thenumberofnon-zeroblocksin         isthesameasthenumberofposeverticesdenoted
               by poses_size. Since the non-zero block is a 6*6 matrix, if a double precision floating-point type is used to save
               data, the space of the sizeof(double)*36*poses_size needs to be allocated. For        , in addition to allocating
               graphics memory according to non-zero blocks, it is also necessary to save the number and position of non-
               zero blocks and the index of corresponding edges for subsequent Schur complement reduction. According to
               the physical meaning of        , when the j-th map point can be observed from the i-th camera,         has a non-
               zero block at position (  ,   ). Since edges do not affect each other, we assign corresponding threads to calculate
               the number of non-zero blocks in parallel depending on the number of edges, and the index information could
               be obtained according to the prestored         non-zero block set in the device.
   74   75   76   77   78   79   80   81   82   83   84