Page 81 - Read Online
P. 81

Page 265                          Liu et al. Intell Robot 2024;4(3):256-75  I http://dx.doi.org/10.20517/ir.2024.17



































                                             Figure 4. Flow chart of parallel solution for        .

               It can be seen from the formulas that         and         are obtained by summing two matrices, and parallel com-
               puting with CUDA is very suitable for matrix superposition.


               For the example of the coefficient        , we copy         to the corresponding position in         as its initialized
               structure, and take the number of non-zero blocks of         as the number of threads, and each thread performs
               a copy operation. Assuming that the number of camera poses is m and the number of map points is n, the        
               can be divided into m*m sub-blocks. The sub-block in the position (i, j) is:

                                                        
                                                     ∑
                                                   (  ,  ) =         (  ,  ) ∗    −1  ∗        ,       (11)
                                                                     (  )      (  ,  )
                                                       =1
               where        (  ,  ) is the i-th row and j-th column of sub-block        , and        (  ) is the x-th non-zero of sub-block
                      . That is, each sub-block of         needs to be added n times. Due to the sparse structure of        , the number
               of calculations will be much less than n. The calculation number of        (  ,  ) is related to the i-th and j-th rows
               of        . If there are    identical column indexes of the non-zero block in row i and row j of        , there will be   
               sum operations. During each iteration, the index of         does not change, so all non-zero block indexes can be
               pre-calculated in the part of the linear structure construction. Since the calculation of each sub-block in        
               is independent, and the calculation of each sum-term in the sub-block is also independent,         is computed
               using a two-level parallel mode, with coarse-grained parallelization among non-zero blocks and fine-grained
               parallelization within non-zero blocks, further improving the operation efficiency. The calculation of         also
               adopts the same strategy. Figure 4 is the flow chart for solving        .


               In order to make full use of the limited storage resources of the graphics processor, Compressed Sparse Row
               (CSR) format is adopted for sparse matrix         to store data. Compared with common matrix storage, CSR
               format can avoid the waste of space. Finally, the paper uses the linear solver cuSolver provided by CUDA to
               solve the incremental equation.
   76   77   78   79   80   81   82   83   84   85   86