Page 77 - Read Online
P. 77

Page 261                          Liu et al. Intell Robot 2024;4(3):256-75  I http://dx.doi.org/10.20517/ir.2024.17

               In CUDA, the basic scheduling unit for threads is a wrap. A wrap consists of 32 threads, and execution is
               most efficient using multiples of 32 as the number of threads per thread block. However, since the number
               of registers is limited, the registers struggle to meet the computational demands when too many threads are
               allocated in a thread block, to the point where data is transferred to local memory at a lower rate. In order to
               improve the utilization of video memory, the allocation of thread blocks is reasonably distributed according
               to the size of the image to be processed to avoid the waste of resources. The width and height of the target
               image are represented by “imgcol” and “imgrow”, respectively. The thread block and thread lattice are set to
               two dimensions with a scaling factor of two. The thread-grid allocation strategy is given in:



                                                            ⌈           ⌉
                                                                           
                                                               .   =
                                                                             .  
                                                            ⌈           ⌉                               (2)
                                                                           
                                                               .   =
                                                                             .  


               Each thread corresponds to a pixel and the pixel coordinates (  ,   ) correspond to the threads as follows:




                                            = blockDim.    ∗ blockIdx.    + threadIdx.   
                                                                                                        (3)
                                            = blockDim.    ∗ blockIdx.    + threadId   .  



               In kernel function, we refer to the look-up table method [24]  to avoid the implementation problem of wrap
               divergence triggered by the if/else statement. The FAST response values are calculated according to the Sum
               of Absolute Differences (SAD-A) method.

               2.3 Non-maximal suppression
               In order to avoid duplicate detection and excessive concentration of corners in a certain area, non-maximal
               suppression is generally used to filter the corners according to their response values. In the paper, each layer of
               the image pyramid is divided into several rectangular grids using a method similar to that described in ref [24] ,
               and the grid size of each layer is adjusted according to the scale factor to ensure the consistency of the grid
               index among the layers. In the CUDA architecture of NVIDIA, the basic thread scheduling unit, the wrap,
               contains 32 threads, and the threads in a wrap share the same shared memory. The shared memory has low
               transfer latency, and all threads in a wrap execute the same instructions. If different threads enter different
               branches, a wrapdivergence occurs, which has an impact on performance. For non-maximal suppression, the
               CUDA architecture is well suited to operate in conjunction with wrap. Firstly, the threads in a wrap must exist
               in the same block; if each grid in the image corresponds to a block, each pixel in the grid is assigned a thread
               in a wrap unit. In this way, high communication speed can be achieved while avoiding wrap divergence.

               Afterthenon-maximalsuppressioninsidethegrid, themaximalvaluesinthegridswiththesameindexamong
               different layers are then compared, and the FAST corners with the largest response value are retained and nor-
               malized to the original image, so as to reduce the redundant feature points and ensure the uniform distribution
               of feature points on the image.


               2.4 Descriptor calculation
               Forrotation-invariantfeatures, the“angle”ofthecornerpointsneedstobecalculatedduringfeatureextraction,
               which correlates to the direction calculation of Figure 2A. The first step is to calculate the grayscale centroid
   72   73   74   75   76   77   78   79   80   81   82