Page 77 - Read Online

P. 77

Page 261 Liu et al. Intell Robot 2024;4(3):256-75 I http://dx.doi.org/10.20517/ir.2024.17

In CUDA, the basic scheduling unit for threads is a wrap. A wrap consists of 32 threads, and execution is
most efficient using multiples of 32 as the number of threads per thread block. However, since the number
of registers is limited, the registers struggle to meet the computational demands when too many threads are
allocated in a thread block, to the point where data is transferred to local memory at a lower rate. In order to
improve the utilization of video memory, the allocation of thread blocks is reasonably distributed according
to the size of the image to be processed to avoid the waste of resources. The width and height of the target
image are represented by “imgcol” and “imgrow”, respectively. The thread block and thread lattice are set to
two dimensions with a scaling factor of two. The thread-grid allocation strategy is given in:

⌈ ⌉

. =
.
⌈ ⌉ (2)

. =
.

Each thread corresponds to a pixel and the pixel coordinates ( , ) correspond to the threads as follows:

= blockDim. ∗ blockIdx. + threadIdx.
(3)
= blockDim. ∗ blockIdx. + threadId .

In kernel function, we refer to the look-up table method [24] to avoid the implementation problem of wrap
divergence triggered by the if/else statement. The FAST response values are calculated according to the Sum
of Absolute Differences (SAD-A) method.

2.3 Non-maximal suppression
In order to avoid duplicate detection and excessive concentration of corners in a certain area, non-maximal
suppression is generally used to filter the corners according to their response values. In the paper, each layer of
the image pyramid is divided into several rectangular grids using a method similar to that described in ref [24] ,
and the grid size of each layer is adjusted according to the scale factor to ensure the consistency of the grid
index among the layers. In the CUDA architecture of NVIDIA, the basic thread scheduling unit, the wrap,
contains 32 threads, and the threads in a wrap share the same shared memory. The shared memory has low
transfer latency, and all threads in a wrap execute the same instructions. If different threads enter different
branches, a wrapdivergence occurs, which has an impact on performance. For non-maximal suppression, the
CUDA architecture is well suited to operate in conjunction with wrap. Firstly, the threads in a wrap must exist
in the same block; if each grid in the image corresponds to a block, each pixel in the grid is assigned a thread
in a wrap unit. In this way, high communication speed can be achieved while avoiding wrap divergence.

Afterthenon-maximalsuppressioninsidethegrid, themaximalvaluesinthegridswiththesameindexamong
different layers are then compared, and the FAST corners with the largest response value are retained and nor-
malized to the original image, so as to reduce the redundant feature points and ensure the uniform distribution
of feature points on the image.

2.4 Descriptor calculation
Forrotation-invariantfeatures, the“angle”ofthecornerpointsneedstobecalculatedduringfeatureextraction,
which correlates to the direction calculation of Figure 2A. The first step is to calculate the grayscale centroid

72 73 74 75 76 77 78 79 80 81 82