Page 78 - Read Online
P. 78
Liu et al. Intell Robot 2024;4(3):256-75 I http://dx.doi.org/10.20517/ir.2024.17 Page 262
as in:
∑ ∑
10 = ( , ),
− −
∑ ∑
01 = ( , ), (4)
− −
∑ ∑
00 = ( , ),
− −
where 00 is the image moment and the centroid coordinates of the image block are defined as follows:
10 01
= , = . (5)
00 00
Then, the principal direction of the feature point is obtained by connecting the geometric center of the image
with the centroid:
01
= arctan( ) = arctan( ). (6)
10
In GPU, the wrap is the basic scheduling unit. All threads within a wrap execute the same instruction at
the same time, and share the same block of memory in low transfer latency. Since the direction is generally
calculated by selecting a circular area with diameter of 32 pixels, one wrap is used to process one feature point.
Therefore, each thread in the wrap is assigned to process one row of pixels, which improves the execution
efficiency. During the execution, 32 calculation units are required for the center row, while the number of
calculations for the rest of the rows is a fixed value of less than 32. The one-dimensional 32-element array is
constructed to save the number of executions per pixel row, and it is stored in constant memory.
In the image matching, the BRIEF descriptor is generally used to represent and describe the detected feature
points with its faster calculation speed and higher matching quality compared with SIFT and Speeded-Up
Robust Features (SURF) descriptors. It is determined by 256 random point pairs within the neighborhood
window around the feature point, where the random pairs can be obtained by Gaussian sampling.
Since the calculation for feature point descriptors is mutually independent and takes up 32 times as much
storage space, 32 threads are allocated for each feature point to improve the running efficiency. In order to
speed up the data fetch, “cudaMemcpyToSymbol()” is utilized to save point pairs in the constant memory.
2.5 Feature matching
During initialization, the number of extracted feature points is generally several times more than that of track-
ingduetothefactthatthefeaturematchingneedsmoreaccuratemappointsandinitialposeestimation. Hence,
the feature matching is very time-consuming. Meanwhile, it is also difficult to guarantee real-time matching
even if a local matching method is employed within a window, since there is no such a priori information, such
as the known movement change between camera poses. Therefore, the paper implements the parallelization
of feature matching by resorting to the OpenCV CUDA module, which mainly includes searching for point
pairs based on Hamming distance and culling false matchings based on RANSAC algorithm.
Forcoarsefeaturematching,theBruteForcealgorithmbasedontheHammingdistanceisappliedbycalculating
the set with the minimum distance between two sets of feature vector sets as a matching point pair. Since the
size of the feature vectors is fixed, it will be set as . When performing the initial feature matching,
. = dimBlock . = . The dimension of the thread grid is set to one dimension, and the
parameters are given in

