Page 63 - Read Online
P. 63
Kimbowa et al. Art Int Surg 2024;4:149-69 https://dx.doi.org/10.20517/ais.2024.20 Page 157
Deep learning-based methods, on the other hand, aim to leverage data to learn both the feature extractor Eλ
1
and decoder Dλ . Early approaches used multi-layer perceptrons (MLPs) to learn relevant needle features
2
[70]
and classify them accordingly . In the approach proposed by Geraldes and Rocha, the MLP took as input a
region of interest (ROI) selected from the input ultrasound image, output a probability estimate for each
pixel in the ROI being a needle, and a threshold applied to the output to localize the needle . This
[70]
approach, however, yielded tip localization errors greater than 5 mm.
Most deep learning methods use convolutional neural networks (CNNs) with multiple layers, whereby the
first layers learn local features from the image and the deeper layers combine the local features to learn more
global features. CNN-based approaches can be categorized into four: (1) classification; (2) regression; (3)
segmentation; and (4) object detection. Using CNNs for classification is common in methods working with
3D ultrasound. For instance, in the approach by Pourtaherian et al., a CNN is used to classify voxels
extracted from 3D ultrasound volumes as either needle or background yielding a 3D voxel-wise
segmentation map of the needle [71,72] . A cylindrical model is then fitted to this map using RANSAC to
estimate the needle axis which is used to determine the 2D plane containing the entire needle. Another
approach would be to classify each scan plane in the 3D volume as either containing a needle or not, and
then similarly combine and visualize the 2D plane that contains the entire needle . While these approaches
[54]
enhance the visualization of the needle in 3D ultrasound, they do not localize the needle tip.
With regression, the features extracted by the CNN are used to directly regress the needle tip coordinates
(x, y) , or their proxy by regressing four values representing the two opposite vertices of a tight bounding
[65]
[66]
box centered around the needle tip [Figure 3D]. These approaches are suitable for needle localization in
both in-plane and out-of-plane insertion as they do not heavily rely on shaft information. The only
downside of these approaches is that they do not enhance visualization of the entire needle during in-plane
insertions.
For segmentation, the high-level features extracted by the CNN are used in reverse to generate a probability
map with pixel-wise probabilities of the existence of a needle [72-78] . This probability map can then be post-
processed, usually by thresholding, to generate a binary segmentation map. CNNs with segmentation are
the most commonly used deep learning approach for needle detection because they can detect the entire
needle, including the shaft, while producing probabilities for their outputs, which leaves room for a variety
of postprocessing approaches [Figure 3C]. A special case of segmentation can be found in high dose rate
(HDR) prostate brachytherapy applications where multiple needles are segmented simultaneously [79-82] . In
these applications, transverse 2D slices obtained from the 3D ultrasound volume are passed as input to the
CNN trained to output the corresponding multi-needle segmentations for each slice. Unlike shaft
segmentations for in-plane needle insertion, segmentations in HDR prostate brachytherapy slices are
circular and centered around each needle in a given slice. These segmentations are then combined and the
centers of the circles are considered to be the needle shaft with the most distal bright intensity considered as
the needle tip.
Object detection methods are similar to segmentation methods but output bounding boxes encasing the
detected needle. For instance, Mwikirize et al. used a CNN to automatically generate potential bounding
box regions containing the needle and fed them to a region-based CNN (R-CNN) to classify which regions
contained the needle . On the other hand, Wang et al. used the Yolox-nano detector that outputs
[83]
bounding box predictions for each pixel. These predictions are then combined using non-max suppression
to obtain a single bounding box indicating the predicted needle . Rubin et al. combined a 3D CNN, to
[84]
extract temporal features from an ultrasound video stream, with a 2D Yolov3-tiny object detector to

