Page 36 - Read Online
P. 36
Page 120 Ding et al. Art Int Surg 2024;4:109-38 https://dx.doi.org/10.20517/ais.2024.16
[175]
Video object segmentation and tracking in surgical videos mainly focus on moving objects like surgeons’
hands and surgical instruments. However, work focusing on tracking the tissues is still necessary when there
is frequent camera movement. Taking advantage of a limited and sometimes fixed number of objects to
track and similar appearance during the surgery, some traditional methods assume models of the
targets [176-178] , optimize with similarity functions or energy functions with respect to the observations and
some low-level features, and apply registration or template matching for prediction. Some [179,180] rely on
additional markers and other sensors for more accurate tracking.
Deep learning methods became the dominant approach once they were applied to surgical videos. Since less
intra-class occlusion exists in the surgical scene, there is less demand for that sophisticated feature fusion
mechanism in general vision. Most works continue using less data-consuming image-based segmentation
and detection architectures [181-185] and maintain the correspondence at the result level. There are also works
for end-to-end sequence models and spatial-temporal feature aggregation attempts [186,187] .
Depth estimation
The goal of depth estimation is to associate with each pixel a value that reflects the distance from the camera
to the object in that pixel, for a single timestep. This depth may be expressed in absolute units, such as
meters, or in dimensionless units that capture the relative depth of objects in an image. The latter goal often
arises for monocular depth estimation (MDE), which is depth estimation using a single image, due to the
fundamental ambiguity between the scale of an object and its distance from the camera. Conventional
approaches use shadows , edges, or structured light [189,190] , to estimate relative depth. Although prior
[188]
knowledge about the scale of specific objects in a scene enables absolute measurement, it was not until the
advent of deep neural networks that object recognition became computationally tractable . Stereo depth
[191]
estimation (SDE), on the other hand, leverages the known geometry of multiple cameras to estimate depth
in absolute units, and it has long been studied as a fundamental problem relevant to robotic navigation,
manipulation, 3D modeling, surveying, and augmented reality applications [192,193] . Traditional approaches in
[194]
this area used patch-based statistics or edge detectors to identify sparse point correspondences between
images, from which a dense disparity map can be interpolated [193,195] . Alternatively, dense correspondences
can be established directly using local window-based techniques [196,197] or global optimization
algorithms [198,199] , which often make assumptions about surfaces being imaged to constrain the energy
[193]
function . However, the advent of deep learning revolutionized approaches to both monocular and
stereo-depth estimation, proving particularly advantageous for the challenging domain of surgical images.
Within that domain, depth estimation is a valuable step toward geometric understanding in real time. As
opposed to 3D reconstruction or structure-from-motion algorithms, as described below, depth estimation
requires only a single frame of monocular or stereo video, meaning this geometric snapshot is just as
reliable as post-hoc analysis that leverages future frames as well as past ones. Furthermore, it makes no
assumptions about the geometric consistency of the anatomy throughout the surgery. In combination with
detection and recognition algorithms, such as those above, depth estimation provides a semantically
meaningful, 3D representation of the geometry of a surgical procedure, particularly minimally invasive
procedures that rely on visual guidance. Traditionally, laparoscopic, endoscopic, and other visible spectrum
image-guided procedures relied on monocular vision systems, which were sufficient when provided as raw
guidance on a flat monitor [200,201] . The introduction of 3D monitors, which display stereo images using
glasses to separate the left- and right-eye images, enabled stereo endoscopes and laparoscopes to be used
clinically , proving to be invaluable to clinicians for navigating through natural orifices or narrow
[202]

