Page 36 - Read Online
P. 36

Page 120                           Ding et al. Art Int Surg 2024;4:109-38  https://dx.doi.org/10.20517/ais.2024.16

                                                 [175]
               Video object segmentation and tracking  in surgical videos mainly focus on moving objects like surgeons’
               hands and surgical instruments. However, work focusing on tracking the tissues is still necessary when there
               is frequent camera movement. Taking advantage of a limited and sometimes fixed number of objects to
               track and similar appearance during the surgery, some traditional methods assume models of the
               targets [176-178] , optimize with similarity functions or energy functions with respect to the observations and
               some low-level features, and apply registration or template matching for prediction. Some [179,180]  rely on
               additional markers and other sensors for more accurate tracking.


               Deep learning methods became the dominant approach once they were applied to surgical videos. Since less
               intra-class occlusion exists in the surgical scene, there is less demand for that sophisticated feature fusion
               mechanism in general vision. Most works continue using less data-consuming image-based segmentation
               and detection architectures [181-185]  and maintain the correspondence at the result level. There are also works
               for end-to-end sequence models and spatial-temporal feature aggregation attempts [186,187] .

               Depth estimation
               The goal of depth estimation is to associate with each pixel a value that reflects the distance from the camera
               to the object in that pixel, for a single timestep. This depth may be expressed in absolute units, such as
               meters, or in dimensionless units that capture the relative depth of objects in an image. The latter goal often
               arises for monocular depth estimation (MDE), which is depth estimation using a single image, due to the
               fundamental ambiguity between the scale of an object and its distance from the camera. Conventional
               approaches use shadows , edges, or structured light [189,190] , to estimate relative depth. Although prior
                                     [188]
               knowledge about the scale of specific objects in a scene enables absolute measurement, it was not until the
               advent of deep neural networks that object recognition became computationally tractable . Stereo depth
                                                                                            [191]
               estimation (SDE), on the other hand, leverages the known geometry of multiple cameras to estimate depth
               in absolute units, and it has long been studied as a fundamental problem relevant to robotic navigation,
               manipulation, 3D modeling, surveying, and augmented reality applications [192,193] . Traditional approaches in
                                             [194]
               this area used patch-based statistics  or edge detectors to identify sparse point correspondences between
               images, from which a dense disparity map can be interpolated [193,195] . Alternatively, dense correspondences
               can  be  established  directly  using  local  window-based  techniques [196,197]   or  global  optimization
               algorithms [198,199] , which often make assumptions about surfaces being imaged to constrain the energy
                      [193]
               function . However, the advent of deep learning revolutionized approaches to both monocular and
               stereo-depth estimation, proving particularly advantageous for the challenging domain of surgical images.

               Within that domain, depth estimation is a valuable step toward geometric understanding in real time. As
               opposed to 3D reconstruction or structure-from-motion algorithms, as described below, depth estimation
               requires only a single frame of monocular or stereo video, meaning this geometric snapshot is just as
               reliable as post-hoc analysis that leverages future frames as well as past ones. Furthermore, it makes no
               assumptions about the geometric consistency of the anatomy throughout the surgery. In combination with
               detection and recognition algorithms, such as those above, depth estimation provides a semantically
               meaningful, 3D representation of the geometry of a surgical procedure, particularly minimally invasive
               procedures that rely on visual guidance. Traditionally, laparoscopic, endoscopic, and other visible spectrum
               image-guided procedures relied on monocular vision systems, which were sufficient when provided as raw
               guidance on a flat monitor [200,201] . The introduction of 3D monitors, which display stereo images using
               glasses to separate the left- and right-eye images, enabled stereo endoscopes and laparoscopes to be used
               clinically , proving to be invaluable to clinicians for navigating through natural orifices or narrow
                       [202]
   31   32   33   34   35   36   37   38   39   40   41