Page 37 - Read Online
P. 37

Ding et al. Art Int Surg 2024;4:109-38  https://dx.doi.org/10.20517/ais.2024.16     Page 121

               incisions [200,203] . Robot-assisted surgical robots like the da Vinci surgical system likewise use a 3D display in
               the surgeon console to provide a sense of depth. The increasing prevalence, therefore, of stereo camera
               systems in surgery has motivated the development of depth estimation algorithms for both modalities in
               this challenging domain.


               Surgical video poses challenges for MDE in particular because it features deformable surfaces under variable
               lighting conditions, specular reflectances, and predominant motion along the non-ideal camera axis [204,205] .
               Deep neural networks (DNNs) offer a promising pathway for addressing these challenges by learning to
               regress dense depth maps consistently based on a priori knowledge of the target domain in the training
               set . In this context, obtaining reliable ground truth is of the highest importance, and in general, this is
                 [206]
               obtained using non-monocular methods, while the DNN is restricted to analyzing the monocular image.
                                      [207]
               Visentini-Scarzanella et al.  and Oda et al. , for example, obtained depth maps for colonoscopic video
                                                     [208]
               by rendering depth maps based on a reconstruction of the colon from CT. A more scalable approach uses
                                                     [209]
               the depth estimate from stereo laparoscopic  or arthroscopic  video, but this still requires stereo video
                                                                     [210]
               data, which may not be available for procedures typically performed with monocular scopes. Using
               synthetic data rendered from photorealistic virtual models is a highly scalable method for generating images
               and depth maps, but DNNs must then overcome the sim-to-real gap [211-215] . Refs [204,216]  explore the idea of self-
               supervised MDE by reconstructing the anatomical surface with structure-from-motion techniques but
               restricting the DNN to a single frame, later used widely [217,218] . Incorporating temporal consistency from
               recent frames can yield further improvements [219,220] . Datasets such as EndoSLAM  for endoscopy,
                                                                                         [67]
                                                                 [221]
                       [68]
                                          [71]
               SCARED  for laparoscopy, ref  for arthroscopy, and ref  for colonoscopy use these methods to make
               ground truth data more widely available for training [218,222-224] . For many of these methods, a straightforward
               DNN architecture with an encoder-decoder structure and straightforward loss function was used [204,211,212,216] ,
               although geometric constraints such as edge [208,225]  or surface  consistency may be incorporated into the
                                                                   [223]
               loss function. More drastic innovations explicitly confront the unique challenges of surgical videos, such as
               artificially removing smoke, which is frequently present due to cautery tools, using a combined GAN and
               U-Net architecture that simultaneously estimates depth .
                                                              [226]
               As in detection and recognition tasks, large-scale foundation models have been developed with MDE in
               mind. The Depth-Anything model leverages both labeled (1.5M) and unlabeled (62M) images from real-
                                                                              [227]
               world datasets to massively scale up the data available for training DNNs . Rather than using structure-
               from-motion or other non-monocular methods to obtain ground truth for unlabeled data, Depth-Anything
               uses a previously obtained MDE teacher model to generate pseudo-labels for unlabeled images, preserving
               semantic features between the student and teacher models. Although trained using real-world images,
               Depth Anything’s zero-shot performance on endoscopic and laparoscopic video is nevertheless comparable
               to specialized models in terms of speed and performance . It remains to be seen whether foundational
                                                                 [228]
               models trained on real-world images will yield substantial benefits for surgical videos after fine-tuning or
               other transfer learning methods are explored.

               For stereo depth estimation, DNNs have likewise shown vast improvements over conventional approaches.
               The ability of CNNs to extract salient local features has proved efficacious for establishing point
               correspondences based on image appearance, compared to handcrafted local or multi-scale window
               operators [229,230] , leading to similar approaches on stereo laparoscopic video [231,232] . With regard to dense SDE,
               however, the ability of vision transformers  to train attention mechanisms on sequential information has
                                                   [233]
               proved especially apt for overcoming the challenges of generating globally consistent disparity maps,
               especially over smooth, deformable, or highly specular surfaces often encountered in surgical video [234-237] .
               When combined with object recognition and segmentation networks, as described above, they can be used
   32   33   34   35   36   37   38   39   40   41   42