Page 37 - Read Online
P. 37
Ding et al. Art Int Surg 2024;4:109-38 https://dx.doi.org/10.20517/ais.2024.16 Page 121
incisions [200,203] . Robot-assisted surgical robots like the da Vinci surgical system likewise use a 3D display in
the surgeon console to provide a sense of depth. The increasing prevalence, therefore, of stereo camera
systems in surgery has motivated the development of depth estimation algorithms for both modalities in
this challenging domain.
Surgical video poses challenges for MDE in particular because it features deformable surfaces under variable
lighting conditions, specular reflectances, and predominant motion along the non-ideal camera axis [204,205] .
Deep neural networks (DNNs) offer a promising pathway for addressing these challenges by learning to
regress dense depth maps consistently based on a priori knowledge of the target domain in the training
set . In this context, obtaining reliable ground truth is of the highest importance, and in general, this is
[206]
obtained using non-monocular methods, while the DNN is restricted to analyzing the monocular image.
[207]
Visentini-Scarzanella et al. and Oda et al. , for example, obtained depth maps for colonoscopic video
[208]
by rendering depth maps based on a reconstruction of the colon from CT. A more scalable approach uses
[209]
the depth estimate from stereo laparoscopic or arthroscopic video, but this still requires stereo video
[210]
data, which may not be available for procedures typically performed with monocular scopes. Using
synthetic data rendered from photorealistic virtual models is a highly scalable method for generating images
and depth maps, but DNNs must then overcome the sim-to-real gap [211-215] . Refs [204,216] explore the idea of self-
supervised MDE by reconstructing the anatomical surface with structure-from-motion techniques but
restricting the DNN to a single frame, later used widely [217,218] . Incorporating temporal consistency from
recent frames can yield further improvements [219,220] . Datasets such as EndoSLAM for endoscopy,
[67]
[221]
[68]
[71]
SCARED for laparoscopy, ref for arthroscopy, and ref for colonoscopy use these methods to make
ground truth data more widely available for training [218,222-224] . For many of these methods, a straightforward
DNN architecture with an encoder-decoder structure and straightforward loss function was used [204,211,212,216] ,
although geometric constraints such as edge [208,225] or surface consistency may be incorporated into the
[223]
loss function. More drastic innovations explicitly confront the unique challenges of surgical videos, such as
artificially removing smoke, which is frequently present due to cautery tools, using a combined GAN and
U-Net architecture that simultaneously estimates depth .
[226]
As in detection and recognition tasks, large-scale foundation models have been developed with MDE in
mind. The Depth-Anything model leverages both labeled (1.5M) and unlabeled (62M) images from real-
[227]
world datasets to massively scale up the data available for training DNNs . Rather than using structure-
from-motion or other non-monocular methods to obtain ground truth for unlabeled data, Depth-Anything
uses a previously obtained MDE teacher model to generate pseudo-labels for unlabeled images, preserving
semantic features between the student and teacher models. Although trained using real-world images,
Depth Anything’s zero-shot performance on endoscopic and laparoscopic video is nevertheless comparable
to specialized models in terms of speed and performance . It remains to be seen whether foundational
[228]
models trained on real-world images will yield substantial benefits for surgical videos after fine-tuning or
other transfer learning methods are explored.
For stereo depth estimation, DNNs have likewise shown vast improvements over conventional approaches.
The ability of CNNs to extract salient local features has proved efficacious for establishing point
correspondences based on image appearance, compared to handcrafted local or multi-scale window
operators [229,230] , leading to similar approaches on stereo laparoscopic video [231,232] . With regard to dense SDE,
however, the ability of vision transformers to train attention mechanisms on sequential information has
[233]
proved especially apt for overcoming the challenges of generating globally consistent disparity maps,
especially over smooth, deformable, or highly specular surfaces often encountered in surgical video [234-237] .
When combined with object recognition and segmentation networks, as described above, they can be used

