Page 31 - Read Online
P. 31
Ding et al. Art Int Surg 2024;4:109-38 https://dx.doi.org/10.20517/ais.2024.16 Page 115
GEOMETRIC SCENE UNDERSTANDING TASKS
In this section, we visit and analyze some existing geometric scene understanding tasks and corresponding
datasets and methods that are the building blocks of the DT framework. The illustration of tasks’
functionality and relative relation in the building and updating of DT is shown in Figure 4. We mainly
introduce four large categories - segmentation and detection, depth estimation, 3D reconstruction, and pose
estimation. Segmentation and detection in both static images and videos focus on identifying and isolating
the target and generating decomposed semantic components from the entire scene. This procedure is the
prerequisite for building and consistently updating the DT. While depth estimation extracts pixel-level
depth information of the entire scene from static images, the generated depth map is limited in accuracy
and representation ability due to its grid-based representation. Thus, it is not suitable as the only source for
building and updating the digital twin and is often employed together with segmentation/detection methods
and 3D reconstruction methods. Based on the identified targets, 3D reconstruction methods take multiple
observations and extract more accurate and detailed geometric information to form 3D models for all
components. Pose estimation tasks align the existing 3D models and their identification with the
observations. While all techniques significantly contribute to building and updating the digital model, the
building of the initial digital model mostly relies on segmentation, detection, and 3D reconstruction,
adopting a “identify then reconstruct” principle. Once the digital model is initialized, the real-time updating
of semantic components in the digital scene relies more on pose estimation. We first visit and summarize
the availability of related data and materials in one subsection and then visit and analyze the techniques in
the following subsections in the order of segmentation and detection, depth estimation, 3D reconstruction,
and pose estimation.
Availability of data and materials
Segmentation and detection
Segmentation and detection, as fundamental tasks in surgical data science, receive an enormous amount of
attention from the community. Challenges are being proposed with the corresponding dataset after the
success of EndoVis challenges [50,51] . EndoVis [50,51] collected data from abdominal porcine procedures with the
da Vinci surgical robot and manually annotated the surgical instruments and some tissue anatomy for
segmentation. SAR-RARP50 and CATARACTS challenge released in-vivo datasets for semantic
[52]
[53]
[55]
segmentation in real procedures. SurgT and STIR provide annotations for tissue tracking. RobustMIS
[54]
challenge divided test data into three categories - same procedures as training, same surgery but different
[56]
procedures, and different but similar surgery. With the three levels of similarity, the challenge aimed to
assess algorithms’ robustness against domain shift. SegSTRONG-C collected ex vivo data with manually
[57]
added noise like smoke, blood, and low brightness to assess models’ robustness against non-adversarial
corruptions unseen from the training data. Besides challenges, researchers also collect data to support
algorithm development and publications [59-61] .
[58]
Although various datasets are available for segmentation and detection, due to the complexity of the
segmentation and detection annotation, the scales (< 10k) of those datasets are not as large as the general
[62]
vision data (MS COCO 328k images, Object365 , 600k images). Thus, SurgToolLoc challenge
[63]
[64]
provided tool presence annotation as weak labels in 24,695 video clips for machine learning models to be
trained to detect and localize tools in video frames with bounding boxes. Lap. I2I Translation attempted
[65]
to generate a larger-scale dataset using an image-to-image translation approach from synthetic data.
SegSTRONG-C provided a foundation model-assisted annotation protocol to expedite the annotation
[57]
[66]
process. Despite the effort, the demand for a large-scale and uniform dataset exists desperately.

