Page 33 - Read Online
P. 33

Ding et al. Art Int Surg 2024;4:109-38  https://dx.doi.org/10.20517/ais.2024.16     Page 117

               simultaneous tool detection, segmentation, and geometric primitive extraction in laparoscopic non-robotic
               surgery.  It  includes  tool  presence,  segmentation  masks,  and  geometric  primitives,  supporting
               comprehensive tool detection and pose estimation tasks. Datasets capturing surgical tools in realistic
                                                                                                    [78]
               conditions are essential for accurate pose estimation. The 3D Surgical Tools (3dStool) dataset  was
               constructed to include RGB images of surgical tools in action alongside their 3D poses. Four surgical tools
               were chosen: a scalpel, scissors, forceps, and an electric burr. The tools were recorded operating on a
               cadaveric knee to accurately mimic real-life conditions. The dVPose dataset  offers a realistic multi-
                                                                                   [79]
               modality dataset intended for the development and evaluation of real-time single-shot deep-learning-based
               6D pose estimation algorithms on a head-mounted display. It includes comprehensive data for vision-based
               6D pose estimation, featuring synchronized images from the RGB, depth, and grayscale cameras of the
               HoloLens 2 device, and captures the extra-corporeal portions of the instruments and endoscope of a da
               Vinci surgical robot.

               Simulated and synthetic datasets can provide large-scale data and annotations through controlled
               environments. The Edinburgh Simulated Surgical Tools Dataset  includes RGBD images of five simulated
                                                                     [80]
               surgical tools (two scalpels, two clamps, and one tweezer), featuring both synthetic and real images.
               Synthetic Surgical Needle 6DoF Pose Datasets  were generated with the AMBF simulator and assets from
                                                      [81]
               the Surgical Robotics Challenge . These synthetic datasets focus on the 6DoF pose estimation of surgical
                                          [82]
               needles, providing a controlled environment for developing and testing pose estimation algorithms. The
               POV-Surgery dataset  offers a large-scale, synthetic, egocentric collection focusing on pose estimation for
                                 [83]
               hands with different surgical gloves and three orthopedic instruments: scalpel, friem, and diskplacer. It
               consists of high-resolution RGB-D video streams, activity annotations, accurate 3D and 2D annotations for
               hand-object poses, and 2D hand-object segmentation masks.


               Pose estimation in surgical settings also extends to the use of X-ray imaging. i3PosNet  introduced three
                                                                                         [84]
               datasets: two synthetic Digitally Rendered Radiograph (DRR) datasets (one with a screw and the other with
               two surgical instruments) and a real X-ray dataset with manually labeled screws. These datasets facilitate the
               study and development of pose estimation algorithms using X-ray images in temporal bone surgery.

               Table 1 summarizes the datasets mentioned in this section.


               Segmentation and detection
               Segmentation and detection aim to identify the target objects from a static observation and represent them
               as space occupancy in pixel/voxel level labels or simplified bounding boxes for target instances, respectively.
               Segmentation and detection methods play a foundational role in performing geometric understanding for a
               complex scenario like surgery and provide basic geometric information by indicating which areas are
               occupied by objects. In this section, we focus on the segmentation and detection of videos and images. This
               section is divided into frame-wise and video-wise segmentation and detection, namely, (a) single frame
               segmentation and detection and (b) video object segmentation and tracking.


               Single frame segmentation and detection
               Starting from easier goals with the availability of more accurate and larger-scaled annotation, segmentation
               and detection in 2D space receive more attention. Due to different demands, the development of the
               detection and segmentation methods started from different perspectives. Traditional segmentation methods
               rely on low-level features such as edges and region colors, discriminating regions using thresholding or
               clustering techniques [85-87] . Traditional detection methods rely on human-crafted features and predict based
               on  template  matching [88-90] . With  the  emergence  of  deep  learning  algorithms,  both  detection  and
   28   29   30   31   32   33   34   35   36   37   38