Page 33 - Read Online
P. 33
Ding et al. Art Int Surg 2024;4:109-38 https://dx.doi.org/10.20517/ais.2024.16 Page 117
simultaneous tool detection, segmentation, and geometric primitive extraction in laparoscopic non-robotic
surgery. It includes tool presence, segmentation masks, and geometric primitives, supporting
comprehensive tool detection and pose estimation tasks. Datasets capturing surgical tools in realistic
[78]
conditions are essential for accurate pose estimation. The 3D Surgical Tools (3dStool) dataset was
constructed to include RGB images of surgical tools in action alongside their 3D poses. Four surgical tools
were chosen: a scalpel, scissors, forceps, and an electric burr. The tools were recorded operating on a
cadaveric knee to accurately mimic real-life conditions. The dVPose dataset offers a realistic multi-
[79]
modality dataset intended for the development and evaluation of real-time single-shot deep-learning-based
6D pose estimation algorithms on a head-mounted display. It includes comprehensive data for vision-based
6D pose estimation, featuring synchronized images from the RGB, depth, and grayscale cameras of the
HoloLens 2 device, and captures the extra-corporeal portions of the instruments and endoscope of a da
Vinci surgical robot.
Simulated and synthetic datasets can provide large-scale data and annotations through controlled
environments. The Edinburgh Simulated Surgical Tools Dataset includes RGBD images of five simulated
[80]
surgical tools (two scalpels, two clamps, and one tweezer), featuring both synthetic and real images.
Synthetic Surgical Needle 6DoF Pose Datasets were generated with the AMBF simulator and assets from
[81]
the Surgical Robotics Challenge . These synthetic datasets focus on the 6DoF pose estimation of surgical
[82]
needles, providing a controlled environment for developing and testing pose estimation algorithms. The
POV-Surgery dataset offers a large-scale, synthetic, egocentric collection focusing on pose estimation for
[83]
hands with different surgical gloves and three orthopedic instruments: scalpel, friem, and diskplacer. It
consists of high-resolution RGB-D video streams, activity annotations, accurate 3D and 2D annotations for
hand-object poses, and 2D hand-object segmentation masks.
Pose estimation in surgical settings also extends to the use of X-ray imaging. i3PosNet introduced three
[84]
datasets: two synthetic Digitally Rendered Radiograph (DRR) datasets (one with a screw and the other with
two surgical instruments) and a real X-ray dataset with manually labeled screws. These datasets facilitate the
study and development of pose estimation algorithms using X-ray images in temporal bone surgery.
Table 1 summarizes the datasets mentioned in this section.
Segmentation and detection
Segmentation and detection aim to identify the target objects from a static observation and represent them
as space occupancy in pixel/voxel level labels or simplified bounding boxes for target instances, respectively.
Segmentation and detection methods play a foundational role in performing geometric understanding for a
complex scenario like surgery and provide basic geometric information by indicating which areas are
occupied by objects. In this section, we focus on the segmentation and detection of videos and images. This
section is divided into frame-wise and video-wise segmentation and detection, namely, (a) single frame
segmentation and detection and (b) video object segmentation and tracking.
Single frame segmentation and detection
Starting from easier goals with the availability of more accurate and larger-scaled annotation, segmentation
and detection in 2D space receive more attention. Due to different demands, the development of the
detection and segmentation methods started from different perspectives. Traditional segmentation methods
rely on low-level features such as edges and region colors, discriminating regions using thresholding or
clustering techniques [85-87] . Traditional detection methods rely on human-crafted features and predict based
on template matching [88-90] . With the emergence of deep learning algorithms, both detection and

