Page 92 - Read Online
P. 92

Glaser et al. Art Int Surg. 2025;5:1-15  https://dx.doi.org/10.20517/ais.2024.36      Page 9





























                               Figure 2. Forest plot showing weight distribution of the different spino-pelvic parameters.

               End-to-end runtimes ranged from 2 to 75 s for automated measurement pipelines [23,24] , up to 17× faster than
               manual analysis; most systems took under 20 s [19,23,35] , adequate for surgical usage. Inference-only times were
               often sub-second [23,27] . Accelerated measurement enables more intraoperative images for improved surgical
               decisions. However, detailed computational profiling was generally lacking, impeding comparisons. Cloud-
               based implementations could broadly enable these techniques.


               Studies used statistical comparisons between automated and manual measurements for validation,
                                                                 p
                                                                                                    l
               i n c o r p o r a t i n g   B l a n d - A l t m a n   analysis [19,23,25,27,31,35] ,  a i r e d   s i g n i f i c a n c e   tests [19,23,27,35] ,  i n e a r
               regression [19,23,25,27,31,35] , Pearson correlation coefficients [19,23,25,27,31,35] , and intra-class coefficients [19,23-25] . Manual
               measurement reliability was sometimes quantified . Both preoperative [19,23-25,27,31,32,35]  and postoperative
                                                            [27]
               subjects [19,24,37]  were included, although only Kim  et al. performed validation in distinct pre- and
               postoperative cohorts . Most evaluations used held-out testing data from the same institution as model
                                  [24]
               development; multicenter validation was absent. Generalizability beyond the typically homogeneous
               training populations requires further scrutiny.


               CNN backbones ranged from VGG  and U-Nets    [31,36]  to ResNets [24,25,33] . Both feedforward [19,25]  and fully
                                               [19]
               convolutional layouts were used. Custom network engineering was common [19,23-25,27,31,32,35] , given insufficient
               anatomical representational power in generic classification architectures. Pretraining on natural images via
               Mask R-CNN  and DetectNet  helped offset smaller target dataset sizes. Segmentation-based approaches
                           [36]
                                          [34]
               employed  secondary  algorithms  on  CNN  outputs  to  estimate  spinal  parameters [24,25,31,35,36] , adding
               measurement variability. End-to-end sagittal measurement could minimize error propagation within
               integrated networks.


               Reported batch sizes during neural network training spanned 16-256. However, 10 studies did not specify
               this optimization detail at all [18-31,34,36] . Small batches can enhance generalization and reduce overfitting, but at
               a computational cost. Larger batches offer efficiency yet may miss anomalous cases. Standardization would
               benefit reproducibility. The median batch size was 64 [24,31,33,36] , aligning with typical practices.
   87   88   89   90   91   92   93   94   95   96   97