Page 93 - Read Online
P. 93
Page 10 Glaser et al. Art Int Surg. 2025;5:1-15 https://dx.doi.org/10.20517/ais.2024.36
The number of training epochs ranged from 30 to 6,000 for deep neural networks. But again, most studies
omitted specifics [18-31,34,36] . Two reports described adaptive epoch counts based on validation
improvements [24,36] , rather than fixed values. Typical regimes were 30-50 epochs [31,33] . Standardized detail
would aid reproducibility. Generalizability with shorter training requires scrutiny where transfer learning
was not employed.
The IJMEDI checklist for medical imaging AI highlighted several shortcomings (see tabulated results in
requests), particularly around enabling reproducibility. Areas such as software details, computational
resource usage, model accessibility, and evaluation set specificity suffered poor reporting. However, studies
did well in conveying overall aims, statistical and evaluation methodology, and limitations. Recent initiatives
[38]
for standardizing ML reporting [39,40] , plus reproducibility checklists , may benefit new spine AI imaging
research.
Despite promising accuracy, certain limitations remain. Most studies used single-institutional data lacking
sufficient diversity [19-21,23,25,28-30] . Reference standards from manual radiograph measurements intrinsically
incorporate subjectivity from inter-observer variations . CT imaging remains unevaluated. Studies for
[41]
[42]
some parameters are still few. Real-world clinical validation is lacking . Our subgroup analyses found that
studies using CNN architectures demonstrated higher accuracy for parameters like lumbar lordosis
compared to other models. This highlights the importance of selecting appropriate architectures tailored to
the specific radiographic quantification task. As deep learning continues advancing, further research is still
needed to optimize model design and determine the most effective architectures for automated spinopelvic
measurement. Larger comparative studies evaluating different network architectures on common datasets
would help elucidate the relative merits and guide selection.
Moving forward, larger multicenter studies should validate these models before clinical
implementation [40,43] . Continued research on handling label noise and measurement uncertainty is
required [13,41] . Standardized reporting guidelines could enhance reproducibility . Models should be
[40]
optimized across diverse settings and pathologies [42,43] . Clinically meaningful accuracy metrics deserve focus
beyond errors .
[41]
The application of deep learning models and their potential role in spine surgery has already begun to be
explored. Of value to spine surgeons, models have demonstrated success in diagnosing various
musculoskeletal and spinal disorders, including sarcopenia, scoliosis, and low back pain [44-47] . In regard to
prognosis, deep learning models have been successful in predicting postoperative complications such as
surgical site infections and 30-day readmission rates after lumbar fusion procedures [48,49] . While these initial
findings are promising, further research validating the use of these models in other realms of patient care,
particularly surgical planning, is needed.
Spinopelvic parameters are of great importance to the surgeon for planning, and methods of measurement
have evolved significantly. Early assessments began with the Cobb angle and focused on spinal curvatures
but overlooked the pelvis. In the 1980s and 1990s, the introduction of parameters such as PI, PT, and SS
revolutionized the understanding of sagittal balance. These measurements linked pelvic alignment to spinal
posture. By the 2000s, global spinal alignment gained attention, with the SVA and newer measures like the
T1 pelvic angle (TPA) becoming essential for surgical planning in adult spinal deformity (ASD).
Up until the early 2000s, measurement of spinopelvic parameters was mostly done manually and, on
average, took 3-15 min. The manual measurement process is tedious and time-consuming while also being