Page 114 - Read Online
P. 114
Page 60 Ambati et al. Art Int Surg. 2025;5:53-64 https://dx.doi.org/10.20517/ais.2024.45
Table 3. Summary of AI/ML studies discussed in the postoperative prognostication subsection
Area of investigation Selected studies
Perioperative complication prediction and risk General technique: Berven et al., 2023 [38]
[39]
stratification Lumbar degenerative disease: Abdelrahman et al., 2022
[40] [41]
Pediatric deformity: Comstock et al., 2023 ; Lim et al., 2023
[42] [43]
Trauma: Yeretsian et al., 2022 ; Malacon et al., 2022
[44] [45] [46]
Long-term outcome prognostication Eliahu et al., 2022 ; Auloge et al., 2020 ; Burström et al., 2019 ; Elmi-Terander et al.,
2020 [47] ; Charles et al., 2021 [48]
AI: Artificial intelligence; ML: machine learning.
consuming tasks (i.e., image segmentation of the spine or robotic navigation).
Challenge 2: subjective outcome measures
Despite the aforementioned challenges in developing a broad AI understanding of spine surgery arising
from patient heterogeneity, one substantial barrier lies in challenges presented by current outcome
measures. Many of the endpoints we follow are subjective or are influenced by a wide variety of factors that
AI may not be able to accurately capture in an unbiased manner. For example, endpoints such as pain and
functional status may be influenced by psychological factors. Endpoints such as the return to work may be
influenced by socioeconomic status. Endpoints such as the need for revision surgery may be influenced by
many factors, including preoperative comorbidities and postoperative access to care in addition to the
surgery itself. Postoperative pain medication use is influenced by preoperative levels of tolerance and
patterns of clinical prescription. It is critical that such models and their predictions do not lead clinicians to
select patients or surgical approaches in a way that perpetuates present disparities. Solutions to this problem
may be to focus on more immediate rather than long-term measures, on quantitative or radiographic
endpoints that can be measured in a validated manner, and potentially to use AI and new technologies to
develop novel outcome metrics that better capture the impact of spine surgery on patients’ lives.
Challenge 3: tradeoffs in data quality and quantity
One of the central principles of ML is that capabilities and performance increase with ever-larger
[15]
datasets . In particular, cutting-edge approaches such as deep learning and large language models (the
types of models underlying self-driving cars and ChatGPT, respectively) rely on immense amounts of data
[62]
to tune hundreds of billions of parameters, from which their intelligence emerges . In spine surgery, large
registries such as the Quality Outcomes Database (QOD), British Spine Registry, and International Spine
Study Group (ISSG) have aggregated patient data across numerous centers, and the largest ML studies may
incorporate thousands of patients. However, these numbers are likely sufficient for certain tasks requiring
only simple categorical and numerical variables as inputs rather than complex data such as cross-sectional
images, text, and video that require immense amounts of data. Still, healthcare databases often encounter
quality issues such as missing or incomplete data and variable practices across the sites where the data were
collected. Furthermore, as the number of variables per patient in the database increases, the difficulty of
expanding the dataset grows, limiting the number of patients incorporated and increasing the
administrative burden on centers that participate.
Due to limitations in data quantity, many studies are validated using withheld patients or cross-validation
from the same single-center datasets, which may result in model overfitting and limited clinical utility.
Validation using independently collected external datasets will allow for improved assessments of model
accuracy and generalizability. Even findings from multi-center studies may be affected by this problem, as
the datasets are not completely independent of one another. In addition, some studies used large national
datasets that may have limited granularity of clinically relevant variables, potentially limiting their models’