Page 100 - Read Online
P. 100
Xu et al. Art Int Surg 2023;3:48-63 https://dx.doi.org/10.20517/ais.2022.33 Page 58
Existing barriers to the use of AI approaches include the lack of standardized algorithms and software used
across institutions, difficulty justifying AI-based predictions given the “black box” phenomenon, and poor
generalizability outside the training set. ML algorithms require external validation in independent datasets
with patient populations of substantial size and diversity for successful training [81,82] . There are also
considerable differences between experimental algorithms written for proof-of-concept studies and those
required for producing a marketable healthcare product. The latter must be done following Good
Manufacturing Practice guidelines by the Food and Drug Administration , often requiring immense labor
[83]
and experience.
Distributional shift and imbalanced data
[84]
Distributional shift is a critical problem in model creation . ML models perform best when index cases and
control cases are similar in the training set , but this is rarely the case with HCC. Disease patterns in
[85]
cirrhosis and cancer also evolve drastically over time (such as the current epidemic of non-alcoholic fatty
liver disease), resulting in mismatches between training and operational data. Imbalanced datasets can be
“re-balanced” with under-sampling or over-sampling, but a failure to correct inherent biases will result in a
model that over-diagnoses rare cases .
[86]
Lack of standardization
In pursuit of safety and efficacy in AI use, standardization is key. As described above, comparability and
reproducibility remain poor across studies due to gross inconsistencies in data management, imaging and
data processing equipment used, and the reporting of methods and results. Common metrics used in
reporting the results of AI prediction, such as area under the curve, sensitivity and specificity, do not
reliably show clinical efficacy . Biomedical researchers should strongly consider following standardized
[87]
guidelines for reporting published by Luo et al. in 2016 . Their seminal work highlights how most pitfalls
[88]
of applying ML in medicine originate from a small set of common issues like data leakage and overfitting.
They have thus generated guidelines for developing predictive models and a minimum list of reporting
items, including information on independent variables, negative or positive examples and modeling
[89]
technique selection . The majority of clinical studies reported here fail to reach such reporting standards.
Scientific publications should stipulate such reporting standards in AI-based studies as part of quality
assurance and, therefore, potential clinical consideration, something the scientific community “should do”.
Overfitted data and generalizability
Following the initial success of various models trained and tested on small datasets, few have translated to
any real-world impact because of problems with data overfitting and difficulty generalizing results to other
[89]
patient populations . The application of AI in HCC remains an emerging field and most algorithms
require training on diverse datasets, as well as testing with external validation or prospective trials. Several
studies discussed have managed to maintain high accuracy rates in independent external validation cohorts.
For instance, the AI model for predicting HCC risk in chronic hepatitis B patients developed by Kim et al.
using a Korean cohort (C-index: 0.79) remained accurate in testing against both an independent external
Korean validation cohort (C-index: 0.79) and an independent external Caucasian validation cohort (C-
index: 0.81) . Notably, the training/derivation cohort, external Korean validation cohort and external
[13]
Caucasian validation cohorts differed in their baseline characteristics and had significant differences in age
and prevalence of cirrhosis . Other AI models that have achieved similar results include the ML analysis of
[13]
[66]
contrast-enhanced CT radiomics for HCC recurrence by Ji et al . The inclusion of such external national
and international cohorts would rapidly advance generalizability.