Page 100 - Read Online
P. 100

Xu et al. Art Int Surg 2023;3:48-63  https://dx.doi.org/10.20517/ais.2022.33         Page 58

               Existing barriers to the use of AI approaches include the lack of standardized algorithms and software used
               across institutions, difficulty justifying AI-based predictions given the “black box” phenomenon, and poor
               generalizability outside the training set. ML algorithms require external validation in independent datasets
               with patient populations of substantial size and diversity for successful training [81,82] . There are also
               considerable differences between experimental algorithms written for proof-of-concept studies and those
               required for producing a marketable healthcare product. The latter must be done following Good
               Manufacturing Practice guidelines by the Food and Drug Administration , often requiring immense labor
                                                                             [83]
               and experience.

               Distributional shift and imbalanced data
                                                                [84]
               Distributional shift is a critical problem in model creation . ML models perform best when index cases and
               control cases are similar in the training set , but this is rarely the case with HCC. Disease patterns in
                                                     [85]
               cirrhosis and cancer also evolve drastically over time (such as the current epidemic of non-alcoholic fatty
               liver disease), resulting in mismatches between training and operational data. Imbalanced datasets can be
               “re-balanced” with under-sampling or over-sampling, but a failure to correct inherent biases will result in a
               model that over-diagnoses rare cases .
                                              [86]

               Lack of standardization
               In pursuit of safety and efficacy in AI use, standardization is key. As described above, comparability and
               reproducibility remain poor across studies due to gross inconsistencies in data management, imaging and
               data processing equipment used, and the reporting of methods and results. Common metrics used in
               reporting the results of AI prediction, such as area under the curve, sensitivity and specificity, do not
               reliably show clinical efficacy . Biomedical researchers should strongly consider following standardized
                                        [87]
               guidelines for reporting published by Luo et al. in 2016 . Their seminal work highlights how most pitfalls
                                                              [88]
               of applying ML in medicine originate from a small set of common issues like data leakage and overfitting.
               They have thus generated guidelines for developing predictive models and a minimum list of reporting
               items, including information on independent variables, negative or positive examples and modeling
                               [89]
               technique selection . The majority of clinical studies reported here fail to reach such reporting standards.
               Scientific publications should stipulate such reporting standards in AI-based studies as part of quality
               assurance and, therefore, potential clinical consideration, something the scientific community “should do”.

               Overfitted data and generalizability
               Following the initial success of various models trained and tested on small datasets, few have translated to
               any real-world impact because of problems with data overfitting and difficulty generalizing results to other
                                [89]
               patient populations . The application of AI in HCC remains an emerging field and most algorithms
               require training on diverse datasets, as well as testing with external validation or prospective trials. Several
               studies discussed have managed to maintain high accuracy rates in independent external validation cohorts.
               For instance, the AI model for predicting HCC risk in chronic hepatitis B patients developed by Kim et al.
               using a Korean cohort (C-index: 0.79) remained accurate in testing against both an independent external
               Korean validation cohort (C-index: 0.79) and an independent external Caucasian validation cohort (C-
               index: 0.81) . Notably, the training/derivation cohort, external Korean validation cohort and external
                         [13]
               Caucasian validation cohorts differed in their baseline characteristics and had significant differences in age
               and prevalence of cirrhosis . Other AI models that have achieved similar results include the ML analysis of
                                      [13]
                                                                      [66]
               contrast-enhanced CT radiomics for HCC recurrence by Ji et al . The inclusion of such external national
               and international cohorts would rapidly advance generalizability.
   95   96   97   98   99   100   101   102   103   104   105