Page 69 - Read Online
P. 69

Hu et al. J Mater Inf 2023;3:1  I http://dx.doi.org/10.20517/jmi.2022.28          Page 9 of 15
               A                                              B

























               Figure 5. Machine learning model optimizations. (A) Cross-validation scores for various learning models with polynomial degrees up to 7.
               (B) The learning curve of the Ridge model. The training score and cross-validation score are compared with different training sizes. Both of
               them tend to saturate and merge at large training sizes.


               good for the training, so they are removed from the dataset for the subsequent process. Then we standardize
               the features by removing the mean and scaling to unit variance. In this way, all the input data behaves like
               standard normal distribution. This data processing step is key to reducing the bias for many machine learning
               models, including the ones we will use below.

               Sincewe already figure out from theabove discussion that GFA is nota simple linear single parameter problem,
               we are trying to build higher-order correlations of these basic features. To this end, we build high-dimensional
               features from polynomial extrapolation. In detail, we generate polynomial and interaction features from the
               fourbasicfeatures. Thenewfeaturematrixwillthusconsistofallpolynomialcombinationsofthebasicfeatures
               with a degree less than or equal to the specified degree. In this way, we can capture not only the nonlinearity
               but also the feature interactions. The higher order is, the more input features will be. Meanwhile, the risk
               of overfitting will also increase. Starting from these polynomial features, we hope to train a linear model to
               map the features to the labels. We thus compare several linear models, including basic linear regression, and
               their derivatives with different regularizations. For example, Ridge regression includes the L2 regularization
               on the size of the coefficients, while Lasso regression imposes L1 regularization. By adding both L1 and L2-
               norm regularization, an ElasticNet model can be trained. To create a workflow, we build a pipeline from
               feature engineering, model construction and cross validation, covering different degrees of polynomials and
               linear algorithms. During the training, 10-fold cross validation is chosen for optimization. The root mean
               squared error (RMSE) between the real values and the predicted values is minimized. Figure 5A shows the
               comparison of the performance of different training models. For Lasso and ElasticNet, where feature selection
               is automatically involved by L1 regularization, the models always under-fit the training data and thus their
               RMSE is much higher. In addition, with an increasingly large number of features (from degree 1 to 7) fed
               to the training model, their performance is not much improved. These models are very aggressive in feature
               reductionandcannotpickupimportanthigh-degreefeatures. Thisdemonstratestheirimprobabilityinsolving
               the current issue.


               We then turn to the basic linear regression and Ridge regression models. They are behaving similarly, except
               that Ridge did a better job when the polynomial degree was 6. We first emphasize that the RMSE in Figure 5A
               is from the 10-fold cross-validations for the training model on the test sub-dataset. For machine learning
               models, with increasing model complexity, the bias will decrease while variance can greatly increase. There
   64   65   66   67   68   69   70   71   72   73   74