Page 39 - Read Online
P. 39

Page 6 of 14                                Liu et al. J Cancer Metastasis Treat 2019;5:4  I  http://dx.doi.org/10.20517/2394-4722.2018.55

               RR spectral data analysis method by PCA-SVM
               The baseline of each raw Raman spectrum was fitted to a polynomial using an asymmetric Huber
                                        [49]
               function as the loss function . The difference between the raw spectra and the baselines were calculated.
               Each baseline-subtracted Raman spectrum was then normalized using its Euclidean norm, and used for
               subsequent analysis.


               Raman peaks in the normalized baseline-subtracted Raman spectra were first investigated. Student’s t-test
               was used to determine if particular RR peak intensities between normal and BCC samples were significantly
               different. Then unsupervised machine learning algorithms such as PCA were used to analyze the entire
               spectral data, reduce dimension and detect underlying spectral feathers.

               PCA finds the uncorrelated components that explain the most variance in the signal. It has been widely used
                                                     [50]
                                                                                         [52]
                                                                      [51]
               for various applications, such as spectroscopy , face recognition  and optical imaging . Mathematically,
               PCA solves an eigenvalue equation, and finds a set of orthonormal eigenvectors which are considered
               principal components (PCs) corresponding to the eigenvalues which are the variances of the PCs in the data.
               For Raman spectral data contained in a matrix X M×N  = {x , …, x }, where M is the number of wavenumbers,
                                                                     N
                                                               1
               and N is the number of spectra or samples (assuming M > N). PCA considers the spectral data x  to be
                                                                                                    i
               linear combinations of PC loadings {w} with scores {h }, i.e., X M×N  ≈ W M×N H , where W M×N  = {w , … w }
                                                                                N×N
                                                              ji
                                                j
                                                                                                   1
                                                                                                        N
               and H  = {h , …, h }. To calculate the PCs, the data matrix is “mean centered” first, i.e., the mean of each
                     N×N
                           1
                                 N
               row is calculated and subtracted off that row. Then an eigenvalue equation of the covariance matrix of the
               “mean centered” data matrix X’ is solved to find the eigenvectors and the corresponding eigenvalues. The
               eigenvectors are the PC loadings, and eigenvalues are the variances explained by the corresponding PCs.
                                                                        [53]
                                                                                                         T
               In practice, this can be solved using singular value decomposition  of the data matrix X’, i.e., X’ = U∑V ,
               where U and V are left and right singular vectors, and σ = diag{∑} are the singular values. Columns of U M×N
                                                              i
                                                                    2
               are taken to be the PC loadings, i.e., W = U, eigenvalues λ = σ , H = pinv(W)X, where pinv denotes pseudo
                                                                    i
                                                                i
                                      T
               inverse, and pinv(W) = (W W) W .
                                         -1
                                            T
               The PC scores contained in H are essentially the projection of the spectral data in matrix X onto the PCs or
               “eigenspectra”. The PC scores h are a set of mixing coefficients of the PCs. These scores can be considered as
                                         i
               the characteristic information of the spectra (samples) and used for classification. Alternatively, PC scores
               obtained from the mean-centered data matrix can also be used, and they are simply different from those
               obtained from the raw data with a shift in the origin.
               Then the PC scores of different spectra (samples) were used for classification after standardization. The scores
               were standardized for each spectrum using the following formula: (score - “score mean”)/“score standard
               deviation”. An SVM with a linear kernel was used for classification. SVM attempts to find a hyperplane (a
               boundary line in two dimension) to separate two classes with the largest distance from the nearest class
               members (data points) which are called support vectors. Once the SVM classifier is trained, it is tested for
               classification using all the data points, which is called re-substitution validation. Various combinations of
               features were tested for classification. Since the contributions due to higher-order PCs significantly decrease
               according to the eigenvalues, limited number of PCs need to be evaluated and compared. More thorough
               search of optimal feature selection may be carried out [47,48] . The classification performance of the SVM
               classifier was evaluated using statistical measures including sensitivity, specificity, and accuracy, along with
               the receiver operating characteristic (ROC) curve [54,55] . To plot the ROC curve for the SVM classifier, the
               positive class (cancer) posterior probability (a data point classified into positive class) for each data point was
               calculated by using a sigmoid function to map the SVM scores which are the distances from the data points
                                      [56]
               to the SVM separation line . Then the posterior probabilities were used to calculate the true positive rate
               (i.e., sensitivity) and false positive rate (i.e., 1 - specificity) by varying the threshold and generate the ROC
               curve for true positive rate vs. false positive rate. The area under ROC curve (AUROC) [55,57]  was calculated
   34   35   36   37   38   39   40   41   42   43   44