Page 39 - Read Online
P. 39
Page 6 of 14 Liu et al. J Cancer Metastasis Treat 2019;5:4 I http://dx.doi.org/10.20517/2394-4722.2018.55
RR spectral data analysis method by PCA-SVM
The baseline of each raw Raman spectrum was fitted to a polynomial using an asymmetric Huber
[49]
function as the loss function . The difference between the raw spectra and the baselines were calculated.
Each baseline-subtracted Raman spectrum was then normalized using its Euclidean norm, and used for
subsequent analysis.
Raman peaks in the normalized baseline-subtracted Raman spectra were first investigated. Student’s t-test
was used to determine if particular RR peak intensities between normal and BCC samples were significantly
different. Then unsupervised machine learning algorithms such as PCA were used to analyze the entire
spectral data, reduce dimension and detect underlying spectral feathers.
PCA finds the uncorrelated components that explain the most variance in the signal. It has been widely used
[50]
[52]
[51]
for various applications, such as spectroscopy , face recognition and optical imaging . Mathematically,
PCA solves an eigenvalue equation, and finds a set of orthonormal eigenvectors which are considered
principal components (PCs) corresponding to the eigenvalues which are the variances of the PCs in the data.
For Raman spectral data contained in a matrix X M×N = {x , …, x }, where M is the number of wavenumbers,
N
1
and N is the number of spectra or samples (assuming M > N). PCA considers the spectral data x to be
i
linear combinations of PC loadings {w} with scores {h }, i.e., X M×N ≈ W M×N H , where W M×N = {w , … w }
N×N
ji
j
1
N
and H = {h , …, h }. To calculate the PCs, the data matrix is “mean centered” first, i.e., the mean of each
N×N
1
N
row is calculated and subtracted off that row. Then an eigenvalue equation of the covariance matrix of the
“mean centered” data matrix X’ is solved to find the eigenvectors and the corresponding eigenvalues. The
eigenvectors are the PC loadings, and eigenvalues are the variances explained by the corresponding PCs.
[53]
T
In practice, this can be solved using singular value decomposition of the data matrix X’, i.e., X’ = U∑V ,
where U and V are left and right singular vectors, and σ = diag{∑} are the singular values. Columns of U M×N
i
2
are taken to be the PC loadings, i.e., W = U, eigenvalues λ = σ , H = pinv(W)X, where pinv denotes pseudo
i
i
T
inverse, and pinv(W) = (W W) W .
-1
T
The PC scores contained in H are essentially the projection of the spectral data in matrix X onto the PCs or
“eigenspectra”. The PC scores h are a set of mixing coefficients of the PCs. These scores can be considered as
i
the characteristic information of the spectra (samples) and used for classification. Alternatively, PC scores
obtained from the mean-centered data matrix can also be used, and they are simply different from those
obtained from the raw data with a shift in the origin.
Then the PC scores of different spectra (samples) were used for classification after standardization. The scores
were standardized for each spectrum using the following formula: (score - “score mean”)/“score standard
deviation”. An SVM with a linear kernel was used for classification. SVM attempts to find a hyperplane (a
boundary line in two dimension) to separate two classes with the largest distance from the nearest class
members (data points) which are called support vectors. Once the SVM classifier is trained, it is tested for
classification using all the data points, which is called re-substitution validation. Various combinations of
features were tested for classification. Since the contributions due to higher-order PCs significantly decrease
according to the eigenvalues, limited number of PCs need to be evaluated and compared. More thorough
search of optimal feature selection may be carried out [47,48] . The classification performance of the SVM
classifier was evaluated using statistical measures including sensitivity, specificity, and accuracy, along with
the receiver operating characteristic (ROC) curve [54,55] . To plot the ROC curve for the SVM classifier, the
positive class (cancer) posterior probability (a data point classified into positive class) for each data point was
calculated by using a sigmoid function to map the SVM scores which are the distances from the data points
[56]
to the SVM separation line . Then the posterior probabilities were used to calculate the true positive rate
(i.e., sensitivity) and false positive rate (i.e., 1 - specificity) by varying the threshold and generate the ROC
curve for true positive rate vs. false positive rate. The area under ROC curve (AUROC) [55,57] was calculated