Page 99 - Read Online
P. 99

Li et al. Cancer Drug Resist. 2025;8:31                                           Page 5 of 26





               Biological function analysis in high- and low-risk groups
               To further investigate the variations in biological functions between the high- and low-risk groups within the
               GEO dataset, we conducted differential gene expression analysis using the R package DESeq2 . We applied
                                                                                              [20]
               a significance threshold of corrected P-values of <​ 0.05 and an absolute log2FC exceeding 1. The results of
               the differential gene expression analysis are represented using volcano plots and heatmaps.


               Subsequently, the R package clusterProfiler [27,28]  was used to perform GO and KEGG enrichment analyses on
               significantly differentially expressed genes, employing corrected P-values of <​ 0.05. Enrichment analysis
               results are presented as bar and bubble plots.


               Gene mutation feature analysis
               To reveal disparities in genomic characteristics between the high- and low-risk group samples, we performed
               gene-level mutation analysis and assessed the tumor mutation burden (TMB).


               First, the function tcga_load in the R package for TCGA mutations (h​t​t​p​s​:​/​/​g​i​t​h​u​b​.​c​o​m​/​P​o​i​s​o​n​A​l​i​e​n​/​T​C​G​A
               m​u​t​a​t​i​o​n​s​) was used to obtain gene mutation data for PRAD, and the R package maftools  was employed
                                                                                            [33]
               for mutation data analysis. The Oncoplot function was used to create waterfall plots depicting the gene
               mutations. The TMB was calculated using the tmb function. To assess any differences in TMB between the
               high- and low-risk groups, box plots were used for visualization, and comparisons were made using the
               Wilcoxon rank-sum test. In addition, stratified survival analysis was conducted, and the KM survival curves
               were plotted. Finally, the samples were categorized into four groups based on the combination of risk and
               TMB levels. Stratified survival analysis was performed and KM survival curves were constructed to assess the
               predictive performance of the model.


               Cell subtyping based on scRNA-seq data
               ScRNA-seq data for PCa were obtained from the literature . Based on the reported cell labels , 835 PCa
                                                                                                [13]
                                                                 [13]
               cells were extracted for downstream analysis. The R package Seurat , which is widely used for the
                                                                             [34]
               systematic processing of scRNA-seq data, was applied for standard data processing. ScRNA-seq data and cell
               subpopulation annotations were directly adopted from published literature. Downstream analyses were
               exclusively performed on the PCa cell subpopulations already annotated in these source publications. Quality
               control standards and parameters were strictly followed according to the processing methods of the original
               datasets; no additional thresholds were set or adjusted in this study. Gene feature selection was performed to
               identify highly variable genes using the FindVariable Feature function. PCA  was employed to extract the
                                                                                [35]
               highly variable genes. The optimal number of principal components was determined using the JackStraw and
               Elbow methods . Unsupervised clustering of cells was performed using the FindClusters function with the
                            [34]
               resolution set to 0.05. The visualization of the clustering results was achieved using the uniform manifold
               approximation and projection (UMAP) for dimension reduction method . In each cell cluster, genes
                                                                                [36]
               exhibiting differential expression were identified using the FindAllMarkers function at corrected P-values of
               <​ 0.05 and an absolute log2FC exceeding 1. These distinctive genes with varying expression levels in each cell
               cluster are presented using both violin plots and heatmaps.

               Identification of high-risk cell subgroups
               To identify the high-risk cell subgroups, risk scores were assigned to each cell type using this model. Due to
               the sparsity of scRNA-seq data, the average risk score was calculated for each cell subgroup based on cell
               clustering labels. The high-risk cell subgroups were defined as those with the highest risk scores. The risk
               score results were visually presented using the UMAP  and t-distributed stochastic neighbor embedding
                                                             [36]
               (t-SNE)  dimensionality reduction techniques for enhanced clarity.
                     [37]




                                                           92
   94   95   96   97   98   99   100   101   102   103   104