Page 99 - Read Online
P. 99
Li et al. Cancer Drug Resist. 2025;8:31 Page 5 of 26
Biological function analysis in high- and low-risk groups
To further investigate the variations in biological functions between the high- and low-risk groups within the
GEO dataset, we conducted differential gene expression analysis using the R package DESeq2 . We applied
[20]
a significance threshold of corrected P-values of < 0.05 and an absolute log2FC exceeding 1. The results of
the differential gene expression analysis are represented using volcano plots and heatmaps.
Subsequently, the R package clusterProfiler [27,28] was used to perform GO and KEGG enrichment analyses on
significantly differentially expressed genes, employing corrected P-values of < 0.05. Enrichment analysis
results are presented as bar and bubble plots.
Gene mutation feature analysis
To reveal disparities in genomic characteristics between the high- and low-risk group samples, we performed
gene-level mutation analysis and assessed the tumor mutation burden (TMB).
First, the function tcga_load in the R package for TCGA mutations (https://github.com/PoisonAlien/TCGA
mutations) was used to obtain gene mutation data for PRAD, and the R package maftools was employed
[33]
for mutation data analysis. The Oncoplot function was used to create waterfall plots depicting the gene
mutations. The TMB was calculated using the tmb function. To assess any differences in TMB between the
high- and low-risk groups, box plots were used for visualization, and comparisons were made using the
Wilcoxon rank-sum test. In addition, stratified survival analysis was conducted, and the KM survival curves
were plotted. Finally, the samples were categorized into four groups based on the combination of risk and
TMB levels. Stratified survival analysis was performed and KM survival curves were constructed to assess the
predictive performance of the model.
Cell subtyping based on scRNA-seq data
ScRNA-seq data for PCa were obtained from the literature . Based on the reported cell labels , 835 PCa
[13]
[13]
cells were extracted for downstream analysis. The R package Seurat , which is widely used for the
[34]
systematic processing of scRNA-seq data, was applied for standard data processing. ScRNA-seq data and cell
subpopulation annotations were directly adopted from published literature. Downstream analyses were
exclusively performed on the PCa cell subpopulations already annotated in these source publications. Quality
control standards and parameters were strictly followed according to the processing methods of the original
datasets; no additional thresholds were set or adjusted in this study. Gene feature selection was performed to
identify highly variable genes using the FindVariable Feature function. PCA was employed to extract the
[35]
highly variable genes. The optimal number of principal components was determined using the JackStraw and
Elbow methods . Unsupervised clustering of cells was performed using the FindClusters function with the
[34]
resolution set to 0.05. The visualization of the clustering results was achieved using the uniform manifold
approximation and projection (UMAP) for dimension reduction method . In each cell cluster, genes
[36]
exhibiting differential expression were identified using the FindAllMarkers function at corrected P-values of
< 0.05 and an absolute log2FC exceeding 1. These distinctive genes with varying expression levels in each cell
cluster are presented using both violin plots and heatmaps.
Identification of high-risk cell subgroups
To identify the high-risk cell subgroups, risk scores were assigned to each cell type using this model. Due to
the sparsity of scRNA-seq data, the average risk score was calculated for each cell subgroup based on cell
clustering labels. The high-risk cell subgroups were defined as those with the highest risk scores. The risk
score results were visually presented using the UMAP and t-distributed stochastic neighbor embedding
[36]
(t-SNE) dimensionality reduction techniques for enhanced clarity.
[37]
92

