Unlike diagnosis and outcomes prediction, where correlational relationships are often sufficient, predicting responses to a treatment critically depends on knowledge of the causes of biological processes. Gene regulatory networks describe the causal and mechanistic interactions between transcription factors and genes and are, therefore, critical for treatment discovery and precision treatment selection. Many gene regulatory pathways (e.g., functional components of regulatory networks) were derived for biological systems under many different conditions based on bulk gene expression and other data, and several have led to targeted treatments, especially in cancer^{[1]}.
Recent developments in singlecell RNA sequencing (scRNAseq) analysis have revealed regulatory behaviors not previously described using bulk analysis^{[24]}. Regarding intratumor heterogeneity, analysis of glioblastomas showed that one tumor contains individual cells that resemble the four bulk gene expression molecular subtypes (proneural, neural, classical, and mesenchymal), revealing diverse regulatory programs within the same tumor^{[5]}. Concerning tumor immune function, scRNAseq analysis of the breast tumor microenvironment observed a continuum of T cell states, leading to a new understanding of immune responses to tumors^{[6]}. Highlighting the potential clinical utility of scRNAseq data, several scRNAseqbased studies suggested a link between the transcriptomic profile of specific cell subpopulations and patientlevel cancer outcomes^{[5,7,8]}. scRNAseq identified subgroups of drugresistant cells^{[9,10]} and expression profiles linked to resistance^{[11]}. One case study provided evidence for the ability of scRNAseq to identify treatmentrefractory mechanisms and treatment selection for surviving tumor cells^{[12]}.
Deducing or modeling gene regulatory networks at the singlecell level from scRNAseq data is an active pursuit of the bioinformatics community. Several methods have been developed in various biological model systems leveraging singlecell transcriptome profiling datasets^{[1620]}. SCOUP infers gene regulation networks by modeling the cell dynamics as ordinary differential equations with pseudotime as the temporal reference^{[21]}. TENET utilizes transfer entropy to approximate the strength of causal relationships between genes and predict largescale gene regulatory cascades/relationships from scRNAseq data^{[22]}.
Furthermore, ordering cells according to pseudotime may not be appropriate in specific cell populations, limiting these methods’ applicability. There are other limitations in methods that do not depend on pseudotime, such as the SCENIC^{[24]}, including the reliance on correlational relationships (e.g., coexpression) for the discovery of causal mechanisms and the dependence on databases of known regulatory relationships. To summarize, these methods are not based on mathematically sound theories to guarantee reliable causal inference from observational data and are thus only heuristic. Finally, benchmark studies showed that the performance of these methods was moderate^{[25]}.
Formal causal inference methods exist for general distributions and have been tested in biological discovery and bulk gene expression data^{[2631]}. The current study aims to assess the performance of formal causal inference methods for regulatory network reconstruction using scRNAseq data. To apply these methods to scRNAseq data, one needs to consider that these are multivariate count data requiring appropriate statistical tests of association and conditional independence. Various statistics and machine learning methods have been introduced for modeling this type of data. Concerning identifying statistical relationships among count variables, such as correlation and difference between groups, the most straightforward variety of the method is transforming count data to Gaussian distributions such that one can leverage existing statistical methods designed for Gaussian data. The data transformation methods are simple and widely adopted in applied research, despite the debate over their effectiveness^{[3235]}. We explored log transformation in the current study (see the METHOD section for detail). Other methods include models specifically designed for modeling count data, such as the Poisson regression^{[35]}. We examined the conditional independence test based on the Poisson regression.
Another class of methods utilizes nonparametric models to infer relationships among count data. These methods make no distributional assumptions and thus can be applied to count data^{[35]}. In our study, we explored two methods that fit into this category, the kernel conditional independence test^{[36]} and the partial distance correlation test^{[37]} (see the METHODS section for detail).
More closely related to causal pathway/regulatory network discovery is learning Bayesian networks over multivariate count variables. Initial work for modeling multivariate count variables focused on modeling the multivariate joint distribution, similar to how multivariate Gaussian distributions can be modeled. However, due to the nature of the count data, this approach is problematic because the density of the joint distribution for count data is only normalizable if the coefficients specifying the model are nonpositive^{[38,39]}. Various modifications, such as truncation and modifying the base measures of the Poisson distribution, have been implemented to mitigate this issue^{[40]}. Another way to address the normalization problem is to circumvent it by modeling the distribution of each variable as a local conditional Poisson distribution given the local neighborhood of the variable without requesting a consistent joint distribution. Most of the more recent work in this domain uses this approach.
The details for local neighborhood selection, one key component of this variety of methods, varies across studies. Allen and Han approximated the local neighborhood by fitting L1 penalized Poisson or lognormal regressions^{[41,42]}. Hadji heuristically inferred the local neighborhood via functional gradient descent, i.e., boosting^{[43]}. These approaches for local neighborhood selection are essential feature selection approaches that optimize for reconstructing the conditional density (i.e., predictivity) while penalizing for the size of the local neighborhood. They are not leveraging the causal nature of the data generation process and, therefore, do not guarantee the discovery of local causality.
Based on the literature, we chose to model the distribution of each count variable as a local conditional Poisson distribution given its local neighborhood. However, instead of using predictive feature selection approaches for local neighborhood selection, as employed by previous studies, we utilized local causal discovery methods for local neighborhood identification. Causal discovery methods (in contrast to predictive feature selection methods and scRNAseqspecific regulatory network reconstruction methods based on pseudotime or coexpression) guarantee the discovery of causal relationships under broad distributional assumptions^{[44]}.
We focused on local causal discovery methods for the following reasons. First, local causal neighborhood discovery is conducive to modeling the Poisson conditional density for each variable. Second, local causal discovery methods uncover (arguably) the most important causal relationships around a gene, i.e., its direct causes and direct effects. Third, the local causal discovery methods are more sampleefficient than global causal discovery methods and have excellent scalability to networks with millions of vertices.
The causal discovery methods used here identify causal relationships by examining the statistical properties in the data using conditional independence tests, following the frameworks of Pearl
Several causal structure discovery methods have been used for de novo reconstruction of gene regulatory networks using bulk gene expression data with success^{[2630]}. In this study, we used a family of causal structure discovery methods called the generalized local learning (GLL) causal discovery methods to reconstruct the gene regulatory network based on scRNAseq data. The GLL can be adapted to numerous distributions and application domains while guaranteeing that the causal structure discovered will be correct under broad assumptions.
In general, the algorithms in the GLL framework take two inputs: (1) a dataset
The GLL framework can be instantiated in many ways, giving rise to existing stateoftheart and novel algorithms. Different instantiations of the GLL can discover different components of the local causal structure. For example, the GLLPC subfamily discovers the direct causes and direct effects of the target of interest, whereas the GLLMB subfamily discovers the Markov boundary of the target of interest, consisting of the direct causes, direct effects, and direct causes of the direct effects. In the current study, intending to identify the target’s direct causes and direct effects, we chose to instantiate the GLLPC family, more specifically, as the HITONPC algorithm^{[4648]}.
The GLL algorithmic framework is sound under welldefined and sufficient conditions. Moreover, it is computationally efficient and applicable to datasets of very high dimensionalities (i.e., millions of variables using modest computing equipment). Empirically, benchmark studies on simulated and various realworld data demonstrated that GLL outperforms other methods with excellent local structure reconstruction accuracy given moderate sample sizes^{[46,47]}. GLL algorithms have been applied to many realworld data for causal discovery and feature selection with great success^{[4953]}. In addition, GLL algorithms can be used for global causal discovery through localtoglobal learning^{[46,47]} and equivalent class discovery^{[54]}.
The GLL algorithm framework infers the local causal neighborhood of the target of interest via systematically examining the statistical dependencies and independencies in the data using statistical tests of conditional independence. Briefly, a pair of variables
The GLL algorithm framework leverages this foundational principle of causality to identify a target variable’s direct causes and effects. Its search strategy optimizes for correct statistical inference given the available sample size and computational efficiency. To identify a variable’s direct cause and direct effects, the GLL algorithm conducts multiple conditional independence tests among variables iteratively. The error incurred on individual conditional independence tests would affect the quality of the discovery. In general, the error rate of a conditional independence test (like any statistical test) depends on the assumptions of the test, effect size, sample size, and the tradeoff between type I (false positive rate) and type II error (false negative rate).
The comparative advantage of various conditional independence tests for local causal discovery leveraging count data (such as the scRNAseq data) has not been characterized systematically in the literature. Therefore, we evaluated five conditional independence tests on systematically generated simulated and real datasets. The five conditional independence tests include:
● Fisher: this is the classical Fisher’s z test. This test uses Fisher’s ztransformation of the partial correlation and tests for zero partial correlation between variable
● LogFisher: the Fisher’s
● Poisson CI test: conditional independence test based on Poisson regression. It tests for nonzero partial correlations between
● Kernel conditional independence (KCI) test: Kernelbased conditional independence test. This test does not explicitly estimate the conditional or joint densities of the variables in question but computes test statistics based on kernel matrixes of the variables. This test does not make assumptions regarding the distribution of the variables or about the functional relationships among the variables^{[36]}. Notably, the KCI test is designed for continuous variables. Even though count data violate the test assumption, we choose to test its empirical performance.
● Partial distance correlation (pdcor) test: partial distance correlation test is a test for zero partial distance correlation for variable sets
We systematically simulated the count dataset to test the performance of different conditional independence tests for local causal discovery under various conditions that affect the discoverability of causal structure. The following simulation conditions were explored: network structure, the form of the data generation function, the signaltonoise ratio, and the sample size.
The task of qualitative causal discovery is to learn the causal structure that generates the data distribution from the analysis of experimental data or from an observational sample from that data distribution. The goal of quantitative causal discovery is to estimate the magnitude of causal effects that variable manipulations have on some target variable of interest using experimental or observational modeling methods. We focused on the discovery process based on observational data. Local causal discovery aims to learn the local causal structure (e.g., direct causes and direct effects) of a target variable.
Therefore, the first step in our data generation process is to generate a network structure that encodes a set of causal relationships among variables. Specifically, we generated random directed acyclic graphs (DAGs) with a specified number of vertices (
Given the qualitative causal information encoded in the generated DAG, the second step in the data generation process is to define the quantitative causal relationship among variables. Most studies in the statistical and machine learning literature generate multivariate Poisson data in two ways: (1) using the Poisson distribution; and (2) using the Poisson LogNormal distribution. We generate data using both methods because it is unknown which method better approximates true multivariate Poisson data observed in various domains.
The first method specifies the conditional distribution of a variable
The second data generation method uses the PoissonLogNormal distribution which contains a latent multivariate normal distribution over the variables
It is worth noting that the exponential function in the conditional distribution for the Poisson and Poisson LogNormal distributions. Due to the exponential function, some choices of
We explored two signaltonoise conditions to study the effect of the signaltonoise ratio of causal effect size on the local causal discovery. As specified in the previous section, the data generated is considered to have a low noise condition. We also generated data with added noise (low signaltonoise ratio) by randomly selecting 30% of the data and permuting them per variable, referred to as the high noise condition. This strategy injects additional noise while preserving the marginal distribution.
We study the effect of sample size by examining simulated data of sample sizes of 100, 500, and 1000 observational units.
To summarize, we explored four types of network structure, two types of data generation functions, and two signaltonoise ratios, resulting in 4 × 2 × 2 = 16 types of data generation processes. To reduce artifacts due to (and assess the variability of) simulated datasets, each data generation process was repeated 50 times to produce 16 × 50 = 800 datasets. Each simulated dataset contained 1000 samples. To test the influence of sample sizes, subsamples of 100, 500, and 1000 were sampled from each simulated dataset. Local causal discovery with different conditional independence tests was conducted on each data sample.
We analyzed two singlecell RNAseq datasets to reconstruct the local causal neighborhood of transcription factors. For the THP1 dataset, the scRNAseq data was obtained from Park (2021)^{[61]} (GSE176294), and we used the network described in Tomaru (2009)^{[62]} as the gold standard. The gold standard was developed with knockdown experiments and transcription factor binding experiments. For the Yeast dataset, the scRNAseq was obtained from Jackson (2020)^{[63]}, and we used the network developed in Tchourine (2018)^{[64]} as the gold standard. This gold standard was derived by combining binding and expression information from various sources. The description of these datasets is displayed in
Characteristics of the singlecell datasets







THP1  40  3159  10  58  scRNAseq from THP1 cell line  [ 
Yeast  1071  38,225  98  1402  scRNAseq from Yeast  [ 
The GLL algorithm instantiated with the five conditional independence tests was applied to individual variables in each data sample (simulated and realworld data) for local causal discovery. The output of the GLL algorithm is the estimated local causal neighborhood of the variable in questions from the data sample given by the specific conditional independence test. The pseudocode and detailed discussions regarding the GLL algorithm have been described previously^{[46,47]}. We implemented the GLL algorithm using MATLAB.
The true local causal neighborhood of a variable consists of its direct causes and direct effects. The performance of local causal discovery is evaluated by comparing the discovered local causal neighborhood to the true local causal neighborhood. Because a variable is either in the local neighborhood of another variable or is not (i.e., a binary decision where being in the local neighborhood is considered positive), we chose metrics for binary classification for performance evaluation. Specifically, we computed the following metrics: sensitivity, specificity, positive predictive value (PPV), negative predictive value, and F1 score. The metrics were computed using each variable as a target separately. The mean and variability were reported.
As stated in the previous sections, for the simulated data, the true local neighborhood is determined by the true network structure that generated the data. The true network was obtained from the prior literature for the actual data.
We compared the performance of GLL instantiated with five different conditional independence tests for local causal discovery across all simulation conditions and multiple performance metrics. We reported sensitivity [
Local causal neighborhood discovery sensitivity for various conditional independence tests and simulation conditions. The numerical value represents the mean sensitivity of a gven conditional independence test for a given simulation condition over 50 randomly generated datasets. We colored the cells according to their sensitivity to aid the inspection of the figure. Deeper red indicates better performance, and deeper yellow indicates worse performance. Bolded cells indicate the best performance among the five conditional independence tests for a given simulation condition.
Specificity of local causal neighborhood discovery for various conditional independence tests and simulation conditions. Cells were colored according to the performance to aid the inspection of the figure. Deeper red indicates better performance, and deeper yellow indicates worse performance. Bolded cells indicate the best performance among the five conditional independence tests for a given simulation condition.
PPV of local causal neighborhood discovery for various conditional independence tests and simulation conditions. Cells were colored according to the performance to aid the inspection of the figure. Deeper red indicates better performance, and deeper yellow indicates worse performance. Bolded cells indicate the best performance among the five conditional independence tests for a given simulation condition.
Local causal neighborhood discovery performance measured by sensitivity, specificity, and PPV for various conditional independence tests and realworld datasets. Cells were colored according to the performance to aid the inspection of the figure. Deeper red indicates better performance, and deeper yellow indicates worse performance. Bolded cells indicate the best performance among the five conditional independence tests for a given sample size and dataset.
The sensitivity metric is the number of true positives over the total number of positives. In our case, it is the number of identified neighbors that are true neighbors over the total number of neighbors for a vertex. Higher sensitivity indicates that the algorithms identify a higher proportion of true neighbors. The Poisson conditional independence test (PoissonCI) achieved the best sensitivity in 35 out of 48 simulation conditions nominally. In the remaining 13 of 48 simulation conditions, pdcor achieved the best sensitivity.
As the sample size increases, the sensitivity of all conditional independence tests improves, as evidenced by
Moreover, the sensitivity for the simulation condition generated from graphs with the same edgetovertex ratio but a different number of vertices (NV) and a number of edges (NE) were comparable (e.g., NV = 20, NE = 20
In summary, as evidenced by
Turning to specificity (the number of true negatives over the total number of negatives), nominally, the KCI test achieved the best performance in 42 of 48 simulation conditions. In the remaining six of 48 simulation conditions, Fisher’s test (Fisher) achieved the best performance.
As the sample size increases, the specificity of all conditional independence tests decreases in general, as expected. However, for the sample size tested, the influence of sample size on specificity is small. Adding additional noise to the data resulted in similar specificity for the Poisson data generation function but increased specificity for the Poisson LogNormal data generation function. It is worth noting that the mean difference in specificity among the conditional independence tests is relatively low (< 0.02) within each simulation condition with the data generated using the Poisson data generation function. However, when the data are generated with the Poisson log normal data generation function, the specificity for PoissonCI, the most sensitive conditional independence test in most conditions, is lower than other conditional independence tests.
In summary, as evidenced by
PPV is the number of true positives over the number of predicted positives. In our case, it is the number of identified neighbors that are true neighbors over the number of identified neighbors. A higher PPV indicates that a higher proportion of the identified neighbors are true. For PPV, Fisher’s test (Fisher) nominally achieved the best performance in 41 out of 48 simulation conditions. In the remaining seven conditions, the KCI test achieved the best performance.
PPV increased as the sample size increased and decreased as more noises were added to the data. Unlike sensitivity and specificity, PPV is mathematically affected by the prior (i.e., the NE over pairs of vertices in our setting); as a result, we observed a decrease in the positive predictive value when the NV in the graph increased.
In summary, as evidenced by
We applied GLL with different conditional independence tests to two realworld scRNAseq datasets for local causal discovery. We subsampled these datasets to assess the change in performance as a function of sample size (core results in
In these two datasets, not all methods performed as well as in simulated data, suggesting that the data are unusually hard outliers or that the gold standards are not as precise as needed (see DISCUSSION).
We tested the performance of local causal discovery algorithms equipped with different conditional independence for reconstructing the local causal neighborhood based on simulated and realworld scRNAseq data.
To our knowledge, this study is the first where local causal discovery algorithms and conditional independence tests were benchmarked on systematically generated count data. Our simulation study showed that local causal discovery methods with appropriate conditional independence tests could result in excellent discovery performance given a sufficient sample size. Different conditional independence tests, as expected, have different powersample characteristics. Therefore, the best conditional independence test depends on the discovery task. When one wishes to discover as many true neighbors as possible (maximizing sensitivity), the Poisson conditional independence test and Fisher’s test has an advantage for networks with low edgetovertex ratios. In contrast, the Poisson conditional independence test, Fisher’s test, and the KCI test have an advantage for networks with higher edgetovertex ratios. Our simulation study provides general guidance for choosing a conditional independence test for local causal discovery given count data.
Although the primary goal of our simulation study was to compare the effectiveness of various conditional independence tests for local causal discovery given count data under laboratory (i.e., controlled and simulated) analysis conditions, it can also help in experiment planning^{[65,66]}. For example, our results on simulated data can help answer questions such as, “is scRNAseq data from one hundred cells sufficient for identifying the local causal neighborhood given specific conditional independence tests and desired level of performance metrics?” Across all simulation conditions, we found that sensitivity, PPV, and F1 score increased substantially when the sample size increased from 100 to 500, while the difference in performance for these metrics was less drastic when the sample size increased from 500 to 1000. For datasets with a larger number of vertices and edges (e.g., for simulated data where NV = 200 and NE = 500), the performance was less than ideal even at a sample size of 1000.
Several directions for future work can be taken to expand and enrich the results of the current study. First, despite being a systematic benchmark study for local causal discovery utilizing various conditional independence tests for count data, this benchmark study only explored a subspace of the available methods for singlecell regulatory network reconstruction. We did not compare the local causal discovery method with previously reported methods that utilize noncausal techniques for neighborhood identification^{[2124,4143]}. In general, methods not designed for causal discovery have (as expected) underperformed causalityoptimized methods^{[46,47]}; however, they might be advantageous for specific performance metrics (e.g., trading low specificity and many false positives for higher sensitivity). Second, a major difficulty in evaluating causal discovery methods on realworld data is the lack of suitable gold standards^{[67]}. Despite the increasing availability of scRNAseq data, highquality gold standards are scarce. The gold standards^{[62,64]} we used in our study were constructed from the bulk level rather than singlecell data.
Furthermore, they were derived from studies limited to a partial set of genes. These limitations may explain the performance gap in our results between simulated and real data. More analyses must be conducted as more reallife datasets with reliable true direct causality gold standards become available. Using gold standards constructed from singlecell data or conducting experimental validations based on the local causal neighborhoods discovered by different algorithms would produce a more accurate evaluation of the performance of the algorithms. Finally, our data simulation methods do not capture all the complexities in realworld singlecell data. This limitation might contribute to the large difference between the results from the simulated data and the realworld study. Future studies using simulations that generate data that better approximate the scRNAseq data (e.g., resimulation methods) for developing and benchmarking methods are desired.
Until the understanding of the performance of various discovery methods is fully characterized, we propose that in situations where experiments do not meet apparent requirements for good discovery performance (e.g., small sample size, large edgetovertex ratio for sensitivity, and a large number of vertexes for PPV), based on our results, a more exploratory attitude and careful (i.e., not overinterpreted) examination of results is warranted. Similarly, experimental validation of the results is warranted when analysis operates in highPPV regions.
In conclusion, the current study is the first to systematically evaluate the performance of the local causal discovery algorithm given different conditional independence tests on simulated count data with empirical results in realworld scRNAseq data. It provides an initial set of insights for designing analyses and choosing discovery methods in research involving scRNAseq data for gene regulatory network reconstruction.
Conception and design: Ma S, Aliferis C
Data analysis: Bieganek C, Ma S, Tourani R
Interpretation of data: Ma S, Tourani R, Aliferis C, Wang J
Writing and editing the manuscript: Ma S, Aliferis C, Wang J, Tourani R
Data supporting the findings are either shared in the paper’s main text or the
The efforts of Drs. Ma S., Wang J., and Aliferis C. were partly supported by the University of Minnesota Clinical and Translational Science Institute grant 5UL1TR002494, the Minnesota Tissue Mapping Center of Cellular Senescence, grant 1U54AG076041, and the Midwest MurineTissue Mapping Center (MMTMC) grant 1U54AG07975401. Dr. Wang J. gratefully acknowledges support from the UMN Masonic Cancer Center and NCI grants P30CA077598 and U54AG079754.
All authors declared that there are no conflicts of interest.
Not applicable
Not applicable
© The Author(s) 2023.
Supplementary Materials