Page 43 - Read Online
P. 43
Page 115 Waller et al. J Transl Genet Genom 2021;5:112-23 I http://dx.doi.org/10.20517/jtgg.2021.09
at every position across the genome, the best evidence (lowest empirical P-value) for an excessive length
of sharing is established (Figure 1, Step 4). This process results in a final optimized set of shared segments
for a single pedigree. Each optimal segment corresponds to a specific subset of cases and has a nominal
empirical P-value.
For two pedigrees, the duo-SGS evidence is the combination of the nominal empirical P-values for the
optimal segments at the same genome position in the two pedigrees. Specifically, the Fisher method to
combine P-values was used. All possible pedigree pairs could be considered as separate analyses, but there
are pedigree pairs (ways to select 2 pedigrees from n total pedigrees), and hence multiple testing can
rapidly become an issue. Alternatively, a single analysis comprising optimization across all pedigree pairs
could be considered, but this global approach may cloud individual pedigree-pair findings. To balance these
two extremes, we propose a fixed-pedigree duo-SGS strategy (Figure 1, Step 5). The procedure is as follows:
(1) fix a pedigree of interest; (2) calculate genome-wide duo-SGS evidence for the fixed pedigree with each
of the other pedigrees; and (3) optimize across the duo-SGS findings to identify the most significant duo-
SGS result at each point across the genome. The optimized findings over pedigree pairs are the duo-SGS
results for the fixed pedigree. In this approach, we identify the best two-pedigree results that include the
fixed pedigree. The procedure is then repeated for each pedigree, thus producing duo-SGS results for each
pedigree.
Genome-wide thresholds for duo-SGS
Critical to interpreting the observed duo-SGS results are genome-wide significance duo-SGS thresholds
for each pedigree (Figure 1, Step 6). To establish these, we echo the same optimization process in null data.
[20]
Establishing these thresholds is similar to the calculation described for the single pedigree SGS method .
Under the reasonable assumption that the vast majority of the genome represents chance sharing (i.e.,
most of the genome does not contain a disease risk gene) we model the distribution for null sharing on the
distribution of the empirical P-values for each pedigree. To avoid comparing the findings to themselves
or skewing to possible true-positives, the empirical-P-values are perturbed, and the distribution-fitting
is performed at 1 million simulations. The latter is to avoid inappropriate distribution-fitting to extreme
outliers, the few results from the alternate hypothesis if included at their final resolution. To perturb an
empirical P-value we determine its Wilson score 95% confidence interval (CI) (Equation 1) and randomly
sample a value from within it.
Equation 1
where is the empirical P-value, z is 1.96 (for the 95%CI), and n is the number of simulations (here,
1,000,000). The Wilson interval was selected because it always produces non-negative confidence bounds
for the P-values. The genome-wide set of perturbed empirical P-values for a pedigree are considered the
“null” P-values for that single pedigree. The duo-SGS procedure (described above) is performed using
the single pedigree genome-wide null P-values. The result of this process is a set of optimal duo-SGS null
P-values.
Genome-wide significant and suggestive thresholds are determined following our previously described
[20]
method for single pedigree SGS . Briefly, the null duo-SGS P-values are log-transformed and fitted to a
gamma distribution. The shape (k) and rate (σ) parameters of the fitted distribution are applied using the
Theory of Large Deviations to calculate the significance thresholds by solving:
µ(X) = [C + 2GX]α(X) Equation 2