Page 79 - Read Online
P. 79
Page 148 Goodman et al. J Transl Genet Genom 2020;4:144-58 I http://dx.doi.org/10.20517/jtgg.2020.23
(TCAG), Hospital for Sick Children Research Institute, Toronto, Ontario, Canada in accordance with
the manufacturer’s protocols. Samples were randomly stratified across chips and run in two batches but
balanced for case/control proportions and sex.
[27]
Raw data were then processed in R statistical software, using the package minfi . Quality control measures
included removing probes that failed detection P-value, meaning the signal was not significantly above
background noise, as well as probes mapping to X and Y chromosomes, cross-reactive probes and SNP
[11]
probes [28,29] . All criteria and methods for pre-processing are fully described in Chater-Diehl et al. .
Following these steps, data underwent background signal subtraction and control normalization also using
[29]
minfi . The normalized data consisted of 774,583 methylation sites or CpGs for each sample. DNAm,
measured in b values, ranges 0-1 representing percent methylation.
DNAm signature derivation
Prior to statistical analysis, underlying proportions of monocytes, neutrophils, CD4T, CD8T, natural killer
[30]
cells and B cells were estimated from the DNAm data using the Houseman algorithm . At each CpG
site, a two-group comparison of KS discovery cases vs. controls was performed using limma regression,
[31]
accounting for sex, age, batch and estimated blood cell proportion covariates . CpG sites found to be
differentially methylated between cases and controls were reported if they met both a statistical significance
[false discovery rate (FDR)-corrected P-value < 0.01] and a minimum effect size (absolute Δb >10%). Δb
represents the difference in average DNAm (b) between groups. Principal component analysis (PCA) and
hierarchical clustering were generated using Qlucore Omics Explorer (QOE, www.qlucore.com).
SVM model classification
Statistically significant CpG sites, i.e., the DNAm signature, were used as input into a machine-learning
algorithm, support vector machine (SVM), to generate a predictive classification model. To remove noise
and to filter out information that did not improve the efficacy of the model, we first removed redundant
sites. Any methylation site that was highly correlated (r > 0.9) with any other site was removed, leaving
429 CpG sites. We then built an SVM model using the R package caret (for details of model training and
[10] [32]
validation, see Butcher et al. ) . The classification model generated by SVM was then applied to all
remaining samples. The output of this model was a probability score indicating likelihood of having KS or a
genomic alteration that causes KS.
GO analysis
Gene ontology (GO) enrichment analysis was performed on the KS signature sites using GREAT (Genomic
[33]
Regions Enrichment of Annotations Tool) . We used a custom “background” that included all 774,583
CpG sites that passed quality control. “Basal+extension” was used to identify associated genes, using
the following modified parameters: constitutive 5.0 kb upstream and 1.0 kb downstream, up to 10.0 kb
maximum extension. We also refined the output by requiring that significant terms contain two or more
gene hits.
RESULTS
Identifying a DNA methylation signature for Kleefstra syndrome
To define a DNAm signature associated with KS, DNA from KS patients and neurotypical controls was
extracted from blood and assayed using the EPIC array, generating high-quality measurements at 774,583
CpG sites. Ten unrelated individuals with a confirmed clinical diagnosis of KS, samples KS1_T - KS10_T,
and pathogenic variants in EHMT1 or microdeletions of 9q34.3, which included partial or full deletions
of EHMT1 (n = 3 and n = 7, respectively; n = 6 females; age 1-25 years) were compared to 42 neurotypical
controls (n = 21 females; age 1-28 years). Since we combined data from patients with pathogenic variants in
EHMT1 and those with 9q34.3 microdeletions together, our analyses identified DNAm changes common to