Page 74 - Read Online
P. 74
Moore et al. J Transl Genet Genom 2021;5:200-217 https://dx.doi.org/10.20517/jtgg.2021.08 Page 5
the European Prospective Investigation into Cancer, Chronic Diseases, Nutrition, and Lifestyles (EPIC), the
Mayo Clinic Case-Control Study of Diffuse Large B-cell Lymphoma (Mayo), the Genetic Epidemiology of
CLL Consortium (GEC), and the Utah Chronic Lymphocytic Leukemia Study (Utah). Genotyping was
performed on commercially available Illumina and Affymetrix platforms [Supplementary Table 1]. Details,
including information on quality control and data cleaning, have been previously reported [6,8,10,11] . All studies
obtained informed consent from participants and were approved by their appropriate Institutional Review
Boards.
Prior to analysis, additional quality control and filtering were applied to each GWAS separately, including
removal of SNPs with a minor allele frequency < 0.05, > 3% missing, or Hardy-Weinberg P-value < 1 × 10
-6
among controls, and removal of subjects with call rates < 97%. After quality control metrics, genotype data
were available for 10,467 NHL cases, including 3061 CLL, 3814 DLBCL, 2784 FL, and 808 MZL cases, as well
as 9374 controls [Supplementary Table 2].
We used PLINK1.9 [26,27] to identify ROH; specifically, we used the two-step command --homozyg. In the first
step, PLINK1.9 identifies directly genotyped SNPs that are possibly within an ROH by looking at 50-SNP
sliding windows across the genome and flagging all SNPs that are encompassed by at least 5% of fully
homozygous windows. For this step, we allowed one heterozygous SNP and up to five SNPs with no calls
within each window to account for a small amount of possible genotyping error and loss. In the second step,
ROH are identified from these sliding windows by requiring a minimum number of consecutive
homozygous SNPs. We required at least 100 consecutive homozygous SNPs for each ROH and that these
SNPs span at least 1500 kilobases (kb), with at least one SNP every 50 kb and the maximum gap between
SNPs of 5000 kb. These parameters were selected with reference to the “ROH_1.5Mb” ROH calling
[28]
parameters used by Gazal et al. We restricted analyses to the autosomal chromosomes.
To estimate the extent of homozygosity across the genome, we calculated the fraction of the autosome
9
covered by ROH (FROH) by summing the lengths of ROH and dividing by 3 × 10 base pairs as the
approximate size of the autosome for all GWAS. As another measure to assess homozygosity, we also
quantified and tested differences in relatedness across the genome in our study using a variant of the
inbreeding coefficient, F3 . F3, which estimates the correlation between uniting gametes, is an alternative
[29]
to ROH-based estimates with potentially reduced bias and standard errors . We estimated F3 using the
[30]
-ibc command in PLINK1.9. To estimate the association of FROH and F3 with NHL, we then estimated beta
coefficients and standard errors for each GWAS using logistic regression, adjusting for age, sex (except in
the UCSF1/NHS study, where all controls were female), fraction of missing SNPs, and the ten principal
components of ancestry to account for population stratification. The fraction of missing SNPs was
calculated for each participant as the number of SNPs without calls divided by the total number of SNPs
genotyped on the array that passed quality control metrics. Associations were combined across GWAS for
each subtype of NHL using random-effects meta-analysis implemented with the command “metan” in
STATA v15.
After determining ROH as described above, we also tested whether specific genomic regions encompassed
by ROH were associated with risk of each of the four NHL subtypes. We divided each autosomal
chromosome into “bins” of 500 kb in length. We then calculated the midpoint of each identified ROH and
assigned it to the corresponding bin. Each study participant in the analysis was therefore categorized as
either homozygous (exposed) or heterozygous (unexposed) at each bin across the autosome. We calculated
beta coefficients and standard errors for the association between presence of an ROH in each bin and risk of
NHL subtype within each GWAS using logistic regression, adjusting for age, sex (except in the UCSF1/NHS