Strange India All Strange Things About India and world


DNM filtering in 100,000 Genomes Project

We analysed DNMs called in 13,949 parent–offspring trios from 12,609 families from the rare disease programme of the 100,000 Genomes Project. The rare disease cohort includes individuals with a wide array of diseases, including neurodevelopmental disorders, cardiovascular disorders, renal and urinary tract disorders, ophthalmological disorders, tumour syndromes, ciliopathies and others. These are described in more detail in previous publications60,61. The cohort was whole-genome sequenced at around 35× coverage and variant calling for these families was performed through the Genomics England rare disease analysis pipeline. The details of sequencing and variant calling have been previously described61. DNMs were called by the Genomics England Bioinformatics team using the Platypus variant caller62. These were selected to optimize various properties, including the number of DNMs per person being approximately what we would expect, the distribution of the VAF of the DNMs to be centred around 0.5 and the true positive rate of DNMs to be sufficiently high as calculated from examining IGV plots. The filters applied were as follows:

  • Genotype is heterozygous in child (1/0) and homozygous in both parents (0/0).

  • Child read depth (RD) > 20, mother RD > 20, father RD > 20.

  • Remove variants with >1 alternative read in either parent.

  • VAF > 0.3 and VAF < 0.7 for child.

  • Remove SNVs within 20 bp of each other. Although this is probably removing true MNVs, the error mode was very high for clustered mutations.

  • Removed DNMs if child RD > 98 (ref. 14).

  • Removed DNMs that fell within known segmental duplication regions as defined by the UCSC (http://humanparalogy.gs.washington.edu/build37/data/GRCh37GenomicSuperDup.tab).

  • Removed DNMs that fell in highly repetitive regions (http://humanparalogy.gs.washington.edu/build37/data/GRCh37simpleRepeat.txt).

  • For DNM calls that fell on the X chromosome, these slightly modified filters were used:

  • For DNMs that fell in PAR regions, the filters were unchanged from the autosomal calls apart from allowing for both heterozygous (1/0) and hemizygous (1) calls in males.

  • For DNMs that fell in non-PAR regions the following filters were used:

  • For males: RD > 20 in child, RD > 20 in mother, no RD filter on father.

  • For males: the genotype must be hemizygous (1) in child and homozygous in mother (0/0).

  • For females: RD > 20 in child, RD > 20 in mother, RD > 10 in father.

DNM filtering in DDD

To identify individuals with hypermutation in the DDD study, we started with exome-sequencing data from the DDD study of families with a child with a severe, undiagnosed developmental disorder. The recruitment of these families has been described previously63: families were recruited at 24 clinical genetics centres within the UK National Health Service and the Republic of Ireland. Families gave informed consent to participate, and the study was approved by the UK Research Ethics Committee (10/H0305/83, granted by the Cambridge South Research Ethics Committee, and GEN/284/12, granted by the Republic of Ireland Research Ethics Committee). Sequence alignment and variant calling of SNVs and indels were conducted as previously described. DNMs were called using DeNovoGear and filtered as described previously12,64. The analysis in this paper was conducted on a subset (7,930 parent–offspring trios) of the full current cohort, which was not available at the start of this research.

In the DDD study, we identified 9 individuals out of 7,930 parent–offspring trios with an increased number of exome DNMs after accounting for parental age (7-17 exome DNMs compared to an expected number of ~2). These were subsequently submitted along with their parents for PCR-free whole-genome sequencing at >30x mean coverage using Illumina 150bp paired end reads and in house WSI sequencing pipelines. Reads were mapped with bwa (v0.7.15)65. DNMs were called from these trios using DeNovoGear64 and were filtered as follows:

  • Child RD > 10, mother RD > 10, father RD > 10.

  • Alternative allele RD in child of >2.

  • Filtered on strand bias across parents and child (p-value > 0.001, Fisher’s exact test).

  • Removed DNMs that fell within known segmental duplication regions as defined by the UCSC (http://humanparalogy.gs.washington.edu/build37/data/GRCh37GenomicSuperDup.tab).

  • Removed DNMs that fell in highly repetitive regions (http://humanparalogy.gs.washington.edu/build37/data/GRCh37simpleRepeat.txt).

  • Allele frequency in gnomAD < 0.01.

  • VAF < 0.1 for both parents.

  • Removed mutations if both parents have >1 read supporting the alternative allele.

  • Test to see whether VAF in the child is significantly greater than the error rate at that site as defined by error sites estimated using Shearwater66.

  • Posterior probability from DeNovoGear > 0.00781 (refs. 12,64).

  • Removed DNMs if the child RD > 200.

After applying these filters, this resulted in 1,367 DNMs. All of these DNMs were inspected in the Integrative Genome Viewer67 and removed if they appeared to be false-positives. This resulted in a final set of 916 DNMs across the 9 trios. One out of the nine had 277 dnSNVs genome wide, whereas the others had expected numbers (median, 81 dnSNVs).

Parental phasing of DNMs

To phase the DNMs in both 100kGP and DDD, we used a custom script that used the following read-based approach to phase a DNM. This first searches for heterozygous variants within 500 bp of the DNM that was able to be phased to a parent (so not heterozygous in both parents and offspring). We next examined the reads or read pairs that included both the variant and the DNM and counted how many times we observed the DNM on the same haplotype of each parent. If the DNM appeared exclusively on the same haplotype as a single parent then that was determined to originate from that parent. We discarded DNMs that had conflicting evidence from both parents. This code is available on GitHub (https://github.com/queenjobo/PhaseMyDeNovo).

Parental age and germline-mutation rate

To assess the effect of parental age on germline-mutation rate, we ran the following regressions on autosomal DNMs. These and subsequent statistical analyses were performed primarily in R (v.4.0.1). On all (unphased) DNMs, we ran two separate regressions for SNVs and indels. We chose a negative binomial generalized linear model (GLM) here as the Poisson was found to be overdispersed. We fitted the following model using a negative Binomial GLM with an identity link where Y is the number of DNMs for an individual:

E(Y) = β0 + β1paternal age + β2maternal age

For the phased DNMs we fit the following two models using a negative binomial GLM with an identity link where Ymaternal is the number of maternally derived DNMs and Ypaternal is the number of paternally derived DNMs:

E(Ypaternal) = β0 + β1paternal age

E(Ymaternal) = β0 + β1maternal age

Individuals with hypermutation in the 100kGP cohort

To identify individuals with hypermutation in the 100kGP cohort, we first wanted to regress out the effect of parental age as described in the parental age analysis. We then looked at the distribution of the studentized residuals and then, assuming these followed a t distribution with N − 3 degrees of freedom, calculated a t-test P value for each individual. We took the same approach for the number of indels except, in this case, Y would be the number of de novo indels.

We identified 21 individuals out of 12,471 parent–offspring trios with a significantly increased number of dnSNVs genome wide (P < 0.05/12,471 tests). We performed multiple quality control analyses, which included examining the mutations in the Integrative Genomics Browser for these individuals to examine DNM calling accuracy, looking at the relative position of the DNMs across the genome and examining the mutational spectra of the DNMs to identify any well-known sequencing error mutation types. We identified 12 that were not truly hypermutated. The majority of false-positives (10) were due to a parental somatic deletion in the blood, increasing the number of apparent DNMs (Supplementary Fig. 7). These individuals had some of the highest numbers of DNMs called (up to 1,379 DNMs per individual). For each of these 10 individuals, the DNM calls all clustered to a specific region in a single chromosome. In this same corresponding region in the parent, we observed a loss of heterozygosity when calculating the heterozygous/homozygous ratio. Moreover, many of these calls appeared to be low-level mosaic in that same parent. This type of event has previously been shown to create artifacts in CNV calls and is referred to as a ‘loss of transmitted allele’ event68. The remaining two false-positives were due to bad data quality in either the offspring or one of the parents leading to poor DNM calls. The large number of DNMs in these false-positive individuals also led to significant underdispersion in the model so, after removing these 12 individuals, we reran the regression model and subsequently identified 11 individuals who appeared to have true hypermutation (P < 0.05/12,459 tests).

Extraction of mutational signatures

Mutational signatures were extracted from maternally and paternally phased autosomal DNMs, 24 controls (randomly selected), 25 individuals (father with a cancer diagnosis before conception), 27 individuals (mother with a cancer diagnosis before conception) and 12 individuals with hypermutation that we identified. All DNMs were lifted over to GRCh37 before signature extraction (100kGP samples are a mix of GRCh37 and GRCh38) and, through the liftover process, a small number of 100kGP DNMs were lost (0.09% overall, 2 DNMs were lost across all of the individuals with hypermutation). The mutation counts for all of the samples are shown in Supplementary Table 1. This was performed using SigProfiler (v.1.0.17) and these signatures were extracted and subsequently mapped on to COSMIC mutational signatures (COSMIC v.91, Mutational Signature v.3.1)19,40. SigProfiler defaults to selecting a solution with higher specificity than sensitivity. A solution with 4 de novo signatures was chosen as optimal by SigProfiler for the 12 individuals with germline-hypermutated genomes. Another stable solution with five de novo signatures was also manually deconvoluted, which has been considered as the final solution. The mutation probability for mutational signature SBSHYP is shown in Supplementary Table 3.

External exposure signature comparison

We compared the extracted signatures from these individuals with hypermutation with a compilation of previously identified signatures caused by environmental mutagens from the literature. The environmental signatures were compiled from refs. 24,51,52. Comparison was calculated as the cosine similarity between the different signatures.

Genes involved in DNA repair

We compiled a list of DNA-repair genes that were taken from an updated version of the table in ref. 69 (https://www.mdanderson.org/documents/Labs/Wood-Laboratory/human-dna-repair-genes.html). These can be found in Supplementary Table 4. These are annotated with the pathways that they are involved with (such as nucleotide-excision repair, mismatch repair). A ‘rare’ variant is defined as those with an allele frequency of <0.001 for heterozygous variants and those with an allele frequency of <0.01 for homozygous variants in both the 1000 Genomes as well as across the 100kGP cohort.

Kinetic characterization of MPG

The A135T variant of MPG was generated by site-directed mutagenesis and confirmed by sequencing both strands. The catalytic domain of WT and A135T MPG was expressed in BL21(DE3) Rosetta2 Escherichia coli and purified as described for the full-length protein70. Protein concentration was determined by absorbance at 280 nm. Active concentration was determined by electrophoretic mobility shift assay with 5′-FAM-labelled pyrolidine-DNA48 (Extended Data Fig. 8). Glycosylase assays were performed with 50 mM NaMOPS, pH 7.3, 172 mM potassium acetate, 1 mM DTT, 1 mM EDTA, 0.1 mg ml−1 BSA at 37 °C. For single-turnover glycosylase activity, a 5′-FAM-labelled duplex was annealed by heating to 95 °C and slowly cooling to 4 °C (Extended Data Fig. 9). DNA substrate concentration was varied between 10 nM and 50 nM, and MPG concentration was maintained in at least twofold excess over DNA from 25 nM to 10,000 nM. Samples taken at timepoints were quenched in 0.2 M NaOH, heated to 70 °C for 12.5 min, then mixed with formamide/EDTA loading buffer and analysed by 15% denaturing polyacrylamide gel electrophoresis. Fluorescence was quantified using the Typhoon 5 imager and ImageQuant software (GE). The fraction of product was fit by a single exponential equation to determine the observed single-turnover rate constant (kobs). For Hx excision, the concentration dependence was fit by the equation kobs = kmax [E]/(K1/2 + [E]), where K1/2 is the concentration at which half the maximal rate constant (kmax) was obtained and [E] is the concentration of enzyme. It was not possible to measure the K1/2 for εA excision using a fluorescence-based assay owing to extremely tight binding71. Multiple turnover glycosylase assays were performed with 5 nM MPG and 10–40-fold excess of substrate (Extended Data Fig. 8).

Fraction of variance explained

To estimate the fraction of germline mutation variance explained by several factors, we fit the following negative binomial GLMs with an identity link. Data quality is likely to correlate with the number of DNMs detected so, to reduce this variation, we used a subset of the 100kGP dataset that had been filtered on some base quality control metrics by the Bioinformatics team at GEL:

We then included the following variables to try to capture as much of the residual measurement error which may also be impacting DNM calling. In brackets are the corresponding variable names used in the models below:

  • Mean coverage for the child, mother and father (child mean RD, mother mean RD, father mean RD)

  • Proportion of aligned reads for the child, mother and father (child prop aligned, mother prop aligned, father prop aligned)

  • Number of SNVs called for child, mother and father (child snvs, mother snvs, father snvs)

  • Median VAF of DNMs called in child (median VAF)

  • Median ‘Bayes Factor’ as outputted by Platypus for DNMs called in the child. This is a metric of DNM quality (median BF).

The first model only included parental age:

E(Y) = β0 + β1paternal age + β2maternal age

The second model also included data quality variables as described above:

$$begin{array}{cc}E(Y),= & {beta }_{0}+{beta }_{1}{rm{paternal; age}}+{beta }_{2}{rm{maternal; age}}\ & +{beta }_{3}{rm{child; mean; RD}}+{beta }_{4}{rm{mother; mean; RD}}\ & +{beta }_{5}{rm{father; mean; RD}}+{beta }_{6}{rm{child; prop; aligned}}\ & +{beta }_{7}{rm{mother; prop; aligned}}+{beta }_{8}{rm{father; prop; aligned}}\ & +{beta }_{9}{rm{childs; nvs}}+{beta }_{10}{rm{mother; snvs}}+{beta }_{11}{rm{father; snvs}}\ & +{beta }_{12}{rm{median; VAF}}+{beta }_{13}{rm{median; BF}}end{array}$$

The third model included a variable for excess mutations in the 11 confirmed individuals with hypermutation (hm excess) in the 100kGP dataset. This variable was the total number of mutations subtracted by the median number of DNMs in the cohort (65), Yhypermutated − median(Y) for these 11 individuals and 0 for all other individuals.

$$begin{array}{cc}E(Y),= & {beta }_{0}+{beta }_{1}{rm{paternal; age}}+{beta }_{2}{rm{maternal; age}}\ & +{beta }_{3}{rm{child; mean; RD}}+{beta }_{4}{rm{mother; mean; RD}}\ & +{beta }_{5},{rm{father; mean; RD}}+{beta }_{6}{rm{child; prop; aligned}}\ & +{beta }_{7}{rm{mother; prop; aligned}}+{beta }_{8}{rm{father; prop; aligned}}\ & +{beta }_{9}{rm{child; snvs}}+{beta }_{10}{rm{mother; snvs}}+{beta }_{11}{rm{father; snvs}}\ & +{beta }_{12}{rm{median; VAF}}+{beta }_{13}{rm{median; BF}}+{beta }_{14}{rm{hm; excess}}end{array}$$

The fraction of variance (F) explained after accounting for Poisson variance in the mutation rate was calculated in a similar way to in ref. 1 using the following formula:

$$F={rm{pseudo}},{R}^{2}frac{1-underline{Y}}{{rm{Var}}(Y)}$$

McFadden’s pseudo R2 was used here as a negative binomial GLM was fitted. We repeated these analyses fitting an ordinary least squares regression, as was done in ref. 1, using the R2 and got comparable results. To calculate a 95% confidence interval, we used a bootstrapping approach. We sampled with a replacement 1,000 times and extracted the 2.5% and 97.5% percentiles.

Rare variants in DNA-repair genes

We fit eight separate regressions to assess the contribution of rare variants in DNA-repair genes (compiled as described previously). These were across three different sets of genes: variants in all DNA-repair genes, variants in a subset of DNA-repair genes that are known to be associated with base-excision repair, MMR, NER or a DNA polymerase, and variants within this subset that have also been associated with a cancer phenotype. For this, we downloaded all ClinVar entries as of October 2019 and searched for germline ‘pathogenic’ or ‘likely pathogenic’ variants annotated with cancer55. We tested both all non-synonymous variants and just PTVs for each set. To assess the contribution of each of these sets, we created two binary variables per set indicating a presence or absence of a maternal or paternal variant for each individual, and then ran a negative binomial regression for each subset including these as independent variables along with hypermutation status, parental age and quality-control metrics as described in the previous section.

Simulations for parental age effect

We downsampled from the full cohort to examine how the estimates of the fraction of variance in the numberof DNMs explained by paternal age varied with sample number. We first simulated a random sample as follows 10,000 times:

  • Randomly sample 78 trios (the number of trios in ref. 1.)

  • Fit ordinary least squares of E(Y) = β0 + β1paternal age.

  • Estimated the fraction of variance (F) as described in ref. 1.

We found that the median fraction explained was 0.77, with a s.d. of 0.13 and with 95% of simulations fallings between 0.51 and 1.00.

Parental cancer diagnosis before conception

To identify parents who had received a cancer diagnosis before the conception of their child, we examined the admitted patient care hospital episode statistics of these parents. There were no hospital episode statistics available before 1997, and many individuals did not have any records until after the birth of the child. To ensure that comparisons were not biased by this, we first subset to parents who had at least one episode statistic recorded at least two years before the child’s year of birth. Two years before the child’s birth was our best approximation for before conception without the exact child date of birth. This resulted in 2,891 fathers and 5,508 mothers. From this set we then extracted all entries with ICD10 codes with a ‘C’ prefix, which corresponds to malignant neoplasms, and ‘Z85’, which corresponds to a personal history of malignant neoplasm. We defined a parent as having a cancer diagnosis before conception if they had any of these codes recorded ≥2 years before the child’s year of birth. We also extracted all entries with ICD10 code ‘Z511’, which codes for an ‘encounter for antineoplastic chemotherapy and immunotherapy’.

Two fathers of individuals with hypermutation who we suspect had chemotherapy before conception did not meet these criteria as the father of GEL_5 received chemotherapy for treatment for systemic lupus erythematosus and not cancer and, for the father of GEL_8, the hospital record ‘personal history of malignant neoplasm’ was entered after the conception of the child (Supplementary Table 5).

To compare the number of dnSNVs between the group of individuals with parents with and without cancer diagnoses, we used a Wilcoxon test on the residuals from the negative binomial regression on dnSNVs correcting for parental age, hypermutation status and data quality. To look at the effect of maternal cancer on dnSNVs, we matched these individuals on maternal and paternal age with sampling replacement with 20 controls for each of the 27 individuals. We found a significant increase in DNMs (74 compared to 65 median dnSNVs, P = 0.001, Wilcoxon Test).

SNP heritability analysis

For this analysis, we started with the same subset of the 100kGP dataset that had been filtered as described in the analysis of the impact of rare variants in DNA-repair genes across the cohort (see above). To ensure variant quality, we subsetted to variants that have been observed in genomes from gnomAD (v.3)72. These were then filtered by ancestry to parent–offspring trios where both the parents and child mapped on to the 1000 Genomes GBR subpopulations. The first 10 principal components were subsequently included in the heritability analyses. To remove cryptic relatedness, we removed individuals with an estimated relatedness of >0.025 (using GCTA grm-cutoff, 0.025). This resulted in a set of 6,352 fathers and 6,329 mothers. The phenotype in this analysis was defined as the residual from the negative binomial regression of the number of DNMs after accounting for parental age, hypermutation status and several data quality variables, as described when estimating the fraction of DNM count variation explained (see above). To estimate heritability, we ran GCTA GREML-LDMS on two linkage disequilibrium stratifications and three MAF bins (0.001–0.01, 0.01–0.05, 0.05–1)56. For mothers, this was run with the –reml-no-constrain option because it would otherwise not converge (Supplementary Table 9).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.



Source link

Leave a Reply

Your email address will not be published.