Identification of constrained sequence elements across 239 primate genomes - Nature

De novo assembly and repeat-masking

Table of Contents

To maximize the species diversity of primates in our analyses, we newly sequenced and assembled the genomes of 187 different primate species, initially presented in refs. ^11,23, for which no other reference genome assembly was available. In brief, each individual was sequenced with 150 bp paired end reads on the Illumina NovaSeq 6000 platform to an average whole-genome coverage of ~35×, and we assembled the resulting reads into contigs using Megahit²⁵ (version 1.2.9) using default parameters. The resulting assemblies had an average contig N50 of 34 kb, and the assembly sizes ranged from 2.1–3.0 Gb, thus falling within the typical range of previously reported genome sizes for primates⁵⁶ (see Extended Data Fig. 1a). We then combined these assemblies with the reference genomes of 52 additional species that had been previously generated as part of other studies⁵⁷ and or available through public repositories (Supplementary Data 1). The final species sampling densely covers the whole primate radiation and includes members of all 16 primate families and 72 primate genera. We identified and soft-masked common genomic repeats within the assemblies using RepeatMasker (version 4.1.2-p1; http://www.repeatmasker.org), utilizing the primates repeat catalogue as query.

Multiple sequence alignment

We aligned the assemblies with Cactus²¹ (version 2.1.1), using the phylogeny presented in¹¹ as a guide tree for progressive decomposition, and used the previously available high-quality assemblies as alignment outgroups. All computation was done by running cactus-prepare with options–wdl–noLocalInputs–preprocessBatchSize 5–defaultDisk 3000 G–halAppendDisk 9000 G–defaultCores 64–gpu–gpuCount 8–defaultMemory 385 G–alignMemory 450 to produce a script in workflow description language (WDL), then uploading it to Terra (https://app.terra.bio/) where it was executed on Google Cloud Platform. GPU-related issues prevented that version of Cactus from executing to completion, so the job was resumed using a WDL made without the–gpu and–gpuCount options. An outgroup to primates (Mus musculus reference mm10) was manually added to the root alignment job by editing the WDL, and the ‘LOCAL’ disk parameter of the hal_append_subtree task was manually increased to 9,000. Cactus has since been fixed (v2.2.3) to resolve all issues encountered during this alignment.

We then combined our resulting primate MSA with the recently generated mammalian MSA by the Zoonomia consortium²⁰. In brief, we used hal2fasta from the haltools²¹ package to output the ancestral genome at the root of the primate MSA, and used it to generate a bridge alignment with the Sunda colugo (Galeopterus variegatus), the closest outgroup to primates in the Zoonomia MSA. We used this bridge alignment to insert the primate MSA into the Zoonomia MSA, and replace the original primate branch with it.

To generate the final, filtered alignment used as input for subsequent analyses described below, we output maf files centred on the human genome reference using haltools including the “–onlyOrthologs–noAncestors –noDupes” flags, thus removing any regions with potentially ambiguous mappings at multiple locations.

Pairwise alignments error rate estimate

To quantify residual error rates within the genome assemblies generated in this project, we identified 25 species for which a reference genome was previously assembled with an orthogonal, state of the art combination of technologies (Supplementary Table 1). After introducing a minimum contig length cutoff of 1 kb, we generated pairwise alignments between the two assemblies using minimap2⁵⁸ (v. 2.17-r941) using the following flags:–cs -x asm5. We called variants on the resulting alignments by retaining alignment blocks of at least 1 kb within the PAF file using paftools.js, by applying the following flags: paftools.js call -l 1000 -L 1000. We quantified mismatch rates from the resulting output accounting for the fraction of the genome within alignment blocks, resulting in mismatch rates that range from 0.00026–0.00515 mismatches per bp. As the genome assemblies produced herein are haploid compressions of diploid organisms, a random allele will be sampled and incorporated at heterozygous positions, and thus the resulting differences between two assemblies of the same species should be strongly correlated with the species’ intraspecific diversity. We compared our mismatch rates to the estimates of heterozygosity for the same genomes presented in ref. ¹¹, and confirmed that heterozygosity accounts for 83% of the observed variation in mismatch rates across assemblies. We quantified the residual mismatch rate after regressing out it’s the effects of heterozygosity, and found the resulting average mismatch rate to be 0.0004 mismatches per bp, which we consider to be sufficiently low for our analyses. We note that the number of base differences due to assembly error is likely lower than this, as residual mismatches also include fixed differences between individuals, which are not accounted for by heterozygosity.

Detecting selective constraint

We measured selective constraint genome wide using the widely used phyloP and phastCons algorithms from the PHAST package^26,59. To do so, we extracted the ancestral genomes of primates and of eutherian mammals from our alignment using haltools hal2fasta, and annotated common genomic repeats in both using ReapeatMasker as described above, but using the mammalian repeat catalogue for the eutherian ancestor. We lifted the resulting annotations into human reference space, and randomly sampled 1 Mb of autosomal short interspersed nuclear element (SINE), long interspersed nuclear element (LINE), long terminal repeat (LTR) and DNA repeats from the alignments as putatively neutrally evolving regions. We used these regions as input for phyloFit together with the general reversible model (“–subst-mod REV”) as the nucleotide substitution model and expectation maximization algorithm (“-EM”) to fit it to the data. As our goal is to detect elements with sequence constraint specific to primates, we generated the neutral background models once for all primates, and once for all mammals after excluding the primate branch. We additionally generated a neutral model for the 100-way vertebrate MSA from UCSC Genome Browser in our analysis to minimize false negatives on the mammalian track, for which we also excluded the primate branch containing 11 species and defined neutral background models via alignments at 4D sites as putatively neutral regions, due to their easier detection across the much larger phylogenetic distances present in this alignment.

We used the models to estimate constraint in different ways across the three clades (primates, mammals, vertebrates): For phyloP, we calculated scores for both constraint and acceleration with the “–mode CONACC” flag, and used the likelihood ratio test “–method LRT” yielding phyloP scores—that is, the −log₁₀(P value) from the hypothesis test, and the associated scale factor. We scored individual bases by outputting them via the “–wig-scores” flags. We additionally scored element-wide annotations for coding sequences, DHS and TFBS by passing them to phyloP via the “–features” flag, to increase power as the test is performed across more than a single basepair. Finally, we generated discrete constrained elements in primates using phastCons, using primate neutral background model, the “–expected-length 45–target-coverage 0.3–rho 0.31” consistent with previous studies¹⁸, and output constrained elements with the “–most-conserved” flag.

To explore the potential impact of regional variation in substitution rates on our estimates of constraint, we additionally generated regional neutral background models for primates and other mammals from 1-Mb sliding windows across the human genome. In each window, we subset the previously identified ancestral repeats and randomly selected 100 kb of sequence after trimming sites with >20% missing data. As described above, these sites were used to estimate substitution rates input with phyloFit, and the resulting models were used to run phyloP for individual bases and DHSs elements.

To additionally ensure our estimates of constraint are robust to topological variation in the underlying phylogeny due to potential sources of uncertainty such as incomplete lineage sorting, we additionally inferred regional phylogenies for primates using a maximum likelihood approach implemented in IQtree. In brief, we randomly subset 150 kb of trimmed sequence from each 1 Mb window, which was used to estimate an appropriate substitution model and infer the phylogeny including 1,000 bootstraps. We used the topology of the resulting consensus tree and the ancestral repeat alignments to infer neutral models as described, using the same subset of sites as for the regional models to minimize additional sources of variation, and assessed the concordance of constraint for DHS elements between regional models using the canonical and regional phylogenies.

Protein-coding exons

To identify protein-coding exons with constrained specifically in the primate lineage, we used phyloP with protein-coding exons from GENCODE (v 42)^9,27 as element-wise input as described above across the primate, mammalian, and vertebrate tracks. We restricted these analyses to exons that are part of ‘Ensembl canonical’ transcript, and additionally excluded any exon that overlaps known human segmental duplications, as defined by the segmental duplication track on UCSC Genome Browser. We ran element-wise phyloP tests on these remaining coding exons, and defined constrained exons for each clade (primates, mammals, vertebrates) directly based on the resulting P values. We accounted for multiple testing by retaining those that remained significant at a 5% FDR⁶⁰. To define exons with primate-specific constraint, we required them to be significantly constrained in primates, but not in mammals or vertebrates. To detect whether these exons also have coding potential in other mammals, we lifted the underlying coordinates to the mouse genomes (mm10) and checked weather they overlap protein-coding annotations there. To define genes with primate-specific constraint, we looked for genes containing one or more exons with primate-specific constraint, but no mammal differentially constrained ones. To calculate differences in the proportion of alternatively spliced exons between broadly constrained and primate specifically constrained exons, we calculated the mean exon inclusion rate across tissues from the GTEx project⁶¹, and defined exons with an inclusion rate different from 1 as alternatively spliced. A list of exons and genes with primate-specific constraint is presented in Supplementary Data 2.

GO-term enrichment

We used Panther⁶² to calculate Gene Ontology (GO)-term enrichments of genes with primate-specific constraint, and those overlapping primate UCEs. We used Fishers’ exact to test for statistical overrepresentation on the ‘GO biological process’ annotation, by using the Ensembl identifiers of the underlying genes from either analysis as foreground set and the human gene annotation as background. To account for multiple testing, we report only results that remain significance at a FDR (Benjamini–Hochberg) of 5%.

DHSs and TFBSs

We obtained high-resolution maps of DHSs from 733 human biosamples encompassing 438 cell and tissue types and states⁴⁷. The study reported 3.6 million DHS elements, and we applied several additional quality control steps to remove low-quality peaks. First, we excluded all peaks without 1-to-1 matches between GRCh38 and hg19. We normalized peaks to 300 bps in size for all analyses, except for the element-wise constraint scoring described below. Finally, we required all peaks to be within the top 100,000 in at least one annotated cell type in the datasets, by the normalized score provided from the study. After excluding sex chromosomes, this resulted in a set of 1,238,405 peaks that were used in downstream analyses. We similarly obtained 3,622,316 consensus DNase I hypersensitivity footprints for the set of DHS elements used in our primary analyses³⁸. Cell types and tissues where each DHS element was most strongly associated were previously estimated using non-negative matrix factorization with 16 components⁴⁷.

We defined a core 40-bp window surrounding the summit of the peak of each DHS annotation as the input to calculate element-wise. Analogous to protein-coding exons, we then calculated constraint in DHS and TFBS element-wise using phyloP across primates, mammals, and vertebrates, and define constrained elements in each clade as those remaining significant at a 5% FDR⁶⁰. To define primate-specific constraint in DHS and TFBS, we required the elements to be significantly constrained in primates, but not in mammals or vertebrates. Finally, DHS elements and TFBSs that did not have primate-specific constraint by phyloP but overlapped with a primate PhastCons elements were excluded from the primary analyses for consistency in interpretation, since these sequences represent a mixture of primate-specific and deeply constrained sequences. The depth of constraint for each DHS and TFBS are provided in Supplementary Data 9 and 10. Approximate target genes of each DHS element were based on the closest gene using the ‘nearest’ function the R GenomicRanges package.

TFBS enrichment analysis

We obtained archetypal motifs overlapping each TFBS or DHS footprint from the annotations presented in ref. ³⁸. Footprints typically had multiple motif matches and were considered independently. For each motif, we computed the proportion of footprints in either constraint category (primate or mammal constrained below an FDR of 5%, as described above), where the denominator was the total number of constrained footprints (primate or mammal) regardless of motif match. We then calculated the odds ratio for each motif to test whether the proportion of primate-constrained and mammal-constrained footprints were different. After observing a small bias where short footprints were more likely to be detected as constrained in mammals, we split footprints into 10 equal-sized bins, computed the odds ratio for each motif in each bin, then performed a fixed effects meta-analysis for each motif.

Primate UCEs

We defined UCEs across primates analogous to ref. ¹⁸: We filtered regions with ambiguous or multiple alignments using haltools including the “–onlyOrthologs–noAncestors –noDupes” flags, and parsed the resulting alignment to exclude any alignment column that is different from all other species in at least one species. We then kept consecutive stretches of 20 bp or more for the final set of UCEs. For a more lax definition, we allowed for missing data (“-” or “N”) in the alignment in at most 2 species (1%). We strictly defined overlap to previous annotations as 1 bp or more.

Estimates of constraint in human populations

Gene constraint in the human population was estimated using the LOEUF metric. In brief, this metric conservatively estimates the selection against loss-of-function mutations by taking the upper bound of a 95% Poisson confidence interval around the observed to expected ratio of loss-of-function mutations. LOEUF values were obtained from 141,456 individuals in gnomAD v2⁴⁵. Constraint across noncoding regions of the genome was estimated as a z-score for depletion of mutations compared to expectation⁴⁶. Z-scores for non-overlapping 1,000-bp bins were obtained from 71,156 individuals in gnomAD v3. When a DHS element overlapped multiple bins the average z-score was used.

Trait-associated variant analyses

Fine-mapping results for 96 complex traits and diseases across 366,194 unrelated ‘white British’ individuals in the UKBB⁶³ were obtained from https://www.finucanelab.org/data and have previously been described in detail⁶⁴. In brief, fine mapping was performed using FINEMAP^65,66 and SuSiE⁶⁷ with GWAS summary statistics from SAIGE/BOLT-LMM and in-sample dosage linkage disequilibrium (LD) computed by LDstore 2⁶⁸. Regions were defined by expanding ±1.5 Mb for each lead variant and were merged if they overlapped. Up to 10 causal variants were allowed per region. Posterior inclusion probabilities (PIPs) were averaged across the two methods and variants where PIPs from the two methods disagreed by >0.05 were excluded.

Fine-mapping results for expression quantitative traits in 49 tissues across 838 individuals were obtained from https://www.finucanelab.org/data and have been described in detail^61,64. In brief, fine mapping was performed using SuSiE on cis-eQTL summary statistics from the GTEx portal (https://gtexportal.org/). Covariates (sex, PCR amplification, sequencing platform, genotype principal components, and probabilistic estimation of expression residuals factors⁶⁹) were projected out from the genotypes prior to fine mapping. After fine mapping, all variants were lifted over from GRCh38 to hg19.

Definition of constraint at DHS and TFBSs was slightly modified such that evidence of constraint out to mammals or vertebrates was separated and elements with discrepant estimates of constraint were excluded. Specifically, constraint at approximately 100 Ma required that mammal and primate phyloP scores were below the FDR threshold but vertebrate phyloP was above the FDR threshold. Similarly, constraint at approximately 160–400 Ma required that vertebrate, mammal, and primate phyloP scores were below the FDR threshold.

Bigwig files for accessible chromatin and transcription factor occupancy were obtained from the ENCODE project^47,70 (ENCFF220IWU, ENCFF659BVQ, ENCFF619LIB and ENCFF842XRQ) or the Sequence Read Archive (SRX097095). Coding variants were annotated as loss-of-function, missense, or synonymous using the Ensembl Variant Effect Predictor (VEP) v85⁷¹. When a variant had multiple coding annotations, the most severe consequence on the canonical transcript (GENCODE v19) was used.

We computed the enrichment of fine-mapped variants for different annotations by comparing the proportion of variants with PIP > 0.5 to the proportion of variants with PIP < 0.01. Distal elements were defined as DHS elements that did not overlap promoters⁷². When variants were fine-mapped across multiple traits, tissues, or genes, only the highest PIP variant was used. Confidence intervals and P values were estimated using Fisher’s exact test. Enrichments were performed in hg19 and annotations were lifted over from GRCh38.

A similar enrichment analysis was performed using stratified LD score regression (S-LDSC)⁷² to estimate the heritability in each annotation. Similar to previous studies⁷, S-LDSC models were fit using approximately 10 million common variants including the Baseline v2.2 annotations. Annotations derived in GrCH38 were lifted over to hg19, and their LD scores were estimated using the EUR sub-population of the 1000 Genomes project. Enrichment and average per-SNP heritability estimates were meta-analysed across 69 mostly independent traits using a random effects model.

The predicted effects of fine-mapped variants on transcription factor binding was estimated using motifbreakR⁷³ for 426 position weight matrices from HOCOMOCOv11⁷⁴. A motif match was determined using the information content (ic) if either allele obtained a P value < 0.0001. A variant disrupted a motif match if there was a difference of >0.4 for the scaled motif matrix between alleles.

Enformer analysis

We obtained the 733 bio-sample aggregated DNase peak dataset as curated by⁴⁷ and deduplicated the technical replicates by retaining the top bio-sample for samples with technical replicates. We retained all DHS peaks found in more than two biosamples for downstream analysis, calculated the midpoint for each DHS and scored the regions using the Enformer model⁴¹. To assess the local functional relevance of the Enformer scores, we averaged them across ±128 bp around the midpoint of each DHS. To compute the correlation between the Enformer score and phyloP in each bio-sample, we pairwise intersected DHS with primate-specific constraint for all bio-sample pairs, and computed the correlation between the Enformer and phyloP scores for the retained regions, and row and column normalized the final correlation matrix. The final matrix was hierarchically clustered on the rows, and the same order was retained for the columns in the heat map. Major cell types for each correlation block identified are highlighted as annotations.

Luciferase reporter vector construction

Mouse, chimp and human CRE with 150 bp in length were synthesized by IDT. The CRE was cloned into the linearized pGL3- Promoter vector (cut by Nhel and BglII). The fusion product (pGL3-cRE) was subsequently transformed into Mix & Go Competent Cells Strain Zymo 5-a (Zymo Research, T3007). Clones were selected by ampicillin and plasmids were prepared using the NucleoSpin Plasmid Transfection-grade (Takara, 740490).

Transfection and luciferase assays

Human iPS cells were transfected in a 24-well plate using the Lipofectamine Stem Transfection Reagent (Invitrogen, STEM00001) and Opti-MEM Reduced Serum medium (Invitrogen, 31-985- 070). On the day of transfection, cell density was 50% confluent. For each well, 500 ng of pGL3-enhancer, pGL3-control, or pGL3-promoter was co-transfected with 10 ng of pRL-CMV (Promega, E2261) as an internal control for the normalization of luciferase activity. Cells were incubated with DNA–lipid complex overnight and media was changed for another two days. The firefly and Renilla luciferase activity were measured respectively using a Dual-Glo Luciferase Assay System (Promega, E2920). Human iPS cells were obtained from the Stanford CVI iPS cell Biobank.

Massively parallel reporter assays

Measured effects of single nucleotide substitution effects from saturation mutagenesis experiments across 29 regulatory elements were obtained from⁴⁰ and across 131 elements from⁹. For each nucleotide, the mean substitution effect across all reported nucleotides was correlated (Pearson) with phyloP scores that were truncated such that negative values, which are indicative of possible acceleration, were set to zero. A Storey FDR⁶⁰ was used to control for multiple comparisons. Regulatory effects from 27,017 common variants in the DHS elements investigated in this study were obtained from⁹. Variants with a reported FDR below 5% were defined as allele-specific. A generalized linear model with a binomial probability distribution was used to estimate the effects of constraint on allele-specific activity.

Chromatin accessibility and histone modifications in non-humans

Chromatin accessibility from ATAC-seq in fibroblasts obtained from human and 4 non-human primates (chimpanzee, gorilla, orangutan and macaque) at 89,744 merged peaks with orthologous sequences in all 5 species were obtained from^42,75. Counts were transformed to log₂ counts per million (cpm), and FDR values from differential accessibility testing across any primate species were obtained⁴².

Histone modifications (H3K27ac) were also obtained from three matching cell types during corticogenesis for human, macaque, and mouse⁴³. First, H3K27ac peaks at orthologous sequences from all species were obtained from the authors and filtered such that at least 200 bp of these peaks overlapped with a DHS element in this study. Next, DHS elements coordinates in GRCh38 were lifted over to each species and the maximum H3K27ac signal (cpm) at each element was calculated using the provided bigwig files. Spearman correlations between matching cell types were then computed for each pair of species stratified by the type of constraint on the DHS element.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

Identification of constrained sequence elements across 239 primate genomes – Nature

ByAUTHOR