Strange India All Strange Things About India and world

Epigenomic datasets and processing

Primary data sources and metadata information

We analysed 3,030 datasets, including 2,329 epigenomic chromatin immunoprecipitation followed by sequencing (ChIP–seq) datasets, 635 DNase-seq datasets and 66 ATAC-seq datasets from ENCODE at, released as of 24 September 2018. These marks include tier 1 assays: DNase-seq, H3K4me1, H3K4me3, H3K27ac, H3K36me3, H3K9me3 and H3K27me3; tier 2 assays: ATAC-seq, H3K9ac, H3K4me2, H2AFZ, H3K79me2 and H4K20me1; tier 3 assays: POLR2A, p300, CTCF, SMC3 and RAD21; and tier 4 histone marks: 16 non-imputed histone acetylation marks, 4 methylation marks (H3K9me2, H3K79me1, H3K9me1 and H3K23me2), H3.3 and H3T11ph. We assigned unique sample IDs to each unique combination of: extended biosample summary, donor, sex, age and life stage, wherever each attribute was available. We removed samples with genetic perturbations and kept only samples with appropriately matched ChIP–seq controls. We provide a metadata matrix including the mapping between ENCODE accessions and our unique sample IDs (Supplementary Table 1; also at We mapped the 111 Roadmap biosamples and the 16 ENCODE 2012 biosamples to any of our biosamples with overlapping dataset accessions if the accessions were used in the flagship Roadmap epigenomics analysis. This mapping assigned 25 samples to ENCODE 2012 and 184 samples to Roadmap 2015, some of which were merged multi-donor samples in Roadmap, out of the final 833 samples that passed quality control. These were merged into 16 and 111 tissue types, respectively, in the Roadmap 2015 publication6.

Uniform data processing

We downloaded one alignment file per replicate, prioritizing filtered alignments aligned with BWA in hg19 whenever possible. We uniformly processed the ChIP–seq and DNase-seq datasets according to the processing pipelines established by the Roadmap Epigenomics Consortium6. In brief, we filtered out improperly paired and non-uniquely mapped reads, truncated reads to 36 bp, filtered out a blacklist of low complexity and artefact regions (ENCODE accession ENCSR636HFF), and filtered reads against a mappability track of uniquely mappable regions for 36-bp reads47. Truncating read lengths inevitably missed some repetitive regions that the longer reads could have helped resolve, but helped to avoid potential biases from alignment differences, as over two-thirds of the datasets had read lengths of 36 bp or lower (Supplementary Fig. 9). We converted .bam files to tagAlign, used liftOver48 to map GRCh38 alignments to hg19, and pooled all experiments within each ID and assay combination. We subsampled the pooled ChIP–seq datasets to a maximum of 30 million reads and the DNase-seq and ATAC-seq datasets to a maximum of 50 million reads. We used the SPP peak caller49 to estimate fragment length. In cases with extremely low fragment length in the ATAC-seq and DNase-seq datasets we used the average fragment length (73 bp) from the average of the rest of the tracks. We generated −log10P value signal tracks against matched whole cell extracts for both the ChIP–seq and the accessibility datasets using the MACS250 and the SPP49 peak caller and cross-correlation analysis to identify the proper fragment length as in the Roadmap analysis.

Epigenomic imputation


We carried out epigenomic imputation on 859 unique biosamples using ChromImpute20 for a total of 10,778 imputed datasets over 13 tier 1 and tier 2 assays using predictors trained on all 35 epigenomic assays across 859 samples. We also imputed 4,345 datasets for the five DNA-associated factors, using only the 35 epigenomic assays as features to train predictors with ChromImpute. We provide all imputed and processed observed tracks along with track sets for the 833 quality controlled samples at

Quality control

For imputation quality control and validation, we compared observed tracks to imputed tracks when both were available (that is, when at least two original observed datasets were available for that biosample). We calculated all imputation quality control metrics from the original ChromImpute publication20, including genome-wide correlation, imputed and observed peak recovery (%), and the area under the receiver-operator characteristic curve (AUC) for all pairs of imputed and observed tracks. In addition to the quantitative metrics, we visually inspected the epigenomic predictions as part of our quality control. We showed (Extended Data Fig. 1b) three dense and varied regions of different resolutions (25 kb, 200 kb and 1.5 Mb) for each of two randomly chosen samples containing both observed and imputed tracks for each assay. We calculated the epigenomic profile quality metrics normalized strand cross-correlation coefficient (NSC), relative strand cross-correlation coefficient (RSC) and read depth for all datasets and compared these to the imputation quality control metrics (see the tables in Supplementary Table 1). We flagged low-quality tracks by detecting the elbow in the ranked correlation metrics, which we calculated as the point where the change in correlation exceeded 5% of the correlation. Validation on external datasets was carried out on 51 experimental tracks across eight marks and assays from ENCODE after our data freeze, similarly subsampled to 30 million (marks) and 50 million reads (accessibility), which we remapped from GRCh38 to hg19 and evaluated on fully remapped 200-bp bins (90.1%) in chromosome 1 (Extended Data Fig. 1c, d). For the data homogeneity analysis, we restricted the data to only biosamples in each mark with both observed and imputed data (Supplementary Fig. 10).

Sample and antibody swap detection

To systematically identify both potential sample or antibody swaps and poor-quality experiments, we computed the correlation of each observed experiment against all 10,734 imputed tracks for histone marks and assays (all imputed tracks before removing samples by quality control). We then calculated the average correlation among the top 10 most similar tracks to each observed track. We flagged potential antibody swaps by comparing the average correlation against samples of the putative mark against those computed for other marks. We fitted a multivariate linear model to each mark comparison, flagged datasets with residuals greater than 3 standard deviations of the average correlation and visually confirmed seven antibody swaps (six low-quality tracks). Similarly, we flagged potential sample swaps by comparing the correlation between imputed and observed tracks against the average correlation in the top 10 tracks in the same mark. We fitted a multivariate linear model and flagged datasets with residuals greater than 3 standard deviations of the residuals distribution. We report 19 potentially swapped samples, of which 5 were also flagged as low-quality tracks (Supplementary Fig. 8).

Secondary reactivities

In addition to genome-wide quality control of imputed tracks, we also focused on the specific differences between observed and imputed tracks. For each observed mark, we generated a genome-wide ‘delta’ track, computed as the difference in signal intensity between the observed and the imputed data, rescaling imputed tracks to match the signal intensity properties of the observed tracks, as the observed tracks showed a general bias for higher intensity. Some of these ‘delta’ tracks showed surprisingly high correlations with ‘primary’ tracks of non-putative marks, indicating potential secondary antibody reactivities. To flag these reactivities, we compared the average correlation of each of the delta tracks to the top 10 closest imputed tracks for each mark. As with antibody swaps, we fitted a multivariate linear model in each mark combination to flag outliers. We flagged 19 tracks and reported 13 after visual inspection as potential secondary reactivities or single replicate swaps (for example, in the case of DNase-seq) (Supplementary Figs. 7, 8). We noted that some cases showed clear difference tracks that do not match available antibodies, suggesting that the secondary reactivity is not a common mark in our compendium.

Biological space coverage

To evaluate the similarity of imputed and observed tracks across samples, we calculated the pairwise genomic correlations between all pairs of imputed and observed signal tracks. We hierarchically clustered the imputed or observed correlation matrix of each individual mark using Ward’s method. We averaged all imputed matrices for the six main marks (H3K27ac, H3K4me1, H3K4me3, H3K36me3, H3K27me3 and H3K9me3) to create a fused correlation matrix, which we similarly clustered. We plotted the hierarchically clustered tree for the fused matrix alongside the metadata information for each biosample using the circlize R package51.

In addition, we calculated mark-specific Spearman correlations that were restricted to relevant features within all observed and imputed tracks per mark. We mapped each of the 13 marks to its top state by emission probability in the ChromHMM 25-state model and any other states with emission probability over 80%. For ATAC-seq, we used the same region list as DNase-seq. For each mark, we averaged and reduced each 25-bp signal track to any 200-bp regions that were labelled as one of the states associated with the mark in any of the 127 imputed Roadmap biosamples under the 25-state model6,20. We calculated the Spearman correlation between sets of these region-restricted mark signal tracks and generated similarity matrices across all datasets for a mark. Using these Spearman correlation matrices on all observed and imputed signal tracks, we computed UMAP dimensionality reductions for each mark and assay using with the uwot R package52 with the default parameters, except for n_neighbours = 250, min_dist = 0.25 and repulsion_strength = 0.25.

Epigenomic annotations

Chromatin-state annotations

We computed epigenomic annotations on 3,533 imputed and 1,465 observed datasets for 6 marks on 833 samples using ChromHMM with the fixed 18-state model from Roadmap6 with the same mnemonics and colours. We used observed data wherever possible, except in cases with no observed data or where observed data were removed in quality control. The table of the signal tracks used to calculate the annotations is available as Supplementary Table 2. The observed data were binarized from signal tracks with a −log10P value signal cut-off of 2. To binarize the imputed data and facilitate comparison with the observed data, we established mark-specific binarization cut-offs. We first separately calculated the overall probability distributions of all imputed and observed tracks for each mark. Then, for each mark, we set the imputed binarization cut-off value to the value of the quantile that matched the quantile in the observed data for the −log10P value > 2 cut-off. We used liftOver48 to map all 833 (after quality control) ChromHMM annotations to GRCh38, using a stringent reciprocal mapping strategy, ensuring that all resulting GRCh38 regions were also 200 bp and non-overlapping, and we have provided these alongside hg19 annotations and as track sets at

Defining active enhancers

We define active enhancers as the intersection of DHS regions with enhancer annotations and high H3K27ac signal (average signal of >2 in the region containing the DHS ± 100 bp). We defined DHS regions from an index list of 3,591,898 DHS element consensus locations in GRCh38, determined from 733 DNase-seq experiments, that we mapped using liftOver48 to 3,568,912 hg19 locations24. We intersected the hg19 regions with the 833 imputed enhancer annotations (states 7, 8, 9, 10, 11 and 15 in the 18-state model). This resulted in 2,842,995 regions with at least one enhancer annotation in any biosample. Finally, we intersected this matrix with the H3K27ac signal in the ±100-bp region that encompassed each DHS from the same tissue-specific imputed and observed datasets used to calculate the ChromHMM annotations. This procedure resulted in 2,356,914 active-enhancer regions. We created an equivalent promoter element region using the promoter annotations (states 1, 2, 3, 4 and 14 in the 18-state model). We noticed that several regions shared both enhancer and promoter annotations. As a conservative cut-off, we assigned all regions to either enhancers or promoters if over 75% of its active occurrences were labelled as that type of element (Supplementary Fig. 13). This final thresholding procedure yielded 2,069,090 enhancers, 204,104 promoters and 122,358 dyadic elements (neither specifically promoter nor enhancer). The matrices and enhancer locations are available at

For all images using tissue group order, including ChromHMM tracks and module heat maps, groups were ordered alphabetically within six major groups: tissue or organs (adipose, bone, digestive, endocrine, heart, kidney, liver, lung, mesenchymal, muscle, myosatellite, pancreas, placenta and EEM, reproductive, smooth muscle and urinary), other primary cells (endothelial, epithelial and stromal), blood and immune (blood and T cell, HSC and B cell, lymphoblastoid, spleen and thymus), nervous system (brain, eye, neurosphere and PNS), stem (embryonic stem cell-derived, embryonic stem cell and induced pluripotent stem cell), and other (cancer and other).

Defining enhancer modules

To define enhancer modules, we clustered the binary enhancer matrix defined by intersecting enhancer annotations with DHS regions and with the average centred and flanking (±100 bp) H3K27ac signal above a −log10P value of 2 using the k-centroids algorithm with the Jaccard distance and the number of clusters set to k = 300. The average module contained 6,897 enhancers, and the largest module (enumerating constitutive elements) contained 93,554 enhancer regions. In all heat map plots of module centres (and associated enrichment figures), we diagonalized the matrix by ordering each column in the heat map (module centres) by the biosample that contributed the maximal signal. All columns that had a signal over 25% in more than 50% of rows were shown first. We used this diagonalization procedure for all diagonalized heat maps. We coloured each module by the tissue group that contained its maximal signal. Modules highlight sample groupings and organize according to cell type and tissue. Major groups were ordered alphabetically within six major groups and samples were ordered within groups according to Ward method’s clustering of the Jaccard distance of the module centres matrix. We performed enrichment on the module centres against the metadata of included samples (signal over 25%) by the hypergeometric test, and show enrichments with −log10P > 2 (Fig. 2b).

Gene ontology enrichment

We performed gene ontology enrichments on each enhancer module using GREAT v3.0.0 for the biological process, cellular component and molecular function ontologies53. We analysed and visualized the results in the same manner as in the Roadmap core paper6. We only considered enrichments of 2 or greater with a multiple testing-corrected P < 0.01. For Fig. 4c, we reduced the gene ontology enrichment by modules matrix to terms with a maximal −log10P > 4 that were enriched in less than 10% of modules. The full enrichment matrix is shown in Supplementary Fig. 16. As in the case of the diagonalized module centres, we labelled each term according to the module containing its maximal signal. We used a bag of words approach (as described in Roadmap6) to pick 36 representative terms out of 865 total terms for Extended Data Fig. 6b, such that each tissue group has at least one term and the rest are representatively allocated across groups.

Motif enrichment

We performed motif enrichment analysis across enhancer modules as described in the Roadmap paper6,54. In brief, we measured the enrichment of 1,690 motifs consisting of the JASPAR (2018)55 core non-redundant vertebrate motifs, the HOCOMOCO v1156 human motif set and the SELEX motifs by Jolma et al.57. We computed the enrichments for each of the 1,690 motifs relative to a joint DHS and intergenic background, additionally controlled by 100 shuffled motifs for each motif. We reported the motif with the highest enrichment in any module for each of the 286 previously identified motif archetypes26. We only reported motifs with a maximum log2-transformed fold change of at least 1, resulting in 160 motif archetypes (corresponding to 1,175 total motifs), which we show with their position weight matrix (PWM) logos against all 300 modules in Extended Data Fig. 6c.

Enhancer–gene linking

We predicted enhancer–gene links for each biosample using the Pearson correlation between gene expression and the histone mark activity of nearby enhancers (within 1 Mb) for six marks (H3K27ac, H3K4me1, H3K4me2, H3K4me3 and H3K9ac). We precomputed correlations between all genes and nearby enhancers across the 304 biosamples with paired expression data. A negative set of correlations for each enhancer was computed using random genes in a different chromosome. We predicted links for each biosample and ChromHMM enhancer state separately (states E7, E8, E9, E10, E11 and E15). Predictions were made by training an XGBoost classifier on the positive set of all valid links against their paired negative links, using precomputed correlations and distance to the transcription start site as features, and keeping all links with a probability above 5/7 (ref. 58).

We validated enhancer–gene links using curated gold-standard data59 in CD34, GM12878, HeLa and K562 cells (Extended Data Fig. 8). We compared four sets of correlation-based predictions (alone or with H3K27ac and H3K4me1 activity, and with and without distance-based rescaling) against distance alone, enhancer–gene links from Roadmap, and H3K27ac correlation and/or activity times distance (calculated using EpiMap tracks and enhancers in compared epigenomes)60. For methods without a threshold value, such as distance alone, only the nearest or highest score gene was used for each as a cut-off value for F1. In addition, we created a gene ontology-based gold-standard set of links from gene ontology terms that were enriched within enhancer clusters by GREAT53. For each gene ontology term per cluster, we added enhancer–gene links for enhancers within 1 Mb of at least two genes in the gene ontology term. Negative link sets were constructed by taking physical and expression quantitative trait locus (eQTL) negative link sets that were also not enriched by gene ontology.

GWAS enrichment analysis

We pruned the NHGRI-EBI GWAS catalogue34 (downloaded from on 3 May 2019) using a greedy approach: within each trait + PMID combination, we ranked associations by their significance (P value) and added SNPs iteratively if they were not within 5 kb of previously added SNPs. We also removed all associations in the HLA locus (for hg19: chr6: 29,691,116–33,054,976). This reduced the catalogue from 121,000 to 113,000 associations. Finally, we reduced the catalogue to 926 unique GWAS (from 5,454 GWAS) with an initial sample size of at least 20,000 cases or individuals (wherever cases and controls were not annotated). This resulted in 66,801 lead SNPs, which landed in 33,417 unique genome intervals when we split the genome into 10,000-bp intervals.

Flat GWAS–epigenome enrichments and module-based GWAS–epigenome enrichments

We performed the hypergeometric test to evaluate GWAS enrichments on flat epigenomes and on modules. For these flat enrichments, we compared each number of SNP–enhancer intersections for each enhancer set (flat epigenome or module) to the full set of intersections in all M enhancers. As above, we corrected for multiple testing for each GWAS and enhancer set combination by computing and correcting with null association P values for flat epigenomes and modules using the null catalogues generated for the tree enrichment. Rarefaction curves were calculated on the flat epigenome enrichments by iteratively adding the sample that was either significantly enriched or the maximal enrichment for the most remaining GWAS until all GWAS were accounted for (Extended Data Fig. 10c, d).

Tree-based GWAS–epigenome enrichments

We constructed a tree by hierarchically clustering the Jaccard similarity of the binary enhancer-by-epigenomes matrix using complete-linkage clustering. Then, for each node in the tree, we calculated its consensus epigenomic set, defined as the set of all enhancers present in all leaves of the subtree, such that each node’s set was a superset of that of its parent. For each GWAS, we asked whether the novel consensus enhancers at a node were significantly enriched for lead SNPs by comparing the enrichment between each node and its parent as measured by the likelihood ratio test between two logistic regressions.

In brief, for each GWAS catalogue unique trait and PubMed ID, we found all intersections of its pruned SNPs with M = 2,069,090 enhancers. Then Y is an indicator vector of size M, which shows the intersected enhancers. We found all consensus enhancers (the intersection of epigenomes in the subtree) in the node of interest (vector XN) and in its parent (XP). All vectors are 1×M. We calculated XD = XNXP (specific enhancers), which was also in {0,1}(1×M) as each node contained a superset of its parent’s enhancers. We then calculated the following two logistic regressions: M1: Y~XP + 1; M2: Y~XP + XD + 1. We calculated the log-likelihood difference and applied the likelihood ratio test to test whether adding the specific enhancers (M2) was significantly different from the parent model (M1). To correct for multiple testing on a per GWAS and node basis, we generated 1,000 null GWAS for each lead SNP set size by shuffling the trait associations across GWAS locations, giving 243,000 null GWAS in total. We used these catalogues to compute the null association P values for each permuted GWAS and used the 0.1% and 1% top quantiles as false discovery rate cut-offs.

For the CAD example, gene ontology terms61 were calculated using the nearest gene of each enhancer hit by a lead SNP. We pruned genes to expressed genes by calculating the average RNA-seq profiles for each tissue group and excluded genes that had log2 FPKM < 2 in the average RNA-seq of each sample’s group. Of 833 samples, 341 samples have matched RNA-seq, which we list in addition to releasing the processed data at We kept only the gene ontology terms that were significant in 25% or less of nodes, and report the top two gene ontology terms per node in Fig. 4d and all gene ontology terms in Supplementary Fig. 26.

For locus investigations (in NTN4, CACNA1C, EDNRA and PLPP3), we found the nearest active enhancer to each lead SNP in each node (within 2.5 kb), plotted the H3K27ac signal in the 2.1 million enhancers only, and (1) directly mapped links that originated at one of the enhancers near a lead SNP in the top three enriched epigenomes or (2) any links in the locus present in at least half of the samples in one of the selected tissue groups.

Tissue similarity

We assigned each internal node in the tree to a unique tissue if over 50% of the leaves of the subtree came from that tissue and as ‘multiple’ if the subtree was not the majority of one tissue. We assigned tissue labels to 641 of 832 (77%) internal nodes where the majority of leaves corresponded to a single group. Using these assignments, we created a tissue by GWAS matrix by adding the −log10P values for each tissue node set from all of the GWAS enrichments on the tree. We binarized this matrix and computed the Jaccard similarity across tissues to calculate a tissue similarity matrix. To assess the significance of tissue overlap, we compared each overlap value against the overlaps from 10,000 permuted enrichments. We collapsed each permuted matrix into a tissue by the GWAS matrix to compute the overlaps under the null. We performed the permutations for each tissue against other tissues by shuffling the enrichment P values on the node by the GWAS matrix. Specifically, we (1) binarized the enrichment matrix, (2) fixed the column of the group of interest, (3) permuted the remainder of the matrix, keeping its row and column marginals the same, and then (4) calculated the cosine distance between the permuted and the original matrix of enrichments.

Cross-GWAS network

To evaluate the cross-GWAS similarity, we normalized the tissue by the GWAS matrix for each GWAS to obtain the proportion of significance attributed to each tissue for each GWAS (Supplementary Fig. 21). We reduced the matrix to 538 significant GWAS with at least 20,000 cases (or individuals when no cases were specified) passing a false discovery rate correction at 0.1% at the per-node and per-GWAS size level. We created a GWAS–GWAS network using the cosine distance matrix as an adjacency matrix, keeping 5,547 links with a cosine distance of 0.25 or less. We used the Fruchterman–Reingold algorithm to lay out the graph62. We used the tissue by the GWAS matrix to colour links according to the maximum tissue in the product between each pair of nodes and to colour nodes according to the maximal tissue for each node (Supplementary Fig. 22).

To compare the epigenetic network to trait genetic similarity, we binned SNPs in the GWAS catalogue into 10-kb windows starting from the beginning of each chromosome. We counted the number of intersecting bins between two traits and kept any trait pairs with Jaccard similarity of at least 1%. To compare this to the epigenetic network, we plotted only links in the epigenetic network that coincided with any SNP-sharing GWAS pairs. In addition, we plotted the heat maps of the tree enrichments distance matrix and the genetic similarity matrix side by side, first organized by hierarchically clustering the enrichments matrix and then by clustering the genetic similarity matrix (Supplementary Figs. 23–25).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *