# A DNA methylation atlas of normal human cell types – Nature

Jan 4, 2023

### Human tissue samples

Human tissues were obtained from various sources, as detailed in Supplementary Table 1. The majority (148) of the 205 samples analysed were sorted from tissue remnants obtained at the time of routine, clinically indicated surgical procedures at the Hadassah Medical Center. In all cases, normal tissue distant from any known pathology was used. Surgeons and/or pathologists were consulted before removal of tissue to confirm that its removal would not compromise the final pathologic diagnosis in any way. For example, in patients undergoing right colectomy for carcinoma of the caecum, the most distal part of the ascending colon and most proximal part of the terminal ileum were obtained for cell isolation. Normal bone marrow was obtained at the time of joint replacement in patients with no known haematologic pathology. The patient population included 135 individuals (n = 60 males, n = 74 females) aged 3–83 years. The majority of donors were White. Approval for collection of normal tissue remnants was provided by the Institutional Review Board (IRB, Helsinki Committee), Hadassah Medical Center, Jerusalem, Israel. Written informed consent was obtained from each donor or legal guardian before surgery.

As described in Supplementary Table 1, some cells and tissues were obtained through collaborative arrangements: pancreatic exocrine and liver samples (cadaveric organ donors, n = 5) from M. Grompe, Oregon Health & Science University; adipocytes (subcutaneous adipocytes at time of cosmetic surgery following weight loss, n = 3), oligodendrocytes and neurons (brain autopsies, n = 14) from K. L. Spalding and H. Druid, Karolinska Institute, Stockholm; and research-grade cadaveric pancreatic islets from J. Shapiro, University of Alberta (n = 16). In all cases, tissues were obtained and transferred in compliance with local laws and after the approval of the local ethics committee on human experimentation. Sixteen cell types were obtained from commercial sources, including 15 from Lonza and one from Sigma-Aldrich. Three pancreatic islet preparations were obtained from the Integrated Islet Distribution Program (https://iidp.coh.org).

### Tissue dissociation and FACS sorting of purified cell populations

Fresh tissue obtained at the time of surgery was trimmed to remove extraneous tissue. Cells were dispersed using enzyme-based protocols optimized for each tissue type. The resulting single-cell suspension was incubated with the relevant antibodies and FACS sorted to obtain the desired cell type (Extended Data Fig. 2 and Supplementary Information).

Purity of live sorted cells was determined by messenger RNA analysis for key known cell-type-specific genes, whereas the purity of cells fixed before sorting was determined using previously validated cell-type-specific methylation signals (Extended Data Fig. 2c and Supplementary Information). DNA was extracted using the DNeasy Blood and Tissue kit (no. 69504, Qiagen) according to the manufacturer’s instructions, and stored at −20 °C for bisulfite conversion and whole-genome sequencing.

### WGBS

Up to 75 ng of sheared genomic DNA was subjected to bisulfite conversion using the EZ-96 DNA Methylation Kit (Zymo Research), with liquid handling on a MicroLab STAR (Hamilton). Dual-indexed sequencing libraries were prepared using Accel-NGS Methyl-Seq DNA library preparation kits (Swift BioSciences) and custom liquid handling scripts executed on the Hamilton MicroLab STAR. Libraries were quantified using KAPA Library Quantification Kits for Illumina Platforms (Kapa Biosystems). Four uniquely dual-indexed libraries, along with the 10% PhiX v.3 library (Illumina), were pooled and clustered on an Illumina NovaSeq 6000 S2 flow cell followed by 150 bp, paired-end sequencing. Total read count and average sequencing depth (in read pairs), as well as percentage of CpGs, per sample, at 1× and 10×, are detailed in Supplementary Table 1. Also listed are average methylation levels, per sample, at CpG, nonCpG and CC dinucleotides. Intriguingly, sorted neuron samples showed higher CpA methylation (approximately 10%) compared with other samples (approximately 1%).

### WGBS computational processing

Paired-end FASTQ files were mapped to the human (hg19, hg38), lambda, pUC19 and viral genomes using bwa-meth (v.0.2.0)51 then converted to BAM files using SAMtools (v.1.9)52. Duplicated reads were marked by Sambamba (v.0.6.5) with parameters ‘-l 1 -t 16 –sort-buffer-size 16000 –overflow-list-size 10000000’ (ref. 53). Reads with low mapping quality, duplicated or not mapped in a proper pair were excluded using SAMtools view with parameters ‘-F 1796 -q 10’. Reads were stripped from nonCpG nucleotides and converted to PAT files using wgbstools (v.0.1.0)54.

### Genomic segmentation into multisample homogenous blocks

We developed and implemented a multichannel dynamic Pprogramming segmentation algorithm to divide the genome into continuous genomic regions (blocks), showing homogeneous methylation levels across multiple CpGs for each sample54. A generative probabilistic model is used, each block inducing a Bernoulli distribution with some $${\theta }_{i}^{k}$$, where i is the block index and k the sample index (k = 1,…, K), and each observation (occurence of one CpG at one sequenced fragment) is represented by a random variable sampled i.i.d. (independent and identically distributed) from the same beta value Ber $${\theta }_{i}^{k}$$. The log-likelihood of all sequencing data is the sum of log-likelihoods across all blocks, each decomposing as the sum of log-likelihoods across all samples. The log-likelihood of the ith block can therefore be formalized as:

$${\rm{score}}({{\rm{block}}}_{i})={ll}_{i}={{\varSigma }^{K}}_{k=1}({{({N}_{C})}_{i}}^{k}\times \log ({{\hat{\theta }}_{i}}^{k})+{{({N}_{T})}_{i}}^{k}\times \log (1-{{\hat{\theta }}_{i}}^{k}))$$

where $${({N}_{C})}_{i}^{k}\,,\,{({N}_{T})}_{i}^{k}$$ is the number of methylated and unmethylated observations, respectively, in the ith block in the kth sample, whereas $${{\hat{\theta }}_{i}}^{k}$$ marks a Bayes estimator of the Bernoulli distribution parameter, calculated with $${a}_{C},{a}_{T}$$ pseudocounts for each block/sample:

$${{\hat{\theta }}_{i}}^{k}=\frac{{{({N}_{C})}_{i}}^{k}+{\alpha }_{C}}{{{({N}_{C})}_{i}}^{k}+{{({N}_{T})}_{i}}^{k}+{\alpha }_{C}+{\alpha }_{T}}$$

These hyperparameters are used for regularization, to control the trade-off between overfitting (shorter blocks) and generalization (longer blocks). Dynamic programming is then used to find the optimal segmentation across the genome. Briefly, we maintain a 1 × N table T (N = 28,217,448 CpGs) for optimal segmentation scores across all prefixes. Specifically, T[i] holds the score of the optimal segmentation of all CpG sites from 1 through to i, and T[N] holds the final, optimal, score across the entire genome. The table itself is updated sequentially from 1 to N, where the optimal segmentation up to position i is achieved by the addition of a new block to a shorter optimal segmentation (for example, up to position i′):

$$T[i]=\mathop{\max }\limits_{i{\prime} < i}\{T[i{\prime} ]+{\rm{score}}({\rm{block}}[i{\prime} +1,...,i])\}$$

For this, all previous optimal segmentations are considered and a new block is added from position (iʹ + 1) to position i (with a maximal block size of 5,000 bp). The combination that maximizes log-likelihood is selected as the optimal segmentation from 1 to i, and the start index of the last block is recorded in a traceback table. Once the score of optimal segmentation is calculated in T[N], the traceback table is used to retrieve the full segmentation. An upper bound on block length (5,000 bases) is set to improve running times and each chromosome is run separately. The linear distance between consecutive CpGs is ignored under this model. The model and segmentation algorithm are further described in Supplementary Information.

### Segmentation and clustering analysis

We segmented the genome into 7,104,162 blocks using wgbstools (with parameters ‘segment –max_bp 5000’) with all of the 205 samples as reference, and retained 2,099,681 blocks covering at least four CpGs. For hierarchical clustering (Fig. 2) we selected the top 1% (20,997) blocks showing the highest variability in average methylation across all samples. Blocks with sufficient coverage of at least ten observations (calculated as sequenced CpG sites) across two-thirds of the samples were further retained. We then computed the average methylation for each block and sample calculated using wgbstools (–beta_to_table -c 10), marked blocks with fewer than ten observations as missing values and imputed their methylation values using sklearn KNNImputer (v.0.24.2)55. The 205 samples were clustered with the unsupervised agglomerative clustering algorithm23, using scipy (v.1.6.3)56 and L1 norm. The fanning diagram was plotted using ggtree (v.2.2.4)57.

### Cell-type-specific markers

The 205 atlas samples were divided into 51 groups by cell type, yielding 39 basic groups and 12 composite supergroups (Supplementary Table 3). We then performed a one-versus-all comparison to identify differentially methylated blocks unique for each cell type. For this we used wgbstools’ ‘find_markers’ function to first identify blocks covering at least five CpGs (length 10–1,500 bp) to calculate the average methylation per block/sample and rank the blocks according to the difference in average methylation between target samples versus all other samples. To allow some flexibility, this difference was computed (for unmethylated markers) as the difference between the 75th percentile in target samples (typically allowing one outlier) versus the 2.5th percentile in the background group (typically allowing about five outlier samples). For methylated markers, this was computed as the difference between the 25th and 97.5th percentiles (Supplementary Information). Low-coverage blocks (fewer than 25 observations), in which the estimation error of average methylation was around 10%, were replaced by a default value of 0.5 which is neither unmethylated nor methylated, thus reducing the block’s methylation difference and downgrading its rank. For cell type-specific markers we selected the top 25 per cell type, for a total of 1,246 markers (Supplementary Table 4a).

Atlases for 450K/EPIC, RRBS and hybrid capture panels were identified similarly while examining a subset of genomic regions, overlapping various probe sets or genomic regions (-b option). Chromatin analysis was performed on the top 250 markers per cell type (total of 11,713 markers; Supplementary Table 4b). Motif analysis was performed on the top 1,000 markers per cell type (total of 50,286 markers; Supplementary Table 4b) using the difference between the 25th and 75th percentile, to allow putative enhancers unmethylated in additional cell types.

### Enrichment for gene set annotations

Analysis of gene set enrichment was performed using GREAT31. For each cell type we selected the top 250 differentially unmethylated regions and ran GREAT via batch web interface using default parameters. Enrichments for ‘Ensembl Genes’ were ignored, and a significance threshold of binomial false discovery rate ≤0.05 was used.

### Enrichment for chromatin marks

For each cell type we analysed the top 250 differentially unmethylated regions versus published ChIP–seq (H3K27ac and H3K4me1) and DNase sequencing from the Roadmap Epigenomics project (downloaded from ftp.ncbi.nlm.nih.gov/pub/geo/DATA/roadmapepigenomics/by_experiment and http://egg2.wustl.edu/roadmap/data/byDataType/dnase/BED_files_enh) in bigWig and bed formats. These include E032 for B cell markers, E034 for T cell markers, E029 for monocyte/macrophage markers, E066 for liver hepatocytes, E104 for heart cardiomyocytes and fibroblasts and E109 and E110 for gastric/small intestine/colon4. Annotations for chromHMM were downloaded (15-states version) from https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final3, and genomic regions annotated as enhancers (7_Enh) were extracted and reformatted in bigWig format. Raw single-cell ATAC–seq data were downloaded from GEO GSE165659 (ref. 32) as ‘feature’ and ‘matrix’ files for 70 samples. For each sample, cells of the same type were pooled to output a bedGraph file, which was mapped from hg38 to hg19 using UCSC liftOver58. Overlapping regions were dropped using bedtools (v.2.26.0)59. Finally, bigWig files were created using bedGraphToBigWig (v.4)60. Heatmaps and average plots were prepared using deepTools (v.3.4.1)61, with the functions ‘computeMatrix’, ‘plotHeatmap’ and ‘plotProfile’. We used default parameters except for ‘referencePoint=center’, 15 kb margins and ‘binSize=200’ for ChIP–seq, DNaseI and chromHMM data, and 75 kb margins with ‘binSize=1000’ for ATAC–seq data.

### Motif analysis

For each cell type we analysed the top 1,000 differentially unmethylated regions for known motifs (Supplementary Table 6a) using the HOMER function ‘findMotifsGenome.pl’, with parameters ‘-bits’ and ‘-size 250’39. Similar analyses were performed for the unmethylated regions in each cell type (Supplementary Table 6b), as well as unmethylated regions overlapping H3K27ac, but not H3K4me3, peaks (Supplementary Table 6c).

### Methylation marker–gene associations

For each cell-type-specific marker we identified all neighbouring genes up to 500 kb apart. We then examined the expression levels of these genes across the GTEx dataset covering 50 tissues and cell types62. We then standardized the expression of each gene across all conditions, by replacing expression values with standard deviations (z-scores) above/below the average expression of that gene across samples. This was followed by column-wise standardization in which the relative enrichment of a gene under a given condition is normalized by the enrichment of other genes under that condition. This highlighted the most overexpressed genes for each tissue. We then classified each ‘marker–gene–condition’ combination as tier 1: distance ≤5 kb, expression ≥10 TPM and z-score ≥1.5; tier 2: same as tier 1 but with distance ≤50 kb; tier 3: up to 750 kb, expression ≥25 TPM and z-score ≥5; and tier 4: same as tier 3 but with z-score ≥3.5.

### A catalogue of unmethylated loci and putative enhancers for each cell type

For each genomic region (blocks of at least four CpGs), and for any of the 39 cell type groups, fragments with at least four CpGs from all replicates were merged and classified as either U (fragment-level methylation 15% or less), M (at least 85%) or X (over 15% but below 85%). The percentage of U fragments was then calculated using ‘wgbstools homog –threshold .15,.85’, and blocks with at least 85% unmethylated fragments retained. These blocks were overlapped with genomic features based on UCSC hg19 annotations, including CpG islands and transcriptional start site regions (up to 1 kb from a gene start site). We also used narrowPeak annotations downloaded from Roadmap4 and ENCODE project5 (accessions listed in Supplementary Table 6d). hg38 bed files were converted to hg19 using liftOver58. For putative enhancers, nonpromoter active regulatory regions were defined as those overlapping H3K27ac, but not H3K4me3, peaks under matching conditions. TF binding sites were downloaded from JASPAR 2022 (ref. 63).

### Interindividual variation in cell type methylation

We define a similarity score between two samples as the fraction of blocks containing at least three CpGs and at least ten binary observations (sequenced CpG sites) in which the average methylation of the two samples differs by at least 0.5. Only cell types with n ≥ 3 FACS-sorted replicates from different donors are considered (136 samples in total).

### CTCF ChIP–seq analysis

CTCF ChIP–seq data were downloaded from the ENCODE project5 as 168 bigWig files, covering 61 tissues/cell types (hg19). Samples of the same cell type were averaged using multiBigwigSummary (v.3.4.1)61.

### Endodermal marker analysis

All 892 endodermal hypomethylated markers were found using wgbstools function ‘find_markers’ (v.0.2.0), with parameters ‘–delta_quants 0.4 –tg_quant 0.1 –bg_quant 0.1’ (ref. 54). For endoderm-derived epithelium, 51 samples were compared with 103 nonepithelial samples from mesoderm or ectoderm. Blocks were selected as markers if the average methylation of the 90th percentile of the epithelial samples was lower than the tenth percentile of the nonepithelial samples by at least 0.4.

### UXM fragment-level deconvolution algorithm

We developed a fragment-level deconvolution algorithm: each fragment was annotated as U (mostly unmethylated), M (mostly methylated) or X (mixed) depending on the number of methylated and unmethylated CpGs64. We then calculated, for each genomic region (marker) and across all cell types, the proportion of U/X/M fragments with at least k CpGs. Here we used k = 4 and thresholds of less than or equal to 25% methylated CpGs for U reads, and more than or equal to 75% methylated CpGs for M reads. We then constructed reference atlas A with 1,232 regions (top 25 markers per cell type), in which the Ai,j cell holds the U proportion of the ith marker in the jth cell type. Given an input sample, the U proportion at each marker is computed to form a 1,232 × 1 vector b. Then, NNLS is applied to infer coefficient vector x by minimizing $${| A\times x-b| }_{2}$$ subject to non-negative x, normalized to $${\Sigma }_{j}{x}_{j}=1$$. Alternatively, each marker can be weighed differently based on fragment coverage in the input sample. For this, b can be defined as the number of U fragments in each region and the rows of A similarly multiplied by Ci, the total number of fragments in each region, thus minimizing $${| {\rm{diag}}(C)\times A\times x-b| }_{2}$$. Additional details are available in Supplementary Information.

### In silico simulation of WGBS deconvolution

Simulated mixtures were performed for cardiomyocytes (n = 4), bladder epithelium (n = 5), breast epithelium (n = 7), endothelial cells (n = 19) and erythrocyte progenitors (n = 3) in a leave-one-out manner. For this, one sample was held out and segmentation and marker selection (25 per cell type) were rerun using the remaining 204 samples. We then simulated mixtures by sampling and mixing reads from the held-out sample at 10, 3, 1, 0.3, 0.1, 0.03 and 0% into a background of leukocyte samples. This was repeated ten times. Finally, mixed samples were analysed using the UXM fragment-level algorithm with markers from the reduced (204) atlas, using fragments with at least three CpGs. Merging, splitting and mixing of reads were performed using wgbstools (v.0.1.0)54.

Array-based analysis was performed by computing, for each mixed set of fragments, average methylation levels across each of around 480,000 CpG sites present in the 450K array (‘wgbstools beta_to_450k’). We then deconvolved these data according to the method of Moss et al.28 (https://github.com/nloyfer/meth_atlas).

We also simulated four-way mixtures in which background plasma methylomes were simulated as a combination of 90% fragments from leukocytes, 7.5% from a vascular endothelial sample and 2.5% from a hepatocyte sample. As described above, this was done by holding out the three samples (for example, cardiomyocytes, endothelial cells and hepatocytes) and then rererunning segmentation and marker selection on the (202 = 205 – 3) remaining samples, to obtain a set of markers that was then used for fragment-level deconvolution of mixtures.

### WGBS deconvolution

Leukocytes and matching plasma samples (n = 23) were processed as described above and analysed using the WGBS methylation atlas, including 1,246 markers plus (for plasma samples) an additional 25 megakaryocyte markers. Fifty-two plasma samples from 28 patients with SARS-CoV-2 (ref. 44) downloaded as FASTQ files were processed as described above. Because of the low coverage (1–2×) of these samples, we extended the marker set from the top 25 to the top 250 markers per cell type (Supplementary Table 4b), and also included 250 megakaryocyte markers65. Roadmap4 and ENCODE5 samples were processed as described above and analysed using the UXM algorithm.

### Deconvolution of 450K array data

Previously published 450K array data were downloaded from either The Cancer Genome Atlas (lung and breast biopsies)49,50 or GEO accession no. GSE62640 (ref. 48) and deconvoluted with meth_atlas NNLS software (https://github.com/nloyfer/meth_atlas) using our array-adapted atlas (Supplementary Table 12). Breast biopsies were grouped using PAM50 classifications66.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.