Strange IndiaStrange India

Data reporting

For newly generated mouse embryo data, no statistical methods were used to predetermine sample size. Embryos used in the experiments were randomized before sample preparation. Investigators were blinded to group allocation during sample collection and data generation and analysis. Embryo collection and sci-RNA-seq3 data generation were performed by different researchers in different locations.

Mouse embryo collection and staging

All animal use at The Jackson Laboratory was done in accordance with the Animal Welfare Act and the AVMA Guidelines on Euthanasia, in compliance with the ILAR Guide for Care and Use of Laboratory Animals, and with prior approval from The Jackson Laboratory Animal Care and Use Committee under protocol AUS20028.

The details of collecting the 12 mouse embryos with somite counts ranging from 0 to 12 were described previously8. In brief, C57BL/6NJ (strain 005304) mice were obtained at The Jackson Laboratory and mice were maintained via standard husbandry procedures. Timed matings were set in the afternoon and plugs were checked the following morning. Noon of the day a plug was found was defined as E0.5. On the morning of E8.5, individual decidua were removed and placed in ice cold PBS during the collection. Individual embryos were dissected free of extraembryonic membranes, imaged, and the number of somites present were noted prior to snap freezing in liquid nitrogen (Extended Data Fig. 1a). A portion of yolk sac from each embryo was collected for sex based genotyping and samples were stored at −80 °C until further processing.

For newly processed mouse embryos, we used a combination of staging methodologies depending on gestational age of collection (Extended Data Fig. 1b–f). To maximize temporal coherence, resolution, and accuracy, we sought to stage individual embryos based on well-defined morphological criteria, rather than by gestational day alone. Embryos collected between E8.0–E10.0 were staged based upon the number of somites counted at the time of collection and further characterized by morphological features (Extended Data Fig. 1a). For E10.25–E14.75 embryos, developmental age was determined using the embryonic mouse ontogenetic staging system (eMOSS,, which leverages dynamic changes in hindlimb bud morphology and landmark-free based morphometry to estimate the absolute developmental stage of a sample71,72. A modified staging tool, implemented in Python and exhibiting better performance on E14.0–E15.0 samples, was used to confirm staging of samples within this window (documentation and Python scripts available at To distinguish samples staged via eMOSS, these samples are prefixed with ‘mE’ to indicate morphometric embryonic day (for example, mE13.5; Extended Data Fig. 1b–f). Due to the increased complexity of limb morphology at later stages automated staging beyond E15.0 is not possible. As a consequence, collections for all remaining embryonic samples (E15.0–E18.75) was performed precisely at 00:00, 06:00, 12:00 and 18:00 on the targeted day. From close inspection of limbs in this sample set we defined additional dynamics related to digit morphogenesis that allowed further binning of samples collected on days 15 and 16 (Extended Data Fig. 1b–f). Therefore, amongst samples profiled in this study, only the E17.0–E18.75 samples were staged solely by gestational age. Finally, P0 samples were collected from litters at noon of the day of birth (parturition for C57BL/6NJ occurs between E18.75 and E19.0).

Collection of mouse pups immediately after birth

Samples for the validation experiment on periparturition transcriptional dynamics were collected from a plugged female that was monitored for signs of labour beginning at E18.75. Following the natural delivery of 3 pups the dam was euthanized, and following removal from the uterus and extraembryonic membranes, the remaining pups were either collected immediately or placed in a warming chamber to monitor respiratory response and collected at 20-min intervals. We collected nine new pups altogether. The first 3 pups were estimated to be between 1 h to 2 h old, although this was not precisely timed (samples 1–3 in Fig. 6c and Extended Data Fig. 12a). None of these pups had nursed at the time of collection. The next two pups were taken by C-section, decapitated and snap frozen immediately; no breaths were taken (samples 4 and 5 in Fig. 6c and Extended Data Fig. 12a). The next 4 pups were taken by C-section and used for a ‘pink up’ time course, collecting one pup every 20 min (that is, 20 min, 40 min, 60 min and 80 min; samples 6–9 in Fig. 6c and Extended Data Fig. 12a). During this time, all pups remained very active and working to establish a breathing rhythm. Pup 6 had not fully pinked up at time of collection, but pups 7–9 had. Pups 8 and 9 had visible lungs in their chest cavities at 60 min. The last pup collected at 80 min was fully pink with a reasonably stable breathing rhythm. No vocalization was heard from any pups during this collection. Of note, for additional quality control, we put nuclei from previously profiled E18.75 and P0 embryos into a small number of wells of the sci-RNA-seq3 experiment in which nuclei from this validation series were processed.

Generating data using an optimized version of sci-RNA-seq3

Together with E8.5 data, which has been reported previously8, a total of 15 sci-RNA-seq3 experiments were performed on a total of 75 mouse embryos. At least one sample was included for every 6-h interval from E8.0 to P0, and we also included embryos with as many specific somite counts as we could for the 0–34 somite range. Multiple samples were selected for a few timepoints (for example, two samples for E13.0) to boost cell numbers. Meanwhile, we tried to ensure that both male and female mice roughly alternated at adjacent timepoints (Extended Data Fig. 2j). A detailed summary and images of individual embryos can be found in Extended Data Fig. 1 and Supplementary Table. 1.

To generate the dataset, we used the optimized sci-RNA-seq3 protocol3 as written, adjusting the volume and type of lysis buffer to the size of the embryos. In brief, frozen embryos were pulverized on dry ice and cells were lysed with a phosphate-based, hypotonic lysis buffer containing magnesium chloride, Igepal, diethyl pyrocarbonate as an RNase inhibitor, and either sucrose or bovine serum albumin (BSA). Lysate was passed over a 20-μm filter, and the nuclei-containing flow-through was fixed with a mixture of methanol and dithiobis (succinimidyl propionate) (DSP). Nuclei were rehydrated and washed in a sucrose/PBS/Triton X-100/magnesium chloride buffer (SPBSTM), then counted and distributed into 96-well plates for reverse transcription with indexed oligonucleotide-dT primers.

Age-specific adaptations were as follows. E10–E13 embryos use 5 ml BSA lysis buffer, E14 embryos use 10 ml BSA lysis buffer, E15–E18 embryos use 20 ml sucrose-based lysis buffer. Each of these samples were split over 48–96 wells for reverse transcription and the first round of indexing. A newborn P0 mouse requires 40 ml of sucrose-based lysis buffer, and the lysate is divided into 4 fractions for filtration and fixing because of the amount of tissue involved. The two P0 mice were each processed as an individual experiment and were each split over 384 wells for reverse transcription.

For the mouse samples E8.0–E9.75, we used the ‘Tiny Sci’ adaptation of the optimized sci-RNA-seq33. Frozen embryos were gently resuspended in 100 μl lysis buffer to free the nuclei, then 400 μl of dithiobis (succinimidyl propionate)-methanol fixative was added. In the same tube, fixed nuclei were rehydrated, washed and then put directly into 8–32 wells for reverse transcription.

After reverse transcription, nuclei were pooled, washed, and redistributed into fresh 96-well plates to attach a second index sequence by ligation. Then the nuclei were pooled again, washed and redistributed into the final plates. There, the nuclei would undergo second-strand synthesis, extraction, tagmentation with Tn5 transposase and finally PCR to add the final indexes. The PCR products were pooled, size-selected, and then the library was sequenced on an Illumina NovaSeq. For some experiments, a second NovaSeq run was necessary to capture the extent of the library complexity, so we would add more sequencing reads until the PCR duplication rate met a threshold of 50% or the median UMI count per cell went over 2,500. The validation dataset (Extended Data Fig. 4a–f) generated from 8–21-somite embryos was sequenced on an Illumina NextSeq.

Processing of sci-RNA-seq3 sequencing reads

Data from each individual sci-RNA-seq3 experiment was processed independently. For each experiment, read alignment and gene count matrix generation was performed using the pipeline that we developed for sci-RNA-seq314 ( In brief, base calls were converted to fastq format using Illumina’s bcl2fastq v2.20 and demultiplexed based on PCR i5 and i7 barcodes using maximum likelihood demultiplexing package deML73 with default settings. Demultiplexed reads were filtered based on the reverse transcription (RT) index and hairpin ligation adapter index (Levenshtein edit distance (ED) < 2, including insertions and deletions) and adapter-clipped using trim_galore v0.6.5 ( with default settings. Trimmed reads were mapped to the mouse reference genome (mm10) for mouse embryo nuclei using STAR v2.6.1d74 with default settings and gene annotations (GENCODE VM12 for mouse). Uniquely mapping reads were extracted, and duplicates were removed using the UMI sequence, RT index, ligation index and read 2 end-coordinate (that is, reads with identical UMI, RT index, ligation index and tagmentation site were considered duplicates). Finally, mapped reads were split into constituent cellular indices by further demultiplexing reads using the RT index and ligation index. To generate digital expression matrices, we calculated the number of strand-specific UMIs for each cell mapping to the exonic and intronic regions of each gene with the Python v2.7.13 HTseq package75. For multi-mapping reads (that is, those mapping to multiple genes), the read were assigned to the gene for which the distance between the mapped location and the 3′ end of that gene was smallest, except in cases where the read mapped to within 100 bp of the 3′ end of more than one gene, in which case the read was discarded. For most analyses, we included both expected-strand intronic and exonic UMIs in per-gene single-cell expression matrices. After the single-cell gene count matrix was generated, cells with low quality (UMI < 200 or detected genes <100 or unmatched_rate (proportion of reads not mapping to any exon or intron) ≥ 0.4) were filtered out. Each cell was assigned to its originating mouse embryo on the basis of the reverse transcription barcode.

Doublet removal

We performed three steps with the goal of exhaustively detecting and removing potential doublets. Of note, all these analyses were performed separately on data from each experiment.

First, we used Scrublet to detect doublets directly. In this step, we first randomly split the dataset into multiple subsets (six for most of the experiments) in order to reduce the time and memory requirements. We then applied the Scrublet v0.1 pipeline76 to each subset with parameters (min_count = 3, min_cells = 3, vscore_percentile = 85, n_pc = 30, expected_doublet_rate = 0.06, sim_doublet_ratio = 2, n_neighbors = 30, scaling_method = ‘log’) for doublet score calculation. Cells with doublet scores over 0.2 were annotated as detected doublets.

Second, we performed two rounds of clustering and used the doublet annotations to identify subclusters that are enriched in doublets. The clustering was performed based on Scanpy v.1.6.020. In brief, gene counts mapping to sex chromosomes were removed, and genes with zero counts were filtered out. Each cell was normalized by the total UMI count per cell, and the top 3,000 genes with the highest variance were selected, followed by renormalizing the gene expression matrix. The data was log-transformed after adding a pseudocount, and scaled to unit variance and zero mean. The dimensionality of the data was reduced by PCA (30 components), followed by Louvain clustering with default parameters (resolution = 1). For the Louvain clustering, we first computed a neighbourhood graph using a local neighbourhood number of 50 using scanpy.pp.neighbors. We then clustered the cells into sub-groups using the Louvain algorithm implemented by the function. For each cell cluster, we applied the same strategies to identify subclusters, except that we set resolution = 3 for Louvain clustering. Subclusters with a detected doublet ratio (by Scrublet) over 15% were annotated as doublet-derived subclusters. We then removed cells which are either labelled as doublets by Scrublet or that were included in doublet-derived subclusters. Altogether, 2.7% to 16.8% of cells in each experiment were removed by this procedure.

We found that the above Scrublet and iterative clustering-based approach has difficulty identifying doublets in clusters derived from rare cell types (for example, clusters comprising less than 1% of the total cell population), so we applied a third step to further detect and remove doublets. This step uses a different strategy to cluster and subcluster the data, and then looks for subclusters whose differentially expressed genes differ from those of their associated clusters. This step consists of a series of ten substeps. (1) We reduced each cell’s expression vector to retain only protein-coding genes, long intergenic non-coding RNAs (lincRNAs) and pseudogenes. (2) Genes expressed in fewer than 10 cells and cells in which fewer than 100 genes were detected were further filtered out. (3) The dimensionality of the data was reduced by PCA (50 components) first on the top 5,000 most highly dispersed genes and then with UMAP (max_components = 2, n_neighbors = 50, min_dist = 0.1, metric = ‘cosine’) using Monocle 3-alpha14. (4) Cell clusters were identified in UMAP 2D space using the Louvain algorithm implemented in Monocle 3-alpha (resolution = 10−6). Cell partitions were detected using the partitionCells function implemented in Monocle 3-alpha. This function applies algorithms that automatically partition cells to learn disjoint or parallel trajectories based on concepts from ‘approximate graph abstraction’77. (5) We took the cell partitions identified by Monocle 3-alpha (cell clusters were used instead for three experiments that profiled embryos before E10), downsampled each partition to 2,500 cells, and computed differentially expressed genes across cell partitions with the top_markers function of Monocle 3 (reference_cells = 1000). (6) We selected a gene set combining the top ten gene markers for each cell partition (filtering out genes with fraction_expressing <0.1 and then ordering by pseudo_R2). (7) Cells from each main cell partition were subjected to dimensionality reduction by PCA (10 components) on the selected set of top partition-specific gene markers. (8) Each cell partition was further reduced to 2D using UMAP (max_components = 2, n_neighbors = 50, min_dist = 0.1, metric = ‘cosine’). (9) The cells within each partition were further sub-clustered using the Louvain algorithm implemented in Monocle 3-alpha (resolution = 10−4 for most clustering analysis). (10) Subclusters that expressed low levels of the genes that were found to be differentially expressed in step 5, had high levels of markers specific to a different partition, and had relatively high doublet scores, were labelled as doublet-derived subclusters and removed from the analysis. On average, this procedure eliminated 3.4% of cells from each experiment (range 0.5–13.2%) of the cells in each experiment (Extended Data Fig. 2a–e).

Cell clustering and cell-type annotations

For data from individual experiments, after removing the potential doublets detected by the above three steps, we further filtered out the potential low-quality cells by investigating the numbers of UMIs and the proportion of reads mapping to the exonic regions per cell (Extended Data Fig. 2f). Then, we merged cells from individual experiments to generate the penultimate dataset, which included 15 sci-RNA-seq3 experiments and 21 runs of the Illumina NovaSeq instrument. In our early embeddings of this penultimate dataset, we noticed that one mouse embryo at E14.5 had a grossly reduced proportion of neuronal cells. This particular sample had been divided during pulverization, and we suspect that specific anatomical portions of the frozen embryo did not make it into the experiment. We therefore removed cells from this E14.5 embryo, and we further filtered out cells from the whole dataset with doublet score (by Scrublet) > 0.15 (~0.3% of the whole dataset), as well as cells with either the percentage of reads mapping to ribosomal chromosome (Ribo%) > 5 or the percentage of reads mapping to mitochondrial chromosome (Mito%) > 10 (~0.1% of the whole dataset). Finally, 11,441,407 cells from 74 embryos were retained, of which the median UMI count per cell is 2,700 and median gene count detected per cell is 1,574. For this final matrix, the number of cells recovered by each embryo and the basic quality information for cells from each sci-RNA-seq3 experiment is summarized in the Supplementary Tables 1 and 2. For sex separation and confirmation of embryos with or without sex genotyping, we counted reads mapping to a female-specific non-coding RNA (Xist) or chromosome Y genes (except Erdr1 which is in both chromosome X and chromosome Y). Embryos were readily separated into females (more reads mapping to Xist than chromosome Y genes) and males (more reads mapping to chromosome Y genes than Xist).

We then applied Scanpy v.1.6.020 to this final dataset, performing conventional single-cell RNA-seq data processing: (1) retaining protein-coding genes, lincRNA, and pseudogenes for each cell and removing gene counts mapping to sex chromosomes; (2) normalizing the UMI counts by the total count per cell followed by log transformation; (3) selecting the 2,500 most highly variable genes and scaling the expression of each to zero mean and unit variance; (4) applying PCA and then using the top 30 principal components to calculate a neighbourhood graph (n_neighbors = 50), followed by Leiden clustering (resolution = 1); (4) performing UMAP visualization in 2D or 3D space (min.dist = 0.1). For cell clustering, we manually adjusted the resolution parameter towards modest overclustering, and then manually merged adjacent clusters if they had a limited number of DEGs relative to one another or if they both highly expressed the same literature-nominated marker genes. For each of the 26 major cell clusters identified by the global embedding, we further performed a sub-clustering with the similar strategies, except setting n_neighbors = 30 when calculating the neighbour graph and min_dist = 0.3 when performing the UMAP. Subsequently, we annotated individual cell clusters identified by the sub-clustering analysis using at least two literature-nominated marker genes per cell-type label (Supplementary Table 5).

To be clear, we have hierarchically nominated three levels of cell-type annotations in the manuscript. (1) In the global embedding involving all 11.4 M cells we identified 26 major cell clusters (Fig. 1b,c and Supplementary Table 4). (2) For individual major cell clusters, we performed sub-clustering, resulting in 190 cell types (Extended Data Fig. 3 and Supplementary Table 5). (3) For a handful of cell types, in specific parts of the manuscript, we performed further sub-clustering, to identify cell subtypes. For example: (i) we re-embedded 745,494 cells from the lateral plate and intermediate mesoderm derivatives, identifying 22 subtypes, most of which correspond to different types of mesenchymal cells (Fig. 3d and Supplementary Table 12). (ii) we re-embedded 296,020 cells (glutamatergic neurons, GABAergic neurons, spinal cord dorsal progenitors and spinal cord ventral progenitors) from stages

Of note, we processed and analysed the birth series dataset (n = 962,697 nuclei after removing low-quality cells and potential doublets cells) and the early versus late somites data (n = 104,671 nuclei after removing low-quality cells and potential doublets cells) using exactly the same strategy, except without performing sub-clustering on each major cell cluster.

Whole-mouse embryo analysis

Each cell was assigned to the mouse embryo from which it derived based on its reverse transcription barcode. For each of the 74 samples, UMI counts mapping to the sample were aggregated to generate a pseudo-bulk RNA-seq profile for the sample. Each cell’s counts were then normalized by dividing by its estimated size factor. The data were then log2-transformed after adding a pseudocount, and PCA was performed on the transformed data using the 3,000 most highly variable genes. The normalization and dimension reduction were performed using Monocle v3.

Quantitatively estimating cell number for individual mouse embryo at any stage during organogenesis

To estimate the cell number of individual embryos, we selected a representative embryo from 12 timepoints at 1-day increments, from E8.5 to P0 (roughly considered as E19.5). Each embryo was digested with proteinase K overnight, and total genomic DNA was isolated with a Qiagen Puregene tissue kit (Qiagen 158063). DNA was quantified and cell number was estimated by taking the total ng of recovered DNA and assuming 2.5 billion base pairs per mouse genome (times two for a diploid cell), 650 g per mole of a base pair. Estimating cell number this way does not include any losses due to the DNA preparation, and does not count non-nucleated cells.

Based on the experimentally estimated cell numbers of those 12 embryos, we applied polynomial regression (degree = 3) to fix a curve across embryos between the embryonic day and log2-scaled cell number (adjusted R2 > 0.98) (Extended Data Fig. 2l). P0 was treated as E19.5 in the model. Then, the total cell number of a whole mouse embryo at any day between E8.5 and P0 is predicted using the below formula:

$${\log }_{2}({\rm{cell}}\,{\rm{number}})=0.011369\times {{\rm{day}}}^{3}-0.583861\times {{\rm{day}}}^{2}+10.397036\times {\rm{day}}-35.469755$$

To estimate the dynamic ‘doubling time’ of the total cell number in a whole mouse embryo, at a given timepoint (day), we took the derivative from the above formula as the log2-scaled proliferation rate p(day), and then calculated \(24\times 2/{2}^{p({\rm{day}})}\), resulting in a point estimate of the number of hours required for the mouse embryo to double its total cell number (Extended Data Fig. 2m).

Characterizing transcriptional heterogeneity in the posterior embryo

We re-analysed 121,118 cells which were initially annotated as NMPs and spinal cord progenitors, mesodermal progenitors (Tbx6+), notochord, ciliated nodal cells, or gut, from embryos during the early somitogenesis (somite counts 0–34; E8–E10). Three clusters were identified, with cluster 1 dominated by NMPs and their derivatives (n = 98,545 cells), cluster 2 dominated by notochord and ciliated nodal cells (n = 3,949 cells), and cluster 3 dominated by gut cells (n = 18,624 cells).

To characterize transcriptional heterogeneity within each of the three cell clusters, we performed PCA on the 2,500 most highly variable genes in each cluster. Then, we calculated the Pearson correlation between the expression of the top highly variable genes and each of the top principal components within each of the three cell clusters. In brief, for each cell cluster, the top 2,500 highly variable genes were identified and their gene expression values were calculated from original UMI counts normalized to total UMIs per cell, followed by natural-log transformation and scaling. After performing Pearson correlation with the selected principal component, significant genes were identified if their correlation coefficients are less than mean − 1 × s.d. or greater than mean + 1 × s.d. of all the correlation coefficients, and false discovery rate < 0.05. In addition, we identified differentially expressed genes between early (n = 4,949 cells) and late (n = 3,910 cells) NMPs, using the FindMarkers function of Seurat v363, after filtering out genes that are detected in <10% of cells in both of the two populations. Significant genes were identified if their absolutely log-scaled fold changes >0.25, and adjusted P values < 0.05. Of note, here cells are labelled as NMPs if they are both strongly T+ (raw count ≥5) and Meis1 (raw count = 0).

In Fig. 2k, the Pearson correlation coefficient between gene expression for the top highly variable genes and either PC1 of notochord (x axis) or PC1 of gut (y axis) are plotted. The overlapped genes between two cell clusters are shown as each dot, and the overlapped significant genes are highlighted in blue. The first quadrant corresponds to the inferred anterior aspect of each cluster, while the third quadrant corresponds to the inferred posterior aspect. In Fig. 2l, the log-scaled fold change of the average expression for the top highly variable genes between early versus late NMPs (x axis), and the Pearson correlation coefficient between gene expression for the top highly variable genes and PC2 of gut (y axis) are plotted. The first quadrant is associated with early somite counts for each cluster, while the third quadrant is associated with late somite counts. In the gene expression line plots in Fig. 2e, left and Fig. 2k,l, right, gene expression values were calculated from original UMI counts normalized to total UMIs per cell, followed by natural-log transformation. The line of gene expression was plotted by the geom_smooth function in ggplot2.

Spatial mapping with Tangram

To infer the spatial origin of each lateral plate and intermediate mesoderm derivative, we used a public dataset called Mosta46, which profiles spatial transcriptomes for 53 sections of mouse embryos spanning 8 timepoints from E9.5 to E16.5. We combined this data with our own data to perform spatial mapping analysis using Tangram47. In brief, for each timepoint of the Mosta data, we combined scRNA-seq data from three adjacent timepoints from our data (for example, E16.25, E16.5 and E16.75 from scRNA-seq versus E16.5 from Mosta data), and the total number of voxels within each section was randomly downsampled to 9,000 for computational efficiency. We used the Tangram with default parameters to estimate the spatial coordinates of cells from each cell type in the scRNA-seq data, and then visualized the results on the coordinates provided by Mosta. The Tangram model was trained in GPU mode using a NVIDIA A100 GPU. After applying Tangram, for each section, a cell-by-voxel matrix with mapping probabilities was returned. This matrix shows the probability that each cell originated from each voxel in the section. To reduce noise, we further smoothed the mapping probabilities for each voxel by averaging values of their k-nearest neighbouring voxels (k is calculated by natural-log-scaled total number of voxels on that section) followed by scaling it to 0 to 1 across voxels of each section. Although only selected results are presented in the paper, the mapping results for each Mosta section on which we performed this analysis are available at

Generating a cell-type tree for mouse development

We collected and combined scRNA-seq data from four published datasets, which consisted of 110,000 cells spanning E0 to E8.5, and the main dataset described in this paper, which consisted of 11.4 million cells spanning E8 to P0 (Supplementary Table 17). We generated the tree of cell types for mouse development via the following steps.

First, based on data source, developmental window and cell-type annotations, we split cells into fourteen subsystems which could be separately analysed and subsequently integrated. The first two subsystems correspond to the pre-gastrulation and gastrulation phases of development and are based on the external datasets4,5,6,7. The remaining 12 subsystems derive from the data reported here, and collectively encompass organogenesis and fetal development (Supplementary Tables 17 and 18).

Second, dimensionality reduction was performed separately on cells from each of the fourteen subsystems. Manual re-examination of each subsystem led to some corrections or refinements of cell-type annotations, ultimately resulting in 283 annotated cell-type nodes, some with only a handful of cells (for example, 60 ciliated nodal cells) and others with vastly more (for example, 650,000 fibroblasts) (Supplementary Tables 19 and 20). Of note, each of these annotated cell-type nodes derives from one data source, such that there are some redundant annotations that facilitate ‘bridging’ between datasets (Extended Data Fig. 11d–h). In contrast to our previous strategy in which nodes were stage-specific8, each cell-type node here is temporally asynchronous, and of course may also contain other kinds of heterogeneity (for example, spatial, differentiation, cell cycle and others).

Third, we sought to draw edges between nodes (Fig. 5a–f). Within each subsystem, we identified pairs of cells that were MNNs in 30-dimensional PCA space (k = 10 neighbours for pre-gastrulation and gastrulation subsystems, k = 15 for organogenesis and fetal development subsystems). Although the overwhelming majority of MNNs occurred within cell-type nodes, some MNNs spanned nodes and are presumably enriched for bona fide cell-type transitions. To approach this systematically, we calculated the total number of MNNs that spanned each possible pair of cell-type nodes within a given subsystem, normalized by the total number of possible MNNs between those nodes, and ranked all possible intra-subsystem edges based on this metric (Supplementary Table 21). Of note, due to its complexity, this was done in two stages for the ‘Brain and spinal cord’ subsystem, first applying the heuristic to the subset of cell types corresponding to the patterned neuroectoderm, and then again to identify edges between the patterned neuroectoderm and its derivatives (that is, neurons, glial cells and others).

Fourth, we manually reviewed the ranked list of 1,155 candidate edges for biological plausibility (those with a normalized MNN score > 1; Extended Data Fig. 11d), resulting in 452 edges which we manually annotated as more likely to correspond to either ‘developmental progression’ or ‘spatial continuity’ (Supplementary Table 22). Where nodes were connected to more than one other node, distinct subsets of cells were generally involved in each edge (Fig. 5a,b,d,e), and inter-node MNN pairs exhibited temporal coincidence (Fig. 5c,f). As only a handful of cells were profiled in the pre-gastrulation subsystem, those edges were added manually.

Finally, to bridge subsystems, we performed batch correction and co-embedding of selected timepoints from either the pre-gastrulation and gastrulation datasets, or the gastrulation and organogenesis and fetal development datasets, to identify equivalent cell-type nodes, resulting in a third category of ‘dataset equivalence’ edges (Extended Data Fig. 11e–h). For example, we performed anchor-based batch correction63 followed by integration between cells from E6.5 to E8.5 generated on the 10x Genomics platform7 (n = 108,857 cells) and the earliest 1% of this dataset (0–12 somite stage embryos) generated by sci-RNA-seq3 (n = 153,597 nuclei) (Extended Data Fig. 11e,f). This allowed us to identify 36 cell types from the integrated dataset, which we used to identify bridging edges between the gastrulation subsystem and the later subsystems (Extended Data Fig. 11g,h). Most of the 12 organogenesis and fetal development subsystems originate in cell-type nodes for which equivalent nodes are already present at gastrulation. The exceptions, presumably due to undersampling of this transition, were the ‘blood’ and ‘PNS neuron’ subsystems, for which we manually added edges to connect them with biologically plausible pseudo-ancestors. Altogether, we added 55 inter-subsystem edges.

In practice, a small number of nodes in the tree have more than one parent, so the ‘tree’ is formally a rooted, directed graph that represents mouse development from E0 to P0. The visualization shown in Fig. 5g was created using yFiles Hierarchical layout in Cytoscape v3.9.1. For presentation purposes, we removed most of the spatial continuity edges, except for those between spinal cord dorsal and ventral progenitors after E13.0 and GABAergic and glutamatergic neurons after E13.0. We also merged nodes with redundant labels derived from different datasets (that is, dataset equivalence edges). This resulted in a rooted graph with 262 cell-type nodes and 338 edges.

Our evaluation of the robustness of our approach to technical factors or parameter choices is provided in Extended Data Fig. 11a–c and Supplementary Note 2.

Nominating key transcription factors and genes

The list of 1,636 mouse proteins that are putatively transcription factors was collated from AnimalTFDB v3 ( For each edge in the cell-type tree, we stratified each cell-type transition into four phases. Specifically, we identified the subset of cells within each node that were either ‘inter-node’ MNNs of the other cell-type or ‘intra-node’ MNNs of those cells. If A → B, this approach effectively models the transition as group 1 → 2 → 3 → 4 (Extended Data Fig. 11i,j). Next, we identified DETFs and genes (DEGs) across each portion of the modelled transition—that is, early (1 → 2), inter-node (2 → 3) and late (3 → 4)—by applying FindMarkers function in Seurat v3 with parameters (logfc.threshold = 0, min.pct = 0). This strategy highlights differences between cells that are most proximate to the cell-type transition itself.

After excluding dataset equivalence edges and the ‘pre-gastrulation’ subsystem, we nominated key transcription factors and genes that specify cell types for each of the 436 edges. Of note, the directionality of many of these edges was not immediately obvious (that is, those annotated as “spatial continuity” edges). In these cases, the orientation of the ‘early’ and ‘late’ phases is arbitrary. For edges with a relatively small number of MNN pairs, we expanded each group to at least 200 cells by iteratively including their MNNs within the same cell type, to increase statistical power.

Identifying cell types with abrupt transcriptional changes before versus after birth

To systematically identify which cell types exhibit abrupt transcriptional changes before versus after birth, we performed the following steps.

  • We focused on the 71 cell types with at least 200 cells from P0 and at least 200 cells from at least 5 timepoints prior to P0.

  • We combined cells from animals collected subsequent to E16 and performed PCA based on the top 2,500 highly variable genes.

  • Timepoints with at least 200 cells were selected and cells were downsampled from each timepoint to the median number of cells across those selected timepoints.

  • The k-nearest neighbours (k was adjusted for different cell types, by taking the log2-scaled median number of cells across the selected timepoints) were identified in PCA space (n = 30 dimensions).

  • We calculated the average proportion of nearest neighbour cells that were from a different timepoint for cells within each cell type. In this framing, a low proportion of neighbours from different timepoints corresponds to a relatively abrupt change in transcriptional state.

We subjected the birth-series dataset to a similar analysis. For each major cell cluster in the birth-series dataset, we took cells from the 6 pups delivered by C-section and calculated the Pearson correlation coefficient between the timepoint of each cell and the average timepoints of its 10 nearest neighbours identified from the global PCA embedding (n = 30 dimensions). In this framing, a high correlation indicates that the cell and its nearest neighbours all underwent rapid, synchronized changes in transcriptional state.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link


Leave a Reply

Your email address will not be published. Required fields are marked *