All tissue samples used for this study were obtained with written informed consent from all participants in accordance with the guidelines in The Declaration of Helsinki 2000.
Human embryo and fetal samples were obtained from the MRC and Wellcome-financed Human Developmental Biology Resource (HDBR, http://www.hdbr.org), with appropriate maternal written consent and approval from the Fulham Research Ethics Committee (REC reference no. 18/LO/0822) and Newcastle and North Tyneside 1 Research Ethics Committee (REC reference no. 18/NE/0290). The HDBR is regulated by the UK Human Tissue Authority (www.hta.gov.uk) and operates in accordance with the relevant Human Tissue Authority Codes of Practice.
Assignment of developmental stage
Embryos up to 8 PCW were staged using the Carnegie staging method47. At stages beyond 8 PCW, age was estimated from measurements of foot length and heel-to-knee length and compared with the standard growth chart48. A piece of skin, or if this was not possible, chorionic villi tissue, was collected from every sample for quantitative PCR analysis using markers for the sex chromosomes and autosomes 13, 15, 16, 18, 21 and 22, which are the most commonly seen chromosomal abnormalities. All samples were karyotypically normal.
All tissues for sequencing and spatial work were collected in HypoThermosol biopreservation medium and stored at 4 °C until processing. Tissue dissociation was conducted within 24 h of tissue retrieval with the exception of tissues that were cryopreserved and stored at −80 °C (Supplementary Table 1).
We used the previous protocol optimized for gonadal dissociation8 and this is available at protocols.io (ref. 49). In short, tissues were cut into <1 mm3 segments before being digested with Trypsin/EDTA 0.25% for 5–15 min at 37 °C with intermittent shaking. Samples less than 17 PCW were also digested using a combination of collagenase and Trypsin/EDTA, a protocol adapted from Wagner et al.50,51. In short, samples were first digested with collagenase 1A (1 mg ml−1) and liberase TM (50 µg ml−1) for 45 min at 37 °C with intermittent shaking. The cell solution was further digested with Trypsin/EDTA 0.25% for 10 min at 37 °C with intermittent shaking. In both protocols, digested tissue was passed through a 100 µm filter and cells collected by centrifugation (500g for 5 min at 4 °C). Cells were washed with PBS before cell counting.
Dissociated cells were incubated at 4 °C with 2.5 μl of antibodies in 1% FBS in Dulbecco’s PBS without calcium and magnesium (Thermo Fisher Scientific, 14190136). To isolate CD45+ and CD45− cells, we used the antibody CD45-BUV395 BD Bioscience 563791 Clone HI30 (RUO) Flow cytometry (dilution 2.5 μl:100 μl). 4,6-Diamidino-2-phenylindole (DAPI) was used for live versus dead discrimination. Cells were sorted using a Becton Dickinson (BD) FACS Aria Fusion with five excitation lasers (355, 405, 488, 561 and 635 nm red), and 18 fluorescent detectors, plus forward and side scatter. The sorter was controlled using BD FACS DIVA software (v.7), and FlowJo v.10.3 was used for analysis.
Single-nuclei suspensions were isolated from dissociated cells when performing scATAC-seq, following the manufacturers’ instructions, and from frozen tissue sections when performing multiomic snRNA-seq/scATAC-seq. For the latter, thick (300 µm) sections were cryosectioned and kept in a tube on dry ice until subsequent processing. Nuclei were released by Dounce homogenization as described in detail in the protocols.io (ref. 52).
Fresh tissue was cut into <1 mm3 segments before being resuspended with 1 ml of ice-cold Cryostor solution (CS10) (C2874-Sigma). The tissue was frozen at −80 °C by decreasing the temperature at about 1 °C per minute. The detailed protocol is available at https://www.protocols.io/view/tissue-freezing-in-cryostor-solution-processing-bgsnjwde.
Fresh tissue samples of human developing gonads were embedded in cold optimal cutting temperature compound (OCT) medium and flash frozen using a dry ice-isopentane slurry. The protocol is available at protocols.io (ref. 53).
Tissue collection from mouse embryos
Developing ovaries, testes and mesonephros were collected from E10.5, E11.5 and E12.5 mouse embryos carrying the Oct4ΔPE-GFP transgene. Mice were housed in specific pathogen-free conditions at the UK Home Office-approved facility at the University of Cambridge. Mice were maintained with a 12 h light/12 h dark cycle, with temperature ranging from 20–24 °C and humidity of 45–65%. Embryos were genotyped to identify the gender. We included six males and three females at E10.5, six males and two females at E11.5, and three males and three females at E12.5. Sample size was not estimated. Developing gonads were dissected from the mesonephros and both organs were separately dissociated with 0.25% Trypsin/EDTA into single-cell suspensions as described for the human tissue. Tissues (gonads or mesonephros) from the same sex and stage were sequenced together. For smFISH imaging, we collected another E13.5 female embryo. For sectioning, tissues were fixed in 4% (w/v) formaldehyde solution for 2 h at 4 °C. Samples were washed with PBS and afterwards sequentially incubated with 10 and 20% (w/v) sucrose at 4 °C. After, samples were embedded in OCT and subsequently flash frozen using a dry ice-isopentane slurry. All experimental procedures were in agreement with the project licence PE596D1FE issued by the Animal Welfare Ethical Review Board committee under the UK Home Office and carried out in a Home Office designated facility, in accordance with ethical guidelines and with the UK Animals (Scientific Procedures) Act of 1986.
Haematoxylin and eosin staining and imaging
Fresh frozen sections were removed from −80 °C storage and air dried before being fixed in 10% neutral buffered formalin for 5 min. After being rinsed with deionized water, slides were dipped in Mayer’s haematoxylin solution for 90 s. Slides were completely rinsed in 4–5 washes of deionized water, which also served to blue the haematoxylin. Aqueous eosin (1%) was manually applied onto sections with a pipette and rinsed with deionized water after 1–3 s. Slides were dehydrated through an ethanol series (70, 70, 100, 100%) and cleared twice in 100% xylene. Slides were coverslipped and allowed to air dry before being imaged on a Hamamatsu NanoZoomer 2.0HT digital slide scanner.
Multiplexed smFISH and high-resolution imaging
Large tissue section staining and fluorescent imaging were conducted largely as described previously54. Sections were cut from fresh frozen or fixed frozen samples embedded in OCT at a thickness of 10 μm using a cryostat, placed onto SuperFrost Plus slides (VWR) and stored at −80 °C until stained. For formalin-fixed paraffin-embedded samples, sections were cut at a thickness of 5 μm using a microtome, placed onto SuperFrost Plus slides (VWR) and left at 37 °C overnight to dry and ensure adhesion. Tissue sections were then processed using a Leica BOND RX to automate staining with the RNAscope Multiplex Fluorescent Reagent Kit v2 Assay (Advanced Cell Diagnostics, Bio-Techne), according to the manufacturers’ instructions. Probes are listed in Supplementary Table 10. Before staining, human fresh frozen sections were post-fixed in 4% paraformaldehyde in PBS for 15 min at 4 °C, then dehydrated through a series of 50, 70, 100 and 100% ethanol, for 5 min each. Following manual pretreatment, automated processing included epitope retrieval by protease digestion with Protease IV for 30 min before probe hybridization. Mouse fixed frozen sections were subjected to the same manual pretreatment described above. Subsequently, the automated processing for these sections included heat-induced epitope retrieval at 95 °C for 5 min in buffer ER2 and digestion with Protease III for 15 min before probe hybridization. On this treatment, no endogenous fluorescence from the Oct4ΔPE-GFP transgene was observed. For formalin-fixed paraffin-embedded sections, automated processing included baking at 60 °C for 30 min and dewaxing, as well as heat-induced epitope retrieval at 95 °C for 15 min in buffer ER2 and digestion with Protease III for 15 min before probe hybridization. Tyramide signal amplification with Opal 520, Opal 570 and Opal 650 (Akoya Biosciences) and TSA-biotin (TSA Plus Biotin Kit, Perkin Elmer) and streptavidin-conjugated Atto 425 (Sigma Aldrich) was used to develop RNAscope probe channels.
Stained sections were imaged with a Perkin Elmer Opera Phenix High-Content Screening System, in confocal mode with 1 μm z-step size, using a ×20 (numerical aperture (NA) 0.16, 0.299 μm per pixel), ×40 (NA 1.1, 0.149 μm per pixel) or ×63 (NA 1.15, 0.091 μm per pixel) water-immersion objectives. Channels were as follows: DAPI (excitation 375 nm, emission 435–480 nm), Atto 425 (excitation 425 nm, emission 463–501 nm), Opal 520 (excitation 488 nm, emission 500–550 nm), Opal 570 (excitation 561 nm, emission 570–630 nm) and Opal 650 (excitation 640 nm, emission 650–760 nm).
Confocal image stacks were stitched as two-dimensional maximum intensity projections using proprietary Acapella scripts provided by Perkin Elmer.
10X Genomics Chromium GEX (gene expression) library preparation and sequencing
For the scRNA-seq experiments, cells were loaded according to the manufacturer’s protocol for the Chromium Single Cell 5′ Kit v.1.0, v.1.1 and v.2 (10X Genomics) to attain between 2,000 and 10,000 cells per reaction. Library preparation was carried out according to the manufacturer’s protocol. Libraries were sequenced, aiming at a minimum coverage of 20,000 raw reads per cell, on the Illumina HiSeq4000 or Novaseq 6000 systems using the sequencing format: read 1, 26 cycles; i7 index, 8 cycles, i5 index, 0 cycles; read 2, 98 cycles.
For the scATAC-seq and multimodal snRNA-seq/scATAC-seq experiments, cells were loaded according to the manufacturer’s protocol for the Chromium Single Cell ATAC v.1.0 and Chromium Single Cell Multiome ATAC + Gene Expression v.1.0 to attain between 2,000 and 10,000 cells per well. Library preparation was carried out according to the manufacturer’s protocol. Libraries for scATAC-seq were sequenced on Illumina NovaSeq 6000, aiming at a minimum coverage of 10,000 fragments per cell, with the following sequencing format; read 1, 50 cycles; i7 index, 8 cycles, i5 index, 16 cycles; read 2, 50 cycles.
10X Genomics Visium library preparation and sequencing
Cryosections of 10 μm were cut and placed on Visium slides. These were processed according to the manufacturer’s instructions. In brief, sections were fixed with cold methanol, stained with H&E and imaged on a Hamamatsu NanoZoomer S60 before permeabilization, reverse transcription and complementary DNA synthesis using a template-switching protocol. Second-strand cDNA was liberated from the slide and single-indexed libraries prepared using a 10X Genomics PCR-based protocol. Libraries were sequenced (one per lane on a HiSeq4000), aiming for 300 million raw reads per sample, with the following sequencing format; read 1, 28 cycles, i7 index, 8 cycles, i5 index, 0 cycles and read 2, 91 cycles.
Alignment and quantification of sc or snRNA-seq data
For each sequenced scRNA-seq library, we performed read alignment to the 10X Genomics’ GRCh38 v.3.1.0 (human) or Mm10-2020 (mouse) reference genomes, quantification and initial quality control using the Cell Ranger Software (v.3.1, 10X Genomics) using default parameters. For each sequenced multimodal snRNA-seq library, we performed read alignment to the 10X Genomics’ GRCh38 v.3.1.0 (human) reference genome, quantification and initial quality control using the Cell Ranger ARC Software (v.1.0.1, 10X Genomics) using default parameters. Cell Ranger filtered count matrices were used for downstream analysis.
Downstream scRNA-seq analysis
We used Scrublet for cell doublet calling on a per-library basis. We used a two-step diffusion doublet identification followed by Bonferroni-false discovery rate (FDR) correction and a significance threshold of 0.01, as described31. Predicted doublets were not excluded from the initial analysis, but used afterwards to flag clusters with high doublet scores.
Quality filters, alignment of data across different batches and clustering
For scRNA-seq libraries, we integrated the filtered count matrices from Cell Ranger and analysed them with Scanpy v.1.7.0, with the pipeline following their recommended standard practices. In brief, we excluded genes expressed by fewer than three cells and excluded cells expressing fewer than 500 genes (or 2,000 genes in mouse), more than 20% mitochondrial content (5% in mouse) or with both more than 10% mitochondrial content and fewer than 1,500 counts correctly mapped to the transcriptome. After converting the expression space to log(CPM/100 + 1), the object was transposed to gene space to identify cell cycling genes in a data-driven manner, as described31,32. After performing principal component analysis (PCA), neighbour identification and Leiden clustering, the members of the gene cluster including known cycling genes (CDK1, MKI67, CCNB2 and PCNA) were flagged as the data-derived cell cycling genes and discarded in each downstream analysis. We identified highly variable genes (n = 2,000) using Seurat v3 flavour on the raw counts, which were used to correct for batch effect with single-cell variational inference (scVI) v.0.6.8. In the analysis of human scRNA-seq, we corrected for sample source and donor effect in both the main and the germ and somatic reanalysis. In the analysis of mouse scRNA-seq we corrected for sample effect and origin of the dataset, this last if combined with external data (below). The resulting latent representation of each cell in the dataset was used for neighbour identification, Leiden clustering and uniform manifold approximation and projection (UMAP) visualization.
General analysis was done separately on males and females in each species. Germ, gonadal somatic, endothelial and immune cells were subsequently reanalysed integrating both sexes into the same manifold, using the approach described in the previous paragraph. Furthermore, gonadal somatic cells from samples at the time of sex specification (younger than CS23) were further reanalysed for fine-grained annotation and validation.
Mouse gonad data
We combined our in-house mouse raw counts matrix with the raw count matrices from the ovarian samples profiled by Niu et al.5, comprising E11.5 to P5 developmental stages (GSE136441)5. For the Niu et al.5 dataset, we excluded cells that expressed fewer than 1,000 genes or more than 20% of mitochondrial genes.
For the analysis of mouse germ cells, we also included the mouse dataset generated by Mayère et al.10, which contains germ cells from mice from E10 to E18 developmental stages (GSE136220)10. ENSEMBL gene IDs provided by the authors were converted to gene names using the appropriate genome build (GRCm38.p5). We filtered out cells that expressed fewer than 1,000 genes or more than 20% of mitochondrial genes. Next, we concatenated male and female mouse germ cells data from our general analysis (already including data from Niu et al.5) with the germ cells dataset from Mayère et al.10, keeping the genes shared between the three datasets. The resulting matrix was integrated by sample and origin of the dataset using scVI on the basis of the procedure described above.
Macaque gonads data
In addition, we downloaded a macaque dataset profiling fetal ovaries at stages E84 and E116 (GSE149629)11 and included it in our cross-species comparison of germ and female somatic cells. Owing to low sequencing depth, we filtered out cells expressing fewer than 300 genes and more than 20% of mitochondrial genes.
As for mice, macaque gene identifiers were converted to human genes using ENSEMBL Biomart multi-species comparison filter. Genes with several mappings were discarded.
Annotation of scRNA-seq datasets cross-species
Identification, labelling and naming of the unbiased clusters was carried out on each species individually using a manual approach that we validated using a SVM classifier (see Cross-species comparison section below). For the manual approach, we first identified cluster-specific genes that we used to classify clusters into main cell types on the basis of bona fide marker genes previously reported in the literature. Next, we refined the annotation accounting for the spatiotemporal dynamics in each sex.
To identify marker genes specific to a cluster, we used the TF-IDF approach from the SoupX package v.1.5.0 (ref. 55) in R v.4.0.3. To estimate the cell cycle phase of each cell (that is, G1, S or G2/M), we aggregated the expression of G2/M and S phase markers and classified the barcodes following the method described in ref. 56 implemented in Scanpy score_genes_cell_cycle function. We discarded the clusters that: (1) were specific to a single donor; (2) had a higher average doublet score; (3) had lower numbers of expressed genes with no distinctive gene expressed (from TF-IDF approach) or (4) were enriched for marker genes for erythroid cells (red blood cells) and likely to be cell-free messenger RNA soup55.
We compared the transcriptional signatures of the cell types identified in our human scRNA-seq to their mouse counterparts, considering all developmental stages combined. Mouse gene identifiers were converted to human genes using ENSEMBL Biomart multi-species comparison filter. Genes with several mappings were discarded. Furthermore, genes associated with the cell cycle were removed to avoid biases. Before training the model, human cell types were downsampled to the cell type with the lowest number of cells to obtain a balanced dataset. Here, 75% of the data were used for training the model and 25% of the data were used to test the model. Raw counts were normalized and log-transformed, and the 300 most highly variable genes were selected. We then trained an SVM classifier (sklearn.svm.SVC) on human data and projected the cell type annotations onto the mouse datasets. By doing so, we obtained a predicted probability value that each cell in the mouse and macaque dataset corresponded to every given human cell type annotation. To study the transcriptomic similarity of a given cell type across species, we compared the estimated probabilities between human–mouse matching cell types and visualized them with boxplots. A detailed description of the workflow used for cross-species comparison is reported in Supplementary Note 2.
Agreement with external human gonads data
We evaluated the consistency between the main lineages identified in our study with the Smart-seq2 dataset of gonadal cells from Li et al.7 (GSE86146). From Li et al.7, we downloaded the normalized transcripts per million matrix and annotated their cells using the ‘FullAnnot’ field provided in the S1 table of the publication. We used the scmap tool57 to project the Li et al. annotations onto our dataset, using a similarity cut-off of 0.5 to retrieve high confidence alignment, on each sex separately. To speed up computational times, we downsampled our dataset to 50% size. Li et al.’s annotations were visualized onto the male and female UMAPs, respectively.
To validate the new ESGCs population, we queried the 10X scRNA-seq dataset of developing testis from GSE143356 (ref. 58) analysed by Guo et al.59. Here, we downloaded the raw expression count matrix, and excluded cells expressing fewer than 300 genes and more than 20% of mitochondrial genes. We carried out downstream analysis as previously described for UMAP visualization. Finally, we trained a SVM classifier (sklearn.svm.SVC) on our early human male somatic cells (
Analysis of immune cells in the gonads
Cell Ranger filtered count matrices of CD45+ enriched samples were processed using the workflow described above for the main scRNA-seq analysis (doublet detection, alignment of data across different batches with scVI and clustering). These cells were then merged with the cluster of immune cells from the non-enriched samples. The resulting clustered manifold was preliminary annotated by transferring labels from a publicly available dataset of human fetal liver haematopoiesis31. Developing liver scRNA-seq raw counts were downloaded from ArrayExpress (E-MTAB-7407), processed with Scanpy v.1.7.0 workflow described above for the main scRNA-seq analysis and filtered on the basis of the expression of CD45 (PTPRC) to exclude non-immune cells. We then trained a SVM classifier (sklearn.svm.SVC) on the filtered liver dataset and used it to predict cell types on our gonadal immune dataset. The label transfer workflow is analogous to that described for cross-species comparison (Supplementary Note 2), except for the initial ENSEMBL gene ID conversion, which is not necessary in this case as we are transferring labels between human datasets. Predicted cell type annotations were validated or disproved by looking at the expression of known marker genes.
To study the unique profile of our gonadal macrophages, we downloaded immune cells from several developing tissues: liver, skin, kidney, yolk sac, gut, thymus, placenta, bone marrow and brain28,31,32,33,34,35. Raw sequencing data were downloaded from ArrayExpress (E-MTAB-7407, E-MTAB-8901, E-MTAB-8581, E-MTAB-0701, E-MTAB-9801) or Gene Expression Omnibus (GEO) (GSE141862). For all datasets, we filtered out cells expressing fewer than 300 genes and more than 20% of mitochondrial genes. Downstream data analyses for these datasets were performed with the Scanpy v.1.7.0 workflow analogously to what is described in the main scRNA-seq analysis section above. Myeloid cells from fetal liver, skin, kidney, yolk sac, gut, thymus, placenta, bone marrow and brain datasets were selected on the basis of the expression of established myeloid markers (CD14, CD68, CSF1R). We then combined the resulting myeloid dataset with our gonadal myeloid cells and used scVI with a combined batch of donor and sample to integrate across the different organs.
Projection of fetal osteoclasts from Jardine et al.35 and microglial cells from Bian et al.29 onto our immune dataset was done using an SVM model. Similarly, we trained an SVM model on our gonadal macrophages and projected the cell type annotations onto fetal testicular myeloid cells from Chitiashvili et al.58. The label transfer workflow is analogous to that described for cross-species comparison (Supplementary Note 2), except for the initial ENSEMBL gene ID conversion, which is not necessary in this case as we are transferring labels between human datasets
Trajectory inference in the germ and early somatic lineages
For both germ and early somatic cells, we modelled differentiation trajectories and conducted pseudotime analysis by ordering cells along the reconstructed trajectory with Palantir (v.1.0.0)60 following their tutorial. In brief, cells were subsampled to balance cell type and sex contribution (n = 500 for germ and n = 150 for somatic cells). The top 2,000 highly variable genes were used for PCA. Next, we determined the diffusion maps from the PCA space (with five top components), and projected the diffusion components onto a t-SNE low dimensional embedding to visualize the data. Finally, we used the function run_palantir (with num_waypoints = 500) to estimate the pseudotime of each cell from the root cell. The barcode with the highest normalized expression of POU5F1 (PGC marker) or UPK3B (mesothelial marker) was used as the cell of origin in the germ and early somatic analyses, respectively. Terminal states were determined automatically by Palantir.
For samples at the time of sex specification, we computed RNA velocities61 to model early somatic development with scVelo (v.0.2.4)62 following their tutorial. Analysis was done on each sample separately in humans and mice. First, we used STARsolo to quantify spliced and unspliced counts, keeping the same 10X Genomics genome references used in Cell Ranger before. Next, we reprocessed the somatic cells (only cells at G1 phase) from each sample independently, performed PCA on the top 2,000 highly variable genes, neighbour identification and UMAP projection to visualize previously annotated cell types. Doublets and low quality control were discarded with unbiased Leiden clustering if necessary. We also excluded extragonadal coelomic epithelium GATA2+. Using scVelo, we computed the RNA moments and estimated velocities with ‘stochastic’ mode. Next, with scVelo we combined transcriptional similarity-based trajectory inference with directional RNA velocity and generated the velocity graph on the basis of cosine similarities. To further characterize the cell fate decision process in an unbiased way, we leveraged the RNA moments with the CellRank package (v.1.5.1). Specifically, CellRank uses a random walk model to learn directed, probabilistic state-change trajectories and determine initial and terminal states. We set the number of terminal states to four, letting CellRank determine the number of initial states. We extracted the fate probability of each cell ending up in one of the terminal states.
Alignment, quantification and quality control of ATAC data
We processed scATAC-seq libraries (read filtering, alignment, barcode counting and cell calling) with 10X Genomics Cell Ranger ATAC pipeline (v.1.2.0) using the prebuilt 10X’s GRCh38 genome (v.3.1.0) as reference. We called the peaks using an in-house implementation of the approach described in Cusanovich et al.63 (available at https://github.com/cellgeni/cellatac, revision 21-099). In short, the genome was broken into 5 kb windows and then each cell barcode was scored for insertions in each window, generating a binary matrix of windows by cells. Matrices from all samples were concatenated into a unified matrix, which was filtered to retain only the top 200,000 most commonly used windows per sample. Using Signac (https://satijalab.org/signac/ v.0.2.5), the binary matrix was normalized with TF-IDF followed by a dimensionality reduction step using singular value decomposition. Latent semantic indexing was clipped at ±1.5. The first latent semantic indexing component was ignored as it usually correlates with sequencing depth (technical variation) rather than a biological variation63. The 2–30 top remaining components were used to perform graph-based Louvain clustering. Next, peaks were called separately on each cluster using macs2 (ref. 64). Finally, peaks from all clusters were merged into a master peak set (that is, peaks overlapping in at least one base pair were aggregated) and used to generate a binary peak by cell matrix, indicating any reads occurring in each peak for each cell.
Downstream scATAC-seq analysis
Quality filters, alignment of data across different batches and clustering
To obtain a set of high-quality peaks for downstream analysis, we filtered out peaks that (1) were included in the ENCODE blacklist, (2) had a width outside the 210–1,500 bp range and (3) were accessible in less than 4% of cells from a cellatac cluster. Low-quality cells were also removed by setting to 5.5 the minimum threshold for log1p transformed total counts per cell.
We adopted the cisTopic approach65,66 v.0.3.0 for the core of our downstream analysis. cisTopic uses latent Dirichlet allocation to estimate the probability of a region belonging to a regulatory topic (region-topic distribution) and the contribution of a topic within each cell (topic-cell distribution). The topic-cell matrix was used for constructing the neighbourhood graph, computing UMAP projections and clustering with the Leiden algorithm. Donor effects were corrected using Harmony67 (theta = 0). Cell doublets were identified and removed using scrublet68.
Gene activity scores
Next, we generated a denoised accessibility matrix (predictive distribution) by multiplying the topic-cell and region-topic distribution and used it to calculate gene activity scores. To integrate them with scRNA-seq data, gene activity scores were rounded and multiplied by a factor of 107, as previously described66.
Cell type annotation
To annotate cell types in scATAC-seq data, we first performed label transfer from scRNA-seq data of matched individuals. We used canonical correlation analysis as a dimensionality reduction method and vst as a selection method, along with 3,000 variable features and 25 dimensions for finding anchors between the two datasets and transferring the annotations6. The predicted cell type annotations by label transfer were validated by importing annotations of the multiomic snRNA-seq/scATAC-seq profiling data. To visualize the correspondence between scATAC-seq final annotations and predictions from label transfer, we plotted the average label transfer score (value between 0 and 1) of each cell type in the annotated cell types in scATAC-seq data.
Cell type-specific cis-regulatory networks
Coaccessible peaks in the genome and cis-coaccessibility networks (CCANs) were estimated using the R package Cicero69 v.188.8.131.52 with default parameters. We then filtered the denoised accessibility matrix from cisTopic to keep only the peaks included in CCANs. The resulting matrix was further processed to average cells by cell type and peaks by CCAN. Finally, we z scored the matrix across CCANs and visualized the separation of CCANs by cell type by hierarchical clustering and plotting the heatmap.
Alignment, quantification and quality control of Visium data
For each 10X Genomics Visium sequencing data, we used Space Ranger Software Suite (v.1.2.1) to align to the GRCh38 human reference genome (official Cell Ranger reference, v.2020-A) and quantify gene counts. Spots were automatically aligned to the paired H&E images by Space Ranger software. All spots under tissue detected by Space Ranger were included in downstream analysis.
Downstream analysis of 10X Genomics Visium data
Location of cell types in Visium data
To spatially locate the cell states on the Visium transcriptomics slides, we used the cell2location tool v.0.05-alpha (ref. 70). As reference, we used scRNA-seq data from individuals of the same sex and gestational stage. We used general cell annotations from the main analysis, with the exception of the main gonadal lineages (germ, supporting and mesenchymal) for which we considered the identified subpopulations. We used default parameters with the exception of cells_per_spot that was set to 20. Each Visium section was analysed separately. Results were visualized following the cell2location tutorial. Plots represent estimated abundance for cell types. The size of the Visium spot in the plots was scaled accordingly to enhance visualization.
CellPhoneDB and CellSign
We updated the CellphoneDB database to include: (1) extra manually curated protein cell–cell interactions (n = 1,852 interactions) and (2) cell–cell interactions involving non-protein ligands such as steroid hormones and other small molecules (n = 194). For the latter, we used the last bona fide enzyme in the biosynthesis pathway (Supplementary Table 11a,b).
To retrieve interactions between supporting and other cell populations identified in our gonadal samples, we used an updated version of our CellPhoneDB34,71 (https://github.com/ventolab/CellphoneDB) approach described in ref. 72. In short, we retrieved the interacting pairs of ligands and receptors meeting the following requirements: (1) all the protein members were expressed in at least 10% of the cell type under consideration; and (2) at least one of the protein members in the ligand or the receptor was a differentially expressed gene, with an adjusted P value below 0.01 and a log2 fold change above 0.2. To account for the distinct spatial location of cells, we further classified the cells according to their location in the developing ovaries (outer cortex, inner cortex, medulla) as observed by Visium and smFISH. We filtered cell–cell interactions to exclude cell pairs that do not share the same location.
Furthermore, we added a new module to the database called CellSign that links receptors in CellphoneDB to their known downstream TF. To build CellSign, we have manually mined the literature to identify TFs with high specificity for an upstream receptor and recorded the relevant pubmed reference number (Supplementary Table 11c). We used this database to link our CellPhoneDB results to the relevant downstream TFs, which were derived from our TF analysis.
To prioritize the TF relevant for a cell state in a human lineage, we integrated three measurements: (1) expression levels of the TF and (2) the activity status of the TF measured from (2a) the expression levels of their targets (described below in TF activities derived from scRNA-seq) and/or (2b) the chromatin accessibility of their binding motifs (described below in TF motif activity analysis from scATAC-seq). Plots in main figures include TFs meeting the following criteria: (1) TF was differentially expressed, with log2 fold change greater than 0.5 and adjusted P < 0.01 and (2) TF was differentially active, with log2 fold change greater than 0.75 and adjusted P < 0.01 in at least one of the TF activity measurements (2a/2b). For mouse and macaque, we performed differential expression analysis only and compared the results to the orthologous TF in humans.
TF differential expression
We computed differential expression using the one-sided Wilcoxon Rank Sum test implemented in the FindAllMarkers function with Seurat v.3.2.2, in a one-versus-all fashion.
TF activities derived from scRNA-seq
We estimated protein-level activity for human TFs as a proxy of the combined expression levels of their targets. Target genes were retrieved from Dorothea73, an orthogonal collection of TF targets compiled from a range of different sources. Next, we estimated TF activities for each cell using Viper74, a GSEA-like approach, as implemented in the Dorothea R package and tutorial75. Finally, to identify TF whose activity was upregulated in a specific cell type, we applied the Wilcoxon Rank Sum test from Seurat onto the z-transformed ‘cell × TF’ activity matrix in a one-versus-all fashion.
TF motif activity analysis from scATAC-seq
TF motif activities were computed using chromVar76 v.1.12.2 with positional weight matrices from JASPAR2018 (ref. 77), HOCOMOCOv10 (ref. 78), SwissRegulon79, HOMER80. chromVar returns a matrix with binding activity estimates of each TF in each cell, which we used to test for differential TF binding activity between cell types in a one-versus-all fashion with Wilcoxon Rank Sum test (FindAllMarkers function in Seurat).
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.