Inclusion and ethics
Table of Contents
Genetic studies of ancient human diseases shed light on how past populations thrived and dealt with health problems, which may trigger concerns such as stigmatization due to diseases or rights and legal issues among people living today. Historical injustices, colonization, and dispossession have often complicated indigenous communities’ ability to assert and maintain their territorial rights in a legal or administrative framework. It is therefore crucial to consider, besides the scientific aspects, also the perspectives of living (indigenous) communities and people when carrying out this work55.
Here we study human remains of fully anonymous individuals who died more than 1,000 years ago and were buried in the archaeological site Jabuticabeira II, in the municipality of Tubarao in Santa Catarina state, Brazil. This site was excavated by P. de Blasis and team56, funded by Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP). and a research permit was obtained from the Brazilian National Institute of Historical and Artistic Heritage (IPHAN), according to the correspondence 1793/2019 GAB PRESI-IPHAN of the process 01506.000720/2019-65 by K. Santos Bogia. The use of the samples of the remains for this study has been also approved by P. de Blasis, custodian of the Jabuticabeira II collections at the Museum of Archeology and Ethnology of the University of Sao Paulo. The human remains have been curated, studied and sampled by S.E. and team at the University of Sao Paulo until 2016 and thereafter at the Natural History Museum of Vienna.
The territories and sites spanning across Rio Grande do Sul, Santa Catarina and Paraná are inherent to the ancestral heritage of the Kaingang, Guarani and Xokleng communities (also referred to as the ‘Sun people’ or ‘coastal group’), who are still living in the region today. These societies have not only utilized the region for mobilization and migration in search of food supply, but also traditionally traversed significant distances, leaving a trail of cultural imprints, particularly in the domain of funerary practices. In a previous study57, samples of five of the individuals exhumed at Jabuticabeira II were studied, revealing some genetic affinity with the Kaingang (a Ge-speaking group of Southern Brazil). However, to the best of our knowledge, the Kaingang are not seen as direct descendants of sambaqui societies, nor do they identify with the people who once dwelled at Jabuticabeira II or request their remains. Finally, research in the Instituto Socioambiental (https://www.socioambiental.org/; for the defence of Brazilian socio-environmental diversity, including Indigenous Rights) states that the region around Jabuticabeira II is not part of any Indigenous reserve, nor are there claims of groups for territorial rights of this region or for the archaeological remains of this site (P. de Blasis, personal communication).
Degenerative processes, often resulting from contexts of marginalization, conflict and displacement, bear witness to the impact of historical relationships of the indigenous groups with colonizers and invaders. The afflictions and diseases experienced by these groups carry historical and environmental ramifications of notable significance, warranting explicit acknowledgment and examination. Regarding possible stigmatization of local (indigenous) communities and persons affected with bejel, it must be stressed that this contagious disease is an endemic, mostly non-sexually transmitted disease, common in hot regions where people live in close contact to each other, have no need for especially covering clothing, and share utensils. Today bejel, which can lead to stigmatization owing to disfiguring wounds, occurs especially in east-mediterranean and west-African communities with limited access to modern medical care. Although the World Health Organization realizes the importance of actions taken to eradicate bejel worldwide since 1949 (WHA2.36 Bejel and other Treponematoses (https://www.who.int/publications/i/item/wha2.36)), the disease is not seen as a current public health issue in Brazil, as it is for some other countries58,59. This stands in contrast to the high prevalences of sexually transmitted diseases such as HIV and venereal syphilis, which affect the indigenous communities in Brazil (Em São Paulo, ação em aldeias promove debate e testagem rápida de HIV e Sífilis — Fundação Nacional dos Povos Indígenas (http://www.gov.br/funai/pt-br/assuntos/noticias/2019/em-sao-paulo-acao-em-aldeias-promove-debate-e-testagem-rapida-de-hiv-e-sifilis)). It is, however, notable that, in the archaeological context, nothing implied that those prehistoric people of Jabuticabeira II with the local treponematosis would have been discriminated against in their time and culture.
Furthermore, the culturally insensitive descriptions in palaeogenomic research articles are an ethical issue of concern60. To ensure discretion, we curated for potentially insensitive or discriminatory expressions within the manuscript. Importantly, we had the invaluable help of E. Krenak, Cultural Survival’s lead on Brazil, Indigenous activist and PhD candidate at the University of Vienna, to critically analyse our texts and provide advice in ethically correct and fair use of terminology.
The sambaquis of the Laguna region
A sambaqui is the prevalent type of archaeological site on the Brazilian coast: a human-built shell midden or shell mound of varying dimensions, located in rich resource areas such as lagoons, mangroves or estuaries. Sambaquis consist of inorganic sediment, mollusc shells, food debris and organic matter mixed in intricate stratigraphies associated with domestic and/or funerary functions61. More than 1,000 sambaquis are mapped along the 7,500 km long Brazilian coast and are dated to between 7,500 and 1,000 yr bp61,62. Recent archaeological research suggests that these shell mound-building populations were sedentary, with an abundant and stable marine-based subsistence, horticulture, high population growth61,63, elaborate funerary rituals64 and landscape appropriation56.
Jabuticabeira II excavation site
Jabuticabeira II (UTM 22 J − 0699479E; 6835488 S) is a medium-sized shell mound (400 × 250 × 10 m in height), settled on a palaeodune and located in the Laguna region, the highest density area of sambaquis from the southern Brazilian coast, 3 km from Laguna do Camacho, one of several water sources associated with a barrier-lagoon geological system formed during the Holocene (Fig. 1). Jabuticabeira II, built during a nearly 1,000-year period, is one of 65 sambaquis mapped around the lagoon system. This large number of settlements and their chronologically overlapping occupation history attest to a fairly dense occupation and intense interactions of the sambaqui builders between 7,500 and 900 calibrated years (cal yr) bp56. According to stratigraphic studies, Jabuticabeira II is the result of incremental funerary rituals accumulated over centuries. Although Jabuticabeira II was not completely excavated, 204 burials containing the remains of 282 individuals were exhumed from a 373 m2 area64. Radiocarbon dates of Jabuticabeira II stratigraphy56,64 suggest a long occupation period between 1214–830 cal bc and 118–413 cal ad or 3137–2794 to 1860–1524 cal yr bp (2σ), roughly in line with the radiocarbon datings from bone material of the four individuals in this study, ranging from 350 cal bc to 573 cal ad.
The human remains from the Jabuticabeira II sambaqui were found in single, double, and multiple burials, dispersed in clusters. The skeletons recovered were mostly incomplete, avoiding categorical estimations of age and sex or other osteological findings. The burial pattern was tightly flexed and suggested intentional treatment of the body prior to the internment. The small size of the graves suggested that the bodies suffered previous desiccation or decomposition of soft tissues, but not enough to produce complete disarticulation (hand and feet bones were found articulated). Many burials come from profiles and are incomplete. The bones of several individuals are stained with red ochre65, a common practice in archaeological sites of the Santa Catarina state66,67. Offerings are common in sambaqui burial contexts and include adornments made with faunal material and lithic tools in a wide range of forms, from debris to polished tools and zooliths, with differences in frequency of occurrence among different sites and strata68. The most common offering in Jabuticabeira II was fish.
Altogether, 99 Jabuticabeira II individuals, with and without bone alterations suggestive of infection, were screened for pathogen DNA content. 37 samples deemed positive for treponemal DNA in the initial screening and four samples yielded sufficient data for T. pallidum genome reconstruction (Supplementary Table 1).
Palaeopathological analysis of treponematoses
Bioarchaeological analyses showed results compatible with increasing population growth and high population density in Jabuticabeira II, including high frequencies of nonspecific stress markers69 and occasional infant stress70, but no evidence of trauma associated with interpersonal conflicts over resources or territory69.
There is, however, evidence of communicable systemic diseases in Jabuticabeira II and other local Brazilian sambaquis30. Eleven 14C accelerator mass spectrometry dates obtained directly from the presumably treponematosis-affected individuals suggest that these diseases are very old on the east coast of South America, with a time-range between 6,300 and 500 yr BP. Among the possible treponemal cases based on osteological analysis, three came from Jabuticabeira II. However, these did not overlap with the individuals yielding the detected genetic evidence in this study.
Information on individuals
Individual 41A-L2.05-E4, sample ZH1390
The individual is an adult male of robust build, with an estimated stature of 150.49 ± 2.6 cm (ref. 70). Although fragmented, the bones of this individual comprised an almost complete skeleton (80%), articulated and buried in an oval shell-rich matrix in a hyper-flexed position. The bones of the individual showed signs of systemic infectious disease in the lower limbs. Femurs, tibias, and fibulas all show discrete generalized periostitis and osteoarthrosis. A widening in the lateral portion of clavicles was also observed. According to Filippini et al.30, applying the SPIRAL method71, this individual’s disease could be classified non-conclusively as syphilis, yaws or bejel. The sampling was performed on an active lesion on the tibia fragment.
Individual FS9–L3–T2, sample ZH1540
The sample comes from an assemblage of commingled bones, of probably more than one individual. The bones assigned to this individual consist of several skeletal elements, some with pathological alterations, such as severe osteomyelitis in the distal third of the right humerus, severe periostitis in the left ulna, periostitis in a fibula diaphysis, and two vertebral bodies with osteophytosis. The sample was taken from the fibula fragment, in the area with periostitis.
Individual FS3B-L3-T4, sample ZH1541
The sample comes from one of three separate individuals, found commingled. The skeletal elements belonging to this robust adult of unknown age and sex include a left radius with arthritis, a fragment of the left ulna (very robust), a fragment of the left humerus, fragments of a femur, a tibia, and a fibula and a first metatarsal. The sample was taken from a femur fragment, under the immediate surface of the bone, to best avoid the possible introduction of external contaminants.
Individual 2B-L6-E3, ZH1557
The sample comes from a probably adult male individual. The individual was articulated and in a flexed position with another, adult female individual buried on top. Osteopathological findings on the bones of the sampled individual included signs of degenerative joint disease, severe lumbar intervertebral osteoarthritis, scoliosis, and possible injuries to the patellae. However, no typical lesions suggestive of treponemal infection were observed. The sample was taken from a small piece of long bone, under the immediate surface of the bone, to best avoid the possible introduction of external contaminants.
Marine reservoir effect correction for 14C dating
Radiocarbon dating was performed by the Laboratory of Ion Beam Physics at ETH Zurich (laboratory number: ETH-127328) using bone collagen purified by a modified ultrafiltration method72. Data calibration was done with OxCal v4.4.4. The diet of the Jabuticabeira II inhabitants, substantially consisting of marine food sources, produces a reservoir effect in the radiocarbon dates calculated as mean age of 247.8 (σ = 103.7) years73. Considering the high contribution of marine carbon to bone collagen of individuals in Jabuticabeira II, the radiocarbon dates were modelled with Calib Rev 8.2074 (http://calib.org/calib/calib.htm) using the Mixed Marine SHCal20 calibration curve75,76 and applying the estimated average local marine radiocarbon reservoir correction value (ΔR) of −126 ± 29 for the South coast of Brazil (Marine Reservoir Correction database)73,77. We considered the average relative contribution of marine carbon to collagen derived from Bayesian Mixing Models for Jabuticabeira II individuals, calculated at a mean value of 42.5%78,79. For the individual estimates for the samples, see Supplementary Table 2.
Samples were documented and carried through sampling, DNA extraction, library preparation and library indexing in facilities dedicated to ancient DNA work at the University of Zurich, including decontamination of samples, laboratory equipment and reagents with UV irradiation and using protective clothing and minimum contamination-risk working methods.
All post-amplification steps were performed in the regular laboratory facility available for the Paleogenetics Group at the Institute of Evolutionary Medicine (IEM), University of Zurich (UZH). DNA sequencing was performed at the Next Generation Sequencing facility of the Vienna BioCenter Core Facilities (VBCF) or at the Functional Genomics Center at the University of Zurich (FGCZ).
Ancient DNA extraction
All sample surfaces were irradiated with ultraviolet light to minimize potential contamination from modern DNA. The bone powder was obtained using a dental drill and diamond head drill bits. DNA extraction was performed on around 50–100 mg of bone powder, according to a well-established extraction protocol for ancient DNA80. Negative controls for extraction and library processes were processed in parallel through all experiments, one control per ten samples, sequenced and bioinformatically compared to their corresponding sample batches, as precaution against possible contamination.
Double stranded DNA libraries were produced for initial screening with shotgun sequencing, without UDG treatment (that is, chemical treatment aiming to limit age-related damage in the DNA). Two additional libraries for each of the potentially positive samples from the first round of capture were produced to maximize the DNA complexity. For the preparation of DNA libraries, 20 µl of DNA extract was converted into double stranded DNA libraries31. Sample-specific barcodes (indexes) were added to both ends of the DNA fragments in the libraries81. The indexed libraries were then amplified to reach a minimum DNA concentration of approximately 90 ng ml−1. The amplification was performed using 1× Herculase II buffer, 0.4 mM IS5 and 0.4 mM IS6 primer81, Herculase II Fusion DNA Polymerase (Agilent Technologies), 0.25 mM dNTPs (100 mM; 25 mM each dNTP) and 5 ml indexed library as DNA template. Four reactions per library were prepared and the total amplification reaction volume was 100 ml. The thermal profile included an initial denaturation for 2 min at 95 °C and 3–18 cycles, depending on DNA concentration after indexing of the libraries, denaturation for 30 s at 95 °C, 30 s annealing at 60 °C and a 30 s elongation at 72 °C, followed by a final elongation step for 5 min at 72 °C. All splits of one indexed library were pooled and purified using the QIAGEN MinElute PCR purification kit. DNA libraries were then quantified with D1000 ScreenTape on an Agilent 2200 TapeStation (Agilent Technologies) and combined in equimolar pools for sequencing.
Shotgun data were used for an initial screening of the 99 candidate samples, with Kraken2 software82, and 41 samples that had more than 7 hits to T. pallidum were selected for target enrichment. The samples selected were subjected to a target enrichment process and subsequently processed by FastQ Screen v0.15.183 to check the number of mapped reads against three representative high-quality reference genomes of T. pallidum subspecies (CDC2, BosniaA and Nichols). The nine most promising samples (>5,000 Kraken hits to T. pallidum after first round of in-solution capture), were turned into two extra libraries and re-captured as explained in detail in the following sections.
Target enrichment for T. pallidum DNA
Genome-wide enrichment of double stranded libraries was performed through custom target enrichment kits (Arbor Bioscience). RNA baits with a length of 60 nucleotides and a 4 bp tiling density were designed based on three reference genomes: Nichols (CP004010.2), SS14 (CP000805.1), Fribourg-Blanc (CP003902). 500 ng library pools were enriched according to the manufacturer’s instructions. Captured libraries were amplified in 100 µl reactions containing 1 unit Herculase II Fusion DNA polymerase (Agilent), 1× Herculase II reaction buffer, 0.25 mM dNTPs, 0.4 mM primers IS5 and IS681 and 15 µl library template, with the following thermal profile: initial denaturation at 95 °C for 2 min, 14 cycles of denaturation at 95 °C for 30 s, annealing at 60 °C for 30 s, and elongation at 72 °C for 30 s, followed by a final elongation at 72 °C for 5 min. Captured libraries were purified with MinElute spin columns (QIAGEN) and quantified with a D1000 High Sensitivity ScreenTape on an Agilent 2200 TapeStation.
For both shotgun data retrieval and after the capture processing, the samples were pooled in unimolar quantity (for SG sequencing up to 50 samples per pool, and for the capture process 2–8 samples per pool), and sequenced on an Illumina NextSeq500 with 2 × 75 + 8 + 8 cycles using the manufacturer’s protocols for multiplex sequencing at the Functional Genomics Center in Zurich or at the Vienna BioCenter Core Facilities.
We assembled a genomic dataset comprising 98 publicly available T. pallidum genomes (8 TEN, 30 TPE and 60 TPA) from previously published studies (including 8 ancient genomes), and the newly generated ZH1540 genome. The genomes represent the genetic variation of the three known subspecies of T. pallidum (TPA, TPE and TEN) available by December 2022, and were selected with a focus on TEN and TPE, because of their proximity to the new ancient genome classified as TEN.
Published data for the modern genome dataset in this study are available at the European Nucleotide Archive (ENA) database: PRJNA313497 (accession numbers: SRR3268682, SRR3268724, SRR3268715, SRR3268694, SRR3268696, SRR3268709, SRR3268710), PRJEB11481 (accession numbers: ERR1470343, ERR3596780, ERR3596747, ERR3596783), PRJEB28546 (accession numbers: ERR4045394, ERR3684452, ERR3684456, ERR3684465, SRR13721290, ERR4853530, ERR4993349, ERR4853587, ERR4899206, ERR5207017, ERR5207018, ERR5207019, ERR4899215, ERR4853623, ERR4853625), PRJNA508872 (accession numbers: SRR8501165, SRR8501164, SRR8501167, SRR8501166, SRR8501168, SRR8501171), PRJNA723099 (accession numbers: SRR14277267, SRR14277266, SRR14277458, SRR14277444), PRJEB11481 (accession number: ERR1470331), PRJDB9408 (accession numbers: DRR213712, DRR213718), PRJNA588802 (accession numbers: SRR10430858, SRS5636328), PRJNA322283 (accession number: SRR3584843), PRJNA754263 (accession numbers: SRR15440297, SRR15440150, SRR15440451, SRR15440240), PRJEB40752 (accession numbers: ERR4690809, ERR4690806, ERR4690810, ERR4690812, ERR4690811). Assembly files were used for 9 genomes from National Center for Biotechnology Information (NCBI) database: CP002375.1, CP002376.1, NC_016842.1, NC_017268.1, NC_018722.1, NC_021490.2, NC_021508.1, GCA_000813285.1, CP035193.1 and for 24 modern genomes from the European Nucleotide Archive (ENA): CP021113.1, CP073572.1, CP073557.1, CP073553.1, CP073536.1, CP073526.1, CP073490.1, CP073487.1, CP073470.1, CP073447.1, CP073446.1, CP073399.1, CP040555.1, LT986433.1, LT986434.1, CP032303.1, CP020366.1, CP024088.1, CP024089.1, CP078121.1, CP078090.1, CP081507.1, CP051889.1 and CP003902 .1. Raw sequence data (fastq files) used for 6 modern genomes is available at the NCBI database: PRJEB20795 (accession numbers: ERS1724928, ERS1724930, ERS1884567) and PRJNA343706 (accession numbers: SRR4308604, SRR4308606, SRR4308597). Previously published ancient treponemal genomes here used are available at the ENA: PRJEB37490 (accession number: ERR4065503), PRJEB37633 (accession number: ERR4000645), PRJEB35855, PRJEB21276 (accession numbers: ERS2470995, ERS2470994) and PRJEB62102. Detailed source information for the reference dataset is documented in Supplementary Table 3.
We selected all eight publicly available TEN genomes, all of which have more than 99.4% genome coverage, with the exception of C7717 (81.4%). We selected 30 TPE genomes (Supplementary Table 3). To represent each lineage or sublineage, we selected at least one genome, preferring the ones with the highest sequencing depth and genome coverage. All included TPE genomes have more than 95.3% genome coverage, except the four ancient TPE genomes: SJN003, AGU007, 133 and CHS119, displaying 97.4%, 92.7%, 57% and 62% genome coverage, respectively. Furthermore, 60 TPA genomes from the major lineages and sublineages described in previous studies were included (Supplementary Table 3). All of these genomes had more than 90% coverage, except the four ancient genomes, PD28, W86, SJ219 and 94B, all of which have genome coverage of 30% or more. All genomes in the dataset are separated from each other by at least 5 SNPs. The TPA strain Seattle-81 was excluded from the final dataset owing to mutations probably accumulated during extensive passaging in rabbits that can cause ambiguous placement in phylogenies4,16,36.
The raw data and/or assembly files for each genome in our dataset were downloaded from the public databases: European Nucleotide Archive (ENA)84 and National Center for Biotechnology Information (NCBI)85. Accession numbers are given in Supplementary Table 3.
Read processing and multiple reference-based genome alignment generation
To reconstruct the individual genomes from the raw data, we carried out raw read quality control and preprocessing, removing duplicates, variant calling and filtering using the default parameters when not otherwise specified. After processing the de-multiplexed sequencing reads, sample sequencing quality was analysed with FastQC version 0.11.983, filtering reads with a QC value < 25. Following processing by cutadapt version 4.186 to remove the sequencing adapters, in order to reduce the reference bias, and improve the posterior phylogenetic inference and assignment87, the genome reference selection for mapping each sample was determined according to the results from the original manuscript where the genomes were published (see Supplementary Table 3). The mapping was carried out by BWA mem88 (using parameters: -k 19, -r 2.5). Four reference genomes were used; the well-studied TEN and TPE genomes BosniaA (NZ_CP007548.1) and CDC2 (NC_016848.1), as well as the Nichols (NC_021490.2) and SS14 (NC_010741.1) genomes, representing the two main lineages within TPA. However, for the new ancient samples obtained here, genomes for each sample were reconstructed by mapping to three high-quality reference genomes, representing the three T. pallidum subspecies (CDC2, BosniaA and Nichols).
CleanSam, from Picard Toolkit version 2.18.29 (http://broadinstitute.github.io/picard), was used to clean the provided SAM or BAM files. Duplicate reads were removed using MarkDuplicates, from Picard toolkit version 2.18.29. AddOrReplaceReadGroups, from Picard Toolkit version 2.18.29, was used to assign all the reads in a file to a single new read-group before using mapDamage version 2.2.0-86-g81d0aca89 to estimate the DNA damage parameters and rescale quality scores of probably damaged positions in the reads (using parameter: –rescale).
After generating a text pileup output for the BAM files with the mpileup tool from Samtools version 1.790, SNPs were called using VarScan version 2.4.391 (using parameters: -p-value 0.01, -min-reads2 1, -min-coverage 1, -min-freq-for-hom, 0.4 -min-var-freq 0.05, -output-vcf 1). Next, a SNP filtering was also carried out with VarScan (using for the modern samples parameters: -p-value 0.01, -min-reads2, 5 -min-coverage 10, -min-avg-qual 30 -min-freq-for-hom 0.4, -min-var-freq 0.9, -output-vcf 1; and modifying some parameters for the ancient samples because of their lower read coverage and quality: -p-value 0.01 -min-reads2 3, -min-coverage 5, -min-avg-qual 30, -min-freq-for-hom 0.4, -min-var-freq 0.9 -output-vcf 1). Additionally, all positions with less than 3 mapped reads were masked with Genomecov from Bedtools version 2.26.092 for modern and ancient samples. All steps of genome generation were visualized and manually confirmed with Tablet version 1.21.02.0893, checking each SNP one by one and discarding the possible spurious SNPs from the new ancient genome ZH1540. The resulting final sequences were obtained by maskfasta from Bedtools v2.26.0.
Additionally, we used tested sequencing and posterior analysis methodologies17,42 to obtain higher coverage and more reliable modern T. pallidum genomes. Where possible, assembly files were obtained rather than raw data (Supplementary Table 3). A multiple reference-based genome alignment for all sequences was generated in MAFFT v7.46794 (using parameters: –adjustdirection –auto –fastaout –reorder). However, due to the use of different genomic references, regions with low coverage for some genomes, corresponding mostly to tpr and arp genes, were reviewed and manually aligned with Aliview version 1.2595.
The samples ZH1390, ZH1541, and ZH1557 had sufficient data to attempt a genome reconstruction and were determined to have the most SNPs in common with the TEN reference but they were excluded from downstream analyses due to the limited coverage acquired for each of them, which made the obtained SNPs less reliable. The sample ZH1540, however, yielded a remarkable 33.6× genomic coverage and was selected for subsequent in-depth analyses.
Proteinortho version 6.0b96 (using parameters: -p=blastn -singles -keep) was used to conduct an orthology study in order to find orthologous genes in the four reference genomes used96. Each gene present in at least one of the four reference genomes had its genomic coordinates determined based on its location in the final merged alignment (see Supplementary Table 3).
To verify the accuracy of the final multiple genome alignment, and that no protein-coding gene was inadvertently truncated, the protein translations for every gene present in at least one reference genome were compared to the original gff3 files of each of the four references (Supplementary Table 3). The reconstructed ZH1540 genome and its main features were represented graphically using BRIG version 0.95-dev.000397.
Recombination analysis using PIM
As previously noted36, the presence of recombination in the genomes of T. pallidum may interfere with the topologies of phylogenetic trees inferred. In order to look into potential gene recombination, we used the PIM pipeline36 to detect recombination gene by gene. In brief, the process involved the following steps:
Using IQ-TREE version 1.6.10, a maximum-likelihood tree was created for the multiple genome alignment98. All maximum-likelihood trees for the remaining steps were obtained using GTR99 + G100 + I101 as an evolutionary model and 1,000 bootstraps replications.
The 1,161 genes found in at least one of the reference genomes were extracted, and the number of SNPs for each gene was calculated. Genes with less than three SNPs were excluded.
The phylogenetic signal in each gene alignment for each of the remaining genes was evaluated by likelihood mapping102 in IQ-TREE (using parameters: -lmap 10000 -n 0), retaining only those genes that showed a phylogenetic signal.
A maximum-likelihood tree was generated for each of the remaining genes using IQ-TREE.
For each included gene, we tested the phylogenetic congruence between trees using IQ-TREE (using parameters: -m GTR + G8 -zb 10000 -zw), comparing the maximum-likelihood tree obtained from the gene alignment and the maximum-likelihood tree obtained from the whole-genome alignment using two different methods: Shimodaira–Hasegawa103 and expected likelihood weights (ELW)104. Genes for which at least one test rejected the reference tree topology with the gene alignment adopting a conservative approach (P < 0.2, weight value close to 0, for Shimodaira–Hasegawa and ELW tests, respectively) and the complete genome alignment rejected the topology of the tree built using the gene alignment (reciprocal incongruence, P < 0.2 and weight value close to 0) in at least one of them were selected and examined more closely in the next step.
Using MEGAX105, the selected genes that displayed reciprocal incongruence were subsequently examined to assess and describe potential recombination events. A gene has to have at least three nearby homoplastic SNPs—SNPs that are shared by several groups (TPE, TEN, TPA-Nichols or TPA-SS14) and produce a polyphyletic distribution—in order to be labelled as recombinant. The homoplastic SNPs found in the gene alignment served as the boundaries of the recombinant areas.
Using a parsimony criterion on the distribution of alternative states of the homoplastic SNPs, the potential donor and recipient clades or strains of each recombination event were inferred.
DNA sections, a number of genes have a high percentage of sites with missing data. The majority of these genes are members of the tpr and arp families, which include collections of paralogous genes. In order to continue analysing these intriguing genes with the PIM pipeline, strains that had a high percentage of missing data in each of these genes were eliminated. Following previous findings35,36, the hypervariable gene tprK (tp0897), with seven hypervariable regions that undergo intrastrain gene conversion17,37,106,107,108,109, and the tp0316 and tp0317 genes, also under gene conversion, were completely excluded from the recombination analysis.
PIM procedure for likelihood mapping and topology tests
A likelihood mapping test was performed using IQ-TREE to determine which genes (Supplementary Table 4) showed a phylogenetic signal (out of the 382 genes for which >3 SNPs were found in pairwise comparison with at least one reference genome). For each quartet (subset of four sequences) in the data, the test creates unrooted phylogenetic trees. The quartet likelihoods are then plotted within a triangle, where the position denotes the ‘tree-likeness’ of the quartet in question. Corner quartets are completely resolved, quartets on the sides are partially resolved, and quartets in the centre are unresolved. Of the 382 genes, 29 had too many missing values to be tested using the likelihood mapping method. In order to include these genes in the next steps of the PIM pipeline and topology comparisons, the problematic sequences with more than 50% of positions with missing data were removed.
Following the likelihood mapping test, 9 genes falling within the central zone of the triangle were discarded (Supplementary Table 4). Then, using the Shimodaira–Hasegawa and ELW topology tests, we compared the gene trees of the remaining genes to the preliminary reference tree of the whole-genome alignment to assess their phylogenetic congruence (Supplementary Table 4). Of the 373 genes that tested positive for phylogenetic incongruence, 27 contained at least three consecutive SNPs, supporting a recombination event. To these we added tp0859, which was detected as recombinant in a previous study35, resulting in a total of 27 recombinant genes.
Recombination analysis using Gubbins and ClonalFrameML
Gubbins version 2.3.1110 and ClonalFrameML version 1.11-1111 are frequently used tools for the genome-wide identification of recombinant positions in bacterial genomes. To test the robustness of our recombination analysis using PIM, we also ran these two programs, with default parameters and the same whole-genome alignment used with PIM. Gubbins identified 301 distinct recombination events associated with 103 genes, ranging in size from 5 bp to 13,866 bp. Similarly, ClonalFrameML detected 656 events, with 32 of them being 1 or 2 bp long, and the longest event spanning 782 bp. Notably, all the genes identified by PIM as having a recombinant region were also detected by both ClonalFrameML and Gubbins, except for gene tp0558, which was missed by ClonalFrameML but detected by Gubbins. Additionally, genes tp0164 and tp0179 were detected by ClonalFrameML but missed by Gubbins.
A maximum-likelihood tree based on the alignment including all genes was constructed with IQ-TREE, using GTR + G + I as the evolutionary model and 1,000 bootstrap replications (Extended Data Fig. 2a). Next, genes identified as recombinant by PIM were removed from the multiple genome alignment. Three additional genes (tp0897, tp0316, and tp0317), which contain repetitive regions and have been identified as hypervariable and/or under gene conversion in the past, were also removed to prevent the introduction of a potential bias. Because the tp0317 gene is nested inside the tp0316 gene and the coordinates from the BosniaA reference genome for tp0316 covered a larger area than those of the other reference genomes, tp0316 and tp0317 were removed according to the tp0316 coordinates from the BosniaA reference genome. A reference phylogenetic tree was then constructed employing the new vertical-inheritance genome alignment, also with IQ-TREE using GTR + G + I as the evolutionary model and 1,000 bootstrap replications (Extended Data Fig. 2b). Both trees obtained were compared and are shown in Extended Data Fig. 2.
The SS14 lineage was previously described as a largely epidemic, macrolide-resistant cluster that emerged after, and was possibly prompted by, the clinical use of antibiotics following its discovery12,16. Based on our phylogenetic analysis results and expanding on earlier phylogenetic classifications and nomenclature of the SS14 lineage12,16, we defined the clade that contains almost all SS14 genomes from clinical and contemporary samples as the SS14-Ω sublineage. However, two contemporary clinical samples (MD18Be and MD06B), were not classified as SS14-Ω sublineage, because these samples cluster together with the MexicoA genome, in line with previously published results42.
To compare the PIM-based analysis with other widely used recombination detection methods, Gubbins and ClonalFrameML, we followed a similar procedure of removing the recombinant positions detected by these tools and inferred maximum-likelihood trees with the retained positions in the corresponding multiple genome alignments. All the phylogenetic trees with recombination events removed exhibit general congruence with each other, whether the events were identified by PIM, Gubbins or ClonalFrameML. Furthermore, the placement of the ZH1540 genome remained consistent in the phylogenetic trees, regardless of the recombination detection method employed, and despite the elimination of recombinant genes to generate the vertically inherited alignment.
Exploratory characterization of the 16S-23S genes
T. pallidum contains two rRNA (rrn) operons, each of which encodes the 16S-23S-5S rRNA genes and intergenic spacer regions (ISRs). There is evidence that the random distribution of rrn spacer patterns in T. pallidum may be generated by reciprocal translocation of rrn operons mediated by a recBCD-like system found in the intergenic spacer regions (ISRs)112. In concordance with previous studies112,113,114,115, we found that the 16S–23 S ISRs of the TPA strains contain the tRNA-Ile (tRNA-Ile-1; tp0012) and tRNA-Ala (tRNA-Ala-3; tp00t15) genes within the rrn1 and rrn2 operons, respectively. By contrast, the TPE genomes show an Ala/Ile spacer pattern, where the tp0012 and tp00t15 orthologues are located within the rrn2 and rrn1 operons, respectively.
We identified 68 SNPs in genes r0001, r0002, r0004 and r0005, encoding the 16S-23S rRNA genes of the new ancient genome ZH1540, placing them among the most variable genes in our alignment and raising the potential that including them in the alignment could result in a biased phylogenetic reconstruction. Although the SNPs found appear to be well supported by the reads obtained from the sequence mapping (Supplementary Table 3), their origin from possible contamination cannot be completely ruled out and further analyses would be necessary to confirm them.
Excluding these genes from the alignment, in addition to the recombinant genes and tp0316, tp0317 and tp0897, did not result in any changes to the topology (Extended Data Figs. 2b and 3), although branch lengths were altered. As these genes are known to have conserved regions in addition to variable regions used to explore the evolutionary relationships among pathogenic bacteria116,117,118, we decided to retain them in the alignment for all subsequent analyses. Finally, we note that the ZH1540 genome did not possess either of the two T. pallidum 23 S ribosomal RNA gene mutations known to confer macrolide resistance (A2058G and A2069G). In contrast, four modern TEN strains from Japan possess the A2048G mutation, suggesting recent selection pressure for antibiotic resistance mutations.
Molecular clock dating
We used the Bayesian phylogenetics package BEAST2 v2.6.7119 to estimate a time-calibrated phylogeny of the context dataset of 98 T. pallidum genomes along with our new ancient genome, ZH1540. We removed hypervariable and recombining genes from the alignment, as described above, reduced it to variable sites and used an ascertainment bias correction to account for constant sites.
Root-to-tip regression analyses (Extended Data Fig. 4) show that while there is a positive correlation between sampling year and root-to-tip divergence among all modern clinical strains, indicating clock-like evolution, the correlation is very weak when also including passaged strains and negative when including ancient strains. Within the TPE, TEN and SS14 clades there exists a positive correlation among all modern clinical and passaged strains. On the other hand, the correlation is negative for Nichols strains, even when looking only at clinical strains. In order to account for rate variation and the long terminal branches on some strains (likely due to a multitude of effects, including sequencing errors, contamination and mutations introduced during rabbit passaging) we used a UCLD and a UCED clock model for the molecular clock dating analysis120. For both models we placed a narrow lognormal prior with a mean (in real space) of 1 × 10−7 substitutions per site per year and standard deviation 0.25 on the mean clock rate. This strong prior was used to compensate for the poor temporal signal among T. pallidum genomes and was calibrated on previous estimates of the substitution rate4,35. We further used a GTR + G + I substitution model118 and a Bayesian skyline plot121 demographic model (tree-prior) with 10 groups. For all genomes where the sampling dates are not known exactly, we used uniform priors across the date ranges reported in the original studies to account for the uncertainty4,5,6,16,122. For ZH1540 we set the date range to 364–573 ad, in accordance with the marine reservoir effect corrected radiocarbon dating results above. Default priors were used for all other model parameters. The same analysis was repeated without ZH1540 in order to assess the effect of our new ancient genome on the divergence dates. We further repeated the analysis using a wide lognormal prior with a mean (in real space) of 1 × 10−7 substitutions per site per year and standard deviation 1 on the mean clock rate and using both constant-size and exponential growth coalescent models to assess the impacts of the mean clock rate prior and demographic models on divergence time estimates.
For each analysis we ran four Markov chain Monte Carlo (MCMC) chains of 5 × 108 steps each, sampling parameters and trees every 10,000 steps. After assessing convergence in Tracer v1.7123 and confirming that all four chains converged to the same posterior distribution, we combined the chains after discarding the first 10% of samples as burn-in. In the resulting combined chains all parameters have effective sample size (ESS) values > 150. TreeAnnotator v2.6.7 was used to compute MCC trees and the results were visualized using ggplot2124, ggtree125 and custom scripts. The 95% HPD of the coefficient of variation estimated under the UCLD model excluded 0 (median = 1.46, 95% HPD 1.08–1.9), indicating that a strict clock model is not appropriate for our dataset. Robustness analyses show that under a narrow mean clock rate prior both the UCED and UCLD clock models result in similar divergence time estimates (Extended Data Fig. 5a–f), with the UCED model estimates tending to be more recent and the UCLD model estimates usually having longer tails. Under a wide mean clock rate prior, estimates with the UCED are broadly similar, albeit wider, while the UCLD model estimates very wide posterior distributions for divergence times, indicating little information under this model. Divergence time estimates were not sensitive to the demographic model used. The MCC trees under the UCED model with a narrow prior, both with and without ZH1540 included in the analysis are shown in Extended Data Figs. 6 and 7, respectively.
Finally, we performed a Bayesian date randomization test126,127,128 (DRT) to further assess the strength of the temporal signal in our dataset, by permuting sampling dates among genomes and performing 50 replicate analyses. For the analyses, the full dataset, a UCED clock model with a narrow prior and the Bayesian skyline plot demographic model were used, while fixing the sampling dates of ancient strains to the means of the radiocarbon date ranges for simplicity. MCMC chains were run for 1 × 108 steps, sampling parameters every 10,000 steps. Convergence was assessed using the coda129 package to ensure that all parameters in all chains have ESS values > 150. The DRT results show that the 95% HPD intervals of the mean clock rate on replicates with permuted sampling dates are much smaller than expected if all information came from the mean clock rate prior (Extended Data Fig. 5g). In general, the HPD intervals do not overlap with the 95% HPD interval of the mean clock rate estimated with the true sampling dates.
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.