No statistical methods were used to predetermine sample size. The experiments were not randomized, and investigators were not blinded to allocation during experiments and outcome assessment.
Overall synteny and BLASTn analyses
To define orthology in the OT, VT and OTR-VTRs in all vertebrates, we used interspecies synteny analyses at three scales: a manual 10-gene window microsyteny analyses using BLAT and BLAST38,39 searches and cross-species genome alignments; a more automated 100-gene macrosynteny window using SynFind and GeVo22; and automated chromosomal-scale alignments with syntenic dot plots using SynMap240. To define paralogy and further trace the evolutionary history of the genes, we used intraspecies synteny analysis, searching for paralogous genes in 10-Mb windows. Microsynteny was useful for determining orthologous and paralogous relationships between genes in the majority of the vertebrate lineages. Macrosynteny was useful for determining orthologous and paralogous relationships in cases in which the microsynteny was weaker, such as between genes found in lampreys or hagfish with the rest of the vertebrates. Sequence identity was determined using BLASTn to understand relationships between the per cent identity and synteny (Supplementary Table 12). Only results with a bit score >40 and hits with high probability E value <10−4 were kept. We describe the specific methods for each synteny approach in ‘Microsynteny between species in approximately 10-gene windows’, ‘Macrosynteny between species in approximately 100-gene windows’ and ‘Chromosome-scale macrosynteny between species’.
Microsynteny between species in approximately 10-gene windows
We ran microsynteny analysis by manually scanning annotated alignments for 5 protein-coding genes before and after each focus gene (Supplementary Table 4a–e) in 35 species spanning all major vertebrate lineages (Supplementary Table 2). The candidate genes in each species (accession number or gene identifier in Supplementary Table 4a–e) were first selected by BLAT and BLAST searches using the UCSC genome browser and alignment (http://genome.ucsc.edu/)38 and the SynFind tool from the CoGe comparative genomics research platform22. The NCBI and Ensembl41 (v.95) database genome alignments were used to identify the neighbouring genes. For the neighbouring genes, we kept in our Supplementary Tables the family gene names used in the genome annotation of each species, even though in some cases—we believe erroneously—different family names have been given to the orthologous gene in different species (for example, FABP1A in spotted gar and FABP1B.1 in stickleback; Supplementary Table 4a). For the species that had more lineage-specific duplications, we labelled the gene that shared more synteny with the orthologue in other vertebrate lineages with ‘a’ (for example, OTRa), and labelled the copy with ‘b’ (for example, OTRb). We listed the aliases in NCBI and Ensembl for each focus gene in each organism (‘Aliases’ column in Supplementary Table 4a–e) and included the most frequent ones in Table 1. When our target genes appeared to be lost in a species (no initial BLAST hit), we searched the surrounding gene territory to determine whether only the gene of interest or a larger block of genes were deleted, or whether the deletion was due to an incomplete genome assembly or assembly artefact.
For some species with more-fragmented genome assemblies or annotations or greater divergences in NCBI and Ensembl, we analysed other higher-quality assemblies and annotations. This included the VGP zebra finch, Anna’s hummingbird, pale spear-nosed bat and platypus genome assemblies3. For the Japanese lamprey, we included previously published synteny data10. For the sea lamprey, we used the assembly of the germline genome14 and analysed it with BLAST, Genome Browser and Gene Search tools (https://genomes.stowers.org/organism/Petromyzon/marinus). For amphioxus, we used the BLAST and Gene Browser tools available at https://genome.jgi.doe.gov/Brafl1/Brafl1.home.html with the latest version of the amphioxus genome (Branchiostoma floridae v.2.0), whereas previously reported data13 are based on the first version of the genome (B. floridae v.1.0). For the inshore hagfish genome assembly, the contigs were relatively short and not fully annotated, and thus we first BLAT-searched all the OT, VT and OTR-VTR sequences of all the aforementioned species against the hagfish genome in Ensembl, found two putative OTR-VTRs in two separate contigs in the hagfish assembly, and then used the ‘Region comparison’ tool of Ensembl to map each gene of these contigs against the human, zebrafish and lamprey genomes (Supplementary Table 4f, g). BLAST gave many gene hits in the hagfish genome, but only with short segments aligning to OT and VT orthologues in other species. Thus, to determine whether they were real OT or VT orthologues, we used the ‘Gene Tree’ tool of Ensembl that constructs a phylogeny using the entire protein sequence, with the sea lamprey VT as reference. For the receptors, we used our data from the SynMap2 dot plots (described in ‘Chromosome-scale macrosynteny between species’) and included in the synteny of the hagfish receptors the gene hits that appear on the chromosomes in which the OTR-VTRs are located in human, chicken, zebrafish and sea lamprey (Supplementary Table 4f, g).
Macrosynteny between species in approximately 100-gene windows
We generated gene sequence alignments between pairs of species using SynFind22 (density, LastZ defaults). This results in a matrix containing syntenic gene-hit values in the reference species relative to query species along with their chromosomal locations. This data matrix was parsed and visualized using a custom R script (https://github.com/ggedman/OT_VT_synteny). First, a 100-gene window centred around a given receptor gene in the reference organism (x axis) was defined using biomaRt (v.3.10). As we move 5′ (left) or 3′ (right) from zero (the focus gene) and if synteny exists, the number of gene hits for a given receptor in the query species shows a cumulative increase. This allowed us to identify large stretches of homologous sequences interspersed by divergent sequences.
Chromosome-scale macrosynteny between species
We used SynMap240 to generate syntenic dot plots of chromosome sequence alignments between species that contain OTR-VTRs (Supplementary Tables 15–30). SynMap2 identifies collinear sets of genes or regions of sequence similarity to infer synteny between two sequences, and generates a dot plot of the results. We used the default parameters (as of December 2018), except for ‘Minimum number of aligned pairs’. This parameter defines the minimum number of homologous genes (based on last default parameters) that should be found in a 20-gene distance for these genes to be considered syntenic and to appear on the dot plot. In more closely related lineages, we selected three as a minimum number (for example, between sea lamprey on the one hand, and Japanese medaka, or zebrafish, frog and chicken genomes on the other); for more distantly related species, we used two (for example, between hagfish on the one hand, and sea lamprey, or Japanese medaka, zebrafish, frog, chicken and human genomes on the other). Additionally, because the hagfish contigs were shorter than most other assemblies (making synteny more difficult to identify), we also ran a dot plot with 1 as the minimum number to search for all possible homologous hits, regardless of synteny.
To test for significant differences, we ran a χ2 test with distinct samples of genes on the difference of the proportions between the first two chromosomes with the highest number of gene hits, using the number of genes in the super-scaffold of the reference species (for example, sea lamprey) as sample size: Supplementary Table 30 provides confidence intervals, degrees of freedom and P values. For cases that reached significance, to confirm that the number of hits between two species was independent of the number of protein-coding genes located on the chromosome of the query organism, we applied a gene density-normalization test to rule out the possibility that the chromosomes with most gene hits were owing to them containing the most genes: we did not find such correlations with our macrosynteny analyses.
Macrosynteny within species in approximately 10-Mb windows
We primarily used the human genome, as it is the best assembled genome and therefore subject to generating fewer errors. We listed all genes found in a 10-Mb window from the present OTR-VTRs (for example, OTR, VTR1A, VTR1B and VTR2C in mammals) as well as absent ones (for example, VTR2A and VTR2B, which are absent in mammals). We chose a 10-Mb window because this genomic region size often captured macrosynteny of >40 genes, allowing within- and between-species macrosynteny analyses described in ‘Macrosynteny between species in approximately 100-gene windows’ to be comparable. We then searched each gene in the HUGO Gene Nomenclature Committee Database (https://www.genenames.org/) to classify its gene-family. For the deleted genes, we defined their territories by manually identifying in the human genome the genes around spotted gar VTR2B and chicken VTR2A; some of these syntenies around the deleted OTR-VTRs had previously been identified8,9, which we confirmed.
Evolutionary history analyses of OT and VT
We noted annotated DNA transposable elements in the UCSC Genome Browser in close vicinity of the OT and VT genes (except for the elephant shark genome, which was not annotated for DNA transposable elements), and thus we quantitatively searched for adjacent transposable elements in the human and chimpanzee genomes using RepeatMasker (http://genome.ucsc.edu/)38 and obtained information for each specific transposable element using Dfam 2.012. We calculated GC content using ENDMEMO (http://www.endmemo.com/bio/gc.php/). We aligned the introns of human OT and VT in all possible combinations using DIALIGN42 and compared intron lengths using Serial Cloner v.2.6 (http://serialbasics.free.fr/Serial_Cloner.html). For relative OT and VT orientations, we examined whether they were in the same direction (tail-to-head) or in opposite directions (tail-to-tail) in the annotated positions in each species. In the cases in which OT and VT were found in opposite directions, we determined which gene was inverted according to the orientation of other genes in the territory. In addition to the genomes used for all other analyses of this study (Supplementary Table 2), we also used the koala (Phascolarctos cinereus) (phaCin_unsw_v4.1; GCF_002099425.1) and the grey short-tailed opossum (Monodelphis domestica) (MonDom5; GCF_000002295.2) genomes to include orientation data from the marsupial clade in Supplementary Table 8.
Evolutionary history analyses of OTR-VTRs
To assess in which ancestral vertebrate chromosomes the OTR-VTRs originated, we used four ancestral chromosome models from the literature14,23,25,27, in which the reconstructed chromosomes were based on different species and different genome qualities. Specifically, human, mouse, dog, chicken and tetraodon genomes were used in ref. 23; human, chicken, stickleback, pufferfish, sea squirt, amphioxus, sea urchin, fruitfly and sea anemone genomes in ref. 27; human, chicken and sea lamprey (somatic) genomes in ref. 25; and chicken, spotted gar and sea lamprey (germline) genomes in ref. 14. We searched for the presence of annotated OTR-VTRs in four outgroup invertebrate lineages (through literature review, BLAST and BLAT searches)—namely in sea squirt, roundworm, California sea hare and amphioxus. For the amphioxus genome (B. floridae v.2.0), we performed BLAT queries on OTR-VTR FASTA sequences from all species studied using the JGI genome browser (https://genome.jgi.doe.gov/portal/).
To test which sea lamprey receptor(s) most probably represents the orthologous ancestral gene(s), we compared the sea lamprey OTR-VTRs in all possible combinations to each other using BLASTn (same parameters). We compared the exons and introns of the identified genes separately to understand the divergence of the paralogous genes, following a previously proposed paradigm43, using the maximum score and per cent identities of the comparisons that were above the threshold (maximum score >40 and E value < 10−4). We performed a similar analysis for VTR1B and VTR2A in elephant shark and coelacanth, to test whether sequence identity can help to solve ancestry questions. To shed light on the orthology between the inshore hagfish and the sea lamprey OTR-VTRs, we compared their exons and introns as well.
To analyse conserved non-coding RNA synteny around the OTR-VTRs, we looked for them in alignments in all the species studied in Ensembl, in the miRbase (http://www.mirbase.org/; release 22), and the miRviewer44 database (28 February 2012 update). We aligned (BLASTn) long non-coding RNA regions within species (sea lamprey and human).
Gene tree phylogeny analyses
Exonic nucleotide tree
Exonic sequences from all the OTR-VTRs from representative species that had the most-complete assembled genes were aligned with MAFFT under the E-INS-i parameter set, which is optimized for sequences with multiple conserved domains and long gaps. Any incomplete non-lamprey OTR-VTR of less than 1,000 bp was excluded, as alignments on short sequences often lack power to resolve species relationships, resulting in weakly supported gene trees. Because of the basal phylogenetic position of the lamprey, all lamprey OTR-VTRs (754 bp and longer) were included. From this alignment, we generated a phylogenetic maximum likelihood tree using GTRGAMMA model of RAXML (version 8.2.10)45, with 1,000 replicates. We calculated the GC content of all the exonic sequences using http://www.endmemo.com/bio/gc.php (Supplementary Table 13).
Protein amino acid tree
A maximum likelihood phylogenetic tree was constructed on one representative amino acid sequence for every gene in every species, using TreeFAM and TreeBeST5 pipeline in the gene tree tool package of Ensembl (https://www.ensembl.org/info/genome/compara/homology_method.html). Thereafter, we manually curated the Ensembl tree (gene tree identifier: ENSGT00760000119156) using the universal nomenclature that we propose here. All the sequences used to generate both trees, sequence alignments and Newick files can be found at https://github.com/constantinatheo/otvt.
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.