Strange India All Strange Things About India and world


No statistical methods were used to predetermine sample size. The experiments were completely randomized, and investigators were not aware of genotype identifiers while conducting experiments or sequencing.

Plant collections, propagation, cultivation and phenotyping

To form the diversity panel, seeds, rhizomes and clonal propagules from natural and common garden sources were collected from 2010 to 2018. Plants grown from seed followed a standard growth procedure16. In brief, 10–15 seeds were sown in 9-cm square pots containing a mixture of ProMix BX potting soil (Premier Tech Horticulture) and Turface MVP calcined clay (Turface Athletics) and vernalized for 7 days at 4 °C. Pots were then placed in a lit greenhouse with 14-h day length and 30-°C/22-°C day/night temperature. Seedlings were thinned at the 3-leaf stage to 1 plant per pot and allowed to grow until the 5-tiller stage. Rhizome propagules and 5-tiller seedlings were transferred to 5-gallon pots containing finely ground pine bark mulch (Lone Star Mulch) and time-release fertilizer (Osmocote 14-14-14, ScottsMiracleGro). All individual plants were propagated in Austin by clonal division from 2016 to 2018, targeting >10 clones per unique accession. Cleary 3336F systemic fungicide (Cleary Chemicals) was applied to the plants as necessary to control fungal pathogens. Plants were placed in 1-gallon pots for the final propagation.

Planting in the field sites occurred from 15 May to 10 July 2018 and followed previously published methods16. In brief, plants were transported to each site by truck, where each field was covered with one layer of DeWitt weed cloth. Plants were placed in holes that were cut into the weed cloth into a honeycomb design in which each plant had four nearest neighbours, all located 1.56 m from one another. To prevent edge effects, the lowland Blackwell cultivar was planted at every edge position. Plants were hand-watered following transplantation. Aboveground portions of all plants were left to stand over the winter of 2018–2019 and removed in the spring of 2019 before spring tiller emergence. At the end of the 2019 season, plants were tied upright as a bunch and harvested with sickle bar mowers.

We generated two measures of fitness for the 2019 growing season: log-transformed biomass (kg) and proportion of winter survival (Supplementary Data 8). Biomass data were obtained from all living individuals during harvest in October and November 2019. Plants with an estimated mass <750 g were placed in paper bags and dried whole at 60 °C until no additional moisture loss occurred, then weighed for total dry biomass. Plants with an estimated mass >750 g were weighed in the field for wet biomass on a hanging scale with a ±5-g resolution. To determine biomass of these plants, approximately 500 g of whole tillers were subsampled from each plant, weighed, dried as above and reweighed. The wet biomass of the whole-plant sample was then multiplied by the per cent moisture in the subsample to approximate total dry biomass. Plants were considered to have experienced winter mortality during the 2018–2019 winter season when no new growth was seen from plant crowns by 1 June 2019. The dead plant crowns were excised from the experiment and replaced with plants of the Blackwell cultivar in July or September 2019.

Genome assembly and polishing

We sequenced the Alamo switchgrass genotype AP13 using a whole genome shotgun sequencing strategy and standard sequencing protocols at the Department of Energy Joint Genome Institute and the HudsonAlpha Institute for Biotechnology. The genome was assembled and polished from 4,520,785 PacBio reads (121.66× raw sequence coverage from a total of 59 P6C4 2.0 and 2.1 chemistry cells with 10-h movie times and a p-read yield of 91.76 Gb) (Extended Data Fig. 1) using the MECAT assembler52 and ARROW polisher53. Final genome polishing and error correction was conducted with one 400 bp insert 2 × 150 bp Illumina HiSeq fragment library (177.1×). Reads with >95% simple sequence repeats and reads <50 bp after trimming for adaptor and quality (q < 20, 5-bp window average) were removed. The final read set consisted of 1,259,053,614 reads for a total of 168× coverage of high-quality Illumina bases. This produced an initial diploid assembly of 6,600 scaffolds (6,600 contigs), with a contig N50 of 1.1 Mb, 3,489 scaffolds larger than 100 kb and a total 2C (diploid) genome size of 2,013.4 Mb.

Assembling a haploid genome in an outbred individual, such as AP13, will generally yield both haploid copies in heterozygous regions, necessitating computational steps to represent each chromosome as a single-copy haplotype without duplicate copies being unnecessarily repeated. Our initial assembly was approximately double the expected haploid (1C) genome size of 1.2 Gb. Therefore, to detect putative meiotically homologous haplotypes, we identified and counted shared 24-mers that occurred exactly twice in the assembly and binned contigs accordingly. A total of 3,152 shorter and redundant alternative haplotypes and 2,387 overlapping contig ends were identified, comprising a total sequence of 871.2 Mb. The remaining 1,142.2 Mb of sequence was ordered and oriented into 18 chromosomes by aligning genetic markers from 2 available maps (Supplementary Data 1) to the MECAT assembly; 563 joins and 57 breaks were made, with 10,000 Ns representing the unsized gap sequence. Overall, 97.2% of the assembled sequence was contained in the chromosomes. Telomeric sequence was identified using the (TTTAGGG)n repeat and properly oriented. The remaining scaffolds were screened against GenBank bacterial proteins and organelle sequences and removed if found to match these sequences. To resolve minor overlapping regions on contig ends, adjacent contig ends were aligned to one another using BLAT54; a total of 47 adjacent duplicate contig pairs were collapsed.

We conducted two rounds of error correction. First, we corrected homozygous SNPs and insertions and/or deletions (indels) by aligning the Illumina 2 × 150 bp library to the release consensus sequence using bwa mem55 and identifying homozygous SNPs and indels with the UnifiedGenotyper tool of GATK56. A total of 690 homozygous SNPs and 80,199 homozygous indels were corrected in the release. Second, we computationally finished 11,343 assembled contigs sequenced from BAC clones with a combination of ABI 3730XL capillary sequencers57 and single index Illumina clone pools and aligned this set of switchgrass clones to the SNP-fixed genome to find heterozygous SNPs that were out of phase with their neighbours. To resolve these phase-switched alleles, the full set of the raw PacBio reads was aligned to the assembly. For each read, the phase of each heterozygous site was determined and 62,732 out-of-phase heterozygous sites were corrected.

To distinguish the N and K subgenomes, we used a de novo repeat-clustering method and validated this with phylogenetic distances to a related species. We searched for ‘diagnostic’ 15-mers via Jellyfish58 in LTR regions of Gypsy, Copia and Pao insertions (identified by RepeatMasker59 and LTRHarvest60) that distinguished each set of homologous chromosomes (≤1 hit in one homologue and ≥100 in the other). The LTR sequences that shared common 15-mers were grouped as superfamilies and were aligned within each superfamily by BLAST. Superfamily members with significant BLAST hits (e < 0.01, ≥90% length) were assigned into families and aligned by Mafft61. Jukes–Cantor distances between LTR families were computed by the R ape package62, and clustered into two distinct sets of subgenomes. Clustering was identical between LTRs and alignments to P. rudgei (K.M.D. and E. Kellogg, unpublished data), which is an ancient relative of the K subgenome17, giving high confidence that we have effectively assigned all chromosomes to the correct subgenomes. Finally, we assigned chromosome identifiers and oriented each chromosome pseudomolecule via synteny with Setaria italica63. The final haploid version 5.0 release contained 1,125.2 Mb of sequence, consisting of 626 contigs with a contig N50 of 5.5 Mb and a total of 97.2% of assembled bases in chromosomes.

Gene annotation

Transcript assemblies were made from about 2 billion pairs of 2 × 150-bp stranded paired-end Illumina RNA-seq reads, about 1 billion pairs of 2 × 100-bp paired-end Illumina RNA-seq reads and 454 reads (Supplementary Data 3) using PERTRAN (details of which have previously been published64). In brief, PERTRAN conducts genome-guided transcriptome short-read assembly via GSNAP65 and builds splice alignment graphs after alignment validation, realignment and correction. In total, around 4.5 million PacBio Iso-Seq circular consensus sequences66 were corrected and collapsed, resulting in approximately 677,000 putative full-length transcript assemblies. Subsequently, 668,176 transcript assemblies were constructed using PASA67 from RNA-seq reads, full-length cDNA, Sanger expressed sequence tags, and corrected and collapsed PacBio circular consensus sequence reads. Loci were determined by EXONERATE68 alignments of switchgrass transcript assemblies and proteins from Arabidopsis thaliana69, soybean70, Kitaake rice71, Setaria viridis72, P. hallii var. hallii64, Sorghum bicolor73, Brachypodium distachyon74, grape and Swiss-Prot75 proteomes. These alignments were accomplished against a repeat-soft-masked switchgrass genome using RepeatMasker59 (repeat library from RepeatModeler76 and RepBase77) with up to 2,000-bp extension on both ends unless extending into another locus on the same strand. Incomplete gene models, which had low homology support without full transcriptome support, or short single exon genes (<300-bp coding DNA sequences (CDS)) without protein domain or good expression were removed.

Comparative genomics

Syntenic orthologues and paralogues were inferred for the two switchgrass subgenomes via the GENESPACE pipeline64, using default parameters and two outgroups: P. hallii var. hallii64 and S. bicolor73. In brief, GENESPACE parses protein similarity scores into syntenic blocks and runs orthofinder78 on synteny-constrained blast results. The resulting block coordinates and syntenic orthology networks give high-confidence anchors for evolutionary inference.

To calculate the ancestral states of CDS regions, we first determined sequences that share common ancestry using genomes from Phytozome79. The final number of hits to the switchgrass genome were 38,960 and 33,772 for P. hallii, and S. bicolor, respectively. For any given orthology network, we built two multiple sequence alignments in mafft61, one excluding the focal switchgrass sequence (msa0) and one forcing msa0 to align to the coordinate system of the focal sequence via the –keeplength parameter. We then extracted marginal character states with the maximum likelihood algorithm in Phangorn80. For each reconstruction, only the internal node closest to the switchgrass branch was used as the ancestral state. Overall, we analysed 40,943 switchgrass gene models (216,157 exons) covering 54.95 Mb (Supplementary Data 9).

Subgenome evolution and dating

To infer the ages of the subgenomes and tetraploid switchgrass, we took a conservative set of orthologues with simple 2:1:1 networks between P. virgatum, P. hallii and S. italica. This yielded 45,045 switchgrass proteins aligning to 24,549 P. hallii proteins, resulting in 20,496 homologue pairs and 4,053 singletons (2,396 for K subgenome and 1,660 for N subgenome) from the cross-species analysis. We aligned the translated CDS of these sequences using Dialign-TX81. The aligned CDS sequences were concatenated and fed to Gblocks82 using default parameters. Gblocks filtered the alignment of 18,044,244 CDS nucleotides to 16,321,302 positions, in 50,334 blocks. The resulting alignment was then used in PhyML83 to build a maximum-likelihood tree using the general-time reversible model. This tree was used as an input to r8s84, to compute a time tree and calibrate the PanicumSetaria node of the tree to 13.1 Ma63. To date subgenome divergence and therefore the timing of polyploid switchgrass speciation, we leveraged burst distances, which refer to all distances within an LTR family (whereas pairwise distances refer to the distance between the 5′ and 3′ LTRs of the same insertion). The 5′ versus 3′ distances of the N- or K-subgenome-specific retrotransposons were used to date the insertion times of those elements. This method cannot be used for the P. virgatum-specific or Panicum-specific families because the more recent expansions of those elements dominate the distributions. Instead, we relied on comparing the best cross-species alignments to estimate the LTR distances of the P. virgatumP. hallii and PanicumSetaria nodes. This way, we have calibration points to compare the LTR distances to the more confident protein-coding gene divergences between species.

Subfunctionalization and gene expression analyses

To assess whether the subgenome evolution biases observed at the protein-coding sequence scale were manifest in phenotypes, we explored gene expression biases between homologues from biologically replicated AP13 leaf tissue (n ≥ 5) collected at two sites (TX2 and MI). Illumina paired-end RNA-seq 150-bp reads were quality trimmed (Q ≥ 25) and reads shorter than 50 bp after trimming were discarded. High-quality sequences were aligned to P. virgatum v5.1 reference genome using GSNAP65 and counts of reads uniquely mapping to annotated genes were obtained using HTSeq v.0.11.285. The test for differential expression was conducted through a likelihood ratio test in DESeq286. Library sizes were calculated before splitting the reads by subgenome; these sizes were used as the size factors in the analysis of differential expression. Subfunctionalization was defined as a significant subgenome-by-environment interaction from the likelihood ratio test. Subgenome expression bias was tested for both the field gardens and annotation libraries using post hoc Wald-test contrasts between subgenomes within conditions. Significant bias was defined as differential expression false-discovery-rate-adjusted P < 0.05. Weighted gene coexpression clustering of AP13 gene annotation RNA-seq libraries was conducted with WGCNA87 with a power of 6. Raw counts can be found in Supplementary Data 10.

Ploidy assessment

We used a LSRFortessa SORP Flow Cytometer (BD Biosciences) to determine ploidy levels of the resequenced accessions. For each plant, 200–300 mg of young leaf tissue was macerated in a Petri dish with a razor blade and treated for 15 min with 1 ml Cystain PI Absolute P nuclei extraction buffer (Sysmex Flow Cytometry) mixed with 1 μl 2-mercaptoethanol. Samples were filtered to isolate free nuclei with a CellTrics 30-μm filter (Sysmex) and treated for 20 min on wet ice with 2 ml of Cystain PI Absolute P staining buffer (Sysmex), 12 μl of propidium iodide and 6 μl of RNase A. Samples were run on the flow cytometer to determine nuclei size with a minimum of 10,000 nuclei analysed per sample. Output from the flow cytometer was analysed with FlowJo software (BD Biosciences) and samples were binned into three categories on the basis of the average units of fluorescence per nuclei (Supplementary Fig. 1). Ploidy level of the sample was considered 4× if the cell population had 40,000–80,000 units of fluorescence, 6× for 80,000–100,000 units and 8× for 100,000–140,000 units. The binning parameters were established with flow cytometry data from several P. virgatum accessions of known ploidy.

We also assessed ploidy of the samples via the distribution of variant allele frequency at biallelic SNPs (as described in ‘Variant calling’). This method assumes that tetraploids and octoploids follow different allele frequency distribution patterns, with tetraploids having 0.5/0.5 (reference and variant depths) and octoploids having a mixture of 0.75/0.25 and 0.5/0.5. If the proportion of hits with 0.48 ≤ x ≤ 0.52 was <0.035, the library was considered octoploid and if it was ≥0.035, tetraploid; 837 out of 870 samples (96.2%) that had flow cytometry data matched with these results.

Variant calling

A total of 789 tetraploid diversity samples were resequenced at a median depth of 59× (range 20×–140×). Of these, 732 were used for further analysis after filtering for missing data, outlier elevated heterozygosity and collection site discrepancies. The samples were sequenced using Illumina HiSeq X10 and Illumina NovaSeq 6000 paired-end sequencing (2 × 150 bp) at HudsonAlpha Institute for Biotechnology and the Joint Genome Institute. To account for different library sizes, reads were pruned to ≤50× coverage, then mapped to the v5 assembly using bwa-mem55.

SNPs were called by aligning Illumina reads to the AP13 reference with BWA-mem. The resulting .bam file was filtered for duplicates using Picard (http://broadinstitute.github.io/picard) and realigned around indels using GATK 3.056. Multi-sample SNP calling was done using SAMtools mpileup88 and Varscan V2.4.089 with a minimum coverage of eight and a minimum alternate allele count of four. Genotypes were called via a binomial test. SNPs within 25 bp of a 24-mer repeat were removed from further analyses. Only SNPs with ≤20% missing data and minor allele frequencies >0.005 were retained, resulting in 33,905,042 SNPs across 75% of the genome at a coverage depth between 8× and 500×. Phasing was performed using SHAPEIT390. FST calculations were accomplished via vcftools91. We tested for subgenome read-mapping bias by generating mean coverage per Mb for each of the 732 libraries and 18 chromosomes. We then fit a mixed effects linear model to these data in lme492 in which the chromosome number (1–9) was a random effect, to test the main effect of subgenome. Models with and without the main effect term were compared via a likelihood ratio test.

Individual de novo assemblies for the 732 short read libraries were constructed using HipMer93 with a k-mer size of 101 to maximize haplotype splitting among contigs. As the assemblies varied in quality and contiguity, the sample set considered for gene presence–absence and structural variant detection was narrowed to 251 samples (pan-genome set) based on total assembly size, contig N50 length and total gene alignments per library.

To assess presence–absence variation of genes across the pan-genome, we aligned all AP13 proteins and a unique set of 6,161 proteins from Oropetium thomaeum (nproteins = 1,476)94, S. italica (n = 1,085)63, Setaria viridis (n = 891)72, P. hallii var. filipes (n = 1,048)64, S. bicolor (n = 878)95 and P. hallii var. hallii (n = 772)64. These unique genes were extracted from single-copy orthology networks inferred via orthofinder78 and selection owing to a lack of orthology to switchgrass. All proteins (≥100 amino acids) were aligned to all de novo assemblies using BLAT54. Gene alignments from AP13 proteins were considered present if they aligned with greater than or equal to 80% identity and 75% coverage, whereas other grass proteins were considered present with alignments greater than 70% identity and 75% coverage (to allow greater divergence among species). Variable (pan-genome shell) genes (considered present across 40–60% of the population; n = 5,432) were extracted from the presence–absence variation matrix and used to visualize differences among non-admixed individuals from the Atlantic, Gulf and Midwest subpopulations. Testing genes that were significantly over- or under-represented within each subpopulation was conducted with a χ2 test with a Benjamini–Hochberg multiple testing correction (P ≤ 0.05).

To detect structural variants across the pan-genome, contigs (≥2 kb) from each library were aligned to the AP13 reference genome using ngmlr96 with default settings for PacBio reads. The resulting .bam file was sorted using samtools88 and used for calling structural variants with sniffles96. Individual structural variant calls were merged across samples using SURVIVOR97, with a maximum allowed distance of 1 kb. The resulting .vcf file was filtered using bcftools88 using a minimum minor allele frequency of 0.1, and considering only insertions and deletions between 100 and 1,500 bp in length.

Population genomics

To assess the genetic population structure of the 732 tetraploid libraries (Supplementary Data 4), we extracted all fourfold degenerate sites (putatively neutral) with ancestral state calls (Supplementary Data 9) from the ancestral state alignments. This list of sites, which represents our highest confidence neutral loci, was then linkage-disequilibrium-pruned using a threshold of |r| ≤ 0.6, resulting in 59,789 sites for downstream analyses in the R package SNPRelate98.

The extent of linkage disequilibrium for the population was determined from SNPs99 in PLINK100. Linkage disequilibrium (r2) was calculated using plink (–ld-window 500–ld-window-kb 2000). The r2 value was averaged every 500 bp. A nonlinear model was fit for this data in R using the nls function, and the extent was determined as to when the linkage disequilibrium (r2) nonlinear curve stabilized.

Population genetic structure was assessed hierarchically. Given the presence of highly divergent ecotypes across the study range, we first analysed the broadest genetic population structure using discriminant analysis of principal components (DAPC)101 in adegenet v.2.0.1102. This method does not rely on common assumptions (for example, Hardy–Weinberg equilibrium and linkage disequilibrium) that underlie many population clustering approaches and therefore provides a valuable tool to look at broad structural divisions. DAPC demonstrated a strong set of gene pools and separated Midwest genotypes from all others. We then evaluated the genetic population structure and potential admixture of the remaining non-Midwest individuals using a Bayesian clustering algorithm implemented in STRUCTURE v.2.3.4103 via the admixture model with correlated allele frequencies. The analysis consisted of 20,000 burn‐in steps and 30,000 replicates of 1–6 genotypic groups, each of which was run 10 times. Ancestry coefficients across all subpopulations were assigned post hoc through eigenvector decomposition in SNPRelate.

We inferred the demographic history of the switchgrass samples using Multiple Sequentially Markovian Coalescent (MSMCv.2.0104), which is a population genetic method used to infer demographic history and population structure through time from sequence data. This method models an approximate version of the coalescent under recombination, and produces tests of both population size and divergence time. MSMC was run using four haplotypes for each subpopulation, skipping ambiguous sites, an estimated rhoOverMu of 0.25 and a time segment pattern of 10 × 2 + 20 × 5 + 10 × 2. We estimated rhoOverMu as 0.25 as the mean value from 100 iterations without the fixed recombination parameter for 5 sets of 4 haplotypes in each subpopulation and averaged them. To estimate scaled divergence time in generations, we assumed a mutation rate of 6.5 × 10−8. To make estimates of initial divergence time, we compared adjacent relative cross-coalescence rate (RCCR) values (past to present) (Supplementary Data 11). If there was a decline, either at a single time segment or within contiguous segments or within two interleaved time segments (>0.01; observed range 0.01–0.28), and the following neighbours were nearly zero (≤0.009; observed range: −0.1–0.009), we considered that to be a starting point for population separation. However, if there was another decline within five time segments, we considered the latter as the start of population separation. We replicated the analyses with 16 sets of different individuals for each subpopulation contrast.

Population structure was visualized across SNPs, structural variants and presence–absence variants via eigenvector decomposition of a distance matrix. First, a Euclidean distance matrix was calculated among 0/1/2 (reference homozygote, heterozygous, alternative homozygote) library × marker matrices for each of the three variant call types. The Euclidean matrix was then scaled and centred to remove among-library coverage variance via Gower’s centred similarity matrix, implemented in the R package MDMR105.

Ecotype classification

Mature switchgrass accessions at or near anthesis were surveyed for 16 plant traits (leaf: length, width, length/width ratio, area, lamina thickness and lamina/midrib thickness ratio; whole plant: number of tillers, tiller height, product of tiller height × number, tiller height/count ratio, panicle height, panicle height/count ratio, leaf canopy height and tiller/leaf height ratio; phenology: date of green-up and date of panicle emergence) to determine ecotype identity during the summer of 2019 at the University of Texas J. J. Pickle Research Campus (PKLE; or TX2 (Austin, Texas, USA) and Michigan State University Kellogg Biological Station (KBSM; or MI (Hickory Corners, Michigan, USA)) common gardens (see Supplementary Data 5 for detailed descriptions of these variables). The phenology measurements, including green-up (when the first green vegetative structures emerge from the rhizome crown) and panicle emergence (when the first reproductive structures emerge from the tiller), were assayed daily. Detailed leaf morphology was assessed on a representative leaf of each plant by measuring length and width (in mm), midrib and lamina thickness (in μm) (Mitutoyo 547-500S caliper) and leaf area (in mm2) (Licor 3100C leaf area meter). In addition to these quantitative traits, we also generated a qualitative upland–lowland index for both the leaf and whole-plant appearance, collected at the end of the summer 2019 in Austin (TX2 site). Each plant characteristic was assessed on a 1–5 scale from most lowland-like to most upland-like. The established cultivars Alamo and Dacotah were used for baseline measurements of lowland and upland characters, respectively. Plant characters assessed included: tiller appearance, from thickest and most lowland-like to thinnest and most upland-like; leaf appearance, from widest, longest and most lowland-like to shortest, thinnest and most upland-like; canopy colour from bluest and most typically lowland to darkest green and most typically upland. This visual approach is akin to basic selection criteria often used by switchgrass breeders.

To assess phenotypic structure in these data, we used a DAPC101. Prior groups were determined by first transforming the phenotypic data using principal component analysis (PCA), then the first 10 principal components were used in a k-means algorithm to classify individuals into 3 possible groupings aiming to maximize the variation between groups. Next, DAPC was implemented on the 10 retained principal components to provide an efficient description of the ecotypic clusters using two synthetic variables, which are linear combinations of the original phenotypic variables that have the largest between-group variance and the smallest within-group variance (that is, the discriminant functions).

We classified each of the 651 tetraploid genotypes surveyed for the 16 traits at the MI and TX2 gardens (34 total features, 32 quantitative and 2 qualitative ordinal traits) to 1 of the 3 ecotypes through a low-capacity neural network with 1 hidden layer and 5 units (Supplementary Data 5). The neural network was implemented in caret106 and was trained on seven cultivars with known ecotypes (lowland: Kanlow and Alamo; coastal: High Tide and Stuart; upland: Summer, Dacotah and Sunburst) and 78 additional genotypes that were in the same SNP-based genetic cluster (Extended Data Fig. 3), collected in the same states and clustered most closely in phenotypic PCA space with the exemplar cultivars. These high-affinity exemplar genotypes are printed in Supplementary Data 5. Ecotypes for the remaining 582 genotypes that were phenotyped for the ecotype classification traits were predicted with caret106. By using traits collected at gardens representing both the northern and southern switchgrass range, we hoped to avoid local climate bias on plant phenotype and subsequent ecotype classification. Furthermore, the neural network classification approach offers one notable advantage over both DAPC and expert’s qualification: because the neural network is anchored to known and published genotypes, experimentation that includes these common cultivars will be able to more effectively recapitulate our assignments.

Admixture and introgression block calculation and dating

We built a database of admixture-informative SNPs through a two-step pipeline. First, ancestry coefficients were calculated as in ‘Population Genomics’ from fourfold degenerate sites that had associated ancestral-state calls. The 30 samples with the least missing data and proportion of genome-wide admixture ≤ 0.001 for each subpopulation were used to define subpopulation-specific allele frequencies. These libraries were used to find SNPs with at least one pairwise FST value >0.4, as calculated with the ‘W&C84’ method in the snpRelate function snpGdsFst. Second, these global ancestry-informative sites were parsed within each subpopulation to those with minor allele frequencies > 0.05 and missingness < 0.05. These sites were further pruned within subpopulations first to sites with |r| < 0.9 (10 SNPs or 1,000-bp windows), then to |r| < 0.95 (1,000 SNPs or 10,000-bp windows) in snpRelate. This process resulted in the following SNP and library counts for each subpopulation: Atlantic, 579,468 SNPs and 284 libraries; GULF, 641,975 SNPs and 215 libraries; and Midwest, 481,563 SNPs and 196 libraries.

To test for the physical locations of admixture blocks between each pair of subpopulations, we used Ancestry_HMM36,107. This approach leverages allele frequencies in putative parental populations to determine regions of likely introgressions in a test population. For each of the three subpopulations, we sought to determine the timing, extent and current positions of admixture block introgressions. In each case, we permitted two pulses from each of the other two subpopulations. Ancestry_HMM can optimize the number of generations before present when an ancestry pulse occurred and the proportion of individuals involved in the admixture pulse. However, 8-parameter optimization with >480,000 sites and >150 libraries was not computationally feasible. Therefore, we optimized parameters using 40 randomly sampled libraries with admixture coefficients within the 0.2–0.8 quantiles of the admixture proportion distribution and SNPs only on chromosome 4 of the N subgenome. We chose this chromosome as representative of others because of a lack of obvious large high-frequency introgressions. The resulting ancestry pulse parameter optimizations were founded on an initially unadmixed population 10,000 generations before present, and two subsequent admixture pulses for each of the other two subpopulations; the optimized pulses are as follows (source–reference): Midwest–Atlantic (ngenerations = 8,658 and Padmixed = 0.001%; 67 and 0.7%), Gulf–Atlantic (85 and 1.1%; 17 and 0.25%), Atlantic–Gulf (79 and 1.9%; 11 and 0.38%), Midwest–Gulf (79 and 0.86%; 11 and 0.14%), Atlantic–Midwest (66 and 0.27%; 14 and 0.036%), and Gulf–Midwest (71 and 0.15%; 14 and 0.033%). These pulses were supplied to the full model with all individuals and chromosomes, along with an error probability of 0.001, maximum number of generations before present of 10,000 and effective population size of 100,000. Posterior ancestry probabilities were decoded into haplotype blocks and blocks were binned into clusters of similarly positioned blocks.

Landscape genomics

Geographical maps were made with publicly available layers downloaded from Natural Earth (https://www.naturalearthdata.com/). Various plotting routines rely on the sf108 and raster109 packages in the R environment for statistical computing110. Climate data were downloaded from WorldClim22 (19 bioclimatic variables, 0.5-arcmin resolution 1960–2000) and ClimateNA21. The distribution of climate variables across collections sites was explored via dynamic clustering111 followed by partitioning around medoids clustering112 with k = 7. The most representative climate variables were defined as those most correlated with the first eigenvector of variation within each cluster. Six of the seven clusters included WorldClim variables.

Weather data were downloaded from the NOAA portal for the most proximate weather station to each garden site that had complete daily temperature (minimum–maximum), and precipitation data from 1 September 2018 to 31 October 2019. The NOAA weather station identifiers used for each garden are as follows: IL (USC00110338), MI (USW00014815), MO (USW00003945), NE (USC00255362), OK (USW00053926), SD (USC00391076), TX1 (USC00414810), TX2 (USC00410433), TX3 (USC00418862) and TX4 (USW00003901).

Climate–phenotype associations across gardens were conducted on both raw data and imputed data. Latitude–survival associations (Fig. 2b) were accomplished on raw data with logistic regressions via glm with a binomial family in R. Imputations, which were accomplished in base R using nearest neighbours across all available phenotypes (k = 5), were used exclusively for tests of the rank order of gardens (Fig. 2c, d). Climate similarity–biomass associations were accomplished in mixed linear models via lmer92, comparing the full model (fixed = climate distance + intercept, random = genotype identifier) to a reduced model without the climate distance fixed effect using a likelihood ratio test.

Species distribution modelling (SDM) was used to simulate modern-day potential ranges for all ecotypes (upland, lowland and coastal) of P. virgatum. The final datasets used to build the SDMs comprised 277 (upland), 199 (coastal) and 121 (lowland) occurrence records. Six environmental predictors were used in our final SDM modelling (BIO1 = annual mean temperature, BIO2 = mean diurnal range, BIO4 = temperature seasonality, BIO5 = maximum temperature of warmest month, BIO16 = precipitation of wettest quarter and BIO17 = precipitation of driest quarter). SDMs were then generated with BIOMOD2 v.3.3113 with seven modelling algorithms: generalized linear models, boosted regression trees, artificial neural networks, flexible discriminant analysis, random forest, classification tree analysis and multivariate adaptive regression splines. For each model, the occurrence data were coupled with 500 pseudo-absence data generated randomly within the modelled study area with equal weighting for presences and pseudo-absences114. Models were trained with 80% of the coupled occurrences and pseudo-absence data and tested with the remaining 20%. Each modelling algorithm was run 100 times for a total of 700 models, which were evaluated via true skill statistics (TSS)115. TSS values ranging from 0.2 to 0.5 were considered poor, from 0.6 to 0.8 useful, and >0.8 good to excellent116. Unique ensemble SDMs were computed from approximately the 50 best SDMs out of 700 models for the three ecotypes on the basis of TSS threshold values (upland TSS threshold = 0.96, lowland TSS threshold = 0.93 and coastal TSS threshold = 0.965). The final ensemble SDMs were projected onto present climate layers to visualize modern-day potential ranges (Supplementary Data 12).

We examined how the presence of Midwest introgressions in the Atlantic subpopulation were associated with the independent and joint influences of climate, geography and kinship, by implementing redundancy analysis in vegan117,118,119,120. To partition explainable variance in introgression presence attributable to climate, kinship and geography, we ran four models: one full model with introgression presence (potential introgression blocks were coded as 0 for Atlantic inheritance or 1 for Midwest introgression) explained by climate (that is, the seven representative climate variables), kinship (the first two principal components calculated from the set of putatively neutral markers) and geography (latitude and longitude), and three models for each of these three factors conditioned on the other two. The inertia (that is, variance) values from the constrained matrix of each model were compared to determine the relative importance of climate, kinship, geography and their joint effect. Furthermore, to find introgression regions strongly linked to climate and survival-corrected biomass, we extracted the loadings for the redundancy analysis axes from two additional models: (1) one predicted by only climate and (2) one predicted only by survival-corrected biomass. Both models were significant according to permutation tests (n = 999; P < 0.001 for both), and all axes were approximately normally distributed. SNPs loading at the tails of each axis were more likely to indicate selection related to the predictors (that is, climate or survival-corrected biomass), so we identified all markers that were at least 2.5 s.d. (two-tailed P = 0.012) from the centre as introgressions putatively under selection119.

GWAS

Owing to the large sizes of our common garden datasets, we developed a pipeline—the switchgrassGWAS R package (https://github.com/Alice-MacQueen/switchgrassGWAS)—to allow fast, less-memory-intensive GWAS on the diversity panel, and to analyse the extent to which SNP effects were similar or different for phenotypes measured at different sites. This package leverages bigsnpr121 to perform fast (>300× faster than TASSEL) statistical analysis of massive SNP arrays encoded as matrices. It also incorporates current gold standards in the human genetics literature for SNP quality control, pruning and imputation, as well as population structure correction in GWAS. To test the significance of many effects in many conditions (for example, multiple sites, climate variables and so on), we used mashr32, a flexible, data-driven method that shares information on patterns of effect size and sign in any dataset for which effects can be estimated on a condition-by-condition basis for many conditions and SNPs. We determined which SNPs had evidence of significant phenotypic effects using local false sign rates, which are analogous to false discovery rates but more conservative (in that they also reflect the uncertainty in the estimation of the sign of the effect)122. We used these values to find SNPs with log10-transformed Bayes factors > 2. Here, the Bayes factor was the ratio of the likelihood of one or more significant phenotypic effects at a SNP to the likelihood that the SNP had only null effects. Following previous work33, a Bayes factor of >102 is considered decisive evidence in favour of the hypothesis that a SNP has one or more significant phenotypic effects.

To calculate regional heritability for climate- and fitness-associated SNPs we followed a previously described two-step method123. Variance component analysis was accomplished with ASReml (VSN International), using genomic relationship matrices calculated using the van Raden method124. Genomic relationship matrices were calculated within each subpopulation and for the full diversity panel. A kinship matrix based on all SNPs used in the univariate GWAS was calculated (G), as well as a kinship matrix based on SNPs significantly associated with climate in that subpopulation (log10-transformed Bayes factor > 2; Qclimate) and a kinship matrix based on SNPs significantly associated with biomass or winter survival in that subpopulation (log10-transformed Bayes factor > 2, or >1.385 for Gulf subpopulation; Qfitness). These kinship matrices were used for regional heritability mapping123 as in a previous publication125, using mixed models of the form:

$${rm{V}}{rm{a}}{rm{r}}(u)=G{sigma }_{u}^{2}$$

$${rm{V}}{rm{a}}{rm{r}}(v)=Q{sigma }_{v}^{2}$$

$${rm{V}}{rm{a}}{rm{r}}(e)=I{sigma }_{e}^{2}$$

in which the vector y represents the biomass values, Z is the design matrix for random effects, u is the whole genomic additive genetic effect, v is the regional genomic additive genetic effect and e is the residual. Matrix G is the whole genomic relationship matrix using all SNPs for the whole genome additive effect. Matrix Q is the regional genomic relationship obtained as above: one of Qclimate or QfitnessI is the rank-y identity matrix, in which y is equal to the number of biomass values. Whole genomic, regional genomic and residual variances are ({sigma }_{u}^{2}), ({sigma }_{v}^{2}) and ({sigma }_{e}^{2}), respectively. Phenotypic variance (({sigma }_{{rm{p}}}^{2})) is ({sigma }_{u}^{2}) + ({sigma }_{v}^{2}) + ({sigma }_{e}^{2}). Whole genomic heritability, regional heritability and total heritability are ({h}_{u}^{2}) = (({sigma }_{u}^{2})/({sigma }_{{rm{p}}}^{2})), ({h}_{v}^{2}) = (({sigma }_{v}^{2})/({sigma }_{{rm{p}}}^{2})) and ({h}_{u+v}^{2}) = (({sigma }_{u}^{2}) + ({sigma }_{v}^{2})/({sigma }_{{rm{p}}}^{2})), respectively.

These models were run for the three locations where subpopulation GWAS were conducted: Columbia, Missouri; Hickory Corners, Michigan; and Austin, Texas. This resulted in 80 models: 4 sets of populations (the full diversity panel and 3 subpopulations), 2 model types (one model with G only and a G + Q model), for 10 phenotypes (biomass at 3 sites and 7 environmental variables).

Variance component analyses were also used to partition variance between the K- and N-subgenomes. Only SNPs with ancestral state calls (Supplementary Data 9) were used in this analysis, resulting in 460,429 SNPs used for each population subset. Kinship matrices based on all SNPs on a particular chromosome were calculated (QChr01K to QChr09K, and QChr01N to QChr09N), resulting in 18 kinship matrices. These kinship matrices were used for regional heritability mapping, using mixed models of the form:

$${bf{y}}=1+Z{v}_{1{rm{K}}}+Z{v}_{1{rm{N}}}+Z{v}_{2{rm{K}}}+ldots +Z{v}_{9{rm{N}}}+e$$

$${rm{V}}{rm{a}}{rm{r}}({v}_{i})={Q}_{i}{sigma }_{{v}_{i}}^{2}$$

$${rm{V}}{rm{a}}{rm{r}}(e)=I{sigma }_{e}^{2}$$

in which the vector y represents the biomass values, Z is the design matrix for random effects, v1K (to v9K) or v1N (to v9N) (collectively designated vi) are the chromosome-specific genomic additive genetic effects and e is the residual. Matrices Qi are the chromosome-specific genomic relationship matrices for the nine chromosomes of the N and K subgenomes. Chromosome-specific and residual variances are ({sigma }_{{v}_{i}}^{2}) and ({sigma }_{e}^{2}), respectively. Chromosome-specific heritability is ({h}_{{v}_{i}}^{2}) = (({sigma }_{{v}_{i}}^{2}) /({sigma }_{{rm{p}}}^{2})), and subgenome-specific heritability is the sum of these variances across the nine chromosomes within each subgenome.

Candidate gene exploration

We integrated multiple data structures to rank and provide meaningful culling criteria for candidate genes within introgression intervals and physical proximity to quantitative trait loci peaks. In the case of GWAS peaks, candidate genes were defined as those loci within a 20-kb interval surrounding the mashr peak. Candidate genes for genomic introgressions must have at least partially overlapped the introgression interval. As inference of GWAS and introgressions were conducted within genetic subpopulations, all statistics reported in Supplementary Data 7 (candidate gene lists) are also subpopulation-specific, with the exception of gene co-expression analysis (which was conducted only on AP13 RNA-sequencing libraries used for annotation purposes (Supplementary Data 3)). For a given interval, we present a set of statistics. First, the physical proximity to the peak location was calculated as the midpoint of the gene to the midpoint of the interval (introgression) or GWAS peak position. Second, as the causal locus underlying GWAS peaks within a subpopulation must necessarily be variable within that subpopulation, we extracted all SNPs within and proximate to candidate gene models. These variants were annotated with SNPeff126 and the weighted sum of three main categories of variants (high, moderate and low; a description of these can be found at https://pcingola.github.io/SnpEff/se_inputoutput/#effect-prediction-details) for each gene were calculated as SNPeff_score = high × 20 + moderate × 5 + low × 1. Third, for each gene, we calculated the minor allele frequency of structural and presence–absence variants. Fourth, we include a vector of the identity of the WGCNA clusters for each gene. Finally, if the candidate was a homologue of flowering-time GWAS candidate genes from a previous publication127, the identity of the overlapping interval or gene is included.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *