Strange IndiaStrange India


Genome sequencing

Illumina libraries

Illumina libraries for this manuscript were sequenced on a combination of Illumina X10, HiSeq and NovaSeq platforms. HipMer assembly and selfed progeny (Extended Data Fig. 1a): sequencing libraries were constructed using an Illumina TruSeq DNA PCR-free library kit using standard protocols. Libraries were sequenced on an Illumina X10 instrument using paired ends and a read length of 150 base pairs.

Single flow-sorted chromosome libraries

Sequencing libraries were constructed using an Illumina TruSeq DNA Nano library kit using standard protocols. Libraries were sequenced on either the Illumina HiSeq2500 or NovaSeq 6000 instrument using paired ends and a read length of 150 base pairs.

Remaining Illumina libraries

Illumina Tight Insert Fragment, 400 bp–2 ug of DNA was sheared to 400 bp using the Covaris LE220 and size selected using the Pippin (Sage Science). The fragments were treated with end-repair, A-tailing and ligation of Illumina compatible adaptors (IDT) using the KAPA-Illumina library creation kit (KAPA Biosystems). The prepared libraries were quantified using KAPA Biosystems’ next-generation sequencing library qPCR kit (Roche) and run on a Roche LightCycler 480 real-time PCR instrument. The quantified libraries were then prepared for sequencing on the Illumina HiSeq sequencing platform using a TruSeq Rapid paired-end cluster kit, v.2, with the HiSeq 2500 sequencer instrument to generate a clustered flowcell for sequencing. Sequencing of the flowcell was performed on the Illumina HiSeq 2500 sequencer using HiSeq Rapid SBS sequencing kits, v.2, following a 2 × 250 indexed run recipe.

PacBio libraries

Continuous long-read PacBio sequencing primer was then annealed to the SMRTbell template library and sequencing polymerase was bound to them using a Sequel Binding kit v.2.1. The prepared SMRTbell template libraries were then sequenced on a Pacific Biosystem Sequel sequencer using v.3 sequencing primer, 1 M v.2 single-molecule real-time cells and v.2.1 sequencing chemistry with 1 × 600 sequencing video run times. PacBio HiFi sequencing was performed using circular consensus sequencing (CCS) mode on a PacBio Sequel II instrument. High molecular weight DNA was either needle-sheared or sheared using a Diagenode Megaruptor 3 instrument. Libraries were constructed using SMRTbell Template Prep Kit v.2.0 and tightly sized on a SAGE ELF instrument (1–18 kb). Sequencing was performed using a 30 h video time with 2 h pre-extension and the resulting raw data was processed using the CCS4 algorithm.

RNA-seq libraries

Illumina RNA-Seq with poly(A) selection plate-based RNA sample preparation was performed on the PerkinElmer Sciclone NGS robotic liquid handling system using Illumina’s TruSeq Stranded mRNA HT sample prep kit using poly(A) selection of mRNA following the protocol outlined by Illumina in their user guide: https://support.illumina.com/sequencing/sequencing_kits/truseq-stranded-mrna.html, and with the following conditions: total RNA starting material was 1 ug per sample and eight cycles of PCR were used for library amplification. The prepared libraries were quantified using KAPA Biosystems’ next-generation sequencing library qPCR kit and run on a Roche LightCycler 480 real-time PCR instrument. Sequencing of the flowcell was performed on the Illumina NovaSeq sequencer using NovaSeq XP v.1 reagent kits and an S4 flowcell, following a 2 × 150 bp indexed run recipe.

Chromosome in situ hybridization

Chromosome mitotic metaphase preparations and fluorescence in situ hybridization were performed as described in ref. 13. The S. spontaneum retro-transposon specific oligo probe was designed by Arbor Biosciences using their proprietary software based on the retro-transposon sequences as described in ref. 50. Probes were either labelled with fluorochromes ATTO 488 or ATTO 550.

Single flow-sorted chromosome preparation

Stems of adult plants were cut into single-bud segments, cleaned and soaked in 0.5% carbendazim solution for 24 h, placed in a plastic tray, covered with wet perlite and incubated at 32 °C in the dark, until the roots were approximately 1.5 cm long. For cell-cycle synchronization and accumulation of metaphases, the segments were washed in ddH2O, then transferred to a plastic tray filled with 150 ml 0.1 × Hoagland solution containing 3 mmol l−1 hydroxyurea and incubated at 25 or 32 °C for 18 h in the dark. After a 2 h recovery treatment, the roots were immersed in 2.5 µmo l−1 amiprophos-methyl solution and incubated for 3 h at 25 or 32 °C. Suspensions of intact chromosomes were prepared by mechanical homogenization of root tips fixed with 3% formaldehyde and 0.5% Triton X-100, and stained with 4′,6-Diamidino-2-phenylindole dihydrochloride (DAPI)51. The instrument used for flow sorting was a FACSAria II SORP flow cytometer (BD Biosciences) and Beckman Coulter MoFlo AstriosEQ cell sorter (Beckman Coulter). The software used was FACSDiva v.6.1.3 (BD Biosciences) and Summit v.6.2.2 (Beckman Coulter). For chromosome sorting, initial gating was set on dotplots DAPI-A versus FSC-A and the final sorting gate was set on DAPI-A versus DAPI-W dotplots to exclude chromosome doublets (Supplementary Fig. 15). The identity of flow-sorted fractions was determined by fluorescence microscopy of chromosomes sorted onto microscope slides51. The analysis revealed that chromosomes could be separated into a few size fractions and while the sorted populations were 100% pure chromosomes, it was not possible to sort individual sugarcane chromosomes. To overcome this problem and prepare samples of chromosome-specific DNA for sequencing, single copies of chromosomes were sorted and their DNA amplified52. This strategy for preparing sugarcane chromosomes for flow cytometry was first described in ref. 51 and is a modification of the protocol described in ref. 53.

Optical map construction

Ultra-high molecular weight (uHMW) DNA was isolated from agarose-embedded nuclei as previously described in ref. 54 with some modifications. Approximately 2 g of young, healthy R570 leaves were collected and fast-frozen in a 50 ml conical tube, ground in a mortar with liquid nitrogen and briefly incubated in Bionano homogenization buffer (HB+; Bionano Plant DNA isolation Kit; Bionano Genomics). Cell debris was filtered out by sequentially passing the homogenate through 100 µm and 40 µm cell strainers. Nuclei in suspension were pelleted by centrifugation at 2,000g at 8 °C for 20 min, resuspended in 3 ml homogenization buffer HB+ and subjected to discontinuous density gradient centrifugation as described in the Plant Tissue DNA Isolation Base Protocol (Revision D; Bionano Genomics). The nuclei-enriched interphase layer was recovered, pelleted and embedded in low-melting-point agarose using a 90-µl CHEFgel electrophoresis plug mould (Bio-Rad). The resulting plug was incubated twice, for a total of 12 h at 50 °C, in Bionano Lysis buffer supplemented with 1.6 mg ml−1 Puregene Proteinase K, washed four times in Bionano Wash Buffer and five times in TE buffer. The uHMW nDNA was recovered by melting and digesting the plug with agarase at 43 °C, followed by drop dialysis. In total, approximately 9 µg uHMW DNA was recovered at a concentration of 136 ng µl−1 and used for subsequent genome mapping processes.

Genome mapping was performed using the Bionano Genomics Direct Label and Stain chemistry in a Bionano Saphyr instrument, using the method described in ref. 55, with a few modifications. Approximately 800 ng of uHMW DNA was used per reaction and a total of eight flow cells were loaded to collect molecules with a total combined length of 3,499,160 Mbp. A subset of 1,650,737 molecules with a minimum length of 450 kb, and N50 of 547 kb were selected for assembly. The final total combined length of the filtered subset was 1,097,878,758 bp, with estimated effective coverage of assembly of ×101.2.

Genome assembly was performed using the Bionano Genomics Access software platform (Bionano Tools v.1.3.8041.8044; Bionano Solve v.3.3_10252018), running the pipeline v.7981 and RefAligner v.7989. Two separated assemblies were performed using the optArguments_nonhaplotype_noES_BG_DLE1_saphyr.xml parameters. The initial assembly was performed without complex multi-path region (CMPR) cuts and produced 570 maps with a N50 length of 36.444 Mbp and total map length of 7,654.039 Mbp. One additional assembly was performed using the CMPR cut option, which introduces map cuts at potential duplications to reduce potential homeolog and phase switching. CMPR-cut-enabled assembly generated 1,512 maps with N50 length of 9.546 Mbp and total map length of 9,282.351 Mbp.

PacBio HiFi Bionano hybrid scaffolds were generated using the Bionano Genomics Access software (Tools v.1.3) and the DLE-1 configuration file hybridScaffold_DLE1_config.xml using auto-conflict resolution. In total, the genome was captured in 122 hybrid scaffolds (Scaffold N50 = 78.823 and maximum scaffold size of 131.769 Mbp. The total scaffold length was 5,074 Mbp, with 4.9 Mbp of sequence remaining un-scaffolded.

Genome assembly overview

Complete representation of all sequences in the 10 Gb genome of R570 was impossible without artificially duplicating collapsed sequences, of which there are many. To scaffold the contigs into chromosomes, we applied five complementary techniques (Supplementary Data). First, we used the Bionano optical map to initially order contigs into long-range scaffolds. Second, scaffolds were clustered into homeologous groups based on 237 linkage groups constructed from approximately 1.8 million simplex markers that were assayed from 96 self-pollinated progeny. Third, additional clustering was performed using genetic markers derived from single flow-sorted chromosome libraries sequenced from R570 (refs. 52,53). After making initial joins, both simplex and single-chromosome genetic markers were re-aligned putative chromosomes to investigate misjoins, which were broken and corrected. Fourth, we resolved overlapping scaffolds by checking for redundant collinear sets of Sorghum bicolor gene models mapped against the contigs using pblat56 with default parameters. Finally, we manually evaluated chromatin linkages from 558 Gb (approximately ×56) Hi-C data to manually verify joins made between scaffolds during chromosome construction (Extended Data Fig. 1a). The highly contiguous primary assembly (5.04 Gb, 12.6 Mb contig N50; 67 chromosomes) also includes optical scaffolds (‘os’; n = 20) and unanchored scaffolds (n = 56). The primary assembly contains 0.1% gaps with an LTR assembly index21 (LAI; measure of intact LTR elements) of 22.82, indicating the assembly is high quality and complete. Where possible, the alternate assembly (3.73 Gb, 2.1 Mb contig N50; comprised of nearly identical haplotypes in the primary assembly; discussed in Supplementary Data), was physically anchored to the most similar chromosome in the primary assembly based on best unique alignments using minimap2 (v.2.20-r1061)57. Contigs and scaffolds that did not have a single best unique alignment were left unanchored. It should be noted that this sequence similarity-based grouping does not suggest that contigs on alternative scaffolds with the same name (for example, Chr6E and Chr6E_alt) necessarily come from the same biological haplotype. Thus, we provide the alternate scaffolds to represent the complete population of sequences in R570, and not as a source for global comparisons against the primary or other reference genomes.

Collapsed haplotypes

To determine which regions of the genome were perfectly identical and collapsed into a single haplotype (in contrast to the alternate assembly that contains nearly identical haplotypes, which could be distinguished by the assembler but most often not by unique HiFi read placements), PacBio HiFi reads were re-aligned back to the assembly using minimap2 (ref. 57) (parameters: -M 0 –secondary=no –hard-mask-level -t 30 -x asm5). Read coverage (script: combinePAFsAndCount.R) was calculated using script: relative to the median depth (37) per 10 kb window, ignoring repetitive regions where the median coverage was greater than five (greater than ×185 raw coverage). Depth classifications (×0–4) were calculated from the median coverage ranges (×0 (0–0.25), ×1 (0.25–1.4), ×2 (1.4–2.3), ×3 (2.3–3.5), ×4 (3.5–5.0)), based on histogram peaks. Depth classifications per 10 kb window were converted to their run-length equivalent using the script: convertCountsToRLEs.R. To ensure accurate representation of haplotypes, NucFreq54 was used to analyse regions where haplotypes were collapsed (×2–4 depth regions; approximately 1.2 Gb of primary genome sequence). In summary, HiFi reads were aligned to the combined primary and alternate assembly using pbmm2 (v.1.1.0; parameters: –log-level DEBUG –preset SUBREAD –min-length 5,000 –sort). Samtools58 was then used to merge individual bam files (from each HiFi sequencing run) and exclude unmapped reads and supplementary alignments. (samtools view -F 2308). The NucFreq output coverage bed (obed) file was converted to run-length equivalents (script: RLEruns.R), where alternate base calls were greater than 20% of the combined coverage. To ensure adequate coverage for analysis, regions with outlier depth ranges beyond the 10th and 90th percentiles were excluded. Additionally, repetitive regions of the genome (95% repetitive, masked with a 24mer and 10 kb regions where greater than 90% of bases were annotated as retrotransposons (from LAI analysis) were also excluded using BEDtools59 subtract. Of the approximately 1.2 Gb considered, approximately 4.8 Mb of sequence (0.4% of considered regions; 0.1% of bases within constructed primary chromosomes) appear to contain non-identically collapsed haplotypes, mainly driven by high depth collapsed regions (×2–3 depth regions = 0.3% of bases; ×4 depth regions = 1.5% of bases).

Genome annotation

Gene models were annotated using our PERTRAN pipeline (described in detail in ref. 60 using approximately 3.7 B pairs of 2 × 150 stranded paired-end Illumina RNA-seq and 31 M PacBio Iso-Seq CCSs reads. In short, PERTRAN conducts genome-guided transcriptome short read assembly via GSNAP (v.2013-09-30)61 and builds splice alignment graphs after alignment validation, realignment and correction. The resulting approximately 1.5 M putative full-length transcripts were corrected and collapsed by genome-guided correction pipeline, which aligns CCS reads to the genome with GMAP61 with intron correction for small indels in splice junctions if any and clusters alignments when all introns are the same or 95% overlap for single exon. Subsequently 1,763,610 transcript assemblies were constructed using PASA (v.2.0.2)62 from RNA-seq transcript assemblies above. Homology support was provided by alignments to 17 publicly available genomes and Swiss-Prot proteomes. Gene models were predicted by homology-based predictors, FGENESH+ (v.3.1.0)63, FGENESH_EST (similar to FGENESH+, but using expressed sequence tags (ESTs) to compute splice site and intron input instead of protein/translated open reading frames (ORFs) and EXONERATE (v.2.4.0)64, PASA assembly ORFs (in-house homology constrained ORF finder) and from AUGUSTUS (v.3.1.0)65 trained by the high confidence PASA assembly ORFs and with intron hints from short read alignments. We improved these preliminary annotations by comparing sequences and gene quality between R570 subgenomes by aligning high-quality gene models between subgenomes and forming gene models from intragenomic alignments. We compared scores between these intragenomic homology-based models and the PASA assemblies; higher-scoring homology supported models that were not contradicted by transcriptome evidence were retained to replace existing partial copy. The selected gene models were subject to Pfam analysis and gene models with greater than 30% Pfam TE domains were removed. We also removed (1) incomplete, (2) low-homology-supported without full transcriptome support and (3) short single exon (less than 300 BP CDS) without protein domain nor transcript support gene models. Repetitive sequences were defined using de novo by RepeatModeler (v.open1.0.11)66 and known repeat sequences in RepBase.

Comparative genomics

Syntenic orthologs among the R570 primary annotation, S. bicolor (v.3.1)67, S. spontaneum (genotype AP85-441)32, Setaria viridis (v.2.1)68 and the R570 monoploid path16 were inferred via GENESPACE (v.0.9.4)23 pipeline using default parameters (analysis script: genespaceCommands.R). In brief, GENESPACE compares protein similarity scores into syntenic blocks using MCScanX69 and uses Orthofinder (v.2.5.4)70 to search for orthologs/paralogs within synteny constrained blocks. Syntenic blocks were used to query pairwise peptide differences among progenitor alleles, determine divergence among progenitor orthologs using S. bicolor syntenic anchors and search for progenitor specific orthogroups (scripts, PID_calc.R; GENESPACE_orthogroupParsing.R; Jupyter Notebook: r570_orthogroupProgenitorAnalysis_forSupp.ipynb).

Structural variants

To identify the large structural rearrangements (inversions, translocations and inverted translocations) and local variations (insertions and deletions), each homeologous chromosome group (B, C, D, E, F, G) was aligned to chromosome A using minimap2 (v.2.20-r1061)57 with parameter setting ‘-ax asm5 -eqx’. The resulting alignments were used to identify structural variations with SyRI (v.1.6)71 and annotation gff3 was used to obtain genes affected by variations between homeologous chromosomes.

Orthogroup diversity

Calculation of mean pairwise differences among progenitor specific homeologs was performed by first extracting all pairwise combinations of progenitor assigned alleles within orthogroups that were anchored by an S. bicolor ortholog. Among these, 25,000 peptide pairs per progenitor were randomly selected and pairwise aligned using R package Biostrings (v.2.70.2)72. Pairwise identity calculation was based on matches/alignment length (PID2; script PID_calc.R). Multiple sequence alignments among syntenic orthogroups for sugar transport gene candidates were performed using MAFFT (v.7.487)73 and were visualized using ggmsa74 (script MSAalignmentPlots.R). Fold scores for each peptide were calculated using ESMfold (v.2.0.1)75.

Resistance gene analogues

RGAs were annotated on scaffolds larger than 10 megabases with NLR-Annotator (v.2)38 using default parameters. The 4,116 predicted RGAs (Supplementary Table 11) were assigned to progenitors by intersecting the location of each motif with progenitor assignment blocks (Supplementary Table 6).

Progenitor divergence

To determine the neutral substitution rate between S. officinarum and S. spontaneum, 45,000 random ortholog pairs were extracted from all pairwise combinations of progenitor assigned alleles (n = 193,815) within S. bicolor anchored orthogroups. Peptide sequence pairs were aligned using MAFFT (v.7.487)73 and converted into coding sequence (CDS) using pal2nal (v.13)76. Pairwise synonymous mutation rates (Ks) among sequences were calculated using seqinr (v.4.2-16)77, finding a single synonymous (ks) mutation peak at 0.012 (Supplementary Fig. 13). Assuming a neutral nuclear mutation rate of 0.383 × 10−8 to 0.386 × 10−8 (ref. 78), S. officinarum and S. spontaneum diverged approximately 1.55–1.56 million years ago.

Bru1 genetic and physical maps

We developed a map-based cloning approach adapted to the high polyploid context of sugarcane to target the durable major rust resistance gene Bru1. Haplotype-specific chromosome walking was performed through fine genetic mapping exploiting 2,383 individuals from self-progenies of R570 and physical mapping exploiting two BAC libraries44,79. The high-resolution genetic map of the targeted region included flanking markers for Bru1 (at 0.14 and 0.28 cM), 13 co-segregating markers and the partial BAC physical map of the target haplotype included two gaps44; Fig. 3b. To complete the physical map of the target Bru1 haplotype, we constructed a new BAC library (using enzyme BamHI) using a mix of DNA from four brown-rust-resistant individuals from the R570 S1 population. The BAC library contained 119,040 clones with an average insert size of 130 kb and covered 3.2-fold the target haplotype and 1.6-fold the total genome.

BAC-ends and BAC subclones from the four BACs (CIR009O20, 022M06, CIR012E03 and 164H22) surrounding the two remaining gaps (‘left’ and ‘right’) in the physical map of the Bru1 haplotype were isolated and used for chromosome walking (as described in ref. 44). Two BACs (CIRB251D13 (150 kb) and CIRB286F09 (130 kb)) were identified and sequenced to fill the right gap. Five BACs (CIRB009N07 (100 kb), CIRB114G05 (100 kb), CIRB127D08 (125 kb), CIRB210D07 (105 kb) and CIRB236L05 (150 kb)) reduced the size of the left gap by 35 kb, but an unsized gap remained. The R570 genome assembly spanned the entirety of the Bru1 target haplotype region with one contig, closing the left gap (99,750 bp) enabling all candidate genes in the region to be investigated (Fig. 3b).

Bru1 candidate genes

The target gap-filled haplotype that represented 0.42 cM and 309 kb was manually annotated, predicting a total of 13 genes (Fig. 3b and Supplementary Table 13). Nine of these genes were also present on all or some of the hom(e)ologous BACs/haplotypes in the R570 genome27. Three of the curated genes were present only in the insertion specific to the Bru1 haplotype. Other whole-genome annotated genes (SoffiXsponR570.03Dg024000; SoffiXsponR570.03Dg024100; SoffiXsponR570.03Dg024600; SoffiXsponR570.03Dg024700) in the region were short, mono-exonic peptides that either contained no protein domains or appeared to be annotated transposable elements, and thus were not supported in the curated candidate gene list (Supplementary Table 13). Among the 13 predicted genes, we searched genes that presented high homology with genes already shown to be involved in resistance mechanisms. We identify five such genes, four genes encoding serine/threonine kinases (genes 1, 5, 7 and 8) and one gene encoding an endoglucanase (gene 13). Annotation of these genes was refined manually through phylogenetic analysis that included genes with high homology from other plants present in databases and search of conserved functional protein domains.

Gene 13, which encodes an endoglucanase, comprised 3 exons and two introns with a genomic size of 1.8 kb for a predicted transcript of 1.5 kb. Sequence alignment and phylogenetic analyses performed with beta-1-4 endoglucanase and beta-1-3 endoglucanase from monocots and dicots showed that gene 13 belongs to the beta-1-4 endoglucanase. This gene presents high homology (greater than 60%) with beta-1-4 endoglucanase from other plants and has the highest homology (88% of identity, 100% coverage) with the orthologous Miscanthus gene (CAD6248271.1). Beta-1-4 endoglucanases are involved in cell development80 in particular on elongation of the cell wall81 but have not been reported as involved in disease resistance. This suggested that this gene is not a good candidate for being Bru1.

Gene 1 is composed of eight exons and seven introns. Its genomic size is 4.3 kb and the CDS size is 882 bp. The protein encoded by the gene has 96.5 % identity (100% coverage) with a kinase involved in cell division control in Sorghum (XP_002451427.1) and therefore, it did not appear to be a good candidate.

Gene 5 is composed of six exons and five introns. Its genomic size was 1.1 kb and the predicted CDS size 534 bp. Alignment of its amino acid sequence with Interpro conserved protein domain database showed that only part of the protein (exons 4 to 6) has homology with subdomains VIb to XI of the serine/threonine kinases. This serine/threonine kinase was thus not complete, lacking some of the functional subdomains and appeared to be a pseudogene. Therefore, it did not appear to be a good candidate.

Gene 7 is composed of six exons and five introns, and gene 8 has four exons and three introns. Both present homology with receptor-like kinases. Annotation of conserved protein domains showed that gene 7 has all the 12 subdomains of kinases and thus could encode a functional protein, while gene 8 encompasses only part of these sub domains (I to VII) and could correspond to a pseudokinase. The classification with the ITAK database (http://itak.feilab.net/cgi-bin/itak/index.cgi) revealed they both belong to the RLK-PELLE-DSLV family45, the same family to which belong the barley stem rust resistance gene (RPG1 (ref. 46)) and the wheat yellow rust resistance gene (Yr15 (ref. 47)) shown to be a tandem kinase-pseudokinase (TKP). In addition, the third intron of gene 7 has a very large size of approximately 11 kb, including a large TE, a particular structure shared with RPG1 and Yr15 TKPs. Bru1, like RPG1 and Yr15, is among the relatively rare resistance genes that confer durable fungal resistance. This tandem kinase-pseudokinase (TKP7 and TKP8) is therefore a solid candidate for Bru1.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



Source link

By AUTHOR

Leave a Reply

Your email address will not be published. Required fields are marked *