Pan-cancer WGS data sources
Table of Contents
GrCh37/hg19 BAM alignments for 2,489 primary tumour and matched normal whole-genome sequencing data were obtained as previously described18. In brief, 989 tumour–normal (T/N) pairs were obtained from The Cancer Genome Atlas (TCGA) Research Network (Genomic Data Commons at https://portal.gdc.cancer.gov/, accession: phs000178.v11.p8). Additional WGS data were obtained for 874 T/N pairs from the International Cancer Genome Consortium (ICGC) from multiple studies publicly available through the European Genome-phenome Archive (EGA; https://ega-archive.org). These cohorts include: 124 breast cancers21 (EGA: EGAS00001001178), 179 melanomas43 (EGA: EGAS00001001552), 49 lung adenocarcinomas44 (EGA: EGAS00001002801), 422 oesophageal adenocarcinomas45 (EGA: EGAD00001004417) and 100 malignant lymphomas (EGA: EGAD00001002123).
Additional BAMs for 121 T/N pairs from a pan-cancer cohort obtained as part of a New York City-based multi-institution collaborative research effort comprising the Memorial Sloan Kettering Cancer Center (MSKCC), New York University, Stony Brook University Hospital, Lenox Hill, Northwell Health, Columbia University, Montefiore, and Cornell, and led by the New York Genome Center, were included here and were previously described18. Study approval was obtained through a central institutional review board (IRB), Biomedical Research Alliance of New York, and by local IRBs, including Stony Brook University and Northwell Health. In addition, 55 prostate cancers that were previously published were obtained through dbGaP with accession phs000447.v1.p1 (ref. 15). BAMs for 80 T/N pairs were obtained from a collaborative precision oncology effort between the Weill Cornell Englander Institute for Precision Medicine (EIPM) and the New York Genome Center. This study was approved by an institutional review board (WCM IRB no. 1305013903). A total of 340 T/N pairs across 80 cases across longitudinally or spatially distinct biopsies from Barrett’s oesophagus tumours were obtained as part of a previous study46.
Call sets were obtained from 1,484 additional T/N pairs contributing additional primary tumour whole genomes from the Pan-Cancer Analysis of Whole Genomes Consortium47 (Extended Data Fig. 1, ‘PCAWG’ dataset, https://dcc.icgc.org/pcawg) and 3,957 T/N pairs from metastatic whole genomes from the Hartwig Medical Foundation (HMF, https://www.hartwigmedicalfoundation.nl/), which included germline, somatic SNV or indel, and somatic SV calls48 (Extended Data Fig. 1: ‘HMF’ dataset).
MSKCC cohort
LR WGS and short-read WGS were performed on a cohort of 46 cases biopsied for ductal carcinomas of breast and found to have BRCA1 (n = 28) or BRCA2 (n = 18) mutations on clinical panel sequencing. These cases were collected under informed consent as part of a prospective biospecimen research protocol at the Memorial Sloan Kettering Cancer Center (MSKCC, MSKCC IRB no. 16–675). T/N pairs were profiled with Illumina short-read WGS and LR WGS (see below for protocol details). Raw sequencing data from these experiments have been made available (see ‘Data availability’ section; Extended Data Fig. 1: ‘MSKCC’ dataset).
Pipelines
Harmonized variant calling was performed on 2,489 T/N BAM file pairs by adapting previously described pipelines18. Additional details are provided below.
SV calling
In brief, genome-wide, 200-bp binned tumour and normal read depth was calculated from alignments and corrected for GC and mappability biases (https://github.com/mskilab-org/fragCounter). Somatic SV calls were obtained with SvAbA49 and filtered using a panel-of-normals (PON) comprising all germline SVs detected across 2,489 T/N pairs. Any somatic SV found within 500 bp of a junction within the germline SV PON with matching orientations was discarded. PCAWG consensus SVs and 200-bp binned tumour and normal read depths were obtained from PCAWG SV release 1.6 and the PCAWG data coordination centre.
HMF SV data were obtained from the Hartwig Medical Foundation through a data sharing agreement48. In brief, junction calls from GRIDSS50 and 1-kbp tumour/normal coverage ratios17 corrected for GC content were obtained for 3,957 T/N pairs.
Genome graph analysis
High-confidence junctions, binned tumour-normal read depth ratios, and purity and ploidy estimates (see below) were used to perform junction balance analysis (JaBbA; github.com/mskilab-org/JaBbA) and generate balanced genome graphs for 7,918 and 46 cases in the pan-cancer WGS and MSKCC datasets, respectively. For a detailed treatment of the formulation behind JaBbA, see a previous report18. Heterozygous germline single-nucleotide polymorphism (SNP) data were used to infer allelic copy number after total copy number inference was performed genome-wide as described18.
Purity and ploidy estimation
Across Hadi dataset cases, tumour purity and ploidy were estimated using ASCAT51. For the 46 cases from the MSKCC cohort, a manual review of purity and ploidy was conducted to enhance downstream genotyping accuracy; ultimately, alternative manual estimates of purity and ploidy from MSKCC were chosen for 4 out of 46 cases. For PCAWG and HMF datasets, purity and ploidy estimates were obtained from the respective PCAWG (https://dcc.icgc.org/releases/PCAWG/) and HMF (https://www.hartwigmedicalfoundation.nl/en/database/) portals48.
LR WGS SV calling
For the LR WGS profiles generated from the MSKCC cohort of 46 cases, junctions called using LinkedSV52 were merged with SvAbA junctions called on the corresponding short-read WGS for each case. These were then input into JaBbA using short-read coverage profiles to generate genome graphs. Merging was performed using the gGnome R package (https://github.com/mskilab-org/gGnome) to determine junctions that were uniquely detected by LR WGS (LinkedSV).
Analysis of gap segment topology
Gap segments were defined as short genomic segments joining reference-consecutive break ends, each belonging to distinct junctions and occurring on opposite strands. Each gap segment was additionally associated with a polarity (+ or −) based on the topology of junctions around the gap segment; (+) polarity corresponded to a gap segment with junctions directly attached to it, (−) polarity corresponded to a gap segment with junctions attached to the two segments to the left and right of the gap segment on the reference. The length threshold to define gap segments was visually chosen as 1 Mbp after inspection of a density plot of segments lengths across gap segment candidates satisfying the above topological criteria. This threshold was confirmed through the application of a background model, in which the gap segment candidate length distribution in each sample was fit with an exponential distribution and each gap segment candidate was assigned a P value according to the left tail of the exponential cumulative distribution function. Short (less than 1 Mbp) gap segment candidates were found to significantly deviate from expectation (false discovery rate (FDR) < 0.10) under this model.
Applying this definition, gap segments with shared junctions were next clustered together (applying ‘eclusters’ gGraph method in the gGnome R package) to identify and classify reciprocal clusters. Reciprocal clusters in which every junction was connected to two gap segments was labelled as ‘cyclic’. Reciprocal clusters were annotated with the number of cluster-associated junctions and gap segments. A reciprocal pair (rPair) is a special case of a cyclic reciprocal cluster that contains two gap segments of either orientation. rDups, rDels and rDelDups each contain two (+) gap segments, two (−) gap segments and one (+) and one (−) gap segment, respectively.
Annotating known SV events
Classes of previously described18 simple and complex SV were annotated in balanced genome graphs derived by JaBbA for both the pan-cancer WGS (n = 7,918) and MSKCC datasets (n = 46). The following simple events were annotated within each graph: deletions, duplications, translocations, inversions and inverted duplications. The following complex events were annotated: breakage–fusion–bridge cycles, double minutes, tyfonas, chromoplexy, chromothripsis and TICs. Implementation of each event classifier can be found in the ‘events’ function in the gGnome R package.
Variant calling and genotyping
For the 2,489 ‘Hadi’ dataset WGS T/N pairs (Extended Data Fig. 1), somatic mutation calls were generated with Strelka1 for SNVs and indels. Germline mutation calls were obtained with Strelka2 run on alignments from blood or adjacent normal samples and filtered to remove common variants above a population allele frequency of 1% (ExAC population: ftp.broadinstitute.org/pub/ExAC_release/release0.3.1/subsets/). SNVs and indels were filtered through a universal genome-wide mask for hg19 (https://github.com/lh3/CHM-eval) to remove artefacts due to low mappability, as described before53. All germline and somatic SNVs and indels were annotated with ClinVar status (ftp.ncbi.nlm.nih.gov:/pub/clinvar/vcf_GRCh37; database date 2022-07-30). The impacts of protein-coding SNVs and indels were also annotated through SnpEff (GRCh37.75 database). SNVs and indels were considered pathogenic if annotated as ‘pathogenic’ or ‘likely_pathogenic’ through ClinVar CLNSG or if their SnpEff annotation fell within the following classes: ‘frameshift variant’, ‘start_lost’, ‘stop_gained’, ‘stop_lost’, ‘splice_acceptor_variant’ or ‘splice_donor_variant’. ClinVar annotation took precedence over SnpEff.
LOH was determined by allele-specific copy number (CN) using allele counts across germline heterozygous SNP sites. Specifically, LOH was called in regions in which minor allele CN = 0 and major allele CN > 0, using allelic copy number as inferred from short-read sequencing data (see ‘Junction balance analysis’). Similarly, homozygou deletions (homdels) were called in regions in which total copy number (sum of major and minor allele CN) = 0.
Genotype was determined across samples for 48 HR-related genes (Supplementary Table 1), including BRCA1, BRCA2, PALB2 and RAD51C. Eleven of these genes were highlighted in a previous study54. Biallelic loss was called for genes if they contained any of the following: (1) two or more germline and/or somatic pathogenic mutations (including SNVs, indels and SVs); (2) one germline or somatic pathogenic mutation along with LOH; or (3) a homozygous deletion. Within the MSKCC cohort, 22 cases were found to have biallelic loss of BRCA1, 14 cases were found with biallelic loss of BRCA2, and one case was found with biallelic loss of both BRCA1 and BRCA2.
For PCAWG dataset cases, somatic SNV and indel calls were obtained from ICGC (2016 data freeze), and annotated driver mutations were obtained from the PCAWG consortium47. HMF provided the following for cases in their dataset: germline SNVs and indels (through GATK HaplotypeCaller), somatic SNVs and indels (through Strelka1 and annotated by SnpEff).
Short-read WGS
Short-read WGS for the 46 MSKCC donors was performed at the New York Genome Center to a target of 80× tumour and 40× normal coverage. Library preparation from genomic DNA for these new cases was performed using the NEBNext Ultra II End Repair/dA-Tailing Module, NEBNext Ultra Ligation Module (New England Biolabs) and KAPA Dual-Indexed Adapter Kit (Roche) following the manufacturers’ protocols. Quality control was performed on the finished libraries with the Agilent 2100 Bioanalyzer on the High Sensitive DNA Chip platform (Agilent Technologies) and KAPA Library Quantification Kit (Roche). Quality control determined that libraries contained an average peak height (fragment size) of at least 400 bp. Libraries were sequenced on an Illumina NovaSeq 6000 System (Illumina) to generate paired-end 2 × 150-bp reads. Reads were aligned to the GRCh37/hg19 reference using Burrows–Wheeler aligner software55, bwa mem, 0.7.10-r789). Read post-processing was done in accordance with best practices for post-alignment data processing with Picard tools (https://broadinstitute.github.io/picard/) to mark duplicates, the GATK (v.2.7.4) (https://gatk.broadinstitute.org/hc/en-us) IndelRealigner module and GATK base quality recalibration.
LR WGS
Each of the 46 BRCA1- or BRCA2-mutant cases in the MSKCC cohort was subjected to additional LR WGS. High-molecular-weight (HMW) genomic DNA (gDNA) was extracted using a Qiagen MagAttract HMW DNA Kit (Qiagen) according to the suggested protocol. In brief, approximately 1–2 million cells were obtained from each frozen tissue sample and lysed, and HMW gDNA was captured by magnetic particles. Then the magnetic particles with HMW gDNA were washed in wash buffer and eluted in EB Buffer (10 mM Tris-HCl, pH 8.5). The HMW gDNA had a mode length of 50 kbp and max length 200 kbp, as estimated on a separate 75-V pulse-field gel electrophoresis using a BluePippin 5–430-kbp protocol (Sage Science). LR WGS library preparation was performed using a Chromium Genome Library Kit v2 (10X Genomics) following the Chromium Genome Reagent Kits v2 User Guide. In brief, 1 ng of extracted HMW gDNA was used to prepare a library, with an average fragment length of 625 bp (ranging from 300 to 2,000 bp, measured with the Agilent Bioanalyzer High Sensitivity DNA Chip). Quality control for the finished libraries was performed as above for the general WGS library preparation. The prepared libraries were sequenced on an Illumina NovaSeq 6000 Sequencing System (Illumina) with an S4 flow cell, to an average read depth of about 33×. All linked reads were aligned to GRCh37/hg19 with the EMerAld aligner (v.0.6.2)56. Germline haplotypes were obtained from Strelka2 germline SNV calls processed using HapCut2 (https://github.com/vibansal/HapCUT2; ref. 57).
Phasing rearranged haplotypes with LR WGS
Our specific goal in somatic phasing was to distinguish SVs that placed both reciprocal junctions on the same molecule (cis) from those that placed junctions at distant loci (trans, including distinct derivative chromosomes) in the cancer genome (Extended Data Fig. 3a). Somatic phasing is distinct from parental phasing, which determines whether reciprocal break ends arose on the same or distinct parental homologue. Namely, break ends that arise on the same parental homologue (germline cis phase) can end up on distinct derivative chromosomes (somatic trans phase).
We approached somatic phasing by assessing LR WGS molecule support for derivative rearranged haplotypes. Derivative rearranged haplotypes can be deconvolved from junction-balanced genome graphs as walks18. Walks were derived from JaBbA graphs on the MSKCC cohort for which both LR and short-read WGS were available using the ‘walks’ gGraph method in the the gGnome package (https://github.com/mskilab-org/gGnome). Barcoded reads were matched against each possible walk within a 100-kbp window of the junctions to be phased (gGnome score.walks function). The walk (or set of walks) that carried the largest number of junction-supporting barcodes was considered the likeliest haplotype explaining the rearrangement.
Specifically with respect to the rDup and rDelDup patterns, two sets of possible derivative haplotypes exist: cis and trans. Cis haplotypes for rDup or rDelDup patterns are walks that contain both involved rearrangements consecutively, or, in other words, contain a single segment that is flanked by both junctions. Trans haplotypes are two separate walks that each contain one of the two rDup or rDelDup rearrangements and would have to exist simultaneously, and thus are considered as a single set of walks. Junction-supporting LR barcodes were counted for each possible walk across every rDup or rDelDup instance in the MSKCC cohort. To assess whether the rDup or rDelDup patterns existed with the junctions in cis or trans, walk-supporting LR barcode counts were compared among the individual cis walks and the trans walks summed together. The walk (or set of walks in the trans case) was taken to be the derivative haplotype underlying the rDup or rDelDup pattern. Haplotypes were also validated by visually assessing the barcode sharing patterns for each rDup or rDelDup present in the dataset to confirm the haplotype as labelled by this heuristic.
Imputing short-read-sequencing reciprocal pair haplotypes
To impute the haplotype phase of reciprocal pairs identified by short-read WGS, we applied a threshold of 3.5 to the log10 gap length. Specifically, for rDup reciprocal pairs, the imputed haplotype phase was cis if the minimum of the two log10 gap lengths was less than 3.5, and trans otherwise. For rDelDup pairs, the imputed haplotype phase was cis if the log10 length of the (+)-polarity gap was less than 3.5, and trans otherwise. The imputed phase of all rDel pairs was trans because this is the only phase possible given the junction topology.
Sequence homeology
‘Homeology’ refers to approximate (higher than 80%) similarity between a pair of genomic sequences. To assess sequences at junction-associated break ends, we applied a sliding bin approach. For every position within a 200-bp window around each break end pair, a 41-bp bin centred at the base was queried for the corresponding hg19 reference sequence. All pairs of 41-bp bins within each junction-associated 200-bp window were then aligned to one another to construct a 200-by-200 matrix of Levenshtein edit distances. The distance matrix was the converted to a similarity matrix (Fig. 4a heat map) in which each entry ij indicates the sequence similarity, calculated as (1 − Levenshtein edit distance)/41, between a pair of 41-mers corresponding to bins i and j in the junction-associated window. The matrix was then converted to a binary bitmap image in which each pixel denoted sequence similarity of >0.8. Connected components of pixels in the image were annotated with the Pearson’s correlation of the associated pixel indices, which was used as a measure of linearity of the pattern. Each junction was then annotated with the size (in pixels) of the largest connected component with a linearity of at least r2 > 0.9. This value thus represents the longest contiguous stretch of bases with at least 80% sequence similarity. This procedure was run using the ‘homeology’ function in the GxG R package (see ‘Code availability’).
Discordant and split reads supporting junctions with homeology were realigned to hg19 using bwa mem (implemented using an R wrapper in the package RSeqLib), to obtain uniform mapping quality scores for those cases containing junction homeology within the in-house dataset in which alignments were present. Reference mappability was determined using two orthogonal means. In the first, sliding 150-mers stepping by one base were queried across hg19 and aligned to the full reference using bwa mem to determine mapping quality scores. Average mapping quality was determined for each base for hg19. The second method used GEM mappability score with a sliding 150-mer across hg19 as described previously58.
Mutational signatures
Mutational signatures were derived from the signature.tools.lib R package suite for implementing the HRDetect algorithm2. In brief, SNV signatures were deconvolved using the known signature weights from COSMIC SNV signature version 2 (https://cancer.sanger.ac.uk/signatures/signatures_v2/, available through the signature.tools.lib R package59) with an implementation of non-negative least squares (‘SignatureFit’ function from the signature.tools.lib package). With the same approach, JaBbA-derived SVs were classified into the 32 SV types on the basis of size, topology and junction clustering as previously described21, and were fit to rearrangement signatures derived from 560 breast cancers. Microhomology in small deletions was searched in 3′ flanking sequence for up to 25 bases. The HRD-LOH index was determined by the number of segments per genome larger than 15 Mbp (but under the span of an entire chromosome) containing LOH.
Classifying HR, BRCA1 and BRCA2 deficiency
To build classifiers distinguishing overall HR deficiency, BRCA1 deficiency and BRCA2 deficiency, random forests (RFs; from the randomForest R package) were trained on a dataset of pan-cancer primary tumours consisting of 62 BRCA1d, 66 BRCA2d and 2,536 controls that were confidently HRP (lacking CLINVAR pathogenic, CLINVAR VUS, truncating or missense mutation in BRCA1, BRCA2, RAD51, RAD51B, RAD51C, RAD51D and PALB2 and LOH in BRCA1 or BRCA2). The following six features were counted for each case using the R package signature.tools.lib: COSMIC SNV signatures 3 and 8; rearrangement signatures 3 and 5; HRD-LOH index; and proportion of deletions with microhomology. rDups, rDels and rDelDups were also counted after annotation on each genome graph.
To evaluate the performance of RFs, ROC curves and corresponding AUROCs were computed on an independent test set of pan-cancer metastatic tumours (HMF dataset, Extended Data Fig. 1) consisting of 40 BRCA1d, 92 BRCA2d and 1,834 HRP controls using the pROC R package (v.1.18.0, https://cran.r-project.org/web/packages/pROC/). Feature importance was determined by resampling the test set across 30 bootstraps with permutation. The decrease in accuracy after permuting each feature on the test set was calculated.
In the following two comparisons, classifier skill to discriminate overall HR deficiency from HR proficiency was analysed by using the full (Hadi, MSKCC, and PCAWG) training set and evaluating the resulting models on the full (HMF) test set (Extended Data Fig. 1). An SV-only RF was trained on rDups, rDels, rDelDups, homeologous deletions, duplications with length 10–100 kbp, RS3 and RS5 as features and compared against an RF trained on rearrangement signatures 3 and 5 as features to compare the efficacy of the classes of SVs described in this manuscript against the established SV types previously used in HRDetect2. A full RF consisting of currently described features and previously established features (rDups, rDels, rDelDups, homeologous deletions, duplications with length 10–100 kbp, RS3, RS5, MH-dels, SNV3, SNV8 and LOH score) was trained and then tested against the published HRDetect model (consisting of MH-dels, SNV3, SNV8, RS3, RS5 and LOH score as features) using ROC curves and feature importance metrics. HRDetect scores were obtained by running the function ‘applyHRDetectDavies2017’ from the signature.tools.lib R package on a feature matrix composed of test samples.
The third comparison evaluated classifier skill to discriminate BRCA1 deficiency from BRCA2 deficiency. For this test, the full RF trained with current and previous features was used to compare against a RF trained with HRDetect-only features. In contrast to the above, ROC and feature importance evaluation were performed on only the 40 BRCA1d and 92 BRCA2d cases from the test dataset (Extended Data Fig. 1).
Statistical information
All statistical analysis was performed as stated in the figure legends using the R programming language (v.4.0.2). P values obtained that are smaller than 2.2 × 10−16 are not accurately estimated in R and are listed as such (‘P < 2.2 × 10−16’). Generalized linear modelling was performed using the ‘glm’ or ‘glm.nb’ function from the stats or MASS R packages. Wilcoxon rank-sum testing was performed using the ‘wilcox.test’ function from the stats R package. Fisher’s exact test was performed using the function ‘fisher.test’ from the stats R package. ROC curves were generated using the function ‘roc’ from the R package ‘pROC’. Comparison of ROC curves was done using the function ‘roc.test’ from the R package ‘pROC’ with the argument ‘method = ‘delong’. Statistical methods were not used to predetermine sample size. The study design did not involve blinding or randomization.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.