Multi-regional sampling of human surgically resected LUADs and NL tissues
Table of Contents
Study participants were evaluated at the MD Anderson Cancer Center and underwent standard-of-care surgical resection of early-stage LUAD (I–IIIA). Samples from all patients were obtained from banked or residual tissues under informed consent and approved by MD Anderson institutional review board protocols. Residual surgical specimens were then used for derivation of multi-regional samples for single-cell analysis (Supplementary Table 1). Immediately following surgery, resected tissues were processed by an experienced pathologist assistant. One side of the specimen was documented and measured, followed by tumour margin identification. Based on the placement of the tumour within the specimen, incisions were made at defined collection sites in one direction along the length of the specimen and spanning the entire lobe: tumour-adjacent and tumour-distant normal parenchyma at 0.5 cm from the tumour edge and from the periphery of the overall specimen or lobe, respectively. An additional tumour-intermediate normal tissue sample was selected for patients P2–P16 and ranged between 3 and 5 cm from the edge of the tumour. Sample collection was initiated with NL tissues that are farthest from the tumour moving inward towards the tumour to minimize cross-contamination during collection.
Single-cell isolation from tissue samples
Fresh tissues from human donors and mouse lungs were collected in RPMI medium supplemented with 2% FBS and maintained on ice for immediate processing. Tissues were placed in a cell culture dish containing Hank’s balanced salt solution (HBSS) on ice, and extra-pulmonary airways and connective tissue were removed with scissors. Samples were transferred to a new dish on ice and minced into about 1 mm3 pieces followed by enzymatic digestion. For human tissues, the enzymatic solution was composed of collagenase A (10103578001, Sigma Aldrich), collagenase IV (NC9836075, Thermo Fisher Scientific), DNase I (11284932001, Sigma Aldrich), dispase II (4942078001, Sigma Aldrich), elastase (NC9301601, Thermo Fisher Scientific) and pronase (10165921001, Sigma Aldrich) as previously described40. For mouse lung digestion, the enzymatic solution was composed of collagenase type I (CLS-1 LS004197, Worthington), elastase (ESL LS002294, Worthington) and DNase I (D LS002007, Worthington). Samples were transferred to 5 ml LoBind Eppendorf tubes and incubated in a 37 °C oven for 20 min with gentle rotation. Samples were then filtered through 70 μm strainers (Miltenyi Biotech, 130-098-462) and washed with ice-cold HBSS. Filtrates were then centrifuged and resuspended in ice-cold ACK lysis buffer (A1049201, Thermo Fisher Scientific) for red blood cell lysis. Following red blood cell lysis, samples were centrifuged and resuspended in ice-cold FBS, filtered (using 40 μm FlowMi tip filters; H13680-0040, Millipore) and an aliquot was taken to count cells and check for viability by Trypan blue (T8154, Sigma Aldrich) exclusion analysis.
Sorting and enrichment of viable lung epithelial singlets
Single cells from patient P1 were stained with Sytox Blue viability dye (S34857, Life Technologies) and processed on a FACS Aria I instrument. Cells from P2–P16 were stained with anti-EPCAM-PE (347198, BD Biosciences; 1:50 dilution in ice-cold PBS containing 2% FBS) for 30 min with gentle rotation at 4 °C. Mouse lung single cells were similarly stained but with a cocktail of antibodies (1:250 each) against CD45-PE/Cy7 (103114, BioLegend), ICAM2-A647 (A15452, Life Technologies), EPCAM-BV421 (118225, BioLegend) and ECAD-A488 (53-3249-80, eBioscience). Stained cells were then washed, filtered using 40 μm filters, stained with Sytox Blue (human) or Sytox Green (mouse) and processed on a FACS Aria I instrument (gating strategies for epithelial cell sorting are shown in Supplementary Figs. 1 and 4 for human and mouse cells, respectively). Doublets and dead cells were eliminated, and viable (Sytox-negative) epithelial singlets were collected in PBS containing 2% FBS. Cells were washed again to eliminate ambient RNA, and a sample was taken for counting by Trypan Blue exclusion before loading on 10X Genomics Chromium microfluidic chips.
Preparation of single-cell 5′ gene expression libraries
Up to 10,000 cells per sample were partitioned into nanolitre-scale Gel beads-in-emulsion (GEMs) using a Chromium Next GEM Single Cell 5′ Gel Bead kit v.1.1 (1000169, 10X Genomics) and by loading onto Chromium Next GEM Chips G (1000127, 10X Genomics). GEMs were then recovered to construct single-cell gene expression libraries using a Chromium Next GEM Single Cell 5′ Library kit (1000166, 10X Genomics) according to the manufacturer’s protocol. In brief, recovered barcoded GEMs were broken and pooled, followed by magnetic bead clean-up (Dynabeads MyOne Silane, 37002D, Thermo Fisher Scientific). 10X-barcoded full-length cDNA was then amplified by PCR and analysed using a Bioanalyzer High Sensitivity DNA kit (5067-4626, Agilent). Up to 50 ng of cDNA was carried over to construct gene expression libraries and was enzymatically fragmented and size-selected to optimize the cDNA amplicon size before 5′ gene expression library construction. Samples were then subjected to end-repair, A-tailing, adaptor ligation and sample index PCR using Single Index kit T Set A (2000240, 10X Genomics) to generate Illumina-ready barcoded gene expression libraries. Library quality and yield were measured using a Bioanalyzer High Sensitivity DNA kit (5067-4626, Agilent) and a Qubit dsDNA High Sensitivity Assay kit (Q32854, Thermo Fisher Scientific). Indexed libraries were normalized by adjusting for the ratio of the targeted cells per library as well as individual library concentration and then pooled to a final concentration of 10 nM. Library pools were then denatured and diluted as recommended for sequencing on an Illumina NovaSeq 6000 platform.
scRNA-seq data processing and quality control
Raw scRNA-seq data were pre-processed (demultiplex cellular barcodes, read alignment and generation of gene count matrix) using Cell Ranger Single Cell Software Suite (v.3.0.1) provided by 10X Genomics. For read alignment of human and mouse scRNA-seq data, human reference GRCh38 (hg38) and mouse reference GRCm38 (mm10) genomes were used, respectively. Detailed quality control metrics were generated and evaluated, and cells were carefully and rigorously filtered to obtain high-quality data for downstream analyses15. In brief, for basic quality filtering, cells with low-complexity libraries (in which detected transcripts were aligned to <200 genes such as cell debris, empty drops and low-quality cells) were filtered out and excluded from subsequent analyses. Probable dying or apoptotic cells in which >15% of transcripts derived from the mitochondrial genome were also excluded. For scRNA-seq analysis of Gprc5a−/−;SftpccreER/+;RosaSun1GFP/+ mice, cells with ≤500 detected genes or with a mitochondrial gene fraction that is ≥15% were filtered out using Seurat41.
Doublet detection and removal, and batch effect evaluation and correction
Probable doublets or multiplets were identified and carefully removed through a multi-step approach as described in previous studies15,42. In brief, doublets or multiplets were identified based on library complexity, whereby cells with high-complexity libraries in which detected transcripts are aligned to >6,500 genes were removed and, based on cluster distribution and marker gene expression, whereby doublets or multiplets forming distinct clusters with hybrid expression features and/or exhibiting an aberrantly high gene count were also removed. Expression levels and proportions of canonical lineage-related marker genes in each identified cluster were carefully reviewed. Clusters co-expressing discrepant lineage markers were identified and removed. Doublets or multiplets were also identified using the doublet detection algorithm DoubletFinder43. The proportion of expected doublets was estimated based on cell counts obtained before scRNA-seq library construction. Data normalization was then performed using Seurat41 on the filtered gene–cell matrix. Statistical assessment of possible batch effects was performed on non-malignant epithelial cells using the R package ROGUE36, an entropy-based statistic, as described in previous studies15,42 and Harmony44 was run with default parameters to remove batch effects present in the PCA space.
Unsupervised clustering and subclustering analysis
The function FindVariableFeatures of Seurat41 was applied to identify highly variable genes for unsupervised cell clustering. PCA was performed on the top 2,000 highly variable genes. The elbow plot was generated with the ElbowPlot function of Seurat and, based on which, the number of significant principal components (PCs) was determined. The FindNeighbors function of Seurat was used to construct the shared nearest neighbour (SNN) graph based on unsupervised clustering performed using the Seurat function FindClusters. Multiple rounds of clustering and subclustering analyses were performed to identify major epithelial cell types and distinct cell transcriptional states. Dimensionality reduction and 2D visualization of cell clusters was performed using UMAP45 and the Seurat function RunUMAP. The number of PCs used to calculate the embedding was the same as that used for clustering. For analysis of human epithelial cells, ROGUE was used to quantify cellular transcriptional heterogeneity of each cluster. Subclustering analysis was then performed for low-purity clusters identified by ROGUE. Hierarchical clustering of major epithelial subsets was performed on the Harmony batch-corrected PCA dimension reduction space. For malignant cells, except for global UMAP visualization, downstream analyses, including identification of large-scale CNVs, inference of cancer cell differentiation states, quantification of meta-program expression, trajectory analysis and mutation analysis, were performed without Harmony batch correction. The hierarchical tree of human epithelial cell lineages was computed based on Euclidean distance using the Ward linkage method, and the dendrogram was generated using the R function plot.hc. For scRNA-seq analysis of Gprc5a−/− mice, the top-ranked ten PCs were selected using the elbowplot function. SNN graph construction was performed with resolution parameter = 0.4, and UMAP visualization was performed with default parameters. For scRNA-seq analysis of Gprc5a−/−;SftpccreER/+;RosaSun1GFP/+ mice, the top-ranked 20 Harmony-corrected PCs were used for SNN graph construction, and unsupervised clustering was performed with resolution parameter = 0.4. UMAP visualization was performed with the RunUMAP function with min.dist = 0.1. Differentially expressed genes (DEGs) of clusters were identified using the FindAllMarkers function with FDR-adjusted P value < 0.05 and log2(fold change) > 1.2.
Identification of malignant cells and mapping KRAS codon 12 mutations
Malignant cells were distinguished from non-malignant subsets based on information integrated from multiple sources as described in previous studies15,42. The following strategies were used to identify malignant cells. (1) Cluster distribution: owing to the high degree of inter-patient tumour heterogeneity, malignant cells often exhibit distinct cluster distribution compared with normal epithelial cells. Although non-malignant cells derived from different patients are often clustered together by cell type, malignant cells from different patients probably form separate clusters. (2) CNVs: we applied inferCNV16 (v.1.3.2) to infer large-scale CNVs in each individual cell with T cells as the reference control. To quantify CNVs at the cell level, CNV scores were aggregated using a previously described strategy16. In brief, arm-level CNV scores were computed based on the mean of the squares of CNV values across each chromosomal arm. Arm-level CNV scores were further aggregated across all chromosomal arms by calculating the arithmetic mean value of the arm-level scores using the R function mean. (3) Marker gene expression: expression of lung epithelial lineage-specific genes and LUAD-related oncogenes was determined in epithelial cell clusters. (4) Cell-level expression of KRASG12D mutations: as we previously described15, BAM files were queried for KRASG12D mutant alleles, which were then mapped to specific cells. KRASG12D mutations, along with cluster distribution, marker gene expression and inferred CNVs as described above, were used to distinguish malignant cells from non-malignant cells. Following clustering of malignant cells from all patients, an absence of malignant cells that were identified from P12 or P16 was noted. This can be possibly attributed to the low number of epithelial cells captured in tumour samples from these patients (Supplementary Table 2).
Mapping KRAS codon 12 mutations
To map somatic KRAS mutations at single-cell resolution, alignment records were extracted from the corresponding BAM files using mutation location information. Unique mapping alignments (MAPQ = 255) labelled as either PCR duplication or secondary mapping were filtered out. The resulting somatic variant carrying reads were evaluated using Integrative Genomics Viewer (IGV)46 and the CB tags were used to identify cell identities of mutation-carrying reads. To estimate the VAF of KRASG12D mutation and cell fraction of KRASG12D-carrying cells within malignant and non-malignant epithelial cell subpopulations (for example, malignant cells from all LUADs, malignant cells from KM-LUADs, KACs from KM-LUADs), reads were first extracted based on their unique cell barcodes and BAM files were generated for each subpopulation using samtools (v.1.15). Mutations were then visualized using IGV, and VAFs were calculated by dividing the number of KRASG12D-carrying reads by the total number of uniquely aligned reads for each subpopulation. A similar approach was used to visualize KRASG12C–carrying reads and to calculate the VAF of KRASG12C in KACs of normal tissues from KM-LUAD cases. To calculate the mutation-carrying cell fraction, extracted reads were mapped to the KRASG12D locus from BAM files using AlignmentFile and fetch functions in pysam package. Extracted reads were further filtered using the ‘Duplicate’ and ‘Quality’ tags to remove PCR duplicates and low-quality mappings. The number of reads with or without KRASG12D mutation in each cell was summarized using the CB tag in read barcodes. Mutation-carrying cell fractions were then calculated as the ratio of the number of cells with at least one KRASG12D read over the number of cells with at least one high-quality read mapped to the locus.
PCA analysis of malignant cells and quantification of transcriptome similarity
Raw unique molecular identifier counts of identified malignant cells were log-normalized and used for PCA analysis using Seurat (RunPCA function). PCA dimension reduction data were extracted using the Embeddings function. The top three most highly ranked PCs were exported for visualization using JMP (v.15). 3D scatterplots of PCA data were generated using the scatterplot 3D tool in JMP (v.15). Bhattacharyya distances were calculated using the bhattacharyya.dist function from the R package fpc (v.2.2-9). The top 25 highly ranked PCs were used for both patient-level and cell-level distance calculations. For Bhattacharyya distance quantification at the cell level, 100 cells were randomly sampled for each patient group defined by driver mutations (for example, KM-LUADs). The random sampling process was repeated 100 times, and pairwise Bhattacharyya distances were then calculated between patient groups. Differences in Bhattacharyya distances between patient groups were tested using Wilcoxon rank-sum tests, and boxplots were generated using the geom_boxplot function from the R package ggplot2 (v.3.2.0).
Determination of non-malignant cell types and states
Non-malignant cell types and states were determined based on unsupervised clustering analysis following batch effect correction using Harmony44. Two rounds of clustering analysis were performed on non-malignant cells to identify major cell types and cell transcriptional states within major cell types. Clustering and UMAP visualization of human normal epithelial cells (Extended Data Fig. 1a) were performed using Seurat with default parameters. Specifically, the parameters k.param = 20 and resolution = 0.4 were used for SNN graph construction and cluster identification, respectively. UMAP visualization was performed with default parameters (min.dist = 0.3). For clustering analysis of airway and alveolar epithelial cells, the RunPCA function was used to determine the most contributing top PCs for each subpopulation and similar clustering parameters (k.param = 20 and resolution = 0.4) were used for SNN graph construction and cluster identification. UMAP plots were generated with min.dist = 0.3 using the RunUMAP function in Seurat. Density plots of alveolar intermediate cells were generated using the stat_densit_2d function in the R package ggplot2 (v.3.3.5) with the first two UMAP dimension reduction data as the input. DEGs for each cluster were identified using the FindMarkers function in Seurat with a FDR-adjusted P < 0.05 and a fold change cut-off > 1.2. Canonical epithelial marker genes from previously published studies by our group and others15,47,48 were used to annotate normal epithelial cell types and states. Bubble plots were generated for select DEGs and canonical markers to define AT1 cells (AGER1+ETV5+PDPN+), AT2 cells (SFTPB+SFTPC+ETV5+), SCGB1A1+SFTPC+ dual-positive cells, AICs (AGER1+ETV5+PDPN+ and SFTPB+SFTPC+), club and secretory cells (SCGB1A1+SCGB3A1+CYP2F1+), basal cells (KRT5+TP63+), ciliated cells (CAPS+PIFO+FOXJ1+), ionocytes (ASCL3+FOXI+), neuroendocrine cells (CALCA+ASCL1+) and tuft cells (ASCL2+MGST2+PTGS1+). KACs were identified by unsupervised clustering of AICs and defined based on previously reported marker genes25,26,49, including significant upregulation of the following genes relative to other alveolar cells: KRT8, CLDN4, PLAUR, CDKN1A and CDKN2A.
Scoring of curated gene signatures
Genes in previously defined ITH MPs19 were downloaded from the original study. Among a total of 41 consensus ITH MPs identified, MPs with unassigned functional annotations (unassigned MPs 38–41; n = 4), neural and haematopoietic lineage-specific MPs (MPs 25–29, MPs 33–37; n = 10) and cell-type-specific MPs irrelevant to LUAD (MPs 22–24 secreted/cilia, MP 32 skin-pigmentation; n = 4) were filtered out, resulting in 23 MPs that closely correlated with hallmarks of cancer and that were used for further analysis. Signature scores were computed using the AddModuleScore function in Seurat as previously described15,42. The KRAS signature used in this study was derived by calculating DEGs between the KRAS mutant malignant-cell-enriched cluster and other malignant cells (FDR-adjusted P value < 0.05, log(fold change) > 1.2; Extended Data Fig. 2i). Human and mouse KAC signatures and the human ‘other AIC’ signature were derived by calculating DEGs using FindAllMarkers among alveolar cells (FDR-adjusted P value < 0.05, log(fold change) > 1.2). Mouse genes in the p53 pathway were downloaded from the Molecular Signature Database (MSigDB; https://www.gsea-msigdb.org/gsea/msigdb/mouse/geneset/HALLMARK_P53_PATHWAY; MM3896). Signature scores for KACs, other AICs, KRAS and p53 were calculated using the AddModuleScore function in Seurat.
Analysis of alveolar cell differentiation states and trajectories
Analysis of differentiation trajectories of lung alveolar and malignant cells was performed using Monocle 2 (ref. 50) by inferring the pseudotemporal ordering of cells according to their transcriptome similarity. Monocle 2 analysis of malignant cells from P14 was performed using default parameters with the detectGenes function. Detected genes were further required to be expressed by at least 50 cells. For construction of the differentiation trajectory of lineage-labelled epithelial cells (GFP+), the top 150 DEGs (FDR-adjusted P value < 0.05, log(fold change) > 1.5, expressed in ≥50 cells) ranked by fold-change of each cell population from NNK-treated samples were used for ordering cells with the setOrderingFilter function. Trajectories were generated using the reduceDimension function with the method set to ‘DDRTree’. Trajectory roots were selected based on the following criteria: (1) inferred pseudotemporal gradient; (2) CytoTRACE score prediction; and (3) careful manual review of the DEGs along the trajectory. To depict expression dynamics of ITH MPs19, ITH MP scores were plotted along the pseudotime axis and smoothed lines were generated using the smoother tool in JMP Pro (v.15). Using the raw counts without normalization as input, CytoTRACE18 was applied with default parameters to infer cellular differentiation states to complement trajectory analysis and further understand cellular differentiation hierarchies. The normalmixEM function from the R package mixtools was used to determine the CytoTRACE score threshold in AICs with k = 2. A final threshold of 0.58 was used to dichotomize AICs into high-differentiation and low-differentiation groups. The Wasserstein distance metric was applied using R package transport (v.0.13) to quantify the variability of distribution of CytoTRACE scores. The function wasserstein1d was used to calculate the distance between the distribution of actual CytoTRACE scores of one patient and the distribution of simulated data with identical mean and standard deviation. The robustness of Monocle 2-based pseudotemporal ordering prediction was validated by independent pseudotime prediction tools including Palantir51, Slingshot52 and Cellrank53. Slingshot (v.2.6.0) pseudotime prediction was performed using slingshot function with reduceDim parameter set to ‘PCA’ and other parameters set to defaults. Cellrank prediction was performed using the CytoTRACEKernel function with default parameters from Cellrank python package (v.1.5.1). Palantir prediction was performed using Palantir python package (v.1.0.1). A diffusion map was generated using run_diffusion_maps function with n_components = 5. Palantir prediction was generated using run_palantir function with num_waypoints = 500 and other parameters set to defaults. Inferred pseudotime by the three independent methods was then integrated with that generated by Monocle 2 for each single cell, followed by pairwise mapping and correlation analysis. Cell density plots were generated using Contour tool in JMP (v.15) with n = 10 gradient levels and contour type parameter set to ‘Nonpar Density’. To assess the pseudotime prediction consistency between Monocle 2 and the three independent methods, Spearman’s correlation coefficients were calculated and statistically tested using cor.test function in R.
ST data generation and analysis
ST profiling of formalin fixation and paraffin-embedding (FFPE) tissues from P14 with LUAD and of three lung tissues from two Gprc5a−/− mice was performed using the Visium platform from 10X Genomics according to the manufacturer’s recommendations and as previously reported54. P14 FFPE tissues were collected from areas adjacent to the tissues analysed by scRNA-seq. Regions of interest per tissue or sample, each comprising a 6.5 × 6.5 mm capture area, were selected based on careful annotation of H&E-stained slides that were digitally acquired using an Aperio ScanScope Turbo slide scanner (Leica Microsystems). HALO software (Indica Labs) was used for pathological annotation (tumour areas, blood vessels, bronchioles, lymphoid cell aggregates, macrophages, muscle tissue, normal parenchyma and reactive pneumocytes) of H&E histology images. Spot-level histopathological annotation and visualization was generated using loupe browser (v.6.3.0). In brief, cloupe files generated from Space Ranger were loaded into the loupe browser. Visualization of annotation was then generated in svg formats using the export plot tool. ST RNA-seq libraries were generated according to the manufacturer’s instructions, each with up to about 3,600 uniquely barcoded spots. Libraries were sequenced on an Illumina NovaSeq 6000 platform to achieve a depth of at least 50,000 mean read pairs per spot and at least 2,000 median genes per spot.
Demultiplexed raw sequencing data were aligned, and gene level expression quantification was generated with analysis pipelines as previously described54. In brief, demultiplexed clean reads were aligned against the UCSC human GRCh38 (hg38) or the GRCm38 (mm10) mouse reference genomes by Spaceranger (v.1.3.0 for human ST data and v.2.0.0 for mouse ST data) and using default settings. Generated ST gene expression count matrices were then analysed using Seurat (v.4.1.0) to perform unsupervised clustering analysis. Using default parameters, the top-ranked 30 PCA components were used for SNN graph construction and clustering and for UMAP low-dimension space embedding with default parameters. UMAP analysis was performed using the RunUMAP function. The SpatialDimPlot function was used to visualize unsupervised clustering. The R package inferCNV16 was used for copy number analysis. Reference spots used in CNV analysis were selected on the basis of careful review of cluster marker genes using the DotPlot function from Seurat and inspection of pathological annotation. CNV scores were calculated by computing the standard deviations of CNVs inferred across 22 autosomes. Loupe browser (v.6.3.0) was used for visualization of pathological annotation results. Expression levels of genes of interest (for example, KRT8) as well as signatures of interest (for example, KAC and KRAS) were measured and directly annotated on histology images with pixel-level resolution using the TESLA (v.1.2.2) machine learning framework55 (https://github.com/jianhuupenn/TESLA). TESLA can compute superpixel-level gene expression and detect unique structures within and surrounding tumours by integrating information from high-resolution histology images. The annotation and visualize_annotation functions were used to annotate regions with high signature signals. KRT8, PLAUR, CLDN4, CDKN1A and CDKN2A were used for ‘KAC markers’ signature annotation in the human ST analysis. For mouse ST data, Krt8, Plaur, Cldn4, Cdkn1a and Cdkn2a were used for ‘KAC signature’ annotation. Gene level expression visualization of Krt8 and Plaur was generated using the scatter function from scanpy (v.1.9.1). Deconvolution analysis was conducted using CytoSPACE56 (https://github.com/digitalcytometry/cytospace). Annotated scRNA-seq data were first transformed into a compatible format using function generate_cytospace_from_scRNA_seurat_object. Visium spatial data were prepared using the function generate_cytospace_from_ST_seurat_object. Deconvolution was performed using CytoSpace function (v.1.0.4) with default parameters. To determine neighbouring cell composition for a specific cell population in Visium data, CytoSPACE was first applied to annotate every spot with the most probable cell type. Neighbouring spots were defined as the six spots surrounding each spot and, accordingly, the neighbouring cell composition for specific cell types were computed. Trajectory construction of ST data was performed using Monocle 2 (ref. 18) with the DDRTree method using DEGs with FDR-adjusted P value < 0.05.
Bulk DNA extraction and WES
Total DNA was isolated from homogenized cryosections of human lung tissues and, when available, from frozen peripheral blood mononuclear cells (PBMCs) using a Qiagen AllPrep mini kit (80204) or a DNeasy Blood and Tissue kit (69504), respectively (both from Qiagen) according to the manufacturer’s recommendations. Qubit 4 Fluorometer (Thermo Fisher Scientific) was used for measurement of DNA yield. TWIST-WES was performed on a NovaSeq 6000 platform at a depth of 200× for tumour samples and 100× for NL and PBMCs to analyse recurrent driver mutations and using either PBMCs or distant NL tissues when blood draw was not consented, as germline control. WES data were processed and mapped to the human reference genome, and somatic mutations were identified and annotated as previously described57,58 with further filtration steps. In brief, only MuTect59 calls marked as ‘KEEP’ were selected and taken into the next step. Mutations with a low VAF (<0.02) or low alt allele read coverage (<4) were removed. Then, common variants reported by ExAc (the Exome Aggregation Consortium, http://exac.broadinstitute.org), Phase-3 1000 Genome Project (http://phase3browser.1000genomes.org/Homo_sapiens/Info/Index) or the NHLBI GO Exome Sequencing Project (ESP6500) (http://evs.gs.washington.edu/EVS/) with minor allele frequencies greater than 0.5% were further removed. Intronic mutations, mutations at 3′ or 5′ UTR or UTR-flanking regions, and silent mutations were also removed. The mutation load in each tumour was calculated as the number of nonsynonymous somatic mutations (nonsense, missense, splicing, stop gain, stop loss substitutions as well as frameshift insertions and deletions).
Survival analysis
Analysis of OS in the TCGA LUAD and PROSPECT60 cohorts was performed as previously described15. KRAS mutation status in TCGA LUAD samples was downloaded from cBioPortal (https://www.cbioportal.org, study ID: luad_tcga_pan_can_atlas_2018). For TCGA dataset, clinical data were downloaded from the PanCanAtlas study18. The logrank test and Kaplan–Meier methods were used to calculate P values between groups and to generate survival curves, respectively. Statistical significance testing for all survival analyses was two-sided. To control for multiple hypothesis testing, Benjamini–Hochberg method was applied to correct P values, and FDR q values were calculated where applicable. Results were considered significant at P value or FDR q value of <0.05. Multivariate survival analysis was performed using a Cox proportional hazards regression model that calculated the hazard ratio, the 95% confidence interval and P values when using pathologic stage, age, KAC and ‘other AIC’ signatures as covariables.
Analysis of public datasets
Publicly available datasets were obtained from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) under accession numbers GSE149813, GSE154989, GSE150263, GSE102511 and GSE219124. Details of the studies28,29 analysed are as follows: GSE149813 investigated single lung cells from KrasLSL-G12D;LSL-YFP mice with Ad5CMV-Cre infection29; GSE154989 studied AT2 lineage-labelled cells from lungs of KrasLSL-G12D/+;Rosa26LSL-tdTomato/+ mice28. Gene expression count matrices of dataset interrogating KrasG12D-driven mouse model from GSE149813 were pre-processed using Seurat following the same filtering steps in that original report. For the GSE154989 dataset28, cells used for analysis were the ones labelled as “PASSED_QC” in supplementary table S7 in that study. For the GSE149813 dataset29, cells with >500 median number of genes detected and <10% fraction of mitochondrial genome derived reads, and according to the pre-processing methods described in their original report29, were retained for analysis. Cells with >7,500 number of genes detected were further filtered to remove potential doublets or multiplets, resulting in 8,304 cells in total for downstream analysis. Both datasets were integrated with mouse cell data generated in this study using Harmony18 with default parameters settings. The top ranked 20 Harmony-corrected PCs were used for clustering with the FindClusters function using resolution = 0.4. UMAP dimension reduction embedding was performed using the RunUMAP function with the same set of Harmony-corrected PCs. Gene expression levels and frequencies of representative cluster marker genes were visualized using DotPlot function from Seurat. The KAC signature score was calculated using the AddModuleScore function from Seurat. The mouse KAC signature was also studied in human AT2 cells with and without inducible KRASG12D (dataset GSE150263) also from ref. 29. Cell filtration criteria described in the original report29 were followed to filter out potential dead cells and doublets (number of detected genes > 800 and the percent of mitochondrial gene reads fraction < 25%). The 20 top-ranked PCs were used for clustering using the FindClusters function with resolution = 0.1. UMAP dimension reduction embeddings were computed using the same SNN graph. The KAC signature score was calculated using AddModuleScore function from Seurat package.
The bulk RNA-seq dataset GSE102511 was a previously published dataset by our group and comprised normal lung tissues, precursor AAHs and matched LUADs (n = 15, each)61. The previously published62 bulk RNA-seq data GSE219124 were generated on cancer stem cell and stem cell-like progenitor cells, in the form of spheres, and their parental MDA-F471 counterparts (a cell line we had developed and cultured from a KM-LUAD of an NNK-exposed Gprc5a−/− mouse)62. To interrogate the association of KACs with tumour formation, gene expression matrices of bulk RNA-seq data GSE102511 (TPM count matrix) and GSE219124 (FPKM count matrix) were extracted and used for quantification of KAC signature expression using MCPcounter (v.1.2.0) R package. Heatmaps were generated using pheatmap (v.1.0.12) R package.
Mouse KACs from this study were compared to mouse Krt8+ transitional cells involved in alveolar regeneration post-acute lung injury from a previous study25. Overlapping marker genes between mouse KACs and the previously reported Krt8+ transitional cells were statistically evaluated using the ggvenn (v.0.1.9) R package using the top-ranked 50 marker genes based on fold change from each study.
Digital spatial profiling of human tissues
The following antibodies were used for digital spatial profiling (DSP): claudin 4 (clone 3E2C1, AF594, LSBio, LS-C354893, concentration 0.5 µg ml–1) and keratin 8 (clone EP1628Y, AF647, Abcam, ab192468, concentration 0.25 µg ml–1). Optimization of antibodies was performed with different dilutions using colorectal carcinoma and LUAD tissues. IF staining was performed on three cases of matched LUAD and NL using the standard GeoMx DSP protocol for morphology markers only (PanCk: clone AE1/AE3, AF532, concentration 0.25 µg ml–1, from GeoMx Solid Tumour Morp kit HsP, 121300301, Novus Biologicals). Slides were scanned at ×20 using the GeoMx DSP platform (NanoString Technologies). Following scanning, multiplex IF image slides were visualized, adjusting channel thresholds for each fluorophore. Expression of KRT8, PanCK and CLDN4 was assessed in adenocarcinoma cells, adjacent reactive lung tissue and distant non-reactive lung tissue.
Animal housing and tobacco carcinogen exposure experiments
Animal experiments were conducted according to Institutional Animal Care and Use Committee (IACUC)-approved protocols at the University of Texas MD Anderson Cancer Center. Mice were maintained in a pathogen-free animal facility. No statistical methods were used to predetermine sample sizes. In all animal experiments, sex-matched and age-matched mice were randomized to treatment groups. For all experiments and until end points were reached (up to 7 months after exposure to saline or NNK), mice were monitored for signs of ill health and their body weight was measured to ensure weight loss did not exceed 20% of body weight over 72 h. None of the mice developed these symptoms; therefore, they were all euthanized after reaching IACUC-approved end points. End points permitted by our IACUC protocols were not exceeded in any of the experiments. Analysis of data from animal experiments was performed in a blinded fashion. To study KACs in the context of KM-LUAD pathogenesis in vivo, Gprc5a−/− mice were interrogated because they form LUADs that are accelerated by tobacco carcinogen exposure and acquire somatic KrasG12D mutations—features that are highly pertinent to KM-LUAD development21,63,64 and therefore to exploring KACs in this setting. Gprc5a−/− mice were generated as previously described21,65. Sex-matched and age-matched Gprc5a−/− mice were divided into starting groups of 4 mice per exposure (NNK or saline control) and time point (EOE or 7 months after exposure, n = 16 mice in total). Eight-week-old mice were intraperitoneally injected with 75 mg kg–1 of body weight NNK or vehicle 0.9% saline (control), 3 times per week for 8 weeks. At EOE or at 7 months after exposure, lungs were collected for derivation of live single cells for scRNA-seq. Whole lungs from additional mice treated as described above were processed by FFPE and for analysis by IF (n = 2 mice per treatment group at EOE and 7 months after exposure, 8 mice in total) and ST (3 lung tissues from n = 2 mice at 7 months after NNK exposure).
SftpccreER/+;RosaSun1GFP/+ mice were provided by H. Chapman (University of California, San Francisco) and were crossed to Gprc5a−/− mice to generate Gprc5a−/−;SftpccreER/+;RosaSun1GFP/+ mice for analysis of lineage-labelled AT2 cells. Gprc5a−/−;SftpccreER/+;RosaSun1GFP/+ mice were treated with 75 mg kg–1 NNK or control saline (intraperitoneally), 3 times per week for 8 weeks. At week 6 of treatment (2 weeks before EOE), mice from both groups received 250 µg (intraperitoneally) tamoxifen dissolved in corn oil for four consecutive days. At EOE or 3 months after exposure to saline or NNK, lungs were digested to derive live (Sytox Blue-negative) GFP+ single cells by flow cytometry using a FACS Aria I instrument as previously described66 (the gating strategy for GFP cell sorting is shown in Supplementary Fig. 6). Sorted single cells were analysed by scRNA-seq (GFP+ and GFP− fractions from n = 2 mice per treatment at 3 months after exposure to saline and NNK) or used to derive organoids (GFP+ cells from n = 4 or 5 mice at EOE to saline or NNK, respectively, and from n = 10 or 13 mice at 3 months after saline or NNK, respectively). Whole lungs from additional mice treated with saline or NNK and tamoxifen as described above (n = 2 per treatment group) were collected (FFPE) at 3 months after NNK and analysed by IF.
Krt8-creER;RosatdT/+ animals were used to generate Gprc5a−/−;Krt8-creER;RosatdT/+ mice for analysis of lineage-labelled KRT8+ cells. Krt8-creER (stock number 017947) and RosatdT/+ (Ai14; stock number 007914) mice were obtained from the Jackson Laboratory. Mice harbouring Krt8-creER;RosatdT/+ were first used for pilot studies to examine labelling of KRT8+ cells. Mice were exposed to control saline (n = 2 mice) or to 8 weeks of NNK (n = 3 mice) as described above followed by 1 mg tamoxifen for 6 continuous days, after which lungs were analysed at the end of tamoxifen exposure. To examine the relevance of labelled KRT8+ cells to tumour development, Gprc5a−/−;Krt8-creER;RosatdT/+ mice were similarly exposed to NNK for 8 weeks followed by tamoxifen, and lungs were then analysed at 8–12 weeks after NNK exposure (n = 3 mice). All lungs were collected and processed for formalin fixation, OCT embedding and IF analysis.
Histopathological and IF analysis of mouse lung tissues
Lungs of Gprc5a−/− mice (n = 2 per treatment and time point) were inflated with formalin by gravity drip inflation, excised, examined for lung surface lesions by macroscopic observation and processed for FFPE, sectioning and H&E staining. Stained slides were digitally scanned using an Aperio ScanScope Turbo slide scanner (Leica Microsystems) at ×200 magnification, and visualized using ImageScope software (Leica Microsystems). Unstained lung tissue sections were obtained for IF analysis of LAMP3 (clone 391005, Synaptic Systems), KRT8 (TROMA-I clone from the University of Iowa DSHB) and PDPN (clone 8.1.1, from the University of Iowa DSHB). Lung FFPE tissue samples were obtained in the same manner from Gprc5a−/−;SftpccreER/+;RosaSun1GFP/+ mice at 3 months after exposure to saline or NNK (n = 2 mice per condition) and following injection with tamoxifen. Tissue sections were obtained for H&E staining and assessment of tumour development, and unstained sections were used for IF analysis using antibodies against GFP (AB13970, Abcam, 1:5000), LAMP3 (391005, Synaptic Systems, 1:10,000), KRT8 (TROMA-I, University of Iowa Developmental Studies Hybridoma Bank, 1:100), PDPN (clone 8.1.1, University of Iowa Developmental Studies Hybridoma Bank, 1:100), claudin 4 (ZMD.306, Invitrogen, 1:250), and PRKCDBP (cavin 3, Proteintech, 1:250). Slides were then stained with fluorophore-conjugated secondary antibodies and 4′,6′-diamidino-2-phenylindole (DAPI). Sections were mounted with Aquapolymount (18606, Polysciences), cover slipped, imaged using an Andor Revolution XDi WD spinning disk confocal microscope and analysed using Imaris software (Oxford Instruments).
Formalin-inflated lung lobes from Krt8-creER;RosatdT/+ mice were cryoprotected in 20% sucrose in PBS containing 10% OCT compound (4583, Tissue-Tek) overnight on a rocker at 4 °C and embedded in OCT. The next day, 10 µm cryosections were blocked in PBS with 0.3% Triton X-100 and 5% normal donkey serum (017-000-121, Jackson ImmunoResearch) and incubated overnight in a humidified chamber at 4 °C with primary antibodies diluted in PBS with 0.3% Triton X-100 and raised against NKX2-1 (sc-13040, Santa Cruz, 1:1000), LAMP3 (same as above) and KRT8 (same as above). The next morning, sections were washed followed by incubation with secondary antibodies (Jackson ImmunoResearch) and DAPI. Slides were then washed, cover slipped as described above and imaged using a Nikon A1plus confocal microscope. Cell counter ImageJ plugin was used to count tdT+ cells within lesions and cells in normal-appearing areas, namely: AT2 cells (LAMP3+), tdT+ AT2 cells (tdT+LAMP3+), AT1 cells (LAMP3–NKX2-1+, avoiding noticeable airways) and tdT+ AT1 cells (tdT+NKX2-1+LAMP3–). Percentages of tdT+LAMP3+ and tdT+NKX2-1+LAMP3– cells out of total tdT+ cells were computed. Counts were averages of triplicate images taken at ×20 magnification for each time point. The percent regional surface area covered by tdT+ cells in normal-appearing regions was estimated by examining the tdT expression across entire lobe sections for each replicate.
3D culture and analysis of AT2-derived organoids
Gprc5a−/−;SftpccreER/+;RosaSun1GFP/+ were treated with NNK or saline and tamoxifen as described above, and they were euthanized at EOE (4 saline-treated and 5 NNK-treated mice) or at 3 months after exposure (10 saline-treated and 13 NNK-treated mice). Lungs were collected, dissociated into single cells (see mouse single-cell derivation in the Methods section ‘Single-cell isolation from tissue samples’), and live (Sytox Blue-negative) GFP+ single cells were collected by flow cytometry using a FACS Aria I instrument as previously described66. GFP+ AT2 cells from NNK-treated or saline-treated groups were immediately washed and resuspended at a concentration of 5,000 cells per 50 µl of 3D medium (F12 medium supplemented with insulin, transferrin and selenium, 10% FBS, penicillin–streptomycin and l-glutamine). GFP+ cells were mixed at a 1:1 ratio (by volume) with 50,000 mouse endothelial cells (collected from mouse lungs by CD31 selection and expanded in vitro as previously described67) and resuspended in 50 µl of Geltrex reduced growth factor basement membrane matrix (A1413301, Gibco). Next, 100 µl of 1:1 GFP+:endothelial cell mixture was plated on Transwell inserts with 0.4 µm pores and allowed to solidify for 30 min in a humidified CO2 incubator (EOE: n = 3 wells per condition; 3 months after exposure: n = 4 wells for saline-derived organoids and n = 12 wells for NNK-derived organoids). Each well was then supplemented with 3D medium containing ROCK inhibitor (Y-27632, Millipore) and recombinant mouse FGF-10 (6224-FG, R&D Systems), and plates were incubated at 37 °C in a humidified CO2 incubator. Wells were replenished with 3D medium every other day. For GFP+ organoids derived from mice exposed to NNK, 200 nM KRAS(G12D)-specific inhibitor MRTX1133 or DMSO vehicle was added to the medium and replenished 3 times a week (n = 6 wells per condition). Organoids were monitored and analysed twice a week using an EVOS M7000 imaging system (Thermo Fisher Scientific), whereby the numbers and sizes of organoids greater than 100 µm in diameter were recorded. At end point, 3D organoids were collected from the basement membrane matrix using Gentle Cell Dissociation reagent (100-0485, StemCell Technologies), fixed with 4% paraformaldehyde, permeabilized, blocked and stained overnight at 4 °C with a mixture of IF primary antibodies raised against LAMP3, GFP, KRT8 and cavin 3. The next day, organoids were washed and stained with fluorophore-conjugated secondary antibodies overnight at 4 °C while being protected from light. Organoids were washed and stained with DAPI nuclear stain for 30 min, after which they were collected in Aqua-Poly/Mount (18606-20, Polysciences) and transferred to slides. Images of organoids were captured using an Andor Revolution XDi WD spinning disk confocal microscope and analysed using Imaris software (Oxford Instruments).
2D viability assays
Mouse mycoplasma-free LUAD cell lines LKR13 (mutant KrasG12D-driven31) and MDA-F471 (Gprc5a−/− and KrasG12D mutant27) were plated on 96-well plates (103 cells per well) and grown in DMEM (Gibco) supplemented with 10% FBS, 1% antibiotic antimycotic solution (A5955, Sigma-Aldrich) and 1% l-glutamine (G7513, Sigma-Aldrich). The next day, cells were cultured for up to 4 days with medium containing 0.5% FBS, 0.5% FBS with 50 ng ml–1 epidermal growth factor (EGF) (E5160, Sigma-Aldrich), or 0.5% FBS with EGF and varying concentrations of MRTX1133 (Mirati Therapeutics). alamarBlue Cell Viability reagent (25 µl; DAL1025, ThermoFisher) was added to each well. At 4 days after treatment, viability was assessed by fluorescence spectrophotometry at 570 nm (and 600 nm as a reference). For the wells showing net positive absorbances relative to blank wells (at least 3 wells per cell line and condition), the percent differences in reduction between treated and control wells were calculated.
Western blot analysis
LKR13 and MDA-F471 cells were plated in 6-well plates (106 cells per well) and grown under different conditions as described above. Protein lysates were extracted at 3 h after treatment and analysed by western blotting following overnight incubation with antibodies to the following primary proteins: vinculin (E1E9V, rabbit, Cell Signaling Technology, 13901; 1:1,000); phosphorylated p44/42 MAPK (ERK1/2, rabbit, Cell Signaling Technology, 9101; 1:2,000); phosphorylated S6 ribosomal protein (Ser 235/236, rabbit, Cell Signaling Technology, 4858; 1:2,000); p44/42 MAPK (ERK1/2, rabbit, Cell Signaling Technology, 9102; 1:2,000); or S6 (E.573.4, rabbit, Invitrogen, MA5-15164; 1:1,000). This was followed by 1 h of incubation with diluted secondary antibody (1706515 goat anti-rabbit IgG-HRP conjugate, Bio-Rad). Protein lysates from each cell line were analysed on multiple gels (four per cell line) with Precision Plus Protein Dual Color Standard (1610394, Bio-Rad) as the ladder and blotted to membranes to separately probe for phosphorylated and total forms of the same proteins, which have highly similar molecular weights (using phospho-specific antibodies or antibodies targeting total version of same protein). Vinculin protein levels were evaluated as loading control on each of the blots. Four blots (phospho-ERK, total ERK, phospho-S6 and total S6) for each of LKR13 and MDA-F471 are shown in Supplementary Fig. 9, each with its own analysis of equal protein loading (vinculin blot) and whereby only the ones indicated with green rectangles are presented in Extended Data Fig. 12c. Membranes were cut horizontally using molecular weight marker as a guide, and cut membranes were incubated with the specified antibodies (see Supplementary Fig. 9 for site of cutting and for overlay of colorimetric and chemiluminescent images of the same blot to display ladder and the analysed protein, respectively). Blots were imaged using the ChemiDoc Touch Imaging System (Bio-Rad) with Chemiluminescence and Colorimetric (for protein ladder) applications and auto expose or manual settings.
Chemicals and reagents
Tobacco-specific carcinogen (NNK) with a purity of 99.96% by HPLC was purchased from TargetMol. Tamoxifen and H&E staining reagents were purchased from Sigma Aldrich. The KRAS(G12D) inhibitor MRTX1133 was provided by J. Christensen (Mirati Therapeutics).
Statistical analyses
In addition to the algorithms and statistical analyses described above, all other basic statistical analyses were performed in the R statistical environment (v.4.0.0). The Kruskal–Wallis H-test was used to compare variables of interests across three or more groups. Wilcoxon rank-sum test was used for paired comparisons among matched samples from the same patients. Wilcoxon rank-sum test was used to compare other continuous variables such as gene expression levels and signature scores between groups. Spearman’s correlation coefficient was calculated to assess associations between two continuous variables (for example, cellular proportions and gene signature scores). Fisher’s exact test was used to identify differences in frequencies of groups based on two categorical variables. Ordinal logistic regression was performed using the polr function in the built-in R package MASS (v.7.3). Benjamin–Hochberg method was used to control for multiple hypothesis testing. All statistical tests performed in this study were two-sided. Results were considered significant at P values or FDR q values < 0.05. When a P value reported by R was smaller than 2.2e-16, it was reported as P < 2.2 × 10−16.
Ethics declarations
All human LUAD and normal lung tissues were obtained from patients who provided informed consent and under institutional review board-approved protocols at The University of Texas MD Anderson Cancer Center. All human data in this manuscript are deidentified to ensure patient privacy. All animal studies were conducted under IACUC-approved protocols at the University of Texas MD Anderson Cancer Center.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.