Genetic Ancestry Contributes to Somatic Mutations in Lung Cancers from Admixed Latin American Populations

Page created by Walter Farmer
 
CONTINUE READING
Genetic Ancestry Contributes to Somatic Mutations in Lung Cancers from Admixed Latin American Populations
Published OnlineFirst December 2, 2020; DOI: 10.1158/2159-8290.CD-20-1165

Research Brief

Genetic Ancestry Contributes to Somatic
Mutations in Lung Cancers from Admixed
Latin American Populations
Jian Carrot-Zhang1,2,3, Giovanny Soca-Chafre4, Nick Patterson2,3, Aaron R. Thorner1,
Anwesha Nag1, Jacqueline Watson1,2, Giulio Genovese2,3, July Rodriguez5, Maya K. Gelbard1,
Luis Corrales-Rodriguez6,7, Yoichiro Mitsuishi8, Gavin Ha9, Joshua D. Campbell10, Geoffrey R. Oxnard1,
Oscar Arrieta4,11, Andres F. Cardona5,12, Alexander Gusev1,2,13, and Matthew Meyerson1,2,3

               abstract            Inherited lung cancer risk, particularly in nonsmokers, is poorly understood.
                                   Genomic and ancestry analysis of 1,153 lung cancers from Latin America revealed
               striking associations between Native American ancestry and their somatic landscape, including tumor
               mutational burden, and specific driver mutations in EGFR, KRAS, and STK11. A local Native American
               ancestry risk score was more strongly correlated with EGFR mutation frequency compared with global
               ancestry correlation, suggesting that germline genetics (rather than environmental exposure) underlie
               these disparities.

               Significance: The frequency of somatic EGFR and KRAS mutations in lung cancer varies by ethnic-
               ity, but we do not understand why. Our study suggests that the variation in EGFR and KRAS mutation
               frequency is associated with genetic ancestry and suggests further studies to identify germline alleles
               that underpin this association.

Introduction                                                                  mutations is higher in lung adenocarcinomas from patients
                                                                              in East Asia (∼45%) compared with lung adenocarcinomas
  Lung cancer causes more than 1.7 million deaths per year                    from patients in Europe or patients of European (EUR) and/
worldwide (1) and kills more people than any other malig-                     or African (AFR) descent in North America (∼10%; refs. 4–8).
nancy in Latin America (2). Lung adenocarcinoma is the most                   In Latin American countries, the frequency of somatic EGFR
common subtype of lung cancer that is typically driven by                     mutations in lung adenocarcinoma ranges from roughly
genomic alterations of genes in the receptor tyrosine kinase                  14% in Argentina, to 25% to 34% in Colombia, Brazil, and
(RTK)/RAS/RAF pathway (3), often allowing effective thera-                    Mexico, to 51% in Peru (refs. 9–11; Fig. 1A). Moreover, recent
peutic targeting by RTK and other pathway inhibitors. It is well              genomic studies from East Asian (EAS) and AFR populations
known, but mysterious, that the frequency of somatic EGFR                     have suggested different distributions of tumor mutation

1
 Department of Medical Oncology, Dana-Farber Cancer Institute, Boston,        Note: Supplementary data for this article are available at Cancer Discovery
Massachusetts. 2Broad Institute of MIT and Harvard, Cambridge, Massa-         Online (http://cancerdiscovery.aacrjournals.org/).
chusetts. 3Departments of Genetics and Medicine, Harvard Medical School,      Corresponding Authors: Matthew Meyerson, Dana-Farber Cancer Institute,
Boston, Massachusetts. 4Personalized Medicine Laboratory, Instituto           450 Brookline Avenue, Boston, MA 02215. Phone: 617-632-4768; Fax:
Nacional de Cancerologia, México City, México. 5Foundation for Clinical and   617-582-7880; E-mail: matthew_meyerson@dfci.harvard.edu; Alexander
Applied Cancer Research – FICMAC, Bogotá, Colombia. 6Medical Oncology,        Gusev, Phone: 203-980-8760; E-mail: alexander_gusev@dfci.harvard.edu;
Hospital San Juan de Dios, San José, Costa Rica. 7Centro de Investigación y   Andres F. Cardona, Cra. 16 #82-95, Bogotá, Cundinamarca, Colombia. Phone:
Manejo del Cáncer – CIMCA, San José, Costa Rica. 8Division of Respiratory     573-0163-48173; E-mail: a_cardonaz@yahoo.com; and Oscar Arrieta,
Medicine, Graduate School of Medicine, Juntendo University, Bunkyo-ku,        San Fernando #22, Col. Sección XVI, Tlalpan, Mexico D.F., Mexico. Phone:
Tokyo, Japan. 9Division of Public Health Sciences, Fred Hutchinson Can-       5255-5628-0400; E-mail: scararrietaincan@gmail.com
cer Research Center, Seattle, Washington. 10Division of Computational
Biomedicine, Department of Medicine, Boston University School of Medi-        Cancer Discov 2021;11:1–8
cine, Boston, Massachusetts. 11Thoracic Oncology Unit, Instituto Nacional     doi: 10.1158/2159-8290.CD-20-1165
de Cancerología, México City, México. 12Clinical and Translational Oncol-     ©2020 American Association for Cancer Research.
ogy Group, Clínica del Country, Bogotá, Colombia. 13Division of Genetics,
Brigham and Women’s Hospital, Boston, Massachusetts.

                                                                                                    March 2021        CANCER DISCOVERY | OF1

  Downloaded from cancerdiscovery.aacrjournals.org on March 12, 2021. © 2020 American Association for Cancer
                                                  Research.
Published OnlineFirst December 2, 2020; DOI: 10.1158/2159-8290.CD-20-1165

RESEARCH BRIEF                                                                                                                    Carrot-Zhang et al.

A                                                                                                              B
                                                                                                                   Inherited germline risk
               Europe
                                                           USA
            EGFR 10%–15%
                                 East Asia   EGFR African American 12%
             KRAS ~30%                      EGFR European American 10%
                              EGFR 40%–50%
                                KRAS ~10% TMB/SCNA European American high
                               TMB/SCNA low
                                                                   Mexico
                                                                 EGFR 34%
                                                                           Colombia
                                                                          EGFR 25%              Brazil
                                                                                              EGFR 27%
                                                                                   Peru
                                                                                 EGFR 51%
                                                                                                                                  Environmental exposure
                                                                                               Argentina
                                                                                              EGFR 14%

Figure 1. Genomic differences in lung adenocarcinoma across patient populations. A, TMB, SCNA burden, and the frequency of KRAS mutations are
lower, whereas the frequency of EGFR mutations is higher, in lung cancers from East Asian patients, compared with lung cancers from patients of Euro-
pean and/or African origin. The somatic EGFR mutation rate in lung cancer varies among Latin American countries. Mexican and Colombian populations
have varying degrees of NAT ancestry, as indicated in blue. B, Both germline variations and environmental exposures such as smoking can predispose to
somatic alterations driving the development of lung cancers, that may cause the genomic differences across populations.

burden (TMB) and levels of somatic copy-number alteration                     and MDM2 (Fig. 2; Supplementary Tables S2 and S3). The
(SCNA; refs. 12, 13), compared with lung adenocarcinoma                       detected mutation frequencies of EGFR and KRAS were 30%
patients of EUR ancestry.                                                     and 10%, respectively, in the tested lung cancer samples from
  Despite the differences in patterns of somatic mutation in                  Mexican patients, and 23% and 13%, respectively, in the tested
lung adenocarcinoma from patients of different ethnicities,                   lung cancer samples from Colombian patients. SCNA analysis
the landscape of ancestry effects on the lung cancer genomes                  (see Methods) identified 9% and 2% cases with high-level ampli-
for the Latin American populations has not been compre-                       fications in MYC and MDM2, respectively. We did not observe
hensively described, and it remains unknown whether the                       novel amplification or deletion peaks in this Latin American
differences are due to ancestry-specific germline variation or                lung cancer cohort as assessed by GISTIC analysis.
rather to population-specific environmental exposures (Fig.                      Ancestry effects on somatic cancer genomes are under-
1B). This is of particular importance as Native American                      studied (17, 18), and few genomic data sets have been devel-
(NAT) ancestry, which includes components of EAS ancestry                     oped from patients with lung cancer with mixed ancestry.
derived through waves of migration (14), is present to varying                One potential source of samples for analysis of ancestry
degrees in modern populations in Latin America, along with                    effects is discarded tissue or nucleic acids, left over after the
EUR and AFR ancestry (15).                                                    clinical analysis of cancer samples. Here, we developed an
                                                                              analytic pipeline (https://github.com/jcarrotzhang/ancestry-
                                                                              from-panel) that offers the advantage of simultaneous meas-
Results                                                                       urement of global and local ancestry from sequencing tumor
   To explore the landscape of somatic cancer mutation in                     DNA only, without requiring matched germline samples
lung cancers from Latin America and to assess the influence of                (Supplementary Fig. S1A–S1C). Briefly, we called the gen-
germline ancestry of genetically admixed patient populations                  otype of germline single-nucleotide polymorphisms (SNP)
on these somatic alterations, we performed genomic analysis                   using on-target and off-target reads from the sequencing
of 601 lung cancer cases from Mexico and 552 from Colom-                      panel and measured global ancestry based on principal com-
bia, including 499 self-reported nonsmokers (Supplementary                    ponent analysis (PCA; ref. 19) of the germline SNPs, in which
Table S1). Next-generation sequencing targeting a panel of 547                principal components (PC) 1, 2, and 3 captured prominently
cancer genes plus intronic regions of 60 cancer genes (16) was                the axis of AFR, EUR, and NAT ancestry, respectively (Sup-
used to identify single-nucleotide variants (SNV), insertions/                plementary Fig. S2A). We then imputed missing SNPs using
deletions (indels), SCNA, and gene fusions. This gene panel                   an external haplotype reference panel (15) and assigned local
covers all currently known lung cancer drivers (3), which are                 ancestry to each genomic region (20), based on the imputed
the focus of this work. Because we do not have matched ger-                   variants. We validated our approach to ancestry analysis by
mline samples, we applied a custom script to identify known,                  performing whole-genome sequencing on a subset of 44
hotspot lung cancer driver mutations for the full 1,153 samples               tumor samples, and SNP genotyping on 12 paired tumor
to ensure the sensitivity for low-coverage samples, as well as to             and normal samples (Supplementary Fig. S2B and S2C). We
avoid potential germline contamination (Methods). We found                    found a high accuracy of our off-target reads approach based
that 552 (48%) of all samples harbored oncogenic mutations in                 on panel sequencing of tumor DNA, by comparing tumor
EGFR, KRAS, BRAF, ERBB2, or MET, or fusions in ALK, ROS1,                     and normal ancestry estimations (Pearson r > 0.99), and
or RET; 785 of 1,153 samples harbored at least one detectable                 by comparing panel sequencing to whole-genome sequenc-
alteration in a broader set of known lung cancer driver genes                 ing (Pearson r > 0.96). As panel testing of cancer genes has
also including TP53, STK11, KEAP1, SMARCA4, SETD2, MYC,                       emerged as a practical diagnostic tool in clinical care, our

OF2 | CANCER DISCOVERY             March 2021                                                                                       AACRJournals.org

     Downloaded from cancerdiscovery.aacrjournals.org on March 12, 2021. © 2020 American Association for Cancer
                                                     Research.
Published OnlineFirst December 2, 2020; DOI: 10.1158/2159-8290.CD-20-1165

Genetic Ancestry Affects Somatic Alterations in Lung Cancers                                                                              RESEARCH BRIEF

                                                                 Latin American LUAD                                                             LATAM EUR EAS
                                                                                                           Mutation type                         LUAD LUAD LUAD
                        Gain-of-function SNVs/indels/fusions
         EGFR*                                                                                         Missense_mutation   Amplification          27%      13%     56%
         KRAS*                                                                                         Inframe_indel       Fusion                 12%      33%     13%
         BRAF                                                                                          Splice_mutation     Deletion                2%
         ERBB2                                                                                                             Truncating_mutation     1%
          MET                                                                                                                                      1%
          ALK                                                                                                                                      4%
         ROS1
Published OnlineFirst December 2, 2020; DOI: 10.1158/2159-8290.CD-20-1165

RESEARCH BRIEF                                                                                                                                                                                                                                  Carrot-Zhang et al.

A              EGFR mutation %                                                                                                                                                                                                                                   160

                                                                                                                                                              KRAS mutation %
                                        40                                                                                                                                      40

                                                                                                                                                                                                                                                                       Sample size
                                                                                                                                                                                                                                                                 120
                                                                                                                                                                                                                                                                 80
                                        20                                                                                                                                      20
                                                                                                                                                                                                                                                                 40
                                            0                                                                                                                                    0                                                                               0
                                                 20                40           60       80                                                       100                                20           40           60       80                       100
                                                                          NAT ancestry %                                                                                                                 NAT ancestry %

B
                                                                                    All samples                                                                                                                            Nonsmokers
                                                                                 NAT ancestry %                                                                                                                         NAT ancestry %
          BRAF-mutant                                                                                                                                                                BRAF-mutant
                                                                                                                                      P = 9e-06                                                                                                          P = 2e-06
          EGFR-mutant                                                                                                                                                                EGFR-mutant
                                                                  P = 0.004
          KRAS-mutant                                                                                                                                                                KRAS-mutant
           ALK-mutant                                                                                                                                                                 ALK-mutant
                                                               –0.02 –0.01              0.00    0.01                                       0.02       0.03                                              –0.06 –0.04 –0.02            0.00        0.02       0.04

                                                                             Smoking signature %                                                                                                                    Smoking signature %
         BRAF-mutant                                                                                                                                                                 BRAF-mutant
                                                                                    P = 0.005
         EGFR-mutant                                                                                                             P = 0.02                                            EGFR-mutant
         KRAS-mutant                                                                                                                                                                 KRAS-mutant
          ALK-mutant                                                                                                                                                                  ALK-mutant
                                                              –0.075 –0.050 –0.025 0.000 0.025 0.050                                                                                            –0.125 –0.100 –0.075 –0.050 –0.025 0.000 0.025
                                                                                                                                                            
Figure 3. Targetable LUAD driver genes associated with genetic ancestry. A, The percentage of NAT germline ancestry is positively correlated with
the percentage of somatic EGFR mutations, and negatively correlated with the percentage of somatic KRAS mutations. Color bar represents the number
of samples in the NAT ancestry percentage range. B, Association of targetable LUAD driver genes with NAT ancestry, mutational signature, and gender
(n = 705; left), and the association in never-smokers only (n = 387; right). Multivariable logistic regression P values are shown, with NAT ancestry percent-
age, gender, smoking, and APOBEC signature as covariates. Red dots represent P < 0.05. Lines represent 95% confidence intervals.

differences in Latin American patients with lung adenocarci-                                                                                                                     5,425 genomic regions with an assignment of AFR, EUR, or
noma that are independent of smoking activity.                                                                                                                                   NAT ancestral population for each parental chromosome
  To assess whether the observed association with EGFR                                                                                                                           (Supplementary Table S6–S7). We performed a multivariable
and KRAS mutations is due to NAT ancestry itself or to                                                                                                                           logistic regression of NAT ancestry for each genomic region
an environmental exposure/socioeconomic status related to                                                                                                                        correlating with the EGFR-mutant or KRAS-mutant samples,
NAT ancestry, we next investigated the influence of local                                                                                                                        controlling for the global ancestry (Methods). We did not
ancestry. Previous work has shown that associations between                                                                                                                      identify any region where correlation reached genome-wide
local ancestry and phenotype (while accounting for global                                                                                                                        significance of P < 5 × 10−5 (Fig. 4A; Supplementary Fig. S6).
ancestry) provide evidence of a genetically driven phenotype,                                                                                                                       We next evaluated whether local ancestries across multiple
as local ancestry is not expected to be causally associated                                                                                                                      sub–genome-wide significance threshold regions (5 × 10−5 < P
with environmental exposure or socioeconomic status (22,                                                                                                                         < 0.05) were associated with the somatic mutation phenotype
23). We used RFMix (20) to map local ancestry, producing                                                                                                                         by constructing a polygenic ancestry score. This approach is

A                                               Genomic loci associated with EGFR                                                               Genomic loci associated with KRAS
                                                                                                                                                                                                         B
                                                                                                                                                                                                             Gene ~ local ancestry risk score + global ancestry
                                        4                                                                                              4
               low NAT ancestry high

                                                                                                              low NAT ancestry high

                                                                                                                                                                                                                                                 P val. Coef. 95% CI
                                        2                                                                                              2                                                                            Local ancestry risk score
                                                                                                                                                                                                                       (cross-validated)         9e–08    0.55   0.35–0.75
  −Log10 (P)

                                                                                                −Log10 (P )

                                                                                                                                                                                                             EGFR
                                                                                                                                                                                                                        Global ancestry
                                        0                                                                                              0                                                                               (NAT ancestry %)           0.31    0.11 −0.11–0.34

                                                                                                                                                                                                                    Local ancestry risk score
                                                                                                                                                                                                                       (cross-validated)         0.002    1.00   0.37–1.63
                                       –2                                                                                             –2
                                                                                                                                                                                                             KRAS
                                                                                                                                                                                                                        Global ancestry
                                                                                                                                                                                                                                                  0.14    0.38 −0.12–0.89
                                                                                                                                                                                                                       (NAT ancestry %)
                                       –4                                                                                             –4

                                            1    2    3   4                      - -181920-X
                                                              5 6 7 8 9 10 11 1213 15                                                       1     2   3   4   5 6 7 8 9 10 11 1213 15     - -181920-X
                                                              Chromosome                                                                                      Chromosome

Figure 4. Germline local ancestry in association with somatic EGFR and KRAS mutations. A, Genome-wide association of local NAT ancestry with
EGFR (left) and KRAS (right). “NAT ancestry high” indicates positive association, whereas “NAT ancestry low” indicates negative association. Red line
indicates P = 0.05. Orange line indicates genome-wide significance threshold (P < 5 × 10−5). B, Association of local ancestry risk score with somatic EGFR
or KRAS mutations, controlling for global ancestry (proportion of overall NAT ancestry).

OF4 | CANCER DISCOVERY                                                         March 2021                                                                                                                                                       AACRJournals.org

                Downloaded from cancerdiscovery.aacrjournals.org on March 12, 2021. © 2020 American Association for Cancer
                                                                Research.
Published OnlineFirst December 2, 2020; DOI: 10.1158/2159-8290.CD-20-1165

Genetic Ancestry Affects Somatic Alterations in Lung Cancers                                                   RESEARCH BRIEF

conceptually similar to previous work leveraging local ances-       Disparities in access to genetic testing have been observed
try to quantifying phenotypic heritability (22), but we use      among Hispanic patients with lung cancer in the United States
a risk score rather than variance partitioning as the former     (29). Our study shows that while controlling for global ancestry,
is more stable at low sample sizes. The local ancestry risk      local ancestry is associated with mutations in EGFR and KRAS,
score was defined as the sum of NAT ancestry across each         providing the first example, to our knowledge, of a germline
associated region weighted by the Z-score for the association    influence, which may or may not act together with ancestry-
of that region with the given mutation (Methods). To guard       specific environmental exposure, on targetable somatic events
against overfitting, Z-scores and local ancestry risk scores     in lung cancer. These findings highlight the importance of pro-
were computed by cross-validation: splitting the data set into   viding somatic genetic testing for Latin American patients with
10 subsets, obtaining Z-scores for the mutation–ancestry         lung cancer with admixed ancestries. Given the limited sample
associations using nine subsets, and then calculating the        size, we could not determine the precise risk loci for EGFR and
local ancestry risk score on the held-out subset (Supplemen-     KRAS mutation by local ancestry mapping. As low-dose CT
tary Fig. S7). We then performed another logistic regression     scans have enabled lung cancer screening that can significantly
including both the cross-validated local ancestry risk score     reduce lung cancer mortality (30, 31), we believe that the iden-
and global ancestry as covariates and found that the local       tification of germline allele(s) predisposing to the development
ancestry risk score was significantly associated with EGFR       of EGFR-mutant or KRAS-mutant lung adenocarcinoma from
and KRAS mutations, respectively, whereas global ancestry        large, Latin American lung cancer sample sets may improve
was no longer significant in the joint model (Fig. 4B). In       our understanding of the biological causes of EGFR and KRAS
contrast, the local ancestry risk score was not associated       mutations and the evolutionary processes in lung cancers.
with TMB and STK11 mutations in a joint model with global        Such findings could therefore shed light on prevention and
ancestry. Finally, although previous work suggested associa-     early-detection strategies for lung cancer in Latin America and
tions between cis alleles and EGFR mutations (24), we did not    beyond, particularly for nonsmokers.
observe an association between the local ancestry of the EGFR
locus and somatic mutations in EGFR (P = 0.8). Moreover,
when including the local ancestry of the EGFR locus as a         Methods
covariate, the association of EGFR mutations and the local       Sample Collection
ancestry risk score remained significant (P = 4 × 10−6). Our
                                                                    The protocol of this work was approved by the ethical and sci-
finding suggests that one or more genetic loci specific to NAT
                                                                 entific committees of the Instituto Nacional de Cancerologia in
ancestry may modulate the evolution of lung cancer tumors        Mexico City, Mexico, the Foundation for Clinical and Applied Cancer
to harbor EGFR or KRAS mutations in the Latin American           Research in Bogotá, Colombia, and the Dana-Farber Cancer Institute
populations.                                                     in Boston, Massachusetts, for detecting EGFR mutations and further
                                                                 genomic analysis. Biopsies were collected for histologic diagnosis by
                                                                 the pathology departments of Instituto Nacional de Cancerologia
Discussion                                                       and Foundation for Clinical and Applied Cancer Research.
   In summary, the genomic landscape of lung adenocarci-
nomas is strikingly varied in Latin American patients with       Library Preparation and Sequencing
mixed ancestries. In our study of 1,153 lung cancers, we dem-       Genomic DNA was extracted from fresh-frozen, blood and paraffin-
onstrated that NAT ancestry was correlated with somatic          embedded samples by a standard procedure using the Wizard Genom-
driver alterations, including EGFR and KRAS mutations            ics DNA Kit (Promega) according to the manufacturer’s instructions.
that can be effectively targeted by small-molecule inhibitors    DNA was fragmented to 250 bp, and size-selected DNA was ligated to
                                                                 sample-specific barcodes. A custom targeted hybrid capture sequenc-
to prolong survival (4, 25), and TMB and STK11 that are
                                                                 ing platform (OncoPanel) was used to assay genomic alterations in
potential prognostic biomarkers in patients with lung cancer     tumor DNA (16). Each library was quantified by sequencing on an Illu-
(26, 27). The ancestry and TMB association was independ-         mina MiSeq nano flow cell (Illumina). Libraries were pooled in equal
ent of smoking-related mutational processes, and there-          mass to a total of 500 ng for enrichment using the Agilent SureSelect
fore further investigation on the impact of ancestry-related     hybrid capture kit (Agilent Technologies; cat. no. G9611A). Libraries
TMB differences on the response to checkpoint inhibitors is      were sequenced on an Illumina HiSeq2500 or HiSeq3000. Pooled sam-
needed (28). Of note, our TMB estimates may be susceptible       ples were demultiplexed using the Picard tools (https://broadinstitute.
to germline contaminations due to the lack of matched            github.io/picard/). Paired reads were aligned to the hg19 reference
normal samples. If germline variants specific to the Mexican     genome using BWA (32) with the following parameters “-q 5 -l 32 -k 2
or Colombian population could not be sufficiently filtered       -o 1.” Aligned reads were sorted and duplicate-marked using Picard. In
                                                                 each batch, we sequenced a control DNA sample as a “plate normal.”
(due to smaller germline reference panels), individuals with
                                                                 For a subset of 44 cases, the same libraries were sequenced on Illumina
higher NAT ancestry would have more germline contamina-          NovaSeq for low-coverage whole-genome sequencing at 1× coverage.
tion, and the anticorrelation between TMB and NAT ances-
try may thus be even more significant than we have observed.     Mutation Analysis
Furthermore, due to the lack of matched germline samples
                                                                    Mutation detection for SNVs was performed using MuTect v1.1.4
and the use of panel sequencing data, we tested only known,      (33) in paired mode by pairing each sample to a control DNA sample
hotspot mutations and protein-truncating mutations in            profiled with the same OncoPanel. SomaticIndelDetector (34) was
lung cancer driver genes. Future studies will be needed to       used for indel calling. Mutations were annotated by variant effect
comprehensively characterize lung cancer genomes from            predictor (VEP; ref. 35) and Oncotator (36). Called variants with
Latin American patients.                                         a frequency greater than or equal to 0.01% that were found in the

                                                                                      March 2021       CANCER DISCOVERY | OF5

 Downloaded from cancerdiscovery.aacrjournals.org on March 12, 2021. © 2020 American Association for Cancer
                                                 Research.
Published OnlineFirst December 2, 2020; DOI: 10.1158/2159-8290.CD-20-1165

RESEARCH BRIEF                                                                                                                   Carrot-Zhang et al.

Genome Aggregation Database (gnomAD; ref. 37), containing 25,748              of NAT ancestry than the Colombian population (15), we accounted
exomes, were excluded. TMB was calculated by dividing the total               for the country of sample collection throughout our analyses. TMB
number of coding, nonsilent mutations in an individual by the target          was included as a covariate when associating PC3 with mutations.
size (3 MB). Mutational signatures were called using SignatureAna-            Total SCNA burden was included as a covariate when associating PC3
lyzer (38) with SNVs classified by 96 trinucleotide contexts. Prior           with SCNA of lung cancer genes. Gender, proportion of smoking,
studies have shown that mutational signature analysis can be inferred         and APOBEC signatures were included as covariates when associating
based on on-target reads from clinical panel sequencing (8, 39).              the percentage of NAT ancestry with oncogenic mutations. To test
Smoking (COSMIC signature 4) and APOBEC signature (COSMIC                     whether smoking signature influences the relationship between ances-
signature 2) activities were inferred by the estimated number of              try and the KRAS mutations, the following model was performed:
mutations in a trinucleotide context associated with each signature.
                                                                                          KRAS ~ NAT ancestry + smoking signature +
   A custom script (https://github.com/jcarrotzhang/ancestry-from-
                                                                                              NAT*smoking + other covariates
panel) was applied to inspect the sites of hotspot driver mutations in
EGFR, KRAS, BRAF, ERBB2, MET, and TP53. For each mutation, we                 Where gender and country of sample collection were considered as
counted reads supporting the reference base and the altered base, after       covariates, and NAT*smoking was included as an interaction effect.
filtering out reads with base quality or mapping quality less than 20 (40).   If the interaction term is not significant, that means that smoking
A mutation was called if the total read count was greater than 5, the         signature activity does not modify the effect of ancestry.
altered read count was greater than 2, and the mutant allele frequency           To identify specific genomic region(s) associated with lung adeno-
was greater than 5%. Identified mutations with total coverage lower than      carcinoma cases harboring certain somatic alterations, a logistic
30× were manually inspected using Integrative Genomics Viewer (41).           regression model was applied controlling for the percentage of
                                                                              NAT ancestry, TMB, and country of sample collection, followed by
Copy Number and Rearrangements                                                                    2
                                                                              genomic control  . Ten-fold cross-validation was performed in
   Read coverage was calculated at 1 MB bins across the genome and            the following steps (Supplementary Fig. S7): the whole data set was
was corrected for guanine-cytosine (GC) content and mappability biases        split into 10 subsets. Z-scores for the mutation–ancestry associations
using ichorCNA version 0.1.0 (42) using the plate normal as the matched       for each genomic region were calculated using nine subsets, and a
control for each sample. GISTIC version 2.0.22 (43) was applied to iden-      cross-validated local ancestry risk score (sum of the NAT ancestry
tify focal and arm-level SCNAs on ichorCNA-generated copy-number              across each associated region weighted by the Z-score of that region)
segments, with the high-level amplification defined as log2-transformed       was calculated for each sample on the held-out subset. These steps
copy-number ratio greater than 0.7. Rearrangement events were called          were repeated 10 times until a local ancestry risk score was generated
by Breakmer (44) and filtered on discordant read counts and split read        for each sample:
counts greater than 0. Total SCNA burden and the degree of aneuploidy
                                                                                              local ancestry risk score   i0 AiZi ,
                                                                                                                             n
was defined by the number of genes or chromosomal arms affected by
SCNAs, respectively (copy-number ratio > 0.1 or > −0.1).
                                                                              where n is the number of regions associated with the somatic feature
Ancestry Analysis from Genotyping Array                                       (P < 0.05), A is the ancestry of each associated region (NAT ancestry
  The Multi-Ethnic Genotyping Array (MEGA) was used for geno-                 was coded as 1, and EUR or AFR ancestry was coded as 0), and Z is
typing of paired fresh-frozen tumor tissue and blood samples. We              the Z-score of that associated region.
used PLINK version 1.9 (45) to filter out variants with missing rate
greater than 2%, or failed Hardy–Weinberg equilibrium test (P <               Data and Code Availability
1 × 10−6). Markers with allele frequency less than 1% in the 1000                Raw sequencing data from cancer gene panel sequencing are depos-
Genomes data set were also excluded. The Mexican and Colombian                ited at European Genome-phenome Archive (EGA) under the acces-
samples were merged with samples from the 1000 Genomes phase III              sion code EGAS00001004752. Ancestry identification method from
(15), and PCA was performed on the merged data set using genome-              panel sequencing and custom code used in the analyses are avail-
wide complex trait analysis (GCTA) version 1.91.6 (46).                       able at https://github.com/jcarrotzhang/ancestry-from-panel. This
                                                                              code and data are also available at https://codeocean.com/­capsule/
Ancestry Analysis from Sequencing                                             6075183. All other data supporting the findings of this study are avail-
   SAMtools (47) was used to genotype germline variants after filtering       able upon reasonable request from the corresponding authors.
out reads with base quality or mapping quality less than 20. LASER
version 2.04 (48) was used to estimate overall ancestry based on              Authors’ Disclosures
637,037 germline variants from all populations in the Human Genome               G. Ha reports a patent for methods for genome characterization
Diversity Project (HGDP; ref. 49). We obtained principal components           pending. G.R. Oxnard reports other from Foundation Medicine
from LASER results that place each sample into a reference PCA space          and Roche outside the submitted work. O. Arrieta reports personal
using 939 HGDP samples as reference samples. For local ancestry               fees from Pfizer, Lilly, Merck, and Bristol Myers Squibb and grants
identification, phasing and imputation were performed using Beagle            and personal fees from AstraZeneca, Boehringer Ingelheim, and
version 4.1 (50) based on SAMtools genotyped variants. We imputed             Roche outside the submitted work. A.F. Cardona reports grants, per-
missing variants using phased haplotypes from 1000 Genomes (15).              sonal fees, and non-financial support from Merck Sharp & Dohme,
Ancestry was assigned to each SNP using RFMix v2 (20). For each               Boehringer Ingelheim, Roche, Bristol-Myers Squibb, Foundation
parental population (NAT, AFR, and EUR), 500 samples from 1000                Medicine, and Astra Zeneca; non-financial support from BioNTech,
Genomes were used as reference samples. Local ancestry regions span-          Rochem Biocare, and INQBox; grants and personal fees from Amgen
ning centromeres were filtered. RFMix outputted global ancestry esti-         and Bayer; personal fees and non-financial support from Foundation
mates were used as the percentage of NAT ancestry.                            for Clinical and Applied Cancer Research; personal fees from EISAI,
                                                                              Merck Serono, Jannsen Pharmaceuticals, Pfizer, Celldex, and Eli Lilly,
Association Analysis                                                          and other from Illumina, Thermo Fisher, and Roche Diagnostics
  Multivariable logistic regression or linear regression was performed        outside the submitted work. M. Meyerson reports personal fees from
using a Python module (https://www.statsmodels.org). Because the              OrigiMed and grants from Bayer, Janssen, Novo, and grants Ono
Mexican population of patients with lung cancer has a higher level            outside the submitted work; in addition, M. Meyerson has a patent

OF6 | CANCER DISCOVERY              March 2021                                                                                    AACRJournals.org

     Downloaded from cancerdiscovery.aacrjournals.org on March 12, 2021. © 2020 American Association for Cancer
                                                     Research.
Published OnlineFirst December 2, 2020; DOI: 10.1158/2159-8290.CD-20-1165

Genetic Ancestry Affects Somatic Alterations in Lung Cancers                                                                  RESEARCH BRIEF

for EGFR mutations for lung cancer diagnosis issued, licensed, and             8. Campbell JD, Lathan C, Sholl L, Ducar M, Vega M, Sunkavalli A, et al.
with royalties paid from LabCorp and a patent for EGFR inhibitors                 Comparison of prevalence and types of mutations in lung cancers
pending to Bayer; and was a founding advisor of, consultant to, and               among black and white populations. JAMA Oncol 2017;3:801–9.
equity holder in Foundation Medicine, shares of which were sold to             9. Arrieta O, Cardona AF, Martín C, Más-López L, Corrales-Rodríguez L,
Roche. No disclosures were reported by the other authors.                         Bramuglia G, et al. Updated frequency of EGFR and KRAS mutations
                                                                                  in nonsmall-cell lung cancer in Latin America: the Latin-American
Authors’ Contributions                                                            consortium for the investigation of lung cancer (CLICaP). J Thorac
                                                                                  Oncol 2015;10:838–43.
   J. Carrot-Zhang: Formal analysis, methodology, writing–original            10. Gimbrone NT, Sarcar B, Gordian ER, Rivera JI, Lopez C, Yoder SJ,
draft, writing–review and editing. G. Soca-Chafre: Resources, writing–            et al. Somatic mutations and ancestry markers in hispanic lung can-
review and editing. N. Patterson: Methodology. A.R. Thorner:                      cer patients. J Thorac Oncol 2017;12:1851–6.
Resources, writing–review and editing. A. Nag: Resources, writing–            11. Leal LF, de Paula FE, De Marchi P, de Souza Viana L, Pinto GDJ,
review and editing. J. Watson: Resources. G. Genovese: Methodology.               Carlos CD, et al. Mutational profile of Brazilian lung adenocarci-
J. Rodriguez: Resources. M.K. Gelbard: Resources. L. Corrales-Rodriguez:          noma unveils association of EGFR mutations with high Asian ances-
Resources. Y. Mitsuishi: Resources, writing–review and editing.                   try and independent prognostic role of KRAS mutations. Sci Rep
G. Ha: Methodology, writing–review and editing. J.D. Campbell:                    2019;9:3209.
Methodology, writing–review and editing. G.R. Oxnard: Resources,              12. Chen J, Yang H, Teo ASM, Amer LB, Sherbaf FG, Tan CQ, et al.
writing–review and editing. O. Arrieta: Conceptualization, resources,             Genomic landscape of lung adenocarcinoma in East Asians. Nat
writing–review and editing. A.F. Cardona: Conceptualization,                      Genet 2020;52:177–86.
resources, writing–review and editing. A. Gusev: Supervision, writing–        13. Sinha S, Mitchell KA, Zingone A, Bowman E, Sinha N, Schäffer AA,
review and editing. M. Meyerson: Conceptualization, supervision,                  et al. Higher prevalence of homologous recombination deficiency in
writing–review and editing.                                                       tumors from African Americans versus European Americans. Nature
                                                                                  Cancer 2020;1:112–21.
Acknowledgments                                                               14. Reich D, Patterson N, Campbell D, Tandon A, Mazieres S, Ray N,
                                                                                  et al. Reconstructing native american population history. Nature
   We would like to acknowledge the patients from Mexico and                      2012;488:370–4.
Colombia for their participation in this study. This study is sup-            15. Consortium T 1000 GP, The 1000 Genomes Project Consortium. A
ported by the V Foundation Translational Research Award (the                      global reference for human genetic variation. Nature 2015;526:68–74.
Stuart Scott Memorial Fund) and by NCI grants R35 CA197568, R01               16. Hanna GJ, Lizotte P, Cavanaugh M, Kuo FC, Shivdasani P, Frieden A,
CA116020, and R01 CA227237, as well as an anonymous gift to the                   et al. Frameshift events predict anti-PD-1/L1 response in head and
Dana-Farber Cancer Institute. M. Meyerson is an American Cancer                   neck cancer. JCI Insight 2018;3:e98811.
Society Research Professor. J. Carrot-Zhang holds the CIHR Banting            17. Yuan J, Hu Z, Mahal BA, Zhao SD, Kensler KH, Pi J, et al. Integrated
fellowship. J.D. Campbell is funded by the LUNGevity Career Devel-                analysis of genetic ancestry and genomic alterations across cancers.
opment award. We thank the Center for Cancer Genomics at Dana-                    Cancer Cell 2018;34:549–60.
Farber Cancer Institute and the Genomic Platform at Broad Institute           18. Carrot-Zhang J, Chambwe N, Damrauer JS, Knijnenburg TA, Robert-
for their sequencing and genotyping efforts, and we thank Aruna                   son AG, Yau C, et al. Comprehensive analysis of genetic ancestry and
Ramachandran and Kar-Tong Tan for helpful discussions.                            its molecular correlates in cancer. Cancer Cell 2020;37:639–54.
                                                                              19. Wang C, Zhan X, Liang L, Abecasis GR, Lin X. Improved ancestry
                                                                                  estimation for both genotyping and sequencing data using projec-
 Received August 11, 2020; revised October 26, 2020; accepted
                                                                                  tion procrustes analysis and genotype imputation. Am J Hum Genet
November 19, 2020; published first December 2, 2020.
                                                                                  2015;96:926–37.
                                                                              20. Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: a discrimi-
                                                                                  native modeling approach for rapid and robust local-ancestry infer-
References                                                                        ence. Am J Hum Genet 2013;93:278–88.
 1. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global   21. Koivunen JP, Kim J, Lee J, Rogers AM, Park JO, Zhao X, et al. Muta-
    cancer statistics 2018: GLOBOCAN estimates of incidence and mor-              tions in the LKB1 tumour suppressor are frequently detected in
    tality worldwide for 36 cancers in 185 countries. CA Cancer J Clin            tumours from Caucasian but not Asian lung cancer patients. Br J
    2018;68:394–424.                                                              Cancer 2008;99:245–52.
 2. Raez LE, Cardona AF, Santos ES, Catoe H, Rolfo C, Lopes G, et al. The     22. Zaitlen N, Pasaniuc B, Sankararaman S, Bhatia G, Zhang J, Gusev A,
    burden of lung cancer in Latin-America and challenges in the access           et al. Leveraging population admixture to characterize the heritability
    to genomic profiling, immunotherapy and targeted treatments. Lung             of complex traits. Nat Genet 2014;46:1356–62.
    Cancer 2018;119:7–13.                                                     23. Florez JC, Price AL, Campbell D, Riba L, Parra MV, Yu F, et al. Strong
 3. Cancer Genome Atlas Research Network. Comprehensive molecular                 association of socioeconomic status with genetic ancestry in Latinos:
    profiling of lung adenocarcinoma. Nature 2014;511:543–50.                     implications for admixture studies of type 2 diabetes. Diabetologia
 4. Paez JG, Jänne PA, Lee JC, Tracy S, Greulich H, Gabriel S, et al. EGFR        2009;52:1528–36.
    mutations in lung cancer: correlation with clinical response to gefi-     24. Liu W, He L, Ramírez J, Krishnaswamy S, Kanteti R, Wang Y-C, et al.
    tinib therapy. Science 2004;304:1497–500.                                     Functional EGFR germline polymorphisms may confer risk for EGFR
 5. Shi Y, Au JS-K, Thongprasert S, Srinivasan S, Tsai C-M, Khoa MT, et al.       somatic mutations in non-small cell lung cancer, with a predominant
    A prospective, molecular epidemiology study of EGFR mutations in              effect on exon 19 microdeletions. Cancer Res 2011;71:2423–7.
    Asian patients with advanced non-small-cell lung cancer of adenocar-      25. Canon J, Rex K, Saiki AY, Mohr C, Cooke K, Bagal D, et al. The clini-
    cinoma histology (PIONEER). J Thorac Oncol 2014;9:154–62.                     cal KRAS(G12C) inhibitor AMG 510 drives anti-tumour immunity.
 6. Midha A, Dearden S, McCormack R. EGFR mutation incidence in                   Nature 2019;575:217–23.
    non-small-cell lung cancer of adenocarcinoma histology: a systematic      26. Johnson DB, Frampton GM, Rioth MJ, Yusko E, Xu Y, Guo X, et al.
    review and global map by ethnicity (mutMapII). Am J Cancer Res                Targeted next generation sequencing identifies markers of response
    2015;5:2892–911.                                                              to PD-1 blockade. Cancer Immunol Res 2016;4:959–67.
 7. Steuer CE, Behera M, Berry L, Kim S, Rossi M, Sica G, et al. Role of      27. Arbour KC, Jordan E, Kim HR, Dienstag J, Yu HA, Sanchez-Vega F,
    race in oncogenic driver prevalence and outcomes in lung adenocar-            et al. Effects of co-occurring genomic alterations on outcomes in
    cinoma: results from the Lung Cancer Mutation Consortium. Cancer              patients with KRAS-mutant non-small cell lung cancer. Clin Cancer
    2016;122:766–72.                                                              Res 2018;24:334–40.

                                                                                                    March 2021        CANCER DISCOVERY | OF7

 Downloaded from cancerdiscovery.aacrjournals.org on March 12, 2021. © 2020 American Association for Cancer
                                                 Research.
Published OnlineFirst December 2, 2020; DOI: 10.1158/2159-8290.CD-20-1165

RESEARCH BRIEF                                                                                                                    Carrot-Zhang et al.

28. Berland L, Heeke S, Humbert O, Macocco A, Long-Mira E, Lassalle S,       40. Carrot-Zhang J, Majewski J. LoLoPicker: detecting low allelic-fraction
    et al. Current views on tumor mutational burden in patients with             variants from low-quality cancer samples. Oncotarget 2017;8:37032–40.
    non-small cell lung cancer treated by immune checkpoint inhibitors.      41. Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES,
    J Thorac Dis 2019;11:S71–80.                                                 Getz G, et al. Integrative genomics viewer. Nat Biotechnol 2011;29:
29. Lynch JA, Berse B, Rabb M, Mosquin P, Chew R, West SL, et al. Under-         24–6.
    utilization and disparities in access to EGFR testing among Medicare     42. Adalsteinsson VA, Ha G, Freeman SS, Choudhury AD, Stover DG,
    patients with lung cancer from 2010–2013. BMC Cancer 2018;18:306.            Parsons HA, et al. Scalable whole-exome sequencing of cell-free DNA
30. Kovalchik SA, Tammemagi M, Berg CD, Caporaso NE, Riley TL,                   reveals high concordance with metastatic tumors. Nat Commun
    Korch M, et al. Targeting of low-dose CT screening according to the          2017;8:1324.
    risk of lung-cancer death. N Engl J Med 2013;369:245–54.                 43. Mermel CH, Schumacher SE, Hill B, Meyerson ML, Beroukhim R,
31. Duffy SW, Field JK. Mortality reduction with low-dose CT screening           Getz G. GISTIC2.0 facilitates sensitive and confident localization of
    for lung cancer. N Engl J Med 2020;382:572–3.                                the targets of focal somatic copy-number alteration in human can-
32. Li H, Durbin R. Fast and accurate short read alignment with Bur-             cers. Genome Biol 2011;12:R41.
    rows–Wheeler transform. Bioinformatics 2009;25:1754–60.                  44. Abo RP, Ducar M, Garcia EP, Thorner AR, Rojas-Rudilla V, Lin L, et al.
33. Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez          BreaKmer: detection of structural variation in targeted massively par-
    C, et al. Sensitive detection of somatic point mutations in impure and       allel sequencing data using kmers. Nucleic Acids Res 2015;43:e19.
    heterogeneous cancer samples. Nat Biotechnol 2013;31:213–9.              45. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender
34. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kerny-               D, et al. PLINK: a tool set for whole-genome association and popula-
    tsky A, et al. The genome analysis toolkit: a MapReduce framework            tion-based linkage analyses. Am J Hum Genet 2007;81:559–75.
    for analyzing next-generation DNA sequencing data. Genome Res            46. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-
    2010;20:1297–303.                                                            wide complex trait analysis. Am J Hum Genet 2011;88:76–82.
35. McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F.          47. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al.
    Deriving the consequences of genomic variants with the ensembl API           The sequence alignment/map format and SAMtools. Bioinformatics
    and SNP effect predictor. Bioinformatics 2010;26:2069–70.                    2009;25:2078–9.
36. Ramos AH, Lichtenstein L, Gupta M, Lawrence MS, Pugh TJ, Sak-            48. Wang C, Zhan X, Bragg-Gresham J, Kang HM, Stambolian D, Chew
    sena G, et al. Oncotator: cancer variant annotation tool. Hum Mutat          EY, et al. Ancestry estimation and control of population stratification
    2015;36:E2423–9.                                                             for sequence-based association studies. Nat Genet 2014;46:409–15.
37. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang        49. Bergström A, McCarthy SA, Hui R, Almarri MA, Ayub Q, Danecek P,
    Q, et al. The mutational constraint spectrum quantified from varia-          et al. Insights into human genetic variation and population history
    tion in 141,456 humans. Nature 2020;581:434–43.                              from 929 diverse genomes. Science 2020;367:eaay5012.
38. Kim J, Mouw KW, Polak P, Braunstein LZ, Kamburov A, Kwiatkowski          50. Browning SR, Browning BL. Rapid and accurate haplotype phasing and
    DJ, et al. Somatic ERCC2 mutations are associated with a distinct            missing-data inference for whole-genome association studies by use of
    genomic signature in urothelial tumors. Nat Genet 2016;48:600–6.             localized haplotype clustering. Am J Hum Genet 2007;81:1084–97.
39. Touat M, Li YY, Boynton AN, Spurr LF, Iorgulescu JB, Bohrson CL,         51. Campbell JD, Alexandrov A, Kim J, Wala J, Berger AH, Pedamallu CS,
    et al. Mechanisms and therapeutic implications of hypermutation in           et al. Distinct patterns of somatic genome alterations in lung adenocar-
    gliomas. Nature 2020;580:517–23.                                             cinomas and squamous cell carcinomas. Nat Genet 2016;48:607–16.

OF8 | CANCER DISCOVERY              March 2021                                                                                     AACRJournals.org

     Downloaded from cancerdiscovery.aacrjournals.org on March 12, 2021. © 2020 American Association for Cancer
                                                     Research.
Published OnlineFirst December 2, 2020; DOI: 10.1158/2159-8290.CD-20-1165

Genetic Ancestry Contributes to Somatic Mutations in Lung
Cancers from Admixed Latin American Populations
Jian Carrot-Zhang, Giovanny Soca-Chafre, Nick Patterson, et al.

Cancer Discov Published OnlineFirst December 2, 2020.

 Updated version         Access the most recent version of this article at:
                         doi:10.1158/2159-8290.CD-20-1165

   Supplementary         Access the most recent supplemental material at:
         Material        http://cancerdiscovery.aacrjournals.org/content/suppl/2020/11/24/2159-8290.CD-20-1165.DC1

      E-mail alerts      Sign up to receive free email-alerts related to this article or journal.

     Reprints and        To order reprints of this article or to subscribe to the journal, contact the AACR Publications
    Subscriptions        Department at pubs@aacr.org.

      Permissions        To request permission to re-use all or part of this article, use this link
                         http://cancerdiscovery.aacrjournals.org/content/early/2021/02/13/2159-8290.CD-20-1165.
                         Click on "Request Permissions" which will take you to the Copyright Clearance Center's (CCC)
                         Rightslink site.

          Downloaded from cancerdiscovery.aacrjournals.org on March 12, 2021. © 2020 American Association for Cancer
                                                          Research.
You can also read