Dissecting Genomic Determinants of Positive Selection with an Evolution-Guided Regression Model

Page created by Sandra Hawkins
 
CONTINUE READING
Dissecting Genomic Determinants of Positive Selection with an
Evolution-Guided Regression Model
Yi-Fei Huang             *1,2
1
 Department of Biology, Pennsylvania State University, University Park, PA, USA
2
 Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, USA
*Corresponding author: E-mail: yuh371@psu.edu.
Associate editor: Jeffrey Townsend

                                                                                                                                                            Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021
Abstract
In evolutionary genomics, it is fundamentally important to understand how characteristics of genomic sequences, such as
gene expression level, determine the rate of adaptive evolution. While numerous statistical methods, such as the
McDonald–Kreitman (MK) test, are available to examine the association between genomic features and the rate of
adaptation, we currently lack a statistical approach to disentangle the independent effect of a genomic feature from the
effects of other correlated genomic features. To address this problem, I present a novel statistical model, the MK
regression, which augments the MK test with a generalized linear model. Analogous to the classical multiple regression
model, the MK regression can analyze multiple genomic features simultaneously to infer the independent effect of a
genomic feature, holding constant all other genomic features. Using the MK regression, I identify numerous genomic
features driving positive selection in chimpanzees. These features include well-known ones, such as local mutation rate,
residue exposure level, tissue specificity, and immune genes, as well as new features not previously reported, such as gene
expression level and metabolic genes. In particular, I show that highly expressed genes may have a higher adaptation rate
than their weakly expressed counterparts, even though a higher expression level may impose stronger negative selection.
Also, I show that metabolic genes may have a higher adaptation rate than their nonmetabolic counterparts, possibly due
to recent changes in diet in primate evolution. Overall, the MK regression is a powerful approach to elucidate the
genomic basis of adaptation.
Key words: adaptive evolution, positive selection, McDonald–Kreitman test, statistical inference, causal inference.

Introduction                                                                            and putatively neutral sites, the MK test seeks to identify
                                                                                        positively selected genes that show an excess of interspecies
Understanding the genetic basis of positive selection is a fun-
                                                                                        divergence at functional sites. Because highly deleterious
damental problem in evolutionary biology. Numerous statis-

                                                                                                                                                          Article
                                                                                        mutations can neither segregate nor reach fixation in a pop-
tical approaches have been developed to detect loci under
                                                                                        ulation, the MK test is intrinsically robust to the presence of
positive selection. A popular framework is codon substitution
                                                                                        strong negative selection. On the other hand, weak negative
models that seek to infer positively selected genes solely from
interspecies sequence divergence (Goldman and Yang 1994;                                selection may lead to biased results in the MK test because
Muse and Gaut 1994; Yang et al. 2000). By contrasting the                               mutations under weak selection can segregate in a popula-
rate of nonsynonymous substitutions (dN) against the rate of                            tion but not reach fixation. To address this problem, several
synonymous substitutions (dS), codon substitution models                                recent studies have extended the MK test to account for the
can identify positively selected genes with a dN/dS ratio                               effects of weak negative selection on intraspecies polymor-
greater than 1. However, because negative (purifying) selec-                            phism (Eyre-Walker and Keightley 2009; Messer and Petrov
tion can dramatically reduce dN/dS ratios, codon substitution                           2013; Galtier 2016; Haller and Messer 2017; Uricchio et al.
models may be underpowered to detect genes that experi-                                 2019), ensuring that the inference of positive selection is
enced both positive selection and strong negative selection                             not biased by the presence of weak selection. Thus, the MK
(Hughes 2007).                                                                          test and its extensions are powerful methods to disentangle
    Unlike codon substitution models, the McDonald–                                     positive selection from ubiquitous negative selection.
Kreitman (MK) test utilizes both interspecies divergence                                   Because MK-based methods use relatively sparse diver-
and intraspecies polymorphism to elucidate positive selection                           gence and polymorphism data from closely related species,
in a species of interest (McDonald and Kreitman 1991; Fay et                            they may be underpowered to pinpoint individual genes un-
al. 2001; Smith and Eyre-Walker 2002). By contrasting the                               der positive selection. To boost statistical power, MK-based
levels of divergence and polymorphism at functional sites                               methods often are applied to a collection of genes or

ß The Author(s) 2021. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.
org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is
properly cited.                                                                                                                        Open Access
Mol. Biol. Evol. doi:10.1093/molbev/msab291 Advance Access publication October 01, 2021                                                              1
Huang . doi:10.1093/molbev/msab291                                                                                           MBE
nucleotide sites with similar genomic features. Using this            features to control for. Therefore, we currently lack a general
pooling strategy, previous studies have identified numerous           and powerful statistical framework to estimate the indepen-
genomic features associated with positive selection in                dent effects of genomic features on positive selection by
Drosophila and primates. The features associated with posi-           adjusting for a large number of correlated genomic features.
tive selection in Drosophila include local mutation rate                  In the current study, I present a novel statistical method,
(Campos et al. 2014; Castellano et al. 2016; Rousselle et al.         the MK regression, to estimate the independent effects of
2020), local recombination rate (Marais and Charlesworth              genomic features on the rate of adaptive evolution. The
2003; Campos et al. 2014; Castellano et al. 2016), gene expres-       MK regression is a hybrid of the MK test and the generalized
sion specificity (Fraı̈sse et al. 2019), residue exposure to sol-     linear regression. Unlike standard linear regression models

                                                                                                                                            Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021
vent (Moutinho, Trancoso et al. 2019), X linkage (Avila et al.        and statistical matching algorithms, the MK regression can
2015; Campos et al. 2018), and sex-biased expression                  control for a large number of correlated genomic features and
(Pröschel et al. 2006; Avila et al. 2015; Campos et al. 2018).       is applicable to species with a low level of polymorphism. To
The features associated with positive selection in primates           the best of my knowledge, the MK regression is the first
include protein disorder (Afanasyeva et al. 2018), virus–host         evolutionary model tailored to characterize the independent
interaction (Enard et al. 2016; Uricchio et al. 2019), protein–       effects of genomic features on adaptive evolution. Using syn-
protein interaction (PPI) degree (Luisi et al. 2015), and X           thetic data, I show that the MK regression can unbiasedly
linkage (Hvilsom et al. 2012).                                        estimate the independent effect of a genomic feature even
   While existing MK-based methods can identify genomic               when it is strongly correlated with another genomic feature.
features associated with the signatures of positive selection,        Applying the MK regression to polymorphism and divergence
they may not be able to distinguish genomic features inde-            data in the chimpanzee lineage, I corroborate previous find-
pendently affecting the rate of adaptive evolution from spu-          ings that local mutation rate, residue exposure level, tissue
rious features without independent effects on adaptation              specificity, and immune system genes are key determinants of
(Moutinho, Trancoso et al. 2019; Fraı̈sse et al. 2019). For in-       positive selection in protein-coding genes. In addition, I show
stance, MK-based methods often are applied to one genomic             that highly expressed genes and metabolic genes may have a
feature at a time. If a genomic feature with an independent           higher rate of adaptive evolution than other genes after con-
effect on the rate of adaptive evolution is strongly correlated       trolling for several correlated genomic features, which has not
with a second feature without an independent effect, MK-              been widely reported in previous studies. Taken together, the
based methods may report a spurious association between               MK regression is a valuable addition to evolutionary biolo-
the second feature and adaptive evolution.                            gists’ arsenal for investigating the genetic basis of adaptation.
   Before the current study, two simple heuristic methods
have been previously used to estimate the independent effect          Results
of a genomic feature on the rate of adaptation by controlling
for other potentially correlated features. If we are interested in
                                                                      The MK Regression is a Generalized Linear Model
estimating the independent effect of a gene-level feature,            Tailored to Estimate the Effects of Genomic Features
such as tissue specificity, we may estimate the rate of adap-         on Positive Selection
tion at the gene level and then fit a standard linear regression      The key idea behind the MK regression is to model the site-
model, in which we treat the feature of interest and correlated       wise rate of adaptive evolution as a linear combination of
genomic features as covariates and treat the gene-level rate of       local genomic features (fig. 1). I use xa , the relative rate of
adaptation as a response variable (Luisi et al. 2015; Castellano      adaptive substitutions at a functional nucleotide site with
et al. 2016; Moutinho, Trancoso et al. 2019; Fraı̈sse et al. 2019).   respect to the average substitution rate at neutral nucleotide
The regression coefficient associated with the feature of in-         sites, as a measure of the rate of adaptation (Booker et al.
terest can be interpreted as its independent effect on positive       2017; Moutinho, Bataillon et al. 2019). Unlike previous MK-
selection, holding constant all other genomic features.               based models that treat xa as a gene-level measure, I treat xa
Although this strategy is powerful and elegant, it cannot be          as a measure of adaptive evolution at an individual nucleotide
applied to species with low levels of polymorphism, such as           site and assume that it can be predicted from local genomic
primates, due to the challenge of estimating the rate of ad-          features. To integrate the effects of multiple features on the
aptation at the gene level. Alternatively, we may first stratify      rate of adaptive evolution, I assume that xa , in a site-wise
genes into a “treatment” group and a “control” group based            manner, is a linear combination of local genomic features,
on the genomic feature of interest. Then, we may use statis-          such as local mutation rate, local recombination rate, and
tical matching algorithms to match each gene from the                 gene expression level. For each genomic feature, the MK re-
“treatment” group with a gene of similar characteristics              gression seeks to estimate a regression coefficient indicating
from the “control” group. A significant difference in the             its independent effect on the rate of adaption, holding con-
rate of adaptation between the two groups of matched genes            stant all other genomic features.
indicates that the feature of interest has an independent ef-             Specifically, the MK regression consists of two compo-
fect on positive selection. Although this method has been             nents: a generalized linear model and an MK-based likelihood
successfully used in previous studies (Enard et al. 2016;             function (fig. 1). First, I assume that xa , in a site-wise manner,
Campos et al. 2018; Castellano et al. 2019), it is difficult to       is a linear combination of local genomic features followed by
match genes when there are a large number of genomic                  an exponential transformation,
2
Dissecting Genomic Determinants of Positive Selection . doi:10.1093/molbev/msab291                                                          MBE

                                                                                                                                                            Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021
FIG. 1. Schematic of the MK regression. The MK regression consists of two components: a generalized linear model and a McDonald–Kreitman-
based likelihood function. First, I assume that, in a site-wise manner, the rate of adaptive evolution (xa ) at a functional site is a linear combination
of local genomic features followed by an exponential transformation, in which regression coefficient bi indicates the effect of the ith feature on
adaptive evolution. Similarly, I assume that the probability of observing a SNP (Pfunc ) at the same functional site is another linear combination of
the same set of genomic features, followed by a logistic transformation. Second, in the McDonald–Kreitman-based likelihood function, I combine
xa and Pfunc at every functional site with two neutral parameters, Dneut and Pneut , to calculate the probability of observed divergence and
polymorphism data given model parameters. Dneut and Pneut denote the expected number of substitutions and the probability of observing a SNP
at a neutral site, respectively. Dfunc denotes the expected number of substitutions at a functional site.

    xa ¼ expðb0 þ b1 X1 þ    þ bi Xi þ    þ bM XM Þ:                      polymorphism in a site-wise manner. Thus, the MK regres-
                                                                                sion can naturally describe the evolution of functional and
   In this equation, Xi is the ith feature at a functional                      neutral sites.
nucleotide site; b0 is an intercept indicating the baseline
rate of adaptive evolution when all genomic features are                        Joint Analysis of Multiple Features Distinguishes
equal to 0; bi is a regression coefficient indicating the ith                   Independent Effects from Spurious Associations
feature’s effect on the rate of adaptation; M is the total                      I conducted two simulation experiments to assess the MK
number of genomic features; exp is an exponential inverse                       regression’s validity and its power to infer the independent
link function which ensures xa is positive. If bi is statisti-                  effects of genomic features on the rate of adaptive evolution.
cally different from 0, I consider that feature i may have an                   The simulation experiments consisted of two steps. First, I
independent effect on adaptation after adjusting for the                        randomly sampled genomic features from a bivariate normal
other features. Similarly, to accommodate the effects of                        distribution at each functional site. Second, I generated syn-
local genomic features on polymorphism data, I assume                           thetic polymorphism and divergence data at both functional
that the probability of observing intraspecies polymor-                         and neutral sites based on the MK regression model.
phism at a functional nucleotide site, Pfunc , is another linear                   In the first simulation experiment, I assumed that there
combination of local genomic features followed by a logis-                      were two genomic features of interest. The first genomic
tic transformation (fig. 1).                                                    feature had an independent effect on the rate of adaptive
   Second, in the component of MK-based likelihood func-                        evolution, and its regression coefficient, b1, was equal to 1. On
tion (fig. 1), I combine xa and Pfunc at every functional site                  the other hand, the second feature had no independent effect
with two neutral parameters to calculate the probability of                     on selection. Thus, its regression coefficient, b2, was equal to 0
observed polymorphism and divergence data at both func-                         by definition. The other parameters required for the simula-
tional and neutral sites, which allows for a maximum like-                      tion experiment were chosen to ensure that genome-wide
lihood estimation of model parameters. Finally, I use the                       levels of polymorphism and divergence are comparable be-
Wald test to examine whether the estimated regression                           tween synthetic data and empirical data from chimpanzees
coefficient, bb i , is significantly different from 0 for each fea-             (see details in the Materials and Methods section). To sys-
ture i. It is worth noting that, unlike the standard linear                     tematically assess the MK regression’s performance with re-
regression, the MK regression does not assume that re-                          spect to various degrees of correlation between genomic
sponse variables, that is, polymorphism and divergence                          features, I generated four sets of synthetic data with different
data, follow a normal distribution. Instead, the likelihood                     correlation coefficients between features (0.0, 0.2, 0.4, and 0.6).
function of the MK regression uses the Jukes–Cantor sub-                        In each synthetic data set, I generated 10 independent repli-
stitution model (Jukes and Cantor 1969) and the Bernoulli                       cates each of which consisted of 10 Mb functional sites and
distribution to describe the generation of divergence and                       10 Mb neutral sites.
                                                                                                                                                       3
Huang . doi:10.1093/molbev/msab291                                                                                       MBE
    I applied two different versions of the MK regression to the   mutations at 0D sites are nonsynonymous, I assume that they
synthetic data. The first version was the simple MK regression     are potentially functional. I obtained a genome-wide map of
that analyzed one genomic feature at a time, which was             single nucleotide polymorphisms (SNPs) in 18 central chim-
designed to mimic previous MK-based methods. The second            panzee (Pan troglodytes troglodytes) individuals (de Manuel
one was the multiple MK regression that analyzed two fea-          et al. 2016), and inferred ancestral alleles using a recon-
tures simultaneously. As shown in figs. 2A and 2B, both the        structed chimpanzee ancestral genome (Herrero et al. 2016;
simple MK regression and the multiple MK regression pro-           Yates et al. 2020). Because the SNP data set consisted of
duced unbiased estimates of regression coefficients when           samples from both females and males, the number of sam-
there was no correlation between features. However, the sim-       pled sequences was different between autosomes and sex

                                                                                                                                        Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021
ple MK regression frequently estimated that b   b 2 was positive   chromosomes. Thus, I retained only autosomal genes for
when the two features were correlated with each other,             downstream analysis. To mitigate the impact of weak nega-
whereas the true value of b2 was equal to 0. On the other          tive selection on the inference of positive selection, I filtered
hand, the multiple MK regression always produced unbiased          out SNPs with a derived allele frequency lower than 50% for
estimates of regression coefficients regardless of the degree of   downstream analysis. In addition, I reconstructed fixed sub-
correlation between features.                                      stitutions at 4D and 0D sites in the chimpanzee lineage by
    In the second simulation experiment, I evaluated the ex-       comparing the reconstructed ancestral genome with the
tent to which the correlation between two causal features          chimpanzee reference genome. I estimated that the propor-
complicates the estimation of their independent effects. I set     tion of adaptive nonsynonymous substitutions (a) was equal
the regression coefficients of the two features to b1 ¼ 1 and      to 15.3% in chimpanzee autosomal genes, which is similar to
b2 ¼ 0:2, respectively. Then, I followed the same procedure       the estimate in a previous study (Tataru et al. 2017).
described in the first simulation experiment to generate syn-          I collected six genomic features in chimpanzee autosomal
thetic data. As shown in figs. 2C and 2D, the multiple MK          genes, including local mutation rate, local recombination rate,
regression accurately estimated regression coefficients with-      residue exposure level, gene expression level, tissue specificity,
out any noticeable bias, whereas the simple MK regression          and the number of unique protein–protein interaction part-
produced biased results when the two features were corre-          ners per gene (PPI degree). Specifically, I obtained a
lated with each other. Importantly, when the correlation was       chimpanzee-based map of local recombination rates from a
strong, the simple MK regression estimated that b   b 2 was pos-   previous study (Auton et al. 2012) and constructed a map of
itive while the true value of b2 was equal to 0.2.                local mutation rates using putatively neutral substitutions in
    Furthermore, using the same synthetic data, I evaluated        the chimpanzee lineage. Because functional genomic data
the performance of a previous MK-based method (Smith and           were more complete and of higher quality in humans than
Eyre-Walker 2002; Fraı̈sse et al. 2019), which can only analyze    in chimpanzees, I obtained tissue-based gene expression data
one feature at a time. For each genomic feature, I stratified      from the Human Protein Atlas (Uhlen et al. 2015) and utilized
functional sites into two equal-sized groups. The first group      the expression level averaged across all tissues and a summary
included the top half of functional sites with higher feature      statistic, tau (Yanai et al. 2005), as measures of gene expres-
value, whereas the second group included the bottom half of        sion level and tissue specificity, respectively. I also obtained
functional sites with lower feature value. I estimated xa for      predicted levels of residue exposure to solvent and experi-
each group separately. Then, I calculated Dxa , that is, the       mentally determined PPI degrees in the human genome from
difference in xa between the two groups of functional sites. If    previous studies (Wong et al. 2011; Luck et al. 2020). I con-
the previous MK-based method can unbiasedly estimate the           verted human-based annotations of gene expression level,
effects of genomic features, the sign of Dxa should match the      tissue specificity, residue exposure level, and PPI degree to
sign of the true regression coefficient. However, the previous     the chimpanzee genome (panTro4) using liftOver
MK-based method frequently produced wrong estimates of             (Haeussler et al. 2019).
the sign of Dxa when genomic features were strongly corre-             I first employed the simple MK regression to analyze the
lated with each other (supplementary fig. 1, Supplementary         effect of one genomic feature at a time, with no attempt to
Material online). In summary, it is critical to jointly analyze    distinguish independent effects from spurious associations.
multiple genomic features for an unbiased estimation of their      Because the MK regression used a logarithmic link function
independent effects on adaptive evolution.                         for xa , I explored if a logarithmic transformation of genomic
                                                                   features can improve model fitting. I found that the logarith-
The Multiple MK Regression Elucidates Genomic                      mic transformation improved the fitting of the simple MK
Determinants of Positive Selection in Chimpanzees                  regression for all features but tissue specificity (supplementary
I investigated positive selection in chimpanzee autosomal          table 1, Supplementary Material online). Therefore, I applied
genes using the MK regression. Because gene annotations            the logarithmic transformation to all features except tissue
were of high quality in the human genome, I converted 4-           specificity throughout this study. In the simple MK regression,
fold degenerate (4D) and 0-fold degenerate (0D) sites anno-        the regression coefficients of local mutation rate, residue ex-
tated in dbNSFP (Liu et al. 2013, 2016) from the human ge-         posure level, and PPI degree were significantly higher than 0,
nome to the chimpanzee genome. Because all point                   whereas the regression coefficient of gene expression level was
mutations at 4D sites are synonymous, I assume that they           significantly lower than 0 (fig. 3A and supplementary table 2,
are putatively neutral. On the other hand, because all point       Supplementary Material online). On the other hand, local
4
Dissecting Genomic Determinants of Positive Selection . doi:10.1093/molbev/msab291                                                                                               MBE

                               A                              Simple MK regression                        B                              Multiple MK regression
                                                        1.2                                                                        1.2

                           Estimated coefficient (β)

                                                                                                      Estimated coefficient (β)
                                                        1.0    ●          ●           ●        ●
                                                                                                                                   1.0    ●          ●           ●        ●

                                                  ^

                                                                                                                             ^
                                                        0.8                                                                        0.8
                                                        0.6                                    ●                                   0.6
                                                        0.4                           ●                                            0.4
                                                        0.2               ●
                                                                                                                                   0.2
                                                        0.0    ●                                                                   0.0    ●          ●           ●        ●

                                                       −0.2                                                                       −0.2
                                                       −0.4                                                                       −0.4

                                                                                                                                                                                           Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021
                                                       −0.6                     True coefficient                                  −0.6                     True coefficient
                                                       −0.8
                                                                                 ●    beta 1 = 1.0                                −0.8
                                                                                                                                                            ●    beta 1 = 1.0
                                                       −1.0
                                                                                 ●    beta 2 = 0.0                                −1.0
                                                                                                                                                            ●    beta 2 = 0.0
                                                              0.0        0.2         0.4      0.6                                        0.0        0.2         0.4      0.6
                                                                    Correlation coefficient                                                    Correlation coefficient

                               C                              Simple MK regression                        D                              Multiple MK regression
                                                        1.2                                                                        1.2
                           Estimated coefficient (β)

                                                                                                      Estimated coefficient (β)
                                                                                                                                                                          ●
                                                        1.0    ●                                                                   1.0    ●          ●           ●
                                                  ^

                                                                                                                             ^
                                                                          ●
                                                                                      ●        ●
                                                        0.8                                                                        0.8
                                                        0.6                                                                        0.6
                                                        0.4                                    ●
                                                                                                                                   0.4
                                                        0.2                           ●                                            0.2
                                                                          ●
                                                        0.0                                                                        0.0
                                                       −0.2    ●
                                                                                                                                  −0.2    ●          ●           ●        ●

                                                       −0.4                                                                       −0.4
                                                       −0.6                     True coefficient                                  −0.6                     True coefficient
                                                       −0.8
                                                                                 ●    beta 1 = 1.0                                −0.8
                                                                                                                                                            ●    beta 1 = 1.0
                                                       −1.0
                                                                                 ●    beta 2 = −0.2                               −1.0
                                                                                                                                                            ●    beta 2 = −0.2
                                                              0.0        0.2         0.4      0.6                                        0.0        0.2         0.4      0.6
                                                                    Correlation coefficient                                                    Correlation coefficient

FIG. 2. Simulation results. (A) Estimates of regression coefficients in the simple MK regression. The true coefficients are b1 ¼ 1 and b2 ¼ 0. (B)
Estimates of regression coefficients in the multiple MK regression. The true coefficients are b1 ¼ 1 and b2 ¼ 0. (C) Estimates of regression
coefficients in the simple MK regression. The true coefficients are b1 ¼ 1 and b2 ¼ 0:2. (D) Estimates of regression coefficients in the multiple
MK regression. The true coefficients are b1 ¼ 1 and b2 ¼ 0:2. In each plot, dots and error bars indicate the means and the 2-fold standard
deviations of estimated coefficients across 10 independent replicates, respectively.

recombination rate and tissue specificity were not signifi-                                                           tissue specificity was significantly higher than 0 in the multiple
cantly associated with the rate of adaptive evolution in the                                                          MK regression (b  b ¼ 1.168; P-value ¼ 6.394  1031) but not
simple MK regression (fig. 3A and supplementary table 2,                                                              in the simple MK regression (b    b ¼ 0.157; P-value ¼ 0.622).
Supplementary Material online).                                                                                           To examine whether these results were robust to different
    As discussed in the simulation experiments, the simple MK                                                         metrics of tissue specificity, I utilized the negative value of Hg
regression may produce biased estimates of regression coef-                                                           (Kryuchkova-Mostacci and Robinson-Rechavi 2017) as an al-
ficients if genomic features are correlated with each other. To                                                       ternative metric of tissue specificity. Similar to tau, a higher
test if this was the case in the chimpanzee data, I used the                                                          value of negative Hg indicates a higher level of tissue specific-
multiple MK regression to analyze the effects of the six geno-                                                        ity. I observed qualitatively similar regression coefficients
mic features simultaneously. Surprisingly, while local muta-                                                          when I replaced tau with negative Hg in the multiple MK
tion rate, local recombination rate, and residue exposure level                                                       regression (supplementary fig. 2, Supplementary Material on-
showed similar effects in the multiple MK regression, the                                                             line), although the regression coefficient of local mutation
regression coefficients of the other features were different                                                          rate was not statistically significant when negative Hg was
between the multiple MK regression and the simple MK re-                                                              used. Thus, the estimated effects of genomic features may
gression (fig. 3B and supplementary table 3, Supplementary                                                            be robust to different metrics of tissue specificity.
Material online). Specifically, the regression coefficient of PPI                                                         To investigate whether correlations between genomic fea-
degree was not significant in the multiple MK regression (b  b¼                                                       tures could explain the differences in estimated coefficients
0.002; P-value ¼ 0.975), whereas the same coefficient was                                                             between the multiple and the simple MK regression, I calcu-
significant in the simple MK regression (b   b ¼ 0.460; P-value                                                       lated the Kendall rank correlation coefficient for all pairs of
¼ 3.178  104). The coefficient of gene expression level was                                                         genomic features (fig. 3C and supplementary table 4,
significantly higher than 0 in the multiple MK regression (b b¼                                                       Supplementary Material online). I found that local mutation
0.347; P-value ¼ 6.674  1012), whereas the same coefficient                                                         rate, local recombination rate, and residue exposure level
was negative in the simple MK regression (b     b ¼ 0.310; P-                                                        were weakly correlated with other features, which may ex-
value ¼ 3.534  1010). Also, the regression coefficient of                                                           plain why the regression coefficients of these features were
                                                                                                                                                                                      5
Huang . doi:10.1093/molbev/msab291                                                                                                                                                            MBE
    A                                       Simple MK regression                                                                       8,471 genes with no more than one interaction partner
                          2                                                                                                            were assigned to the low PPI-degree group. Without control-
                                                                                                                                       ling for gene expression level and tissue specificity, the high
    Coefficient (β)
                 ^

                          1                                    **                              ***                                     PPI-degree group had a higher xa than the low PPI-degree
                                    ***
                          0
                                                                                                         Genomic feature
                                                                                                                                       group (supplementary fig. 3A, Supplementary Material on-
                                                                         ***                                  Mutation rate            line), but the difference in xa was not significant (P-value ¼
                      −1                                                                                      Recombination rate
                                                                                                              Residue exposure
                                                                                                                                       0.146; two-tailed permutation test), possibly due to a reduc-
    B                                      Multiple MK regression                                             Expression level         tion of sample size in the stratified analysis. Then, using the
                          2                                                                                   Tissue specificity
                                                                                                                                       default propensity score matching algorithm in MatchIt (Ho

                                                                                                                                                                                                              Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021
                                                                                                              PPI degree
                                                                                     ***
    Coefficient (β)

                                                                                                                                       et al. 2011), I matched each gene from the high PPI-degree
                 ^

                          1

                                     *                         **        ***                                                           group with a gene of similar expression level and tissue spe-
                          0                                                                                                            cificity from the low PPI-degree group. In the matched data,
                                                                                                                                       xa was not different between the high-PPI and low-PPI
                      −1
                                                                                                                                       groups (supplementary fig. 3B, Supplementary Material on-
    C                                                                                                                                  line). I observed similar results using two alternative cutoffs, 5
                                                                                                      Mutation rate
                                                                                                                                       and 20, for the high PPI-degree group (supplementary fig. 4,
                                                                                                      Recombination rate   Kendall's
                                                                                                                           tau
                                                                                                                                       Supplementary Material online). Thus, PPI degree is unlikely
                                                                                                      Residue exposure        1.0      to be a genomic determinant of positive selection in
                                                                                                                              0.5
                                                                                                      Expression level        0.0      chimpanzees.
                                                                                                                              −0.5         I also verified the effect of tissue specificity on the rate of
                                                                                                      Tissue specificity      −1.0
                                                                                                                                       adaptation after adjusting for gene expression level. Due to
                                                                                                      PPI degree                       the strong negative correlation between expression level and
                                                                                                                                       tissue specificity (supplementary fig. 5, Supplementary
                                     e

                                                  e

                                                              e

                                                                            l

                                                                                     ty

                                                                                                  e
                                                                          ve
                                      t

                                                   t

                                                              ur

                                                                                               re
                                                                                   ci
                                   ra

                                                ra

                                                                                                                                       Material online), MatchIt returned few matched genes
                                                                        le
                                                         os

                                                                                             eg
                                                                                  ifi
                               n

                                           n

                                                                                ec
                                                                    on
                                                          p
                              io

                                          tio

                                                                                          Id
                                                       ex

                                                                             sp
                          at

                                                                   si
                                      na

                                                                                        PP
                                                               es
                      ut

                                                                                                                                       when I attempted to control for expression level. Therefore,
                                                  ue

                                                                          ue
                                     bi
                      M

                                                              pr
                                                id
                               om

                                                                        ss
                                                          Ex
                                            es

                                                                    Ti
                              ec

                                          R

                                                                                                                                       I implemented a different matching approach. By closely ex-
                          R

                                                                                                                                       amining the relationship between expression level and tissue
FIG. 3. Effects of genomic features on the rate of adaptive evolution.
                                                                                                                                       specificity, I found that the variation in tissue specificity was
(A) Estimated coefficients of genomic features in the simple MK re-
                                                                                                                                       high among highly expressed genes (supplementary fig. 5,
gression. (B) Estimated coefficients of genomic features in the multi-
ple MK regression. In each bar plot, error bars indicate 95%                                                                           Supplementary Material online). Thus, I stratified 737 highly
confidence intervals while one, two, and three asterisks indicate                                                                      expressed genes (gene expression level > 30) into two equal-
0.01  P-value < 0.05, 0.001  P-value < 0.01, and P-value <                                                                           sized groups based on the ranking of their tissue specificity. As
0.001, respectively. (C) Correlations between genomic features.                                                                        shown in fig. 4A, highly expressed genes with high tissue
                                                                                                                                       specificity had a significantly higher xa than their counter-
consistent between the multiple and the simple MK regres-                                                                              parts with low tissue specificity (P-value ¼ 0.001; two-tailed
sion. In contrast, gene expression level and tissue specificity                                                                        permutation test). Therefore, my gene matching analysis con-
showed a strong negative correlation, which may cause spu-                                                                             firms the positive effect of tissue specificity on the rate of
rious associations in the simple MK regression. I also found                                                                           adaptation in the multiple MK regression.
that PPI degree was correlated with gene expression level and                                                                              As an alternative analysis to infer the independent effect of
tissue specificity, although the correlations were to a lesser                                                                         tissue specificity after controlling for expression level, I divided
                                                                                                                                       genes into 10 equal-sized groups (deciles) based on their ex-
extent compared with the correlation between gene expres-
                                                                                                                                       pression levels. In each decile, I further divided genes into two
sion level and tissue specificity. Thus, the observed association
                                                                                                                                       equal-sized subgroups based on the ranking of their tissue
of PPI degree with positive selection in the simple MK regres-
                                                                                                                                       specificity within the decile. As expected, the subgroup with
sion could be due to its correlation with gene expression level
                                                                                                                                       high tissue specificity had a significantly higher xa than the
and/or tissue specificity.                                                                                                             subgroup with low tissue specificity in the decile with the
                                                                                                                                       highest expression level (supplementary table 5,
Statistical Matching Analysis Confirms Genomic
                                                                                                                                       Supplementary Material online), which confirms the positive
Determinants Identified by the Multiple MK                                                                                             effect of tissue specificity on adaptation rate in highly
Regression                                                                                                                             expressed genes. The difference in xa was not significant in
I used statistical matching algorithms to corroborate the                                                                              other deciles, possibly due to the low variation in tissue spe-
results of the multiple MK regression. First, I verified whether                                                                       cificity in lowly expressed genes (supplementary fig. 5,
the association between PPI degree and the rate of adaptive                                                                            Supplementary Material online).
evolution could be explained away by controlling for gene                                                                                  I observed that the variation in expression level was high
expression level and tissue specificity. I stratified protein-                                                                         among genes with high tissue specificity (supplementary fig. 5,
coding genes into two groups with different PPI degrees, in                                                                            Supplementary Material online). To verify the positive effect
which 1,556 genes with at least 10 protein interaction part-                                                                           of gene expression level after adjusting for tissue specificity, I
ners were assigned to the high PPI-degree group whereas                                                                                stratified 993 tissue-specific genes (tau > 0.85) into two
6
Dissecting Genomic Determinants of Positive Selection . doi:10.1093/molbev/msab291                                                       MBE

                              A 0.75                                         B 0.75

                                             0.50                                           0.50
                                                                                                                ●

                                             0.25                                           0.25

                           ωa

                                                                           ωa
                                                                  ●

                                             0.00     ●                                     0.00     ●

                                                                                                                                                         Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021
                                            −0.25                                          −0.25

                                                     Low        High                               Low         High
                                                    Tissue specificity                             Expression level
                              C                                              D
                                               2                                              2
                             PhyloP score

                                                                            PhyloP score
                                               0                                              0

                                              −2                                             −2

                                                     Low        High                               Low         High
                                                    Tissue specificity                             Expression level

FIG. 4. Statistical matching analysis. (A) Estimates of xa in 369 highly expressed genes with low tissue specificity and 368 highly expressed genes
with high tissue specificity. (B) Estimates of xa in 498 tissue-specific genes with a low expression level and 495 tissue-specific genes with a high
expression level. In each violin plot, dots indicate point estimates of xa while violins depict the distributions of xa from a gene-based boot-
strapping analysis with 1,000 resamplings. (C) Distributions of phyloP scores in 369 highly expressed genes with low tissue specificity and 368 highly
expressed genes with high tissue specificity. (D) Distributions of phyloP scores in 498 tissue-specific genes with a low expression level and 495
tissue-specific genes with a high expression level. In each box plot, the bottom, the top, and the middle horizontal bar of the box indicate the first
quartile, the third quartile, and the median of phyloP scores, respectively. The whiskers indicate 1.5-fold interquartile ranges.

approximately equal-sized groups based on the ranking of                            significant in other deciles, possibly due to the low variation
their expression levels. The first group consisted of the top                       in expression level among genes with low tissue specificity
495 tissue-specific genes with higher expression level, whereas                     (supplementary fig. 5, Supplementary Material online). On
the second group consisted of the bottom 498 tissue-specific                        average, highly expressed genes showed a higher rate of adap-
genes with lower expression level. The mean expression levels                       tive evolution than their lowly expressed counterparts.
were equal to 5.905 and 0.368 in the two gene groups, which                            Also, I used phyloP scores (Pollard et al. 2010; Hubisz
corresponds to a 16-fold difference in mean expression level.                       et al. 2011) to examine the effects of gene expression level
As shown in fig. 4B, tissue-specific genes with a high expres-                      and tissue specificity on the rate of protein evolution. After
sion level had a significantly higher xa than their lowly                           controlling for gene expression level, phyloP scores increased
expressed counterparts (P-value ¼ 0.002; two-tailed permu-                          with decreasing tissue specificity (fig. 4C), which is in line
tation test), which confirms the positive effect of gene expres-                    with the observation that housekeeping genes tend to evolve
sion level on the rate of adaptive evolution in the multiple                        at a lower substitution rate than tissue-specific genes
MK regression.                                                                      (Zhang and Li 2004; Zhu et al. 2008). On the other hand, after
    As an alternative analysis to infer the independent effect of                   controlling for tissue specificity, phyloP scores increased
gene expression level after controlling for tissue specificity, I                   with increasing expression level (fig. 4D), which is in line
divided genes into 10 deciles based on their tissue specificity.                    with stronger purifying selection on highly expressed genes
In each decile, I further divided genes into two equal-sized                        (Zhang and Yang 2015). Taken together, it seems that highly
subgroups based on the ranking of their expression levels                           expressed genes may be subject to more frequent positive
within the decile. As expected, I observed that the subgroup                        selection than their lowly expressed counterparts, although a
with a high expression level had a significantly higher xa than                     higher expression level may impose stronger purifying selec-
its counterpart in the decile with the highest tissue specificity                   tion and reduce the overall rate of protein evolution.
(supplementary table 6, Supplementary Material online). To a
lesser extent, xa was slightly lower in the subgroup with a                         The Rate of Adaptive Evolution May Also Increase
high expression level than the subgroup with a low expression                       with Gene Expression Level in Drosophila
level in the 7th decile (supplementary table 6, Supplementary                       In the previous section, I showed that the rate of adaptive
Material online). The difference in xa was not statistically                        evolution may increase with increasing gene expression level
                                                                                                                                                    7
Huang . doi:10.1093/molbev/msab291                                                                                        MBE
in chimpanzees. Recently, Fraı̈sse et al. (2019) examined the        with a high expression level (fig. 5A and 5B; false-discovery
same problem in Drosophila melanogaster. Fraı̈sse et al. (2019)      rate < 0.01). Based on these results, I hypothesized that met-
first estimated xa for each gene separately. Then, they              abolic genes may be subject to more frequent positive selec-
regressed the gene-level xa on expression level and other            tion than nonmetabolic genes in chimpanzees.
potentially correlated genomic features, such as tissue specif-          To test this hypothesis, I constructed a genomic feature
icity, using the standard linear regression. The coefficients of     indicating whether each 0D site was located in one of the
the standard linear regression were interpreted as the inde-         2,220 metabolic genes from MSigDB (Subramanian et al. 2005;
pendent effects of genomic features after controlling for other      Liberzon et al. 2011). Then, I used the multiple MK regression
correlated features. Unlike the current study, Fraı̈sse et al.       to simultaneously estimate the effects of the new feature and

                                                                                                                                         Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021
(2019) observed that the rate of adaptation might decrease           the six original features. As shown in fig. 5C and supplemen-
with increasing expression level in D. melanogaster.                 tary table 7, Supplementary Material online, the regression
    To reconcile the discrepancy between the current study           coefficient of metabolic genes was significantly higher than 0
and Fraı̈sse et al. (2019), I reanalyzed the data from Fraı̈sse et   in the multiple MK regression (b  b ¼ 0.537; P-value ¼ 1.075 
al. (2019). Following Fraı̈sse et al. (2019), I used the standard    104), suggesting that metabolic genes might have a higher
linear regression to regress the gene-level estimate of xa on        rate of adaptation than their nonmetabolic counterparts.
gene expression level and tissue specificity, and observed that      Interestingly, the effect of metabolic genes was not significant
the coefficient of gene expression was negative (0.041384; P-       in the simple MK regression (b    b ¼ 0.276; P-value ¼ 0.516).
value ¼ 8.93  109). However, the standard linear regression        Therefore, controlling for potentially correlated features is
may suffer from two critical problems in this data set. First,       critical for revealing elevated positive selection in metabolic
the residuals of the standard linear regression did not follow a     genes. Metabolic genes may partially explain the positive ef-
normal distribution, as suggested by a quantile–quantile plot        fect of gene expression level on adaptive evolution because
(supplementary fig. 6A, Supplementary Material online).              the regression coefficient of gene expression level reduced
Second, the majority of genes had less than 4 nonsynony-             moderately from 0.347 to 0.314 after adding metabolic genes
mous polymorphisms in this data set (supplementary fig. 6B,          as a new feature in the multiple MK regression (supplemen-
Supplementary Material online), so the gene-level estimate of        tary tables 3 and 7, Supplementary Material online).
xa may be highly inaccurate. Thus, I argue that the standard             I observed similar results using propensity score matching.
linear regression may not be an appropriate statistical              Specifically, I observed that residue exposure level, gene ex-
method for analyzing this data set.                                  pression level, and tissue specificity showed different distribu-
    On the other hand, gene expression level had a positive          tions between metabolic and nonmetabolic genes
regression coefficient in the multiple MK regression (supple-        (supplementary fig. 8, Supplementary Material online).
mentary fig. 7A, Supplementary Material online). In an or-           Before controlling for these correlated genomic features, xa
thogonal statistical matching analysis, I stratified 681             was similar between metabolic and nonmetabolic genes (sup-
Drosophila tissue-specific genes (tau > 0.85) into two gene          plementary fig. 8, Supplementary Material online). On the
groups based on the ranking of their expression levels. The          other hand, xa was more than two times higher in metabolic
first group consisted of the top 340 tissue-specific genes with      genes than in nonmetabolic genes (0.0550 vs. 0.0234) after
higher expression level, whereas the second group consisted          controlling for residue exposure level, gene expression level,
of the bottom 341 tissue-specific genes with lower expression        and tissue specificity (supplementary fig. 8, Supplementary
level. Again, tissue-specific genes with higher expression level     Material online) despite that the difference was not statisti-
had a higher xa than tissue-specific genes with lower expres-        cally significant due to reduced sample size (P-value ¼ 0.390;
sion level (supplementary fig. 7B, Supplementary Material            two-tailed permutation test).
online) despite that the difference in xa was marginally sig-            To test whether the effect of metabolic genes on adapta-
nificant (P-value ¼ 0.067; two-tailed permutation test). Taken       tion rate can be explained away by their biased expression in
together, the rate of adaptation may also increase with in-          digestive organs, I obtained 2,274 genes with biased or
creasing gene expression level in D. melanogaster.                   enriched expression in digestive organs, including intestine,
                                                                     liver, pancreas, salivary gland, and stomach, from the Human
Metabolic and Immune Genes May Be under Frequent                     Protein Atlas (Uhlen et al. 2015). As expected, metabolic
Positive Selection in Chimpanzees                                    genes were more likely to have a biased expression in digestive
To explore whether the positive effect of gene expression level      organs than nonmetabolic genes (odds ratio ¼ 2.177; P-value
on the rate of adaptive evolution can be explained by the
Dissecting Genomic Determinants of Positive Selection . doi:10.1093/molbev/msab291                                                                                     MBE

    A                 10
                                                                                                                   analysis, I observed that the regression coefficient of immune
                                                                                                                   genes was significantly higher than 0 (supplementary fig. 10,
    enrichment            8
      Fold of

                          6                                                                                        Supplementary Material online and supplementary table 8,
                          4                                                                                        Supplementary Material online). Thus, immune genes may
                          2                                                                                        have a higher rate of adaptive evolution than their nonim-
                          0                                                                                        mune counterparts. Immune genes may also partially explain
                                                                                                                   the positive effect of gene expression level on the rate of
                                     em

                                                   st ne

                                                     ei m

                                                   lip ism

                                                            m

                                                                                                     ec m r t
                                                                                                   ol s o
                                                 ot is

                                                         is
                                                      em

                                                       ns

                                                         s

                                                                                                       ul all
                                               Sy mu

                                                                                                  m of nsp
                                   st

                                               pr ol

                                               of ol

                                                      ol
                                                      id

                                                                                                          es
                              Sy

                                                                                                                   adaptive evolution, since the regression coefficient of gene
                                             of tab

                                                   ab

                                                   ab
                                                  Im

                                                                                                         a
                                                                                                       Tr
                                                et

                                                et
                          e

                                                e
                                            e
                       un

                                              M

                                              M

                                              M
                                          at

                                                                                                                                                                                      Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021
                                                                                                                   expression level decreased from 0.314 to 0.238 after adding
                      m

                                        n
                                     In
                Im

                                                                                                                   immune genes as a new feature (supplementary tables 7 and
    B                 10                                                                                           8, Supplementary Material online).
    enrichment

                          8
      Fold of

                          6                                                                                        Discussion
                          4
                                                                                                                   In this work, I have introduced the MK regression, the first
                          2
                                                                                                                   evolutionary model for jointly estimating the effects of mul-
                          0
                                                                                                                   tiple, potentially correlated genomic features on the rate of
                                        is

                                                      cl l

                                                           as

                                                                                           r

                                                                                                            e
                                                   us ta

                                                                                        ve

                                                                                                       tin
                                     m

                                                                                                                   adaptive substitutions. Based on similar ideas, my colleagues
                                                  m ele

                                                        re
                                                         e

                                                                                        Li
                                   dy

                                                                                                        s
                                                     nc

                                                                                                     te
                                                   Sk
                               di

                                                                                                  In
                                                  Pa

                                                                                                                   and I have previously developed statistical approaches to infer
                             i
                          Ep

                                                                                                                   negative selection on genetic variants (Huang et al. 2017;
    C                 2.0
                                                                                                                   Huang and Siepel 2019; Huang 2020) and the evolutionary
                                                                                      ***
    Coefficient (β)

                      1.5                                                                                          turnover of cis-regulatory elements (Dukler et al. 2020). Thus,
                 ^

                      1.0                                                                                    ***   unifying generalized linear models and evolutionary models
                                                                                                                   may be a powerful strategy to address a variety of statistical
                      0.5             *                       **          ***
                                                                                                                   problems in evolutionary biology.
                      0.0                                                                                              As shown in the simulation experiments (fig. 2), when two
                                                                                                                   genomic features are correlated with each other, even at a
                                     e

                                                 e

                                                              re

                                                                             l

                                                                                      ty

                                                                                                         e

                                                                                                   ne ic
                                                                          ve

                                                                                                                   moderate level, the simple MK regression and a previous MK-
                                      t

                                                   t

                                                                                                      re

                                                                                                ge bol
                                                                                   ci
                                   ra

                                                ra

                                                         su

                                                                                                      s
                                                                        le

                                                                                             eg
                                                                                  ifi
                                                         po
                               n

                                             b.

                                                                                                    a
                                                                                 ec
                                                                   on
                              io

                                                                                           Id

                                                                                                 et
                                          om

                                                                                                                   based method (Smith and Eyre-Walker 2002; Fraı̈sse et al.
                                                       ex

                                                                             sp
                          at

                                                                   si

                                                                                                   M
                                                                                        PP
                                                              es
                      ut

                                     ec

                                                 ue

                                                                          ue
                      M

                                                              pr
                                    R

                                                                                                                   2019) cannot accurately estimate the independent effects
                                                id

                                                                        ss
                                                         Ex
                                             es

                                                                    Ti
                                            R

                                                                                                                   of genomic features because they cannot control for corre-
FIG. 5. Positive selection in metabolic genes. (A) Enrichment of                                                   lated genomic features. On the other hand, the multiple MK
Reactome pathways in 495 tissue-specific genes with a high expres-                                                 regression can unbiasedly estimate the independent effects of
sion level. (B) Enrichment of tissue types in 495 tissue-specific genes                                            genomic features if all relevant genomic features are included
with a high expression level. In each enrichment test, 498 tissue-                                                 in the same analysis. Because we are almost always interested
specific genes with a low expression level are used as a background                                                in the independent effects of genomic features, the multiple
gene set. A dashed blue line indicates that the fold of enrichment is                                              MK regression may be superior to the simple MK regression
equal to 1 (no enrichment). (C) Estimates of regression coefficients in                                            and other MK-based methods that can only analyze one
the multiple MK regression. This analysis includes a new binary fea-                                               feature at a time.
ture indicating whether each 0D site is located in a metabolic gene.
                                                                                                                       Regression coefficients in the multiple MK regression
Error bars indicate 95% confidence intervals while one, two, and three
asterisks indicate 0.01  P-value < 0.05, 0.001  P-value < 0.01, and
                                                                                                                   might be interpreted as the direct causal effects of genomic
P-value < 0.001, respectively.                                                                                     features on the rate of adaptation. However, similar to other
                                                                                                                   linear regression models (Pearl et al. 2016), the causal inter-
                                                                                                                   pretation of the multiple MK regression relies on two implicit
adaptation rate could not be explained by their biased ex-                                                         assumptions. First, genomic features of interest and all corre-
pression in digestive organs.                                                                                      lated genomic features should be included in the same MK
   In line with frequent positive selection on the immune                                                          regression analysis. Second, there should be no reverse cau-
system (Schlenke and Begun 2003; Nielsen et al. 2005;                                                              sality, that is, the rate of adaptive evolution should not cause
Kosiol et al. 2008; Barreiro and Quintana-Murci 2010), I ob-                                                       changes in genomic features. As discussed in the literature of
served that genes associated with the immune system had a                                                          causal inference (Pearl et al. 2016), these assumptions cannot
2- to 4-fold enrichment in the 495 tissue-specific genes with a                                                    be verified using observational data alone and, thus, have to
high expression level (fig. 5A; false-discovery rate < 0.01). To                                                   be justified by domain knowledge on a case-by-case basis.
formally test if immune system genes have a higher rate of                                                             It is worth noting that I do not attempt to infer the total
adaptation than nonimmune genes in chimpanzees, I con-                                                             causal effects of genomic features on adaptation rate in the
structed a genomic feature indicating whether each 0D site                                                         current study. According to the theory of causal inference
was located in one of the 3,400 immune system genes from                                                           (Pearl et al. 2016), the total effect of a genomic feature of
MSigDB (Subramanian et al. 2005; Liberzon et al. 2011). After                                                      interest includes its direct effect on the rate of adaptation as
adding this new feature to the multiple MK regression                                                              well as its indirect effect through mediators, that is, others
                                                                                                                                                                                 9
Huang . doi:10.1093/molbev/msab291                                                                                        MBE
genomic features that reside on a directed path from the             hydrophobic cores (Bloom et al. 2005; Bloom, Labthavikul
genomic feature of interest to the rate of adaptation in an          et al. 2006; Bloom, Drummond et al. 2006; Franzosa and Xia
assumed causal graph. Thus, inferring the total effect of the        2009). Thus, missense mutations on protein surfaces may be
genomic feature of interest will require me to impose very           under more frequent positive selection because they are less
strong assumptions about the causal relationship between             likely to perturb protein folding. Alternatively, protein surfa-
genomic features (see examples in Rosenbaum et al. 2020;             ces may have a higher rate of adaptive evolution because they
Laubach et al. 2021). On the other hand, to infer the direct         may play an important role in host–pathogen interactions
effect of the genomic feature of interest, I can simply control      (Moutinho, Trancoso et al. 2019).
for all potentially correlated features without specifying their         Fourth, I have shown that the tissue specificity of a gene

                                                                                                                                         Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021
causal relationship with the genomic feature of interest             has a positive effect on the rate of adaptation after controlling
(Laubach et al. 2021). In other words, the MK regression ef-         for correlated genomic features, such as gene expression level.
fectively assumes a simplified causal graph where I do not           Thus, tissue-specific genes are more likely to be under positive
specify the directions of causality between genomic features         selection than housekeeping genes. Because nonsynonymous
(supplementary fig. 11, Supplementary Material online; see           mutations in housekeeping genes have a higher chance to
also Chapter 4.3 in Shipley 2016).                                   disrupt multiple phenotypes, my findings support that the
    It is also worth noting that I have ignored potential inter-     pleiotropic effect is a key determinant of adaptive evolution
actions between genomic features in the current study. It is         (Fraı̈sse et al. 2019).
possible to add interaction terms to the multiple MK regres-             Fifth, I have shown that the rate of adaptive evolution
sion. However, because of the large number of potential in-          increases with increasing gene expression levels in both chim-
teraction terms and the spareness of polymorphisms in the            panzees (figs. 3B and 4B) and Drosophila (supplementary fig.
chimpanzee genome, I may lack statistical power to detect            7, Supplementary Material online) after controlling for corre-
interactions between features in chimpanzees.                        lated genomic features, such as tissue specificity. Recently,
    Using the multiple MK regression, I have identified numer-       Fraı̈sse et al. (2019) reported an opposite trend in
ous genomic features with independent effects on adaptive            Drosophila by regressing a gene-level estimate of xa on
evolution in the chimpanzee lineage (fig. 3B). First, in line with   gene expression level and potentially correlated features.
previous studies (Campos et al. 2014; Castellano et al. 2016;        However, my reanalysis of their data suggests that the stan-
Rousselle et al. 2020), I have shown that the rate of adaptation     dard linear regression used in Fraı̈sse et al. (2019) may not be
increases with increasing mutation rate. Because mutations           an appropriate method for inferring the effects of genomic
are the ultimate source of genetic variation, a higher mutation      features in Drosophila. Unlike the standard linear regression,
rate will increase the genetic variation for positive selection to   the MK regression does not rely on inaccurate estimates of xa
act on.                                                              at the gene level, and does not assume that the response
    Second, previous studies have shown that local recombi-          variables follow a normal distribution. Thus, the MK regres-
nation rate is positively correlated with the rate of adaptation     sion may be more broadly applicable than the standard linear
in Drosophila (Marais and Charlesworth 2003; Campos et al.           regression in inferring the effects of genomic features on
2014; Castellano et al. 2016), probably due to a reduced effect      adaptation.
of Hill-Robertson interference in recombination hotspots.                Sixth, numerous studies have shown that immune genes
However, I have not observed the same pattern in chimpan-            may have a higher rate of adaptive evolution than other genes
zees, which could be explained by a reduced impact of linked         in various species (Schlenke and Begun 2003; Nielsen et al.
selection in species with a small census population size, such       2005; Kosiol et al. 2008; Barreiro and Quintana-Murci 2010).
as chimpanzees (Corbett-Detig et al. 2015). Alternatively, my        Using the MK regression, I have observed the same trend in
analysis may have limited power to detect a weak association         chimpanzees (supplementary fig. 10, Supplementary Material
between recombination rate and positive selection due to the         online and supplementary table 8, Supplementary Material
lower level of polymorphism and the smaller proportion of            online). Thus, immune genes may be subject to constant
adaptive substitutions in chimpanzees compared with                  adaptation in chimpanzees to fight against ever-evolving
Drosophila (Castellano et al. 2016).                                 pathogens and parasites.
    Third, in agreement with a previous study (Moutinho,                 Last but not least, I have shown that highly expressed genes
Trancoso et al. 2019), I have shown that the rate of adaptive        are more likely to be associated with metabolic pathways and
evolution increases with the increasing level of residue expo-       digestive organs than their lowly expressed counterparts (fig.
sure to solvent. On the other hand, it is well-known that the        5A and 5B), which implies that frequent positive selection in
site-wise rate of protein evolution is positively correlated with    metabolic genes may partially explain the positive effect of
the level of residue exposure, possibly due to relaxed negative      gene expression level on the rate of adaptation. In agreement
selection on exposed residues (Goldman et al. 1998; Franzosa         with this hypothesis, I have shown that metabolic genes may
and Xia 2009; Liberles et al. 2012; Echave et al. 2016). Taken       have a higher rate of adaptation than their nonmetabolic
together, exposed residues on protein surfaces may be subject        counterparts after controlling for potentially correlated geno-
to both weaker negative selection and more frequent positive         mic features (fig. 5C). Similarly, a recent study has reported
selection than their buried counterparts. From a biophysical         that metabolic pathways may be subject to more frequent
perspective, missense mutations on protein surfaces are less         positive selection than nonmetabolic pathways in multiple
likely to disrupt protein stabilities than mutations in              inner branches of the primate phylogeny (Daub et al. 2017).
10
You can also read