Dissecting Genomic Determinants of Positive Selection with an Evolution-Guided Regression Model
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Dissecting Genomic Determinants of Positive Selection with an Evolution-Guided Regression Model Yi-Fei Huang *1,2 1 Department of Biology, Pennsylvania State University, University Park, PA, USA 2 Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, USA *Corresponding author: E-mail: yuh371@psu.edu. Associate editor: Jeffrey Townsend Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021 Abstract In evolutionary genomics, it is fundamentally important to understand how characteristics of genomic sequences, such as gene expression level, determine the rate of adaptive evolution. While numerous statistical methods, such as the McDonald–Kreitman (MK) test, are available to examine the association between genomic features and the rate of adaptation, we currently lack a statistical approach to disentangle the independent effect of a genomic feature from the effects of other correlated genomic features. To address this problem, I present a novel statistical model, the MK regression, which augments the MK test with a generalized linear model. Analogous to the classical multiple regression model, the MK regression can analyze multiple genomic features simultaneously to infer the independent effect of a genomic feature, holding constant all other genomic features. Using the MK regression, I identify numerous genomic features driving positive selection in chimpanzees. These features include well-known ones, such as local mutation rate, residue exposure level, tissue specificity, and immune genes, as well as new features not previously reported, such as gene expression level and metabolic genes. In particular, I show that highly expressed genes may have a higher adaptation rate than their weakly expressed counterparts, even though a higher expression level may impose stronger negative selection. Also, I show that metabolic genes may have a higher adaptation rate than their nonmetabolic counterparts, possibly due to recent changes in diet in primate evolution. Overall, the MK regression is a powerful approach to elucidate the genomic basis of adaptation. Key words: adaptive evolution, positive selection, McDonald–Kreitman test, statistical inference, causal inference. Introduction and putatively neutral sites, the MK test seeks to identify positively selected genes that show an excess of interspecies Understanding the genetic basis of positive selection is a fun- divergence at functional sites. Because highly deleterious damental problem in evolutionary biology. Numerous statis- Article mutations can neither segregate nor reach fixation in a pop- tical approaches have been developed to detect loci under ulation, the MK test is intrinsically robust to the presence of positive selection. A popular framework is codon substitution strong negative selection. On the other hand, weak negative models that seek to infer positively selected genes solely from interspecies sequence divergence (Goldman and Yang 1994; selection may lead to biased results in the MK test because Muse and Gaut 1994; Yang et al. 2000). By contrasting the mutations under weak selection can segregate in a popula- rate of nonsynonymous substitutions (dN) against the rate of tion but not reach fixation. To address this problem, several synonymous substitutions (dS), codon substitution models recent studies have extended the MK test to account for the can identify positively selected genes with a dN/dS ratio effects of weak negative selection on intraspecies polymor- greater than 1. However, because negative (purifying) selec- phism (Eyre-Walker and Keightley 2009; Messer and Petrov tion can dramatically reduce dN/dS ratios, codon substitution 2013; Galtier 2016; Haller and Messer 2017; Uricchio et al. models may be underpowered to detect genes that experi- 2019), ensuring that the inference of positive selection is enced both positive selection and strong negative selection not biased by the presence of weak selection. Thus, the MK (Hughes 2007). test and its extensions are powerful methods to disentangle Unlike codon substitution models, the McDonald– positive selection from ubiquitous negative selection. Kreitman (MK) test utilizes both interspecies divergence Because MK-based methods use relatively sparse diver- and intraspecies polymorphism to elucidate positive selection gence and polymorphism data from closely related species, in a species of interest (McDonald and Kreitman 1991; Fay et they may be underpowered to pinpoint individual genes un- al. 2001; Smith and Eyre-Walker 2002). By contrasting the der positive selection. To boost statistical power, MK-based levels of divergence and polymorphism at functional sites methods often are applied to a collection of genes or ß The Author(s) 2021. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons. org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Open Access Mol. Biol. Evol. doi:10.1093/molbev/msab291 Advance Access publication October 01, 2021 1
Huang . doi:10.1093/molbev/msab291 MBE nucleotide sites with similar genomic features. Using this features to control for. Therefore, we currently lack a general pooling strategy, previous studies have identified numerous and powerful statistical framework to estimate the indepen- genomic features associated with positive selection in dent effects of genomic features on positive selection by Drosophila and primates. The features associated with posi- adjusting for a large number of correlated genomic features. tive selection in Drosophila include local mutation rate In the current study, I present a novel statistical method, (Campos et al. 2014; Castellano et al. 2016; Rousselle et al. the MK regression, to estimate the independent effects of 2020), local recombination rate (Marais and Charlesworth genomic features on the rate of adaptive evolution. The 2003; Campos et al. 2014; Castellano et al. 2016), gene expres- MK regression is a hybrid of the MK test and the generalized sion specificity (Fraı̈sse et al. 2019), residue exposure to sol- linear regression. Unlike standard linear regression models Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021 vent (Moutinho, Trancoso et al. 2019), X linkage (Avila et al. and statistical matching algorithms, the MK regression can 2015; Campos et al. 2018), and sex-biased expression control for a large number of correlated genomic features and (Pröschel et al. 2006; Avila et al. 2015; Campos et al. 2018). is applicable to species with a low level of polymorphism. To The features associated with positive selection in primates the best of my knowledge, the MK regression is the first include protein disorder (Afanasyeva et al. 2018), virus–host evolutionary model tailored to characterize the independent interaction (Enard et al. 2016; Uricchio et al. 2019), protein– effects of genomic features on adaptive evolution. Using syn- protein interaction (PPI) degree (Luisi et al. 2015), and X thetic data, I show that the MK regression can unbiasedly linkage (Hvilsom et al. 2012). estimate the independent effect of a genomic feature even While existing MK-based methods can identify genomic when it is strongly correlated with another genomic feature. features associated with the signatures of positive selection, Applying the MK regression to polymorphism and divergence they may not be able to distinguish genomic features inde- data in the chimpanzee lineage, I corroborate previous find- pendently affecting the rate of adaptive evolution from spu- ings that local mutation rate, residue exposure level, tissue rious features without independent effects on adaptation specificity, and immune system genes are key determinants of (Moutinho, Trancoso et al. 2019; Fraı̈sse et al. 2019). For in- positive selection in protein-coding genes. In addition, I show stance, MK-based methods often are applied to one genomic that highly expressed genes and metabolic genes may have a feature at a time. If a genomic feature with an independent higher rate of adaptive evolution than other genes after con- effect on the rate of adaptive evolution is strongly correlated trolling for several correlated genomic features, which has not with a second feature without an independent effect, MK- been widely reported in previous studies. Taken together, the based methods may report a spurious association between MK regression is a valuable addition to evolutionary biolo- the second feature and adaptive evolution. gists’ arsenal for investigating the genetic basis of adaptation. Before the current study, two simple heuristic methods have been previously used to estimate the independent effect Results of a genomic feature on the rate of adaptation by controlling for other potentially correlated features. If we are interested in The MK Regression is a Generalized Linear Model estimating the independent effect of a gene-level feature, Tailored to Estimate the Effects of Genomic Features such as tissue specificity, we may estimate the rate of adap- on Positive Selection tion at the gene level and then fit a standard linear regression The key idea behind the MK regression is to model the site- model, in which we treat the feature of interest and correlated wise rate of adaptive evolution as a linear combination of genomic features as covariates and treat the gene-level rate of local genomic features (fig. 1). I use xa , the relative rate of adaptation as a response variable (Luisi et al. 2015; Castellano adaptive substitutions at a functional nucleotide site with et al. 2016; Moutinho, Trancoso et al. 2019; Fraı̈sse et al. 2019). respect to the average substitution rate at neutral nucleotide The regression coefficient associated with the feature of in- sites, as a measure of the rate of adaptation (Booker et al. terest can be interpreted as its independent effect on positive 2017; Moutinho, Bataillon et al. 2019). Unlike previous MK- selection, holding constant all other genomic features. based models that treat xa as a gene-level measure, I treat xa Although this strategy is powerful and elegant, it cannot be as a measure of adaptive evolution at an individual nucleotide applied to species with low levels of polymorphism, such as site and assume that it can be predicted from local genomic primates, due to the challenge of estimating the rate of ad- features. To integrate the effects of multiple features on the aptation at the gene level. Alternatively, we may first stratify rate of adaptive evolution, I assume that xa , in a site-wise genes into a “treatment” group and a “control” group based manner, is a linear combination of local genomic features, on the genomic feature of interest. Then, we may use statis- such as local mutation rate, local recombination rate, and tical matching algorithms to match each gene from the gene expression level. For each genomic feature, the MK re- “treatment” group with a gene of similar characteristics gression seeks to estimate a regression coefficient indicating from the “control” group. A significant difference in the its independent effect on the rate of adaption, holding con- rate of adaptation between the two groups of matched genes stant all other genomic features. indicates that the feature of interest has an independent ef- Specifically, the MK regression consists of two compo- fect on positive selection. Although this method has been nents: a generalized linear model and an MK-based likelihood successfully used in previous studies (Enard et al. 2016; function (fig. 1). First, I assume that xa , in a site-wise manner, Campos et al. 2018; Castellano et al. 2019), it is difficult to is a linear combination of local genomic features followed by match genes when there are a large number of genomic an exponential transformation, 2
Dissecting Genomic Determinants of Positive Selection . doi:10.1093/molbev/msab291 MBE Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021 FIG. 1. Schematic of the MK regression. The MK regression consists of two components: a generalized linear model and a McDonald–Kreitman- based likelihood function. First, I assume that, in a site-wise manner, the rate of adaptive evolution (xa ) at a functional site is a linear combination of local genomic features followed by an exponential transformation, in which regression coefficient bi indicates the effect of the ith feature on adaptive evolution. Similarly, I assume that the probability of observing a SNP (Pfunc ) at the same functional site is another linear combination of the same set of genomic features, followed by a logistic transformation. Second, in the McDonald–Kreitman-based likelihood function, I combine xa and Pfunc at every functional site with two neutral parameters, Dneut and Pneut , to calculate the probability of observed divergence and polymorphism data given model parameters. Dneut and Pneut denote the expected number of substitutions and the probability of observing a SNP at a neutral site, respectively. Dfunc denotes the expected number of substitutions at a functional site. xa ¼ expðb0 þ b1 X1 þ þ bi Xi þ þ bM XM Þ: polymorphism in a site-wise manner. Thus, the MK regres- sion can naturally describe the evolution of functional and In this equation, Xi is the ith feature at a functional neutral sites. nucleotide site; b0 is an intercept indicating the baseline rate of adaptive evolution when all genomic features are Joint Analysis of Multiple Features Distinguishes equal to 0; bi is a regression coefficient indicating the ith Independent Effects from Spurious Associations feature’s effect on the rate of adaptation; M is the total I conducted two simulation experiments to assess the MK number of genomic features; exp is an exponential inverse regression’s validity and its power to infer the independent link function which ensures xa is positive. If bi is statisti- effects of genomic features on the rate of adaptive evolution. cally different from 0, I consider that feature i may have an The simulation experiments consisted of two steps. First, I independent effect on adaptation after adjusting for the randomly sampled genomic features from a bivariate normal other features. Similarly, to accommodate the effects of distribution at each functional site. Second, I generated syn- local genomic features on polymorphism data, I assume thetic polymorphism and divergence data at both functional that the probability of observing intraspecies polymor- and neutral sites based on the MK regression model. phism at a functional nucleotide site, Pfunc , is another linear In the first simulation experiment, I assumed that there combination of local genomic features followed by a logis- were two genomic features of interest. The first genomic tic transformation (fig. 1). feature had an independent effect on the rate of adaptive Second, in the component of MK-based likelihood func- evolution, and its regression coefficient, b1, was equal to 1. On tion (fig. 1), I combine xa and Pfunc at every functional site the other hand, the second feature had no independent effect with two neutral parameters to calculate the probability of on selection. Thus, its regression coefficient, b2, was equal to 0 observed polymorphism and divergence data at both func- by definition. The other parameters required for the simula- tional and neutral sites, which allows for a maximum like- tion experiment were chosen to ensure that genome-wide lihood estimation of model parameters. Finally, I use the levels of polymorphism and divergence are comparable be- Wald test to examine whether the estimated regression tween synthetic data and empirical data from chimpanzees coefficient, bb i , is significantly different from 0 for each fea- (see details in the Materials and Methods section). To sys- ture i. It is worth noting that, unlike the standard linear tematically assess the MK regression’s performance with re- regression, the MK regression does not assume that re- spect to various degrees of correlation between genomic sponse variables, that is, polymorphism and divergence features, I generated four sets of synthetic data with different data, follow a normal distribution. Instead, the likelihood correlation coefficients between features (0.0, 0.2, 0.4, and 0.6). function of the MK regression uses the Jukes–Cantor sub- In each synthetic data set, I generated 10 independent repli- stitution model (Jukes and Cantor 1969) and the Bernoulli cates each of which consisted of 10 Mb functional sites and distribution to describe the generation of divergence and 10 Mb neutral sites. 3
Huang . doi:10.1093/molbev/msab291 MBE I applied two different versions of the MK regression to the mutations at 0D sites are nonsynonymous, I assume that they synthetic data. The first version was the simple MK regression are potentially functional. I obtained a genome-wide map of that analyzed one genomic feature at a time, which was single nucleotide polymorphisms (SNPs) in 18 central chim- designed to mimic previous MK-based methods. The second panzee (Pan troglodytes troglodytes) individuals (de Manuel one was the multiple MK regression that analyzed two fea- et al. 2016), and inferred ancestral alleles using a recon- tures simultaneously. As shown in figs. 2A and 2B, both the structed chimpanzee ancestral genome (Herrero et al. 2016; simple MK regression and the multiple MK regression pro- Yates et al. 2020). Because the SNP data set consisted of duced unbiased estimates of regression coefficients when samples from both females and males, the number of sam- there was no correlation between features. However, the sim- pled sequences was different between autosomes and sex Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021 ple MK regression frequently estimated that b b 2 was positive chromosomes. Thus, I retained only autosomal genes for when the two features were correlated with each other, downstream analysis. To mitigate the impact of weak nega- whereas the true value of b2 was equal to 0. On the other tive selection on the inference of positive selection, I filtered hand, the multiple MK regression always produced unbiased out SNPs with a derived allele frequency lower than 50% for estimates of regression coefficients regardless of the degree of downstream analysis. In addition, I reconstructed fixed sub- correlation between features. stitutions at 4D and 0D sites in the chimpanzee lineage by In the second simulation experiment, I evaluated the ex- comparing the reconstructed ancestral genome with the tent to which the correlation between two causal features chimpanzee reference genome. I estimated that the propor- complicates the estimation of their independent effects. I set tion of adaptive nonsynonymous substitutions (a) was equal the regression coefficients of the two features to b1 ¼ 1 and to 15.3% in chimpanzee autosomal genes, which is similar to b2 ¼ 0:2, respectively. Then, I followed the same procedure the estimate in a previous study (Tataru et al. 2017). described in the first simulation experiment to generate syn- I collected six genomic features in chimpanzee autosomal thetic data. As shown in figs. 2C and 2D, the multiple MK genes, including local mutation rate, local recombination rate, regression accurately estimated regression coefficients with- residue exposure level, gene expression level, tissue specificity, out any noticeable bias, whereas the simple MK regression and the number of unique protein–protein interaction part- produced biased results when the two features were corre- ners per gene (PPI degree). Specifically, I obtained a lated with each other. Importantly, when the correlation was chimpanzee-based map of local recombination rates from a strong, the simple MK regression estimated that b b 2 was pos- previous study (Auton et al. 2012) and constructed a map of itive while the true value of b2 was equal to 0.2. local mutation rates using putatively neutral substitutions in Furthermore, using the same synthetic data, I evaluated the chimpanzee lineage. Because functional genomic data the performance of a previous MK-based method (Smith and were more complete and of higher quality in humans than Eyre-Walker 2002; Fraı̈sse et al. 2019), which can only analyze in chimpanzees, I obtained tissue-based gene expression data one feature at a time. For each genomic feature, I stratified from the Human Protein Atlas (Uhlen et al. 2015) and utilized functional sites into two equal-sized groups. The first group the expression level averaged across all tissues and a summary included the top half of functional sites with higher feature statistic, tau (Yanai et al. 2005), as measures of gene expres- value, whereas the second group included the bottom half of sion level and tissue specificity, respectively. I also obtained functional sites with lower feature value. I estimated xa for predicted levels of residue exposure to solvent and experi- each group separately. Then, I calculated Dxa , that is, the mentally determined PPI degrees in the human genome from difference in xa between the two groups of functional sites. If previous studies (Wong et al. 2011; Luck et al. 2020). I con- the previous MK-based method can unbiasedly estimate the verted human-based annotations of gene expression level, effects of genomic features, the sign of Dxa should match the tissue specificity, residue exposure level, and PPI degree to sign of the true regression coefficient. However, the previous the chimpanzee genome (panTro4) using liftOver MK-based method frequently produced wrong estimates of (Haeussler et al. 2019). the sign of Dxa when genomic features were strongly corre- I first employed the simple MK regression to analyze the lated with each other (supplementary fig. 1, Supplementary effect of one genomic feature at a time, with no attempt to Material online). In summary, it is critical to jointly analyze distinguish independent effects from spurious associations. multiple genomic features for an unbiased estimation of their Because the MK regression used a logarithmic link function independent effects on adaptive evolution. for xa , I explored if a logarithmic transformation of genomic features can improve model fitting. I found that the logarith- The Multiple MK Regression Elucidates Genomic mic transformation improved the fitting of the simple MK Determinants of Positive Selection in Chimpanzees regression for all features but tissue specificity (supplementary I investigated positive selection in chimpanzee autosomal table 1, Supplementary Material online). Therefore, I applied genes using the MK regression. Because gene annotations the logarithmic transformation to all features except tissue were of high quality in the human genome, I converted 4- specificity throughout this study. In the simple MK regression, fold degenerate (4D) and 0-fold degenerate (0D) sites anno- the regression coefficients of local mutation rate, residue ex- tated in dbNSFP (Liu et al. 2013, 2016) from the human ge- posure level, and PPI degree were significantly higher than 0, nome to the chimpanzee genome. Because all point whereas the regression coefficient of gene expression level was mutations at 4D sites are synonymous, I assume that they significantly lower than 0 (fig. 3A and supplementary table 2, are putatively neutral. On the other hand, because all point Supplementary Material online). On the other hand, local 4
Dissecting Genomic Determinants of Positive Selection . doi:10.1093/molbev/msab291 MBE A Simple MK regression B Multiple MK regression 1.2 1.2 Estimated coefficient (β) Estimated coefficient (β) 1.0 ● ● ● ● 1.0 ● ● ● ● ^ ^ 0.8 0.8 0.6 ● 0.6 0.4 ● 0.4 0.2 ● 0.2 0.0 ● 0.0 ● ● ● ● −0.2 −0.2 −0.4 −0.4 Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021 −0.6 True coefficient −0.6 True coefficient −0.8 ● beta 1 = 1.0 −0.8 ● beta 1 = 1.0 −1.0 ● beta 2 = 0.0 −1.0 ● beta 2 = 0.0 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Correlation coefficient Correlation coefficient C Simple MK regression D Multiple MK regression 1.2 1.2 Estimated coefficient (β) Estimated coefficient (β) ● 1.0 ● 1.0 ● ● ● ^ ^ ● ● ● 0.8 0.8 0.6 0.6 0.4 ● 0.4 0.2 ● 0.2 ● 0.0 0.0 −0.2 ● −0.2 ● ● ● ● −0.4 −0.4 −0.6 True coefficient −0.6 True coefficient −0.8 ● beta 1 = 1.0 −0.8 ● beta 1 = 1.0 −1.0 ● beta 2 = −0.2 −1.0 ● beta 2 = −0.2 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 Correlation coefficient Correlation coefficient FIG. 2. Simulation results. (A) Estimates of regression coefficients in the simple MK regression. The true coefficients are b1 ¼ 1 and b2 ¼ 0. (B) Estimates of regression coefficients in the multiple MK regression. The true coefficients are b1 ¼ 1 and b2 ¼ 0. (C) Estimates of regression coefficients in the simple MK regression. The true coefficients are b1 ¼ 1 and b2 ¼ 0:2. (D) Estimates of regression coefficients in the multiple MK regression. The true coefficients are b1 ¼ 1 and b2 ¼ 0:2. In each plot, dots and error bars indicate the means and the 2-fold standard deviations of estimated coefficients across 10 independent replicates, respectively. recombination rate and tissue specificity were not signifi- tissue specificity was significantly higher than 0 in the multiple cantly associated with the rate of adaptive evolution in the MK regression (b b ¼ 1.168; P-value ¼ 6.394 1031) but not simple MK regression (fig. 3A and supplementary table 2, in the simple MK regression (b b ¼ 0.157; P-value ¼ 0.622). Supplementary Material online). To examine whether these results were robust to different As discussed in the simulation experiments, the simple MK metrics of tissue specificity, I utilized the negative value of Hg regression may produce biased estimates of regression coef- (Kryuchkova-Mostacci and Robinson-Rechavi 2017) as an al- ficients if genomic features are correlated with each other. To ternative metric of tissue specificity. Similar to tau, a higher test if this was the case in the chimpanzee data, I used the value of negative Hg indicates a higher level of tissue specific- multiple MK regression to analyze the effects of the six geno- ity. I observed qualitatively similar regression coefficients mic features simultaneously. Surprisingly, while local muta- when I replaced tau with negative Hg in the multiple MK tion rate, local recombination rate, and residue exposure level regression (supplementary fig. 2, Supplementary Material on- showed similar effects in the multiple MK regression, the line), although the regression coefficient of local mutation regression coefficients of the other features were different rate was not statistically significant when negative Hg was between the multiple MK regression and the simple MK re- used. Thus, the estimated effects of genomic features may gression (fig. 3B and supplementary table 3, Supplementary be robust to different metrics of tissue specificity. Material online). Specifically, the regression coefficient of PPI To investigate whether correlations between genomic fea- degree was not significant in the multiple MK regression (b b¼ tures could explain the differences in estimated coefficients 0.002; P-value ¼ 0.975), whereas the same coefficient was between the multiple and the simple MK regression, I calcu- significant in the simple MK regression (b b ¼ 0.460; P-value lated the Kendall rank correlation coefficient for all pairs of ¼ 3.178 104). The coefficient of gene expression level was genomic features (fig. 3C and supplementary table 4, significantly higher than 0 in the multiple MK regression (b b¼ Supplementary Material online). I found that local mutation 0.347; P-value ¼ 6.674 1012), whereas the same coefficient rate, local recombination rate, and residue exposure level was negative in the simple MK regression (b b ¼ 0.310; P- were weakly correlated with other features, which may ex- value ¼ 3.534 1010). Also, the regression coefficient of plain why the regression coefficients of these features were 5
Huang . doi:10.1093/molbev/msab291 MBE A Simple MK regression 8,471 genes with no more than one interaction partner 2 were assigned to the low PPI-degree group. Without control- ling for gene expression level and tissue specificity, the high Coefficient (β) ^ 1 ** *** PPI-degree group had a higher xa than the low PPI-degree *** 0 Genomic feature group (supplementary fig. 3A, Supplementary Material on- *** Mutation rate line), but the difference in xa was not significant (P-value ¼ −1 Recombination rate Residue exposure 0.146; two-tailed permutation test), possibly due to a reduc- B Multiple MK regression Expression level tion of sample size in the stratified analysis. Then, using the 2 Tissue specificity default propensity score matching algorithm in MatchIt (Ho Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021 PPI degree *** Coefficient (β) et al. 2011), I matched each gene from the high PPI-degree ^ 1 * ** *** group with a gene of similar expression level and tissue spe- 0 cificity from the low PPI-degree group. In the matched data, xa was not different between the high-PPI and low-PPI −1 groups (supplementary fig. 3B, Supplementary Material on- C line). I observed similar results using two alternative cutoffs, 5 Mutation rate and 20, for the high PPI-degree group (supplementary fig. 4, Recombination rate Kendall's tau Supplementary Material online). Thus, PPI degree is unlikely Residue exposure 1.0 to be a genomic determinant of positive selection in 0.5 Expression level 0.0 chimpanzees. −0.5 I also verified the effect of tissue specificity on the rate of Tissue specificity −1.0 adaptation after adjusting for gene expression level. Due to PPI degree the strong negative correlation between expression level and tissue specificity (supplementary fig. 5, Supplementary e e e l ty e ve t t ur re ci ra ra Material online), MatchIt returned few matched genes le os eg ifi n n ec on p io tio Id ex sp at si na PP es ut when I attempted to control for expression level. Therefore, ue ue bi M pr id om ss Ex es Ti ec R I implemented a different matching approach. By closely ex- R amining the relationship between expression level and tissue FIG. 3. Effects of genomic features on the rate of adaptive evolution. specificity, I found that the variation in tissue specificity was (A) Estimated coefficients of genomic features in the simple MK re- high among highly expressed genes (supplementary fig. 5, gression. (B) Estimated coefficients of genomic features in the multi- ple MK regression. In each bar plot, error bars indicate 95% Supplementary Material online). Thus, I stratified 737 highly confidence intervals while one, two, and three asterisks indicate expressed genes (gene expression level > 30) into two equal- 0.01 P-value < 0.05, 0.001 P-value < 0.01, and P-value < sized groups based on the ranking of their tissue specificity. As 0.001, respectively. (C) Correlations between genomic features. shown in fig. 4A, highly expressed genes with high tissue specificity had a significantly higher xa than their counter- consistent between the multiple and the simple MK regres- parts with low tissue specificity (P-value ¼ 0.001; two-tailed sion. In contrast, gene expression level and tissue specificity permutation test). Therefore, my gene matching analysis con- showed a strong negative correlation, which may cause spu- firms the positive effect of tissue specificity on the rate of rious associations in the simple MK regression. I also found adaptation in the multiple MK regression. that PPI degree was correlated with gene expression level and As an alternative analysis to infer the independent effect of tissue specificity, although the correlations were to a lesser tissue specificity after controlling for expression level, I divided genes into 10 equal-sized groups (deciles) based on their ex- extent compared with the correlation between gene expres- pression levels. In each decile, I further divided genes into two sion level and tissue specificity. Thus, the observed association equal-sized subgroups based on the ranking of their tissue of PPI degree with positive selection in the simple MK regres- specificity within the decile. As expected, the subgroup with sion could be due to its correlation with gene expression level high tissue specificity had a significantly higher xa than the and/or tissue specificity. subgroup with low tissue specificity in the decile with the highest expression level (supplementary table 5, Statistical Matching Analysis Confirms Genomic Supplementary Material online), which confirms the positive Determinants Identified by the Multiple MK effect of tissue specificity on adaptation rate in highly Regression expressed genes. The difference in xa was not significant in I used statistical matching algorithms to corroborate the other deciles, possibly due to the low variation in tissue spe- results of the multiple MK regression. First, I verified whether cificity in lowly expressed genes (supplementary fig. 5, the association between PPI degree and the rate of adaptive Supplementary Material online). evolution could be explained away by controlling for gene I observed that the variation in expression level was high expression level and tissue specificity. I stratified protein- among genes with high tissue specificity (supplementary fig. 5, coding genes into two groups with different PPI degrees, in Supplementary Material online). To verify the positive effect which 1,556 genes with at least 10 protein interaction part- of gene expression level after adjusting for tissue specificity, I ners were assigned to the high PPI-degree group whereas stratified 993 tissue-specific genes (tau > 0.85) into two 6
Dissecting Genomic Determinants of Positive Selection . doi:10.1093/molbev/msab291 MBE A 0.75 B 0.75 0.50 0.50 ● 0.25 0.25 ωa ωa ● 0.00 ● 0.00 ● Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021 −0.25 −0.25 Low High Low High Tissue specificity Expression level C D 2 2 PhyloP score PhyloP score 0 0 −2 −2 Low High Low High Tissue specificity Expression level FIG. 4. Statistical matching analysis. (A) Estimates of xa in 369 highly expressed genes with low tissue specificity and 368 highly expressed genes with high tissue specificity. (B) Estimates of xa in 498 tissue-specific genes with a low expression level and 495 tissue-specific genes with a high expression level. In each violin plot, dots indicate point estimates of xa while violins depict the distributions of xa from a gene-based boot- strapping analysis with 1,000 resamplings. (C) Distributions of phyloP scores in 369 highly expressed genes with low tissue specificity and 368 highly expressed genes with high tissue specificity. (D) Distributions of phyloP scores in 498 tissue-specific genes with a low expression level and 495 tissue-specific genes with a high expression level. In each box plot, the bottom, the top, and the middle horizontal bar of the box indicate the first quartile, the third quartile, and the median of phyloP scores, respectively. The whiskers indicate 1.5-fold interquartile ranges. approximately equal-sized groups based on the ranking of significant in other deciles, possibly due to the low variation their expression levels. The first group consisted of the top in expression level among genes with low tissue specificity 495 tissue-specific genes with higher expression level, whereas (supplementary fig. 5, Supplementary Material online). On the second group consisted of the bottom 498 tissue-specific average, highly expressed genes showed a higher rate of adap- genes with lower expression level. The mean expression levels tive evolution than their lowly expressed counterparts. were equal to 5.905 and 0.368 in the two gene groups, which Also, I used phyloP scores (Pollard et al. 2010; Hubisz corresponds to a 16-fold difference in mean expression level. et al. 2011) to examine the effects of gene expression level As shown in fig. 4B, tissue-specific genes with a high expres- and tissue specificity on the rate of protein evolution. After sion level had a significantly higher xa than their lowly controlling for gene expression level, phyloP scores increased expressed counterparts (P-value ¼ 0.002; two-tailed permu- with decreasing tissue specificity (fig. 4C), which is in line tation test), which confirms the positive effect of gene expres- with the observation that housekeeping genes tend to evolve sion level on the rate of adaptive evolution in the multiple at a lower substitution rate than tissue-specific genes MK regression. (Zhang and Li 2004; Zhu et al. 2008). On the other hand, after As an alternative analysis to infer the independent effect of controlling for tissue specificity, phyloP scores increased gene expression level after controlling for tissue specificity, I with increasing expression level (fig. 4D), which is in line divided genes into 10 deciles based on their tissue specificity. with stronger purifying selection on highly expressed genes In each decile, I further divided genes into two equal-sized (Zhang and Yang 2015). Taken together, it seems that highly subgroups based on the ranking of their expression levels expressed genes may be subject to more frequent positive within the decile. As expected, I observed that the subgroup selection than their lowly expressed counterparts, although a with a high expression level had a significantly higher xa than higher expression level may impose stronger purifying selec- its counterpart in the decile with the highest tissue specificity tion and reduce the overall rate of protein evolution. (supplementary table 6, Supplementary Material online). To a lesser extent, xa was slightly lower in the subgroup with a The Rate of Adaptive Evolution May Also Increase high expression level than the subgroup with a low expression with Gene Expression Level in Drosophila level in the 7th decile (supplementary table 6, Supplementary In the previous section, I showed that the rate of adaptive Material online). The difference in xa was not statistically evolution may increase with increasing gene expression level 7
Huang . doi:10.1093/molbev/msab291 MBE in chimpanzees. Recently, Fraı̈sse et al. (2019) examined the with a high expression level (fig. 5A and 5B; false-discovery same problem in Drosophila melanogaster. Fraı̈sse et al. (2019) rate < 0.01). Based on these results, I hypothesized that met- first estimated xa for each gene separately. Then, they abolic genes may be subject to more frequent positive selec- regressed the gene-level xa on expression level and other tion than nonmetabolic genes in chimpanzees. potentially correlated genomic features, such as tissue specif- To test this hypothesis, I constructed a genomic feature icity, using the standard linear regression. The coefficients of indicating whether each 0D site was located in one of the the standard linear regression were interpreted as the inde- 2,220 metabolic genes from MSigDB (Subramanian et al. 2005; pendent effects of genomic features after controlling for other Liberzon et al. 2011). Then, I used the multiple MK regression correlated features. Unlike the current study, Fraı̈sse et al. to simultaneously estimate the effects of the new feature and Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021 (2019) observed that the rate of adaptation might decrease the six original features. As shown in fig. 5C and supplemen- with increasing expression level in D. melanogaster. tary table 7, Supplementary Material online, the regression To reconcile the discrepancy between the current study coefficient of metabolic genes was significantly higher than 0 and Fraı̈sse et al. (2019), I reanalyzed the data from Fraı̈sse et in the multiple MK regression (b b ¼ 0.537; P-value ¼ 1.075 al. (2019). Following Fraı̈sse et al. (2019), I used the standard 104), suggesting that metabolic genes might have a higher linear regression to regress the gene-level estimate of xa on rate of adaptation than their nonmetabolic counterparts. gene expression level and tissue specificity, and observed that Interestingly, the effect of metabolic genes was not significant the coefficient of gene expression was negative (0.041384; P- in the simple MK regression (b b ¼ 0.276; P-value ¼ 0.516). value ¼ 8.93 109). However, the standard linear regression Therefore, controlling for potentially correlated features is may suffer from two critical problems in this data set. First, critical for revealing elevated positive selection in metabolic the residuals of the standard linear regression did not follow a genes. Metabolic genes may partially explain the positive ef- normal distribution, as suggested by a quantile–quantile plot fect of gene expression level on adaptive evolution because (supplementary fig. 6A, Supplementary Material online). the regression coefficient of gene expression level reduced Second, the majority of genes had less than 4 nonsynony- moderately from 0.347 to 0.314 after adding metabolic genes mous polymorphisms in this data set (supplementary fig. 6B, as a new feature in the multiple MK regression (supplemen- Supplementary Material online), so the gene-level estimate of tary tables 3 and 7, Supplementary Material online). xa may be highly inaccurate. Thus, I argue that the standard I observed similar results using propensity score matching. linear regression may not be an appropriate statistical Specifically, I observed that residue exposure level, gene ex- method for analyzing this data set. pression level, and tissue specificity showed different distribu- On the other hand, gene expression level had a positive tions between metabolic and nonmetabolic genes regression coefficient in the multiple MK regression (supple- (supplementary fig. 8, Supplementary Material online). mentary fig. 7A, Supplementary Material online). In an or- Before controlling for these correlated genomic features, xa thogonal statistical matching analysis, I stratified 681 was similar between metabolic and nonmetabolic genes (sup- Drosophila tissue-specific genes (tau > 0.85) into two gene plementary fig. 8, Supplementary Material online). On the groups based on the ranking of their expression levels. The other hand, xa was more than two times higher in metabolic first group consisted of the top 340 tissue-specific genes with genes than in nonmetabolic genes (0.0550 vs. 0.0234) after higher expression level, whereas the second group consisted controlling for residue exposure level, gene expression level, of the bottom 341 tissue-specific genes with lower expression and tissue specificity (supplementary fig. 8, Supplementary level. Again, tissue-specific genes with higher expression level Material online) despite that the difference was not statisti- had a higher xa than tissue-specific genes with lower expres- cally significant due to reduced sample size (P-value ¼ 0.390; sion level (supplementary fig. 7B, Supplementary Material two-tailed permutation test). online) despite that the difference in xa was marginally sig- To test whether the effect of metabolic genes on adapta- nificant (P-value ¼ 0.067; two-tailed permutation test). Taken tion rate can be explained away by their biased expression in together, the rate of adaptation may also increase with in- digestive organs, I obtained 2,274 genes with biased or creasing gene expression level in D. melanogaster. enriched expression in digestive organs, including intestine, liver, pancreas, salivary gland, and stomach, from the Human Metabolic and Immune Genes May Be under Frequent Protein Atlas (Uhlen et al. 2015). As expected, metabolic Positive Selection in Chimpanzees genes were more likely to have a biased expression in digestive To explore whether the positive effect of gene expression level organs than nonmetabolic genes (odds ratio ¼ 2.177; P-value on the rate of adaptive evolution can be explained by the
Dissecting Genomic Determinants of Positive Selection . doi:10.1093/molbev/msab291 MBE A 10 analysis, I observed that the regression coefficient of immune genes was significantly higher than 0 (supplementary fig. 10, enrichment 8 Fold of 6 Supplementary Material online and supplementary table 8, 4 Supplementary Material online). Thus, immune genes may 2 have a higher rate of adaptive evolution than their nonim- 0 mune counterparts. Immune genes may also partially explain the positive effect of gene expression level on the rate of em st ne ei m lip ism m ec m r t ol s o ot is is em ns s ul all Sy mu m of nsp st pr ol of ol ol id es Sy adaptive evolution, since the regression coefficient of gene of tab ab ab Im a Tr et et e e e un M M M at Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021 expression level decreased from 0.314 to 0.238 after adding m n In Im immune genes as a new feature (supplementary tables 7 and B 10 8, Supplementary Material online). enrichment 8 Fold of 6 Discussion 4 In this work, I have introduced the MK regression, the first 2 evolutionary model for jointly estimating the effects of mul- 0 tiple, potentially correlated genomic features on the rate of is cl l as r e us ta ve tin m adaptive substitutions. Based on similar ideas, my colleagues m ele re e Li dy s nc te Sk di In Pa and I have previously developed statistical approaches to infer i Ep negative selection on genetic variants (Huang et al. 2017; C 2.0 Huang and Siepel 2019; Huang 2020) and the evolutionary *** Coefficient (β) 1.5 turnover of cis-regulatory elements (Dukler et al. 2020). Thus, ^ 1.0 *** unifying generalized linear models and evolutionary models may be a powerful strategy to address a variety of statistical 0.5 * ** *** problems in evolutionary biology. 0.0 As shown in the simulation experiments (fig. 2), when two genomic features are correlated with each other, even at a e e re l ty e ne ic ve moderate level, the simple MK regression and a previous MK- t t re ge bol ci ra ra su s le eg ifi po n b. a ec on io Id et om based method (Smith and Eyre-Walker 2002; Fraı̈sse et al. ex sp at si M PP es ut ec ue ue M pr R 2019) cannot accurately estimate the independent effects id ss Ex es Ti R of genomic features because they cannot control for corre- FIG. 5. Positive selection in metabolic genes. (A) Enrichment of lated genomic features. On the other hand, the multiple MK Reactome pathways in 495 tissue-specific genes with a high expres- regression can unbiasedly estimate the independent effects of sion level. (B) Enrichment of tissue types in 495 tissue-specific genes genomic features if all relevant genomic features are included with a high expression level. In each enrichment test, 498 tissue- in the same analysis. Because we are almost always interested specific genes with a low expression level are used as a background in the independent effects of genomic features, the multiple gene set. A dashed blue line indicates that the fold of enrichment is MK regression may be superior to the simple MK regression equal to 1 (no enrichment). (C) Estimates of regression coefficients in and other MK-based methods that can only analyze one the multiple MK regression. This analysis includes a new binary fea- feature at a time. ture indicating whether each 0D site is located in a metabolic gene. Regression coefficients in the multiple MK regression Error bars indicate 95% confidence intervals while one, two, and three asterisks indicate 0.01 P-value < 0.05, 0.001 P-value < 0.01, and might be interpreted as the direct causal effects of genomic P-value < 0.001, respectively. features on the rate of adaptation. However, similar to other linear regression models (Pearl et al. 2016), the causal inter- pretation of the multiple MK regression relies on two implicit adaptation rate could not be explained by their biased ex- assumptions. First, genomic features of interest and all corre- pression in digestive organs. lated genomic features should be included in the same MK In line with frequent positive selection on the immune regression analysis. Second, there should be no reverse cau- system (Schlenke and Begun 2003; Nielsen et al. 2005; sality, that is, the rate of adaptive evolution should not cause Kosiol et al. 2008; Barreiro and Quintana-Murci 2010), I ob- changes in genomic features. As discussed in the literature of served that genes associated with the immune system had a causal inference (Pearl et al. 2016), these assumptions cannot 2- to 4-fold enrichment in the 495 tissue-specific genes with a be verified using observational data alone and, thus, have to high expression level (fig. 5A; false-discovery rate < 0.01). To be justified by domain knowledge on a case-by-case basis. formally test if immune system genes have a higher rate of It is worth noting that I do not attempt to infer the total adaptation than nonimmune genes in chimpanzees, I con- causal effects of genomic features on adaptation rate in the structed a genomic feature indicating whether each 0D site current study. According to the theory of causal inference was located in one of the 3,400 immune system genes from (Pearl et al. 2016), the total effect of a genomic feature of MSigDB (Subramanian et al. 2005; Liberzon et al. 2011). After interest includes its direct effect on the rate of adaptation as adding this new feature to the multiple MK regression well as its indirect effect through mediators, that is, others 9
Huang . doi:10.1093/molbev/msab291 MBE genomic features that reside on a directed path from the hydrophobic cores (Bloom et al. 2005; Bloom, Labthavikul genomic feature of interest to the rate of adaptation in an et al. 2006; Bloom, Drummond et al. 2006; Franzosa and Xia assumed causal graph. Thus, inferring the total effect of the 2009). Thus, missense mutations on protein surfaces may be genomic feature of interest will require me to impose very under more frequent positive selection because they are less strong assumptions about the causal relationship between likely to perturb protein folding. Alternatively, protein surfa- genomic features (see examples in Rosenbaum et al. 2020; ces may have a higher rate of adaptive evolution because they Laubach et al. 2021). On the other hand, to infer the direct may play an important role in host–pathogen interactions effect of the genomic feature of interest, I can simply control (Moutinho, Trancoso et al. 2019). for all potentially correlated features without specifying their Fourth, I have shown that the tissue specificity of a gene Downloaded from https://academic.oup.com/mbe/advance-article/doi/10.1093/molbev/msab291/6379733 by guest on 08 November 2021 causal relationship with the genomic feature of interest has a positive effect on the rate of adaptation after controlling (Laubach et al. 2021). In other words, the MK regression ef- for correlated genomic features, such as gene expression level. fectively assumes a simplified causal graph where I do not Thus, tissue-specific genes are more likely to be under positive specify the directions of causality between genomic features selection than housekeeping genes. Because nonsynonymous (supplementary fig. 11, Supplementary Material online; see mutations in housekeeping genes have a higher chance to also Chapter 4.3 in Shipley 2016). disrupt multiple phenotypes, my findings support that the It is also worth noting that I have ignored potential inter- pleiotropic effect is a key determinant of adaptive evolution actions between genomic features in the current study. It is (Fraı̈sse et al. 2019). possible to add interaction terms to the multiple MK regres- Fifth, I have shown that the rate of adaptive evolution sion. However, because of the large number of potential in- increases with increasing gene expression levels in both chim- teraction terms and the spareness of polymorphisms in the panzees (figs. 3B and 4B) and Drosophila (supplementary fig. chimpanzee genome, I may lack statistical power to detect 7, Supplementary Material online) after controlling for corre- interactions between features in chimpanzees. lated genomic features, such as tissue specificity. Recently, Using the multiple MK regression, I have identified numer- Fraı̈sse et al. (2019) reported an opposite trend in ous genomic features with independent effects on adaptive Drosophila by regressing a gene-level estimate of xa on evolution in the chimpanzee lineage (fig. 3B). First, in line with gene expression level and potentially correlated features. previous studies (Campos et al. 2014; Castellano et al. 2016; However, my reanalysis of their data suggests that the stan- Rousselle et al. 2020), I have shown that the rate of adaptation dard linear regression used in Fraı̈sse et al. (2019) may not be increases with increasing mutation rate. Because mutations an appropriate method for inferring the effects of genomic are the ultimate source of genetic variation, a higher mutation features in Drosophila. Unlike the standard linear regression, rate will increase the genetic variation for positive selection to the MK regression does not rely on inaccurate estimates of xa act on. at the gene level, and does not assume that the response Second, previous studies have shown that local recombi- variables follow a normal distribution. Thus, the MK regres- nation rate is positively correlated with the rate of adaptation sion may be more broadly applicable than the standard linear in Drosophila (Marais and Charlesworth 2003; Campos et al. regression in inferring the effects of genomic features on 2014; Castellano et al. 2016), probably due to a reduced effect adaptation. of Hill-Robertson interference in recombination hotspots. Sixth, numerous studies have shown that immune genes However, I have not observed the same pattern in chimpan- may have a higher rate of adaptive evolution than other genes zees, which could be explained by a reduced impact of linked in various species (Schlenke and Begun 2003; Nielsen et al. selection in species with a small census population size, such 2005; Kosiol et al. 2008; Barreiro and Quintana-Murci 2010). as chimpanzees (Corbett-Detig et al. 2015). Alternatively, my Using the MK regression, I have observed the same trend in analysis may have limited power to detect a weak association chimpanzees (supplementary fig. 10, Supplementary Material between recombination rate and positive selection due to the online and supplementary table 8, Supplementary Material lower level of polymorphism and the smaller proportion of online). Thus, immune genes may be subject to constant adaptive substitutions in chimpanzees compared with adaptation in chimpanzees to fight against ever-evolving Drosophila (Castellano et al. 2016). pathogens and parasites. Third, in agreement with a previous study (Moutinho, Last but not least, I have shown that highly expressed genes Trancoso et al. 2019), I have shown that the rate of adaptive are more likely to be associated with metabolic pathways and evolution increases with the increasing level of residue expo- digestive organs than their lowly expressed counterparts (fig. sure to solvent. On the other hand, it is well-known that the 5A and 5B), which implies that frequent positive selection in site-wise rate of protein evolution is positively correlated with metabolic genes may partially explain the positive effect of the level of residue exposure, possibly due to relaxed negative gene expression level on the rate of adaptation. In agreement selection on exposed residues (Goldman et al. 1998; Franzosa with this hypothesis, I have shown that metabolic genes may and Xia 2009; Liberles et al. 2012; Echave et al. 2016). Taken have a higher rate of adaptation than their nonmetabolic together, exposed residues on protein surfaces may be subject counterparts after controlling for potentially correlated geno- to both weaker negative selection and more frequent positive mic features (fig. 5C). Similarly, a recent study has reported selection than their buried counterparts. From a biophysical that metabolic pathways may be subject to more frequent perspective, missense mutations on protein surfaces are less positive selection than nonmetabolic pathways in multiple likely to disrupt protein stabilities than mutations in inner branches of the primate phylogeny (Daub et al. 2017). 10
You can also read