Uncovering the mutation-fixation correlation in short lineages
© Vallender and Lahn; licensee BioMed Central Ltd. 2007
Received: 07 July 2007
Accepted: 21 September 2007
Published: 21 September 2007
We recently reported a highly unexpected positive correlation between the fixation probability of nonsynonymous mutations (estimated by ω) and neutral mutation rate (estimated by Ks) in mammalian lineages. However, this positive correlation was observed for lineages with relatively long divergence time such as the human-mouse lineage, and was not found for very short lineages such as the human-chimpanzee lineage. It was previously unclear how to interpret this discrepancy. It may indicate that the positive correlation between ω and Ks in long lineages is a false finding. Alternatively, it may reflect a biologically meaningful difference between various lineages. Finally, the lack of positive correlation in short lineages may be the result of methodological artifacts.
Here we show that a strong positive correlation can indeed be seen in short lineages when a method was introduced to correct for the inherently high levels of stochastic noise in the use of Ks as an estimator of neutral mutation rate. Thus, the previously noted lack of positive correlation between ω and Ks in short lineages is due to stochastic noise in Ks that makes it a far less reliable estimator of neutral mutation rate in short lineages as compared to long lineages.
A positive correlation between ω and Ks can be observed in all mammalian lineages for which large amounts of sequence data are available, including very short lineages. It confirms the authenticity of this highly unexpected correlation, and argues that the correction likely applies broadly across all mammals and perhaps even non-mammalian species.
Point mutations in coding regions of genes can be classified into two categories: synonymous and nonsynonymous. In mammals, synonymous mutations are largely neutral, though they may sometimes experience weak selection . Nonsynonymous mutations, in contrast, are frequently subject to strong selection. The rate of fixed synonymous (or nonsynonymous) substitutions is often denoted as Ks (or Ka). Ks is typically treated as a reasonable proxy for neutral mutation rate, and the Ka/Ks ratio (abbreviated ω) is often used as a measure for the fixation probability of nonsynonymous mutations scaled to neutral expectation.
Recently, we discovered a strong positive correlation between ω and Ks in several mammalian lineages, arguing that the fixation probability of nonsynonymous mutations in a gene is positively corrected with the gene's neutral mutation rate . This finding is highly unexpected under the classical neutral theory of molecular evolution, which argues that the fixation probability of nonsynonymous mutations is determined by (and serves as an estimator for) selective pressure, and as such, should be independent of neutral mutation rate [3–5]. Conventionally, therefore, ω should not show any positive correlation with Ks, and in fact, simulations demonstrate that only an inverse correlation should in theory exist between ω and Ks due to a mathematical artifact  (see below). The observation of a strong positive correction between ω and Ks therefore challenges the current paradigm of molecular evolution, and necessitates a reexamination of the widely held assumption that ω is determined primarily by selective pressure independent of neutral mutation rate.
There is, however, an important caveat to this finding. The positive correlation between ω and Ks was observed in mammalian lineages with relatively large sequence divergence, specifically, the human-mouse, human-rabbit, and mouse-rat lineages  (the average Ks of these lineages being 0.48, 0.38 and 0.18, respectively), but not found in the human-chimpanzee or the human-macaque lineages, which have much lower levels of sequence divergence (the average Ks being 0.012 and 0.063, respectively). Instead, an inverse correlation between ω and Ks was seen in these latter short lineages . This is troubling because it leaves open the possibility that the positive correlation between ω and Ks observed in long lineages may actually be an artifact associated with the imprecise calculation of Ka or Ks. In particular, multiple hits at any given nucleotide position can occur with high probability in long lineages. Such multiple hits, difficult to correct since only one or zero changes can be observed at a given nucleotide position, can constitute a major source of error in the calculation of Ka and, even more so, Ks. It is formally possible, therefore, that errors stemming from the improper correction of multiple hits along with some other unknown factors have systematically biased the calculation of Ka and Ks in long lineages, in a manner that creates an artifactual positive correlation between ω and Ks.
We have argued previously that this inverse correlation may be due to the effect of stochastic noise on Ks, which affects short lineages more profoundly than long lineages. Specifically, stochastic deviation of Ks from the true underlying neutral mutation rate (μ) represents a much greater fraction of μ in short lineages as compared to long lineages, and our simulations showed that this could indeed lead to a strong inverse correlation between ω and Ks in very short lineages such as the human-chimpanzee lineage because Ks is also used as the denominator in the calculation of ω (which equals Ka/Ks) (see Supplementary Material in ).
If the failure to observe a positive correlation between ω and Ks in short lineages is indeed due to the inherently high levels of stochastic noise affecting Ks in short lineages, then a reasonable correction of noise in Ks might be able to bring out the positive correlation again. Here, we show that by introducing a method that corrects for stochastic noise in Ks, a strong positive correlation between ω and Ks indeed becomes observable in short lineages. This result strengthens the authenticity of this highly unexpected correlation, and argues that a biological mechanism (rather than a methodological artifact) is relating the fixation probability of nonsynonymous mutations to mutation rate.
We generated a set of 6,779 genes each with orthologs from five primate and rodent species for which large-scale genome sequences are currently available: human, chimpanzee, rhesus macaque, mouse, and rat. A subset of 5,831 genes also included orthologs from dog. To ensure that the current data set was consistent with our earlier results based on a separately derived data set , we sought to recapitulate the positive correlation between ω and Ks in human-mouse and mouse-rat comparisons. Indeed, for both lineages, a highly robust correlation exists and binning of genes creates a visually striking representation of the correlation (Additional file 1). We also plotted human-dog and mouse-dog values and observed correlations similar to that seen in the human-mouse lineage (data not shown). To ensure that the correlation is not restricted to genes with orthologs in all the species sampled, we also obtained data sets containing only pairwise orthologs. Again, we observed correlations that are essentially the same as that seen in the five- or six-way ortholog sets (data not shown).
We then considered the human-chimpanzee and human-macaque ortholog pairs within the complete data set. As expected for these short lineages, plotting ω against Ks showed that not only is there not a positive correlation, but there is a strong inverse correlation between these two parameters for the human-chimpanzee lineage (Additional file 2). We have speculated previously that this inverse correlation is due to the inherently high levels of stochastic variation in Ks when it is used as a proxy for neutral mutation rate in short lineages . If this is correct, then the strength of the inverse correlation should be stronger in the human-chimpanzee lineage than in the human-macaque lineage because the former is shorter and therefore suffers from an even higher level of stochastic noise in Ks. This is precisely what we observed (Additional file 2). Indeed, the general finding is that as evolutionary distance of a lineage decreases (and hence the stochastic noise associated with Ks relative to neutral mutation rate increases), the relationship between ω and Ks goes from a strong positive correction in the case of long lineages progressively toward a strong inverse correlation in the case of very short lineages.
In molecular evolutionary studies, Ks is frequently used as a proxy for neutral mutation rate (μ). It is often forgotten, however, that mutational events are a discreet process and as such are subject to stochastic variation. Over short periods of evolutionary time, this stochastic variation is often large relative to the true underlying neutral mutation rate. As evolutionary time lengthens and the number of mutational events increases, stochastic variation decreases relative to neutral mutation rate. This is clearly demonstrated by computer simulations in our previous study .
High levels of stochastic noise in Ks can lead to an artifactual inverse correlation is because Ks is in both parameters being corrected, and it is the denominator of ω. Mathematically, by just correcting for noise on the x-axis (Ks), the artifactual inverse correlation should go away even if there is still considerable noise on the y-axis. This prediction was clearly borne out by our previous simulation studies . We therefore decided to focus on devising a means to correct for noise in Ks only, and to examine if such correction can eliminate the artifactual inverse correlation between ω and Ks in short lineages. This may reveal the underlying positive correlation, if any, that has been obscured.
We note that the accuracy of Ka and Ks can be compromised by polymorphisms and that the effect is stronger for short lineages. This is because in short lineages, a considerable fraction of observed sequence differences between two reference genomes may actually be polymorphisms rather than fixed divergence. As such, Ka and Ks values calculated from two closely related reference genomes are inflated. This is especially true for Ka because a significant fraction of nonsynonymous polymorphisms can be slightly deleterious mutations destined for elimination before they can reach fixation . This may contribute to the poor quality of Ks in approximating μ and Ka/Ks in approximating selective pressure, and thus the erosion of the positive correlation in short lineages. What remains clear, however, is that polymorphisms are an unlikely source of the positive correlation between ω and Ks. This is because in long lineages, the effect of polymorphisms on Ka and Ks is negligible given that only a tiny fraction of the observed sequence differences between two reference genomes are due to polymorphisms, and yet that a robust positive correlation between ω and Ks can been seen in long lineages. This effectively rules out polymorphisms as a major contributing factor to the correlation.
In this study, we show that the positive correlation between ω and Ks is not restricted to specific mammalian lineages. Rather, the correlation can be observed across all mammalian species for which large-scale genome sequence data are available. It is particularly interesting that the correlation can be observed even in very short lineages once measures are taken to correct for the stochastic noise in Ks. This argues that the failure to observe a positive correlation between ω and Ks in short lineages in our previous study is indeed due to noise in Ks as we had speculated . We note that the calculation of Ka and Ks is minimally confounded by the occurrence of multiple hits in short lineages. The ability to observe a strong positive correlation between ω and Ks in very short lineages therefore argues that the correlation is not an artifact stemming from the improper correction of multiple hits in the calculation of Ka and Ks. Our current study thus bolsters the authenticity of this correlation while demonstrating its broad applicability across the mammalian tree.
Another important message from the study is that Ks in long lineages may provide a much better estimator of neutral mutation rate than Ks in short lineage. This occurs because of the large amount of stochastic variation in Ks relative to the true underlying neutral mutation rate in short lineages than in long lineages. Indeed, when estimating neutral mutation rate of a gene in a short lineage, it may be more accurate to use the observed Ks of that gene in a long lineage (scaled down proportional to the genome-average differential in sequence divergence of the two lineages) than to use the observed Ks from the short lineage. This study and our unpublished data confirm that neutral mutation rate can remain relatively stable across different mammalian lineages [8, 9].
Neither the previous long-lineage study nor the current short-lineage study alone is sufficient to eliminate methodological artifacts from consideration. In the study of long lineages, the correlation is complicated by possible difficulties in correcting for multiple hits, though the stochastic noise in Ks and the effect of polymorphisms are not major issues. In the study of short lineages, the converse situation is true. It is thus gratifying that a strong positive correlation between ω and Ks can be observed not only for long lineages, but also for short lineages after noise in Ks is corrected. Together, these results argue that the correlation is the result of a biological mechanism rather than a methodological artifact.
It is as yet unclear why there should exist such a strong positive correlation between ω and Ks. Neither this study nor our previous study provides a definitive mechanism by which this correlation may occur, though several models have been proposed. Perhaps ω is reflective of some combined effect of selective pressure and neutral mutation rate, or perhaps selective pressure and neutral mutation rate influence each other in some way . It was speculated that the occurrence of intragenic compensatory mutations, which is supported by some theoretical and empirical studies [12–14], may contribute to the former scenario . It was also suggested that modulated mutability may contribute to the latter scenario , that is, the correlation may be partly due to the fact that genes with conserved functions have evolved lower neutral mutation rates over evolutionary time [15–17]. These speculations notwithstanding, the biological mechanism responsible for the correlation remains unclear.
Given the presence of the correlation in all the mammalian lineages for which there is sufficient genome sequence data, including very short lineages, it now seems clear that the correlation is indeed the result of a biologically meaningful process. The stage is set for future studies to identify the mechanisms underpinning this enigmatic correlation.
Using data from Ensembl v36 , sequences were obtained for: human, Homo sapiens (NCBI 35); chimpanzee, Pan troglodytes (PanTro 1.0); rhesus macaque, Macaca mulatta (Mmul 1.0); mouse, Mus musculus (NCBI m34); rat, Rattus norvegicus (RGSC 3.4); and dog, Canis familiaris (CanFam 1.0). Genes were clustered into orthologous groups using reciprocal best BLAST hits following established methods [19–21]. We further attempted to ensure proper alignments by imposing maximum acceptable Ks cut-offs at roughly three standard deviations above average for all alignments. Sequences were curated for length differences and for poor alignment. This resulted in 5,831 orthologous groups containing a member from each of the six species with reasonable alignment and 6,779 orthologous groups containing all five primate and rodent species. Ortholog groups, alignments, and evolutionary estimates used in this study can be obtained through the SPEED database .
In-frame alignments of orthologs were performed using The Wisconsin Package v10.2 http://www.accelrys.com/. Evolutionary parameters were estimated using the Li method , though the results obtained were comparable (positive linear correlations between ω and Ks) when other methods such as PAML [23, 24] were used.
We thank L. Hurst and A. Kondrashov who served as constructive reviewers of the manuscript.
- Chamary JV, Parmley JL, Hurst LD: Hearing silence: non-neutral evolution at synonymous sites in mammals. Nat Rev Genet. 2006, 7 (2): 98-108. 10.1038/nrg1770.View ArticlePubMedGoogle Scholar
- Wyckoff GJ, Malcom CM, Vallender EJ, Lahn BT: A highly unexpected strong correlation between fixation probability of nonsynonymous mutations and mutation rate. Trends Genet. 2005, 21 (7): 381-385. 10.1016/j.tig.2005.05.005.View ArticlePubMedGoogle Scholar
- Ohta T: Slightly deleterious mutant substitutions in evolution. Nature. 1973, 246 (5428): 96-98. 10.1038/246096a0.View ArticlePubMedGoogle Scholar
- Kimura M: Evolutionary rate at the molecular level. Nature. 1968, 217 (129): 624-626. 10.1038/217624a0.View ArticlePubMedGoogle Scholar
- Kimura M: The Neutral Theory of Molecular Evolution. 1983, Cambridge, UK , Cambridge University PressView ArticleGoogle Scholar
- Bulmer M, Wolfe KH, Sharp PM: Synonymous nucleotide substitution rates in mammalian genes: implications for the molecular clock and the relationship of mammalian orders. Proc Natl Acad Sci U S A. 1991, 88 (14): 5974-5978. 10.1073/pnas.88.14.5974.PubMed CentralView ArticlePubMedGoogle Scholar
- Mouchiroud D, Gautier C, Bernardi G: Frequencies of synonymous substitutions in mammals are gene-specific and correlated with frequencies of nonsynonymous substitutions. J Mol Evol. 1995, 40 (1): 107-113. 10.1007/BF00166602.View ArticlePubMedGoogle Scholar
- Makalowski W, Boguski MS: Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc Natl Acad Sci U S A. 1998, 95 (16): 9407-9412. 10.1073/pnas.95.16.9407.PubMed CentralView ArticlePubMedGoogle Scholar
- Kumar S, Subramanian S: Mutation rates in mammalian genomes. Proc Natl Acad Sci U S A. 2002, 99 (2): 803-808. 10.1073/pnas.022629899.PubMed CentralView ArticlePubMedGoogle Scholar
- Williams EJ, Hurst LD: Is the synonymous substitution rate in mammals gene-specific?. Mol Biol Evol. 2002, 19 (8): 1395-1398.View ArticlePubMedGoogle Scholar
- Rocha EP, Smith JM, Hurst LD, Holden MT, Cooper JE, Smith NH, Feil EJ: Comparisons of dN/dS are time dependent for closely related bacterial genomes. J Theor Biol. 2006, 239 (2): 226-235. 10.1016/j.jtbi.2005.08.037.View ArticlePubMedGoogle Scholar
- Kimura M: The role of compensatory neutral mutation in molecular evolution. J Genet. 1985, 64: 7-19.View ArticleGoogle Scholar
- Kimura M: Diffusion models in population genetics with special reference to fixation time of moilecular mutants under mutational pressure. Population Genetics and Molecular Evolution. Edited by: Ohta T, Aoki K. 1985, New York, New York , Springer-Verlag, 19-39.Google Scholar
- Choi SS, Li W, Lahn BT: Robust signals of coevolution of interacting residues in mammalian proteomes identified by phylogeny-aided structural analysis. Nat Genet. 2005, 37 (12): 1367-1371. 10.1038/ng1685.View ArticlePubMedGoogle Scholar
- Cox EC: On the organization of higher chromosomes. Nat New Biol. 1972, 239 (92): 133-134.View ArticlePubMedGoogle Scholar
- Kondrashov AS: Modifiers of mutation-selection balance: general approach and the evolution of mutation rates. Genet Res. 1995, 66: 53-69.View ArticleGoogle Scholar
- Chuang JH, Li H: Functional bias and spatial organization of genes in mutational hot and cold regions in the human genome. PLoS Biol. 2004, 2 (2): E29-10.1371/journal.pbio.0020029.PubMed CentralView ArticlePubMedGoogle Scholar
- Birney E, Andrews D, Caccamo M, Chen Y, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Flicek P, Graf S, Hammond M, Herrero J, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Kokocinski F, Kulesha E, London D, Longden I, Melsopp C, Meidl P, Overduin B, Parker A, Proctor G, Prlic A, Rae M, Rios D, Redmond S, Schuster M, Sealy I, Searle S, Severin J, Slater G, Smedley D, Smith J, Stabenau A, Stalker J, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C, Hubbard TJ: Ensembl 2006. Nucleic Acids Res. 2006, 34 (Database issue): D556-61. 10.1093/nar/gkj133.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4: 41-10.1186/1471-2105-4-41.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science. 1997, 278 (5338): 631-637. 10.1126/science.278.5338.631.View ArticlePubMedGoogle Scholar
- Vallender EJ, Paschall JE, Malcom CM, Lahn BT, Wyckoff GJ: SPEED: a molecular-evolution-based database of mammalian orthologous groups. Bioinformatics. 2006, 22 (22): 2835-2837. 10.1093/bioinformatics/btl471.View ArticlePubMedGoogle Scholar
- Li WH: Unbiased estimation of the rates of synonymous and nonsynonymous substitution. J Mol Evol. 1993, 36 (1): 96-99. 10.1007/BF02407308.View ArticlePubMedGoogle Scholar
- Goldman N, Yang Z: A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994, 11 (5): 725-736.PubMedGoogle Scholar
- Nei M, Gojobori T: Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986, 3 (5): 418-426.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.