Using equilibrium frequencies in models of sequence evolution
© Knudsen and Miyamoto; licensee BioMed Central Ltd. 2005
Received: 19 July 2004
Accepted: 02 March 2005
Published: 02 March 2005
The f factor is a new parameter for accommodating the influence of both the starting and ending states in the rate matrices of "generalized weighted frequencies" (+gwF) models for sequence evolution. In this study, we derive an expected value for f, starting from a nearly neutral model of weak selection, and then assess the biological interpretation of this factor with evolutionary simulations.
An expected value of f = 0.5 (i.e., equal dependency on the starting and ending states) is derived for sequences that are evolving under the nearly neutral model of this study. However, this expectation is sensitive to violations of its underlying assumptions as illustrated with the evolutionary simulations.
This study illustrates how selection, drift, and mutation at the population level can be linked to the rate matrices of models for sequence evolution to derive an expected value of f. However, as f is affected by a number of factors that limit its biological interpretation, this factor should normally be estimated as a free parameter rather than fixed a priori in a +gwF analysis.
Felsenstein  was the first to introduce an evolutionary model for DNA sequences, which allows for unequal nucleotide frequencies (see also ). His F81 model allows for substitutions at a rate proportional to the frequencies of the ending nucleotides. It is considered the simplest rate matrix for accommodating variable nucleotide frequencies and is therefore the starting point for the consideration of more complex models with frequency variation (e.g., the HKY model of Hasegawa et al. ). Goldman and Whelan  described new variants of these F81-based models (their +gwF (generalized weighted frequencies) models; e.g., JC+gwF for Jukes and Cantor , and K2P+gwF for Kimura ). At the heart of their +gwF variants was a new free parameter (f) to accommodate the frequencies of the starting, as well as ending, nucleotides in the evolutionary process:
where q ij refers to the substitution rate from nucleotide i to j, π i and π j correspond to their equilibrium base frequencies, and s ij is the exchangeability between the two. In the +gwF variants, the substitution rate becomes more dependent on the ending nucleotide as f decreases from 1 to 0, with f = 0 for the classic F81-type models.
This study starts with a population genetics model to derive equations that link weak selection, genetic drift, and mutation to the f parameter and evolutionary rate matrices of the +gwF variants. These theoretical derivations lead to an expected value of f = 0.5. However, as illustrated with simulations, the f parameter is complex and thus its biological interpretation must be considered with caution.
Derivation of the rate matrix for the weak selection model
The nearly neutral model of molecular evolution states that most DNA mutations of longer-term evolutionary consequence are under weak selection and are therefore prone to drift [7, 8]. For a diploid population of size N, a neutral mutation has a probability of 1/2N of becoming fixed in the population. However, because of drift, even slightly deleterious mutations can become fixed, but at a probability of less than 1/2N. Advantageous mutations have higher fixation probabilities than neutral mutations. In the nearly neutral model, the distribution of alleles is determined by an equilibrium of selection, drift, and mutation.
Consider a number of sites under identical evolutionary constraints and with a bias in nucleotide distribution. Assume that weak selection and drift are the causes of this bias; e.g., as for the codon usage biases in micro-organisms and Drosophila [9, 10]. In our model, some nucleotides confer a slightly higher fitness onto the organism than do others, regardless of their position, and these can become fixed in the population through drift and/or selection. Here, we also assume that selective advantages are additive for the two alleles of the diploid organism [11, 12]. Let the selective advantages of the four nucleotides be given by s A , s C , s G , and s T . The differences between these selection coefficients will be very close to zero, since no strong selection is expected.
Consider a mutation from nucleotide i to j, with a selective advantage of s = s j - s i (a selective disadvantage exists when s is negative). For a population of size N and an effective size of N e , Kimura  showed that the fixation probability in this population is given by:
when s ≠ 0. For s = 0, we have P(s) = 1/2N. This approximation is valid for small values of s, which is the case here.
The substitution rate from nucleotide i to j is proportional to P(s j - s i ):
q ij = 2N μ ij P(s j - s i ), (3)
where μ ij is the mutation rate from i to j. For different i and j, μ ij can vary because of unequal transition versus transversion rates (for example). Furthermore, let us assume that the mutation rate is the same for either direction of substitutions between i and j. This assumption is necessary to maintain the widely used condition of time reversibility in the evolutionary process, which thereby keeps the following derivations tractable [1, 13].
We then have:
Since q ij /q ji can be written as a function evaluated at s j divided by the same function evaluated at s i , evolution is time reversible according to this model with:
Here, c and c' are constants with c' = -l/4N e log c, which will be chosen to make the equilibrium frequencies sum to one. The substitution rates can now be approximated as:
Given an exchangeability of s ij = μ ij , this equation reduces to equation (1) with f = 0.5 and an adjustment factor of:
Starting equilibrium base frequencies and results for the simulations with either homogeneous or heterogeneous sequences (i.e., those with sites from single versus multiple categories, respectively).
0.50 ± 0.01
0.00 ± 0.01
0.50 ± 0.01
0.00 ± 0.01
0.50 ± 0.02
0.00 ± 0.03
0.49 ± 0.01
0.01 ± 0.01
0.51 ± 0.01
-0.01 ± 0.02
0.50 ± 0.02
-0.01 ± 0.01
0.43 ± 0.01
-0.11 ± 0.02
0.34 ± 0.03
-0.16 ± 0.03
0.24 ± 0.02
-0.33 ± 0.03
0.39 ± 0.04
-0.13 ± 0.04
0.68 ± 0.03
0.29 ± 0.04
In the third set of simulations, an acceleration in the C to T substitution rate was incorporated, thereby modeling an increase in their mutation rate due to the deamination of methylated C's in CpG pairs . The introduction of this bias resulted in significant deviations of f in either direction from 0.5, even though their sequences were simulated in equilibrium (Fig. 2). Thus, the value of f can vary considerably when the rates for reciprocal mutations are unequal.
This study illustrates how selection, drift, and mutation within a population can be linked to the f parameter and rate matrices of the +gwF variants for sequence evolution. Our weak selection model relies on the fixation probabilities of mutant alleles with additive genie selection and equal mutation rates for reciprocal substitutions. What is now needed are additional studies that link other population genetics models to the +gwF variants . For example, the population genetics models of Li , which focus on allele frequency distributions and different modes of selection and mutation, could be studied for their connections to the f parameter and +gwF rate matrices.
Collectively, the three sets of simulations highlight that the f parameter is complex and can be influenced by a number of different factors . This complexity limits its biological interpretation and the use of its expected value of 0.5 as derived for the weak selection model. Correspondingly, in many +gwF analyses, f ill need to be estimated as a free parameter rather than fixed beforehand.
Goldman and Whelan  focused on amino acid sequences, where they found that the +gwF models provided better fits to the majority of their protein data sets. They also analyzed two rather small nucleotide data sets for which the general reversible model (REV) outperformed the +gwF variants. As noted by them, the REV model provides enough free parameters to cover the effects of a +gwF analysis. Thus, given sufficient data, this model will consistently outperform the simpler +gwF variants, since it can always accommodate more of the evolutionary process by virtue of its extra parameters. Nevertheless, as widely acknowledged, simpler models have their place, since they allow one to maximize analytical power for more limited data, while minimizing the risk of over-parameterization [13, 16]. Thus, as for the JC, K2P, and HKY models, we expect their +gwF variants to remain of interest as part of the hierarchy of simple to complex models for sequence evolution.
B.K. thanks the Carlsberg Foundation, the University of Aarhus, and the Danish National Science Research Council (grant number 21-00-0283) for their support. Both authors also thank the Department of Zoology, University of Florida for its support.
- Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981, 17: 368-376.View ArticlePubMedGoogle Scholar
- Tajima F, Nei M: Biases of the estimates of DNA divergence obtained by the restriction enzyme technique. J Mol Evol. 1982, 18: 115-120.View ArticlePubMedGoogle Scholar
- Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985, 22: 160-174.View ArticlePubMedGoogle Scholar
- Goldman N, Whelan S: A novel use of equilibrium frequencies in models of sequence evolution. Mol Biol Evol. 2002, 19: 1821-1831.View ArticlePubMedGoogle Scholar
- Jukes TH, Cantor CR: Evolution of protein molecules. Mammalian Protein Metabolism. Edited by: Munro HN. 1969, New York: Academic Press, 21-132.View ArticleGoogle Scholar
- Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16: 111-120.View ArticlePubMedGoogle Scholar
- Ohta T: Slightly deleterious mutant substitutions in evolution. Nature. 1973, 246: 96-98.View ArticlePubMedGoogle Scholar
- Ohta T: The nearly neutral theory of molecular evolution. Annu Rev Ecol Syst. 1992, 23: 263-286. 10.1146/annurev.es.23.110192.001403.View ArticleGoogle Scholar
- Buhner M: The selection-mutation-drift theory of synonymous codon usage. Genetics. 1991, 129: 897-907.Google Scholar
- Carlini DB, Stephan W: In vivo introduction of unpreferred synonymous codons into the Drosophila Adh gene results in reduced levels of ADH protein. Genetics. 2003, 163: 239-243.PubMed CentralPubMedGoogle Scholar
- Kimura M: On the probability of fixation of mutant genes in populations. Genetics. 1962, 47: 713-719.PubMed CentralPubMedGoogle Scholar
- Kimura M, Ohta T: Population genetics, molecular biometry, and evolution. Proc Sixth Berkeley Symp Math Stat Prob. 1972, 5: 43-68.Google Scholar
- Felsenstein J: Inferring phylogenies. 2004, Sunderland, MA: Sinauer AssociatesGoogle Scholar
- Razin A, Riggs AD: DNA methylation and gene function. Science. 1980, 210: 604-610.View ArticlePubMedGoogle Scholar
- Li WH: Maintenance of genetic variability under mutation and selection pressures in a finite population. Proc Natl Acad Sci U S A. 1977, 74: 2509-2513.PubMed CentralView ArticlePubMedGoogle Scholar
- Posada D, Crandall KA: Selecting models of nucleotide substitution: an application to human immunodeficiency virus 1 (HIV-1). Mol Biol Evol. 2001, 18: 897-906.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.