Skip to main content
  • Methodology article
  • Open access
  • Published:

Computing Ka and Ks with a consideration of unequal transitional substitutions

Abstract

Background

Approximate methods for estimating nonsynonymous and synonymous substitution rates (Ka and Ks) among protein-coding sequences have adopted different mutation (substitution) models. In the past two decades, several methods have been proposed but they have not considered unequal transitional substitutions (between the two purines, A and G, or the two pyrimidines, T and C) that become apparent when sequences data to be compared are vast and significantly diverged.

Results

We propose a new method (MYN), a modified version of the Yang-Nielsen algorithm (YN), for evolutionary analysis of protein-coding sequences in general. MYN adopts the Tamura-Nei Model that considers the difference among rates of transitional and transversional substitutions as well as factors in codon frequency bias. We evaluate the performance of MYN by comparing to other methods, especially to YN, and to show that MYN has minimal deviations when parameters vary within normal ranges defined by empirical data.

Conclusion

Our comparative results deriving from consistency analysis, computer simulations and authentic datasets, indicate that ignoring unequal transitional rates may lead to serious biases and that MYN performs well in most of the tested cases. These results also suggest that acquisitions of reliable synonymous and nonsynonymous substitution rates primarily depend on less biased estimates of transition/transversion rate ratio.

Background

For appraising evolutionary significance of variable protein-coding sequences among diverged species in a quantitative fashion, one of the powerful tools is to compute nonsynonymous and synonymous substitution rates, termed as Ka and Ks, respectively [1–3]. Since Ka and Ks represent the numbers of substitutions per nonsynonymous and synonymous site, respectively, these parameters (or often their ratio ω = Ka/Ks) are used to partition the targeted sequences into three basic scenarios: negative (purifying) selection when Ka < Ks (ω < 1), positive (adaptive) selection when Ka > Ks (ω > 1), and neutral mutation when Ka = Ks (ω = 1).

Approximate methods for estimating Ka and Ks normally involve three steps: numbering synonymous (S) and nonsynonymous (N) sites, counting synonymous (Sd) and nonsynonymous (Nd) substitutions, and correcting for multiple substitutions. Over the past two decades, several methods have been developed [4–13], which are based on different mutation (substitution) models with subtle yet significant differences [14–16]. Among them, Yang and Nielsen made a valuable attempt to consider differences among transitional and transversional substitutions as well as codon frequency bias [4], and their method (denoted as YN in this report), based on the more realistic HKY Model [15] and implemented in PAML (Phylogenetic Analysis by Maximum Likelihood; [17]), has become increasingly popular in the field of molecular evolution studies. However, it does not exclude the possibility that methods based on simpler models have some favorable properties [18]. There are always tradeoffs between incorporating more features into models and avoiding over-parameterization for more accurate captures of evolutionary information [19–21]. In this report, we propose an improved method, a m odified YN algorithm or MYN, based on the Tamura-Nei Model, in which transitional changes (between A and G or T and C) are not assumed to occur with an equal frequency [22]. We also review the basics involved in estimating Ka and Ks (Table 1), explain differences between the two methods, and provide our comparative results as an in-depth evaluation for the new method.

Table 1 Symbols used in estimating Ka and Ks

Results

To compare the performance of YN and MYN, we need to generate simulated and empirical datasets with careful considerations on possible features in evolution of diverged protein-coding sequences, such as biases in transitional rates and codon frequencies. We simulated hypothetical common ancestral sequences according to codon frequencies that were derived from three basic datasets: (1) equal codon frequencies (each sense codon frequency for canonical genetic code is 1/61, and other codes can be accommodated by making simple modifications [4]), (2) human codon frequencies (based on 39,420 human protein-coding genes from ENSEMBL database, Release 35 [23]), and (3) rice codon frequencies (deduced from 19,079 rice protein-coding genes [24]). We used the formula 100% × [(estimated value) - (expected value)]/(expected value) to calculate percentage errors for assessing relative biases between estimated and expected values.

Consistency analysis

We first used a set of data that collectively contain 2 million codons, assuming that a good approximate method should not deviate too far from the real value with near infinite amount of data [4]. Although the selective strength, reflected in ω, differs from gene to gene, some representative values can be set based on analyses of empirical data; we use ω = 0.3, 1, and 3 as such values for negative, neutral, and positive mutations, respectively [3, 4, 25]. We fix t = 0.6 for the initial analysis and the effect of t is examined later. Since genuine values for κ often range from 1.5 to 5, we take 3.75 as a representative one. Considering that MYN differentiates κY from κR, we always fix one of them to 3.75 and allow the other to vary from 1 to 10. We plotted percentage errors for ω between data generated with YN and MYN against κR (fixing κY = 3.75) for different expected values, using the three codon frequencies from our test datasets (Figure 1 A to I). Similar results are readily obtained for fixed κR and variable κY (data not shown).

Figure 1
figure 1

Percentage errors of estimated ω (= Ka/Ks) by YN and MYN when κ Y = 3.75, considering κ R varying from 1 to 10. The percentage error was calculated by the formula 100% × [(estimated value) - (expected value)]/(expected value). The canonical genetic code was used for simulated sequences with 2 million codons. Three sets of codon frequencies were used: equal (A to C), human (D to F) calculated from human protein-coding genes, and rice (G to I) calculated from rice protein-coding genes. ω = 0.3 (A, D, G), ω = 1 (B, E, H), and ω = 3 (C, F, I) were considered as representative values for purifying selection, neutral mutation and positive selection, respectively.

Different codon frequencies have minor influence on the performance of YN and MYN. Ignoring the difference between κR and κY, YN gives estimates close to the expected values only if κR ≈ κY. For instance, when κR = 4, the percentage errors for ω calculated with YN under human codon frequencies are 1.98%, 2.51%, and 0.78% when ω = 0.3, 1, and 3, respectively (Figure 1D to 1F). When κR ≠ κY, YN gives rise to obviously biased ω; it tend to underestimate ω when κR < κY and to overestimate ω when κR > κY. The percentage errors for ω = 0.3, 1, and 3, when human codon frequencies are used, are -4.16%, -7.09% and -11.03% when κR = 1, and are 16.82%, 11.24%, and 5.87% when κR = 10, respectively. Compared to YN, MYN appears to produce lower percentage errors in ω estimations in most cases. When κR ≈ κY, MYN performs in a similar way as YN (it becomes equivalent to YN when κR = κY).

We also compared YN and MYN in Ks estimations. We took a similar approach as what Tzeng and co-workers used to evaluate the expected values of Ks [12]. We plotted percentage errors of Ks against κR (assuming κY = 3.75; Figure 2A to 2C) based on human codon frequencies, showing similar results when taking equal or rice codon frequencies (data not shown). YN and MYN give similar estimates of Ks when κR ≈ κY. When κR = 4, the percentage errors of the estimated Ks generated by YN and MYN are -1.83% and -2.11% for ω = 0.3, -3.77% and -3.78% for ω = 1, -2.71% and -2.56% for ω = 3, respectively. YN tends to overestimate Ks when κR < κY and to underestimate Ks when κR > κY and the bias becomes serious with increasing κR. MYN sometimes gives rise to larger biases than YN when κR < κY, but it overall performs better for most of the parameter combinations tested, especially when κR > κY.

Figure 2
figure 2

Percentage errors of estimated Ks by YN and MYN when κ Y = 3.75, considering κ R varying from 1 to 10. The percentage error was calculated by the formula 100% × [(estimated value) - (expected value)]/(expected value). Sequences with 2 million codons were simulated with human codon frequencies. Three representative values of ω (0.3 in A, 1 in B, and 3 in C) were used for purifying selection, neutral mutation, and positive selection, respectively.

Effects of κR and κY

We examined the effects of κR and κY with a set of simulated sequences, three ω values (0.3, 1 and 3), and two t values (0.1 and 1). We considered the case of κR ≥ κY and four parameter sets: (1) κR = 3 and κY = 1.5, (2) κR = 5 and κY = 1.5, (3) κR = 10 and κY = 1, (4) κR = κY = 3.75. To avoid stochastic errors, we generated 1,000 pairs of sequences with 400 codons each for all the tests. We only described the results when human codon frequencies were used since similar results were obtained for equal or rice codon frequencies. The average estimates of Ka, Ks, and ω were computed with YN and MYN as well as expected values of Ka and Ks (Table 2).

Table 2 Average estimates of Ka, Ks, and ω with YN and MYN

Without considering the difference between κR and κY, YN produces minor bias when κR = κY, and underestimates Ks and overestimates ω when κR > κY. MYN is less biased compared with YN for most parameter combinations. The results agree with the infinite data test for consistency. For example, when κR > κY, YN gives positive values for ω and negative values for Ks in percentage errors for the infinite data. As a result, it overestimates ω and underestimates Ks for sequences with normal length such as 400 codons albeit far from dramatic.

Effect of t

We let t vary from 0.1 to 1 to evaluate its effect. We again used the human codon frequencies for simulations (1,000 pairs of sequences with 400 codons for each case), and tested three different parameter combinations: (1) ω = 0.3, κR = 10 and κY = 1; (2) ω = 1, κR = 10 and κY = 1; (3) ω = 3, κY = 10 and κR = 1. We plotted average estimates of ω with YN and MYN against t for the parameter combinations (Figure 3A to 3C).

Figure 3
figure 3

Average ω estimates over 1,000 pairs of sequences when divergence time t varies from 0.1 to 1. Human codon frequencies were used for simulating sequences (400 codons each). The parameters are ω = 0.3, κR = 10, κY = 1 in A; ω = 1, κR = 10, κY = 1 in B; and ω = 3, κR = 1, κY = 10 in C. Note that the scales of y-axis are different in all three panels.

Both YN and MYN have a nearly parallel overall trend when t varies from 0.1 to 1 but MYN deviates less from the expected values. They both tend to overestimate ω for purifying selection and to underestimate ω for positive selection, whereas t has little influence on ω for neutral mutation. YN overestimates ω when κR > κY and underestimates ω when κR < κY, which is consistent with those found in the infinite data test. MYN tends to overestimate ω for the expected ω = 0.3 and this overestimation becomes severe with increasing t, but it is relatively subtle when compared to YN. When ω = 1 and 3, MYN gives closer ω estimates than YN over most of the parameter combinations.

Effect of sequence length

To examine the effect of variable protein length (number of codons), we took the human codon frequencies, 1,000 pairs of sequences for each case, and two sets of parameters for the analysis: (1) ω = 0.3, κR = 10, κY = 1, and t = 1; (2) ω = 3, κR = 1, κY = 10, and t = 0.1. We calculated the average ω estimates when numbers of codons vary from 100 to 1,000 for the simulated sequences (Table 3). YN and MYN both give larger biases for short sequences (< 300 codons). Despite the fact that YN and MYN do not perform satisfactorily when target sequences are rather short, MYN is less biased than YN for most of the parameter settings.

Table 3 Average estimates of ω with YN and MYN

Testing real data

We collected three orthologous datasets from NCBI HomoloGene database (Build 44.1 [26]): 14,329 pairs of human-mouse, 10,851 pairs of human-dog, and 13,544 pairs of mouse-rat. For a more comprehensive display, we examined the cumulative percentage of κR - κY (Figure 4), emphasizing different transitional substitutions with unequal frequencies. For instance, the cumulative percentages for κR - κY > 1 for human-mouse, human-dog, and mouse-rat orthologs are 30.6%, 33.2%, and 39.7%, respectively, and those for κR - κY < -1 are all approximately 15%. The rest, for | κR - κY | ≤ 1, are 53.6%, 51.5%, and 45.4% for the three ortholog groups, respectively.

Figure 4
figure 4

Cumulative percentage of κ R – κ Y for human-mouse (squares), human-dog (circles), and mouse-rat (triangles) orthologs at a bin size of 0.1. Dashed lines were used to show the cases when κR-κY = -1 and κR-κY = 1.

To evaluate the performance of MYN, we compared a set of values (S%, Ka, Ks, and ω) generated with three other selected methods in a similar way, considering κR - κY > 1, κR - κY < -1, and | κR - κY | ≤ 1 (Table 4). Other than YN, we used another approximate method developed by Li (1993) and by Pamilo and Bianchi (1993) independently (denoted as LPB), and a maximum likelihood method proposed by Goldman and Yang (1994) (denoted as GY). Despite the fact that all four methods yield similar Ka estimates, they do show some differences in other parameter settings. As a whole, LPB tends to overestimate ω and to underestimate Ks when compared to MYN. GY performs similarly as YN does since both consider transition/transversion rate bias and nucleotide (codon) frequency bias. For instance, if κR - κY < -1, both methods, compared to MYN, underestimate S% and ω, and overestimate Ks as we have demonstrated in the simulation studies. In the case of | κR - κY | ≤ 1, YN and MYN work in a similar way. When confined to κR - κY > 1, GY and YN both overestimate S% and ω, and underestimate Ks, compared to MYN. Taking the Ks estimates as an example, they are 0.527 and 0.597 for human-mouse orthologs, 0.329 and 0.357 for human-dog orthologs, and 0.186 and 0.201 for mouse-rat orthologs, calculated with YN and MYN, respectively. Similarly, the estimated ω with YN and MYN are 0.121 and 0.105 for human-mouse orthologs, 0.158 and 0.143 for human-dog orthologs, and 0.157 and 0.142 for mouse-rat orthologs, respectively.

Table 4 Proportions of synonymous sites (S%) and estimates of Ka, Ks and ω

Discussion

MYN, proposed as a modified YN in this paper, allows for unequal transitional rates between purines and between pyrimidines as well as transversional rate and nucleotide (codon) frequencies. Our comparative analyses indicate that ignoring unequal transitional rates often results in closer estimates only when κR ≈ κY but rather biased estimates when κR > κY or κR < κY, and that MYN is more robust in simulations and real datasets by comparison with other methods, especially with YN. Therefore, it is important to take account of unequal transitional rates for accurately capturing evolutionary information (see one of Ka and Ks applications; [27]) when unequal transitional rates among compared sequences exist. In addition, unequal transitional rates can also be implemented in a maximum likelihood framework [10] that allows model evaluation for choosing a better suit to a dataset and therefore allows users to obtain more reliable estimates.

How does κ lead to biased estimates of Ka and Ks?

All approximate methods for Ka and Ks estimations may one way or another give rise to biased results for at least some parameter combinations. We examined κ, which is used not only in estimating S and N, but also in generating a transition probability matrix for estimating Sd and Nd. Since the sum of Sd and Nd between the two compared sequences is always smaller than the sum of possible sites (S + N or the effective length of compared sequences), the influence of κ on S and N is significantly stronger than that on Sd and Nd. Hence, let us focus on the effect of κ on S and N. YN uses κ and codon frequencies to estimate S and N. Since codon frequencies are constant, estimated from targeted sequences, and transitions between two codons are more likely to be synonymous especially at the third codon positions, S thus (as a function of κ and codon frequencies) is positively correlated to κ. Furthermore, the values of Ks depend on S (Ks ≈ Sd /S, assuming no correction for multiple substitutions). Therefore, an overestimated κ could give rise to overestimation of S and underestimation of Ks, which results in overestimation of ω. Likewise, underestimation of κ leads to overestimation of Ks and underestimation of ω. Therefore, κ correlates negatively with Ks and positively with ω and such correlations are also applicable to κR and κY.

Let us examine how κ affects Ka and Ks in our simulations. When κR < κY, YN tends to underestimate ω and to overestimate Ks, which results from the underestimation of κ by assuming κY equal to κR. The trend sometimes is less obvious, because about 4% loss of sites are due to mutations leading to stop codons, resulting in slightly underestimated Ka and Ks [4, 28]. Similarly, when κR > κY, YN overestimates κ, resulting in overestimation of ω and underestimation of Ks. The bias of ω becomes more pronounced as the bias of κ varies to extremes, when κR >> κY or κR << κY. In addition, κR and κY can also affect the estimates from MYN. Since transitions are more likely to occur than transversions, a decrease in κR can be related to more drastic reduction of transversions than transitions between purines, resulting in overestimated κR; this overestimation becomes severer for purifying selection due to a higher occurrence of synonymous substitutions than nonsynonymous ones, and synonymous substitutions are also more likely to be transitional. As a result, ω can be overestimated (Figure 1) and Ks can be underestimated (Figure 2), especially when κR values are smaller. Therefore, to acquire more reliable estimates on Ka and Ks, it is essential to have a less biased estimate of κ, or both κR and κY.

Influences of t and sequence length

Divergence time (t) and sequence length both influence the estimation of Ka and Ks. Since κ in YN, κR and κY in MYN are all deduced from fourfold-degenerate sites at the third codon positions and non-degenerate sites (often the first and the second codon positions), larger t leads to multiple substitutions between the two compared codons, resulting in under-counts of substitution events (a reduction in effective sampling size) for estimating κ, κR and κY. Therefore, extreme t tends to decrease the effective sampling size and short sequences tend to exaggerate this effect, resulting in biased estimates of κ, κR and κY. We indeed found that both YN and MYN tend to overestimate ω for purifying selection and underestimate ω for positive selection and these biases become severe as t increases (Figure 3). These biased estimations are in fact indirect results of biased estimates of κ, κR and κY. The fact is that synonymous substitutions are more likely resulted from transitions than nonsynonymous ones are, such as the cases of two-fold degenerate codons, so the bias between κR and κY does exist. In addition, since for negative selection, synonymous substitutions have higher possibilities to occur than nonsynonymous ones, transitions are more likely to occur than transversions. Therefore, with a decreasing effective sampling size caused by the increasing t, YN tends to underestimate transversional rate and thus to overestimate κ. In the case of positive selection, YN underestimates transitional rate and κ in a similar way. As to neutral mutation, YN tends to simultaneously underestimate both transitional and transversional rates so that the varying t has no apparent effect. Since MYN adopts a similar approach as YN estimates κ, it shows a similar trend. Short sequences have similar effect as smaller sampling size, and therefore, both YN and MYN may give rise to biased estimates κ, κR and κY, albeit MYN's better performance in less extreme cases.

Conclusion

We compared MYN with other methods, especially with YN, by examining infinitely long sequences, performing computer simulations and analyzing real datasets, and found that MYN has minimal deviation even when parameters vary within normal ranges defined by empirical data. In addition, these results indicate that biased estimates of Ka and Ks primarily stem from biased estimates of κ, or both κR and κY, which can be influenced by t and sequence length.

Methods

Methods for estimating Ka and Ks consider sequence variations of both DNA and protein, which are related through the genetic code. Since we are engaging in a generally purposed discussion, the genetic code is always referred to the canonical code. As a DNA-centric consideration, nucleotides substitutions only have two types, either within purines and pyrimidines as transitions or between them as transversions. As a protein-centric consideration, each nucleotide triplet (codon) is defined as they vary according to nucleotide changes, except stop codons (TAG, TAA, and TGA). For protein-coding genes, nucleotide substitutions are classified as nonsynonymous and synonymous (silent), referring to changes that do or do not provoke amino acid variations. Although there are several mutation (substitution) models that take these sequence variation features into account, in this report we limit our discussion only to the HKY and the Tamura-Nei Models (see Table S1 in the additional file 1 for details).

Mutation model

YN adopts the HKY Model that considers transitional rate, transversional rate, and unequal nucleotide (codon) frequencies. It uses an iterative approach to estimate Ka and Ks. Before iteration, YN computes nucleotide frequencies (regarding to the three codon positions), κ, S and N from compared sequences. Codon frequencies are calculated by multiplying each nucleotide frequencies. κ is estimated from fourfold degenerate sites at the third codon positions and non-degenerate sites. S and N are calculated by using κ and codon frequencies. YN then chooses initial values for t and ω as starting point for iteration. It generates a transition probability matrix that represents substitution probabilities from one codon to another by using ω, t, κ, and codon frequencies. This transition probability matrix is then used to deduce Sd and Nd. Hence, new estimates of ω and t can be obtained. YN repeats the calculation for another transition probability matrix, until the algorithm converges.

Compared to the HKY Model, the Tamura-Nei Model simply considers more parameters. It differentiates αR from αY according to different transitional substitutions. In fact, if the transitional rates between purines and between pyrimidines are set equal (αR = αY), the model becomes the HKY Model. Since the Tamura-Nei Model distinguishes αR and αY, we correspondingly use κR and κY to denote the ratios of transitional rates between purines and between pyrimidines over the transversional rate, respectively. MYN also needs a transition probability matrix similar to what YN has. We give the substitution rate q ij from any sense codon i to j (i ≠ j) to generate a transition probability matrix as follows:

q i j = { 0 , if  i  and  j  differ by more than one differnce π j , if  i  and  j  differ by a synonymous transversion κ R π j , if  i  and  j  differ by a synonymous transition between purines κ Y π j ,if  i  and  j  differ by a synonymous transition between pyrimidines ω π j , if  i  and  j  differ by a nonsynonymous transversion ω κ R π j , if  i  and  j  differ by a nonsynonymous transition between purines ω κ Y π j , if  i  and  j  differ by a nonsynonymous transition between pyrimidines ( 1 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGXbqCdaWgaaWcbaGaemyAaKMaemOAaOgabeaakiabg2da9maaceaabaqbaeaabyqaaaaabaGaeGimaaJaeiilaWIaeeyAaKMaeeOzayMaeeiiaaIaemyAaKMaeeiiaaIaeeyyaeMaeeOBa4MaeeizaqMaeeiiaaIaemOAaOMaeeiiaaIaeeizaqMaeeyAaKMaeeOzayMaeeOzayMaeeyzauMaeeOCaiNaeeiiaaIaeeOyaiMaeeyEaKNaeeiiaaIaeeyBa0Maee4Ba8MaeeOCaiNaeeyzauMaeeiiaaIaeeiDaqNaeeiAaGMaeeyyaeMaeeOBa4MaeeiiaaIaee4Ba8MaeeOBa4MaeeyzauMaeeiiaaIaeeizaqMaeeyAaKMaeeOzayMaeeOzayMaeeyzauMaeeOCaiNaeeOBa4Maee4yamMaeeyzaugabaGaeqiWda3aaSbaaSqaaiabdQgaQbqabaGccqGGSaalcqqGPbqAcqqGMbGzcqqGGaaicqWGPbqAcqqGGaaicqqGHbqycqqGUbGBcqqGKbazcqqGGaaicqWGQbGAcqqGGaaicqqGKbazcqqGPbqAcqqGMbGzcqqGMbGzcqqGLbqzcqqGYbGCcqqGGaaicqqGIbGycqqG5bqEcqqGGaaicqqGHbqycqqGGaaicqqGZbWCcqqG5bqEcqqGUbGBcqqGVbWBcqqGUbGBcqqG5bqEcqqGTbqBcqqGVbWBcqqG1bqDcqqGZbWCcqqGGaaicqqG0baDcqqGYbGCcqqGHbqycqqGUbGBcqqGZbWCcqqG2bGDcqqGLbqzcqqGYbGCcqqGZbWCcqqGPbqAcqqGVbWBcqqGUbGBaeaacqaH6oWAdaWgaaWcbaGaeeOuaifabeaakiabec8aWnaaBaaaleaacqWGQbGAaeqaaOGaeiilaWIaeeyAaKMaeeOzayMaeeiiaaIaemyAaKMaeeiiaaIaeeyyaeMaeeOBa4MaeeizaqMaeeiiaaIaemOAaOMaeeiiaaIaeeizaqMaeeyAaKMaeeOzayMaeeOzayMaeeyzauMaeeOCaiNaeeiiaaIaeeOyaiMaeeyEaKNaeeiiaaIaeeyyaeMaeeiiaaIaee4CamNaeeyEaKNaeeOBa4Maee4Ba8MaeeOBa4MaeeyEaKNaeeyBa0Maee4Ba8MaeeyDauNaee4CamNaeeiiaaIaeeiDaqNaeeOCaiNaeeyyaeMaeeOBa4Maee4CamNaeeyAaKMaeeiDaqNaeeyAaKMaee4Ba8MaeeOBa4MaeeiiaaIaeeOyaiMaeeyzauMaeeiDaqNaee4DaCNaeeyzauMaeeyzauMaeeOBa4MaeeiiaaIaeeiCaaNaeeyDauNaeeOCaiNaeeyAaKMaeeOBa4MaeeyzauMaee4CamhabaGaeqOUdS2aaSbaaSqaaiabbMfazbqabaGccqaHapaCdaWgaaWcbaGaemOAaOgabeaakiabbYcaSiabbMgaPjabbAgaMjabbccaGiabdMgaPjabbccaGiabbggaHjabb6gaUjabbsgaKjabbccaGiabdQgaQjabbccaGiabbsgaKjabbMgaPjabbAgaMjabbAgaMjabbwgaLjabbkhaYjabbccaGiabbkgaIjabbMha5jabbccaGiabbggaHjabbccaGiabbohaZjabbMha5jabb6gaUjabb+gaVjabb6gaUjabbMha5jabb2gaTjabb+gaVjabbwha1jabbohaZjabbccaGiabbsha0jabbkhaYjabbggaHjabb6gaUjabbohaZjabbMgaPjabbsha0jabbMgaPjabb+gaVjabb6gaUjabbccaGiabbkgaIjabbwgaLjabbsha0jabbEha3jabbwgaLjabbwgaLjabb6gaUjabbccaGiabbchaWjabbMha5jabbkhaYjabbMgaPjabb2gaTjabbMgaPjabbsgaKjabbMgaPjabb6gaUjabbwgaLjabbohaZbqaaiabeM8a3jabec8aWnaaBaaaleaacqWGQbGAaeqaaOGaeiilaWIaeeyAaKMaeeOzayMaeeiiaaIaemyAaKMaeeiiaaIaeeyyaeMaeeOBa4MaeeizaqMaeeiiaaIaemOAaOMaeeiiaaIaeeizaqMaeeyAaKMaeeOzayMaeeOzayMaeeyzauMaeeOCaiNaeeiiaaIaeeOyaiMaeeyEaKNaeeiiaaIaeeyyaeMaeeiiaaIaeeOBa4Maee4Ba8MaeeOBa4Maee4CamNaeeyEaKNaeeOBa4Maee4Ba8MaeeOBa4MaeeyEaKNaeeyBa0Maee4Ba8MaeeyDauNaee4CamNaeeiiaaIaeeiDaqNaeeOCaiNaeeyyaeMaeeOBa4Maee4CamNaeeODayNaeeyzauMaeeOCaiNaee4CamNaeeyAaKMaee4Ba8MaeeOBa4gaeaqabeaacqaHjpWDcqaH6oWAdaWgaaWcbaGaeeOuaifabeaakiabec8aWnaaBaaaleaacqWGQbGAaeqaaOGaeiilaWIaeeyAaKMaeeOzayMaeeiiaaIaemyAaKMaeeiiaaIaeeyyaeMaeeOBa4MaeeizaqMaeeiiaaIaemOAaOMaeeiiaaIaeeizaqMaeeyAaKMaeeOzayMaeeOzayMaeeyzauMaeeOCaiNaeeiiaaIaeeOyaiMaeeyEaKNaeeiiaaIaeeyyaeMaeeiiaaIaeeOBa4Maee4Ba8MaeeOBa4Maee4CamNaeeyEaKNaeeOBa4Maee4Ba8MaeeOBa4MaeeyEaKNaeeyBa0Maee4Ba8MaeeyDauNaee4CamNaeeiiaaIaeeiDaqNaeeOCaiNaeeyyaeMaeeOBa4Maee4CamNaeeyAaKMaeeiDaqNaeeyAaKMaee4Ba8MaeeOBa4MaeeiiaaIaeeOyaiMaeeyzauMaeeiDaqNaee4DaCNaeeyzauMaeeyzauMaeeOBa4MaeeiiaaIaeeiCaaNaeeyDauNaeeOCaiNaeeyAaKMaeeOBa4MaeeyzauMaee4CamhabaGaeqyYdCNaeqOUdS2aaSbaaSqaaiabbMfazbqabaGccqaHapaCdaWgaaWcbaGaemOAaOgabeaakiabcYcaSiabbMgaPjabbAgaMjabbccaGiabdMgaPjabbccaGiabbggaHjabb6gaUjabbsgaKjabbccaGiabdQgaQjabbccaGiabbsgaKjabbMgaPjabbAgaMjabbAgaMjabbwgaLjabbkhaYjabbccaGiabbkgaIjabbMha5jabbccaGiabbggaHjabbccaGiabb6gaUjabb+gaVjabb6gaUjabbohaZjabbMha5jabb6gaUjabb+gaVjabb6gaUjabbMha5jabb2gaTjabb+gaVjabbwha1jabbohaZjabbccaGiabbsha0jabbkhaYjabbggaHjabb6gaUjabbohaZjabbMgaPjabbsha0jabbMgaPjabb+gaVjabb6gaUjabbccaGiabbkgaIjabbwgaLjabbsha0jabbEha3jabbwgaLjabbwgaLjabb6gaUjabbccaGiabbchaWjabbMha5jabbkhaYjabbMgaPjabb2gaTjabbMgaPjabbsgaKjabbMgaPjabb6gaUjabbwgaLjabbohaZbaaaiaaxMaacaWLjaWaaeWaaeaacqaIXaqmaiaawIcacaGLPaaaaiaawUhaaaaa@588A@

Estimating κR and κY

Before generating the transition probability matrix, we need to estimate κR and κY. In a similar way to YN's estimation of κ, we estimate four nucleotide frequencies (gT, gC, gA, gG), proportions of transitional differences between purines (TR) and between pyrimidines (TY), and the proportion of transversional differences (V) from compared sequences. We calculate

a = log ( 1 − g R 2 g A g G T R − 1 2 g R V ) b = log ( 1 − g Y 2 g T g C T Y − 1 2 g Y V ) c = log ( 1 − 1 2 g R g Y V ) ( 2 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegyvzYrwyUfgarqqtubsr4rNCHbGeaGqiA8vkIkVAFgIELiFeLkFeLk=iY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqVepeea0=as0db9vqpepesP0xe9Fve9Fve9GapdbaqaaeGacaGaaiaabeqaamqadiabaaGcbaqbaeaabiGaaaqaaiabdggaHjabg2da9iGbcYgaSjabc+gaVjabcEgaNjabcIcaOiabigdaXiabgkHiTmaalaaabaGaem4zaC2aaSbaaSqaaiabdkfasbqabaaakeaacqaIYaGmcqWGNbWzdaWgaaWcbaGaemyqaeeabeaakiabdEgaNnaaBaaaleaacqWGhbWraeqaaaaakiabdsfaunaaBaaaleaacqqGsbGuaeqaaOGaeyOeI0YaaSaaaeaacqaIXaqmaeaacqaIYaGmcqWGNbWzdaWgaaWcbaGaemOuaifabeaaaaGccqWGwbGvcqGGPaqkaeaacqWGIbGycqGH9aqpcyGGSbaBcqGGVbWBcqGGNbWzcqGGOaakcqaIXaqmcqGHsisldaWcaaqaaiabdEgaNnaaBaaaleaacqWGzbqwaeqaaaGcbaGaeGOmaiJaem4zaC2aaSbaaSqaaiabdsfaubqabaGccqWGNbWzdaWgaaWcbaGaem4qameabeaaaaGccqWGubavdaWgaaWcbaGaeeywaKfabeaakiabgkHiTmaalaaabaGaeGymaedabaGaeGOmaiJaem4zaC2aaSbaaSqaaiabdMfazbqabaaaaOGaemOvayLaeiykaKcabaGaem4yamMaeyypa0JagiiBaWMaei4Ba8Maei4zaCMaeiikaGIaeGymaeJaeyOeI0YaaSaaaeaacqaIXaqmaeaacqaIYaGmcqWGNbWzdaWgaaWcbaGaemOuaifabeaakiabdEgaNnaaBaaaleaacqWGzbqwaeqaaaaakiabdAfawjabcMcaPaqaaaaacaWLjaGaaCzcamaabmaabaGaeGOmaidacaGLOaGaayzkaaaaaa@8C5A@

where gR = gA + gG and gY = gT + gC. Then we use equation 3 to estimate κR and κY.

κ R = a − g Y × c g R × c κ Y − b − g R × c g Y × c ( 3 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegyvzYrwyUfgarqqtubsr4rNCHbGeaGqiA8vkIkVAFgIELiFeLkFeLk=iY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqVepeea0=as0db9vqpepesP0xe9Fve9Fve9GapdbaqaaeGacaGaaiaabeqaamqadiabaaGcbaqbaeqabeGaaaqaaiabeQ7aRnaaBaaaleaacqqGsbGuaeqaaOGaeyypa0ZaaSaaaeaacqWGHbqycqGHsislcqWGNbWzdaWgaaWcbaGaemywaKfabeaakiabgEna0kabdogaJbqaaiabdEgaNnaaBaaaleaacqWGsbGuaeqaaOGaey41aqRaem4yamgaaaqaaiabeQ7aRnaaBaaaleaacqqGzbqwaeqaaOGaeyOeI0YaaSaaaeaacqWGIbGycqGHsislcqWGNbWzdaWgaaWcbaGaemOuaifabeaakiabgEna0kabdogaJbqaaiabdEgaNnaaBaaaleaacqWGzbqwaeqaaOGaey41aqRaem4yamgaaaaacaWLjaGaaCzcamaabmaabaGaeG4mamdacaGLOaGaayzkaaaaaa@660E@

d = − 2 g A g G g R a − 2 g T g C g Y b − 2 ( g R g Y − g A g G g Y g R − g T g C g R g Y ) c ( 4 ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamXvP5wqSXMqHnxAJn0BKvguHDwzZbqegyvzYrwyUfgarqqtubsr4rNCHbGeaGqiA8vkIkVAFgIELiFeLkFeLk=iY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfeaY=biLkVcLq=JHqVepeea0=as0db9vqpepesP0xe9Fve9Fve9GapdbaqaaeGacaGaaiaabeqaamqadiabaaGcbaGaemizaqMaeyypa0JaeyOeI0YaaSaaaeaacqaIYaGmcqWGNbWzdaWgaaWcbaGaemyqaeeabeaakiabdEgaNnaaBaaaleaacqWGhbWraeqaaaGcbaGaem4zaC2aaSbaaSqaaiabdkfasbqabaaaaOGaemyyaeMaeyOeI0YaaSaaaeaacqaIYaGmcqWGNbWzdaWgaaWcbaGaemivaqfabeaakiabdEgaNnaaBaaaleaacqWGdbWqaeqaaaGcbaGaem4zaC2aaSbaaSqaaiabdMfazbqabaaaaOGaemOyaiMaeyOeI0IaeGOmaiJaeiikaGIaem4zaC2aaSbaaSqaaiabdkfasbqabaGccqWGNbWzdaWgaaWcbaGaemywaKfabeaakiabgkHiTmaalaaabaGaem4zaC2aaSbaaSqaaiabdgeabbqabaGccqWGNbWzdaWgaaWcbaGaem4raCeabeaakiabdEgaNnaaBaaaleaacqWGzbqwaeqaaaGcbaGaem4zaC2aaSbaaSqaaiabdkfasbqabaaaaOGaeyOeI0YaaSaaaeaacqWGNbWzdaWgaaWcbaGaemivaqfabeaakiabdEgaNnaaBaaaleaacqWGdbWqaeqaaOGaem4zaC2aaSbaaSqaaiabdkfasbqabaaakeaacqWGNbWzdaWgaaWcbaGaemywaKfabeaaaaGccqGGPaqkcqWGJbWycaWLjaGaaCzcamaabmaabaGaeGinaqdacaGLOaGaayzkaaaaaa@7B89@

The detailed procedures for deducing κR and κY were summarized in the additional file 1. We also made other modifications accordingly, such as using κR and κY to estimate S and N, generating relevant transition probability matrix (equation 1), considering different transitional evolution pathways to count Sd and Nd, and correcting for multiple substitutions when estimating Ka and Ks (equation 4; [22]).

References

  1. Kimura M: The neutral theory of molecular evolution. 1983, Cambridge, England , Cambridge University Press

    Chapter  Google Scholar 

  2. Gillespie JH: The causes of molecular evolution. 1991, Oxford, England , Oxford University Press

    Google Scholar 

  3. Li WH: Molecular evolution. 1997, Sunderland, Mass. , Sinauer Associates

    Google Scholar 

  4. Yang Z, Nielsen R: Estimating Synonymous and Nonsynonymous Substitution Rates Under Realistic Evolutionary Models. Mol Biol Evol. 2000, 17 (1): 32-43.

    Article  CAS  PubMed  Google Scholar 

  5. Nei M, Gojobori T: Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986, 3 (5): 418-426.

    CAS  PubMed  Google Scholar 

  6. Ina Y: New methods for estimating the numbers of synonymous and nonsynonymous substitutions. J Mol Evol. 1995, 40 (2): 190-226. 10.1007/BF00167113.

    Article  CAS  PubMed  Google Scholar 

  7. Li WH: Unbiased estimation of the Rates of synonymous and nonsynonymous substitution. J Mol Evol. 1993, 36: 96-99. 10.1007/BF02407308.

    Article  CAS  PubMed  Google Scholar 

  8. Li WH, Wu CI, Luo CC: A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol Biol Evol. 1985, 2 (2): 150-174.

    PubMed  Google Scholar 

  9. Pamilo P, Bianchi NO: Evolution of the Zfx and Zfy genes: rates and interdependence between the genes. Mol Biol Evol. 1993, 10 (2): 271-281.

    CAS  PubMed  Google Scholar 

  10. Goldman N, Yang Z: A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994, 11 (5): 725-736.

    CAS  PubMed  Google Scholar 

  11. Comeron JM: A method for estimating the numbers of synonymous and nonsynonymous substitutions per site. J Mol Evol. 1995, 41 (6): 1152-1159. 10.1007/BF00173196.

    Article  CAS  PubMed  Google Scholar 

  12. Tzeng YH, Pan R, Li WH: Comparison of three methods for estimating rates of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 2004, 21 (12): 2290-2298. 10.1093/molbev/msh242.

    Article  CAS  PubMed  Google Scholar 

  13. Muse SV, Gaut BS: A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994, 11 (5): 715-724.

    CAS  PubMed  Google Scholar 

  14. Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16 (2): 111-120. 10.1007/BF01731581.

    Article  CAS  PubMed  Google Scholar 

  15. Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985, 22 (2): 160-174. 10.1007/BF02101694.

    Article  CAS  PubMed  Google Scholar 

  16. Jukes TH, Cantor CR: Evolution of protein molecules. Mammalian Protein Metabolism. Edited by: Munro HN. 1969, New York , Academic Press, 21-123.

    Chapter  Google Scholar 

  17. Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997, 13 (5): 555-556.

    CAS  PubMed  Google Scholar 

  18. Muse SV: Estimating synonymous and nonsynonymous substitution rates. Mol Biol Evol. 1996, 13 (1): 105-114.

    Article  CAS  PubMed  Google Scholar 

  19. Lio P, Goldman N: Models of Molecular Evolution and Phylogeny. Genome Res. 1998, 8 (12): 1233-1244.

    CAS  PubMed  Google Scholar 

  20. Goldman N, Yang Z: Models of DNA substitution and the discrimination of evolutionary parameters: International Biometric Society, Hamilton, Ontario, Canada.1994, , 407-421.

    Google Scholar 

  21. Takahashi K, Nei M: Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. Mol Biol Evol. 2000, 17 (8): 1251-1258.

    Article  CAS  PubMed  Google Scholar 

  22. Tamura K, Nei M: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993, 10 (3): 512-526.

    CAS  PubMed  Google Scholar 

  23. Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Gilbert J, Hammond M, Herrero J, Hotz H, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Kokocinsci F, London D, Longden I, McVicker G, Melsopp C, Meidl P, Potter S, Proctor G, Rae M, Rios D, Schuster M, Searle S, Severin J, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C, Birney E: Ensembl 2005. Nucleic Acids Res. 2005, 33 (Database issue): D447-53. 10.1093/nar/gki138.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  24. Yu J, Wang J, Lin W, Li S, Li H, Zhou J, Ni P, Dong W, Hu S, Zeng C, Zhang J, Zhang Y, Li R, Xu Z, Li S, Li X, Zheng H, Cong L, Lin L, Yin J, Geng J, Li G, Shi J, Liu J, Lv H, Li J, Wang J, Deng Y, Ran L, Shi X, Wang X, Wu Q, Li C, Ren X, Wang J, Wang X, Li D, Liu D, Zhang X, Ji Z, Zhao W, Sun Y, Zhang Z, Bao J, Han Y, Dong L, Ji J, Chen P, Wu S, Liu J, Xiao Y, Bu D, Tan J, Yang L, Ye C, Zhang J, Xu J, Zhou Y, Yu Y, Zhang B, Zhuang S, Wei H, Liu B, Lei M, Yu H, Li Y, Xu H, Wei S, He X, Fang L, Zhang Z, Zhang Y, Huang X, Su Z, Tong W, Li J, Tong Z, Li S, Ye J, Wang L, Fang L, Lei T, Chen C, Chen H, Xu Z, Li H, Huang H, Zhang F, Xu H, Li N, Zhao C, Li S, Dong L, Huang Y, Li L, Xi Y, Qi Q, Li W, Zhang B, Hu W, Zhang Y, Tian X, Jiao Y, Liang X, Jin J, Gao L, Zheng W, Hao B, Liu S, Wang W, Yuan L, Cao M, McDermott J, Samudrala R, Wang J, Wong GK, Yang H: The Genomes of Oryza sativa: a history of duplications. PLoS Biol. 2005, 3 (2): e38-10.1371/journal.pbio.0030038.

    Article  PubMed Central  PubMed  Google Scholar 

  25. Messier W, Stewart CB: Episodic adaptive evolution of primate lysozymes. Nature. 1997, 385 (6612): 151-154. 10.1038/385151a0.

    Article  CAS  PubMed  Google Scholar 

  26. NCBI HomoloGene. [ftp://ftp.ncbi.nih.gov/pub/HomoloGene/]

  27. Nekrutenko A, Makova KD, Li WH: The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. Genome Res. 2002, 12 (1): 198-202. 10.1101/gr.200901.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  28. Yang Z, Nielsen R: Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J Mol Evol. 1998, 46 (4): 409-418. 10.1007/PL00006320.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank three anonymous reviewers, Ruiqiang Li, Jia Ye, Jing Wang, and Ang Li for their constructive comments on this manuscript. We are also grateful to Dr Jun Wang and Gane Ka-Shu Wong for many thoughtful suggestions. This work was supported by Ministry of Science and Technology of China (2001AA231061) and National Natural Science Foundation of China (30270748) awarded to JY.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Yu.

Additional information

Authors' contributions

ZZ designed and programmed this new method, and drafted the manuscript. JL carried out computer simulations and generated sequence datasets. JY supervised the research and revised the manuscript. All authors read and approved the final manuscript.

Zhang Zhang, Jun Li contributed equally to this work.

Electronic supplementary material

12862_2006_229_MOESM1_ESM.pdf

Additional File 1: Differences between the HKY and Tamura-Nei models and the detailed derivations of κR and κY. This file contains two sections: section I shows the differences between the HKY and Tamura-Nei models and section II details the procedures for deducing κR and κY. (PDF 150 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Zhang, Z., Li, J. & Yu, J. Computing Ka and Ks with a consideration of unequal transitional substitutions. BMC Evol Biol 6, 44 (2006). https://doi.org/10.1186/1471-2148-6-44

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2148-6-44

Keywords