- Methodology article
- Open Access

# Computing Ka and Ks with a consideration of unequal transitional substitutions

- Zhang Zhang†
^{1, 2, 3}, - Jun Li†
^{2}and - Jun Yu
^{1, 2, 4}Email author

**6**:44

https://doi.org/10.1186/1471-2148-6-44

© Zhang et al; licensee BioMed Central Ltd. 2006

**Received:**02 March 2006**Accepted:**02 June 2006**Published:**02 June 2006

## Abstract

### Background

Approximate methods for estimating nonsynonymous and synonymous substitution rates (Ka and Ks) among protein-coding sequences have adopted different mutation (substitution) models. In the past two decades, several methods have been proposed but they have not considered unequal transitional substitutions (between the two purines, A and G, or the two pyrimidines, T and C) that become apparent when sequences data to be compared are vast and significantly diverged.

### Results

We propose a new method (MYN), a modified version of the Yang-Nielsen algorithm (YN), for evolutionary analysis of protein-coding sequences in general. MYN adopts the Tamura-Nei Model that considers the difference among rates of transitional and transversional substitutions as well as factors in codon frequency bias. We evaluate the performance of MYN by comparing to other methods, especially to YN, and to show that MYN has minimal deviations when parameters vary within normal ranges defined by empirical data.

### Conclusion

Our comparative results deriving from consistency analysis, computer simulations and authentic datasets, indicate that ignoring unequal transitional rates may lead to serious biases and that MYN performs well in most of the tested cases. These results also suggest that acquisitions of reliable synonymous and nonsynonymous substitution rates primarily depend on less biased estimates of transition/transversion rate ratio.

## Keywords

- Codon Position
- Transition Probability Matrix
- Codon Frequency
- Transitional Rate
- Canonical Code

## Background

For appraising evolutionary significance of variable protein-coding sequences among diverged species in a quantitative fashion, one of the powerful tools is to compute nonsynonymous and synonymous substitution rates, termed as Ka and Ks, respectively [1–3]. Since Ka and Ks represent the numbers of substitutions per nonsynonymous and synonymous site, respectively, these parameters (or often their ratio ω = Ka/Ks) are used to partition the targeted sequences into three basic scenarios: negative (purifying) selection when Ka < Ks (ω < 1), positive (adaptive) selection when Ka > Ks (ω > 1), and neutral mutation when Ka = Ks (ω = 1).

_{d}) and nonsynonymous (N

_{d}) substitutions, and correcting for multiple substitutions. Over the past two decades, several methods have been developed [4–13], which are based on different mutation (substitution) models with subtle yet significant differences [14–16]. Among them, Yang and Nielsen made a valuable attempt to consider differences among transitional and transversional substitutions as well as codon frequency bias [4], and their method (denoted as YN in this report), based on the more realistic HKY Model [15] and implemented in PAML (Phylogenetic Analysis by Maximum Likelihood; [17]), has become increasingly popular in the field of molecular evolution studies. However, it does not exclude the possibility that methods based on simpler models have some favorable properties [18]. There are always tradeoffs between incorporating more features into models and avoiding over-parameterization for more accurate captures of evolutionary information [19–21]. In this report, we propose an improved method, a m odified YN algorithm or MYN, based on the Tamura-Nei Model, in which transitional changes (between A and G or T and C) are not assumed to occur with an equal frequency [22]. We also review the basics involved in estimating Ka and Ks (Table 1), explain differences between the two methods, and provide our comparative results as an in-depth evaluation for the new method.

Symbols used in estimating Ka and Ks

Symbol | Definition |
---|---|

S | Number of synonymous sites |

N | Number of nonsynonymous sites |

S | Number of synonymous substitutions |

N | Number of nonsynonymous substitutions |

Ks | Synonymous substitution rate |

Ka | Nonsynonymous substitution rate |

ω | Estimator of selective strength, ω = Ka/Ks |

| Divergence time between two sequences, the expected number of nucleotide substitutions per codon, |

α | Transitional rate between purines |

α | Transitional rate between pyrimidines |

α | Transitional rate |

β | Transversional rate |

κ | Ratio of transitional rate between purines to transversional rate, κ |

κ | Ratio of transitional rate between pyrimidines to transversional rate, κ |

κ | Ratio of transitional rate/transversional rate, κ = α/β |

| Frequency of nucleotide N, N ∈ {T, C, A, G} |

π | Frequency of codon |

## Results

To compare the performance of YN and MYN, we need to generate simulated and empirical datasets with careful considerations on possible features in evolution of diverged protein-coding sequences, such as biases in transitional rates and codon frequencies. We simulated hypothetical common ancestral sequences according to codon frequencies that were derived from three basic datasets: (1) equal codon frequencies (each sense codon frequency for canonical genetic code is 1/61, and other codes can be accommodated by making simple modifications [4]), (2) human codon frequencies (based on 39,420 human protein-coding genes from ENSEMBL database, Release 35 [23]), and (3) rice codon frequencies (deduced from 19,079 rice protein-coding genes [24]). We used the formula 100% × [(estimated value) - (expected value)]/(expected value) to calculate percentage errors for assessing relative biases between estimated and expected values.

### Consistency analysis

*t*= 0.6 for the initial analysis and the effect of

*t*is examined later. Since genuine values for κ often range from 1.5 to 5, we take 3.75 as a representative one. Considering that MYN differentiates κ

_{Y}from κ

_{R}, we always fix one of them to 3.75 and allow the other to vary from 1 to 10. We plotted percentage errors for ω between data generated with YN and MYN against κ

_{R}(fixing κ

_{Y}= 3.75) for different expected values, using the three codon frequencies from our test datasets (Figure 1 A to I). Similar results are readily obtained for fixed κ

_{R}and variable κ

_{Y}(data not shown).

Different codon frequencies have minor influence on the performance of YN and MYN. Ignoring the difference between κ_{R} and κ_{Y}, YN gives estimates close to the expected values only if κ_{R} ≈ κ_{Y}. For instance, when κ_{R} = 4, the percentage errors for ω calculated with YN under human codon frequencies are 1.98%, 2.51%, and 0.78% when ω = 0.3, 1, and 3, respectively (Figure 1D to 1F). When κ_{R} ≠ κ_{Y}, YN gives rise to obviously biased ω; it tend to underestimate ω when κ_{R} < κ_{Y} and to overestimate ω when κ_{R} > κ_{Y}. The percentage errors for ω = 0.3, 1, and 3, when human codon frequencies are used, are -4.16%, -7.09% and -11.03% when κ_{R} = 1, and are 16.82%, 11.24%, and 5.87% when κ_{R} = 10, respectively. Compared to YN, MYN appears to produce lower percentage errors in ω estimations in most cases. When κ_{R} ≈ κ_{Y}, MYN performs in a similar way as YN (it becomes equivalent to YN when κ_{R} = κ_{Y}).

_{R}(assuming κ

_{Y}= 3.75; Figure 2A to 2C) based on human codon frequencies, showing similar results when taking equal or rice codon frequencies (data not shown). YN and MYN give similar estimates of Ks when κ

_{R}≈ κ

_{Y}. When κ

_{R}= 4, the percentage errors of the estimated Ks generated by YN and MYN are -1.83% and -2.11% for ω = 0.3, -3.77% and -3.78% for ω = 1, -2.71% and -2.56% for ω = 3, respectively. YN tends to overestimate Ks when κ

_{R}< κ

_{Y}and to underestimate Ks when κ

_{R}> κ

_{Y}and the bias becomes serious with increasing κ

_{R}. MYN sometimes gives rise to larger biases than YN when κ

_{R}< κ

_{Y}, but it overall performs better for most of the parameter combinations tested, especially when κ

_{R}> κ

_{Y}.

### Effects of κ_{R} and κ_{Y}

_{R}and κ

_{Y}with a set of simulated sequences, three ω values (0.3, 1 and 3), and two

*t*values (0.1 and 1). We considered the case of κ

_{R}≥ κ

_{Y}and four parameter sets: (1) κ

_{R}= 3 and κ

_{Y}= 1.5, (2) κ

_{R}= 5 and κ

_{Y}= 1.5, (3) κ

_{R}= 10 and κ

_{Y}= 1, (4) κ

_{R}= κ

_{Y}= 3.75. To avoid stochastic errors, we generated 1,000 pairs of sequences with 400 codons each for all the tests. We only described the results when human codon frequencies were used since similar results were obtained for equal or rice codon frequencies. The average estimates of Ka, Ks, and ω were computed with YN and MYN as well as expected values of Ka and Ks (Table 2).

Average estimates of Ka, Ks, and ω with YN and MYN

Parameters | Expected Values | YN | MYN | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

ω |
| κ | κ | Ka | Ks | Ka | Ks | ω | Ka | Ks | ω |

0.3 | 0.1 | 3 | 1.5 | 0.021 | 0.069 | 0.021 | 0.065 | 0.353 | 0.021 | 0.067 | 0.340 |

5 | 1.5 | 0.021 | 0.069 | 0.021 | 0.062 | 0.361 | 0.021 | 0.066 | 0.332 | ||

10 | 1 | 0.020 | 0.068 | 0.022 | 0.058 | 0.395 | 0.021 | 0.066 | 0.334 | ||

3.75 | 3.75 | 0.020 | 0.066 | 0.020 | 0.066 | 0.328 | 0.020 | 0.066 | 0.331 | ||

1 | 3 | 1.5 | 0.207 | 0.692 | 0.210 | 0.653 | 0.329 | 0.208 | 0.720 | 0.298 | |

5 | 1.5 | 0.206 | 0.686 | 0.206 | 0.569 | 0.369 | 0.201 | 0.672 | 0.311 | ||

10 | 1 | 0.205 | 0.682 | 0.197 | 0.419 | 0.476 | 0.188 | 0.529 | 0.366 | ||

3.75 | 3.75 | 0.199 | 0.662 | 0.198 | 0.662 | 0.305 | 0.198 | 0.676 | 0.301 | ||

1 | 0.1 | 3 | 1.5 | 0.033 | 0.033 | 0.034 | 0.032 | 1.216 | 0.034 | 0.033 | 1.163 |

5 | 1.5 | 0.033 | 0.033 | 0.034 | 0.030 | 1.294 | 0.034 | 0.032 | 1.187 | ||

10 | 1 | 0.033 | 0.033 | 0.034 | 0.030 | 1.293 | 0.033 | 0.033 | 1.102 | ||

3.75 | 3.75 | 0.033 | 0.033 | 0.034 | 0.033 | 1.144 | 0.034 | 0.033 | 1.150 | ||

1 | 3 | 1.5 | 0.333 | 0.333 | 0.330 | 0.305 | 1.103 | 0.325 | 0.322 | 1.034 | |

5 | 1.5 | 0.333 | 0.333 | 0.325 | 0.283 | 1.168 | 0.317 | 0.310 | 1.044 | ||

10 | 1 | 0.333 | 0.333 | 0.300 | 0.242 | 1.267 | 0.287 | 0.279 | 1.051 | ||

3.75 | 3.75 | 0.333 | 0.333 | 0.326 | 0.318 | 1.043 | 0.327 | 0.318 | 1.047 | ||

3 | 0.1 | 3 | 1.5 | 0.040 | 0.013 | 0.041 | 0.013 | 3.637 | 0.040 | 0.014 | 3.511 |

5 | 1.5 | 0.041 | 0.014 | 0.041 | 0.013 | 3.738 | 0.040 | 0.014 | 3.453 | ||

10 | 1 | 0.041 | 0.014 | 0.041 | 0.014 | 3.077 | 0.039 | 0.016 | 2.783 | ||

3.75 | 3.75 | 0.041 | 0.014 | 0.040 | 0.016 | 2.846 | 0.041 | 0.016 | 2.869 | ||

1 | 3 | 1.5 | 0.403 | 0.134 | 0.396 | 0.129 | 3.173 | 0.391 | 0.135 | 2.994 | |

5 | 1.5 | 0.405 | 0.135 | 0.389 | 0.122 | 3.304 | 0.379 | 0.132 | 2.986 | ||

10 | 1 | 0.406 | 0.135 | 0.354 | 0.113 | 3.216 | 0.340 | 0.128 | 2.734 | ||

3.75 | 3.75 | 0.413 | 0.138 | 0.400 | 0.136 | 3.015 | 0.402 | 0.136 | 3.026 |

Without considering the difference between κ_{R} and κ_{Y}, YN produces minor bias when κ_{R} = κ_{Y}, and underestimates Ks and overestimates ω when κ_{R} > κ_{Y}. MYN is less biased compared with YN for most parameter combinations. The results agree with the infinite data test for consistency. For example, when κ_{R} > κ_{Y}, YN gives positive values for ω and negative values for Ks in percentage errors for the infinite data. As a result, it overestimates ω and underestimates Ks for sequences with normal length such as 400 codons albeit far from dramatic.

### Effect of *t*

*t*vary from 0.1 to 1 to evaluate its effect. We again used the human codon frequencies for simulations (1,000 pairs of sequences with 400 codons for each case), and tested three different parameter combinations: (1) ω = 0.3, κ

_{R}= 10 and κ

_{Y}= 1; (2) ω = 1, κ

_{R}= 10 and κ

_{Y}= 1; (3) ω = 3, κ

_{Y}= 10 and κ

_{R}= 1. We plotted average estimates of ω with YN and MYN against

*t*for the parameter combinations (Figure 3A to 3C).

Both YN and MYN have a nearly parallel overall trend when *t* varies from 0.1 to 1 but MYN deviates less from the expected values. They both tend to overestimate ω for purifying selection and to underestimate ω for positive selection, whereas *t* has little influence on ω for neutral mutation. YN overestimates ω when κ_{R} > κ_{Y} and underestimates ω when κ_{R} < κ_{Y}, which is consistent with those found in the infinite data test. MYN tends to overestimate ω for the expected ω = 0.3 and this overestimation becomes severe with increasing *t*, but it is relatively subtle when compared to YN. When ω = 1 and 3, MYN gives closer ω estimates than YN over most of the parameter combinations.

### Effect of sequence length

_{R}= 10, κ

_{Y}= 1, and

*t*= 1; (2) ω = 3, κ

_{R}= 1, κ

_{Y}= 10, and

*t*= 0.1. We calculated the average ω estimates when numbers of codons vary from 100 to 1,000 for the simulated sequences (Table 3). YN and MYN both give larger biases for short sequences (< 300 codons). Despite the fact that YN and MYN do not perform satisfactorily when target sequences are rather short, MYN is less biased than YN for most of the parameter settings.

Average estimates of ω with YN and MYN

Number of codons | ω = 0.3 | ω = 3 | ||
---|---|---|---|---|

YN | MYN | YN | MYN | |

100 | 0.504 | 0.408 | 2.092 | 2.014 |

200 | 0.477 | 0.369 | 2.340 | 2.394 |

300 | 0.473 | 0.368 | 2.522 | 2.627 |

400 | 0.473 | 0.364 | 2.583 | 2.736 |

500 | 0.470 | 0.363 | 2.568 | 2.824 |

600 | 0.469 | 0.360 | 2.661 | 2.929 |

700 | 0.472 | 0.363 | 2.653 | 2.943 |

800 | 0.469 | 0.361 | 2.684 | 3.003 |

900 | 0.466 | 0.358 | 2.695 | 3.034 |

1000 | 0.473 | 0.361 | 2.671 | 3.006 |

### Testing real data

_{R}- κ

_{Y}(Figure 4), emphasizing different transitional substitutions with unequal frequencies. For instance, the cumulative percentages for κ

_{R}- κ

_{Y}> 1 for human-mouse, human-dog, and mouse-rat orthologs are 30.6%, 33.2%, and 39.7%, respectively, and those for κ

_{R}- κ

_{Y}< -1 are all approximately 15%. The rest, for | κ

_{R}- κ

_{Y}| ≤ 1, are 53.6%, 51.5%, and 45.4% for the three ortholog groups, respectively.

_{R}- κ

_{Y}> 1, κ

_{R}- κ

_{Y}< -1, and | κ

_{R}- κ

_{Y}| ≤ 1 (Table 4). Other than YN, we used another approximate method developed by Li (1993) and by Pamilo and Bianchi (1993) independently (denoted as LPB), and a maximum likelihood method proposed by Goldman and Yang (1994) (denoted as GY). Despite the fact that all four methods yield similar Ka estimates, they do show some differences in other parameter settings. As a whole, LPB tends to overestimate ω and to underestimate Ks when compared to MYN. GY performs similarly as YN does since both consider transition/transversion rate bias and nucleotide (codon) frequency bias. For instance, if κ

_{R}- κ

_{Y}< -1, both methods, compared to MYN, underestimate S% and ω, and overestimate Ks as we have demonstrated in the simulation studies. In the case of | κ

_{R}- κ

_{Y}| ≤ 1, YN and MYN work in a similar way. When confined to κ

_{R}- κ

_{Y}> 1, GY and YN both overestimate S% and ω, and underestimate Ks, compared to MYN. Taking the Ks estimates as an example, they are 0.527 and 0.597 for human-mouse orthologs, 0.329 and 0.357 for human-dog orthologs, and 0.186 and 0.201 for mouse-rat orthologs, calculated with YN and MYN, respectively. Similarly, the estimated ω with YN and MYN are 0.121 and 0.105 for human-mouse orthologs, 0.158 and 0.143 for human-dog orthologs, and 0.157 and 0.142 for mouse-rat orthologs, respectively.

Proportions of synonymous sites (S%) and estimates of Ka, Ks and ω

Method | κ | κ | |κ | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

S% | Ka | Ks | ω | S% | Ka | Ks | ω | S% | Ka | Ks | ω | |

human-mouse orthologs | ||||||||||||

LPB | - | 0.069 | 0.463 | 0.148 | - | 0.071 | 0.449 | 0.159 | - | 0.105 | 0.500 | 0.209 |

GY | 27.2% | 0.065 | 0.518 | 0.125 | 27.1% | 0.068 | 0.503 | 0.135 | 26.9% | 0.101 | 0.561 | 0.180 |

YN | 27.4% | 0.064 | 0.527 | 0.121 | 27.2% | 0.067 | 0.505 | 0.133 | 26.6% | 0.099 | 0.588 | 0.169 |

MYN | 26.1% | 0.063 | 0.597 | 0.105 | 28.5% | 0.068 | 0.474 | 0.144 | 26.5% | 0.099 | 0.591 | 0.168 |

human-dog orthologs | ||||||||||||

LPB | - | 0.055 | 0.309 | 0.176 | - | 0.057 | 0.296 | 0.192 | - | 0.081 | 0.348 | 0.233 |

GY | 27.5% | 0.052 | 0.332 | 0.157 | 27.5% | 0.055 | 0.318 | 0.172 | 26.5% | 0.078 | 0.381 | 0.205 |

YN | 27.8% | 0.052 | 0.329 | 0.158 | 27.8% | 0.055 | 0.310 | 0.176 | 26.4% | 0.077 | 0.387 | 0.200 |

MYN | 26.5% | 0.051 | 0.357 | 0.143 | 29.1% | 0.056 | 0.294 | 0.189 | 26.3% | 0.077 | 0.389 | 0.199 |

mouse-rat orthologs | ||||||||||||

LPB | - | 0.030 | 0.176 | 0.173 | - | 0.030 | 0.170 | 0.179 | - | 0.048 | 0.196 | 0.245 |

GY | 28.0% | 0.030 | 0.189 | 0.157 | 27.7% | 0.029 | 0.183 | 0.160 | 26.9% | 0.047 | 0.214 | 0.220 |

YN | 28.2% | 0.029 | 0.186 | 0.157 | 27.9% | 0.029 | 0.180 | 0.162 | 26.6% | 0.046 | 0.216 | 0.215 |

MYN | 26.6% | 0.029 | 0.201 | 0.142 | 29.1% | 0.030 | 0.171 | 0.173 | 26.5% | 0.046 | 0.216 | 0.214 |

## Discussion

MYN, proposed as a modified YN in this paper, allows for unequal transitional rates between purines and between pyrimidines as well as transversional rate and nucleotide (codon) frequencies. Our comparative analyses indicate that ignoring unequal transitional rates often results in closer estimates only when κ_{R} ≈ κ_{Y} but rather biased estimates when κ_{R} > κ_{Y} or κ_{R} < κ_{Y}, and that MYN is more robust in simulations and real datasets by comparison with other methods, especially with YN. Therefore, it is important to take account of unequal transitional rates for accurately capturing evolutionary information (see one of Ka and Ks applications; [27]) when unequal transitional rates among compared sequences exist. In addition, unequal transitional rates can also be implemented in a maximum likelihood framework [10] that allows model evaluation for choosing a better suit to a dataset and therefore allows users to obtain more reliable estimates.

### How does κ lead to biased estimates of Ka and Ks?

All approximate methods for Ka and Ks estimations may one way or another give rise to biased results for at least some parameter combinations. We examined κ, which is used not only in estimating S and N, but also in generating a transition probability matrix for estimating S_{d} and N_{d}. Since the sum of S_{d} and N_{d} between the two compared sequences is always smaller than the sum of possible sites (S + N or the effective length of compared sequences), the influence of κ on S and N is significantly stronger than that on S_{d} and N_{d}. Hence, let us focus on the effect of κ on S and N. YN uses κ and codon frequencies to estimate S and N. Since codon frequencies are constant, estimated from targeted sequences, and transitions between two codons are more likely to be synonymous especially at the third codon positions, S thus (as a function of κ and codon frequencies) is positively correlated to κ. Furthermore, the values of Ks depend on S (Ks ≈ S_{d} /S, assuming no correction for multiple substitutions). Therefore, an overestimated κ could give rise to overestimation of S and underestimation of Ks, which results in overestimation of ω. Likewise, underestimation of κ leads to overestimation of Ks and underestimation of ω. Therefore, κ correlates negatively with Ks and positively with ω and such correlations are also applicable to κ_{R} and κ_{Y}.

Let us examine how κ affects Ka and Ks in our simulations. When κ_{R} < κ_{Y}, YN tends to underestimate ω and to overestimate Ks, which results from the underestimation of κ by assuming κ_{Y} equal to κ_{R}. The trend sometimes is less obvious, because about 4% loss of sites are due to mutations leading to stop codons, resulting in slightly underestimated Ka and Ks [4, 28]. Similarly, when κ_{R} > κ_{Y}, YN overestimates κ, resulting in overestimation of ω and underestimation of Ks. The bias of ω becomes more pronounced as the bias of κ varies to extremes, when κ_{R} >> κ_{Y} or κ_{R} << κ_{Y}. In addition, κ_{R} and κ_{Y} can also affect the estimates from MYN. Since transitions are more likely to occur than transversions, a decrease in κ_{R} can be related to more drastic reduction of transversions than transitions between purines, resulting in overestimated κ_{R}; this overestimation becomes severer for purifying selection due to a higher occurrence of synonymous substitutions than nonsynonymous ones, and synonymous substitutions are also more likely to be transitional. As a result, ω can be overestimated (Figure 1) and Ks can be underestimated (Figure 2), especially when κ_{R} values are smaller. Therefore, to acquire more reliable estimates on Ka and Ks, it is essential to have a less biased estimate of κ, or both κ_{R} and κ_{Y}.

### Influences of *t* and sequence length

Divergence time (*t*) and sequence length both influence the estimation of Ka and Ks. Since κ in YN, κ_{R} and κ_{Y} in MYN are all deduced from fourfold-degenerate sites at the third codon positions and non-degenerate sites (often the first and the second codon positions), larger *t* leads to multiple substitutions between the two compared codons, resulting in under-counts of substitution events (a reduction in effective sampling size) for estimating κ, κ_{R} and κ_{Y}. Therefore, extreme *t* tends to decrease the effective sampling size and short sequences tend to exaggerate this effect, resulting in biased estimates of κ, κ_{R} and κ_{Y}. We indeed found that both YN and MYN tend to overestimate ω for purifying selection and underestimate ω for positive selection and these biases become severe as *t* increases (Figure 3). These biased estimations are in fact indirect results of biased estimates of κ, κ_{R} and κ_{Y}. The fact is that synonymous substitutions are more likely resulted from transitions than nonsynonymous ones are, such as the cases of two-fold degenerate codons, so the bias between κ_{R} and κ_{Y} does exist. In addition, since for negative selection, synonymous substitutions have higher possibilities to occur than nonsynonymous ones, transitions are more likely to occur than transversions. Therefore, with a decreasing effective sampling size caused by the increasing *t*, YN tends to underestimate transversional rate and thus to overestimate κ. In the case of positive selection, YN underestimates transitional rate and κ in a similar way. As to neutral mutation, YN tends to simultaneously underestimate both transitional and transversional rates so that the varying *t* has no apparent effect. Since MYN adopts a similar approach as YN estimates κ, it shows a similar trend. Short sequences have similar effect as smaller sampling size, and therefore, both YN and MYN may give rise to biased estimates κ, κ_{R} and κ_{Y}, albeit MYN's better performance in less extreme cases.

## Conclusion

We compared MYN with other methods, especially with YN, by examining infinitely long sequences, performing computer simulations and analyzing real datasets, and found that MYN has minimal deviation even when parameters vary within normal ranges defined by empirical data. In addition, these results indicate that biased estimates of Ka and Ks primarily stem from biased estimates of κ, or both κ_{R} and κ_{Y}, which can be influenced by *t* and sequence length.

## Methods

Methods for estimating Ka and Ks consider sequence variations of both DNA and protein, which are related through the genetic code. Since we are engaging in a generally purposed discussion, the genetic code is always referred to the canonical code. As a DNA-centric consideration, nucleotides substitutions only have two types, either within purines and pyrimidines as transitions or between them as transversions. As a protein-centric consideration, each nucleotide triplet (codon) is defined as they vary according to nucleotide changes, except stop codons (TAG, TAA, and TGA). For protein-coding genes, nucleotide substitutions are classified as nonsynonymous and synonymous (silent), referring to changes that do or do not provoke amino acid variations. Although there are several mutation (substitution) models that take these sequence variation features into account, in this report we limit our discussion only to the HKY and the Tamura-Nei Models (see Table S1 in the additional file 1 for details).

### Mutation model

YN adopts the HKY Model that considers transitional rate, transversional rate, and unequal nucleotide (codon) frequencies. It uses an iterative approach to estimate Ka and Ks. Before iteration, YN computes nucleotide frequencies (regarding to the three codon positions), κ, S and N from compared sequences. Codon frequencies are calculated by multiplying each nucleotide frequencies. κ is estimated from fourfold degenerate sites at the third codon positions and non-degenerate sites. S and N are calculated by using κ and codon frequencies. YN then chooses initial values for *t* and ω as starting point for iteration. It generates a transition probability matrix that represents substitution probabilities from one codon to another by using ω, *t*, κ, and codon frequencies. This transition probability matrix is then used to deduce S_{d} and N_{d}. Hence, new estimates of ω and *t* can be obtained. YN repeats the calculation for another transition probability matrix, until the algorithm converges.

Compared to the HKY Model, the Tamura-Nei Model simply considers more parameters. It differentiates α_{R} from α_{Y} according to different transitional substitutions. In fact, if the transitional rates between purines and between pyrimidines are set equal (α_{R} = α_{Y}), the model becomes the HKY Model. Since the Tamura-Nei Model distinguishes α_{R} and α_{Y}, we correspondingly use κ_{R} and κ_{Y} to denote the ratios of transitional rates between purines and between pyrimidines over the transversional rate, respectively. MYN also needs a transition probability matrix similar to what YN has. We give the substitution rate *q*_{
ij
}from any sense codon *i* to *j* (*i* ≠ *j*) to generate a transition probability matrix as follows:

${q}_{ij}=\{\begin{array}{l}0,\text{if}i\text{and}j\text{differbymorethanonediffernce}\hfill \\ {\pi}_{j},\text{if}i\text{and}j\text{differbyasynonymoustransversion}\hfill \\ {\kappa}_{\text{R}}{\pi}_{j},\text{if}i\text{and}j\text{differbyasynonymoustransitionbetweenpurines}\hfill \\ {\kappa}_{\text{Y}}{\pi}_{j}\text{,if}i\text{and}j\text{differbyasynonymoustransitionbetweenpyrimidines}\hfill \\ \omega {\pi}_{j},\text{if}i\text{and}j\text{differbyanonsynonymoustransversion}\hfill \\ \begin{array}{l}\omega {\kappa}_{\text{R}}{\pi}_{j},\text{if}i\text{and}j\text{differbyanonsynonymoustransitionbetweenpurines}\\ \omega {\kappa}_{\text{Y}}{\pi}_{j},\text{if}i\text{and}j\text{differbyanonsynonymoustransitionbetweenpyrimidines}\end{array}\hfill \end{array}\phantom{\rule{0.1em}{0ex}}\left(1\right)$

### Estimating κ_{R} and κ_{Y}

Before generating the transition probability matrix, we need to estimate κ_{R} and κ_{Y}. In a similar way to YN's estimation of κ, we estimate four nucleotide frequencies (*g*_{T}, *g*_{C}, *g*_{A}, *g*_{G}), proportions of transitional differences between purines (*T*_{R}) and between pyrimidines (*T*_{Y}), and the proportion of transversional differences (*V*) from compared sequences. We calculate

$\begin{array}{ll}a=\mathrm{log}\phantom{\rule{0.1em}{0ex}}(1-\frac{{g}_{R}}{2{g}_{A}{g}_{G}}{T}_{\text{R}}-\frac{1}{2{g}_{R}}V)\hfill & b=\mathrm{log}\phantom{\rule{0.1em}{0ex}}(1-\frac{{g}_{Y}}{2{g}_{T}{g}_{C}}{T}_{\text{Y}}-\frac{1}{2{g}_{Y}}V)\hfill \\ c=\mathrm{log}\phantom{\rule{0.1em}{0ex}}(1-\frac{1}{2{g}_{R}{g}_{Y}}V)\hfill \end{array}\phantom{\rule{0.1em}{0ex}}\left(2\right)$

where *g*_{R} = *g*_{A} + *g*_{G} and *g*_{Y} = *g*_{T} + *g*_{C}. Then we use equation 3 to estimate κ_{R} and κ_{Y}.

$\begin{array}{cc}{\kappa}_{\text{R}}=\frac{a-{g}_{Y}\times c}{{g}_{R}\times c}& {\kappa}_{\text{Y}}-\frac{b-{g}_{R}\times c}{{g}_{Y}\times c}\end{array}\phantom{\rule{0.1em}{0ex}}\left(3\right)$

$d=-\frac{2{g}_{A}{g}_{G}}{{g}_{R}}a-\frac{2{g}_{T}{g}_{C}}{{g}_{Y}}b-2({g}_{R}{g}_{Y}-\frac{{g}_{A}{g}_{G}{g}_{Y}}{{g}_{R}}-\frac{{g}_{T}{g}_{C}{g}_{R}}{{g}_{Y}})c\phantom{\rule{0.1em}{0ex}}\left(4\right)$

The detailed procedures for deducing κ_{R} and κ_{Y} were summarized in the additional file 1. We also made other modifications accordingly, such as using κ_{R} and κ_{Y} to estimate S and N, generating relevant transition probability matrix (equation 1), considering different transitional evolution pathways to count S_{d} and N_{d}, and correcting for multiple substitutions when estimating Ka and Ks (equation 4; [22]).

## Notes

## Declarations

### Acknowledgements

We thank three anonymous reviewers, Ruiqiang Li, Jia Ye, Jing Wang, and Ang Li for their constructive comments on this manuscript. We are also grateful to Dr Jun Wang and Gane Ka-Shu Wong for many thoughtful suggestions. This work was supported by Ministry of Science and Technology of China (2001AA231061) and National Natural Science Foundation of China (30270748) awarded to JY.

## Authors’ Affiliations

## References

- Kimura M: The neutral theory of molecular evolution. 1983, Cambridge, England , Cambridge University PressView ArticleGoogle Scholar
- Gillespie JH: The causes of molecular evolution. 1991, Oxford, England , Oxford University PressGoogle Scholar
- Li WH: Molecular evolution. 1997, Sunderland, Mass. , Sinauer AssociatesGoogle Scholar
- Yang Z, Nielsen R: Estimating Synonymous and Nonsynonymous Substitution Rates Under Realistic Evolutionary Models. Mol Biol Evol. 2000, 17 (1): 32-43.View ArticlePubMedGoogle Scholar
- Nei M, Gojobori T: Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986, 3 (5): 418-426.PubMedGoogle Scholar
- Ina Y: New methods for estimating the numbers of synonymous and nonsynonymous substitutions. J Mol Evol. 1995, 40 (2): 190-226. 10.1007/BF00167113.View ArticlePubMedGoogle Scholar
- Li WH: Unbiased estimation of the Rates of synonymous and nonsynonymous substitution. J Mol Evol. 1993, 36: 96-99. 10.1007/BF02407308.View ArticlePubMedGoogle Scholar
- Li WH, Wu CI, Luo CC: A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol Biol Evol. 1985, 2 (2): 150-174.PubMedGoogle Scholar
- Pamilo P, Bianchi NO: Evolution of the Zfx and Zfy genes: rates and interdependence between the genes. Mol Biol Evol. 1993, 10 (2): 271-281.PubMedGoogle Scholar
- Goldman N, Yang Z: A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol. 1994, 11 (5): 725-736.PubMedGoogle Scholar
- Comeron JM: A method for estimating the numbers of synonymous and nonsynonymous substitutions per site. J Mol Evol. 1995, 41 (6): 1152-1159. 10.1007/BF00173196.View ArticlePubMedGoogle Scholar
- Tzeng YH, Pan R, Li WH: Comparison of three methods for estimating rates of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 2004, 21 (12): 2290-2298. 10.1093/molbev/msh242.View ArticlePubMedGoogle Scholar
- Muse SV, Gaut BS: A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994, 11 (5): 715-724.PubMedGoogle Scholar
- Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16 (2): 111-120. 10.1007/BF01731581.View ArticlePubMedGoogle Scholar
- Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985, 22 (2): 160-174. 10.1007/BF02101694.View ArticlePubMedGoogle Scholar
- Jukes TH, Cantor CR: Evolution of protein molecules. Mammalian Protein Metabolism. Edited by: Munro HN. 1969, New York , Academic Press, 21-123.View ArticleGoogle Scholar
- Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997, 13 (5): 555-556.PubMedGoogle Scholar
- Muse SV: Estimating synonymous and nonsynonymous substitution rates. Mol Biol Evol. 1996, 13 (1): 105-114.View ArticlePubMedGoogle Scholar
- Lio P, Goldman N: Models of Molecular Evolution and Phylogeny. Genome Res. 1998, 8 (12): 1233-1244.PubMedGoogle Scholar
- Goldman N, Yang Z: Models of DNA substitution and the discrimination of evolutionary parameters: International Biometric Society, Hamilton, Ontario, Canada.1994, , 407-421.Google Scholar
- Takahashi K, Nei M: Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. Mol Biol Evol. 2000, 17 (8): 1251-1258.View ArticlePubMedGoogle Scholar
- Tamura K, Nei M: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993, 10 (3): 512-526.PubMedGoogle Scholar
- Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Gilbert J, Hammond M, Herrero J, Hotz H, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Kokocinsci F, London D, Longden I, McVicker G, Melsopp C, Meidl P, Potter S, Proctor G, Rae M, Rios D, Schuster M, Searle S, Severin J, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C, Birney E: Ensembl 2005. Nucleic Acids Res. 2005, 33 (Database issue): D447-53. 10.1093/nar/gki138.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu J, Wang J, Lin W, Li S, Li H, Zhou J, Ni P, Dong W, Hu S, Zeng C, Zhang J, Zhang Y, Li R, Xu Z, Li S, Li X, Zheng H, Cong L, Lin L, Yin J, Geng J, Li G, Shi J, Liu J, Lv H, Li J, Wang J, Deng Y, Ran L, Shi X, Wang X, Wu Q, Li C, Ren X, Wang J, Wang X, Li D, Liu D, Zhang X, Ji Z, Zhao W, Sun Y, Zhang Z, Bao J, Han Y, Dong L, Ji J, Chen P, Wu S, Liu J, Xiao Y, Bu D, Tan J, Yang L, Ye C, Zhang J, Xu J, Zhou Y, Yu Y, Zhang B, Zhuang S, Wei H, Liu B, Lei M, Yu H, Li Y, Xu H, Wei S, He X, Fang L, Zhang Z, Zhang Y, Huang X, Su Z, Tong W, Li J, Tong Z, Li S, Ye J, Wang L, Fang L, Lei T, Chen C, Chen H, Xu Z, Li H, Huang H, Zhang F, Xu H, Li N, Zhao C, Li S, Dong L, Huang Y, Li L, Xi Y, Qi Q, Li W, Zhang B, Hu W, Zhang Y, Tian X, Jiao Y, Liang X, Jin J, Gao L, Zheng W, Hao B, Liu S, Wang W, Yuan L, Cao M, McDermott J, Samudrala R, Wang J, Wong GK, Yang H: The Genomes of Oryza sativa: a history of duplications. PLoS Biol. 2005, 3 (2): e38-10.1371/journal.pbio.0030038.PubMed CentralView ArticlePubMedGoogle Scholar
- Messier W, Stewart CB: Episodic adaptive evolution of primate lysozymes. Nature. 1997, 385 (6612): 151-154. 10.1038/385151a0.View ArticlePubMedGoogle Scholar
- NCBI HomoloGene. [ftp://ftp.ncbi.nih.gov/pub/HomoloGene/]
- Nekrutenko A, Makova KD, Li WH: The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. Genome Res. 2002, 12 (1): 198-202. 10.1101/gr.200901.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang Z, Nielsen R: Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J Mol Evol. 1998, 46 (4): 409-418. 10.1007/PL00006320.View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.