Superiority of a mechanistic codon substitution model even for protein sequences in Phylogenetic analysis

Background Nucleotide and amino acid substitution tendencies are characteristic of each species, organelle, and protein family. Hence, various empirical amino acid substitution rate matrices have needed to be estimated for phylogenetic analysis: JTT, WAG, and LG for nuclear proteins, mtREV for mitochondrial proteins, cpREV10 and cpREV64 for chloroplast-encoded proteins, and FLU for influenza proteins. On the other hand, in a mechanistic codon substitution model, in which each codon substitution rate is proportional to the product of a codon mutation rate and the ratio of fixation depending on the type of amino acid replacement, mutation rates and the strength of selective constraint on amino acids can be tailored to each protein family with additional 11 parameters. As a result, in the evolutionary analysis of codon sequences it outperforms codon substitution models equivalent to empirical amino acid substitution matrices. Is it superior even for amino acid sequences, among which synonymous substitutions cannot be identified? Results Nucleotide mutations are assumed to occur independently of codon positions but multiple nucleotide changes in infinitesimal time are allowed. Selective constraints on the respective types of amino acid replacements are tailored to each gene with a linear function of a given estimate of selective constraints, which were estimated by maximizing the likelihood of an empirical amino acid or codon substitution frequency matrix, each of JTT, WAG, LG, and KHG. It is shown that the mechanistic codon substitution model with the assumption of equal codon usage yields better values of Akaike and Bayesian information criteria for all three phylogenetic trees of mitochondrial, chloroplast, and influenza-A hemagglutinin proteins than the empirical amino acid substitution models with mtREV, cpREV64, and FLU, which were designed specifically for those protein families, respectively. The variation of selective constraint across sites fits the datasets significantly better than variable codon mutation rates, confirming that substitution rate variations across sites detected by amino acid substitution models are caused primarily by the variation of selective constraint against amino acid substitutions rather than the variation of codon mutation rate. Conclusions The mechanistic codon substitution model is superior to amino acid substitution models even in the evolutionary analysis of protein sequences.


Likelihood of amino acid sequences in a codon substitution model
Given a phylogenetic tree T and a codon substitution model Θ, in which codon substitutions are assumed to occur independently at each site, the conditional probability P (A|T, Θ) that an alignment A ≡ (A 1 , A 2 , . . . , A L ) of N sequences with L sites is observed is represented as the product over sites of those of the alignments A i ≡ (A 1i , A 2i , . . . , A N i ) ′ at site i.
The likelihood of the phylogenetic tree T and the model Θ for the alignment at each site can be calculated as where P (θ α ) is the a priori probability distribution of a parameter θ α for the variation of selective constraint or mutation rate across sites [1,2]. Here, a mechanistic codon substitution model [1,2] is used as the evolutionary model Θ. Then, if substitutions are assumed to be in the equilibrium state of a time-reversible Markov process, the likelihood of a sequence alignment A i for site i will be calculated by taking any node as a root node. Let us assume here that the root node is the node v ℓ connected to a leaf node ℓ with branch b ℓ .
P (A i |T, Θ, θ α ) = µ ν δ Aℓiaν P (ν|µ, t ℓ , Θ, θ α )f µ P vℓ (A i |v ℓ = µ, T, Θ, θ α ) where µ and ν denote the type of codon, and f µ is the equilibrium frequency of µ. P (ν|µ, t ℓ , Θ, θ α ) is a substitution probability from µ to ν at the branch b ℓ whose length is equal to t ℓ . P vℓ (A i |v ℓ = µ, T, Θ, θ α ) is the likelihood of the parent subtree with the node v ℓ = µ connected to the leaf node ℓ. δ A ℓi a ν is the Kronecker delta and takes one if A ℓi = a ν otherwise zero, where a ν is the type of amino acid corresponding to codon ν. Equal codon usage is simply assumed here to estimate the equilibrium frequencies of codons from the amino acid composition in the alignment. In the maximum likelihood (ML) method for a phylogenetic tree, the tree T and parameters Θ are estimated by maximizing the likelihood.
The posterior probability of site i being at the category θ α is Then, the posterior frequencies of amino acid a in the category θ α is calculated with from which codon frequencies for the category are estimated with the assumption of equal codon usage. The posterior frequencies of amino acids for each category may be used in the next run as the equilibrium frequencies for each category.
A mechanistic codon substitution model with multiple nucleotide changes When substitutions independently occur at each site with a constant substitution rate R µν per unit time from codon µ to ν, the substitution probability matrix P (ν|µ, t, Θ, θ α ) during time t is calculated as Assuming that the detailed balance condition between states is satisfied, i.e., f µ R µν = f ν R νµ , the substitution rate R µν is represented as where f µ is the equilibrium composition; µ f µ R µν = 0. The symmetric matrix r is named an exchangeability matrix. In the case of the codon substitution matrix, the equilibrium frequencies of stop codons are set to be equal to 0, and therefore the probability flow from any to a stop codon and its inverse flow are always equal to 0. The unit of time is chosen in such a way that the total rate of R is equal to 1; Therefore, only relative values among exchangeabilities r µν are meaningful.
In the present mechanistic codon substitution model [1,2], the substitution rate R µν is represented as the product of a mutation rate M µν and the average ratio of fixation F µν , which is defined to be the average fixation probability multiplied by the chromosomal population size, for mutations from codon µ to ν under selective pressure; R µν ∝ M µν F µν for µ = ν. The M is also assumed to satisfy the detailed balance condition; f mut is the equilibrium codon composition of the rate matrix M . Under this assumption, the average fixation ratio F µν must be represented as the product of the two terms, f ν /f mut ν and e wµν , where w µν = w νµ ; F µν = (f ν /f mut ν )e w µν for µ = ν. Then, the exchangeability r µν can be represented as The arbitrary scaling constant C onst is determined by Eq. 9.
The frequency-dependent term f ν /f mut ν represents the effects of selective pressure at the DNA level as well as at the amino acid level, which change the codon frequency from the mutational equilibrium frequency f mut ν to the frequency f ν specific to a gene. The ratio of fixation F was explicitly given as a function of the fitnesses of mutants µ and ν [3,4]. It must be equal to 0 for lethal mutations and equal to 1 for neutral mutations. Here, we approximate the average quantity e wµν over mutants to be independent of codon frequencies. This quantity e wµν is essentially the same as the one called the rate of acceptance by Miyata et al. [5]. We assume that selective pressure against codon replacements appears primarily on an amino acid sequence encoded by a nucleotide sequence; in other words, w µν for codon pair (µ, ν) depends only on the encoded amino acids a µ and b ν .
A code table specific to each gene such as the standard and vertebrate mitochondrial code tables is employed. At the amino acid level, there should be no selective pressure against synonymous mutations. Thus, the w ab satisfies Selective constraints w ab for a specific protein family are approximated with a linear function of a given estimate w estimate where w estimate ab with "estimate" ∈ {EI, JTT-ML91+, WAG-ML91+, LG-ML91+, KHG-ML200} means the estimate of w ab , which is a physico-chemical estimate based on the Energy-Increment-based (EI) method [1], or a ML estimate [1] from the empirical substitution frequency matrix of JTT, WAG, LG, or KHG. The value of w ab is non-positive, assuming that on average there is negative selection on amino acid replacements; of course, w estimate ab ≤ 0 [2]. Positive selection will be taken into account if selective constraints are variable across sites. The parameter β, which is non-negative, adjusts the strength of selective constraint for a given protein family. The parameter w 0 directly controls the ratio of nonsynonymous to synonymous substitution exchangeability. The model with β = 0 is called the Equal-Constraint model, and the Equal-Constraint model with w 0 = 0 is called the No-Constraint model, which is equivalent to a nucleotide substitution model.
In the model EI, w EI ab ≡ −(∆ε c ab +∆ε v ab ), where ∆ε c ab and ∆ε v ab represent the effects of the mean increment of contact energies between residues and of residue-volume change due to an amino acid replacement, respectively; for details, see Supporting Information, Text S1, in [1]. The selective constraint matrices w estimate with "estimate" ∈ {JTT-ML91+, WAG-ML91+, LG-ML91+} are those estimated by maximizing the respective likelihoods of the 1-PAM amino acid substitution frequency matrices of JTT, WAG, and LG in the ML-91+ model [1]. Similarly, the matrix w KHG-ML200 was estimated from the 1-PAM KHG codon substitution frequency matrix in the ML-200 model [1]. These estimates of selective constraints are available as Supporting Information, Data S1, in [1]. These models are called here by the name of a selective constraint matrix with a suffix (n) meaning the number of adjustable parameters such as Equal-Constraint-n, EI-n, JTT/WAG/LG-ML91+-n, and KHG-ML200-n.
We represent the codon mutation rate matrix M in terms of nucleotide mutation rates as follows by assuming that nucleotide mutations occur independently of codon positions and multiple nucleotide changes can infinitesimally occur.
where B i is a mutation rate matrix between the four types of nucleotides at the ith codon position, δ µiνi is the Kronecker's δ, and the index µ i means the ith nucleotide in the codon µ; µ = (µ 1 , µ 2 , µ 3 ) where µ i ∈ { a, t, c, g } . Assuming that the rate matrix B i satisfies the detailed balance condition, it is represented as where f mut i,νi is the equilibrium composition of nucleotide ν i at the ith codon position, and m i,µiνi is the exchangeability between nucleotides µ i and ν i at the ith codon position. Because the B i is assumed to satisfy the detailed balance condition, the M also satisfies the detailed balance condition.
If multiple nucleotide changes were completely ignored, then Eq. 14 would be sim- , whose formulation for a codon mutation rate matrix with Eq. 15 is the same as the one proposed by Muse and Gaut [6]. Here, it should be noted that B i,µiνi in Eq. 15 is defined to be proportional to the equilibrium nucleotide composition f mut i,νi . Alternatively, one may define M µν as in the same way as Miyazawa and Jernigan [7] and others [8,9] defined it to be proportional explicitly to the composition of the base triplet, f mut ν . This alternative definition with Eqs. 9 and 10 is equivalent to Eqs. 14 and 15 with f mut ν i = 0.25 and m i,µ i ν i ⇒ 4m i,µ i ν i , and thus it is a special case in the present formulation.
The No-Constraint model, in which there is no selective pressure on amino acid replacements (w ab = 0), is a nucleotide substitution model extended to allow multiple nucleotide changes in infinitesimal time. Also, it is useful to note that the present model in the special case of M µν = constant becomes equivalent to an amino acid substitution model converted into a codon substitution model; if (m i ) µ i ν i = 4 and f mut i,ν i = 0.25, then M µν = 1 and Eq. 10 will become r µν ∝ e wµν and equivalent to Eq. 4 in [2] with r empirical ab ∝ e βw estimate ab . In the present analyses, we assume for simplicity that m i,µiνi and f mut i,νi do not depend on codon position i; that is, m i,ξη = m ξη and f mut i,ξ = f mut ξ , where ξ, η ∈ {a, t, c, g}. This approximation is reasonable because mutational tendencies may be independent of nucleotide position in a codon. } constant in Eq. 9. Also, it is noted that unlike the SDT model [10] double nucleotide changes at the first and the third positions in a codon are assumed to occur as frequently as doublet changes.
The number of parameters except equilibrium codon frequencies in the mechanistic codon substitution model is equal to 11; they are β, w 0 , m(≡ m [

Variations of mutation rate and of selective constraint across codon sites
Taking account of the variation of amino acid substitution rate across sites always significantly increases the maximum likelihood of a phylogenetic tree in the analysis of amino acid sequences [11]. The variation of amino acid substitution rate can be caused by the variation of codon mutation rate and also by the variation of selective constraint on amino acids.
The variation of codon mutation rate across codon sites is also assumed to obey a Γ distribution [11] with a shape parameter α and the mean equal to 1, which is then approximated by a discrete-gamma distribution [12,13] with m categories. Basically, categories, 0 < Γ 1 < Γ 2 < . . . < Γ m , with equal probability are employed for the variations of rate, but if Γ 2 < 0.1, two categories of Γ 1 and Γ 2 will be merged and replaced by their (weighted) mean category of the sum of their probabilities, and the largest value of Γ i , Γ m , will be divided into two categories of the half of its probability, virtually increasing the number of categories; that is, each category has unequal probability. The shape parameter α is optimized as one of ML parameters. This model is specified with a suffix dGmr where m denotes the number of categories.
The variation of selective constraint across amino acid sites is approximated to obey a discrete-gamma distribution, too; the average of selective constraints over amino acid pairs (the mean acceptance rate), a b>a e w ab /190, is assumed to vary in a discrete-gamma distribution. The rate matrix of each category is scaled in such a way that the mean rate matrix with the prior probabilities of categories satisfies Eq. 9. The shape parameter α of the discrete-gamma distribution is optimized as one of ML parameters. This model is specified with a suffix dGms where m denotes the number of categories.
In the mechanistic codon substitution model, selective constraint w i,ab for ith category in a discrete-gamma distribution is calculated to satisfy the following equations.
where Γ i is the value of the ith category with probability p(Γ i ) in the discretegamma distribution and its mean is equal to the average of e w ab over all amino acid pairs and whose shape parameter is equal to α; 0 ≤ Γ i < Γ i+1 . If Γ i < 1 and γ i exp w ab ≤ 1 for ∀a = b, γ i will be simply equal to a category in the discretegamma distribution whose mean is equal to 1. Basically, categories with equal probability are also employed for the variation of selective constraint, but if Γ 2 / Γ i < 0.1, two categories of Γ 1 and Γ 2 will be merged and replaced by their (weighted) mean category of p(Γ 1 ) + p(Γ 2 ), and the largest value of Γ i , Γ m , will be divided into two categories of p(Γ m )/2, virtually increasing the number of categories; that is, each category has unequal probability.
Datasets of protein sequences used to evaluate amino acid and codon substitution models Substitution models are evaluated by employing three datasets of protein sequences; (1) fast-evolving interspecific mitochondrial proteins, (2) closely-related chloroplastencoded proteins, and (3) fast-evolving HA proteins of Human Influenza A H1N1. These datasets are chosen, because the empirical amino acid substitution rate matrices, cpREV64 [14], mtREV [15], and FLU [16], that were designed as those specific to the respective protein sequences are available.
1 Dataset mammalian-mtProt: interspecific mammalian mitochondrial protein sequences concatenating 12 protein-coding genes from 69 mammalian species [17], whose genome sequences were obtained from the NCBI RefSeq database of organelle genomes. The alignment of each gene was made with the codon sequences by the modified version [2] of the ClustalW2 [18]. The total length of aligned genes is equal to 3618 amino acid sites, and the minimum amino acid identity between the sequences is equal to 0.66. The tree topology that was estimated as Tree-6 by [17] is used here as the most probable one. Overlapped segments between genes were removed from protein sequences. 2 Dataset cpProt-55: protein sequences concatenating 52 protein-coding genes from 55 chloroplast genomes of the major angiosperm lineages whose genome sequences are available in the NCBI RefSeq database out of the 64 genomes analyzed in [19] and that are genes owned by all 55 taxa. The alignment of each gene was made with the codon sequences by the modified version [2] of the ClustalW2 [18]. The tree topology estimated by [19] is used as the most probable one in the present analysis. The total length of aligned genes is equal to 14128 amino acid sites, and the minimum amino acid identity between the sequences is equal to 0.73. The cpREV64 [14] was estimated from the full set of 77 protein-coding genes in the 64 genomes. 3 Dataset HA Human-Flu-A-H1N1: Hemagglutinin proteins from the H1N1 type of human influenza A are downloaded from the entire influenza database at NCBI (July 2nd 2013 version). Only sequences with full length are retrieved and identical sequences are collapsed to single sequences; the number of the HA proteins was equal to 4231. These sequences were aligned by the MAFFT version 7 with the FFT-NS-2 option [20], and the tree topology assumed as the most probable one is the one inferred by the FastTree version 2 [21] with the JTT and CAT options. Also, virtually identical sequences were collapsed by removing sequences connected to the phylogenetic tree with branch length shorter than a certain threshold, 0.002. As a result, the number of the sequences was reduced to 1309. The multiple sequence alignment consists of 595 sites, of which 408 amino acid sites without gaps are used here, because sites with gaps were excluded in the estimation of FLU [16]. The FLU was estimated from the super set of influenza A proteins. ∼113000 influenza proteins consisting of ∼20 million residues.

Statistical comparison of amino acid and codon substitution models
Model selection must be pursued with considerable attention [22]. For the comparison of models one of which is a special case of the other, the likelihood ratio test (LRT) [23] can be used to test the superiority of a nesting model to nested models. Models that are not nesting or nested can be compared using Akaike information criterion (AIC) [24], Bayesian information criterion (BIC) [25], a decision-theoretical approach [26,27], and Bayes factor [28]. Here, AIC and BIC for a given tree topology of aligned codon sequences are used to compare amino acid substitution models with empirical amino acid substitution rate matrices and the mechanistic codon substitution model with the wide range of selective constraint matrices. The AIC and BIC are defined as follows [29]: BIC ≡ −2ℓ(θ) + K log L where K is the number of adjustable parameters,θ is the vector of the ML estimates of the parameters, ℓ(θ) is the maximum log-likelihood value, and L is the number of sites in an amino acid alignment. A model with a smaller value of AIC or BIC is considered to be a better model.