Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified
- Thomas M Keane^{1}Email author,
- Christopher J Creevey^{2},
- Melissa M Pentony^{3},
- Thomas J Naughton^{4} and
- James O Mclnerney^{1}
https://doi.org/10.1186/1471-2148-6-29
© Keane et al; licensee BioMed Central Ltd. 2006
Received: 19 December 2005
Accepted: 24 March 2006
Published: 24 March 2006
Abstract
Background
In recent years, model based approaches such as maximum likelihood have become the methods of choice for constructing phylogenies. A number of authors have shown the importance of using adequate substitution models in order to produce accurate phylogenies. In the past, many empirical models of amino acid substitution have been derived using a variety of different methods and protein datasets. These matrices are normally used as surrogates, rather than deriving the maximum likelihood model from the dataset being examined. With few exceptions, selection between alternative matrices has been carried out in an ad hoc manner.
Results
We start by highlighting the potential dangers of arbitrarily choosing protein models by demonstrating an empirical example where a single alignment can produce two topologically different and strongly supported phylogenies using two different arbitrarily-chosen amino acid substitution models. We demonstrate that in simple simulations, statistical methods of model selection are indeed robust and likely to be useful for protein model selection. We have investigated patterns of amino acid substitution among homologous sequences from the three Domains of life and our results show that no single amino acid matrix is optimal for any of the datasets. Perhaps most interestingly, we demonstrate that for two large datasets derived from the proteobacteria and archaea, one of the most favored models in both datasets is a model that was originally derived from retroviral Pol proteins.
Conclusion
This demonstrates that choosing protein models based on their source or method of construction may not be appropriate.
Keywords
Background
For a number of years phylogenetic construction has been considered to be a problem of statistical inference. One of the most popular methods of inferring phylogenetic relationships is maximum likelihood (ML). It has often been considered that one of the advantages of ML over parsimony based methods is that it allows for the use of different models of evolution depending on the dataset being examined. Therefore knowing the process of evolution and being able to construct realistic models of evolution is the foundation for being able to infer accurate phylogenetic relationships among species. Currently one of the major challenges in phylogenetics is to accurately model the process of nucleotide or amino acid substitution and to choose among our set of models in order to infer accurate phylogenies. Felsenstein [1] was the first to show in simulations that overly simplistic models that underestimate multiple substitutions can result in inconsistency during phylogeny estimation in certain situations (referred to as the 'Felsenstein zone'). Further simulation studies have shown that even when using ML analysis, underestimation of nucleotide substitutions (as assumed by simpler models) leads to long-branch attraction and inconsistency in the Felsenstein zone [2, 3]. These results have also been duplicated in real datasets where the use of inadequate models can lead to long-branch attraction [4, 5]. However it was also shown in simulations that violations of the model can also favour the true tree in certain situations (often referred to as the 'Farris zone' or 'inverse-Felsenstein zone') [6, 7]. It was later shown in simulations and real data that this can only happen in a very limited number of cases and in general using overly simplistic models should be avoided [3, 7, 8]. One fact that is most certainly true is that the accurate estimation of node support is strongly affected by the use of simplistic models in simulated and real datsets [9, 10]. It is clear that unless we can be totally sure that a dataset fits into one of the categories mentioned above, then the use of overly simplistic (or incorrect) substitution models can negatively bias our phylogenies.
Almost all models of amino acid replacement assume that all amino acids sites evolve independently according to the same Markov process. It is assumed that the Markov process is stationary and homogeneous, so that all rates of substitution are constant across time. Each of the protein substitution models consists of a 20 × 20 instantaneous rate matrix which includes the set of original amino acid frequencies (π_{ i }) obtained from the dataset that was used to generate the model. The (π_{ i }) values represent the equilibrium or stationary frequencies of the 20 amino acids and the matrices are often modified to include the set of observed frequencies in the dataset being examined. Models that take into account the observed amino acid frequencies are often denoted by the '+F' suffix [11]. If we assume that the substitution process is reversible then the number of free parameters is reduced to from 399 to 189. However due the computational burden imposed by optimising all 189 free parameters of the instantaneous rate matrix with large datasets and the risk of overfitting the matrix parameters when analysing small datasets has meant that most phylogeny programs rely on empirical models of protein evolution. Dayhoff et al. [12] were the first to develop a general model of amino acid substitution using the limited amount of sequence data available at the time. Since then, several additional models have been developed from other datasets and using different techniques, such as maximum parsimony and maximum likelihood [13–22].
There has been a great deal of research into various techniques for performing model selection on nucleotide data [23]. In the past, three measures have been used to select the best-fit substitution model. The hierarchical likelihood ratio test (hLRT) consists of a tree hierarchy where the best-fit model is selected by performing a number of pairwise likelihood ratio tests and navigating the tree to arrive at the final model [24]. However the hLRT is only suitable for models which can be defined as subsets of each other, therefore is generally only applied to nucleotide model selection. For example, the F81 model [25] is a subset of the HKY model [26] with the transition and transversion rates set to be equal. As the different amino acid matrices do not have any free parameters, it is not possible to define a similar tree hierarchy as with nucleotide models. The Akaike information criterion (AIC) [27] and Bayesian information criterion (BIC) [28] belong to a different class of model selection measures that compare all of the models simultaneously according to some measure of fitness. Although these measures have been used for many years in nucleotide model selection, only recently have programs such as MODELGENERATOR [29] and Prottest [30] been specifically developed to perform statistical analyses of the complete set of available amino acid substitution models.
Until now many phylogenetic analyses of multiple datasets from a fixed set of taxa have assumed a single substitution model for all sets of homologs (e.g. [31–33]). We argue that if it is assumed that a single amino acid matrix is the best-fit matrix for all genes in a dataset, then there is the possibility that the method may encounter situations like the one mentioned previously and produce suboptimal phylogenies. We have performed experiments using real datasets in order to determine if it is correct to build phylogenies from different genes using the same amino acid matrix. This issue has never been examined before and our results show a large differences in best-fit protein models within all of the datasets analysed. The results presented in this paper raise the question of whether we should be performing full protein model selection analyses prior to any amino acid phylogeny estimation.
Results
Base tree sensitivity
Base Tree Simulations. Results of simulated datasets when a random, NJ-JTT, and the true tree are used as the base tree for the model selection procedure and the sequence length is 500 characters. Each entry is the number of times out of 100 replicates the correct model was selected by each measure.
Random | NJ-JTT | True | |||||||
---|---|---|---|---|---|---|---|---|---|
Model | AIC_{1} | AIC_{2} | BIC | AIC_{1} | AIC_{2} | BIC | AIC_{1} | AIC_{2} | BIC |
Blosum | 0 | 0 | 0 | 91 | 98 | 99 | 84 | 96 | 96 |
Blosum+I | 0 | 0 | 0 | 94 | 99 | 100 | 100 | 100 | 100 |
Blosum+G | 58 | 65 | 67 | 75 | 83 | 87 | 75 | 84 | 87 |
Blosum+I+G | 89 | 86 | 85 | 90 | 88 | 87 | 89 | 85 | 85 |
CPREV | 0 | 0 | 0 | 92 | 99 | 100 | 93 | 98 | 99 |
CPREV+I | 0 | 0 | 0 | 98 | 99 | 99 | 97 | 99 | 100 |
CPREV+G | 80 | 83 | 85 | 89 | 89 | 90 | 89 | 90 | 90 |
CPREV+I+G | 94 | 91 | 91 | 80 | 75 | 73 | 80 | 75 | 73 |
Dayhoff | 0 | 0 | 0 | 95 | 100 | 100 | 94 | 98 | 99 |
Dayhoff+I | 0 | 0 | 0 | 98 | 100 | 100 | 96 | 99 | 100 |
Dayhoff+G | 68 | 72 | 74 | 77 | 86 | 90 | 79 | 88 | 91 |
Dayhoff+I+G | 94 | 93 | 93 | 82 | 74 | 74 | 84 | 74 | 72 |
JTT | 0 | 0 | 0 | 94 | 99 | 100 | 97 | 99 | 99 |
JTT+I | 0 | 0 | 0 | 96 | 100 | 100 | 97 | 100 | 100 |
JTT+G | 54 | 59 | 62 | 78 | 85 | 86 | 81 | 87 | 89 |
JTT+I+G | 94 | 94 | 93 | 89 | 85 | 82 | 92 | 87 | 84 |
MtREV | 0 | 0 | 0 | 85 | 96 | 97 | 94 | 99 | 99 |
MtREV+I | 0 | 0 | 0 | 92 | 99 | 100 | 97 | 100 | 100 |
MtREV+G | 80 | 87 | 87 | 92 | 94 | 95 | 93 | 94 | 94 |
MtREV+I+G | 86 | 85 | 84 | 68 | 65 | 61 | 70 | 63 | 63 |
WAG | 0 | 0 | 0 | 88 | 97 | 99 | 95 | 99 | 100 |
WAG+I | 0 | 0 | 0 | 98 | 100 | 100 | 96 | 100 | 100 |
WAG+G | 74 | 79 | 79 | 83 | 89 | 89 | 83 | 88 | 89 |
WAG+I+G | 90 | 89 | 87 | 79 | 73 | 69 | 78 | 73 | 71 |
Full ML Comparison. A comparison of the models selected from the likelihood values obtained from a full ML tree search using all models and the likelihood values using the default NJ-JTT base tree. The column 'Identical' indicates the number of times (out of 100 alignments) both procedures selected the same model. The column titled 'Rate' indicates cases when the same amino acid matrix and a different ASRV was selected. The column titled 'Matrix' indicates cases when the a different amino acid matrix was selected.
AIC_{1} | AIC_{2} | BIC | |||||||
---|---|---|---|---|---|---|---|---|---|
Dataset | Identical | Rate | Matrix | Identical | Rate | Matrix | Identical | Rate | Matrix |
Proteobacteria | 95 | 4 | 1 | 93 | 6 | 1 | 94 | 2 | 4 |
Archaea | 99 | 1 | 0 | 96 | 2 | 2 | 95 | 2 | 3 |
Vertebrate | 91 | 7 | 2 | 94 | 5 | 1 | 97 | 1 | 2 |
Sequence length
Alignment Length Simulations. Results of the simulated datasets for alignments of 100, 500, and 1000 characters in length. Each entry is the number of times out of 100 replicates the correct model was selected by each measure (using the default NJ-JTT base tree).
100 | 500 | 1000 | |||||||
---|---|---|---|---|---|---|---|---|---|
Model | AIC_{1} | AIC_{2} | BIC | AIC_{1} | AIC_{2} | BIC | AIC_{1} | AIC_{2} | BIC |
Blosum | 86 | 96 | 95 | 91 | 98 | 99 | 94 | 99 | 100 |
Blosum+I | 95 | 100 | 95 | 94 | 99 | 100 | 98 | 100 | 100 |
Blosum+G | 89 | 95 | 95 | 75 | 83 | 87 | 79 | 85 | 88 |
Blosum+I+G | 44 | 30 | 30 | 90 | 88 | 87 | 95 | 95 | 94 |
CPREV | 92 | 99 | 99 | 92 | 99 | 100 | 95 | 100 | 100 |
CPREV+I | 94 | 100 | 100 | 98 | 99 | 99 | 99 | 99 | 100 |
CPREV+G | 87 | 99 | 98 | 89 | 89 | 90 | 91 | 96 | 97 |
CPREV+I+G | 51 | 37 | 37 | 80 | 75 | 73 | 95 | 94 | 94 |
Dayhoff | 92 | 99 | 99 | 95 | 100 | 100 | 93 | 99 | 100 |
Dayhoff+I | 94 | 100 | 100 | 98 | 100 | 100 | 96 | 100 | 100 |
Dayhoff+G | 83 | 93 | 93 | 77 | 86 | 90 | 94 | 94 | 95 |
Dayhoff+I+G | 54 | 35 | 38 | 82 | 74 | 74 | 95 | 92 | 91 |
JTT | 95 | 98 | 98 | 94 | 99 | 100 | 93 | 98 | 100 |
JTT+I | 95 | 99 | 98 | 96 | 100 | 100 | 96 | 100 | 100 |
JTT+G | 87 | 94 | 94 | 78 | 85 | 86 | 91 | 91 | 93 |
JTT+I+G | 48 | 36 | 40 | 89 | 85 | 82 | 96 | 95 | 94 |
MtREV | 95 | 98 | 98 | 85 | 96 | 97 | 91 | 97 | 97 |
MtREV+I | 97 | 100 | 100 | 92 | 99 | 100 | 97 | 100 | 100 |
MtREV+G | 86 | 97 | 97 | 92 | 94 | 95 | 92 | 95 | 96 |
MtREV+I+G | 29 | 17 | 17 | 68 | 65 | 61 | 87 | 85 | 83 |
WAG | 91 | 97 | 96 | 88 | 97 | 99 | 97 | 98 | 100 |
WAG+I | 94 | 100 | 99 | 98 | 100 | 100 | 97 | 99 | 100 |
WAG+G | 85 | 95 | 93 | 83 | 89 | 89 | 86 | 95 | 95 |
WAG+I+G | 50 | 34 | 36 | 79 | 73 | 69 | 97 | 96 | 94 |
Among-site rate variation parameters
Gamma Distribution Simulations. Results of simulations when the α parameter of the gamma distribution was varied between 0.5, 1.0, and 2.0. The sequence length was kept constant at 500 characters and the proportion of invariable sites was 0.2. Each entry is the number of times out of 100 replicates that the correct model was selected.
α = 0.5 | α = 1.0 | α = 2.0 | |||||||
---|---|---|---|---|---|---|---|---|---|
Model | AIC_{1} | AIC_{2} | BIC | AIC_{1} | AIC_{2} | BIC | AIC_{1} | AIC_{2} | BIC |
BLOSUM62+G | 75 | 83 | 87 | 32 | 62 | 69 | 36 | 68 | 74 |
BLOSUM62+I+G | 90 | 88 | 87 | 95 | 93 | 92 | 100 | 100 | 100 |
CPREV+G | 89 | 89 | 90 | 39 | 72 | 77 | 39 | 65 | 79 |
CPREV+I+G | 80 | 75 | 73 | 93 | 89 | 89 | 100 | 100 | 100 |
Dayhoff+G | 77 | 86 | 90 | 33 | 36 | 74 | 38 | 60 | 66 |
Dayhoff+I+G | 82 | 74 | 74 | 98 | 95 | 92 | 100 | 100 | 100 |
JTT+G | 78 | 85 | 86 | 43 | 71 | 76 | 25 | 54 | 63 |
JTT+I+G | 89 | 85 | 82 | 98 | 96 | 94 | 100 | 100 | 100 |
MtREV+G | 92 | 94 | 95 | 46 | 72 | 76 | 51 | 75 | 84 |
MtREV+I+G | 68 | 65 | 61 | 90 | 85 | 83 | 100 | 100 | 100 |
WAG+G | 83 | 89 | 89 | 35 | 70 | 76 | 32 | 70 | 79 |
WAG+I+G | 79 | 73 | 69 | 97 | 91 | 90 | 100 | 100 | 100 |
Dayhoff+I+G | 54 | 35 | 38 | 82 | 74 | 74 | 95 | 92 | 91 |
JTT | 95 | 98 | 98 | 94 | 99 | 100 | 93 | 98 | 100 |
JTT+I | 95 | 99 | 98 | 96 | 100 | 100 | 96 | 100 | 100 |
JTT+G | 87 | 94 | 94 | 78 | 85 | 86 | 91 | 91 | 93 |
JTT+I+G | 48 | 36 | 40 | 89 | 85 | 82 | 96 | 95 | 94 |
MtREV | 95 | 98 | 98 | 85 | 96 | 97 | 91 | 97 | 97 |
MtREV+I | 97 | 100 | 100 | 92 | 99 | 100 | 97 | 100 | 100 |
MtREV+G | 86 | 97 | 97 | 92 | 94 | 95 | 92 | 95 | 96 |
MtREV+I+G | 29 | 17 | 17 | 68 | 65 | 61 | 87 | 85 | 83 |
WAG | 91 | 97 | 96 | 88 | 97 | 99 | 97 | 98 | 100 |
WAG+I | 94 | 100 | 99 | 98 | 100 | 100 | 97 | 99 | 100 |
WAG+G | 85 | 95 | 93 | 83 | 89 | 89 | 86 | 95 | 95 |
WAG+I+G | 50 | 34 | 36 | 79 | 73 | 69 | 97 | 96 | 94 |
Amino acid frequency perturbation
Amino Acid Frequency Simulations. Results of the simulated datasets where the original amino acid frequencies are randomly perturbed by up to 10% from the original values and the alignment length is 500 characters. Each entry indicates the number of times out of 100 replicates the correct model was selected by each measure.
Model | AIC_{1} | AIC_{2} | BIC | Model | AIC_{1} | AIC_{2} | BIC |
---|---|---|---|---|---|---|---|
Blosum+F | 94 | 100 | 100 | JTT+F | 93 | 100 | 100 |
Blosum+I+F | 71 | 91 | 95 | JTT+I+F | 67 | 89 | 94 |
Blosum+G+F | 86 | 93 | 96 | JTT+G+F | 75 | 89 | 92 |
Blosum+I+G+F | 99 | 97 | 96 | JTT+I+G+F | 98 | 96 | 96 |
CPREV+F | 92 | 100 | 100 | MtREV+F | 93 | 99 | 99 |
CPREV+I+F | 87 | 98 | 99 | MtREV+I+F | 86 | 96 | 99 |
CPREV+G+F | 93 | 96 | 97 | MtREV+G+F | 86 | 93 | 95 |
CPREV+I+G+F | 89 | 87 | 84 | MtREV+I+G+F | 85 | 82 | 80 |
Dayhoff+F | 93 | 99 | 99 | WAG+F | 95 | 100 | 100 |
Dayhoff+I+F | 91 | 98 | 99 | WAG+I+F | 82 | 96 | 97 |
Dayhoff+G+F | 86 | 93 | 96 | WAG+G+F | 88 | 95 | 96 |
Dayhoff+I+G+F | 99 | 97 | 96 | WAG+I+G+F | 90 | 89 | 89 |
Expected model selections
Real Dataset Analysis. Results of the model selection on the specialised datasets (see the references for full descriptions of the individual datasets). Amino acid matrix expectations are based on previously published information about the sequences ([19, 54, 55] and LANL [56]).
Dataset | Source | Expected | AIC_{1} | AIC_{2} | BIC |
---|---|---|---|---|---|
mtCDNApri | Yang [54] | MtMam | MtMam+I+G | MtMam+G | MtMam+G |
mtCDNAape | Yang [54] | MtMam | MtMam+F | MtMam+F | MtMam+F |
70pep_nogap | Reyes et al. [55] | MtMam | MtMam+I+G+F | MtMam+I+G | MtMam+I+G |
BETA | Dimmic et al. [19] | RtREV | RtREV+G+F | RtREV+G | RtREV+G |
ENDO | Dimmic et al. [19] | RtREV | RtREV+I+G+F | RtREV+I+G+F | RtREV+I+G+F |
GAGGAM | Dimmic et al. [19] | JTT | JTT+G+F | JTT+G+F | JTT+G+F |
GAGHIV | Dimmic et al. [19] | JTT | JTT+G+F | JTT+G+F | JTT+G+F |
GAMMA | Dimmic et al. [19] | RtREV | CPREV+G+F | RtREV+G | RtREV+G |
LENTI | Dimmic et al. [19] | RtREV | RtREV+I+G+F | RtREV+I+G+F | RtREV+I+G+F |
SPUMA | Dimmic et al. [19] | RtREV | RtREV+G | RtREV+G | RtREV+G |
NONLTR | Dimmic et al. [19] | RtREV | RtREV+I+G+F | RtREV+I+G+F | RtREV+I+G+F |
SIVPOLPRO | LANL | RtREV | RtREV+G+F | RtREV+G+F | RtREV+G |
Model variation among multi-gene datasets
Model selection and tree accuracy
Tree Accuracy Simulations. Results of the simulated tree accuracy test where alignments were generated with a particular model and then phylogenies were built using all of the other available models. Each entry is the average scaled Robinson-Foulds (RF) distance [40] over the trees inferred using the alternative models. This test was repeated 10 times for each model and the values in brackets are the RF distances from the true tree when phylogenies were inferred using the model that generated the alignment. Phyml [53] was used to build all trees.
Model | RF Distance | Model | RF Distance |
---|---|---|---|
Blosum | 0.03 (0.03) | JTT | 0.05 (0.05) |
Blosum+I | 0.02 (0.02) | JTT+I | 0.05 (0.04) |
Blosum+G | 0.08 (0.06) | JTT+G | 0.04 (0.03) |
Blosum+I+G | 0.05 (0.05) | JTT+I+G | 0.12 (0.11) |
CPREV | 0.05 (0.04) | MtREV | 0.06 (0.05) |
CPREV+I | 0.09 (0.04) | MtREV+I | 0.08 (0.08) |
CPREV+G | 0.06 (0.05) | MtREV+G | 0.07 (0.06) |
CPREV+I+G | 0.07 (0.06) | MtREV+I+G | 0.12 (0.1) |
Dayhoff | 0.07 (0.07) | WAG | 0.02 (0.02) |
Dayhoff+I | 0.06 (0.05) | WAG+I | 0.04 (0.04) |
Dayhoff+G | 0.06 (0.06) | WAG+G | 0.1 (0.1) |
Dayhoff+I+G | 0.05 (0.04) | WAG+I+G | 0.04 (0.04) |
Proteobacteria Tree Accuracy Analysis. The scaled Robinson-Foulds (RF) distances [40] of the trees produced from the Proteobacteria dataset using fixing a model used to build trees from each alignment. The values reported are the median and average distance computed by comparing every tree against every other tree. When the optimal set of models were used the median was 0.22 and the average was 0.34. Phyml [53] was used to build all trees.
Model | Median RF | Mean RF | Model | Median RF | Mean RF |
---|---|---|---|---|---|
Blosum | 0.23 | 0.35 | JTT | 0.23 | 0.34 |
Blosum+I | 0.25 | 0.35 | JTT+I | 0.25 | 0.35 |
Blosum+G | 0.25 | 0.35 | JTT+G | 0.25 | 0.35 |
Blosum+I+G | 0.25 | 0.35 | JTT+I+G | 0.25 | 0.35 |
CPREV | 0.24 | 0.35 | MtREV | 0.25 | 0.35 |
CPREV+I | 0.25 | 0.35 | MtREV+I | 0.25 | 0.35 |
CPREV+G | 0.25 | 0.35 | MtREV+G | 0.25 | 0.35 |
CPREV+I+G | 0.25 | 0.35 | MtREV+I+G | 0.25 | 0.35 |
Dayhoff | 0.2 | 0.34 | WAG | 0.21 | 0.34 |
Dayhoff+I | 0.21 | 0.34 | WAG+I | 0.23 | 0.35 |
Dayhoff+G | 0.22 | 0.34 | WAG+G | 0.25 | 0.35 |
Dayhoff+I+G | 0.22 | 0.34 | WAG+I+G | 0.25 | 0.35 |
Discussion
We have studied the influence of various factors on protein model selection. Our simulations have confirmed previous work showing that the model selection procedure performs quite accurately using an approximate tree for model selection. One of the most interesting results that we have shown using real datasets is that less than 9% of the time was a different matrix selected using a full ML analysis than those selected using a quick NJ-JTT method. This further strengthens the recent results presented by Sullivan et al. [38] showing that successive-approximation methodologies to phylogeny estimation does not suffer from any significant loss in accuracy. Our simulations have also shown that protein model selection is not as sensitive as nucleotide model selection to sequence length differences. Recovery rates remain relatively constant over different sequence lengths with the only exception to this being at short sequence lengths when the difference in likelihood values can result in an overly-simplistic model being selected (+G instead of +I+G). We have also shown that when amino acid frequencies deviate from the default amino acid frequencies of each model, there is a clear trend towards the '+F' version of each model. This phenomenon was also observed in the results of the real dataset analysis presented in Table 5 with many of the real datasets being best described by '+F' versions of the expected models. One constant trend across all of the simulations we have performed is that the correct amino acid matrix is selected by both measures close to 100% of the time regardless of factors such as base tree accuracy, sequence length, ASRV variances, or amino acid frequencies.
It should be emphasized that many of the current set of models of amino acid or nucleotide substitution make many unrealistic assumptions such as reversibility, amino acid composition stationarity, and homogeneous substitution rates. However much work is currently taking place to develop methods to loosen many of these restrictions [41–43]. While the focus of this work has been to demonstrate the usefulness of performing protein model selection, it must be stated that model selection measures can only provide information on which of the given set of models best-fits the data and cannot give any indication of how close a particular model is to reality.
We have highlighted an example where two highly-supported and topologically different phylogenies were produced from the same alignment using two arbitrarily selected amino acid substitution matrices (see Fig. 1). The likelihood values of the two trees are -30722 for the MtREV tree and -28996 for the WAG tree with the extremely high bootstrap support values providing evidence that the observed trees are not due to a stochastic error (e.g. the treesearch getting stuck in a local optima). To further rule out any source of stochastic error, the corresponding likelihoods for the MtREV tree with the WAG matrix is -29288 and -30959 for the WAG tree with the MtREV matrix, thereby confirming that both matrices favor different tree topologies. A tree constructed using the optimal model for this alignment (RtREV+I+G+F) agrees with the WAG tree. At first glance, our particular example may seem slightly unrealistic as we have used a mitochondrial model to construct a tree from nuclear genes. However, as we have shown, one of the best models for proteobacterial and archaeal genes is frequently (22% and 33% of the time respectively) a model that was derived from retroviral Pol proteins. Therefore, ad hoc model selection, even when using arguments about the origin of the model (nuclear versus organelle, or some such) are still ad hoc arguments. The maximum likelihood principle suggests the use of the best of the available models and in some cases, the best performing model can be surprising.
The results of our cross-domain substitution model analysis are interesting as there are noticeable differences in the groups of models selected by each dataset with no single matrix emerging as the best for any of the datasets. The large diversity of amino acid matrices cannot come as a great surprise as it would seem intuitively unreasonable to assume that a very large group of independently evolving gene families from a fixed taxon set followed an identical amino acid substitution pattern. Perhaps one of the most significant findings is that the RtREV matrix [19] features so prominently in both the proteobacteria and archaea datasets. The WAG matrix [18], for instance was derived from a globular protein dataset and was shown to produce higher likelihood values in general, compared with the JTT matrix for the dataset from which it was derived. This seemed to indicate that choosing a matrix based upon the method or the data used to derive the matrix might be a good idea. However, our finding that for so many alignments from cellular life, the best matrix was derived from viral sequences is surprising and the consequence is that ad hoc arguments for choice of matrix may not reasonable.
Conclusion
In this study, we have analysed the ability of the AIC and the BIC to select the appropriate evolutionary model in cases where the model is known. We have shown that both methods are suitable for this purpose. We have also shown that none of the currently available models is universally preferred for all alignments and that there is considerable variation in the substitution process across protein families. What we have not attempted to show is that for any given alignment the selected model is the actual model that gave rise to the observed data. However, on the basis of our results we can speculate on the appropriateness of the models. Considering that a viral model is one of the most preferred models for these cellular sequences, perhaps none of the models are really capturing the data. The models are homogeneous across the tree and this is likely to be a simplification. Therefore, even though we have produced a robust method of model selection, it is likely that the models themselves need to be improved.
Methods
The AIC is a popular model selection measure that attempts to strike a balance between the goodness-of-fit and complexity of a model. The AIC is calculated by
AIC_{1} = -2 ln L_{ i }+ 2N_{ i }, (1)
where N_{ i }is the number of free parameters in model i and L_{ i }is the likelihood value of model i. Posada and Crandall [36] presented evidence to show that the more empirically tuned AIC_{2} can sometimes be more accurate at determining the correct nucleotide substitution model. It is calculated by replacing the 2N_{ i }term with 5N_{ i }thus further penalising models of greater complexity. The BIC is another model selection measure and is equivalent to selecting the model with the maximum posterior probability and is calculated from
BIC = -2 ln L_{ i }+ N_{ i }ln n, (2)
where n is the sample size (sequence length). The AIC and BIC select the best model by choosing the model with the minimum criterion value. The main difference between the three measures is that the AIC_{2} and BIC tend to select simpler models than the AIC_{1} because they penalise the addition of further model parameters more than the AIC_{1} [44]. If the models that rank highest for a given dataset all include a certain ASRV parameter, then the AIC and BIC will essentially become an ordering with respect to the likelihood values.
We have recently developed a protein model selection program called MODELGENERATOR [29]. It initially constructs a neighbor-joining (NJ) tree using an arbitrary model (default is Jukes-Cantor [45] for nucleotides and JTT [14] for proteins) in order to get an initial base tree for comparison of models. For each model examined, the branch lengths of the tree and model parameters are optimised independently using the PAL library [46]. The program supports 10 amino acid matrices and 14 nucleotide models with either a proportion of invariable sites (+I), gamma shape ASRV (+G), combined invariable and gamma distribution (+I+G), and for amino acids the observed amino acid frequencies (+F). When all matrix and ASRV permutations are considered, a total of 56 nucleotide and 80 protein models can be derived. In the following subsections, we outline how we investigated the effects of various properties of amino acid alignments on the three non-nested model selection measures (AIC_{1}, AIC_{2}, BIC) when applied to protein model selection. For all of the simulations, we used the same 20 taxon clocklike tree used by Posada and Crandall [36] and the program Seq-Gen vl.3.2 [47] to generate all simulated protein alignments. The presence or absence of a molecular clock in the base tree has been shown to have a negligible effect on the model selection procedure [36]. For all of the simulated and real data tests performed below, MODELGENERATOR was not constrained a priori and the full set of 80 protein models was examined during every execution.
Base tree sensitivity
In order to compare the sensitivity of protein model selection to the accuracy of the base tree, we generated 2400 individual alignments of 500 characters in length using each of the protein models available in Seq-Gen (100 alignments per model) fixing the proportion of invariable sites at 0.2 and the α parameter of the gamma distribution to 0.5 where appropriate and then performed model selection using three different base trees – the true tree, an NJ-JTT tree, and a randomly generated tree.
To further investigate the effect of using a distance-based tree for comparison rather than the fully optimised ML tree of each model, we obtained three real datasets from each of the Domains of life. The first dataset consists of 2135 gene families obtained from 25 complete proteobacteria genomes. The homologs were identified by performing all-against-all blast searches [48] of the 25 fully completed genomes with an e-value cutoff of 10^{-7}. The sequences were aligned using ClustalW 1.81 [49] with the parameters unchanged from their default settings. The alignments were manually edited to remove badly aligned areas and large gapped areas. The second dataset consisted of amino acid sequences of 16 archaeal genomes retrieved from the COGENT database [50] and one (Haloarcula marismortui) from the NCBI. We identified gene families where all members of the family were capable of identifying all other members of the family during database searches (with an e-value cutoff of 10^{-7}). Each of these families consisted of between 4 and 16 taxa and were aligned using ClustalW 1.81 using the default settings [49]. The final dataset is a previously published set of 118 vertebrate gene families which included representatives of all the major vertebrate groups obtained from the HOVERGEN database [51] with each alignment consisting of between 4 and 58 taxa [52]. For each dataset, we took a subset of 100 alignments and used Phyml [53] to construct fully optimised ML phylogenies with each of the available protein models and recorded the final likelihood of each individual phylogeny. We limited the ML analysis to 100 randomly-chosen alignments from each dataset due to excessive execution times for the full ML analyses. We took the final likelihood values produced by Phyml and determined the best-fit model and then compared the selected models with those produced by the NJ-JTT model selection procedure.
Sequence length
We generated 100 replicate alignments of each of the protein models available in Seq-Gen consisting of 100, 500, and 1000 characters in length. For these tests, we fixed the proportion of invariable sites at 0.2 and the α parameter of the gamma distribution to 0.5 where appropriate. We performed model estimation using MODELGENERATOR and recorded the model selected by each of the available tests (AIC1, AIC2, BIC).
Rate-distribution parameters
In order to investigate the possible effect of varying ASRV parameters, we generated a number of different simulated datasets (100 replicate alignments per model) with a fixed sequence length of 500 characters and varied the α parameter of the gamma ASRV between 0.5, 1.0, and 2.0 (corresponding to high, medium, and low rate heterogeneity).
Amino acid frequency perturbation
Expected model selection
We obtained the two primate mitochondrial datasets that are included as example datasets in Paml 3.14 [54], namely the files mtCDNApri.aa and mtCDNAape.nuc (translated the nucleotide sequences to amino acids). We also obtained the amino acid sequences of a recent study examining congruence among mammalian mitochondrial and nuclear genes [55]. We downloaded a copy of the complete test dataset used in the creation and testing of the RtREV matrix [19]. We also obtained a Pol alignment from the 2003 HIV and SIV alignments database [56] and performed model selection on all of the sequences mentioned.
Model variation among empirical datasets
For this test, we used the full set of sequences from the three real datasets of the each Domain of life (as described above). We performed model prediction for each alignment in the datasets in order to assess the extent of model differences within the gene families.
Model selection and tree accuracy
To test for the effect of ad hoc model selection on tree accuracy, we performed an analysis on both simulated and real data. In the simulated analysis, we generated alignments of 500 characters in length using all of the amino acid matrices and rate distributions setting the proportion of invariable sites to 0.3 and the α parameter of the gamma distribution to 0.5 where appropriate. The base tree in Figure 2 was used to generate all alignments. We then analysed each alignment using all of the possible models except the one used to generate the alignment. We repeated this test 10 times for each model and computed the average scaled RF distance [40] from all the inferred trees to the true tree for each model. We then build phylogenies using the same model as was used to generate each alignment and reported the RF distance to the true tree also.
In an attempt to analyse the effect of ad hoc model selection on tree accuracy on real data, we took the proteobacteria dataset (2135 orthologs) and built phylogenies by using each of the available models as a fixed model for the entire dataset. We recorded the median and average of the all-against-all RF distances of the trees using Clann v3.0.3 [57]. For all possible pairs of trees, Clann prunes the taxa of the trees so that only the common taxa are left and then computes the scaled RF distance.
Supplementary data
All of the simulated and real alignments mentioned in the paper are available for download from [58].
Declarations
Acknowledgements
We thank Peter Foster and Zheng Yang for providing valuable comments on the manuscript. We wish to acknowledge James Cotton and Rod Page for making their vertebrate dataset available for our analysis. We thank Davide Pisani and Jennifer Commins for helping create the figures. We would like to acknowledge the financial support of the Irish Research Council for Science, Engineering and Technology (IRCSET). The authors wish to acknowledge the SFI/HEA Irish Centre for High-End Computing (ICHEC) for the provision of computational facilities and support.
Authors’ Affiliations
References
- Felsenstein J: Cases in which parsimony and compatibility methods will be positively misleading. Syst Zool. 1978, 27: 401-410.View ArticleGoogle Scholar
- Gaut BS, Lewis PO: Success of maximum likelihood phylogeny inference in the four-taxon case. Mol Biol Evol. 1995, 12: 152-162.View ArticlePubMedGoogle Scholar
- Sullivan JS, Swofford DL: Should we use model-based methods for phylogenetic inference when we know assumptions about among-site rate variation and nucleotide substitution pattern are violated?. Syst Biol. 2001, 50: 723-729. 10.1080/106351501753328848.View ArticlePubMedGoogle Scholar
- Anderson FE, Swofford DL: Should we be worried about long-branch attraction in real data sets? Investigations using metazoan 18S rDNA. Mol Phylogenet Evol. 2004, 33: 440-451. 10.1016/j.ympev.2004.06.015.View ArticlePubMedGoogle Scholar
- Sullivan J, Swofford DL: Are guinea pigs rodents? The importance of adequate models in molecular phylogenetics. J Mamm Evol. 1997, 4: 477-486.View ArticleGoogle Scholar
- Siddall ME: Success of parsimony in the four-taxon case: long-branch repulsion by likelihood in the Farris zone. Cladistics. 1998, 14: 209-220. 10.1111/j.1096-0031.1998.tb00334.x.View ArticleGoogle Scholar
- Swofford DL, Waddell PJ, Huelsenbeck JP, Foster PG, Lewis PO, Rogers JS: Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. Syst Biol. 2001, 50: 525-539. 10.1080/106351501750435086.View ArticlePubMedGoogle Scholar
- Bruno WJ, Halpern AL: Topological bias and inconsistency of maximum likelihood using wrong models. Mol Biol Evol. 1999, 16: 564-566.View ArticlePubMedGoogle Scholar
- Buckley TR, Cunningham CW: The effects of nucleotide substitution model assumptions on estimates of non-parametric bootstrap support. Mol Biol Evol. 2002, 19: 394-405.View ArticlePubMedGoogle Scholar
- Buckley TR, Cunningham CW: Model misspeciflcation and probabilistic tests of topology: Evidence from empirical data sets. Syst Biol. 2002, 51: 509-523. 10.1080/10635150290069922.View ArticlePubMedGoogle Scholar
- Cao Y, Adachi J, Janke A, Paabo S, Hasegawa M: Phylogenetic relationships among eutherian orders estimated from inferred sequences of mitochondrial proteins: instability of a tree based on a single gene. J Mol Evol. 1994, 39 (5): 519-552. 10.1007/BF00173421.View ArticlePubMedGoogle Scholar
- Dayhoff MO, Eck RV, Park CM: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure. Edited by: Dayhoff MO. 1972, Washington D.C.: National Biomedical Research Foundation, 89-99.Google Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Nail Acad Sci USA. 1992, 89: 10915-10919.View ArticleGoogle Scholar
- Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992, 8: 275-282.PubMedGoogle Scholar
- Adachi J, Hasegawa M: Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol. 1996, 42: 459-468. 10.1007/PL00013324.View ArticlePubMedGoogle Scholar
- Yang Z, Nielsen R, Hasegawa M: Models of amino acid substitution and applications to Mitochondrial protein evolution. Mol Biol Evol. 1998, 15: 1600-1611.View ArticlePubMedGoogle Scholar
- Muller T, Vingron M: Modeling amino acid replacement. J Comp Biol. 2000, 7: 761-776. 10.1089/10665270050514918.View ArticleGoogle Scholar
- Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol. 2001, 18: 691-699.View ArticlePubMedGoogle Scholar
- Dimmic MW, Rest JS, Mindell DP, Goldstein RA: rtREV: a substitution matrix for inference of retrovirus and reverse transcriptase phylogeny. J Mol Evol. 2002, 55: 65-73. 10.1007/s00239-001-2304-y.View ArticlePubMedGoogle Scholar
- Muller T, Spang R, Vingron M: Estimating Amino Acid Substitution Models: A Comparison of Dayhoff's Estimator, the Resolvent Approach and a Maximum Likelihood Method. Mol Biol Evol. 2002, 19: 8-13.View ArticlePubMedGoogle Scholar
- Veerassamy S, Smith A, Tillier ERM: A transition probability model for amino acid substitutions from blocks. J Comput Biol. 2003, 10: 997-1010. 10.1089/106652703322756195.View ArticlePubMedGoogle Scholar
- Kosiol C, Goldman N: Different versions of the Dayhoff rate matrix. Mol Biol Evol. 2005, 22: 193-199. 10.1093/molbev/msi005.View ArticlePubMedGoogle Scholar
- Posada D, Buckley TR: Model selection and model averaging in phylogenetics: advantages of the AIC and Bayesian approaches over likelihood ratio tests. Syst Biol. 2004, 53: 793-808. 10.1080/10635150490522304.View ArticlePubMedGoogle Scholar
- Cao Y, Adachi J, Janke A, Paabo S, Hasegawa M: Phylogeny estimation and hypothesis testing using maximum likelihood. Annu Rev Ecol Syst. 1997, 28: 437-466. 10.1146/annurev.ecolsys.28.1.437.View ArticleGoogle Scholar
- Felsenstein J: Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol. 1981, 17: 368-376. 10.1007/BF01734359.View ArticlePubMedGoogle Scholar
- Hasegawa M, Kishino K, Yano T: Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985, 22: 160-174. 10.1007/BF02101694.View ArticlePubMedGoogle Scholar
- Akaike H: A new look at the statistical model identification. IEEE Trans Aut Control. 1974, 19: 716-723. 10.1109/TAC.1974.1100705.View ArticleGoogle Scholar
- Schwartz G: Estimating the dimension of a model. Annal Stat. 1978, 6: 461-464.View ArticleGoogle Scholar
- MODELGENERATOR download. [http://bioinf.nuim.ie/software/modelgenerator/]
- Abascal F, Zardoya R, Posada D: ProtTest: Selection of best-fit models of protein evolution. Bioinformatics. 2005, 21 (9): 2104-2105. 10.1093/bioinformatics/bti263.View ArticlePubMedGoogle Scholar
- Brochier C, Forterre P, Gribaldo S: Archaeal phylogeny based on proteins of the transcription and translation machineries: tackling the Methanopyrus kandleri paradox. Genome Biol. 2004, 5: R17-10.1186/gb-2004-5-3-r17.PubMed CentralView ArticlePubMedGoogle Scholar
- Nishiyama T, Wolf PG, Kugita M, Sinclair RB, Sugita M, Sugiura C, Wakasugi T, Yamada K, Yoshinaga K, Yamaguchi K, Ueda K, Hasebe M: Chloroplast phylogeny indicates that bryophytes are monophyletic. Mol Biol Evol. 2004, 21 (10): 1813-1819. 10.1093/molbev/msh203.View ArticlePubMedGoogle Scholar
- Philippe H, Snell EA, Bapteste E, Lopez P, Holland PWH, Casane D: Phylogenomics of Eukaryotes: Impact of Missing Data on Large Alignments. Mol Biol Evol. 2004, 21: 1740-1752. 10.1093/molbev/msh182.View ArticlePubMedGoogle Scholar
- Philip GK, Creevey CJ, McInerney JO: The Opisthokonta and the Ecdysozoa may not be Glades: Stronger Support for the Grouping of Plant and Animal than for Animal and Fungi and Stronger Support for the Coelomata than Ecdysozoa. Mol Biol Evol. 2005, 22: 1175-1184. 10.1093/molbev/msi102.View ArticlePubMedGoogle Scholar
- Yang Z, Goldman N, Friday AE: Comparison of models for nucleotide substitution used in maximum likelihood phylogenetic estimation. Mol Biol Evol. 1994, 11: 316-324.PubMedGoogle Scholar
- Posada D, Crandall KA: Selecting the Best-Fit Model of Nucleotide Substitution. Syst Biol. 2001, 50 (4): 580-601. 10.1080/106351501750435121.View ArticlePubMedGoogle Scholar
- Sullivan J, Holsinger KE, Simon C: The effect of topology on estimates of among-site rate variation. J Mol Evol. 1996, 42: 308-312. 10.1007/BF02198857.View ArticlePubMedGoogle Scholar
- Sullivan J, Abdo Z, Joyce P, Swofford DL: Evaluating the performance of a Sucessive-Approximations Approach to Parameter Optimization in Maximum-Likelihood Phylogeny Estimation. Mol Biol Evol. 2005, 22 (6): 1386-1392. 10.1093/molbev/msi129.View ArticlePubMedGoogle Scholar
- Abdo Z, Minin VN, Joyce P, Sullivan J: Accounting for Uncertainty in the Tree Topology Has Little Effect on the Decision-Theoretic Approach to Model Selection in Phylogeny Estimation. Mol Biol Evol. 2005, 22: 691-703. 10.1093/molbev/msi050.View ArticlePubMedGoogle Scholar
- Robinson DF, Foulds LR: Comparison of phylogenetic trees. Math BioSci. 1981, 53: 131-47. 10.1016/0025-5564(81)90043-2.View ArticleGoogle Scholar
- Sanderson MJ: Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. Mol Biol Evol. 2002, 19: 101-109.View ArticlePubMedGoogle Scholar
- Foster PG: Modeling compositional heterogeneity. Syst Biol. 2004, 53: 485-495. 10.1080/10635150490445779.View ArticlePubMedGoogle Scholar
- Ho SYW, Phillips MJ, Drummond AJ, Cooper A: Accuracy of Rate Estimation Using Relaxed-Clock Models with a Critical Focus on the Early Metazoan Radiation. Mol Biol Evol. 2005, 22: 1355-1363. 10.1093/molbev/msi125.View ArticlePubMedGoogle Scholar
- Forster MR, Sober E: Why likelihood?. Likelihood and Evidence. Edited by: Taper M, Lele S. 1972, Chicago: University of Chicago Press, 89-99.Google Scholar
- Jukes TH, Cantor CR: Evolution of protein molecules. Mammalian protein metabolism. Edited by: Munro HM. 1969, New York: Academic Press, 21-132.View ArticleGoogle Scholar
- Drummond A, Strimmer K: PAL: An object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics. 2001, 17: 662-663. 10.1093/bioinformatics/17.7.662.View ArticlePubMedGoogle Scholar
- Rambaut A, Crassly NC: Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997, 13: 235-238.PubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Evol. 1990, 215: 403-410.Google Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680.PubMed CentralView ArticlePubMedGoogle Scholar
- Janssen PJ, Enright AJ, Audit B, Cases I, Goldovsky L, Harte N, Kunin V, Ouzounis CA: COGENT: a flexible data environment for computational genomics. Bioinformatics. 2003, 19: 1451-1452. 10.1093/bioinformatics/btg161.View ArticlePubMedGoogle Scholar
- Duret L, Mouchiroud D, Gouy M: HOVERGEN: a database of homologous vertebrate genes. Nucleic Acids Res. 1994, 22: 2360-2365.PubMed CentralView ArticlePubMedGoogle Scholar
- Cotton JA, Page RDM: Going nuclear: gene family evolution and vertebrate phylogeny reconciled. Proc R Soc Lond B Biol Sci. 2001, 269: 1555-1561. 10.1098/rspb.2002.2074.View ArticleGoogle Scholar
- Guindon S, Gascuel O: A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood. Syst Biol. 2003, 52: 696-704. 10.1080/10635150390235520.View ArticlePubMedGoogle Scholar
- Yang Z, Nielsen R, Hasegawa M: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997, 13: 555-556.PubMedGoogle Scholar
- Reyes A, Gissi C, Catzeflis F, Nevo E, Pesole G, Saccone C: Congruent Mammalian Trees from Mitochondrial and Nuclear Genes Using Bayesian Methods. Mol Biol Evol. 2004, 21: 397-403. 10.1093/molbev/msh033.View ArticlePubMedGoogle Scholar
- HIV and SIV Database at Los Almos National Laboratory. [http://hiv-web.lanl.gov]
- Creevey CJ, McInerney JO: Clann: investigating phylogenetic information through supertree analyses. Bioinformatics. 2005, 21 (3): 390-392. 10.1093/bioinformatics/bti020.View ArticlePubMedGoogle Scholar
- Supplementary Data. [http://bioinf.nuim.ie/supplementary/models/]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.