bModelTest: Bayesian phylogenetic site model averaging and model comparison

Background Reconstructing phylogenies through Bayesian methods has many benefits, which include providing a mathematically sound framework, providing realistic estimates of uncertainty and being able to incorporate different sources of information based on formal principles. Bayesian phylogenetic analyses are popular for interpreting nucleotide sequence data, however for such studies one needs to specify a site model and associated substitution model. Often, the parameters of the site model is of no interest and an ad-hoc or additional likelihood based analysis is used to select a single site model. Results bModelTest allows for a Bayesian approach to inferring and marginalizing site models in a phylogenetic analysis. It is based on trans-dimensional Markov chain Monte Carlo (MCMC) proposals that allow switching between substitution models as well as estimating the posterior probability for gamma-distributed rate heterogeneity, a proportion of invariable sites and unequal base frequencies. The model can be used with the full set of time-reversible models on nucleotides, but we also introduce and demonstrate the use of two subsets of time-reversible substitution models. Conclusion With the new method the site model can be inferred (and marginalized) during the MCMC analysis and does not need to be pre-determined, as is now often the case in practice, by likelihood-based methods. The method is implemented in the bModelTest package of the popular BEAST 2 software, which is open source, licensed under the GNU Lesser General Public License and allows joint site model and tree inference under a wide range of models. Electronic supplementary material The online version of this article (doi:10.1186/s12862-017-0890-6) contains supplementary material, which is available to authorized users.


Background
One of the choices that needs to be made when performing a Bayesian phylogenetic analysis is which site model to use. A common approach is to use a likelihood-based method like ModelTest [1], jModelTest [2], or jModel-Test2 [3] to determine the site model. The site model is comprised of (i) a substitution model defining the relative rates of different classes of substitutions and (ii) a model of rate heterogeneity across sites which may include a gamma distribution [4] and/or a proportion of invariable sites [5,6]. The site model recommended by such likelihood-based method is then often used in a model for all sites. Wu et al. [12] use reversible jump for both substitution models and partitions and furthermore sample the use of gamma rate heterogeneity for each site category. However, since the method divides sites among a set of substitution models, it does not address invariable sites, and only considers a limited set of five (K80, F81, HKY85, TN93, and GTR) substitution models.
Here we introduce a method which combines model averaging over substitution models with model averaging of the parameters governing rate heterogeneity across sites using reversible jump. Whether one considers the method to be selecting the site model, or averaging over (marginalizing over) site models depends on which random variables are viewed as parameters of interest and which are viewed as nuisance parameters. If the phylogeny is viewed as the parameter of interest, then bModelTest provides estimates of the phylogeny averaged over site models. Alternatively if the site model is of interest, then bModelTest can be used to select the site model averaged over phylogenies. These are matters of post-processing of the MCMC output, and it is also possible to consider the interaction of phylogeny and site models. For example one could construct phylogeny estimates conditional on different features of the site model from the results of a single MCMC analysis.
The method is implemented in the bModelTest package of BEAST 2 [13] with GUI support for BEAUti making it easy to use. It is open source and available under LGPL licence. Source code, installation instructions and documentation can be found at https://github.com/BEAST2-Dev/bModelTest.

Implementation
All time-reversible nucleotide models can be represented by a 4 × 4 instantaneous rate matrix: − π C r ac π G r ag π T r at π A r ac − π G r cg π T r ct π A r ag π C r cg − π T r gt π A r at π C r ct π G r gt − ⎞ ⎟ ⎟ ⎠ , with six rate parameters r ac , r ag , r at , r cg , r ct and r gt and four parameters describing the equilibrium base frequencies = (π A , π C , π G , π T ). A particular restriction on the rate parameters can conveniently be represented by a six figure model number where each of the six numbers corresponds to one of the six rates in the alphabetic order listed above. Rates that are constrained to be the same, have the same integer at their positions in the model number. For example, model 123,456 corresponds to a model where all rates are independent, named the general time reversible (GTR) model [14]. Model 121121 corresponds to the HKY model [15] in which rates form two groups labelled transversions (1 : r ac = r at = r cg = r gt ) and transitions (2 : r ag = r ct ). By convention, the lowest possible number representing a model is used, so even though 646,646 and 212,212 represent HKY, we only use 121,121.
There are 203 reversible models in total [7]. However, it is well known that transitions (A↔C, and G↔T substitutions) are more likely than transversions (the other substitutions) [16,17]. Hence grouping transition rates with transversion rates is often not appropriate and these rates should be treated differently. We can restrict the set of substitution models that allow grouping only within transitions and within transversions, with the exception of model 111,111, where all rates are grouped. This reduces the 203 models to 31 models (see Fig. 1 and details in Additional file 1: Appendix). Alternatively, if one is interested in using named models, we can restrict further to include only Jukes Cantor [18,19] (111,111), HKY [15] (121,121), TN93 [20] (121,131), K81 [21] (123,321), TIM [22] (123,341), TVM [22] (123,425),and GTR [14] (123,456). However, to facilitate stepping between TIM and GTR during the MCMC (see proposals below) we like to use nested models, and models 123,345 and 123,324 provide intermediates between TIM and GTR, as well as K81 and TVM, leaving us with a set of 9 models (Fig. 1).
The state space consists of the following parameters: • Rates r ac , r ag , r at , r cg , r ct and r gt are determined from the model number M and rate parameter R. Further, we restrict R such that the sum of the six rates r .. equals 6 in order to ensure identifiability. This is implemented by starting each rate with value 1, and ensuring proposals keep the sum of rates in (see details on proposals below).

Prior
By default, bModelTest uses the flat Dirichlet prior on rates from [7]. From empirical studies [16,17], we know that transition rates tend to be higher than transversion rates. It makes sense to encode this information in our prior and bModelTest allows for rates to get a different prior on transition rates (default log normal with mean 1 and standard deviation of 1.25 for the log rates) and a b c Fig. 1 Model spaces. The model spaces supported by bModelTest. a All reversible models, b transition/transversion split models, and c named models. Arrows indicate which models can be reached by splitting a model. Note all models with the same number of groupings are at the same height transversion rates (default exponential with mean 1 for the rates). An obvious choice for the prior on models is to use a uniform prior over all valid models. As Fig. 1 shows, there are many more models with 3 parameters than with 1. An alternative allowed in bModelTest is to use a uniform prior on the number of parameters in the model. In that case, Jukes Cantor and GTR get a prior probability of 1/6, since these are the only models with 0 and 5 degrees of freedom respectively. Depending on the model set, a much lower probability is assigned to each of the individual models such that the total prior probability summed over models with K parameters, p(K) = 1/6 for K ∈ {0, 1, 2, 3, 4, 5}.
For frequencies a Dirichlet (4,4,4,4) prior is used, reflecting our believe that frequencies over nucleotides tend to be fairly evenly distributed, but allowing a 2.2% chance for a frequency to be under 0.05. For p inv a Beta(4,1) prior on the interval (0, 1) is used giving a mean of 0.2 and for α an exponential with a mean 1. These priors only affect the posterior when the respective binary indicator is 1.

MCMC proposals
The probability of acceptance of a (possibly transdimensional) proposal [23] is min{1, posterior ratio × proposal ratio × Jacobian} where the posterior ratio is the posterior of the proposed state S divided by that of the current state S, the proposal ratio the probability of moving from S to S divided by the probability of moving back from S to S, and the Jacobian is the determinant of the matrix of partial derivatives of the parameters in the proposed state with respect to that of the current state [23].

Model merge/split proposal
For splitting (or merging) substitution models, suppose we start with a model M. To determine the proposed model M , we randomly select one of the child (or parent) nodes in the graph (as shown in Fig. 1). This is in contrast to the approach of Huelsenbeck et al. [7], in which first a group is randomly selected, then a subgrouping is randomly created. For any set of substitution models organised in an adjacency graph our merge/split operator applies, making our graph-based method easier to generalise to other model sets (e.g. the one used in [24]). If there are no candidates to split (that is, model M = 123, 456 is GTR) the proposal returns the current state (this proposal is important to guarantee uniform sampling of models). Likewise, when attempting to merge model M = 111, 111, the current state is proposed (M = 111, 111). Let r be the rate of the group to be split. We have to generate two rates r i and r j for the split into groups of size n i and n j . To ensure rates sum to 6, we select u uniformly from the interval (−n i r, n j r) and set r i = r + u/n i and r j = r − u/n j .
For a merge proposal, the rate of the merged group r from two split groups i and j with sizes n i and n j , as well as rates r i and r j is calculated as r = n i r i +n j r j n i +n j . When we select merge and split moves with equal probability, the proposal ratio for splitting becomes where |M split | (and |M merge |) is the number of possible candidates to split (and merge) into from model M (and M respectively). The proposal ratio for merging is The Jacobian for splitting is n i +n j n i n j and for merging it is n i n j n i +n j .

Rate exchange proposal
The rate exchange proposal randomly selects two groups, and exchanges a random amount such that the condition that all six rates sum to 6 is met. A random number is selected from the interval [0, δ] where δ is a tuning parameter of the proposal (δ is automatically optimized to achieve the desired acceptance probability for the data during the MCMC chain). Let n i , r i , n j and r j as before, then the new rates are r i = r i − u and r j = r j + u n i n j . The proposal fails when r i < 0.
The proposal ratio as well as the Jacobian are 1.

Birth/death proposal
Birth and death proposals set or unset the category count flag and sample a new value for α from the prior when the flag is set. The proposal ratio is d(α ) for birth and 1/d(α) for death where d(.) is the density used to sample from (by default an exponential density with a mean of 1).
Likewise for setting the indicator flag to include a proportion of invariable sites and sampling p inv from the prior. The Jacobian is 1 for all these proposals.

Scale proposal
For the α, we use the standard scale operator in BEAST 2 [13], adapted so it only samples if the category count flag is set for α. Likewise, for pInv this scale operator is used, but only if the indicator flag to include a proportion of invariable sites is set.

Results and discussion
Since implementation of the split/merge and rate exchange proposals is not straightforward, nor is derivation of the proposal ratio and Jacobian, unit tests were written to guarantee their correctness and lack of bias in proposals (available on https://github.com/BEAST2-Dev/ bModelTest).
To validate the method we performed a simulation study by drawing site models from the prior, then used these models to generate sequence data of 10K sites length on a tree (in Newick (A:0.2,(B:0.15,C:0.15):0.05)) with three taxa under a strict clock. The data was analysed using a Yule tree prior, a strict clock and bModelTest as site model with uniform prior over models and exponential with mean one for transversions and log-normal with mean one and variance 1.25 for transition rates. A hundred alignments were generated with gamma rate heterogeneity and a hundred without rate heterogeneity using a (Bouckaert, RR: BEASTShell -scripting for bayesian hierarchical clustering, submitted) script. Invariant sites can be generated in the process and are left in the alignment.
Comparing the model used to generate the alignments with inferred models is best done by comparing the individual rates of these models. Figure 2 shows the rate estimates for the six rates against the rates used to generate the data. Clearly, there is a high correlation between the estimated rates and the ones used to generate (R 2 > 0.99 for all rates). Results were similar with and without rate heterogeneity. Note values for rates AG and CT (middle panels) tend to be higher than the transversion rates due to the prior they are drawn from. Table 1 summarises coverage of the various parameters in the model, which is defined as the number of experiments where the 95% HPD of the parameter estimate contains the value of the parameter used to generate the data. The rows in the table show the four different models of rate heterogeneity among sites; plain means a single category without gamma or invariable sites, +G for discrete gamma rate categories, +I for two categories, one being invariable, and +G+I for discrete gamma rate categories and one invariable category. Furthermore, the experiment was run estimating whether base frequencies were equal or not. The first four rows are for data simulated with Accuracy of estimated substitution rates. True rates (horizontal) against estimated rates (vertical) in simulated data for 3 taxa. In reading order, rate AC, AG, AT, CG, CT and GT. Diamonds are for estimates when no rate heterogeneity was used to simulate the data, circles are for estimates with rate heterogeneity. Error bars represent 95% HPD intervals for each estimate equal frequencies, the latter four with unequal frequencies. The last row shows results averaged over all 800 experiments. On average, one would expect the coverage to be 95% if simulations are drawn from the prior [25], so each entry in Table 1 has an expected value of 95, but can deviate due to small sample size. According to the binomial probability distribution there is a ∼ 1.1% chance of seeing 89 or less successes when sampling 100 times with a success rate of 0.95. The sample size for the mean rows is 800, so is expected to be much closer to 95%.
Coverage of rate estimates and frequencies are as expected, as shown in the table. Substitution model coverage is measured by first creating the 95% credible set of models for each simulation and then counting how often the model used to generate the data was part of the 95% credible set. The 95% credible set is the smallest set of models having total posterior probability ≥ 0.95. As Table 1 shows, model coverage is as expected (Subst. Model coverage column). The situation with gamma shape parameter estimates and proportion of invariable sites is not as straightforward as for the relative rates of the substitution process. The site model coverage can be measured in a similar fashion: the site model coverage column shows how often the 95% credible sets for the four different site models (plain, +G. +I and +G+I) contains the true model used to generate the data. The coverage is as expected. When looking at how well the shape parameter (α column in Table 1) and the proportion invariable sites (p inv column in the table) is estimated, we calculated the 95% HPD intervals for that part of the trace where the true site model was sampled. Coverage is as expected when only gamma rate heterogeneity is used, or when only a proportion of invariable sites is used, but when both are used an interaction between the two site model categories appears to slightly reduce the coverage of both parameters. In these experiments the coverage for the frequency estimates for the individual nucleotides was as expected. In summary, the statistical performance of the model is as expected for almost all parameters except for the case where gamma and a proportion of invariable sites are used due to their interaction as discussed further below.
To investigate robustness of the approach, we repeated the study with a log normal uncorrelated relaxed clock [26] with a gamma(α = 30, β = 0.005) prior over the E q u a l + G 9 1 9 5 9 3 9 3 9 5 9 3 9 3 . 3 9 7 E q u a l + I 9 2 9 4 9 4 9 5 9 3 9 4 9 3 . 6 9 6 E q u a l + G + I 8 9 9 6 9 5 9 4 9 5 9 5 9 4 9 8 U n e q u a l p l a i n 9 6 9 5 9 6 9 7 9 3 9 6 9 5 . 5 The first column lists the frequency and site models used to generate the data, and the last row is the mean coverage over all 800 runs. Coverage for rate parameters and frequencies is defined as the number of replicate simulations in which the true parameter value was contained in the estimated 95% HPD interval. The mean rate column contains the coverage averaged over all six rate coverage columns (i.e. the proportion of the 600 parameter estimates whose values were contained in their respective 95% HPD intervals. For details of substitution model coverage see text. The site model coverage is the number of replicate simulations that contained the correct model specification for rate heterogeneity across sites in the 95% credible set of models. Columns α and p inv are coverages of the shape and proportion invariable parameter conditioned on sampling from the true site model standard deviation for the log normal distribution. Trees with 5 taxa were randomly sampled from a Yule prior with log normal distribution (the birth rate was drawn from a distribution with a mean of the rate of 5.5, and a standarddeviation of the log-rate of 0.048) giving trees with mean height ≈ 0.25 and 95% HPD interval of 0.015 to 0.7. The study as outlined above was repeated, and results are summarised in Additional file 1: Table S1, which looks very similar to that of Table 1. So, we conclude that the model is not sensitive to small variation in molecular clock rates among branches. Figure 3 shows histograms of estimated posterior probability of gamma-distributed rate heterogeneity across sites for the data sets simulated over 5 taxa. When data was generated without gamma-distributed rate heterogeneity across sites, the posterior probability was often estimated to be close to zero (left of Fig. 3), while the posterior probability was estimated to be close to one for most of the analyses on data in which gamma rate heterogeneity was present (middle of Fig. 3). 1 When rate heterogeneity was present, shape estimates were fairly close to the ones used to generate the data (right of Fig. 3). However, there were quite a few outliers, especially when the shape parameter was high (although this is harder to see on a log-log plot which was used here because of the uneven distribution of true values). This can happen due to the fact that when the gamma shape is small, a large proportion of sites gets a very low rate, and may be invariant, so that the invariable category can model those instances. The mean number of invariant sites was 6083 when no rate heterogeneity was used, while it was 6907 when rate heterogeneity was used, a difference of about 8% of the sites. Figure 4 shows similar plots as Fig. 3 but for the proportion of invariable sites for 5 taxa. 2 Empirically for the parameters that we used for our simulations, it appears that if there are less than 60% invariant sites, adding a category to model them does not give a much better fit. True shape parameter Estimated shape parameter Fig. 3 Accuracy of inference of rate heterogeneity across sites. Posterior probability for inclusion of gamma rate heterogeneity when the data is generated without (left) and with (middle) rate heterogeneity for 5 taxa. Right, True gamma shape parameter (horizontal) against estimated shape parameter (vertical) when rate heterogeneity is used to generate the data simulation, there was a high correlation between the true proportion and the estimated proportion of invariable sites.

When a proportion of invariable sites was included in the
The same study with 5 taxa was repeated with the substitution model fixed to HKY and GTR, but estimating the other parts of the model. Results are summarised in Additional file 1: Tables S2 and S3 respectively. Fixing the model to HKY results in severe degradation of accuracy in all parameter and model estimates. The lack of coverage of frequency estimates when the true model has equal frequencies suggests that lack of degrees of freedom in substitution model parameters is compensated by estimating frequencies instead of keeping them equal. So substitution model misspecification can result in considerable misspecification of the remainder of the model. Results when fixing the substitution model to GTR shows a table with results very similar to that of bModelTest, however the substitution model parameters have on average a 95% HPD interval of size 0.17 while that of bModelTest is only 0.13. The extra parameters that need to be estimated for GTR compared to bModelTest result in more uncertain estimates, and thus more uncertainty in the analysis.
To see the impact of the model set, the experiment was repeated with sampling from all 203 reversible models instead of using only the 30 transition/transversion split models. Results are shown in Additional file 1: Table  S4, which do not differ substantially from Table 1. Further, to investigate the effect of the number of taxa and sequence length, the study was repeated with 16 taxa and sequence lengths 1K and 0.5K base pairs long under a relaxed clock as before. Results are summarised in Additional file 1: Tables S5 and S6 respectively. The tables do not show significant differences to Table 1 or

Truth is no proportion invariable sites
Posterior support for pInv > 0

Comparison with jModelTest
We ran jModelTest version 2.1.10 [3] on the sequence data used for the last simulation study with 5 taxa (using all reversible models, since only that set is the same for both jModelTest and bModelTest) and the two simulation studies using 16 taxa and compared the substitution model coverage (with settings -BIC -AIC -f -g 4 -i -s 203). For each dataset, we collected the top models according to the AIC and BIC criteria such that the cumulative weight exceeded 95% of the models as shown in the jModel-Test output and registered whether the true model was contained in the resulting set. Results are summarised in Additional file 1: Table S7, which shows that both AIC and BIC do not cover the true model 95% of the time as would be desirable. For some combinations the coverage is close to the desirable value (89.4% for AIC with 5 taxa) and for some it is much lower (61.1% for BIC with 0.5K length sequences and 16 taxa). Coverage of both AIC and BIC appears to decrease with increasing number of taxa and decreasing sequence length, although we have not attempted a systematic study. In contrast bModelTest has a coverage of ∼ 95% for all scenarios. jModelTest uses a single maximum likelihood tree and it seems that increasing uncertainty in the true tree (by increasing the number of taxa or decreasing sequence length) results in an increasing chance of incorrect model weights from jModelTest. For BIC, we find substantially less coverage of jModelTest than the around 90% model coverage reported in a previous study [3]. This is probably because our data contains a larger amount of uncertainty due to shorter sequences and tree lengths. Another factor is that we use different priors. For example, we use a Beta (1,4) for the proportion of invariable sites, while the previous study [3] used a Beta(1,3) that was then truncated to the interval [0.2,0.8], thereby avoiding extreme values which might cause difficulties. To confirm this we produced simulated data more closely matched to previously published experiments (with 40 taxa, sequences of 2500 base pairs, models selected uniformly from the 11 named models, tree length with mean of 6.5, truncated prior for invariable sites, BIC criterion) and obtained a coverage of 93.8% for the 95% credible set and 89.5% coverage by the best fitting model, similar to the results in [3].
In practice, users of Bayesian phylogenetic packages only use the most highly weighted model returned by jModelTest. Additional file 1: Table S7 shows how often the best fitting model according to AIC and BIC matches the true model, which ranges from 73.9% for BIC on 5 taxa to 30.8% for AIC on 0.5K length sequences and 16 taxa, suggesting that the probability of model misspecification using this approach increases with phylogenetic uncertainty.
To compare the application of bModelTest to jModel-Test (with settings -f -i -g 4 -s 11 -AIC -a) we applied both to two real datasets. The first data set used was an alignment from 12 primate species [27] (available from BEAST 2 as file examples/nexus/Primates.nex) containing 898 sites. In this case the model recommended by jModelTest was TPM2uf+G and the substitution model TPM2 (=121,323) has the highest posterior probability using bModelTest (21.12% see Additional file 1: Appendix for full list of supported models) when empirical frequencies are used. However, when frequencies are allowed to be estimated, HKY has highest posterior probability (16.19%), while TPM2 (10.25%) has less posterior probability then model 121,123 (14.09%). So, using a heuristic maximum likelihood approach (jModelTest and/or empirical frequencies) makes a difference in the substitution model being preferred. Figure 5 left shows the posterior probabilities for all models, and it shows that the 95% credible set is quite large for the primate data. Figure 5 middle and 5 right show correlation between substitution model rates. The former shows correlation between transversion rate AC (horizontally) and transition rate AG (vertically). One would not expect much correlation between these rates since the model coverage image shows there is little support for these rates to be shared. However, since HKY is supported to a large extent and the rates are constrained to sum to 6, any proposed change in a transition rate requires an opposite change in transversion rates in order for the sum to remain 6. So, when sampling HKY, there is a linear relation between transition and transversion rates, which faintly shows up in the Fig. 5 (middle). Figure 5 (right) shows the correlation between transversion rates AC and AT. Since they are close to each other, a large proportion of the time rate AC and AT are linked, which shows up as a dense set of points on the AC=AT line.
The second data set used was an alignment of 31 sequences of 9030 sites of coding hepatitis C virus (HCV) from [28]. It was split into two partitions, the first containing codon 1 and 2 positions (6020 sites) and the second all codon 3 positions (3010 sites). Figure 6 left show the model distributions for the first partition at the top and second at the bottom. The 95% credible sets contain just 7 and 6 models respectively, much smaller than those for the primate data as one would expect from using longer, more informative sequences. Note that the models pre- partition, jModelTest recommends TIM2+I+G. TIM2 is model 121,343, the model with highest posterior probability according to bModelTest, as shown in Fig. 6. For the second partition, jModelTest recommends TVM+G, and though TVM is in the 95% credible set, it has a lower posterior probability than model 123421, which gets the highest posterior probability according to bModelTest. Running jModelTest on all 203 models, model 123,451 is preferred by both AIC and BIC, even though 123421 was considered by jModelTest. Again, we see a difference in heuristic likelihood and full Bayesian approaches. The correlation between transition rates A ↔ G and C ↔ T as well as between two transversion rates A ↔ C and A ↔ T are shown in Fig. 6 top middle and right for the first partition and Fig. 6 bottom middle and right for the second. The transition rates A ↔ G and C ↔ T have a posterior probability of being the same of 0.024 in the first partition, whereas the posterior probability is 0.66 in the second partition containing only 3rd positions of the codons. This leads to most models for the first partition distinguishing between A ↔ G and C ↔ T, while for the second partition most models share these rates. For the two transversion rates A ↔ C and A ↔ T the partitions display the opposite relationship, with the second partition preferring to distinguish them. As a result, overall the two partitions only have one model in common in their respective 95% credible sets, but that model (GTR) has quite low posterior probability for both partitions.

Implementation details
The calculation of the tree likelihood typically consumes the bulk ( 90%) of computational time. Note that for a category with invariable sites, the rate is zero, hence only sites that are invariant (allowing for missing data) contribute to the tree likelihood. The contribution is 1 for those sites for any tree and for any parameter setting, so by counting the number of invariant sites, the tree likelihood can be calculated in constant time. Switching between with and without gamma rate heterogeneity means switching between one and k rate categories, which requires k time as much calculation. Having two tree likelihood objects, one for each of these two scenarios, and a switch object that selects the one required allows use of the BEAST 2 updating mechanism [9] so that only the tree likelihood that needs updating is performing calculations. So, jModelTest and bModelTest can, but do not necessarily agree on the most appropriate model to use.