bModelTest : Bayesian site model selection for nucleotide data

bModelTest allows for a Bayesian approach to inferring a site model for phylogenetic analysis. It is based on trans dimensional MCMC proposals that allow switching between substitution models, whether gamma rate heterogeneity is used and whether a proportion of the sites is invariant. The model can be used with the set of reversible models on nucleotides, but we also introduce other sets of substitution models, and show how to use these sets of models. With the method, the site model can be inferred during the MCMC analysis and does not need to be pre-determined, as is now often the case in practice, by likelihood based methods.


Introduction
One of the choices that need to be made when performing a Bayesian phylogenetic analysis is which site model to use.A common approach is to use a likelihood based method like ModelTest [16], JModelTest [15], or JModelTest2 [4] to determine the substitution model and whether to use gamma rate heterogeneity [21] and/or invariant sites.These three components form what we call the site model.The site model recommended by such likelihood based method is then used in the Bayesian analysis.A more elegant method is to co-estimate site model and phylogeny in a single analysis, thus preventing possible bias caused by the phylogeny assumed by the above methods.Obviously, in a Bayesian setting it is more natural to use Bayesian methods of model selection.
One way to select substitution models of DNA evolution is to use reversible jump between all possible reversible models [8], or just a nested set of model [2].An alternative is to use stochastic Bayesian variable selection [5], though this does not address whether to use gamma rate heterogeneity or invariant sites.Wu et al. [20] use reversible jump for substitution models and furthermore select for each site whether to use gamma rate heterogeneity or not.Since the method divides sites among a set of substitution models, it does not address invariant sites.
We introduce a method that estimates all three components of the site model simultaneously, which combines selection of substitution model as with selection for rate heterogeneity model and invariant sites, all of these are using reversible jump.The method is implemented in the bModelTest package of BEAST 2 [3] with GUI support for BEAUti making it trivial to use.It is open source and available under LGPL licence.Source code, installation instructions and documentation can be found at https://github.com/BEAST2-Dev/bModelTest.

Method
Symmetric nucleotide models can be represented by a 4x4 rate matrix with six rates r ac , r ag , r at , r cg , r ct and r gt .A particular grouping of rates can conveniently be represented by a six figure number where each of the six numbers corresponds to one of the six rates in the order listed before respectively.Rates that are grouped have the same number.For examples, model 123456 corresponds to a model where all rates are independent.This is also known as the GTR model.Model 1211211 corresponds to the HKY model.By convention, the lowest number representing a model is used, so even though 646646 and 212212 represent the same model, we only use 121121.
There are 203 reversible models in total [8].However, it is well known that transitions (A↔C, and G↔T mutations) are much more likely than tranversions (the other mutations) [13,17].Hence grouping transition rates with transversion rates is not appropriate and these rates should be treated differently.If we restrict the set of substitution model such that we only allow grouping within transitions and within transversions (with the exception of model 111111, where all rates are grouped), only 31 models are left (see Figure 1 and details in Appendix).If one is interested in using named models, we can use Jukes Cantor [9,6] (111111), HKY [10] (121121), TN93 [18] (121131), K81 [11] (123321), TIM [14] (123341), TVM [14] (123421),and GTR [19] (123456).However, to facilitate stepping between TIM and GTR during the MCMC (see proposals below) we like to use nested models, and models 123345 and 123451 nicely fit between TIM and GTR, leaving us with a set of 7 models (Figure 1).
The state space consist of the following parameters: • the model number M , • a variable size rate parameter (depending on model number) R, • a flag to indicate whether 1 or 4 gamma categories should be used, • a shape parameter α, used for gamma rate heterogeneity when there are 4 gamma categories, Figure 1: Model sets supported by bModelTest, all reversible models (left), transition/tranversion split models (middle), and named models (right).Arrows indicate which models can be reached by splitting a model.Note all models with the same number of groupings are at the same height.
• a flag to indicate a category for invariant sites should be used, • the proportion of invariant sites p inv , Rates rates r ac , r ag , r at , r cg , r ct and r gt are determined from M and R. Further, we restrict R such that the sum of the six rates r .. equals 6 in order to ensure identifiability.

Prior
By default, bModelTest uses the flat Dirichlet prior on rates from [8] taking the linear transformation in account.From emperical studies [13,17], we know that transition rates tend to be higher than transition rates.It makes sense to encode this information in our prior and bModelTest allows for rates to get a different prior on transition rates (default log normal with mean 1 and standard deviation of 1.25) and transition rates (default exponential with mean 1).
An obvious choice for the prior on models is to use a uniform prior on all models.As Figure 1 shows, there are many more models with 3 parameters than with 1.An alternative allowed in bModelTest is to use a uniform prior on the number of groups.That way, Jukes Cantor and GTR get a prior probability of 1/6, since these are the only models with 1 and 6 groups respectively.Depending on the model set, a much lower probability is assigned to each of the other models such that the sum over models with k groups is 1/6.
For p inv a uniform prior with interval (0, 1) is used and for α an exponential with mean 1.These priors only affect the posterior when the respective flags are set.

MCMC proposals
The probability of acceptance of a (possibly trans-dimensional) proposal [7] is min{1, posterior ratio × proposal ratio × Jacobian} where the posterior ratio is the posterior of the proposed state S divided by that of the current state S, the proposal ratio the probability of moving from S to S divided by the probability of moving back from S to S, and the Jacobian is the determinant of the matrix of partial derivatives of the parameters in the propsoed state with respect to that of the current state.

Model merge/split proposal
For splitting (or merging) substitution models, suppose we start with a model M .To determine the proposed model M , we randomly select one of the child (or parent) nodes in the graph (as shown in Figure 1).This differs from Huelsenbeck's approach [8] where first randomly a group is selected, then randomly a subgrouping created, since the graph based method is easier to generalise to other model sets (e.g. the one used in [12]).If there are no candidates to split (that is, model M = 123456 is GTR) the proposal returns the current state (this proposal is important to guarantee uniform sampling of models).Likewise, when attempting to merge model M = 111111, the currents state is proposed (M = 111111).Let r be the rate of the group to be split.We have to generate two rates r i and r j for the split into groups of size n i and n j .To ensure rates sum to 6, we select u uniformly from the interval (−n i r, n j r) and set r i = r + u/n i and r j = r − u/n j .
For a merge proposal, the rate of the merged group r from two split groups i and j with sizes n i and n j , as well as rates r i and r j is calculated as r = niri+nj rj ni+nj .
When we select merge and split moves with equal probability, the proposal ratio for splitting becomes where |M split | (and |M merge |) is the number of possible candidates to split (and merge) into from model M (and M respectively).The proposal ratio for merging is The Jacobian for splitting is ni+nj ninj and for merging it is ninj ni+nj .

Rate exchange proposal
The rate exchange proposal randomly selects two groups, and exchanges a random amount such that the condition that all six rates sum to 6 is met.A random number is selected from the interval [0, δ] where δ is a parameter of the proposal (0.15 appears a good number).Let n i , r i , n j and r j as before, then the new rates are r i = r i − u and r j = r j + u ni nj .The proposal fails when r i < 0. The proposal ratio as well as the Jacobian are 1.

Birth/death proposal
Birth and death proposals set or unset the category count flag and sample a new value for α from the prior when the flag is set.The proposal ratio is d(α ) for birth and 1/d(α) for death where d(.) is the density used to sample from.Likewise for setting the proportion invariant flag and sampling p inv from the prior.The Jacobian is 1 for all these proposals.

Scale proposal
For the α, we use the standard scale operator in BEAST 2 [3], adapted so it only samples if the category count flag is set for α.Likewise, for pInv this scale operator is used, but only if the proportion invariant flag is set.

Results
Since implementation of the split/merge and rate exchange proposals is not straightforward, nor getting the math right for proposal ratio and Jacobian, unit tests were written to guarantee their correctness and lack of bias in proposals.
To validate the method we performed a simulation study by drawing site models from the prior, then used these models to generate sequence data of 10K sites length on a tree (in Newick (A:0.2,(B:0.15,C:0.15):0.05))with three taxa under a strict clock.The data was analysed using a Yule tree prior, a strict clock and bModelTest as site model with uniform prior over models and exponential with mean one for tranversions and log-normal with mean one and variance 1.25 for transition rates.A hundred alignments were generated with gamma rate heterogeneity and a hundred without rate heterogeneity using a BEASTShell [1] script.Invariant sites can be generated in the process and are left in the alignment.
Comparing the model used to generate the alignments with inferred models is best done by comparing the individual rates of these models.Figure 2 shows the rate estimates for the six rates agains the rates used to generate the data.Clearly, there is a high correlation between the estimated rates and the ones used to generate (R 2 > 0.99 for all rates).Results were similar with and without rate heterogeneity.Note values for rates AG and CT (middle panels) tend to be higher than the transversion rates due to the prior they are drawn from.Figure 3 shows histograms with the proportion of time gamma shapes were used during the MCMC run.When data was generated without gamma shapes, gamma rate heterogeneity was not used most of the time (left of Figure 3), while gamma rates were used for most of the analyses most of the time when gamma rate heterogeneity was present (middle of Figure 3). 1 When rate heterogeneity was present, shape estimates were fairly close to the ones used to generate the data (right of Figure 3).However, there were quite a few outliers, especially when the shape parameter was low, the estimate tended to be biased upward.This can happen due to the fact that when the gamma shape is small, a large proportion of sites gets a very low rate and the invariant category can model those instances.The mean number of invariant sites was 6083 when no rate heterogeneity was used, while it was 6907 when rate heterogeneity was used, a difference of about 8% of the sites.
Figure 4 shows similar plots as Figure 3 but for the proportion invariant.The actual proportion invariant was determined empirically by counting the number of invariant sites in the generated alignments. 2 It appears that if there is less than 60% invariant sites, adding a category to model them does not give a better fit.When invariant rates were used they had a high correlation (R 2 = 0.77) with that of the empirical proportion invariant.
To compare results with jModelTest 2 [4], we used gopher data (available from BEAST 2 as file examples/nexus/gopher.nex)and GTR+G is given the highest likelihood score.So, it may be tempting to just use that model.How-  ever, with bModelTest we get a more nuanced picture (Figure 5) showing GTR only makes a small fraction of the posterior.Further, 78% of the time gamma rate heterogeneity is in use, while 60% of the posterior has an invariant site category.
The calculation of the tree likelihood consumes the bulk (>> 90%) of computational time.Note that for a category with invariant sites, the rate is zero, hence only sites that are invariant (allowing for missing data) contribute to the tree likelihood.The contribution is 1 for those sites for any tree and for any parameter setting, so by counting the number of invariant sites, the tree likelihood can be calculated in constant time.Switching between with and without gamma rate heterogeneity means switching between one and k rate categories, which requires k time as much calculation.Having two tree likelihood objects, one for each of these two scenarios, and a switch object that selects the one required allows use of the BEAST 2 updating mechanism [5] so that only the tree likelihood that needs updating is performing calculations.

Conclusion
bModelTest is a BEAST package which can be used in any analysis where trees are estimated based on DNA sequences, such as multi-species coalescent analysis (*BEAST), various forms of phylogeographical analyses, sampled ancestor analysis, demographic reconstruction using coalescent, birth death skyline analysis, etc.The GUI support provided through BEAUti makes it easy to set up an analysis with the bModelTest site model: just select bModelTest instead of the default gamma site model from the combo box in the site model panel.
bModelTest allows estimation of the site model using a full Bayesian approach, without the need to rely on non-Bayesian tools for determining aspects of the site model.

Appendix
list of all transition/tranversion split models model number r ac r ag r at r cg r ct r gt name 0 1

Figure 2 :
Figure 2: Rate original (horizontal) against estimates (vertical) in simulated data.In reading order, rate AC, AG, AT, CG, CT and GT.

Figure 4 :Figure 5 :
Figure 4: Proportion of time proportion invariant is used without (left) and with (middle) rate heterogeneity used to generate the data.Right, empirical proportion invariant in alignment (horizontal) against estimates (vertical) when rate heterogeneity is used to generate the data.
Figure 3: Proportion of time gamma rate heterogeneity is used without (left)and with (middle) rate heterogeneity used to generate the data.Right, gamma shape original (horizontal) against estimates (vertical) when rate heterogeneity is used to generate the data.