EM for phylogenetic topology reconstruction on nonhomogeneous data
© Ibáñez-Marcelo and Casanellas; licensee BioMed Central Ltd. 2014
Received: 19 November 2013
Accepted: 11 June 2014
Published: 17 June 2014
Skip to main content
© Ibáñez-Marcelo and Casanellas; licensee BioMed Central Ltd. 2014
Received: 19 November 2013
Accepted: 11 June 2014
Published: 17 June 2014
The reconstruction of the phylogenetic tree topology of four taxa is, still nowadays, one of the main challenges in phylogenetics. Its difficulties lie in considering not too restrictive evolutionary models, and correctly dealing with the long-branch attraction problem. The correct reconstruction of 4-taxon trees is crucial for making quartet-based methods work and being able to recover large phylogenies.
We adapt the well known expectation-maximization algorithm to evolutionary Markov models on phylogenetic 4-taxon trees. We then use this algorithm to estimate the substitution parameters, compute the corresponding likelihood, and to infer the most likely quartet.
In this paper we consider an expectation-maximization method for maximizing the likelihood of (time nonhomogeneous) evolutionary Markov models on trees. We study its success on reconstructing 4-taxon topologies and its performance as input method in quartet-based phylogenetic reconstruction methods such as QFIT and QuartetSuite. Our results show that the method proposed here outperforms neighbor-joining and the usual (time-homogeneous continuous-time) maximum likelihood methods on 4-leaved trees with among-lineage instantaneous rate heterogeneity, and perform similarly to usual continuous-time maximum-likelihood when data satisfies the assumptions of both methods.
The method presented in this paper is well suited for reconstructing the topology of any number of taxa via quartet-based methods and is highly accurate, specially regarding largely divergent trees and time nonhomogeneous data.
Obtaining a good method for reconstructing the phylogenetic topology of four taxa is one of the crucial goals in phylogenetics. The four-taxon trees, if correctly inferred, can be used as input of quartet-based methods in order to reconstruct larger trees. But due to the complexity of real data, the problem of reconstructing four-taxon trees is not so easy. Most phylogenetic reconstruction methods assume simple evolutionary models that may not really fit real data, which leads to incorrect phylogenetic inference ( [1–5]). For example, many of them rely on continuous-time Markov processes with a constant instantaneous mutation rate matrix along the tree (the so-called global homogeneity), and also assume time-reversibility (and hence stationarity). On the other hand, as pointed out in , in order to make quartet-based methods work it is extremely important to obtain 4-taxon tree reconstruction methods that are not affected by the presence of long-branch attraction .
Most evolutionary models are described by a Markov process over the tree, that is, conditional rates of change at two diverging sequences depend only on the current state and are independent of the previous sates ( , chapter 8.2). Markov processes on trees are specified by a distribution at the root of the tree and a transition matrix at each branch and, in contrast to continuous-time models, the Markov process on each branch is not assumed to be time-homogeneous  (that is, they are locally heterogeneous). These models directly consider as parameters the entries of the substitution matrices and the root distribution (see , chapter 8, , § 4.2). Barry and Hartigan  considered such a general Markov model (henceforth called GMM), which does not assume any other constraints. In particular, it is locally and globally time nonhomogeneous (instantaneous substitution rates are not constant on any edge, nor on the whole tree) and it is neither time-reversible nor stationary. The only restriction underlying this model is that sites evolve independently and are identically distributed (i.i.d. hypothesis). Considering this model and its submodels is one way of covering more general scenarios, in contrast to those phylogenetic methods that implement time-homogeneous and time-reversible continuous-time models (GTR and its submodels) cf .
The GMM above accounts for 12 parameters per edge plus three parameters for the root distribution. When some symmetries on the transition matrices or on the root distribution are imposed, one obtains the substitution matrices of the corresponding Jukes-Cantor and Kimura (2 and 3 parameters) models among others (( , § 4.2), , see Methods section). For instance, the Markov version of the K81 model  (henceforth referred to as K81*) deals with 3 parameters per edge (one for the conditional probability of transitions, and two for the two types of transversions, see the Methods section) which makes a total of 3 ∗ (2n - 3) parameters in unrooted trivalent trees with n leaves, whereas the usual time-homogeneous continuous-time version accounts for 2 parameters for a normalized instantaneous rate matrix constant over the tree plus one parameter per edge length (that is, 2 + (2n - 3) parameters on an unrooted trivalent tree). Notice, however, that if one considers a time nonhomogeneous continuous-time Kimura 3-parameter model, then the number of parameters is exactly the same as for K81*. In this case, the only difference between both models is that K81* does not even assume local homogeneity (that is, time homogeneity over each branch), while all time-continuous models do. The huge amount of parameters for nonhomogeneous models makes a maximum-likelihood approach unfeasible and unreliable for a whole tree on n taxa if n is large. Nevertheless, there is some hope that these more general models lead to accurate methods on 4-taxon trees. In our setting, we only deal with substitutions of nucleotides (not aminoacids) on 4-taxon trees and we will always assume the i.i.d. hypothesis (thus excluding the possibility of heterogeneity across sites).
In this paper we develop a maximum likelihood framework for inferring the best tree topology under (general) Markov processes. Our approach is based on the widely used Expectation-Maximization algorithm. The Expectation-Maximization algorithm (EM), as introduced in , is an iterative algorithm for finding maximum likelihood estimates of parameters in statistical models that deal with unobserved data. We have adapted this algorithm to the case of phylogenetic 4-taxon trees in what we call EMtree. EM iteratively gives an expectation of the distribution of the nucleotide sequences at the interior nodes (this is called the E-step) and finds the parameters that maximize the likelihood for these data in the so-called M-step (because the parameters for which the maximum likelihood is achieved can be computed in a closed form for complete data). The EM algorithm has been applied to many other disciplines (see for example ).
The use of the EM algorithm to estimate the continuous parameters of a phylogenetic tree under a Markov process (namely, the root distribution and the entries of the transition matrices) has been already discussed in  and . In this paper we focus on the use of EMtree for estimating the tree topology for four taxa and test its performance in reconstructing the correct tree topology on simulated data. We shall see that, although it is a time consuming algorithm, it can lead to very accurate results.
First of all we test it as a 4-taxon tree reconstruction method by using the tree space proposed by Huelsenbeck  on data simulated from a general Markov process first and then restricting to a time-homogeneous (continuous-time) process. We compare EMtree to neighbor-joining and to the usual (continuous-time) maximum likelihood approach under both global homogeneity and nonhomogeneity (note that throughout the experiments we will restrict to time-reversible models to simulate and to recover trees). As all models considered will be stationary (and with uniform stationary distribution), we will not be evaluating compositional heterogeneity but only the effect of the variation of substitution rates among lineages . Afterwards, we use these three methods as input of two quartet-based methods: one weighted (Willson’s QuartetSuite [19, 20]) and one unweighted (Maximum Quartet Fit, QFIT as implemented in Clann ); and compare their performance on the 12-taxon trees proposed in . Specially for largely divergent trees, we observe that EMtree gives the best results and is less subject to long-branch attraction.
In this section we present the results of the topology reconstruction method EMtree on simulated data evolving both under homogeneous and nonhomogeneous Markov processes.
In order to compare EMtree to time nonhomogeneous but continuous-time ML, we generated data evolving under a continuous-time nonhomogeneous K81 model. This is a special case of K81* and therefore this model matches both approaches: EMtree (on its K81* version) and bppml  restricted to a nonhomogeneous continuous-time K81 model. On Figure 3 (right) we present the results of these two methods on nonhomogeneous K81 data to show that both methods perform similarly and outperform NJ.
Although EMtree may give more accurate results on more general data, it is a time consuming algorithm. Indeed, on 4-taxon trees, EMtree is almost 1000 times slower than NJ and more than 2000 times slower than ML. For example on 100 4-taxon alignments of 600bp, the execution time on a Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz (only using one of the four CPU) was 186.66s for EMtree on K81*, 0.06s for PAML ML on homogeneous continuous-time K81, 211.45 for bppml on nonhomogeneous K81 model, and 0.16s for NJ (with K81 distances).
Now we present the results that test the use of EMtree, NJ, and (time-homogeneous continuous-time) ML as methods to obtain the input for the quartet-based methods QuartetSuite and QFIT on data generated under K81* model.
In both quartet-based methods (QuartetSuite and QFIT), the reconstruction of the cc topology presents the best results compared to the others, independently of the algorithm used as input quartets (EMtree, NJ, or ML). The same tendency is shared by global ML and NJ. Conversely, the topology dd is never correctly reconstructed for any of the methods or branch lengths. It is worth noting that cc is the topology that would have the least long branch attraction and dd is the one that would have the most (cd is in between because half of the tree comes from cc and the other half comes from dd). Therefore our results are consistent with the observation in  that the success of a quartet based method depends on the capacity of the input method to correctly reconstruct 4-taxon trees under the long-branch attraction problem.
In both Figures 5 and 6 we observe that the performance of ML and EMtree is quite similar in most cases, although ML never outperforms EMtree. A detailed look at these figures reveals that for largely divergent trees (that is, b=0.1, the last bar in each plot), EMtree is the best quartet input method among those considered here, as its results outperform NJ and ML for both QuartetSuite and QFIT and in all tree topologies. We find the explanation of this result in the management of long-branch attraction by the different methods considered. Long branches lead to similar sequences as a result of multiple substitutions and, as ML estimates have been computed on a wrong substitution model, this method is more influenced by long-branch attraction. EMtree has also a better success than a global NJ and a global ML on the trees cd and dd, but on the “easiest” tree cc, a global ML is more accurate. For the largely divergent cc tree, a global ML has better average performance than EMtree (see also below), but it never succeeds in fully reconstructing the topology (whereas EMtree does in about 2% of the alignments).
When considering the QuartetSuite method, NJ is clearly the worst quartet input method for all tree topologies and branch lengths. This is probably due to the fact that the Willson method implemented in QuartetSuite is intended for weighted quartets, whereas NJ quartets are given only binary weights. On the contrary, for the unweighted method QFIT, NJ seems to be more accurate than ML and EMtree for topologies cc and dd but only for low divergence (b=0.005 or b=0.015).
We want to point out that, in general, QFIT gives worse results than QuartetSuite (except for the NJ algorithm used as input in the cc and cd topologies with low divergence), reinforcing the idea that weighted methods are more reliable [6, 23].
Mean of Robinson-Foulds distance with QuartetSuite and global ML and NJ (variance shown in parentheses), 600 bp
Mean Robinson-Foulds distane with QuartetSuite, 600 bp
Mean of Robinson-Foulds distance with QFIT and global ML and NJ (variance shown in parentheses), 600 bp
Mean Robinson-Foulds distance with QFIT, 600 bp
In these tables we observe that QuartetSuite gives the lowest distance to the original tree in general and that the best results for largely divergent trees (b = 0.1) are obtained by EMtree (for both QuartetSuite and QFIT and all tree topologies). Whenever the mean of Robinson-Foulds distance for (quartet-based) ML is lower than the mean for EMtree, there is no significant difference. Overall, EMtree is the method that outperforms the other two quartet input methods in most cases. As far as global methods are concerned, it is worth pointing out the bad performance of a global NJ on trees cd and dd (or cc with b = 0.05 or 0.1), specially if one takes into account that the Robinson-Foulds distance for the (less resolved) star tree is 9.
When inspecting the variances, one sees that EMtree is the only quartet input method that preserves low variances in all cases (global methods are the ones presenting lower variance, though). Conversely, the variances are extremely large for NJ with QuartetSuite in all different scenarios, and they are also huge for (quartet input) ML when the trees are largely divergent (b = 0.1). QFIT presents low variances in all cases, for all input methods, probably because the input information is less “subject to vary” (the input only considers tree topologies, not weights). It is worth pointing out that whenever quartet input NJ outperforms ML and EMtree (only for QFIT), it does so with larger variance than the other two methods.
We tested the accuracy of the (likelihood-based) method EMtree as a method to infer 4-taxon topologies under (time nonhomogeneous) Markov models, and compared it to NJ and ML (homogeneous and nonhomogeneous). When EMtree and ML are tested on data satisfying the assumptions of both methods, they have a similar performance. Nevertheless, EMtree is based on time nonhomogeneous models (both local and global time heterogeneity), and hence outperforms the other methods when these assume homogeneity.
There are only few nonhomogeneous continuous-time models that could be fairly compared to general Markov processes, and one expects that under more complex evolutionary scenarios (such as non-stationary or not time-reversible data), the success of usual ML or NJ methods (based on these assumptions) will be poorer (as shown in ), confirming that an EM approach based on general Markov processes, could be more recommendable.
EMtree is a time-consuming algorithm, however, and the user has to decide whether it is worth performing such an analysis (we only recommend it for at most 6 taxa).
We have also assessed EMtree, NJ, and ML as input for the quartet-based methods QFIT and QuartetSuite. To do so, we have considered three different topologies on twelve taxa evolving under tome nonhomogeneous processes, and a wide set of branch lengths values. EMtree turns out to be the input method that performs best in most cases on this type of data. Regardless of the quartet-based method chosen and the tree topology, EMtree gives the best results for trees with large divergence among taxa (b = 0.1).
Summing up, an EM approach on Markov models provides an accurate 4-taxon tree reconstruction method suited for data not known to satisfy homogeneity and very useful as input of quartet-based methods, specially for largely divergent trees. However, the method presented here is not valid for data violating the i.i.d. hypotheses, such as data with variation across sites (Gamma-rates, invariable sites, mixtures, and others), or dependency among sites. The method should be strongly modified in order to accommodate these generalizations, and this might be an interesting future project.
Input: Data D (multiple alignment of n sequences), unrooted trivalent tree topology T with n leaves (and edges), Markov model .
Initialization: Root the tree at some internal node and provide tentative initial values for the root distribution π and the substitution matrices S i ( ), and a threshold ε > 0.
E-step: Provide complete data cD (for all nodes in the tree) that maximizes the posterior probability (unique maximum that can be computed efficiently by Felsenstein’s algorithm ).
M-step: Compute the parameters that maximize the loglikelihood for these complete data (this maximum is unique and can be computed in a closed form, see the Additional file 1, section 2).
If , then set , ( ), and go back to E-step.
Output: and .
We set up the EM algorithm with ε = 10-3 and forced it to stop after 100 iterations if it had not finished.
Given four aligned nucleotide sequences s1, s2, s3, s4, we run the EM algorithm for the three possible trivalent trees on four taxa: T 1 = (s1,s2|s3,s4), T 2 = (s1,s3|s2,s4), T 2 = (s1,s4|s2,s3). Our tree reconstruction method returns the tree topology for which the likelihood output by EM is higher. This method is called EMtree throughout the paper.
on each branch e i . In the matrix above, the rows and columns are labeled by nucleotides adenine, cytosine, guanine and thymine (in this order), so that entry (j,k) stands for the conditional probability that nucleotide j at the parent node of edge e i is substituted by nucleotide k at the child node. When trees evolve under this model, the root becomes unidentifiable (different root locations may give rise to the same joint distribution at the leaves, ) and, as a consequence, one can only expect to reconstruct unrooted trees. This model does not assume a constant instantaneous mutation rate matrix over the tree and, therefore, trees evolving under this model are time nonhomogeneous. The more restrictive models K80* and JC* (Markov versions of Kimura 2-parameter and Jukes-Cantor models) are obtained by imposing b i = d i and b i = c i = d i , respectively ( , § 4.2). All of them are time-reversible (and hence stationary), and are part of the so-called group-based models ( [12, 25]). The most general model among Markov processes is the general Markov model GMM, which considers transition matrices and root distribution without any further restriction and is neither stationary nor time-reversible .
The other two methods that are confronted to EMtree in this paper are Neighbor-Joining (NJ) and a usual continuous-time maximum-likelihood (ML). The ML estimates for homogeneous continuous-time K81 model  (with a instantaneous mutation rate matrix constant over the tree) have been obtained by the free software PAML  (baseml program), whereas for nonhomogeneous continuous-time K81 model we used the program bppml in the Bio++ package . Neighbor-Joining was implemented considering the K81 distance . When a global ML was used on 12 taxa, we used PAML and set up the stepwise addition option on this software.
The simulated data in this paper has been produced using the program GenNon-h of  and Seq-Gen . GenNon-h produces directly transition matrices of the required branch length (for any of the Markov models described above) and therefore does not assume time-homogeneity (not globally over the tree, nor locally at the edges). Seq-Gen was used to generate data evolving under continuous-time K81 model (both homogeneous and globally nonhomogeneous). In order to generate nonhomogeneous data, we made the software generate various edges evolving at different instantaneous rates by recording ancestral sequences. Alternatively, the software Hetero  could be also used to generate (global) time nonhomogeneous continuous-time data.
In order to test the performance of the proposed method on four-taxon trees, we based our tests on the tree space proposed by Huelsenbeck , so that it is possible to compare our results to those obtained there with different phylogenetic reconstruction methods. In this tree space, two branch length parameters a, b on trees of four taxa are varied. Parameter a assigns the branch length to the internal branch and two opposite peripheral branches, and parameter b assigns the branch length to the two remaining branches as in Figure 2. Parameters a and b represent expected number of substitution per site and in this paper are varied from 0.01 to 0.75 in increments of 0.02.
For each pair (a,b), we simulated one hundred alignments of lengths 300 and 1000 basepairs (briefly bp) under the tree topology 12|34 (see Figure 2) and we inferred the topology using the three methods EMtree, ML, and NJ. The results for each method are shown in Figure 1.
In order to assess the four-taxon tree reconstruction method EMtree proposed above, it is important to test its performance in quartet-based reconstruction methods. To do so, we considered two of these methods: Maximum Quartet Fit (QFIT) and the method proposed by Willson in . QFIT choses the supertree that shares the maximum number of quartets with the source trees, whereas the other method choses the supertree that optimizes a certain criterion based on the weights assigned at the input quartets. We use the implementation of QFIT distributed in Clann  and the implementation of Willson’s method called QuartetSuite in . While QuartetSuite produces a tree that is expected to minimize inconsistencies with quartets, QFIT uses heuristics to search through the tree space (and we restricted this search to 100000 trees). As our intention was not to evaluate quartet-based methods but testing our 4-taxon method in comparison to others, our criterion to choose these two quartet-based methods was the fact that they were freely available and that one of them allowed the use of weighted quartets.
In QuartetSuite weights are understood as -log of the frequency of a quartet, so that weights are nonnegative and zero denotes the tree with most support. Therefore we used the opposite of the loglikelihood output by EMtree and baseml as weights of the corresponding quartets. Weights for NJ were set up to be 0 for the topology output by NJ and 999 for the other two trees.
We followed  to test the performance of our method as input of quartet-based method. In that paper, the authors consider three tree topologies on 12 taxa, denoted as cc, cd and dd (see Figure 4), and fix the proportions among their branch lengths in order to compare different reconstruction methods. A parameter b denoting the internal branch lengths is varied between 0.005, 0.015, 0.05, and 0.1, which gives a maximum pairwise divergence along the tree of about 0.1, 0.3, 1.0, and 2.0 nucleotides per site, respectively. The lengths of the simulated alignments are 300 and 600 bp.
For each of these scenarios we generated 1000 alignments using GenNon-h and the Kimura 3-parameter model as explained above. For each alignment, the tree was estimated using QFIT and QuartetSuite with the four-taxon methods EMtree, ML and NJ as input. Then the Robinson-Foulds distance to the original tree (that is, the number of partitions that are present in one tree but not in the other) was computed using DendroPy symmetric_difference function . The results are shown in Figures 5, 6, A1 and A2 (in the Additional file 1), and in Tables 1, 2, and S1, S2 (in the Additional file 1), and explained in the Results and discussion section.
MC acknowledges funding from Ministerio de Ciencia y Competitividad of Government of Spain, project MTM2012-38122-C03-01/FEDER, and Generalitat de Catalunya, 2009 SGR 1284. We thank the editor and reviewers for highly valuable comments that lead to major improvements of the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.