Looking for trees in the forest: summary tree from posterior samples
 Joseph Heled^{1, 2}Email author and
 Remco R Bouckaert^{1, 2}
DOI: 10.1186/1471214813221
© Heled and Bouckaert; licensee BioMed Central Ltd. 2013
Received: 18 July 2013
Accepted: 18 September 2013
Published: 4 October 2013
Abstract
Background
Bayesian phylogenetic analysis generates a set of trees which are often condensed into a single tree representing the whole set. Many methods exist for selecting a representative topology for a set of unrooted trees, few exist for assigning branch lengths to a fixed topology, and even fewer for simultaneously setting the topology and branch lengths. However, there is very little research into locating a good representative for a set of rooted time trees like the ones obtained from a BEAST analysis.
Results
We empirically compare new and known methods for generating a summary tree. Some new methods are motivated by mathematical constructions such as tree metrics, while the rest employ tree concepts which work well in practice. These use more of the posterior than existing methods, which discard information not directly mapped to the chosen topology. Using results from a large number of simulations we assess the quality of a summary tree, measuring (a) how well it explains the sequence data under the model and (b) how close it is to the “truth”, i.e to the tree used to generate the sequences.
Conclusions
Our simulations indicate that no single method is “best”. Methods producing good divergence time estimates have poor branch lengths and lower model fit, and vice versa. Using the results presented here, a user can choose the appropriate method based on the purpose of the summary tree.
Background
Bayesian Markov Chain Monte Carlo (MCMC) analysis provides powerful and popular techniques for performing phylogenetic analysis. The result of such an analysis is a set of trees drawn from the posterior distribution. The set of correlated draws is often condensed into a single tree for visualistion, comprehension, annotations and presentation. When most trees agree in topology and branch lengths, the most frequent tree topology, properly annotated, can give a fair representation of the posterior distribution.
However, a single tree can be misleading, especially when the agreement between posterior trees is small. Posterior tree topologies can be reduced to a set of common subtopologies [1], but this is also a fragmented view of the posterior. Tree drawing programs such as FigTree [2] can annotate internal nodes with the clade posterior support (the fraction of posterior trees containing the clade), and the credible interval of internal node ages.
Still, the choice of any specific topology highlights one alternative at the expense of others. The tree drawing program DensiTree draws all posterior trees transparently [3]. Where most trees agree in topology and node height, lines are close to each other and distinct edges appear, while areas of uncertainty in topology or heights remain a blur. The composite image allows a direct assessment of posterior support and node height uncertainty by visual inspection. But even the display of the full posterior can be hard to interpret when the uncertainty gets large, and a summary tree overlaid on top can be useful in such situations.
There are many ways of obtaining a “representative” tree topology from a collection of trees. One group of methods look for consensus among the trees using splits, clusters or rooted triplets/quartets present in posterior trees [4]. The TreeAnnotator utility in BEAST [5] uses the clade frequencies as estimated by the posterior to score each tree, selecting the rooted topology of the highest scoring tree amongst the set. This results in a fully resolved topology with a non zero support, in contrast to consensus methods which often produce unresolved trees. A recently published method uses conditional clade splits probabilities to compute a probability for each posterior tree [6].
However, selecting a representative topology is only the first step in generating a summary tree from the output of programs such as BEAST. BEAST trees are rooted with branches proportional to time, as is the summary tree. In the second step, TreeAnnotator assigns a divergence time for each clade using the ages of matching clades from the posterior. If the number of trees containing the clade is small, the divergence time estimate can have high variance which may result in negative branch lengths. Clades in trees from the MCMC samples which do not appear in the summary tree are essentially ignored, and sometimes a large proportion of the posterior goes unrepresented. Ignoring nonmatching parts appears to be the accepted practice and is used in the SumTrees utility in DendroPy [7].
In this paper we describe several new ways for building rooted summary trees. These new constructions use more of the information contained in the posterior even when the disagreement between posterior trees is high. Some of the methods are based on rooted tree distances, and are similar in spirit to the method developed by Huggins et al. for unrooted trees [8]. We perform an extensive simulation study and compare the trees from all methods using multiple criteria. Summary trees are assessed with respect to the posterior and by their distance to the tree used in generating the sequence data. The methods are implemented in biopy [9], and integrated with DensiTree, making it easy to examine the summary tree in the context of the full posterior.
Methods
Definitions and notations
We define a rooted tree as a collection of clades with ages. Specifically, a tree is a strict hierarchy of clades, where each clade is a subset of the taxa, and a nonnegative age is associated with each clade.
Formally, a tree T is a triplet $(L,\u2102,h)$, where $L=\left\{{x}_{1},{x}_{2},\dots ,{x}_{l}\right\}$ is the set of taxa and $\u2102=\left\{{C}_{1},{C}_{2},\dots ,{C}_{n}\right\}$ is a set of clades. Each clade C_{ i }⊆L is a subset of taxa, and $h:\u2102\to {\mathbb{R}}_{\ge 0}$ is a function assigning an age to the clade. The set $\u2102$ describes only the clades hierarchy and is referred to as the tree topology. Sometimes we shall use c∈T as a shorthand for $T=(L,\u2102,h)\text{and}c\in \u2102$ (“clade c is present in tree T”).
To qualify as a tree, the following conditions must hold:
i The tree contains all leaves: ${\forall}_{i}\left\{{x}_{i}\right\}\in \u2102$.
ii The tree contains a root: $L\in \u2102$.
iii Strict hierarchy of clades: for any two clades ${c}_{1},{c}_{2}\in \u2102$, either C_{1}⊂C_{2}, C_{2}⊂C_{1} or C_{1}∩C_{2}=∅. (Note that C_{1}⊂C_{2} implies C_{1}≠C_{2}, otherwise we write C_{1}⊆C_{2}.)
iv NonNegative branches: for ${c}_{1},{c}_{2}\in \u2102$, c_{1}⊂c_{2}⇒h(c_{1})≤h(c_{2}).
For any clade c, the elements in the set $A\left(c\right)=\{y\in \u2102:c\subset y\}$ are called ancestors of c, and the minimal element P(c) in A is the parent of c. Every clade except the root has a parent and by association a branch to its parent with length b(c)=h(P(c))−h(c). For convenience, the branch length of a subset not in is defined as zero. Any subset of taxa x has a Most Recent Common Ancestor in the tree, the minimal clade containing all members of x. Formally, ca(x) is the minimal element of $\{y\in \u2102:x\subseteq y\}$. For brevity we omit the tree when the context is clear, and use ca(c,T) to explicitly associate the clade with the tree T.
Extending the domain of b(·) to all taxa subsets simplifies definitions involving sets of trees with different topologies. We extend h(·) for the same reason and define the age of any subset x⊆L to be the age of the common ancestor of x,$\stackrel{\u0304}{h}\left(x\right)=h\left(\text{ca}\right(x\left)\right)$.
The clade and divergence errors are equal for trees with the same topology, but they differ when topologies disagree, and the difference usually increases with the distance in topology.
The posterior frequency of a subset not in any of the trees is zero.
Distance between trees
The Rooted Branch Score (RBS) measures the distance between two rooted time trees, and is the total sum of the difference in branch lengths of matching clades. This definition is motivated by the distance between unrooted trees [11], but the space of rooted trees is more complex than its unrooted counterpart since branch lengths are not free to vary independently of each other [12]. Since by convention the branch length of a missing clade is zero, any clade present only in one tree contributes its total length to the score.
The heights score is a (nonoptimal) edit distance, where the score is the total sum of a sequence of moves which transform one tree into the other. Each move involves sliding an internal node, and two nodes may “merge” into one when they meet.
RBS and RAS are metrics in tree space, while SRBS and HS are not. RBS is a metric since branches can be mapped to the vector space ${\mathbb{R}}^{{2}^{n}1}$[8], and a similar argument works for RAS. However, we only require that distances are semimetrics and make no use of the triangle inequality.
Summary trees
BEAST Tree annotator
Since each age is set independently, the end result is not guaranteed to be a tree (condition iii). A few “negative branches” are not an unusual occurrence in trees with a medium to large number of taxa and moderate posterior uncertainty.
Minimum distance trees
where d_{ T } is one of the tree scores defined previously. A Minimum Distance Tree is a tree which minimizes $d(T,\mathbb{T})$. While the definition is simple and natural, the details are not. First, the minimal tree is not necessarily unique; there might be several or even an infinite number of minimal trees in some cases. Second, with anything more than a few taxa the space of trees is vast and topologically complex, so there is no guarantee of finding the minimal tree. We therefore limit the search to the topologies present in the posterior, and designate this approach by a lowercase ‘m’ followed by the distance method (mRBS, mRAS, etc). However, even this can be time consuming when the posterior contains many topologies, and in addition we examine a family of methods which consider just a single topology, using one of the heuristics outlined in the next section. The details about the algorithm for searching the best branch assignment for a specific topology are in Appendix 2.
Selecting a topology
All of the two stage methods we considered selects a topology first and assign branch lengths conditional on that topology. We examined three alternatives to the MCC for selecting a topology.
The first alternative uses the recently published Conditional Clade Probability Distribution (CCD). The CCD computes a probability for each tree based upon the posterior probability of the splits in the tree, conditional on the clade posterior frequency [6]. The second is a the Total Clade Branch (TCB), which assigns a score to each clade in the tree equal to the total length of matching branches in the posterior. The total length reflects the support for a clade by combining both the frequency (the number of trees with the clade) and confidence, under the assumption that longer branches are more likely to be “real” than shorter branches. The third is the Highest Posterior Frequency (HPF), which picks the topology of the tree most frequent in the posterior. To break ties, the HPF picks the tree whose height is closest to the mean root height of the posterior.
CA Tree
The generated ages always produce a tree, since $x\subset y\Rightarrow \stackrel{\u0304}{h}(x,{T}_{i})\le \stackrel{\u0304}{h}(y,{T}_{i})$. Unlike TreeAnnotator, which may end up using a small number of values for some clades, CAT uses $\left\mathbb{T}\right$ posterior values for estimating the age of each clade.
Taxa partitions tree
We now present the Taxa Partition (TP) tree, a single stage method which does not commit to a particular topology before assigning ages. The TP is inspired by the tree operator described by Mau et al[13]. In this representation each internal node is assigned a left/right orientation, inducing a linear order on the taxa and positioning each internal node between two tips (Figures one and two in [13]). We reverse the process by first ordering the taxa, then using the posterior to assign the ages between tips and finally reconstructing the tree topology from the ages.
For a given ordering of taxa, each posterior tree provides ages according to its topology. A clade contributes an age if it spans an unbroken range in the ordering. For example, for the order [a b c d], the tree (((a,b),c),d) contributes the age of (a,b) to [a  bcd], the age of ((a,b),c) to [ab  cd] and the root height to [abc  d]. The tree ((a,d),(b,c)) contributes only the age of (b,c) to [ab  cd]. (a,((d,b),c)) contribute only its root height to [a  bcd].
After collecting ages for all splits from the posterior, a point estimate of the height at each split is used to build the tree. The precise definitions are given in Appendix 2.
TP incorporates clade ages from competing topologies before committing to the final topology. For example, take the set with a mixture of two topologies, ((a,b),c) and (a,(b,c)). With the obvious ordering [a b c], TP uses all ages in every tree, and the choice between the two topologies is determined by the age of the [ab  c] and [a  bc] splits. If [ab  c] is higher we end up with ((a,b),c), otherwise with (a,(b,c)).
Finding an optimal ordering is hard. Assigning an orientation which minimizes the distance between taxa orders of just two trees is NP complete [14]. We use a fast heuristic which proved effective in practice: build a distance matrix for pairs of taxa and use simple clustering to build the ordering. The distance between taxa a and b in each tree is the size of the clade of their common ancestor, d(a,b)=ca({a,b}). The overall distance is the mean of pair distances over all posterior trees. The clustering starts with each taxon in its own group, then progressively joins the two closest groups.
Test cases
To evaluate the various methods we generated 2000 test cases, divided into 20 groups of 100 repeats. For each case, a tree with n tips was drawn from the Kingman coalescent distribution [15] with population size N_{ e }. All repeats shared the same n and N_{ e }, and each group was assigned one pair from the 5x4 grid formed by n=8,16,32,64,128 and N_{ e }=1,2,4,8.
A sequence of length 800bp was generated for the tips of the tree, starting with an ancestral sequence at the root and mutating the sequence along the branches using the JukesCantor substitution model [16] with a mutation rate of 0.005. The sequences were analyzed using BEAST2 [17] under the same model (JukesCantor and a coalescent prior with constant population size). The tree and population size were estimated but the mutation rate was fixed at its true value. The chain was 2.2M steps, sampled every 2k steps. 200k of the initial samples were discarded (burnin), leaving 1000 posterior samples. Those were used as input for building a summary tree by each of the methods under consideration.
The test trees contain 8 to 128 tips and range (on average) from a height of 0.01 substitutions to 0.08, or 2 to 16 million years for a nuclear mammalian gene. Sampling the posterior of such trees normally requires a longer MCMC chain, but here a relatively short one is sufficient. The data was generated under a simple model and the exact same model is used for inference, resulting in excellent mixing. Not only was the effective sample size high for all parameters, we made sure the clades were adequately sampled by running a second independent chain, starting with a different seed. We then computed the maximum of the absolute difference between posterior frequency of all clades; this number was well below 5% in most settings, and around 6% for the most diffuse case (128 tips and height of 0.01 substitutions).
The posterior for trees with 32 and more taxa was completely diffuse, with a distinct topology for each sample. Even the easiest cases (n=8 and N_{ e }=8) contained between 1 and 45 distinct topologies, with a mean of 6. Also note that even when the posterior has a single topology, a method may do better that others by setting more accurate branch lengths.
Summary trees were compared using two main criteria: accuracy in estimating ages and model fit. The first criteria was broken into 3 related error measures: accuracy in estimating the root height, accuracy in estimating clade ages (equation 1) and accuracy in estimating divergence times (equation 4). The second criteria was also divided into three: the loglikelihood of the sequence data given the tree (tree likelihood), the loglikelihood of the tree under the coalescent (coalescent likelihood), and the overall model fit, which is the sum of the tree and coalescent likelihood.
How methods are ranked
The methods were compared by aggregating the results from all test cases. Let us take the root height as an example. For each test case, an error value is computed for each method by taking the absolute difference between the summery and true tree heights. Next, the methods are ranked by error using dense ranking (the 1223 rule). Finally, the mean rank of each method is computed by averaging its rank over all 2000 tests.
This scoring procedure was repeated (bootstrapped) 4000 times. In each repeat 2000 test cases are sampled (with replacement) from the pool of 2000 test cases, and a mean score computed for each method. Method A was deemed better than B only if A’s mean ranking was greater than B’s in 90% (3600) of the bootstraps. The method gets a final score of 0 (best) if no other method is better, and a score of R+1 if there is a better method of score R.
The same process is repeated, using not the rankings of errors but the normalized error values. The normalization takes the errors of each case and transforms them to have a mean of 0 and a variance of 1. This ranks the methods by the magnitude of the error they make compared to other methods.
This may seem overly complex but making a fair comparison requires extra care. The methods and error measures are correlated in both obvious and subtle ways. Multiple criteria allows for a more nuanced comparison. Ideally, the particular mix of methods should not matter: adding a duplicate (or a very close variant) of one method should not penalize the ranking of lesser methods. Using dense ranking should minimize those effects. Strong correlations exist between the test settings (N_{ e } and n) and the magnitude of errors, so aggregating results from the 20 groups requires some care. Rankings based on comparison alone are insensitive to those correlations, and the normalization of errors makes aggregation possible without going through the complex exercise of modeling the relations between settings. Another reason for using two rankings is that method A may be slightly better than B in (say) 60% of the cases, yet its errors in the other 40% are large. The difference between the two ranks would alert us to such situations.
Finally, any number of test cases, 2000 included, is small when considering the size of tree space. Bootstrapping provides some confidence that the results are stable and not due to random noise.
Results
Rankings of methods for building a summary tree
Method  RH  CME  CCE  CAE  DVE  MF  TLL  CLL 

TP(med)  1/3  0/0  12/9  8/8  7/5  3/3  0/0  3/3 
TP(avg)  0/4  0/0  13/9  6/7  0/3  11/10  1/6  14/15 
MED,TCB  1/0  3/3  10/7  6/4  6/6  9/7  8/11  9/9 
MED,MCC  1/0  6/6  12/10  7/4  7/7  7/6  7/10  7/7 
RBS,TCB  6/8  10/10  4/3  12/12  11/10  2/2  4/3  1/1 
RBS,MCC  7/9  12/12  5/4  12/12  11/11  1/1  3/2  0/0 
HSO,TCB  1/1  2/2  11/8  6/4  6/6  10/7  8/11  10/10 
HSO,MCC  1/2  5/5  13/11  6/4  7/7  8/6  7/10  8/8 
SRBS,TCB  3/5  8/7  8/5  6/5  1/2  12/9  6/9  13/13 
SRBS,MCC  4/6  9/9  9/6  6/5  3/4  11/9  5/9  11/12 
RAS,MCC  5/6  14/14  7/4  9/9  9/8  4/4  3/4  6/6 
RAS,TCB  5/6  13/13  6/3  10/10  10/9  5/4  5/5  5/5 
mSRBS  3/5  9/8  9/6  7/6  2/3  11/8  4/8  12/11 
mRAS  5/7  15/15  7/4  11/11  10/9  6/5  6/7  4/4 
mRBS  6/8  11/11  3/3  13/13  11/10  0/0  2/1  2/2 
mHS  1/0  18/19  1/1  2/0  1/0  16/12  12/14  18/18 
AVG,MCC  0/4  7/7  11/9  5/5  5/8  13/11  9/12  15/14 
CAT,TCB  0/5  1/1  14/10  0/0  1/2  18/16  14/15  20/21 
CAT,MCC  0/4  4/4  15/12  1/1  1/2  17/15  13/14  19/20 
HS,TCB  2/2  17/17  2/2  4/3  4/1  15/14  11/17  16/16 
HS,MCC  1/1  19/18  2/2  3/2  3/0  14/13  10/16  17/17 
CONS(med)  1/0  16/16  0/0  3/2  8/8  17/14  15/13  19/19 
Condensed rankings for methods in Table 1 with additional performance numbers
Method  TIMES  CME  CCE  MODEL  POLY  MF%  CAE%  CME% 

CAT,TCB  1  1  14  19  0.0%  45.2%  3.79%  36.33% 
CAT,MCC  2  4  15  18  0.0%  45.3%  3.79%  36.46% 
TP (avg)  5  0  13  11  0.0%  93.2%  4.37%  36.22% 
TP (med)  12  0  12  1  0.0%  98.6%  4.50%  36.21% 
SRBS,TCB  6  8  7  12  4.1%  91.8%  4.38%  36.64% 
MED,TCB  7  3  9  8  1.1%  94.0%  4.36%  36.36% 
HSO,TCB  8  2  10  10  1.1%  94.0%  4.36%  36.35% 
MED,MCC  9  6  13  6  0.9%  94.6%  4.37%  36.48% 
mSRBS  9  9  8  9  3.8%  92.1%  4.39%  36.80% 
HSO,MCC  10  5  14  7  0.9%  94.6%  4.37%  36.48% 
AVG,MCC  10  7  11  13  1.1%  84.2%  4.30%  36.49% 
SRBS,MCC  11  10  8  11  4.2%  91.9%  4.39%  36.8% 
mHS  0  19  1  16  29.8%  50.0%  3.97%  44.48% 
HS,MCC  3  19  2  14  34.3%  54.9%  4.16%  44.27% 
HS,TCB  4  18  2  15  34.5%  54.6%  4.18%  44.09% 
CONS (med)  6  17  0  17  27.5%  46.3%  3.98%  43.00% 
RAS,MCC  13  15  6  3  24.4%  93.1%  4.66%  42.45% 
RAS,TCB  14  14  5  4  24.2%  92.7%  4.67%  42.26% 
mRAS  15  16  6  5  24.5%  88.2%  4.74%  42.73% 
RBS,TCB  16  11  4  2  23.4%  99.0%  4.82%  40.60% 
mRBS  17  12  3  0  23.8%  99.1%  4.84%  40.80% 
RBS,MCC  18  13  5  0  23.7%  99.0%  4.81%  40.94% 
Discussion
Clearly no method in Table 1 is “best”, but several interesting trends and patterns can be identified. The agreement of ranking by comparison and magnitude is excellent, suggesting a similar distribution of errors for all methods. The table shows 22 of the 55 methods examined; most of the reduction comes from removing methods using CCD and HPF to select a topology, as MCC/TCB were significantly better for almost all combinations of methods and error criteria. This is slightly surprising, especially since we expected CCD, which assigns a proper probability to every tree topology, to fare better than heuristics such as TCB or MCC. The online supplement compares the four selection methods in more detail.
As expected there is a strong correlation between model fit and tree/coalescent likelihood (r=0.89 and r=0.98), but in addition the tree and coalescent likelihood are strongly correlated as well (r=0.85). Basically, methods generating trees with a good model fit tend to do well on both counts. The only exception is TP(avg) with a good tree likelihood but bad coalescent likelihood. Also, low clade age errors and low divergence errors go together (r=0.79), again with TP(avg) as the exception. Slightly unexpected at first sight is the negative correlation (r=−0.88) between clades missed and clades called. Either a method plays it safe by calling only definite clades, and tends to miss a lot (CONS), or calls everything and makes more mistakes (TP).
The table shows a second unexpected result: strong negative correlation between clade age errors and model fit (r=−0.94). Since model fit is highly correlated with branch lengths (r=0.87), no method provides good clade ages and good branch lengths/model fit. Methods optimizing branches, such as RBS, generate trees with good fit but worse ages, and methods optimizing ages exhibit the opposite. This negative correlation exists between all measures of age and fit. It is quite interesting that the two variants of the TP end up at different ends: medians give better model fit while means gives lower divergence errors.
Another performance split can be observed between pairs employing the same method for setting branch lengths but using MCC and TCB for selecting the topology. The MCC variant has better model fit, while the TCB fares better with clade calls and misses.
While Table 1 makes it easy to compare pairs of methods, it is quite hard to interpret as a whole. Table 2 complements it by aggregating some performance ranking and adding a few permethod statistics. The first statistic is the mean number of zero length branches in the summery tree, which effectively create polytomies. The methods in the table are divided into three groups: those who never create polytomies, those with occasional polytomies (up to 5%), and those with a high number (20% or more). The number of polytomies is strongly correlated with missed and called clades: methods which “resolve” conflict in the posterior by not committing and creating zero length branches miss more true clades but make less mistakes, and so have a high clade calls. Somewhat surprisingly there is no connection between zero length branches and model fit. We suspected that short branches were the main cause for low model fit, since they create noncoalescent like trees, however we see that RBS methods manage to have high model fit and around 24% polytomies. The other three statistics are the mean model fit percentile, clade age errors per clade as a percent of tree height, and the percent of missed clades from the total number of nontrivial clades. Those numbers can help in deciding how a difference in ranking translates to performance: for example, TP(avg) is seven ranks higher than TP(med) in clade time errors, but this amounts only to a difference of 0.13%, about ^{1}/_{7} of the total range. On the other hand there are a seven ranks between TP(avg) and CAT in model fit, but here the difference is very large  from 45% to 93%.
Conclusions
Properly analysing the test cases proved to be as challenging as the research itself. The number of possible methods for constructing summary trees, coupled with the number of ways of assessing the results can be overwhelming. In addition, the domain of trees is vast and the evaluation and construction methods are not independent. For example, the distance between the summary and true tree seems the most natural error measure. However, we have four ways of measuring distance and four related methods, each searching for the minimal distance tree using that distance. Not unexpectedly, each distance score finds the tree produced by its counterpart to be closer to the true tree than the trees generated using other scores. Interconnections such as these show the importance of using multiple error criteria when comparing methods. The space of tree sets is complex, and each measure sheds light on different aspects of that space. Both “Clades Missing” and “Clades Called” measure topological distance via success in detecting clades, both seem reasonable and valid, yet one is the reverse of the other. Having only one of them would give a biased view. Simultaneously examining many methods – while complicating the comparison process – can reveal general performance trends.
By examining results from a large simulation study we found there is no clear “winner”. Having low clade age errors and good branch lengths in a tree seems fundamentally exclusive. Methods setting clade ages from posterior ages tend to have lower age errors while methods matching the branch lengths produce trees with a better fit to the model.
Therefore, it makes sense to consider the purpose of the summary tree when choosing a method. If divergence times matters most, use either HS or CAT. If only topology matters, use the consensus (CONS) or TP. In both cases the decision between the two alternatives depend on whether you are conservative and prefer unresolved clades (polytomies) in areas of conflict, or whether you wish a “the best guess” at a fully resolved tree. Use RBS to get a tree with good model fit and therefore closer to a Maximum Likelihood tree.
TP(med) provides a good compromise: good model fit and low missed clades, with middle of the pack ages/divergence errors (but still better than RBS). All of these are better than the MCC as implemented in TreeAnnotator, which is middle of the pack in all measures except for doing worse on clades called and well on root height.
While the simulations show a few surprising results, we were most surprised by the performance of the “theory based” methods. We set out to replace heuristics with methods based upon firmer theoretical consideration, and strongly expected that RAS, a tree metric which takes into account both ages and branch lengths, will outperform the alternatives. Likewise, we expected the CCD to fare better than other methods for selecting a topology. However, heuristics seem to do better when measured against the main objective  recovering the true tree.
We think the different types of summaries are all valuable when the posterior trees are in conflict. Together with the full posterior as drawn by DensiTree, they provide different insights into the information contained in the posterior. We suggest that researchers generating a summary tree for annotation or publication use one of the newer methods since all of them outperform the existing consensus methods and BEAST’s own TreeAnnotator.
While we focused on obtaining a single point estimate from posterior MCMC samples, we would like to emphasize that researchers should treat single point estimates as end points, and use the full posterior whenever possible, especially for secondary analyses. In addition, one should look at several methods for extracting a point estimate when dealing with the complex space of phylogenetic trees.
Appendix
A Taxa partitions formal definition
where S_{1,2}(c) are the left and right sons of c.
The ages are computed by taking the median (or mean) of V_{ k }. The tree is reconstructed by picking the maximal age as the root, and recursively building the subtrees to the left and right of the split.
B Implementation: Minimum distance tree
 1.
Represent the tree as a point in ${\mathbb{R}}_{m}$.
 2.
Preprocess posterior trees to speed up the evaluation of the total distance to all posterior trees.
 3.
Analytically compute the derivative.
 4.
Find a good initial starting point.
Tree parametrization
The tree $T=(L,\u2102,h)$ is represented as a vector of real numbers $z=({h}_{r},{\alpha}_{2},\dots ,{\alpha}_{m})\in {\mathbb{R}}_{m}$, where $m=\left\u2102\right\leftL\right$ is the number of internal clades in the tree. h_{ r } is the height of T, and ${\alpha}_{2},{\alpha}_{3},\dots ,{\alpha}_{m}$ are m−1 values, one per internal clade, equal to the ratio of the clade age to the age of its parent. To retrieve a clade age from z, multiply the root height by the α for all the clade ancestors. That is, ${h}_{r}\prod _{k}{\alpha}_{k}$, where k ranges over the clade ancestors. By traversing the tree in preorder (clade before its descendants) all ages can be extracted from z using just m−1 multiplications, and an additional 2(m−1) subtractions would extract all branch lengths. Each component in z has a simple bound independent of other components; 0≤h_{ r }<∞ and 0≤α_{ i }≤1. This makes the tree a suitable optimization target for a method such as LBGFSB, a quasi newton algorithm for minimizing a multivariate function with simple bounds [18].
Preprocessing of posterior trees
The search for the minimum distance tree involves many evaluations of the target function, the mean distance $d(T,\mathbb{T})$. This evaluation is sped up by transforming the expression, which is a sum on trees, into a sum over clades. The details vary somewhat, depending on the distance metric d_{ T }. Here we elaborate for the rooted branch score case (Equation 8), and the interested reader should consult the code for details of the other metrics.
For the tree $T=(L,\u2102,h)$ the total distance is expanded as follows,
The terms in parentheses do not depend on T and can be precomputed, so the last two terms take $O\left(\right\u2102\left\right)$ operations to evaluate. The first term appears to require $O\left(\right\u2102\left\right\mathbb{T}\left\right)$ but we can cut this down to $O\left(\right\u2102log(\left\mathbb{T}\right\left)\right)$.
The reason for this complicated looking expression is that the last two terms in parentheses can be precomputed, and the first is simply the number of branches in the posterior greater than b(x) minus the number of branches smaller than it. After we presort the branches from the posterior for each clade, this number can be found by a binary search, taking at most $O(log(\left\mathbb{T}\right\left)\right)$ since there can be at most $\left(\right\mathbb{T}\left\right)$ matched branches, one for each tree in the posterior.
Analytical derivative
The search is significantly faster when a derivative can be computed analytically, since estimating a derivative requires at least m evaluations (the number of dimensions). While the details are tedious the calculations themselves are simple, since the target function is composed in a series of multiplications and additions/subtractions, so the derivative is easy to compute using the chain rule at each stage. Again the interested reader should consult the code for the exact details in each case.
Search initialization
We found that a good starting point can be vital, as under some settings the number of multiple local minima can be large. While the procedure to obtain the initial tree seems natural and obvious in hindsight, several other obvious looking approaches did not perform well at all.
The initial tree is obtained by first examining each branch independently. Each branch has its own optimal length, based on the matching branches in the posterior and the distance metric. This optimal value is computed for each branch, but since branch lengths are not independent, the next step builds a tree from those optimal values. The build assigns an age to each clade, proceeding in postorder, that is assigning an age to all descendants of a clade before assigning the clade age. The age of the clade is obtained by averaging the expected age from the direct descendants, which is the sum of their own assigned age and their optimal branch length.
Abbreviations
 AVG:

Summary tree method which sets internal nodes heights using average of posterior heights
 ca():

Common ancestor
 CAE:

Clades ages error. Equation (1)
 CAT:

Common ancestor tree. Equation (14)
 CC:

Clade credibility
 CCD:

Conditional clade probability distribution
 CCE:

Clades called error. Equation (6)
 CLL:

Coalescent likelihood
 CME:

Clades missed error. Equation (5)
 DVE:

Divergence times error. Equation (2)
 HS:

Heights score. Equation (10)
 HSO:

Heights only summary tree method
 HPF:

Highest posterior frequency
 MCC:

Maximal clade credibility
 MED:

Summary tree method which sets internal nodes heights using medians of posterior heights
 RAS:

Rooted agreement score. Equation (11)
 RBS:

Rooted branch score. Equation (8)
 RH:

Root height
 SRBS:

Squared rooted branch score. Equation (9)
 TCB:

Total clade branch
 TP:

Taxa partition tree.
Declarations
Acknowledgements
Many thanks to Alexei Drummond, David Bryant and Mike Steel for their useful suggestions and comments. JH and RB were funded by AJD’s Rutherford discovery fellowship from the Royal Society of NZ. JH partially funded by AWC.
Authors’ Affiliations
References
 Cranston K, Rannala B: Summarizing a posterior distribution of trees using agreement subtrees. Syst Biol. 2007, 56 (4): 578590. 10.1080/10635150701485091.PubMedView ArticleGoogle Scholar
 Rambaut A: FigTree. 2006, [http://tree.bio.ed.ac.uk/software/figtree]Google Scholar
 Bouckaert RR: DensiTree: making sense of sets of phylogenetic trees. Bioinformatics. 2010, 26 (10): 13721373. 10.1093/bioinformatics/btq110.PubMedView ArticleGoogle Scholar
 Bryant D: A classification of consensus methods for phylogenetics. DIMACS Series Discrete Math Theor Comput Sci. 2003, 61: 163184.Google Scholar
 Drummond AJ, Rambaut A: BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol. 2007, 7: 21410.1186/147121487214.PubMed CentralPubMedView ArticleGoogle Scholar
 Larget B: The estimation of tree posterior probabilities using conditional clade probability distributions. Syst Biol. 2013, 62 (4): 501511. 10.1093/sysbio/syt014.PubMed CentralPubMedView ArticleGoogle Scholar
 Sukumaran J, Holder M: DendroPy: a Python library for phylogenetic computing. Bioinformatics. 2010, 26 (12): 15691571. 10.1093/bioinformatics/btq228.PubMedView ArticleGoogle Scholar
 Huggins P, Li W, Haws D, Friedrich T, Liu J, Yoshida R: Bayes estimators for phylogenetic reconstruction. Syst Biol. 2011, 60 (4): 528540. 10.1093/sysbio/syr021.PubMed CentralPubMedView ArticleGoogle Scholar
 Heled J: Small collection of phylogenetics functions. 2010, [https://code.google.com/p/biopy]Google Scholar
 Robinson D, Foulds L: Comparison of phylogenetic trees. Math Biosci. 1981, 53: 131147. 10.1016/00255564(81)900432.View ArticleGoogle Scholar
 Billera L, Holmes S, Vogtmann K: Geometry of the space of phylogenetic trees. Adv Appl Math. 2001, 27 (4): 733767. 10.1006/aama.2001.0759.View ArticleGoogle Scholar
 Owen M, Provan J: A fast algorithm for computing geodesic distances in tree space. IEEE/ACM Trans Comput Biol Bioinformatics (TCBB). 2011, 8: 213.View ArticleGoogle Scholar
 Mau B, Newton M, Larget B: Bayesian phylogenetic inference via Markov Chain Monte Carlo methods. Biometrics. 1999, 55: 112. 10.1111/j.0006341X.1999.00001.x.PubMedView ArticleGoogle Scholar
 Fernau H, Kaufmann M, Poths M: Comparing trees via crossing minimization. FSTTCS 2005: Foundations Software Technology Theoretical Computer Science. 2005, Heidelberg: Springer, 457469.View ArticleGoogle Scholar
 Kingman J: The coalescent. Stochastic Process Appl. 1982, 13 (3): 235248. 10.1016/03044149(82)900114.View ArticleGoogle Scholar
 Jukes T, Cantor C: Evolution of protein molecules. Mamm Protein Metab. 1969, 3: 21132.View ArticleGoogle Scholar
 Remco B, Heled J, Kuehnert D, Vaughan T, Wu CH, Xie D, Suchard M, Rambaut A, Drummond A: BEAST 2: A software platform for Bayesian evolutionary analysis. 2013Google Scholar
 Zhu C, Byrd R, Lu P, Nocedal J: Algorithm 778: LBFGSB: Fortran subroutines for largescale boundconstrained optimization. ACM Trans Math Softw (TOMS). 1997, 23 (4): 550560. 10.1145/279232.279236.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.