RETRACTED ARTICLE: TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics
 Gangolf Jobb^{1}Email author,
 Arndt von Haeseler^{2, 3} and
 Korbinian Strimmer^{1}
DOI: 10.1186/14712148418
© Jobb et al; licensee BioMed Central Ltd. 2004
Received: 16 March 2004
Accepted: 28 June 2004
Published: 28 June 2004
Abstract
Background
Most analysis programs for inferring molecular phylogenies are difficult to use, in particular for researchers with little programming experience.
Results
TREEFINDER is an easytouse integrative platformindependent analysis environment for molecular phylogenetics. In this paper the main features of TREEFINDER (version of April 2004) are described. TREEFINDER is written in ANSI C and Java and implements powerful statistical approaches for inferring gene tree and related analyzes. In addition, it provides a userfriendly graphical interface and a phylogenetic programming language.
Conclusions
TREEFINDER is a versatile framework for analyzing phylogenetic data across different platforms that is suited both for exploratory as well as advanced studies.
Background
Computational inference of molecular phylogenies has a wide spectrum of applications in the analysis of DNA sequences, ranging from systematic biology to population genetics and comparative genomics [1].
As a result, a large body of theoretical methodology has developed [2], along with numerous specialist software packages. However, often the most advanced of these computer programs typically provide only a very Spartan user interface and hence are too difficult to use without additional training, especially for novices in phylogeny. One notable exception is the popular commercially distributed PAUP* software [3] that implements both powerful probabilistic methods for modeling and inferring gene trees and at the same time offers a friendly graphical user interface (GUI). Unfortunately, this GUI is currently available only on the Macintosh platform.
On the other hand, a more experienced user will quickly outgrow the limits of a graphical user interface. Consequently, to facilitate complex sequence analysis corresponding scripting languages have been developed. For example, in PAUP* all elements of its GUI can also be invoked on the command line. However, for the rapid deployment of specialized phylogenetic analysis tools one still needs the additional flexibility of a programming rather than scripting language.
Therefore, in an integrative generalpurpose phylogenetic analysis environment ideally several complementary objectives are taken into account:

platform independence and modular design,

an easytouse GUI which masks the complexity of tree inference from the nonexpert user,

a phylogenetic computer language that allows both scripting of all GUI functions as well as generic programming, and

availability of powerful tree inference and related analysis approaches, and means for simulation of data and trees.
The development of the TREEFINDER software is an attempt to address these issues to provide a unified powerful framework for phylogenetic analysis for both occasional and experienced users across different platforms.
Implementation
General design
The TREEFINDER software has a modular design. It consists of a graphical frontend (written in Java) and computational kernel (written in ANSI C). Both communicate in the specialpurpose language TL ("TREEFINDER's language"). The frontend translates mouse clicks and keyboard hits into TL commands that are sent to the kernel. The kernel evaluates these commands and sends the results back to the window interface, where they are displayed.
TREEFINDER has grown into a fairly large project. The current version as of this writing (April 2004) consists of approximately 30,000 lines of C code, 9,000 lines of Java and 2,500 lines of TL. TREEFINDER is portable to any operating system where an ANSI C compiler and a Java virtual machine is available.
TREEFINDER components and language
The kernel performs the actual analysis. For an overview of the currently implemented phylogenetic procedures and algorithms see section "Results" below. In addition to these specialized tasks, the kernel implements many other general mathematical and statistical functions, including pdf, cdf, and quantile functions of common statistical distributions and most functions from the publicdomain CEPHES library [4]. It is also possible to run the kernel without the graphical frontend. In this case TL commands may simply be typed in at the operating system shell prompt or may be read from a text file.
The computer language TL developed for use with TREEFINDER is a functional language, similar to LISP and Mathematica. This makes TL ideally suited to the processing of lists and trees. The language is interpreted and provides all the common programming elements like flow control, variables, operator notations and a huge set of basic routines. It supports stack orientated programming as well as rulebased data transformations. The TL language is extensively documented in the TREEFINDER software package.
Note that the clear separation of kernel and frontend and the use of the TL language for communication between the two components greatly facilities the writing of thirdparty plugins to extend the capabilities of the kernel. Correspondingly, a substantial part of the computational library of TREEFINDER is itself written in TL.
Results
Available phylogenetic methods
The phylogenetic analysis procedures currently implemented in TREEFINDER focus mainly on probabilistic and statistical approaches. One important reason for this choice is that these methods consistently provide the most powerful and accurate inferences [2]. The following is a nonexhaustive list of features present in the TREEFINDER version of April 2004.
Substitution models
The program offers the standard set of evolutionary models for nucleotide substitution (GTR and submodels [5–7]) and two different models of rate heterogeneity among sites (Gamma [8], different rates for each codon position). All model parameters including the rate heterogeneity and base frequencies can be estimated from the data.
Estimation of branch lengths and absolute rates
Tree topology search
TREEFINDER employs a genetic algorithm for the search of the optimal mostlikely tree topology [11]. Genetic algorithms are global search procedures, and are, unlike local rearrangement methods, less prone to get trapped in local maxima. The specific details of the genetic algorithm implemented in TREEFINDER are explained in the manual. In addition to exploring the whole tree space, the tree search may also be constrained by a guide tree (a tree whose multifurcations represent the remaining degree of freedom in the search space). To speed up the evaluation of the likelihood for different tree topologies a technique called "likelihood hashing" is employed.
Tree manipulation
The tree viewer build into TREEFINDER allows to open trees in various data formats and provides access to basic rearrangement capabilities, such as rerooting, midpoint rooting, placing outgroups, and collapsing of small edges. The transformations may also be applied simultaneously to whole sets of trees. More advanced tree manipulation is available through the TL programming interface. For instance, comparison of trees and other expressions is straightforward in TL due to its functional nature.
Simulation of data and trees
A method to simulate sequence data along a specified gene tree and model of nucleotide substitution is implemented [12]. In addition, a simple procedure to generate random bifurcated trees is available. Simulation of trees and data is useful to assess the accuracy of phylogenetic methods and to generate empirical distributions for test statistics [13, 14].
Rate profiles
The computation of rate and mutation profiles [15] along sequence alignments is implemented in TREEFINDER. These plots are useful in the inference of functional regions and in investigating the selective forces acting on DNA sequences.
Other features
The confidence of inferred evolutionary relationships may be assessed by bootstrap analysis [16]. Corresponding routines for computing consensus trees [17] with the option to count and output the distinct topologies in the set of samples are available. Further TL procedures include checks for compositional bias in the data and functions for reading, writing and manipulating sequence alignments.
User interface and TL language
A major design goal of TREEFINDER is to provide both a simple and easytouse graphical user interface as well as a corresponding powerful programming language for phylogenetic analysis.
All analyzes can be done also on the command line or script level. For instance, to reconstruct a phylogenetic tree from a sequence alignment contained in some "file" one enters the following simple command:
ReconstructPhylogeny ["file",
SubstitutionModel>"HKY"].
In this case, "SubstitutionModel" is a named optional argument (that takes a default value and hence need not to be specified) whereas "file" is a required positional argument.
A noteworthy detail about TL is that a programmer can choose at any time between functional and stackorientated calls. The stackorientated form of the above command is:
"file",
SubstitutionModel>"HKY",
ReconstructPhylogeny
A phylogenetic tree in TL is an object of the form
{{"a",{"b","c"}},"d",{"e","f"}}
and with edge lengths
(*) {{"a":0.15,{"b":0.1,"c":0.1}
:0.001}:0.1, "d":0.2,
{"e":0.1,"f":0.1}:0.001}.
The structure of the nested list represent the topology, as in the familiar NEWICK bracket notation for phylogenetic trees. However, the above examples are not merely a data format to store tree information in files. They are genuine TL expressions, and as such may be subjected to further transformations. For instance, if you have a "treelist" variable with value
{tree1, tree2, tree3, tree4}
one easily computes the corresponding strict consensus tree via the simple command
ConsensusTree [treelist, Strict > True].
One of TL's most advanced features is rulebased data transformation. Given a variable 'tree' containing a tree with edge lengths, it requires less than one line of code to collapse the short edges into multifurcations:
tree={{__b}:_e>__b/?_e<0.01}
will transform the above example (*) into
{{"a":0.15,"b":0.1,"c":0.1}:0.1,
"d":0.2,"e":0.1,"f":0.1}.
Furthermore, TL internally applies rulebased transformations also to algebraic expressions, e.g.,
a+a+b
will be simplified to
b+2*a.
This pattern matching property is heavily relied on in TREEFINDER's internal TL kernel routines, but it may also prove useful for writing extension.
These are only some illustrative examples of TL programming. Further code examples can be found in the TL documentation and in the 'Kernel' directory of the TREEFINDER distribution.
Relative speed and accuracy
The algorithms implemented in TREEFINDER have been tested and crosscompared with those of other likelihoodbased phylogeny softwares, such as PAUP* [3], PHYLIP [18], fastDNAml [19], and TREEPUZZLE [20].
Specifically, we conducted a simulation study to investigate the computation time, the accuracy (=probability to recover the exact true tree topology), and the expected dissimilarity of true and inferred tree [21] for several widely used programs and TREEFINDER. Varying the number of taxa between 4–60 we generated a set of random trees. Subsequently, sequence data of length 1,000 nucleotides were artificially evolved along these trees. The resulting alignments were then used to infer the original trees by the using TREEFINDER (version of April 2004), PAUP* (version 4b10), TREEPUZZLE (version 5.1) and fastDNAml (version 1.2.2). For each tree size (4–60 leafs) this procedure was repeated 100 times to assess the average relative performance of each program.
Future work
The TREEFINDER environment, while being an versatile analysis framework already in the present version, has many options for further enhancement. This includes, most importantly, substitution models for amino acids, e.g., the classic Dayhoff model [22] or the more recent WAG model [23]. Other desirable directions for extension are the implementation of modern population genetic methods, such as tools for coalescent simulation and estimation of demographic parameters [24]. These, and other procedures, are scheduled for inclusions in future releases of TREEFINDER.
Conclusions
The TREEFINDER project is an ongoing effort in providing an easytouse and yet powerful platformindependent analysis environment for molecular phylogenetics. Currently, it offers a solid set of welltested statistical methods to infer gene trees and for related analyzes, with its functional programming interface providing an extra amount of flexibility. This article presents the current status of TREEFINDER as of version April 2004. With a release cycle of 3–4 updates per year, as in 2003, we expect that TREEFINDER will soon further mature and provide especially for beginners a convenient and quick route to phylogenetic analysis.
Availability and requirements
The TREEFINDER analysis environment can be downloaded free of charge from the web page http://www.treefinder.de. Packages are currently provided for the Windows, MacOS X, SUN Solaris, and Intel Linux platforms. TREEFINDER requires the prior installation of a Java virtual machine (preferably version 1.4 or later). The TREEFINDER software is provided "as is" with no guarantee or warranty of any kind. It may be distributed noncommercially, provided that neither its manual or any other components of the software are changed (for details refer to the web page or the manual).
Notes
Declarations
Acknowledgements
Development of TREEFINDER was in part supported by grants from the Deutsche Forschungsgemeinschaft and the Max Planck Gesellschaft to A.v.H. and K.S. We also thank Wolfgang Ludwig (TU Munich) for many valuable suggestions that greatly helped the TREEFINDER project.
Authors’ Affiliations
References
 Page RDM, Holmes EC: Molecular Evolution: A Phylogenetic Approach. 1998, Oxford: Blackwell ScienceGoogle Scholar
 Felsenstein J: Inferring Phylogenies. 2004, Sunderland, MA: Sinauer AssociatesGoogle Scholar
 Swofford DL: PAUP*. Phylogenetic analysis using parsimony (* and other methods). Version 4. 1998, Sunderland MA: Sinauer AssociatesGoogle Scholar
 Moshier SL: Methods and Programs for Mathematical Functions. 1989, Upper Saddle River, New Jersey: PrenticeHallGoogle Scholar
 Lanave C, Preparata G, Saccone C, Serio G: A new method for calculating evolutionary substitution rates. J Mol Evol. 1984, 20: 8693.View ArticlePubMedGoogle Scholar
 Tamura K, Nei M: Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993, 10: 512526.PubMedGoogle Scholar
 Hasegawa M, Kishino H, Yano K: Dating of the humanape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985, 22: 160174.View ArticlePubMedGoogle Scholar
 Yang Z: Maximumlikelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods. J Mol Evol. 1994, 39: 306314.View ArticlePubMedGoogle Scholar
 Felsenstein J: Evolutionary trees from DNA sequences: A maximumlikelihood approach. J Mol Evol. 1981, 17: 36876.View ArticlePubMedGoogle Scholar
 Sanderson MJ: A nonparametric approach to estimating divergence times in the absence of rate constancy. Mol Biol Evol. 1997, 14: 12181231.View ArticleGoogle Scholar
 Lewis PO: A genetic algorithm for maximumlikelihood inference using nucleotide sequence data. Mol Biol Evol. 1998, 15: 277283.View ArticlePubMedGoogle Scholar
 Rambaut A, Grassly NC: SeqGen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Applic Biosci. 1997, 13: 235238.Google Scholar
 Goldman N: Statistical tests of models of DNA substitution. J Mol Evol. 1993, 36: 182198.View ArticlePubMedGoogle Scholar
 Goldman N: Simple diagnostic statistical tests of models for DNA substitution. J Mol Evol. 1993, 37: 650661.PubMedGoogle Scholar
 Simon AL, Stone EA, Sidow A: Inference of functional regions in proteins by quantifications of evolutionary constraints. Proc Natl Acad Sci USA. 2002, 99: 29122917. 10.1073/pnas.042692299.View ArticlePubMedPubMed CentralGoogle Scholar
 Felsenstein J: Confidence limits on phylogenies: An approach using the bootstrap. Evolution. 1985, 39: 783791.View ArticleGoogle Scholar
 Wilkinson M: Majorityrule reduced consensus trees and their use in bootstrapping. Mol Biol Evol. 1996, 13: 437444.View ArticlePubMedGoogle Scholar
 Felsenstein J: PHYLIP: Phylogenetic Inference Package, version 3.5c. Seattle. 1993, Department of Genetics, University of WashingtonGoogle Scholar
 Olsen GJ, Natsuda H, Hagstrom R, Overbeek R: FastDNAML: A Tool for construction of phylogenetic trees of DNA sequences using maximumlikelihood. Comput Applic Biosci. 1994, 10: 4148.Google Scholar
 Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREEPUZZLE: maximumlikelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics. 2002, 18: 502504. 10.1093/bioinformatics/18.3.502.View ArticlePubMedGoogle Scholar
 Robinson DF, Foulds LR: Comparison of phylogenetic trees. Mat Biosci. 1981, 53: 131147. 10.1016/00255564(81)900432.View ArticleGoogle Scholar
 Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In Atlas of Protein Sequences and Structure. Edited by: Dayhoff MO. 1978, Silver Springs: Natl. Biomed. Res. Found., 5: 345352.Google Scholar
 Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximumlikelihood approach. Mol Biol Evol. 2001, 18: 691699.View ArticlePubMedGoogle Scholar
 Nordborg M: Coalescent Theory. In Handbook of Statistical Genetics. Edited by: Balding D, Bishop M, Cannings C. 2001, Chichester: Wiley, 179212.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.