Volume 7 Supplement 1
SCaFoS: a tool for Selection, Concatenation and Fusion of Sequences for phylogenomics
© Roure et al; licensee BioMed Central Ltd. 2007
Published: 8 February 2007
Phylogenetic analyses based on datasets rich in both genes and species (phylogenomics) are becoming a standard approach to resolve evolutionary questions. However, several difficulties are associated with the assembly of large datasets, such as multiple copies of a gene per species (paralogous or xenologous genes), lack of some genes for a given species, or partial sequences. The use of undetected paralogous or xenologous genes in phylogenetic inference can lead to inaccurate results, and the use of partial sequences to a lack of resolution. A tool that selects sequences, species, and genes, while dealing with these issues, is needed in a phylogenomics context.
Here, we present SCaFoS, a tool that quickly assembles phylogenomic datasets containing maximal phylogenetic information while adjusting the amount of missing data in the selection of species, sequences and genes. Starting from individual sequence alignments, and using monophyletic groups defined by the user, SCaFoS creates chimeras with partial sequences, or selects, among multiple sequences, the orthologous and/or slowest evolving sequences. Once sequences representing each predefined monophyletic group have been selected, SCaFos retains genes according to the user's allowed level of missing data and generates files for super-matrix and super-tree analyses in several formats compatible with standard phylogenetic inference software. Because no clear-cut criteria exist for the sequence selection, a semi-automatic mode is available to accommodate user's expertise.
SCaFos is able to deal with datasets of hundreds of species and genes, both at the amino acid or nucleotide level. It has a graphical interface and can be integrated in an automatic workflow. Moreover, SCaFoS is the first tool that integrates user's knowledge to select orthologous sequences, creates chimerical sequences to reduce missing data and selects genes according to their level of missing data. Finally, applying SCaFoS to different datasets, we show that the judicious selection of genes, species and sequences reduces tree reconstruction artefacts, especially if the dataset includes fast evolving species.
Phylogenomics, i.e. phylogenetic inference based on large amounts of sequence data, is an alternative approach to single gene phylogenies, which are insufficient to resolve many phylogenetic questions . The most common phylogenomic strategies using primary sequences are the concatenation of sequences before tree reconstruction (super-matrix approach) and the combination of single gene phylogenies (super-tree approach). Several difficulties are associated with handling large amounts of data: (i) the uneven distribution of species across genes (genes that have been lost or that are not yet sequenced); (ii) the existence of partial sequences, especially in EST (Express Sequence Tag) and WGS (Whole Genome Shotgun)-based projects; and (iii) the presence of multiple copies per gene for the same species (paralogs or xenologs). The two first points imply the presence of missing data in the final dataset, whereas the third imply the presence of sequences that do not reflect the species tree and could therefore mislead phylogenetic inference.
Undoubtedly, one of the most problematic aspects when assembling phylogenomic datasets for reconstructing species trees is the presence of paralogous or xenologous genes. As opposed to orthologous genes, which arose by speciation and reflect the organismal phylogeny, paralogs arose by gene duplications, and xenologs, by lateral gene transfer from another species . Both cases generally imply the presence of multiple copies of a given gene per species, some of which do not reflect the organismal phylogeny. Orthology assignment is a difficult task . Similarity of primary sequence alone is not always sufficient to discriminate correct orthologs . A rigorous, albeit extreme, solution would be to retain only genes having one and only one copy in all the species under study (an approach particularly suited when complete genomes are available). However, if an objective is to minimise the amount of missing data, this implies retaining a tiny fraction of the genome (e.g., 14 genes from 10 complete eukaryotic genomes in the study of Philip et al. ), rejecting a large number of genes whose paralogy history may be inferred and/or does not disturb the inference of the species phylogeny. In particular, this approach would uselessly reject in-paralogs (i.e. genes issued from a recent duplication within a single species), which do not disturb the inference of species phylogeny. In contrast, great care should be taken to detect out-paralogs (i.e. genes for which the duplication event arose before speciation) whose presence may induce erroneous phylogenies. Unfortunately, orthology determination is difficult when only one sequence per species exists. In brief, a gene should only be discarded when its duplication history cannot be reliably inferred meaning that gene and sequence selection should integrate information about duplication histories in order to optimally infer organismal phylogeny from genomic data.
Missing data are also often considered to be a significant obstacle in phylogenetic reconstruction (see Wiens 1998  and references therein), and researchers generally prefer to avoid incomplete super-matrices [7, 8]. Nevertheless, this implies that a compromise has to be made between using a large number of species for a few sequenced genes or a large number of genes for a few completely sequenced species. The first strategy often fails to provide statistically supported trees due to the limited sequence information contained in single or few genes, whereas the second can lead to highly supported, albeit erroneous trees, due to systematic biases (e.g. compositional or rate heterogeneity among lineages) [1, 9]. Influence of systematic bias is limited with the first strategy because the impact of bias will be reduced as multiple substitutions (hence convergence) are detected more easily. Therefore, using a large number of both genes and species is necessary to infer accurate and well-resolved phylogenies, even if this implies the presence of missing data. To achieve this purpose, algorithms have been developed to identify optimal incomplete phylogenetic datasets [10, 11] allowing the assembly of huge super-matrices (e.g. 70 taxa and 1131 genes ) automatically from a given database. However, this automation favours the selection of species for which the complete genome is sequenced, without consideration of their phylogenetic interest. For instance, it may lead to the inclusion of redundant taxa (e.g. mouse and rat when studying the eukaryotic phylogeny) or of rogue taxa (e.g. microsporidia), which would needlessly increase computational time and phylogenetic inaccuracy, respectively. Nevertheless, recent studies using simulations, as well as real data, have shown that the presence of missing data does not drastically reduce phylogenetic accuracy as long as a sufficient number of characters is available for each species [12–14]. That is the reason why reducing the amount of missing data must not be an end in itself. In particular, it has been shown that including partial sequences to break a long branch (i.e. adding species that are sister-group of a fast evolving species) reduce one of most common tree reconstruction artefacts, known as long branch attraction (LBA) . In the same goal, an extreme approach is to exclude the fastest evolving genes from a fast evolving taxon (up to 90% of missing data for a given species) . Even if these approaches imply much more incomplete matrices, the ultimate aim of selecting sequences, genes and species is to increase the amount of phylogenetic signal to the detriment of noisy signal; minimizing the level of missing data is one of the ways to pursue this aim. In fact, no rules currently exist to find the optimal number of taxa and level of missing data and a tool is therefore required to easily explore this question.
In summary, accurate and statistically supported phylogenetic inferences rely on the construction of large datasets with minimal amount of missing data and free of non-orthologous sequences, which makes species, gene and sequence sampling a crucial issue. In order to facilitate the construction of such phylogenomic datasets, we have developed SCaFoS, a tool that semi-automatically or automatically selects species, genes and sequences taking into account their level of missing data. Moreover, the software presents two novel functions: (i) it allows the combination of closely related species into a single pseudo-species to minimize missing data while retaining poorly represented taxa, and (ii) uses the relative evolutionary distance of the sequences and/or the user's expertise to judiciously select orthologous and/or slowest evolving sequences to avoid inaccurate phylogenetic reconstructions. These new functions will be peculiarly useful in a data mining context as more and more genomes will be sequenced.
SCaFoS runs in an easy-to-use graphical mode, as well as in a command-line mode that can be implemented in a workflow. It can deal with either amino acid or nucleotide sequences. Common formats for input and output alignment files are handled: Fasta, Phylip , Must  or Nexus . SCaFoS is developed in Perl and the graphical interface is designed with Perl-Tk.
the minimum length of a single sequence is used to remove too short sequences because stochastic errors might be induced by partial sequences, especially in the super-tree approach (default = 10%);
the sequence completeness is defined to consider as complete a sequence for which few residues are missing (default = 10%), called quasi-complete sequences;
the divergence threshold is the maximum percentage of pairwise phylogenetic distance within the OTU compared to the average pairwise distances with the other sequences (default = 25%).
Schematically, the steps for sequence selection occur as follows according to the different thresholds:
if only one sequence for a given OTU is present in the file, the sequence is systematically selected except if it is too short;
if only one quasi-complete sequence (according to the sequence completeness criterion) exists for the OTU, the sequence is also systematically selected, even if this sequence has a higher evolutionary rate than the non-complete sequences in the OTU;
if none of the sequences are quasi-complete and the chimera option has been chosen by the user, a chimerical sequence will be constructed and selected as described in 'Construction of chimerical sequences' paragraph, except if the created chimera is too short;
if at least two quasi-complete sequences are present, only these quasi-complete sequences are sent to the selection criteria step described in 'Selection according to evolutionary distances' paragraph;
otherwise, all incomplete sequences are sent to the selection criteria step.
Two mutually exclusive selection criteria, sequence size or evolutionary distances, constitute the starting point of the selection criteria step. The more straightforward criterion is the size of the sequences, in which case the longest sequence will be selected. Although this criterion is best to minimize the quantity of missing data, selection according to evolutionary distances allows a more judicious choice of sequences (see below). Those two kinds of sequence selection are provided in an automatic mode, which makes SCaFoS a stand-alone tool.
Selection according to evolutionary distances
For each gene alignment, evolutionary distances between each pair of sequences are calculated with TREE-PUZZLE . While the choice of the model of substitution is left to TREE-PUZZLE, the user can enforce a Gamma distribution to handle rate heterogeneity across sites. In practice, we have observed that the assumption of uniform rates provides sufficiently accurate estimates, while significantly reducing computational time. Evolutionary distances are used in two goals: (i) verifying that the OTU does not include xenologous or paralogous sequences, and mainly (ii) selecting the least divergent sequence. Then, for each OTU, the ratio between the in-OTU distances (maximum pairwise phylogenetic distance within each OTU) and the out-OTU distances (the average pairwise distances between each OTU sequence and each non-OTU sequence) is calculated. If the in-OTU/out-OTU distances ratio is bigger than the divergence threshold, all sequences from this OTU will be discarded and, for this gene, the OTU will be represented by question marks in the super-matrix. Otherwise, the sequence that displays the lowest average distance to the other sequences will represent the OTU. This approach is rather drastic, but it is efficient to avoid out-paralogs in the resulting file. Nevertheless, as detailed below, a more accurate selection might be obtained with the semi-automatic mode. Evidence of gene duplication somewhere in the tree is a reason to worry about the orthology of the other sequences; then a more conservative option is also available which eliminates the complete gene when at least one OTU needs to be removed.
Finally if the OTU does not present risk of xeno- or paralogy, the less divergent sequence is selected from the quasi-complete sequences of the OTU in order to decrease the noisy signal contained in the terminal branches file (without decreasing the phylogenetic signal contained in the inner branch). For this last step, the definition of the sequence completeness is an important option because it is useful to be able to select an almost complete slow divergent sequence than a complete but highly divergent one.
Selection according to user's expertise
In the semi-automatic mode, after computation of the ratio in-OTU/out-OTU distances as previously described, SCaFoS proposes the user to select of the sequence that displays the lowest average distance. A visual flag indicates if the ratio in-OTU/out-OTU distances overcomes the user defined divergence threshold. In this manner, the user can choose between selecting the suggested sequence, or another complete sequence that he/she considers of better orthology, or discarding the OTU from this gene. The user can use any external information to validate his/her choice, in particular a phylogenetic tree or the position of the genes on the chromosome (synteny). The use of human expertise is advised because there are no known reliable methods for automatically identifying orthologs. As this user intervention is time consuming, SCaFoS saves the information on selected sequences. In subsequent dataset constructions, this information can be reused allowing for a fast assembling of numerous combinations of genes and taxa. The sequence selected in the first run for each OTU becomes the default sequence for a given OTU. As long as the list of complete sequences included in the OTU remains unchanged (i.e. no sequence are added or removed), SCaFoS automatically keeps the default sequence.
Construction of chimerical sequences
Finally, SCaFoS is able to modulate between the creation of chimera from partial sequences and the selection of complete sequences, by considering sequences with few missing characters as full-length sequences.
Global level of missing data
Once the sequences are selected for each gene, the user may want to select genes according to their global level of missing data. For this purpose, SCaFoS creates several directories that contain the processed files including the selected species and sequences. These files are sorted according to their level of missing species or characters and an additional file, containing the super-matrix is also produced for each level. Since there are no established rules on the maximum amount of missing data in a super-matrix, the user is free to select the threshold of missing data (either globally or for the species of interest) that he/she considers appropriate. For this purpose, the user is guided by the statistical information about the composition in genes, species and missing positions, the nature of phylogenetic question being also of major importance.
Results and Discussion
Typical use of SCaFoS
Second, using the OTUs defined by the user, SCaFoS creates a copy of each file that will contain only the sequences of the species of interest. It should be noted that no chimerical sequences will be created at this step, and all sequences from a given OTU will be included in each file (Fig. 3, step 3: FILE SELECTION). With a reduced number of sequences, one can more accurately remove ambiguously aligned positions in each file, and construct preliminary phylogenetic trees of each gene to control for laterally transferred or paralogous genes (Fig. 3, step 4).
Third, for each OTU and each gene, SCaFoS selects one sequence or constructs a chimerical sequence following the steps shown on Figure 1, and assembles final datasets (Fig. 3, step 5: ASSEMBLING DATASETS). In the semi-automatic mode, the user incorporates information from the trees constructed for single-genes (step 4) to select sequences. Moreover, if phylogenetic trees are available in postscript format (produced by MUST ), the selection is visually reported onto the trees.
Finally, all the relevant information about sequence selection is provided in a text file, allowing the analysis to be reproduced. Once the sequences are selected for each gene, files for super-matrix and super-tree analyses are generated in formats usable by MrBayes , PAUP , PHYLIP , or TREE-PUZZLE . Files summarizing the presence of OTUs for each gene and the amount of missing data in various datasets help the user to select the best set of genes for subsequent inferences.
Evaluation of SCaFoS performance
Impact of missing data
First, the use of chimerical sequences slightly reduces the level of missing data. For instance, for a global level of 30% of missing data, chimeras allow the incorporation of seven additional genes (115 versus 108). This is not surprising because the Metazoa dataset is mainly constructed from EST sequences, implying that data will frequently be missing for the same, lowly expressed genes. In practice, chimeras are especially interesting for OTUs having a key phylogenetic position (i.e. that break long branches or that are the only representative from a taxonomic group of interest).
Second, the conservative elimination of sequences when several copies are present for a given OTU, as performed in the automatic mode of SCaFoS, has much more drastic consequence. When the ratio in-OTU/out-OTU distances is 60%, 25%, or 1%, the global percent of missing data in the final dataset is 16, 24 and 64, respectively. Nevertheless, a similar number of genes (52, 47 and 56, respectively) is incorporated in the datasets. Note that this severe effect is not only due to paralogy, but is an incidental consequence of chimera construction through the OTU concept. In fact, when an OTU contains several species, the orthologous copies from these species are artificially considered in the exact same way as paralogs from the same organisms. Then, the more divergent species within the OTU are, the more likely SCaFoS will remove the OTU because at least one sequence will have a higher evolutionary distance than permitted by the divergence threshold. In such case, the automatic approach of SCaFoS is too conservative. We strongly recommend the use of the semi-automatic mode in which sequences are discarded only when paralogy problems are recognized by the user. Nevertheless, the automatic mode yields reasonable results when each OTU is represented by a single species (data not shown).
Sequence selection and the reduction of tree reconstruction artefacts
An important function of SCaFoS is to automatically determine, for each OTU, the best sequence for representing a given gene according to user-defined criteria. When several complete sequences are present for an OTU, SCaFoS tries to select the one that possesses the maximum amount of phylogenetic signal. To achieve this, the sequence that has the lowest evolutionary distance to all other sequences is selected to represent the OTU. As we will show, this approach helps to reduce the long branch attraction (LBA) artefact .
Selection frequency for species included in the Arthropoda OTU, Number of sequences per species and their corresponding frequency in the two datasets used for Figures 5 and 6 and constructed according to two different selection criteria: longest sequence (LC) or smallest evolutionary distance (SC)
number of sequences
number of sequences
Importance of the investigator expertise
Since there is no clearly defined limit for an acceptable level of global missing data, the investigator is free to choose his/her favourite compromise between the number of genes, the frequency of missing data and the severity of the threshold used to extract the orthologs. To do that, the user is guided by a table containing the number of genes, of positions and of missing data for each subdirectory in which the resulting files with a given amount of missing data have been copied.
Some improvements could be considered. The most evident one is to take into account compositional biases when selecting sequences, especially when several sequences within an OTU have similar relative evolutionary distances. However, combining this criterion with the evolutionary distance is not straightforward because the compositional bias is not always correlated with the evolutionary distance. As we have shown, the sequence length is not the best criterion to choose a sequence and estimating the evolutionary distances of partial sequences to create intra-gene chimeras would improve the results. Yet, the evolutionary distance of each fragment should be corrected for the difference in the average evolutionary rate of this protein part because a conserved domain of a fast evolving species may have a slower evolutionary rate than a variable domain in a less divergent species. Taken into account the evolutionary distance for chimera making has also two advantages (i) avoiding risk of artificial heterotachy (i.e. incorporating partial sequences with various evolutionary rates), (ii) allowing the comparison of complete and chimerical sequences to select the slow evolving one. An idea to create chimera might be to infer ancestral state for each site; unfortunately, this rule is difficult to apply because it needs a within OTU phylogenetic tree and at least 4 residues per site, two conditions rarely met when few overlapped sequences like those obtained by EST methods are considered. Finally, incorporating refined tools to facilitate species selection (i.e. the definition of the OTUs), such as the biclique and quasi-biclique algorithms [10, 11] would be also useful.
Phylogenetic studies based on a huge sampling of both genes and species remain rare despite the great quantity of genomic data currently available. We have conceived a software open to a large usage in a phylogenomic context. SCaFoS is a helpful tool for rapidly constructing large datasets of aligned sequences that can be easily used with different phylogenetic inference approaches. Simplifying the construction of these datasets should permit a better phylogenetic use of genomic data by various samplings of sequences, species and genes. This latter point is particularly important because of the increasing number of contradictory papers that are based on different samples, as illustrated by the question of Ecdysozoa monophyly [5, 22, 30–32]. Finally, we have shown that SCaFoS selection of the slowest evolving representative sequence of a monophyletic group is an efficient approach to reduce the impact of tree reconstruction artefacts, suggesting that increasing the amount of phylogenetic signal during the construction of phylogenomic datasets should be a priority for future research.
Availability and requirements
Project name: SCaFoS
Project home page: http://megasun.bch.umontreal.ca/Software/scafos/scafos.html
Operating systems: native Xwindow environment on Unix/Linux, Mac OSX and Windows platforms (Win32)
Programming language: Perl version 5.8.0 or later
Other requirements: Tcl/Tk version 8.4.5 or later and Tree-puzzle version 5.1 or later
List of abbreviations
Expressed Sequence Tags
Long Branch Attraction
Operational Taxonomic Unit
Whole Genome Shotgun
We wish to thank Denis Baurain, Henner Brinkmann, Nicolas Rodrigue, Mike Sanderson and one anonymous referee for their helpful comments and suggestions. This work was supported by Genome Quebec. H.P. is member of the Program in Evolutionary Biology of the CIAR and of the Canada Research Chairs. B.R. has been supported by 'Bourses d'Excellence biT' a strategic program of the Canadian CIHR, and N.R.E. by 'Programa de Formación de Investigadores del Departamento de Educación, Universidades e Investigación' (Government of Basque Country).
This article has been published as part of BMC Evolutionary Biology Volume 7, Supplement 1, 2007: First International Conference on Phylogenomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcevolbiol/7?issue=S1.
- Delsuc F, Brinkmann H, Philippe H: Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet. 2005, 6 (5): 361-375. 10.1038/nrg1603.View ArticlePubMedGoogle Scholar
- Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW: Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol. 2000, 17 (1): 164-178.View ArticlePubMedGoogle Scholar
- Koonin EV: Orthologs, paralogs, and evolutionary genomics (1). Annu Rev Genet. 2005, 39: 309-338. 10.1146/annurev.genet.39.073003.114725.View ArticlePubMedGoogle Scholar
- Pearson WR, Sierk ML: The limits of protein sequence comparison?. Curr Opin Struct Biol. 2005, 15 (3): 254-260. 10.1016/j.sbi.2005.05.005.PubMed CentralView ArticlePubMedGoogle Scholar
- Philip GK, Creevey CJ, McInerney JO: The Opisthokonta and the Ecdysozoa may not be clades: Stronger support for the grouping of plant and animal than for animal and fungi and stronger support for the Coelomata than Ecdysozoa. Mol Biol Evol. 2005, 22 (5): 1175-1184. 10.1093/molbev/msi102.View ArticlePubMedGoogle Scholar
- Wiens JJ: Does adding characters with missing data increase or decrease phylogenetic accuracy?. Syst Biol. 1998, 47 (4): 625-640. 10.1080/106351598260635.View ArticlePubMedGoogle Scholar
- Sanderson MJ, Purvis A, Henze C: Phylogenetic supertrees: assembling the trees of live. Tree. 1998, 13 (3): 105-109.PubMedGoogle Scholar
- Anderson JS: The phylogenetic trunk: maximal inclusion of taxa with missing data in an analysis of the lepospondyli (Vertebrata, Tetrapoda). Syst Biol. 2001, 50 (2): 170-193. 10.1080/10635150119889.View ArticlePubMedGoogle Scholar
- Philippe H, Delsuc F, Brinkmann H, Lartillot N: Phylogenomics. Annu Rev Ecol Evol Syst. 2005, 36: 541-562. 10.1146/annurev.ecolsys.35.112202.130205.View ArticleGoogle Scholar
- Sanderson MJ, Driskell AC, Ree RH, Eulenstein O, Langley S: Obtaining maximal concatenated phylogenetic data sets from large sequence databases. Mol Biol Evol. 2003, 20 (7): 1036-1042. 10.1093/molbev/msg115.View ArticlePubMedGoogle Scholar
- Yan C, Burleigh JG, Eulenstein O: Identifying optimal incomplete phylogenetic data sets from sequence databases. Mol Phylogenet Evol. 2005, 35 (3): 528-535. 10.1016/j.ympev.2005.02.008.View ArticlePubMedGoogle Scholar
- Driskell AC, Ane C, Burleigh JG, McMahon MM, O'Meara BC, Sanderson MJ: Prospects for building the tree of life from large sequence databases. Science. 2004, 306 (5699): 1172-1174. 10.1126/science.1102036.View ArticlePubMedGoogle Scholar
- Wiens JJ: Missing data, incomplete taxa, and phylogenetic accuracy. Syst Biol. 2003, 52 (4): 528-538. 10.1080/10635150390218330.View ArticlePubMedGoogle Scholar
- Philippe H, Snell EA, Bapteste E, Lopez P, Holland PW, Casane D: Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol Biol Evol. 2004, 21 (9): 1740-1752. 10.1093/molbev/msh182.View ArticlePubMedGoogle Scholar
- Wiens JJ: Can Incomplete Taxa Rescue Phylogenetic Analyses from Long-Branch Attraction?. Syst Biol. 2005, 54 (5): 731-742. 10.1080/10635150500234625.View ArticlePubMedGoogle Scholar
- Brinkmann H, van der Giezen M, Zhou Y, Poncelin de Raucourt G, Philippe H: An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst Biol. 2005, 54 (5): 743-757.View ArticlePubMedGoogle Scholar
- Felsenstein J: PHYLIP (Phylogene Inference Package). 2001, Department of Genetics, University of Washington, Seattle, 3.6Google Scholar
- Philippe H: MUST, a computer package of Management Utilities for Sequences and Trees. Nucleic Acids Res. 1993, 21 (22): 5264-5272. 10.1093/nar/21.22.5264.PubMed CentralView ArticlePubMedGoogle Scholar
- Maddison DR, Swofford DL, Maddison WP: NEXUS: an extensible file format for systematic information. Syst Biol. 1997, 46 (4): 590-621. 10.2307/2413694.View ArticlePubMedGoogle Scholar
- Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics. 2002, 18 (3): 502-504. 10.1093/bioinformatics/18.3.502.View ArticlePubMedGoogle Scholar
- Delsuc F, Brinkmann H, Chourrout D, Philippe H: Tunicates and not cephalochordates are the closest living relatives of vertebrates. Nature. 2006, 439 (7079): 965-968. 10.1038/nature04336.View ArticlePubMedGoogle Scholar
- Philippe H, Lartillot N, Brinkmann H: Multigene analyses of bilaterian animals corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol Biol Evol. 2005, 22 (5): 1246-1253. 10.1093/molbev/msi111.View ArticlePubMedGoogle Scholar
- Rodriguez-Ezpeleta N, Brinkmann H, Burey SC, Roure B, Burger G, Loffelhardt W, Bohnert HJ, Philippe H, Lang BF: Monophyly of primary photosynthetic eukaryotes: Green plants, red algae, and glaucophytes. Current Biology. 2005, 15 (14): 1325-1330. 10.1016/j.cub.2005.06.040.View ArticlePubMedGoogle Scholar
- Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003, 19 (12): 1572-1574. 10.1093/bioinformatics/btg180.View ArticlePubMedGoogle Scholar
- Swofford DL: PAUP*: Phylogenetic Analysis Using Parsimony and other methods. 2000, Sinauer, Sunderland, MA, 4b10Google Scholar
- Felsenstein J: Cases in which parsimony or compatibility methods will be positively misleading. Syst Zool. 1978, 27: 401-410. 10.2307/2412923.View ArticleGoogle Scholar
- Jobb G, von Haeseler A, Strimmer K: TREEFINDER: a powerful graphical analysis environment for molecular phylogenetics. BMC Evol Biol. 2004, 4 (1): 18-10.1186/1471-2148-4-18.PubMed CentralView ArticlePubMedGoogle Scholar
- Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992, 8 (3): 275-282.PubMedGoogle Scholar
- Lockhart PJ, Howe CJ, Bryant DA, Beanland TJ, Larkum AW: Substitutional bias confounds inference of cyanelle origins from sequence data. Journal of Molecular Evolution. 1992, 34 (2): 153-162. 10.1007/BF00182392.View ArticlePubMedGoogle Scholar
- Wolf YI, Rogozin IB, Koonin EV: Coelomata and not ecdysozoa: evidence from genome-wide phylogenetic analysis. Genome Res. 2004, 14 (1): 29-36. 10.1101/gr.1347404.PubMed CentralView ArticlePubMedGoogle Scholar
- Dopazo H, Santoyo J, Dopazo J: Phylogenomics and the number of characters required for obtaining an accurate phylogeny of eukaryote model species. Bioinformatics. 2004, 20 (Suppl 1): i116-i121. 10.1093/bioinformatics/bth902.View ArticlePubMedGoogle Scholar
- Dopazo H, Dopazo J: Genome-scale evidence of the nematode-arthropod clade. Genome Biology. 2005, 6 (5): R41-10.1186/gb-2005-6-5-r41.PubMed CentralView ArticlePubMedGoogle Scholar
- Castresana J: Selection of Conserved Blocks from Multiple Alignments for Their Use in Phylogenetic Analysis. Mol Biol Evol. 2000, 17 (4): 540-552.View ArticlePubMedGoogle Scholar
- Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003, 52 (5): 696-704. 10.1080/10635150390235520.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.