ESTimating plant phylogeny: lessons from partitioning
© de la Torre et al. 2006
Received: 03 February 2006
Accepted: 15 June 2006
Published: 15 June 2006
Skip to main content
© de la Torre et al. 2006
Received: 03 February 2006
Accepted: 15 June 2006
Published: 15 June 2006
While Expressed Sequence Tags (ESTs) have proven a viable and efficient way to sample genomes, particularly those for which whole-genome sequencing is impractical, phylogenetic analysis using ESTs remains difficult. Sequencing errors and orthology determination are the major problems when using ESTs as a source of characters for systematics. Here we develop methods to incorporate EST sequence information in a simultaneous analysis framework to address controversial phylogenetic questions regarding the relationships among the major groups of seed plants. We use an automated, phylogenetically derived approach to orthology determination called OrthologID generate a phylogeny based on 43 process partitions, many of which are derived from ESTs, and examine several measures of support to assess the utility of EST data for phylogenies.
A maximum parsimony (MP) analysis resulted in a single tree with relatively high support at all nodes in the tree despite rampant conflict among trees generated from the separate analysis of individual partitions. In a comparison of broader-scale groupings based on cellular compartment (ie: chloroplast, mitochondrial or nuclear) or function, only the nuclear partition tree (based largely on EST data) was found to be topologically identical to the tree based on the simultaneous analysis of all data. Despite topological conflict among the broader-scale groupings examined, only the tree based on morphological data showed statistically significant differences.
Based on the amount of character support contributed by EST data which make up a majority of the nuclear data set, and the lack of conflict of the nuclear data set with the simultaneous analysis tree, we conclude that the inclusion of EST data does provide a viable and efficient approach to address phylogenetic questions within a parsimony framework on a genomic scale, if problems of orthology determination and potential sequencing errors can be overcome. In addition, approaches that examine conflict and support in a simultaneous analysis framework allow for a more precise understanding of the evolutionary history of individual process partitions and may be a novel way to understand functional aspects of different kinds of cellular classes of gene products.
In contrast, the majority of molecular phylogenies have postulated the gymnosperms to be a monophyletic group sister to all angiosperms. Most molecular studies place the Gnetales as a sister group to the conifers (Fig. 1B; [7–9, 15, 16]). However, some molecular evidence can also be interpreted as supporting the anthophyte theory . Attempts to associate molecular expression data with morphological structures (e.g. ) also place the Gnetales and conifers together, with shared expression of orthologous genes indicating that the Gnetum strobilar collar and ovule are homologous to the conifer bract-and-ovule/ovuliferous scale complex. Adding to the controversy, a recent study involving phytochrome genes (Fig. 1D; ) has placed the Gnetales as basal gymnosperms, with Ginkgoales and Cycadales as sister groups branching after the Coniferales. No recent combined analyses of molecular and morphological data have been produced and a very early one was equivocal . In the past this question has been addressed with a single partition and more recently with eight  and thirteen partitions . To our knowledge as of yet no general consensus has been reached as to the phylogenetic arrangement of these six major seed plant lineages. In fact the Tree of Life website for Spermatophytes  resolves only two nodes involving the five relevant taxa listed above. In the tree of life study, Gnetales are shown as the sister group to angiosperms, yet the difference between the Gnetales as the sister group to the angiosperms versus a monophyletic gymnoperms with cycads sister to other gymnosperms requires a very different set of morphological concepts and transformations. For example, are carpels leaves with marginal ovules or are they subtending leaves with axillary ovules? These are very different and mutually exclusive scenarios. Consequently, attempts to understand the genes involved in the innovations achieved by the angiosperms are severely hampered.
It is clear from the literature on seed plant phylogenetics that the addition of information relevant to the seed plants may be a viable way to solve this difficult problem. In addition, many studies on other taxa have demonstrated that the simultaneous analysis of multiple data partitions can result in an increase in overall branch support, despite conflict among the characters, due to emergent properties not evident in the separate analyses of individual data partitions [23–27]. An additional positive aspect of adding process partitions to an analysis is that once a large number of partitions from various cellular functional classes are available, partitioned analysis will also allow detailed examination of the evolutionary dynamics of these classes of genes. The latter advantage may shed light on the role of certain genes in organismal evolution.
Sequencing errors and orthology determination pose challenges to the use of ESTs as a source of characters for systematics. There can be a high rate of sequencing error in raw EST data, since it is derived from single pass reads. A strategy to minimize this problem is contig assembly and EST clustering using several reads at every region (e.g. [39, 40]). In our approach, a minimum of 10 reads were used to determine each EST sequence. While orthology assessment is difficult in sub-genomic studies such as ones that use PCR or gene cloning approaches to obtain sequences, one can enhance orthology assessment in such studies by careful design of primers, and by referring to the whole genome sequences of closely related model taxa as guides for assessing orthology. Assessing orthology of ESTs is more difficult, because of the inaccuracy that accompanies EST analysis and by the possibility that some desired orthologs are not expressed or expressed at low levels. We determined the orthology of EST sequences using a tree-building approach. Initially this was accomplished by including ESTs in the gene tree analysis for each gene family. Without automation, this approach would be prohibitively time consuming and labor intensive and would greatly restrict the use of genomic-scale EST data in phylogenetic analyses. Therefore, during the course of this study, we developed automated methods for orthology determination within a parsimony framework, described elsewhere  (see also Methods section below for an overview).
Studies with large numbers of process partitions in a dataset exist [23, 26, 42–48], and some of these have attempted to address higher phylogenetic questions of mammals, yeast and bacteria by taking advantage of genomic level approaches. These large data set approaches can be divided into whole-genome approaches (mostly microbial) and "subgenomic"  approaches. The whole-genome approach has the obvious advantage that orthology assessment is made with more certainty and ease when whole genome sequences are used in a phylogenetic analysis. Such is the case in a study of the relationships of seven ingroup yeast species with whole genomes sequenced  and much of the whole genome bacterial phylogenetic studies that are beginning to appear in the literature [44, 45]. The yeast study is particularly interesting in that the authors approached the very question of where in sequence analysis space we need to be to resolve phylogenies with robustness, Using 106 carefully chosen orthologous genes they showed that > 20 genes or > 15 kb of sequence produced a plateau of robustness at measures of 100% for conventionally used detectors of node robustness (bootstrapping; ) in phylogenetics. In addition, they showed that despite rampant incongruence (as typified by the large number of single gene trees that disagree in topology amongst the 106 single gene trees that can be produced), combining gene partitions into a concatenated or simultaneous analysis [51, 52] was always the best way to analyze the sequence information in a phylogenetic context.
Implementing a different node support measure  than the ones used in the yeast study, DeSalle  demonstrated that this phenomenon is the result of hidden support in the various gene partitions included in the analysis. Hidden support  is simply the amount of support at a node that is NOT found in the separate gene partitions analyzed individually. All character partitions have either positive, neutral or negative latent support for any given phylogenetic hypothesis, that becomes evident only after combining or concatenating data partitions and performing a simultaneous analysis of all available data. An assessment of hidden support using the yeast dataset of Rokas et al.  reveals that one in every five characters that support the simultaneous analysis (SA) tree is hidden . This large amount of hidden support for the nodes in the SA tree, suggests that interaction of character information is an important concept in reconstructing phylogenetic relationships. More importantly quantifying hidden support can enlighten researchers about the degree of positive or negative interaction of characters in a concatenated analysis that can help determine "next steps" in phylogenetic studies. Hidden support and partitioned analyses can also aid in determining the effects of missing data and the congruence of particular partitions with an overall phylogenetic hypothesis. Issues arising from missing data are especially problematic with EST phylogenetic studies and are caused by two factors. First, in EST studies partial gene sequences are more frequently reported than full length cDNA sequences; and second, because of the random nature of clones in EST libraries often times orthologs are not found in all taxa in the study. An exploration of the amount of support contributed by each partition to the simultaneous analysis tree can aid in determining the effect of these two kinds of missing data on overall phylogenetic hypotheses.
In our analysis, cycads appear basal to a grouping of Ginkgo, Gnetum and conifers, with Gnetales sister to the Coniferales. This arrangement is in accordance with other recent molecular studies, that conflict with the Anthophyte hypothesis. These other studies did not include a morphological component. Our separate analysis of morphological characters supported the grouping of Gnetales with angiosperms. However, the inclusion of the morphological data set in a combined analysis contributed nine steps of hidden support to the grouping of Gnetales and Coniferales.
Total branch support on the simultaneous analysis tree in Figure 3 is 275 steps. Of those 275 steps, only 81 are apparent in the separately analyzed partitions, while 194 are hidden. In other words, 70.5% of the phylogenetically informative characters provide hidden support that would not have been apparent had each gene region been analyzed separately. Figure 3 shows the distribution of this hidden support (contributed by the 43 partitions) at each node of the tree. Strikingly, 100% of the support for node 3 is hidden (69.3% hidden for node 1, 65.6% hidden for node 2 and 78.1% hidden for node 4).
In general, the broad nuclear gene data partition proved to be the most consistent with the simultaneous analysis tree. Not only is the nuclear gene tree topologically identical to the simultaneous analysis tree, but all simultaneous analysis tree nodes also receive positive branch support, both hidden and apparent, from the nuclear data set. While this might be seen as an argument for the preferential use of nuclear genes (or ESTs) in phylogenetic analyses, over the molecular characters from other subcellular compartments (such as chloroplast and mitochondria), these subcellular compartments did contribute character support to the simultaneous analysis tree despite topological conflict. In addition, the topological differences among subcellular gene partitions examined using the ILD test (implemented in PAUP*) were not significant (p-value > 0.05, see below).
In order to explore the interaction among data partitions analyzed separately, we calculated ILDs  and tested the significance of the resulting length differences  between all possible pairwise comparisons of the individual data partitions as well as among the broader scale groupings of the data that we examined for hidden support above. Of the 937 pairwise comparisons, 83 showed significant length differences [see Additional file 4]. Of those 83 conflicting pairwise comparisons, 13 were comparisons between the morphological data set and an individual gene partition; 18 were between mitochondrial CO1 and another individual partition; 11 were between the nuclear heat shock protein 82 and other individual partitions; and 8 were between the chloroplast RNA polymerase beta subunit 1 and other individual partitions. We highlight these examples to show that no single partition dominated in terms of contributing conflict, but only a handful of partitions are involved in significant length differences. As with our examination of hidden support, when we examined conflict among broader scale groupings of the data, we found less conflict (as measured by ILD). In addition, none of the broader scale groupings examined for hidden support showed significant conflict (as measured by ILD) except for those groupings compared to the morphological data set.
Several of the partitions we used had missing taxa due, for example, to the lack of available sequence data for a given taxon for a particular gene region [see Additional Files 1 and 3]. We explored the effect these missing taxa had on the overall phylogenetic hypothesis by comparing the amount of branch support and hidden branch support for each node using partitions where information was available for 7, 6, 5 and 4 taxa. This type of analysis is particularly relevant to EST studies as the probability of obtaining a full complement of taxa for a particular ortholog is reduced as the number of taxa in the analysis increases. Recent studies using large data sets, also containing ESTs [58, 59] examined the effect of missing data by removing taxa with large amounts of missing data and comparing those results to an analysis in which these taxa were not excluded. Since the results of these analyses were similar, it was concluded that the use of taxa with large amounts of missing data did not bias the results. A simulation study  concluded that it is not the amount of missing data that is problematic in terms of resolving trees but the presence of too few characters to allow taxon placement.
In our analyses, we established a matrix with six ingroup taxa and one outgroup, and no taxa were removed from the matrix at any time in our analysis regardless of amounts of missing data in the various partitions. This approach allowed us to explore the effect of the inclusion of taxa with missing data by examining branch support values contributed to the simultaneous analysis tree by partitions with varying amounts of missing data. In this case, we compared the contribution to support provided by those partitions that contained at least some data (but not necessarily the complete dataset) for all seven taxa to the group of partitions that were lacking data for one taxon for an entire partition (an individual gene region); to those that were lacking data for two taxa and so forth. In this way we were able to examine the affect of incompletely taxonomically sampled partitions.
We also partitioned the data set into classes of genes based on their cellular function. Our functional partitions were MnoPSR (Non-Photosynthetic or Respiratory Metabolism: 7 gene partitions), photosynthetic (11 gene partitions), respiration (7 gene partitions), signalling (3 gene partitions), structural (8 gene partitions), transcription factors (2 gene partitions) and genes of unknown function (4 gene partitions).
While our results would suggest that conserved structural proteins and signalling proteins might be better at defining both deeper nodes and tree tip nodes, and proteins in the transcription factor and respiration class of genes might be best for nodes nearer the tips of the tree; the sample size of genes in functional partitions for this study is small. Nevertheless, these results do suggest a potential method for categorizing functional gene classes with respect to their congruence with a simultaneous or organismal phylogeny. As more ESTs are added to the sequence database, the sample sizes of these functional classes will become larger and a more rigorous test of the role of functional class in phylogenetic analysis may be possible.
Rokas et al.,  addressed the question of where in sequence analysis space we need to be to robustly resolve phylogenies; showing that > 20 genes or > 15 kb of sequence produced a plateau of robustness at measures of 100% for conventionally used detectors of node robustness. The > 15, 000 base pairs and > 40 genes in the present study is only enough to garner strong support for three of the four nodes in the concatenated analysis tree with node 3 receiving the weakest support in the analysis. More sequence information is thus needed to resolve this problem, and it appears from the hidden support analysis that nuclear gene partitions will most efficiently provide information for all nodes in the concatenated analysis tree. In addition, both the yeast analysis by Rokas et al. , and the seed plant analyses presented here strongly suggest that even though a single gene partition might support an alternative topology to the concatenated analysis tree, hidden support in most gene partitions will contribute positively to overall robustness of a phylogenetic hypothesis. Finally, the yeast and seed plant examples, while having similar numbers of ingroup taxa, suggest that different numbers of characters and genes will be needed to assign robust inferences to nodes in studies. We suggest that this discrepancy may be a factor of the different phylogenetic ages of the groups: the ingroup species in the yeast phylogeny diverged between 50 and 100 MYA [63, 64] is basically within a genus, while the ingroup taxa in the plant study diverged no earlier than 400 MYA [65–67]. In addition, several studies with much larger numbers of ingroup taxa exist [23, 42, 44] and these studies suggest that larger numbers of characters than those of the yeast study are required for robust resolution of this simple phylogenetic hypothesis. How many more characters? A strong indication may be given by the high support and robustness of node 1 (Arabidopsis + Oryza). A plateau of topological robustness for other nodes may be reached when a similar number of phylogenetically informative characters is reached.
The approach we describe here, where support for the SA tree is estimated for each process partition, will also pinpoint those partitions that disagree or conflict with the overall general pattern of divergence of the taxa in the analysis. If one assumes that the SA tree best represents evolutionary history of the taxa involved, then such partitions are in conflict with overall organismal history of the taxa in the analysis. This approach then would provide a method for detecting process partitions that might be selected for or have experienced drift and such partitions might be important in some of the more interesting organismal differences amongst the taxa in the analysis.
One final and important aspect of the present analysis highlights a problem that will be prevalent in future genomic level phylogenetic studies. This problem concerns the almost continual revision of the overall phylogenetic hypothesis for a set of taxa. For instance, as more and more EST data are added to the database, more and more process partitions can be added to an analysis. This will effectively create a growing matrix that might even expand daily. With the addition of each new process partition to an analysis, all support values and other tree metrics such as bootstrap values , jackknife values [68, 69], Bayesian posterior probabilities [70–72], and node support values [21–23, 42, 61, 62, 73] need to be recalculated. In addition, the manual inclusion of the new process partitions to a growing matrix is time consuming and sometimes prone to error. We therefore suggest that such important systematic questions where large amounts of genomic level data are available have a need for an automated and rapid means for inclusion of new process partitions to the growing matrix. Such an automated approach is under development for the seed plant question and will be discussed in a separate publication .
Simultaneous analysis using 42 gene partitions and a morphological partition yield a phylogenetic hypothesis with a monophyletic gymnosperms which is at odds with the Anthophyte hypothesis.
Addition of short EST sequences to a data set can enhance a phylogenetic analysis, if the problems of sequence quality and orthology are overcome.
The majority of support in this study is hidden support, meaning that the support is not immediately apparent in single gene partitions.
Completeness of data partitions with respect to full complement of taxa had a large affect on levels of support in phylogenetic analysis. In our study example with seven taxa, support from partitions that had sequences for five or fewer taxa was nonexistent. However, variation in the amount and distribution of data within partitions may also play a role.
When phylogenetic incongruence between a partitioned functional class of genes (such as transcription factors) and the organismal phylogeny is detected, this result suggests that the partition has experienced a unique evolutionary history relative to the organisms. This different evolutionary history can be used as a signpost of altered evolutionary pressure in a particular class of genes. In this way, incongruence of a particular class of genes (such as transcription factors) in a partitioned analysis allow us to establish hypotheses about the evolution and potential function of these gene classes.
Many studies use pairwise sequence comparison schemes, such as BLAST , COG (Clusters of Orthologous Groups; ), INPARANOID [77, 78], RBH (Reciprocal Blast Hits; [79, 80]), and RSD (Reciprocal Smallest Distance Algorithm; ) to determine gene orthology on a genomic scale.
Since we are ultimately interested in exploring the characters associated with particular evolutionary novelties, we use a character-based alternative to distance based methods for the identification of orthologous gene regions. The tree-building approach to orthology determination involves the generation of gene family trees in order to identify the orthologous gene family member for each EST sequence. Within a character-based parsimony framework, nodes are defined by shared derived characters.
Without automation, this approach would be prohibitively time consuming and labor intensive and would greatly restrict the use of genomic-scale EST data in parsimony based analyses since the placement of ESTs into orthology groups using this character-based approach would require manual rebuilding of gene family trees for each new EST to be classified. Therefore, during the course of this study, we developed automated methods for orthology determination within a parsimony framework: OrthologID ,  firefox/Netscape is the preferred browser–Internet Explorer is not supported by the current OrthologID viewer). This approach builds gene family trees using sequences from available completely sequenced genomes (currently, Arabidopsis, Oryza and Populus; Chlamydomonas reinhardtii is used as an outgroup in the gene tree analyses for orthology determination), such that all members of a given gene family are included. In addition, rather than including EST sequences in the gene tree construction analysis, the whole genome gene family trees are first used to construct "guide" trees. These gene family guide trees are used to identify diagnostic characters for each gene family member and then EST query sequences are screened for the presence of shared diagnostics using the CAOS algorithm (See  for details of the guide tree/CAOS approach). This approach eliminates the need to manually rebuild a gene family tree each time a new EST sequence requires orthology determination.
We use only completely sequenced genomes for constructing gene family guide trees with OrthologID in order to minimize the possibility of the erroneous placement of query sequences due to missing data. If gene family guide trees had been constructed using partially sequenced genomes, it is possible that some gene family members would be missing, in which case it could be possible that queries orthologous to these missing gene family members would be incorrectly placed. The current database of plant genomes will soon be expanded to include complete genomes from other phylogenetic lineages, including prokaryotes and non-plant eukaryotes.
OrthologID automatically searches the local database of completely sequenced plant genomes and performs an initial clustering of gene sequences into putative gene families, using NCBI BLAST  with an expectation value cutoff of 1e-20. Next it builds gene family trees. It performs sequence alignments using the program MAFFT  using different sets of alignment parameters to create three different alignments for each gene family and culls  alignment ambiguous regions. The three pairs of gap open penalty and offset values are (1.53, 0.123), (2.4, 0.1), and (1.0, 0.2). It performs tree searches within a parsimony framework, using either exhaustive searches or, where exhaustive tree searches are not possible due to a large number of putative gene family members, heuristic searches are performed implementing the parsimony ratchet  with 200 re-weighting iterations for each of 20 ratchets; in order to rigorously explore tree space. It saves resultant trees and computes the strict consensus when multiple equally parsimonious trees are obtained from the analysis and then passes these guide trees to the CAOS algorithm to identify node diagnostics. In order to identify the ortholog of an EST sequence, OrthologID uses the CAOS algorithm to screen the ESTs for the presence of characters that are diagnostic of nodes on the guide tree.
Once EST orthologs had been identified, we manually assembled a process partition matrix for each orthologous gene region for all of the seven chosen plant taxa; sequences for the moss Physcomitrella patens [87, 88] were used as outgroup in all phylogenetic analyses. We aligned the sequences for each process partition using the default parameters in Clustal . We assembled a simultaneous analysis matrix composed of 42 gene regions consisting of mitochondrial (6), chloroplast (16) and nuclear (20; including 19 composed of EST protein sequences and one of DNA sequences [18S rDNA]) along with a single morphological partition. A list of all the gene regions, and the accession numbers of all sequences used in the analysis, can be found in Additional file 1. In several instances in which a mitochondrial or chloroplast sequence was not available for a given taxon, we substituted the corresponding sequence from a related species. These are noted in Additional file 1. The matrix used in the analyses can be found as Additional file 2. Phylogenetic analyses were accomplished in PAUP* version 4.0b1.0  using exhaustive searches. Measures of branch support [21, 23, 61, 62] were accomplished using batch command files in PAUP*, and the resulting log files were imported into an Excel spreadsheet for final calculations.
The authors acknowledge Rob Martienssen and W. Richard McCombie (CSHL), Indra Neil Sarkar (AMNH) and other members of the New York Plant Genomics Consortium (NYPGC) for stimulating discussion and comments on the manuscript. The work in this manuscript was supported by an NSF Plant Genome grant #DBI -0421604 to GMC, DWS, and RD. MGE, RD and DWS thank the continued support of the Lewis B and Dorothy Cullman Program in Molecular Systematics at the AMNH and the NYBG.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.