Exploration of phylogenetic data using a global sequence analysis method
© Chapus et al; licensee BioMed Central Ltd. 2005
Received: 15 April 2005
Accepted: 09 November 2005
Published: 09 November 2005
Molecular phylogenetic methods are based on alignments of nucleic or peptidic sequences. The tremendous increase in molecular data permits phylogenetic analyses of very long sequences and of many species, but also requires methods to help manage large datasets.
Here we explore the phylogenetic signal present in molecular data by genomic signatures, defined as the set of frequencies of short oligonucleotides present in DNA sequences. Although violating many of the standard assumptions of traditional phylogenetic analyses – in particular explicit statements of homology inherent in character matrices – the use of the signature does permit the analysis of very long sequences, even those that are unalignable, and is therefore most useful in cases where alignment is questionable. We compare the results obtained by traditional phylogenetic methods to those inferred by the signature method for two genes: RAG1, which is easily alignable, and 18S RNA, where alignments are often ambiguous for some regions. We also apply this method to a multigene data set of 33 genes for 9 bacteria and one archea species as well as to the whole genome of a set of 16 γ-proteobacteria. In addition to delivering phylogenetic results comparable to traditional methods, the comparison of signatures for the sequences involved in the bacterial example identified putative candidates for horizontal gene transfers.
The signature method is therefore a fast tool for exploring phylogenetic data, providing not only a pretreatment for discovering new sequence relationships, but also for identifying cases of sequence evolution that could confound traditional phylogenetic analysis.
Phylogenetic classifications traditionally rely on phenotypic traits and the paleontological record . As a result of the large amount of DNA sequences now available in the databases, molecular phylogeny has become an essential companion in studying evolutionary relationships among species . As usually practiced, it allows constructing phylogenetic trees based on differences between homologous sequences or genes . A basic and indispensable step in phylogenetic study is alignment of the set of homologous sequences . However, distantly related sequences can be difficult to align and under these conditions, different algorithms often lead to different phylogenetic results [5, 6]. There are other problems linked to the use of biological sequences in phylogenetic analysis, including sampling of representative sequences, biological processes such as lateral gene transfer, fusion events and recombination (see Brocchieri et al  for a review).
New approaches of molecular phylogeny, taking into account new characteristics of sequences, have been recently developed. Such methods include using other aspects of molecular data such as structural properties of proteins , the presence and organization of genes along genomes [8–11], occurrence of characteristic patterns [12, 13] and the frequencies of short nucleotide or peptide relative abundance [14–18]. These methods contribute to the understanding of species evolution from different points of view, particularly in terms of our understanding of genome evolution. What is intriguing about these methods is that they often yield phylogenetic results comparable to those of traditional methods, frequently employing data sets much larger than traditional phylogenetic analyses. As such, they deserve the attention of those wishing to extract maximal information from comparative genomic data sets.
We expand on a method to characterize DNA sequences: the sequence signature. Sequence signature is defined as the whole set of frequencies of short oligonucleotides (words, until ten nucleotides long currently) of a sequence . The principal characteristics of sequence signature used for phylogenetic studies are species-specificity of sequence signature and conservation of signature in any part of the genome  allowing researchers to compare sequences from diverse regions of the genome. It has already been established that distances between species signatures of the same taxonomic group are smaller than between signatures of species belonging to different groups [19, 21]. A difference of signatures between two sequences could arise from shifts in the pattern of point substitution, but could also involve interactions among adjacent nucleotides, natural selection, DNA repair processes and conformational constraints (super coiling, nucleosome formation, bend DNA) . A phylogenetic analysis of signatures could therefore reflect underlying genomic changes that shift motif frequencies, thereby yielding higher-order homologies available for phylogenetic analysis. The method has already been used for taxonomic classification of some species groups [23–25]. One advantage of such a method consists mainly in avoiding the alignment step, and can be used on numerous sequences of varying size. In addition, distance matrices, such as those applicable to genomic signatures, generally permit fast building of trees. Perhaps most importantly, genomic signatures provide a means of comparing large-scale patterns in genomes and can help evaluate trends in genome evolution across a phenetic tree. However, no systematic analysis of the reliability of the signature approach has been performed on homologous sequences. It has been demonstrated that long word frequencies describes DNA sequence information more accurately [19, 25], but with their much larger number, long words are difficult to apply to short sequences because word frequencies are poorly estimated. Wang et al.  have also qualitatively analyzed the impact of the choice of the divergence metrics on phylogenetic results. However, no quantitative analyses or simulations have been presented yet on this subject.
In this paper, statistical studies of the ability of a signature approach for reconstructing phylogenies are investigated, specifically in order to determine the optimum word length and the influence of the divergence metric on the results. One of the tests we employ allows us to determine whether the signature distance can be considered tree-like, possessing hierarchical information . Working with homologous, fully alignable sequences, we tested the method on simulated sequences whose true topologies are known and also analyzed two published examples of DNA sequences that propose novel interspecific relationships. Overall we find that there is a strong correspondence between signature trees and those generated by conventional means. As a means of improving large multi-gene studies [27, 28], we also propose the use of signatures for rapid, large-scale sequence analysis specifically to detect subsets of genes supporting similar species phylogenies and to identify cases of horizontal transfer. In an analysis of 16 complete γ-proteobacteria genomes, we also illustrate how the signature method can also be used on data sets in which some of gene sequences are missing.
Results and discussion
Word length and metrics
We then tried to determine how tree-like were the trees inferred by the signature method, and if the distances in our signature matrices reflected tree distances. To do that, we used the distance matrices and the trees of the RAG1 study (see below for a discussion of these results). Various criteria for evaluating treeness, such as arboricity and stress, have been used as proposed by Guénoche and Garetta  to answer this question. Considering the three sums involved in the four point condition in quadruples , arboricity measures the percentage of quadruples for which the middle sum is closer to the largest one than to the smallest one. Stress corresponds to the square root of the quadratic difference between tree and matrice distances divided by the average distance value. These criteria are numerical and topological. All the criteria have been calculated on the signature-based distance matrices. These distance matrices are obtained using different word lengths (between 1 and 10), because we do not have an a priori knowledge of the optimum length.
d T decreases when word length increases (Fig 3), indicating a convergence of the trees towards a stable topology that is reached for 6-letter word whatever the metric used, then for longer word a plateau is observed. The 5- or 6-letter word signature trees are comparable to those obtained by NJ or ML. The d T observed between the signature/NJ or ML trees and those between conventional NJ/ML trees are similar for 5-letter word and higher confirming our choice in 6-letter word for the study.
Simulation of sequences
We decided to compare signatures trees to known trees using simulated sequences from a known phylogeny. Our simulation tests used a protocol similar to the work of Kumar  and Gascuel ). 100 phylogenetic trees were chosen randomly among a dataset of the 2000 random trees, proposed by Gascuel to test phylogeny methods . These simulation sets are composed of 24-taxon or 96-taxon trees. For each tree T, we used SEQGEN  to generate 10 data files with sequences of length 1 kb, 3 kb and 5 kb. These sequences were obtained by simulating the evolution of nucleotides along T according to the Kimura two-parameter model with a transition/transversion rate of 2 and a model of site-specific rate heterogeneity following a gamma distribution (with parameter α = 0.75). We obtained for each length of sequence and each number of taxons 1000 data files.
Simulation results with 1000 trees. The values correspond to the proportion of wrong branches in the inferred trees. Two distance metrics (χ2 and Euclidean) were used with three word lengths. For the distance method, three different evolutionary model have been used : JC, K2P et HKY85.
eucl – 4-letter word
eucl – 5-letter word
eucl – 6-letter word
χ2 – 4-letter word
χ2 – 5-letter word
χ2 – 6-letter word
The methods are compared by their ability to infer the "true" tree, i.e. the topology of the tree that has been used to generate the sequences. We used the topological distance d T of Robinson-Foulds between the inferred tree and the true one. The bipartition distance of Robinson-Foulds  is equal to the number of bipartition present in one of the two trees and not in the other. The results are presented in term of percentage of misinferred branches. This percentage is equal to the topological distance divided by the maximum number of different bipartition between two trees: 2N-6 where N is the number of taxa.
In both methods, the Neighbor-Joining reconstruction algorithm was used. The differences in the results come principally from the choice of the distance. The Kimura two-parameter can be designed as the "true" distance, because the parameter of the distance are exactly the same as those chosen to generate the sequences. So normally the Kimura distance must be the branch length of the original trees. The fact that the results obtained by the distance method are not perfect can be attributed to the reconstruction algorithm Neighbor Joining (see Gascuel ). HKY85 is a model that includes the Kimura 2-parameter (K2P) model, so the result should be the same.
The proportion of wrong branches decreases in the signature method when word length increases (Table 1). At the same time, the longer the sequences, the better the results with the signature method. However, the proportion of correct branches obtained from the signature is not as high as for the distance method. As expected, the results of HKY85 are the same than those of Kimura 2-parameter. The results of the Jukes-Cantor model are similar to those of the signature for 1 k sequences. But for longer sequences, the signature method is less effective than the JC method. The result of K2P can be explained by the fact that the distance method uses exactly the model used to generate the data. This fact also explains why the results of the signature method improve less with the increase of the sequence length than those of the distance method. The fact that, for the moment, no evolutionary model can be design to the signatures limits the estimation of distances between the signatures. An improvement will be to find how the signature evolves with time as function of nucleotide substitution models. Increases in sequence length facilitate estimation of distance by conventional methods, because the substitution model is known. With the signature, 3 kb sequences are sufficient to obtain a representative signature of the species using 6 letter words. As a result, the increase in accuracy between 3 kb and 5 kb is not significant.
Despite the fact that no evolutionary model has been used with the signature, the results obtained from the signature method are reasonable. With 6-letter words, only 10 % of the internal branches are incorrect. It can be compared to the results presented by Gascuel . The results of the signature method are not as good as the distance method, but they are nevertheless rather accurate. In general, the median size of genes is around 1 k. If we use longer sequences, it will be in the case of non-homologous sequences. For long sequences, no conventional method can be applied.
We used RAG1, a highly conserved gene that produces small distances between sequences to infer the vertebrate phenetic tree . The analysis of the 46 sequences in the dataset had shown that four sequences were complete and the other contained only the conserved core, with length ranging from 1 kb for core sequences to 3 kb for complete ones. This large difference in length induced a bias in the signatures of the four complete sequences, and so in the obtained trees. For comparison with published works , we only used the conserved core of RAG1 gene.
In the signature tree, species are placed within classes in agreement with taxonomy. For example, in the signature analysis, the relationships within birds are congruent with conventional analysis . With regard to mammals, the signature method is the only method that correctly recovers bats as a monophyletic group, with the exception of Felis catus. But the cat, Felis catus, is misplaced by every method, and so its incorrect placement cannot be attributed to a specific phylogenetic method. Mammal relationships appear much more problematic when analyzed by conventional phylogenetic methods than with the signature method. The polyphyly of tetrapods may be explained by the paucity of batrachian sequences, which could lead to an unreliable position for this clade. The monophyly of taxonomic classes, as well as relationships within each class appear quite robust as measured by bootstrap values.
Difference in log Likelihood. The differences are computed between the ML tree and the other trees.
χ2 – 4-letter
χ2 – 5-letter
χ2 – 6-letter
Mean random trees
This study, based on an article of Soltis et al. , used 18S rRNA for 93 plant species whose sequences are available from the "Green Plant Phylogeny Research Coordination Group" http://ucjeps.berkeley.edu/bryolab/GPphylo/RNA/18S_12.html. The species can be grouped into nine main clades (Angiosperms (flowering plants), Conifers, Gnetales, Cycads (palm trees), Hornworts, Liverworts, Ferns, Mosses, Lycophytes), with some additional isolated species and an outgroup.
Recent analyses based on molecular data  confirms this result (Soltis  and Källersjö ). In addition, Equisetum and Psilotaceae are placed with the Ferns. This grouping is found in other studies [50, 51] and these species are presented as sister group of Ferns. The sister group relationship of Psilotaceae and Ophioglossaceae is also found . Contrary to the results obtained by Soltis,  the ferns are polyphyletic in the signature tree.
The outgroup separates the plants into two groups: the seed plants and the other land plants. To confirm the position of this outgroup, 18S rRNA sequences of Homo sapiens, Saccharomyces cerevisiae and Schizosaccharomyces pombe have been added (Data not shown). The outgroup is still confirmed as well as the tree split. This separation of land/flowering plants, the separation of the Lycophytes and the fact that the moss and liverwort do not form a monophyletic clade have been found also by Soltis when a NJ analysis was performed . Thus, the signature method leads to a similar topology as the NJ method with alignment.
Phylogenetic trees carry two types of signal: species evolution and gene evolution. For a variety of reasons, gene trees can be different from the tree of species from which they are sampled . In addition, signals coming from different genes could lead to different inferred phylogenetic relationships between species .
In order to deal with this problem, several genes can be used to build a multigene tree [27, 28]. The addition of signals coming from various genes can under some conditions reinforce the information on species evolution. In general, the alignment of each gene can be determined, and alignments concatenated prior to tree building. The signature has many properties that facilitate the calculation of multigene tree.
Another problem deals with the selection of genes participating into the multigene tree. In general, several steps of selection occur to eliminate horizontal transferred genes, duplications or those leading to aberrant phylogeny (see [27, 28] for an example of these steps). Signatures are an ideal pretreatment tool for identifying horizontally transferred genes , and selecting those genes that conform to evolutionary relationships of the species under consideration. Moreover, due to the rapidity of the treatment with the signature, a very large number of genes can be tested at once.
We propose applying the signature method to infer a consensus tree of multiple genes. Two methods are possible. First, assuming that each gene brings the same quantity of information to the phylogeny for each species, an average signature is computed from several genes. The set of average signatures is then analyzed by the signature method. Another approach is to assume that each gene brings a quantity of phylogenetic information that is correlated with its length. In this approach, the sequences are concatenated and signatures are computed on the set of concatenated sequences.
To carry out this study, we used 33 genes originating from ten species (nine Bacteria: Bacillus subtilis, Clostridium perfringens, Escherichia coli, Lactococcus lactis, Neisseria meningitidis, Salmonella typhimurium, Staphylococcus aureus, Vibrio cholerae, Xanthomonas axonopodis and one Archaebacteria:Archaeoglobus fulgidus – see Material & Methods).
By contrast, an example where gene conservation is very strong is for EF-Tu gene; the signatures of nearly all the species are grouped together at the root of the V. cholerae/E. coli/S. typhimurium group. As it can be observed in the phylogenetic trees (signature and method of distance, results not shown), the A. fulgidus and C. perfringens copies of the gene are quite different, enough to their species signal to be stronger than the EF-Tu signal.
Some gene signatures cluster with species other than their own in the hierarchical tree. This could result from horizontal gene transfer. For instance, the phosphomannomutase gene of S. typhimurium is placed at the root of the S. aureus group. In the phosphomannomutase NJ tree and the signature tree, the relationships between the Gram- and the Gram+ bacteria are incongruent with other data and presumably wrong. Despite that, the other phosphomannomutase signatures are correctly assigned to their host species. The misplacement of this gene may indicate a horizontal transfer in S. typhimurium from an unknown donor. Two other potential horizontal transfers can be found deep inside species group: the elongation factor 2 signature of N. meningitidis and the ornithine carbamoyltransferase signature of S. aureus respectively inside the V. cholerae group and inside the C. perfringens group. In each case the signature is near the signature of the homologous gene of that species. So the gene signal is strong enough to displace the signature inside a different species group. To see if the original sequences are horizontal transfers, we examined two horizontal transfer databases: HGT-DB  and HGT Analysis Database . In HGT-DB, the phosphomannomutase sequence of S. Typhimurium is tagged as horizontal transfer , but not the other two original sequences detected by the hierarchical classification. Thus our novel result suggests original sequences that need to be studied more precisely before being incorporated into a multigene study.
Statistical analysis of the distribution of dissimilarity distances as a function of method used.
The consensus signature method allows us to analyze genes present only for some species. We added 9 genes to the data set (see Materials & Methods), thereby increasing the amount of sequence used to 65 kb per species. The signatures of these genes are amalgamated into the species signatures. The tree obtained (data not shown) exhibits the same topology as the consensus tree obtained with the whole set of genes per species computed previously.
Bootstrap values as function as the number of genes analyzed in the multigene study.
Percentage of used genes
100 % except for two clades:
– (E. coli + S. typhimurium) = 91 %
– (N. meningitides + X. axonopodis) = 96 %
100% for all branches
100% for all branches
100% for all branches
In the individual phylogenetic trees, the variations in topologies are so important whatever the method used (except signature) (Fig 9, Table 3), that they do not allow us to confirm whether these sequences have in fact undergone a horizontal transfer.
Phylogeny of γ-proteobacteria
Species names and taxonomic groups of γ-proteobacteria.
The 16S rRNA tree permits the establishment of reference relationships between the γ-proteobacteria. Some taxonomic groups are recovered: Xanthomonadales, Pseudonomaces as well as Pasteurelles. The tree groups Xanthomonadales and Pseudodomaces, and places B aphidicola close to Pasteurellale but with a long branch. This long branch can explain the incongruent placement of Pasteurelles in Enterobacteria for the ML tree (the phenomenon of long branch attraction ).
The tree calculated using the base compositionally-corrected signatures of complete genomes is more in agreement with the 16S rRNA tree. A group of Enterobacteria similar to that found in the 16S rRNA signature tree was obtained. However the monophyly of Xanthomonadales is not recovered in any of the trees obtained from signatures of complete genomes. X fastidiosa is placed at the root of the group (Xanthomonadale + Pseudomaceae). Another difference between trees for complete genomes and those of the 16S RNA is a grouping of Pasteurellales, Vibrionales and S. oneidensis found in the signature tree.
In the complete genome trees, B aphidicola appears misplaced. It is always positioned apart from the Enterobacterial clade, despite its belonging to this group taxonomically. An analysis of genome signatures of B aphidicola revealed that this species exhibits a very different signature from those of the other γ-proteobacteria (result not shown). This result is not due to a bias in signature method arising from the size of B aphidicola genome, because a tree obtained by randomly selecting the same sequence length in the 15 other genomes (650 kb) leads to the same topology (result not shown). We suggest that the source of this anomoly is that B aphidicola is a symbiotic bacteria, andhas a very small genome (650 KB) compared with those of the other γ-proteobacteria (4 to 5 Mb). This genome reduction arises from its parasitic lifestyle and is the result of many independent losses of genes and genomic segments. B aphidicola experienced very strong evolutionary pressures that led to a profound shift in its signature, and also transferred numerous genes to its host . Such symbiotic species are known to be difficult to place phylogenetically . B. aphidicola also has a strong compositional bias (the genome of is nearly 75% AT rich). The other γ-proteobacteria are more GC rich. These problems appear when using whole genomes to infer a tree and are bypassed when using conserved genes or a selection of genes sharing a common history .
We used the method of Dufraigne et al.  to detect in the B aphidicola genome sequences that may have arisen by horizontal transfer. We divided the entire genome into 5 kb sequence windows. For each window, a 4-letter word signature was computed. The method developed by Dufraigne et al. allows us to detect which sequences have original signatures such as would be found in cases of horizontal tranfer. We removed this original sequences from the genome and a new 6-letter word signature was computed. The tree obtained is exactly the same as the base compositionally-corrected tree (Fig 10C).
In this paper we have illustrated the exploration of phylogenetic data with a global sequence analysis method, the signature method. Using a variety of genes, this method yields tree topologies similar to those obtained using traditional phylogenetic approaches. The results presented here suggest that trees obtained by this method could be used as an exploratory step in phylogenetic studies. The signature method can deliver a quick overview of phylogenetic relationships between species in data sets that can be challenging or time consuming for traditional alignment and phylogenetic analysis. As our simulations showed, the signature method sometimes yields phylogenies that are less accurate than those produced by conventional analyses, but this arises mainly from the fact that no evolutionary model is known for word frequencies comprising genomic signatures. The signature tree can be used as fast pretreatment in conjunction with classical methods such as ML. We also demonstrated that the signature distances are tree-like, reflect tree distances and that in the case of short sequences such as frequently assembled in studies of homologous sequences, the optimal word length seems to be 6. This length represents a trade-off between long words that represent more accurately the DNA sequences [21, 25] and the size of the sequences.
The signature method is particularly useful as a first step in data exploration. The speed of the analysis permits detection of either misplacement of particular species, in some cases due to local composition fluctuations (horizontal transfer), or unexpected groupings of species that can be scrutinized further by biological means or conventional phylogenetic study. Thus, the signature method easily permits the researcher to use long and/or numerous genes in a study. When using numerous species, their phylogenetic proximities can be analyzed using their signatures by conventional statistical methods and the set of species split into subgroups. This method is also useful in combining information from different genes. The signature method permits the averaging of a great number of genes of any length to obtain a consensus and a unique signature per species and thereby take into account a great number of evolutionary events. The signature method does not rely on homology of DNA sites to compare sequences and it is possible to compare non-homologous sequences to infer a phylogenetic tree. Thus, many genes not present in every species can be added to this tree, giving more confidence in the species tree. This approach was already applied to birds , bacterial [15–18, 21, 62] or mitochondrial  phylogenetic studies. In contrast to conventional methods, the signature method utilizes information present in the sequences that may not be analyzable with conventional alignments, such as additional sequences at the beginning or the end of alignments.
For studies of complete genomes, detection of horizontal transfer using signatures, such as proposed by Dufraigne et al. , permits removal of sequences that will compromise phylogenetic analysis. Finally, signatures allow the rapid detection of horizontally transferred genes or simply misplaced genes that require additional attention via hierarchical clustering or other statistical classification methods.
Sequence signature can be computed easily and very quickly thanks to an algorithm -the "Chaos game representation" (CGR)-, (about 1 Mb per second on a laptop computer) . The signature can be displayed as an image, where each pixel represents a word and the darkness of the pixel increases with the frequency of the word in the sequence.
We selected two genes to compare signature analysis of two different clades with results from the literature. These genes are long enough to get a significant signature and address the phylogeny of vertebrates and plants including a large number of species. The recombination activation gene RAG1 is used for inferring the phenetic tree of 46 species of vertebrates. Ribosomal RNA sequence analysis is the de facto standard for phylogenetic reconstruction. Here we use ribosomal 18S RNA to analyze 93 plant species. Finally, 42 genes, accounting for more than 50 kb of sequence, are used for a multigene study (see Annex), including nine Bacteria and one Archaea. To select the 42 genes, we utilized the SYSTERS database . For all the selected species, the database returned 119 orthologous protein families shared by the whole set of species. These families were filtered by size of the corresponding DNA sequences (retained families contain sequences with mean lengths > 1 kb). From these, 33 complete sets and 9 partial sets of genes were obtained. The selected genes belong mainly to amino acid, nucleotide and protein synthesis and DNA metabolism families. All the sequences were extracted from GenBank or Genome Information Broker . The complete genomes of 16 γ-proteobacteria were gathered from GenBank (see appendix). Simulated sequences from a known phylogeny were found on Gascuel's website .
Phylogenetic analysis and signature method
Two distance metrics (Euclidean and ?2) were used to quantify the differences between signatures. Other metrics (Manhattan, Mahalanobis, Correlation and Cosine) were investigated as well; these methods rarely performed better than our two focal methods, and often performed worse, so we do not consider them further. Distance matrices were obtained via the Euclidean and ?2 metrics. We used these matrices to infer trees with the Neighbor-Joining (NJ) reconstruction algorithm implemented in the PHYLIP package . In order to estimate the robustness of the tree topology, we simulated by bootstrap  a whole new set of signatures from the initial set of motif frequencies, sampling with replacement (in general, 100 bootstrap trees were computed). Each dataset contains the same individuals from the initial data and N new variables (words) randomly drawn in order to replace the N variables from the initial set . For each set of sequences, the phylogenetic tree was inferred and a consensus tree was calculated from each bootstrap replicate. Besides the signature method, three commonly used methods  were used to analyze aligned sequences from the same data sets: Neighbor-Joining (NJ) , maximum parsimony (MP)  and maximum of likelihood (ML) . All three methods were implemented using the PAUP*  and PHYLIP packages. Alignments were obtained with ClustalW (default parameters) and were similar to those used in their respective sources. For the different conventional methods, we have used the HKY85 model of sequence evolution, and gaps were treated as missing data in the MP analysis. For ML analyses, a gamma distribution of rate heterogeneity with simultaneous parameter estimation was used.
Species annotation for the 18S rRNA sequences of plants
Genes used in multigene study:
1/ whole set of species:
Cell division protein ftsZ
DNA-directed RNA polymerase
DNA topoisomerase I
Elongation factor 2
Pantothenate metabolism flavoprotein
Phosphoribosylformylglycinamidine synthase II
Thermosome alpha subunit
Translation elongation factor EF-Tu
Total length = 57.2 kb
2/ partial set of species:
Acetolactate synthase large subunit
GTP cyclohydrolase II
Tryptophan synthase subunit beta
This research was supported by contract 120910 from the "Action inter EPST Bio-informatique 2001" of the French Research Ministry and contract A02114DS from the "Action inter EPST Bio-informatique 2003" of the French Research Ministry.
- Lecointre G, Le Guyader H: Classification phylogénétique du vivant. 2001, Paris, Belin, 544-
- Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981, 17: 368-376.View ArticlePubMed
- Li WH: Molecular Evolution. Edited by: Sinauer A. 1997, , Sinauer, 487-Sinauer
- Higgins DG, Thompson JD, Gibson TJ: Using CLUSTAL for multiple sequence alignments. Methods Enzymol. 1996, 266: 383-402.View ArticlePubMed
- Brocchieri L: Phylogenetic inferences from molecular sequences: review and critique. Theor Popul Biol. 2001, 59: 27-40.View ArticlePubMed
- Russo CAM, Takezaki N, Nei M: Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny. Mol Biol Evol. 1996, 13: 525-536.View ArticlePubMed
- Billoud B, Guerrucci MA, Masselot M, Deutsch JS: Cirripede phylogeny using a novel approach: molecular morphometrics. Mol Biol Evol. 2000, 17: 1435-1445.View ArticlePubMed
- Wolf YI, Rogozin IB, Grishin NV, Tatusov RL, Koonin EV: Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol Biol. 2001, 1: 8-PubMed CentralView ArticlePubMed
- Korbel JO, Snel B, Huynen MA, Bork P: SHOT: a web server for the construction of genome phylogenies. Trends Genet. 2002, 18: 158-162.View ArticlePubMed
- Tekaia F, Lazcano A, Dujon B: The genomic tree as revealed from whole proteome comparisons. Genome Res. 1999, 9: 550-557.PubMed CentralPubMed
- Daubin V, Gouy M, Perrière G: A phylogenetic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Research. 2002, 12: 1080-1090.PubMed CentralView ArticlePubMed
- Gupta RS: Protein phylogenies and signature sequences: a reappraisal of evolutionary relationships among Archaebacteria, Eubacteria and Eukaryotes. Microbiol Mol Biol Rev. 1998, 62: 1435-1491.PubMed CentralPubMed
- Rivera MC, Lake JA: Evidence that eukaryotes and eocyte prokaryotes are immediate relatives. Science. 1992, 257: 74-76.View ArticlePubMed
- Karlin S, Burge C: Dinucleotide relative abundance extremes: a genomic signature. Trends In Genetics. 1995, 11: 283-290.View ArticlePubMed
- Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ: Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 2003, 13: 145-158.PubMed CentralView ArticlePubMed
- Yu ZG, Jiang P: Distance, correlation and mutual information among portraits of organisms based on complete genomes. Phys Lett A. 2001, 286: 34-46.View Article
- Qi J, Wang B, Hao BI: Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach. J Mol Evol. 2004, 58: 1-11.View ArticlePubMed
- Stuart GW, Moffett K, Leader JJ: A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Mol Biol Evol. 2002, 19: 554-562.View ArticlePubMed
- Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B: Genomic signature: characterization and classification of species assessed by Chaos Game Representation of sequences. Molecular Biology and Evolution. 1999, 16: 1391–1399-View ArticlePubMed
- Deschavanne P, Giron A, Vilain J, Dufraigne C, Fertil B: Genomic signature is preserved in short DNA fragments. BIBE2000 IEEE international Symposium on bio-informatics & biomedical engineering, Washington USA, 8-10 november 2000. 2000, 161-167.
- Chapus C, Fertil B, Edwars S, Giron A, Deschavanne P: Classification of species based on DNA style: ; Berlin, Germany. Edited by: Spang R, Béziat P and Vingron M. 2003, , 147-148.
- Karlin S: Bacterial DNA strand compositional asymmetry. Trends Microbiol. 1999, 7: 305-308.View ArticlePubMed
- Karlin S, Mràzek J, Campbell AM: Compositional biases of bacterial genomes and evolutionary implications. J Bact. 1997, 179: 3899-3913.PubMed CentralPubMed
- Edwards SV, Fertil B, Giron A, Deschavanne PJ: A genomic schism in birds revealed by phylogenetic analysis of DNA strings. Syst Biol. 2002, 51: 599-613.View ArticlePubMed
- Wang Y, Hill K, Singh S, Kari L: The spectrum of genomic signatures: from dinucleotides to chaos game representation. Gene. 2005, 346: 173-185.View ArticlePubMed
- Guénoche A, Garreta H: 2000, Springer, 2066: 45-56. Can We Have Confidence in a Tree Representation?: 2001; Montpelier, France., Gascuel O and Sagot MF, Lecture Notes in Computer Science, Goos G, Hartmanis J, van Leeuwen J,
- Daubin V, Gouy M, Perriere G: A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res. 2002, 12: 1080-1090.PubMed CentralView ArticlePubMed
- Lerat E, Daubin V, Moran NA: From gene trees to organismal phylogeny in prokaryotes: the case of the g-proteobacteria. PLOS Biology. 2003, 1: 101-109.View Article
- Buneman P: Mathematics in the Archeological and Historial Sciences. Edited by: Hobson FR, Kendall DG and Tautu P. 1971, Edinburgh, University Press, 387-
- Robinson DF, Foulds LR: Comparaison of phylogenetic trees. Math Biosci. 1981, 53: 131-147.View Article
- Kuhner MK, Felsenstein J: A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol Biol Evol. 1994, 11: 459-468.PubMed
- Penny D, Hendy MD: The use of tree comparaison metrics. Systematic Zoology. 1985, 34: 75-82.View Article
- Kumar S: A stepwise algorithm for finding minimum evolution trees. Mol Biol Evol. 1996, 13: 584-593.View ArticlePubMed
- Ranwez V, Gascuel O: Quartet-based phylogenetic inference: improvements and limits. Mol Biol Evol. 2001, 18: 1103-1116.View ArticlePubMed
- Gascuel O: http://www.lirmm.fr/~w3ifa/MAAS/US-MAAS.html.
- Rambaut A, Grassly NC: Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997, 13: 235-238.PubMed
- Gascuel O: Current Protocols in Bioinformatics. 2004, , Wiley & Sons, 6.3.1-6.3.18. Getting a Tree Fast: Neighbor Joining and Distance Based Methods, Baxevanis A, Davison D, Page R, Stein L, Stormo G and Petsko G, Current Protocols, & Sons Wiley,
- Greenhalgh P, Steiner LA: Recombination activating gene 1 (Rag1) in zebrafish and shark. Immunogenetics. 1995, 41: 54-55.View ArticlePubMed
- Sibley CG, Ahlquist JA: Phylogeny and classification of birds: a study in molecular evolution. 1990, New Haven, Yale University Press
- Feil EJ, Holmes EC, Bessen DE, Chan MS, Day NP, Enright MC, Goldstein R, Hood DW, Kalia A, Moore CE, Zhou J, Spratt BG: Recombination within natural populations of pathogenic bacteria: short-term empirical estimates and long-term phylogenetic consequences. Proc Natl Acad Sci U S A. 2001, 98: 182-187.PubMed CentralView ArticlePubMed
- Soltis PS, Soltis DE, Wolf PG, Nickrent DL, Chaw SM, Chapman RL: The phylogeny of land plants inferred from 18S rDNA sequences: pushing the limits of rDNA signal?. Mol Biol Evol. 1999, 16: 1774-1784.View ArticlePubMed
- Chaw SM, Parkinson CL, Cheng Y, Vincent TM, Palmer JD: Seed plant phylogeny inferred from all three plant genomes: monophyly of extant gymnosperms and origin of Gnetales from conifers. Proc Natl Acad Sci U S A. 2000, 97: 4086-4091.PubMed CentralView ArticlePubMed
- Crane PR: Phylogenetic analysis of seed plants and the origin of the angiosperms. Ann Mo Bot Gard. 1985, 72: 716-793.View Article
- Doyle JA: Seed plant phylogeny and the relationshipsof Gnetales. Int J Plant Sci. 1996, 157: S3-S39.View Article
- Doyle JA: Molecules, morphology, fossils, and the relationship of angiosperms and Gnetales. Mol Phylogenet Evol. 1998, 9: 448-462.View ArticlePubMed
- Doyle JA, Donoghue MJ: Seed plant phylogeny and the origin of the angiosperms: an experimental cladistic approach. Bot Rev. 1986, 52: 321-431.View Article
- Doyle JA, Donoghue MJ, Zimmer EA: Integration of morphological and ribosomal RNA data on the origin of the angiosperms. Ann Mo Bot Gard. 1994, 81: 419-450.View Article
- Doyle JA: Phylogeny of vascular plants. Annu Rev Ecol Syst Biol. 1998, 29: 567–569-
- Källersjö M, Farris JS, Chase W, Bremer B, Fay MF, Humpries CJ, Petersen G, Seberg O, Bremer K: Simultaneous parsimony jackknife analysis of 2538 rbcl DNA sequences reveals upport for major clades of green plants, land plants, seed plants and flowering plants. Plant Syst Evol. 1998, 213: 2599-2287.View Article
- Kenrick P, Crane PR: The origin and early evolution of plants on land. nature. 1997, 389: 33-39.View Article
- Manhart JR: Phylogenetic analysis of green plant rbcL sequences. Mol Phylogenet Evol. 1994, 3: 114-127.View ArticlePubMed
- Manhart JR: Chloroplast 16S rDNA sequences and phylogeneic relationships of ferns and fern allies. Am Fern J. 1995, 85: 182-192.View Article
- Rokas A, Williams BL, King N, Carroll SB: Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003, 425: 798-804.View ArticlePubMed
- Gray MW, Burger G, Lang BF: Mitochondrial evolution. Science. 1999, 283: 1476-1481.View ArticlePubMed
- Dufraigne C, Fertil B, Lespinats S, Giron A, Deschavanne P: Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res. 2005, 33: e6-PubMed CentralView ArticlePubMed
- Garcia-Vallve S, Guzman E, Montero MA, Romeu A: HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Research. 2003, 31: 187-189.PubMed CentralView ArticlePubMed
- Nakamura Y, Itoh T, Matsuda H, Gojobori T: Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nature Genetics. 2004, 36: 760-766.View ArticlePubMed
- Lanyon SM: Detecting internal inconsistencies in distance data. Systematic Zoology. 1985, 34: 397-403.View Article
- Felsenstein J: Cases in wich parsimony or compability methods will be positively misleading. Systematic Zoology. 1978, 27: 401-410.View Article
- Moran NA, Mira A: The process of genome shrinkage in the obligate symbiont Buchnera aphidicola. Genome Biol. 2001, 2: RESEARCH0054-PubMed CentralView ArticlePubMed
- Itoh T, Martin W, Nei M: Acceleration of genomic evolution caused by enhanced mutation rate in endocellular symbionts. Proc Natl Acad Sci U S A. 2002, 99: 12944-12948.PubMed CentralView ArticlePubMed
- Yu ZG, Anh V, Lau KS: Measure representation and multifractal analysis of complete genomes. Phys Rev E. 2001, 64: 1-9.
- Jeffrey HJ: Chaos game representation of gene structure. Nucleic Acids Research. 1990, 18: 2163-2170.PubMed CentralView ArticlePubMed
- Krause A, Stoye J, Vingron M: The SYSTERS protein sequence cluster set. Nucleic Acids Res. 2000, 28: 270-272.PubMed CentralView ArticlePubMed
- GIB: http://gib.genes.nig.ac.jp/.
- Felsenstein J: PHYLIP (Phylogeny Inference Package) 3.6. 2004
- Felsenstein J: Confidence limits on phylogenies : an approach using the bootstrap. Evolution. 1985, 4: 783-791.View Article
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4: 406-425.PubMed
- Fitch WM: On the problem of discovery the most parsimonious tree. Am Nat. 1977, 111: 223-257.View Article
- Swofford DL: PAUP*, Phylogenetic Analysis Using Parsimony (*and Other Methods). 2003, , Sinauer Associates, Sunderland, Massachusetts, 4
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.