BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments
© Criscuolo and Gribaldo; licensee BioMed Central Ltd. 2010
Received: 10 February 2010
Accepted: 13 July 2010
Published: 13 July 2010
The quality of multiple sequence alignments plays an important role in the accuracy of phylogenetic inference. It has been shown that removing ambiguously aligned regions, but also other sources of bias such as highly variable (saturated) characters, can improve the overall performance of many phylogenetic reconstruction methods. A current scientific trend is to build phylogenetic trees from a large number of sequence datasets (semi-)automatically extracted from numerous complete genomes. Because these approaches do not allow a precise manual curation of each dataset, there exists a real need for efficient bioinformatic tools dedicated to this alignment character trimming step.
Here is presented a new software, named BMGE (Block Mapping and Gathering with Entropy), that is designed to select regions in a multiple sequence alignment that are suited for phylogenetic inference. For each character, BMGE computes a score closely related to an entropy value. Calculation of these entropy-like scores is weighted with BLOSUM or PAM similarity matrices in order to distinguish among biologically expected and unexpected variability for each aligned character. Sets of contiguous characters with a score above a given threshold are considered as not suited for phylogenetic inference and then removed. Simulation analyses show that the character trimming performed by BMGE produces datasets leading to accurate trees, especially with alignments including distantly-related sequences. BMGE also implements trimming and recoding methods aimed at minimizing phylogeny reconstruction artefacts due to compositional heterogeneity.
BMGE is able to perform biologically relevant trimming on a multiple alignment of DNA, codon or amino acid sequences. Java source code and executable are freely available at ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/.
Most phylogenetic inference approaches are based on an alignment of homologous sequences (e.g. DNA, RNA, amino acids). The alignment of sequences aims at highlighting the substitutions that have occurred during the evolutionary process from their common ancestral sequence. The quality of a multiple sequence alignment can have a strong impact on the accuracy of the inferred phylogenetic tree, whatever the inference criterion used [1–4]. In spite of constant improvements of the multiple sequence alignment heuristics [5, 6], an alignment can contain regions (i.e. sets of contiguous characters, also often called blocks [7, 8]) where homology is ambiguous. Moreover, too divergent regions (even when correctly aligned) may induce a mutational saturation effect, which is an important source of bias for many phylogenetic reconstruction methods. In order to minimize the bias introduced by these problematic regions, a frequent approach is to detect and remove them from the multiple sequence alignment prior to phylogenetic analysis (e.g. [9–13]). Indeed, it has been observed that the removal of such regions allows more accurate trees to be inferred [7, 8, 14–16].
A current trend consists in reconstructing phylogenetic trees by using a large number of datasets of aligned sequences from many complete genomes. Phylogenetic trees are then reconstructed from these datasets in many contexts, such as the construction of gene tree databases , the inference of species trees based on a core-gene set [17, 18] or the estimation of amino acid substitution matrices . These different phylogenetic explorations are often based on (semi-)automated processes (e.g. [15, 18, 20]), requiring a software solution for each step of these computer pipelines. Given the importance of dataset quality, the use of practical and accurate software dedicated to alignment trimming task has become a real need.
In this paper, we present a novel software, named BMGE (Block Mapping and Gathering with Entropy), that identifies regions inside multiple sequence alignments that are suited for phylogenetic inference. BMGE computes a score for each character (i.e. amino acid, nucleotide or codon column), mainly determined by the entropy induced by the proportion of character states. To estimate realistic scores that take into account biologically relevant substitution processes (e.g. transition rates more frequent than transversions for DNA sequences, highest probability of changes between amino acids with physicochemical similarities), BMGE weights the entropy estimation with standard substitution matrices (e.g. PAM or BLOSUM). Averaging score values across the characters of the multiple sequence alignment allows identifying conserved (i.e. with low entropy-like score values) and highly variable/uncertain regions (i.e. with large entropy-like score values [15, 21]). By removing such high entropy regions, BMGE returns trimmed datasets that allow the reconstruction of more accurate phylogenetic trees than the initial alignment, as shown by simulation studies.
Character state coding used by BMGE
A or C
A or G
Weak (3 H bonds)
A or T
Strong (3 H bonds)
C or G
C or T
G or T
N or X
one of the 4 nucleotides
Degenerated amino acid
N or D
Q or E
one of the 20 amino acids
Input/output files and sequence coding conversions
The input file for BMGE is a multiple sequence alignment in FASTA (or PHYLIP sequential) format. The user must indicate whether the sequences are amino acids or DNA (with standard one-letter coding [26, 32]; see Table 1). It is also possible to consider DNA sequences as codons, which allows the multiple sequence alignment to be handled with amino acid substitution matrices. Selected (and/or removed) regions are written in an output file in several formats (i.e. FASTA, PHYLIP sequential, NEXUS). HTML output is also available to display selected sites as well as graphical representation of both entropy values and gap proportions.
Several sequence conversion options are also available: from DNA or codons to RY-coding , and from codons to translated amino acids (according to the universal genetic code). BMGE also allows converting an amino acid alignment into a nucleotide alignment by considering the corresponding degenerated codons (see Table 1). In practice, given an amino acid and its set of corresponding synonymous codons (following the universal genetic code), the degenerated codon is simply obtained, for each of the three codon positions p, by considering the nucleotides corresponding to the set of possible codons at position p. For example, isoleucine (I) can be encoded by the three codons ATA, ATC and ATT; therefore, the degenerated codon corresponding to this set of codons is ATH, knowing that the degenerated nucleotide H (i.e. A, C or T; see Table 1) represents the possible nucleotides at the third threefold degenerate position.
Entropy-based character trimming
where the parameters are the eigenvalues of the matrix μ ∏(c)S estimated via the JAMA package . It should be stressed that the entropy normalization condition is verified with the normalizing factor μ.
with start = max(1;c-w) and end = min(m;c+w), in order to give more weight to characters with few gaps, i.e. g(c) ≅ 0. After this smoothing operation, BMGE defines as conserved those characters c that have value lower than a fixed threshold (0.5 by default). A conserved region is then defined as a set of contiguous conserved characters.
Then, the multiple sequence alignment is partitioned into successive conserved (C ) and variable (i.e. non-conserved; V ) regions, these being either a single character or a set of contiguous ones. Let V i be the ith non-conserved region. By definition, V i is flanked by the two conserved regions C i and Ci+1. In order to distinguish variable regions due to ambiguous alignment from those due to natural variation, the average value is computed for the region C i ∪ V i ∪ Ci+1 by formula (3) with parameters 'start' and 'end' set as the first character of C i and the last character of Ci+1, respectively. The consecutive regions C i , V i , Ci+1 with less than 30% of gaps and with value lower than the fixed 0.5 threshold are then merged into a unique conserved region. Finally, BMGE iteratively performs these merging operations until no more variable region V i can be merged with its two flanking C i and Ci+1 ones.
On the use of similarity matrix
If the similarity matrix S in formula (1) is the identity matrix I r , then h(c) is closely related to the well-known Shannon entropy , given by formula (2) where each is simply the proportion of the character state s for character c. If the character c is constant, then h(c) = 0. On the other hand, if the character c is highly variable, then each of the r character states is present with a relatively high proportion (e.g. ), implying that h(c) is close to 1, its maximal value. Therefore, h allows the level of variability of a character to be quantified . Unfortunately, using h with the identity matrix I r (as suggested in ) suffers from biases. For example, when considering amino acid sequences, if a given character c1 is only made of the four residues I, L, M and V, each with 25% proportion, and a second character c2 is made of the four residues C, Q, W and Y, also with identical proportions, then formula (1) with matrix I r returns h(c1) = h(c2) = -4× 0.25 log20 0.25 ≈ 0.462, and then indicates the same level of variability for both characters c1 and c2, while residues in character c1 are much more likely to be substituted than those in character c2 [27, 37, 38]. In contrast, using dedicated similarity matrices S ≠ I allows relevant substitution processes to be taken into account. As suggested in , when using the Henikoff and Henikoff's  BLOSUM50 target frequency matrix in formula (1), one obtains h(c1) ≈ 0.300 and h(c2) ≈ 0.453. Therefore, the function h with appropriate similarity matrix S computes score values that allow distinguishing among expected (i.e. biologically relevant) and ambiguous (e.g. source of noise) variability.
For practical use with amino acid sequences, BMGE provides the complete range of BLOSUM target frequency matrices (i.e. BLOSUM30, 35, 40, ..., 95 from ; see  for more details) in order to estimate pertinent h values depending on the level of divergence between sequences. As shown in simulation results (see below), our entropy-based character trimming method performs better when using stringent matrices (e.g. BLOSUM95) for closely related sequences, and, reciprocally, when using more relaxed matrices (e.g. BLOSUM30) for distantly related sequences. By default, BMGE uses the popular BLOSUM62 matrix .
For DNA sequences, PAM matrices are first computed. BMGE uses a transition/transversion ratio κ (= 2.0 by default) to compute the PAM-1 4 × 4 matrix: the four diagonal elements are all 0.99, and the off diagonal elements are 0.01 κ (2 + κ)-1 for transition and 0.01(2 + κ)-1 for transversion (see  for more details about the PAM-1 calculation for DNA). Given a prefixed integer η (= 100 by default), BMGE then computes the PAM-η matrix (= PAM-1 η ), which is finally used by BMGE as similarity matrix S in formula (1). In a similar way as the previously described amino acid framework, our character trimming method is more accurate when using a stringent matrix (e.g. PAM-1) with closely related DNA sequences, and when using a relaxed matrix (e.g. PAM-250) with distantly related ones.
Stationary-based character trimming
Given two aligned sequences (e.g. taken from the multiple sequence alignment), one can use a r×r divergence matrix F to represent the relative proportion of each of the possible character state pairs in the pairwise comparison of these two sequences (as schematized for DNA sequences by formula (19) in ). If these two sequences have similar character state composition, then, for each character state s = 1, 2,...,r , this sequence pair verifies the null hypothesis F s. = F .s , named the marginal homogeneity (e.g. [30, 43]) or marginal symmetry (e.g. ). A not-too-low p-value returned by the Stuart's χ2 test  on F allows the marginal homogeneity/symmetry to be assessed; then, one can assess that two homologous sequences arise from a nonstationary evolutionary process if the Stuart's test returns a p-value close to zero (e.g. < 0.1). This test being essentially based on numerical linear algebra operations (see e.g.  for more details about its computation from two sequences), BMGE implements it with, on the one hand, matrix operations available in the JAMA package , and, on the other hand, with fast and accurate numerical algorithms for estimating χ2 cumulative distribution functions (adapted in Java from the C code sources available p. 216-219 in ).
If a multiple sequence alignment seems to be compositionally heterogeneous, BMGE implements a stationary-based character trimming in order to obtain a compositionally homogeneous alignment. Given a multiple alignment of n sequences, for each possible pair of distinct sequences i, j (i.e. 1 ≤ i < j ≤ n ), BMGE computes the Stuart's test p-value, denoted p ij . If there is at least one pair of sequences for which p ij < 0.1, then BMGE progressively removes the characters c ranked in function of their decreasing entropy-like h(c) values --as estimated by formula (1)-- until p ij > 0.1 for every pairs of sequences i, j. This first crude character removal approach leads to a set C of compositionally homogeneous characters. Then, BMGE aims at integrating the set C with some of the previously removed characters, in order to obtain a set of compositionally homogeneous characters of maximal size.
If σ (c) > 0, then adding character c inside C leads to an increase for most of the n(n-1)/2 p-values; reciprocally, adding characters c with σ (c)< 0 leads to a (not wished) overall decrease of the p-values. BMGE then progressively removes the characters c ∉ C from the initial multiple sequence alignment following the increasing order of their respective σ(c) value, i.e. from the smaller (negative) to the larger (positive) σ score value. After each character removal, every p ij are re-estimated. When all p ij > 0.1, BMGE stops removing characters from the initial alignment and considers the remaining (i.e. not removed) characters as the new set C. Finally, BMGE iteratively performs these add-and-remove operations until the set C of compositionally homogeneous characters cannot be increased further.
Results and Discussion
In order to assess the utility of multiple alignment character trimming to infer more accurate phylogenetic trees and to compare the respective performances of BMGE (with different similarity matrices) with other available character trimming methods (i.e. Gblocks ; Noisy ; trimAl ), we carried out computer simulations and real case studies. Protocol and results are described below.
Simulation results with entropy-based character trimming
The 200 first 40-taxon trees available in  were selected as model trees to generate artificial amino acid sequence datasets. On the one hand, in order to simulate phylogenetically informative characters, 10 clusters of 40 sequences each were generated using Seq-Gen  under the JTT model  from each of the 200 model trees. Sequence lengths for each of these 10 sequence clusters were randomly drawn from 30 to 70 amino acids. On the other hand, in order to simulate uninformative/variable characters, amino acid sequences were generated in the same way from a 40-taxon star tree (i.e. a tree with 40 leaves and a unique internal node).
Following these two steps, for each of the 200 initial model trees, we generated 20 clusters of 40 amino acid sequences of lengths 50 ± 20, ten containing informative phylogenetic signal, and ten containing uninformative/variable signal. Finally, from each of these 200 sets of 20 sequence clusters, 10 clusters were randomly drawn and concatenated, in order to produce a dataset composed of 200 clusters of 40 sequences of 500-amino acid length on average, each containing an equal mixture of phylogenetically informative and uninformative regions.
This simulation procedure to generate 200 clusters of sequences containing 50% (on average) of uninformative/variable characters was repeated three times, each with initial model tree branch lengths (see [45, 48] for more details) multiplied by a divergence factor of 1, 2 and 3, respectively, in order to mimic from closely- to distantly-related sequence clusters. Each of these (3 × 200=)600 clusters of simulated amino acid sequences were aligned with MUSCLE [49, 50]. The average lengths of these multiple sequence alignments are 513, 561, and 573 for the levels of divergence ×1, ×2, and ×3, respectively.
The software BMGE was applied on these multiple sequence alignments with three similarity matrices: BLOSUM30, BLOSUM62 and BLOSUM95. As a comparison, character trimming was also performed by using three available softwares.
Gblocks (0.91 b) was used with 'strict' and 'relaxed' parameter sets (see  for a precise description of these two parameter sets). However, the 'strict' conditions (default parameters in Gblocks) are indeed very stringent (e.g. no character was selected in ≈40% of the multiple sequence alignments with level of divergence ×3), and phylogenetic trees inferred from the remaining blocks (when these existed) were always less accurate than those inferred from the blocks returned by the 'relaxed' conditions (results not shown). Worse results than those obtained with the 'relaxed' conditions were also observed with alternative parameter sets (e.g. those described by ; results not shown). Then, results from Gblocks presented below are only those obtained with the 'relaxed' conditions.
The software Noisy was used with the --nogap options. Several other options were tested (especially the --cutoff one; see  for more details) but the Noisy default options allows better results to be observed with our simulated datasets.
The software trimAl (1.2rev59) allows three trimming methods (among numerous ones) to be used: 'gappyout', 'strictplus' and 'automated1'. Since the 'gappyout' method mainly focuses on highly gapped regions, this approach removed too few characters in our poorly gapped simulated datasets. Consequently, the 'gappyout' results are not shown since they are very close to those observed with the initial (i.e. non-trimmed) multiple sequence alignments.
All initial (i.e. non-trimmed) multiple sequence alignments (i.e. tn = fn = 0) correspond to the (1,1) point (i.e. fpr = fp/fp = tpr = tp/tp =1); reciprocally, the removal of all characters (i.e. tp = fp = 0) corresponds to the (0,0) point (i.e. fpr = 0/tn = tpr = 0/fn = 0). A cloud close to the (1,1) point in the ROC graph then indicates that the corresponding trimming method is liberal (i.e. it keeps too many uninformative/variable characters). Conversely, conservative trimming methods (i.e. that remove too many phylogenetically informative characters) correspond to clouds close to the (0,0) point. Ideally, the best method selects only those characters that are phylogenetically informative (i.e. fn = fp = 0), and then corresponds to the (0,1) point inside the ROC graph (i.e. fpr = 0/tn = 0 and tpr = tp/tp = 1). For each case in Figure 1, the L1 distance between each point and the (0,1) point (= 1-tpr+fpr ) was computed, averaged and written under its ROC graph. For each of the three levels of divergence (i.e. ×1, ×2, ×3), a sign test [55–57] was performed to assess the statistical significance between the best average L1 distance measure (i.e. the lowest) and the other ones. For each level of divergence in Figure 1, an average L1 distance measure is considered as non-significantly different to the best one if the p-value returned by the sign test is > 5%.
Figure 1 shows that, in all cases, trimAl has always tpr ≈ 1 but often fpr ≫ 0, indicating that it is too liberal (i.e. fp ≫ 0); moreover, L1 distances observed for trimAl are often among the worst, similarly to those observed for Noisy (except for the level of divergence ×3). We can also see from Figure 1 that Gblocks with relaxed conditions presents very good performance with closely related sequences (i.e. level of divergence ×1), but induces tpr «1 when the level of divergence increases (i.e. from ×1 to ×3), showing that it becomes too conservative (i.e. fn ≫ 0). As expected, BMGE with stringent option (i.e. BLOSUM95) corresponds to points that are more concentrated around the y-axis as the level of sequence divergence increases, showing that the use of stringent similarity matrices such as ranging from BLOSUM80 to BLOSUM95 lead to conservative character trimming with distantly related sequences (e.g. level of divergence ×3). Reciprocally, the use of the similarity matrix BLOSUM30 leads to selecting too many characters (i.e. fpr ≫ 0) with closely related sequences (i.e. level of divergence ×1). However, BMGE with BLOSUM30 allows minimizing both fn and fp when the level of divergence increases; indeed this last BMGE usage leads to the best average L1 distances for the levels of divergence ×2 and ×3.
Simulation results on phylogenetic tree accuracy
For each of the three levels of divergence and from each multiple sequence alignment (i.e. the initial one as well as those outputted by BMGE, Gblocks, Noisy and trimAl), phylogenetic trees were inferred with several approaches. Maximum Likelihood (ML) trees were inferred by PhyML  with model JTT-Γ4. BioNJ  trees were also inferred by PhyML with JTT distances. Maximum Parsimony (MP) trees were inferred by TNT  with TBR branch swapping and parsimony ratchet . Bayesian inference was not performed because of the huge running time required by this approach. Moreover, Bayesian trees are expected to be very close to ML trees when inferred from our simulated datasets .
Average quartet distances between the model trees and the ML trees inferred from the different trimmed multiple sequence alignments
Level of divergence
Average quartet distances between the model trees and the BioNJ trees inferred from the different trimmed multiple sequence alignments
Level of divergence
Average quartet distances between the model trees and the MP trees inferred from the different trimmed multiple sequence alignments
Level of divergence
Variable/uncertain characters contained in the initial multiple sequence alignments cause strong artefacts in the resulting phylogenetic trees, especially for BioNJ and MP trees (see Tables 2, 3,and 4). When character trimming softwares are used, almost all reconstructed phylogenetic trees are closer to their model tree (Tables 2, 3, and 4), showing that character trimming is a useful step prior any phylogenetic inference. However, it should be stressed that the ML approach is very robust to the noise introduced by uninformative/variable characters in our simulated datasets, mainly thanks to the Γ parameter. Even if our datasets were generated with equal rates across characters (see above), estimation of a Γ parameter was included in ML inference because preliminary tries without Γ parameter led to less accurate trees, especially those inferred from Noisy and trimAl outputs (results not shown). The Γ parameter then helps to compensate part of the phylogenetic noise contained in our datasets. However, when considering distantly-related sequences (e.g. level of divergence ×3), the ML approach (as well as the BioNJ and MP ones) needs a preliminary trimming step to infer significantly accurate trees, in agreement with results from previous simulation-based studies (e.g. ).
In general, BMGE (when used with a BLOSUM substitution matrix adequate to the level of sequence divergence) allows reconstructing among the most accurate trees for each of the three levels of divergence and every tree reconstruction method used (Tables 2, 3 and 4). trimAl and Noisy infer the worst BioNJ and MP trees, whereas Gblocks with relaxed conditions produces good results as long as sequences are not too divergent. To the minor exception of Noisy with ML trees, BMGE trimming with the (less stringent) BLOSUM30 matrix allows reconstructing the significantly best trees with level of divergence ×3.
Simulation results on phylogenetic tree branch support
In order to observe the impact of character trimming methods on branch supports inside phylogenetic trees, we have focused on two different approaches to estimate confidence values on the internal branches of a phylogenetic tree: the bootstrap-based support  with BioNJ trees, and the approximate likelihood ratio test (aLRT; ) as implemented by default in PhyML 3.0 .
Broadly, Figure 2 shows that the estimate of BioNJ bootstrap proportions from the initial (non-trimmed) multiple sequence alignments leads to very biased values, with a proportion of true branches with bootstrap-based confidence values ≤ 0.1 varying from 31% to 47%, and those > 0.9 varying from 18% to 20%. Figure 2 also shows that the highest average bootstrap-based confidence values are obtained from the character-trimming methods that best optimize both tpr and fpr criteria (see above and Figure 1). The same conclusions hold for aLRT with level of divergence ×1 (Figure 3), with (significantly) best average aLRT values observed with BMGE and Gblocks. Surprisingly, for level of divergence ×3, the best average aLRT values are obtained from the characters selected by the most liberal character trimming methods, i.e. initial alignments and trimAl (strictplus and automated1).
AUC values in Figures 4 and 5 show that initial (non-trimmed) multiple sequence alignments lead to more incorrect confidence values than those estimated from characters selected by trimming methods. Interestingly, Figures 4 and 5 show that aLRT-based ROC curves induce highest AUC values on average that bootstrap-based ones (with the slight exception of BMGE with BLOSUM95 for level of divergence ×3). This shows that aLRT-based confidence values are more able to discreminate true and false branches than BioNJ bootstrap-based ones. ROC curve shapes also show that trimming methods lead to more precise confidence values. Indeed, in Figure 4, the down-left tail point of each ROC curve (i.e. corresponding to threshold τ = 0.98) always has highest tpr (y-axis) for trimmed alignments than for initial ones; this shows that a larger proportion of true branches with BioNJ bootstrap-based confidence values >0.98 is observed with trimmed alignments than with initial alignments. For initial multiple sequence alignments and for level of divergence ×2 and ×3 in Figure 4, the up-right head points of these two ROC curves (i.e. each corresponding to threshold τ = 0.02) have tpr < 1; this means that there exists some true branches with BioNJ bootstrap-based confidence value < 0.02. Notably, this is not the case when using character trimming methods. These two tendencies on tpr values for both extremities of the ROC curves are in agreement with the distributions in Figure 2. Nevertheless, when observing the fpr ranges (x-axis) of the ROC curves in Figure 4, trimmed multiple sequence alignments all induce larger fpr values than initial ones; this shows that when using character trimming methods instead of initial multiple sequence alignments, there is an increase in the proportion of false branches in BioNJ trees with high confidence value (e.g. down-left tail of ROC curves corresponding to τ = 0.98) and a decrease of the proportion of false branches with low confidence values (e.g. up-right tail of ROC curves corresponding to τ = 0.02). This last tendency is clearly obvious with level of divergence ×3 (see Figure 4). However, shapes of ROC curves constructed by thresholding aLRT-based confidence values on ML trees do not seems to be strongly modified by trimming methods (see Figure 5). More precisely, for levels of divergence ×1 and ×2, Gblocks and BMGE (with BLOSUM62) present always among the best results. For level of divergence ×3, Noisy and BMGE with BLOSUM30 present the significantly best AUC values for aLRT-based confidence values.
To sum up, these simulation results show that, by selecting among variable characters those with biologically-relevant expected variability thanks to the use of similarity matrices, BMGE often presents results that are among the (significantly) best (e.g. phylogenetic accuracy, better confidence values for true branches) and leads to a less biased phylogenetic signal.
Entropy-based character trimming in a phylogenomics context
In order to illustrate the benefit of using character trimming approaches, we have re-analysed the multi-gene dataset used by Castresana (2000) to describe the usefulness of Gblocks  in selecting suited characters. This amino-acid dataset is composed by ten genes (i.e. three subsunits of cytochrome c oxidase CO1-CO3, one subunit of cytochrome c-ubiquinol oxidoreductase CYTb, and six subunits of NADH desydrogenase ND1-ND5 and ND4L) gathered from the complete mitochondrial genome of 16 eukaryotes (11 unikonts, 4 archaeplastida, 1 excavate), and from Paracoccus denitrificans, an α-proteobacterium used as outgroup (see  for more details).
Lengths of the different character supermatrices, and average log-likelihood per character of the corresponding ML trees.
Number of characters
ML bootstrap-based (boot.) and aLRT-based confidence values derived from different character supermatrices.
Unfortunately, this formula underestimates the value of η that best minimizes both the number of false positive and negative characters in our simulated datasets, e.g. η, as estimated by (4), varies from 16 to 71 with an average of 33 with level of divergence ×1, and varies from 10 to 62 with an average of 22 with level of divergence ×3, whereas simulation shows that best results are obtained with η = 95 and 30 for levels of divergence ×1 and ×3, respectively. Indeed, formula (4) on the Castresana (2000) phylogenomic dataset gives η values varying from 34 (ND3) to 64 (CO1), which shows that the amino acid sequences constituting these ten alignments are not closely related, whereas we have shown that the conservative BLOSUM95 similarity matrix is best adapted than BLOSUM62 and BLOSUM30 ones. Moreover, when estimated by (4) from the eight considered character supermatrices, η varies from 52 (concatenated initial multiple sequence alignments) to 60 (concatenated alignments trimmed by Noisy). A similar underestimation of η was also observed by averaging the n max values in formula (4), or by considering only characters with no gaps. Finally, even if formula (4) or related could give a suited approximate of the η value by using the building rules of BLOSUM matrices, it will be even more difficult to provide a similar formula for the PAM-η similarity matrices when dealing with DNA sequences.
Therefore, as maintained by Ewens and Grant (2005) for BLAST queries, we believe that "one often has prior knowledge about the evolutionary distance between the sequences of interest that helps one choose which BLOSUM matrix to use" (p. 244 in ). In practice, when inferring a particular gene tree, our opinion is to carefully examine the original multiple sequence alignment and those obtained by BMGE trimming with several similarity matrices (i.e. BLOSUM or PAM). On the contrary, when building a phylogenomic dataset, we believe, as Talavera and Castresana (2007), that "there is enough information from the concatenation of several genes" and then that "stringent conditions tend to give rise to the best phylogenetic trees" (p. 575 in ): the use of BMGE with stringent similarity matrices (e.g. BLOSUM95 or PAM-1) can strongly increase the number of false negatives (i.e. too many characters suited for phylogenetic inference are removed; see Figure 1) but systematic biases due to this conservative approach are often compensated by a sufficiently large number of genes used. Finally and in every case, we think that it is more relevant to deal with well-defined similarity matrices in order to choose among stringent to relaxed character trimming as in BMGE, rather than having to set several (and subjective) numerical parameters.
Simulation results with stationary-based character trimming
There exist many alternative statistical solutions to compare the character state composition of two (or more) sequences. Each of these has its own strengths and weaknesses (see  for instructive survey and discussion). However, matched-pairs tests (such as Stuart's test ) are comparatively efficient, particularly because of their ability to consider aligned sequences on a site-by-site basis . Albeit the stationary-based character trimming may be extended by the use of overall tests for marginal symmetry (i.e. assessing the compositional homogeneity in the complete multiple sequence alignment [75, 31]), we think that using pairwise p-values allows more precise sorting of the characters c according to their σ(c) score value (see above). Moreover, such overall tests are more time consuming. It should also be stressed that the Bowker's  test (i.e. another matched-pairs test assessing the complete symmetry inside F) was tried instead of the Stuart's test in the stationary-based trimming, but it led to worse results in the simulation analysis (not shown). Finally, as stressed in , Stuart's test is based on ordinary χ 2 approximation and is not appropriate for small samples, particularly when the number of categories (i.e. the row number in F) is not small. It is then strongly recommended to use the stationary-based character trimming on a large number of characters (e.g. ≥ 1, 000, such as in supermatrices of characters), especially when dealing with amino acid sequences.
Real case study with stationary-based character trimming
In order to illustrate the performance of the stationary-based character trimming, we applied it on the multi-gene dataset described in  (available at ), which is known to suffer from a GC-content bias. This phylogenomic dataset was built by concatenating 106 alignments of DNA sequences gathered from 7 Saccharomyces species and Candida albicans as outgroup (see  for more details). The so-obtained supermatrix contains 127,026 nucleotide characters.
A subset of 114,105 characters was selected by stationary-based character trimming implemented in BMGE. These so-selected characters were used to infer ME trees following the same methods as described previously. As expected, the right-hand tree in Figure 9 was inferred from both GTR and LogDet distance estimates with 100% bootstrap-based confidence value at each branch. More precisely, pairwise Stuart's test p-values are all ≈ 0 in the initial character supermatrix (with the slight exception of the two sequence pairs S. cerevisiae - S. paradoxus and S. cerevisiae - S. mikatae with p-values of 0.0146 and 0.0868, respectively). After the stationary-based trimming performed by BMGE, all Saccharomyces sequences induce pairwise Stuart's test p-values > 0.85, the lowest p-values being induced by the outgroup species (i.e. varying from 0.1000 to 0.3675 for the sequence pairs C. albicans - S. kluyveri and C. albicans - S. castellii, respectively). This shows that by removing ≈10% characters, the stationary-based trimming is able to select a subset of compositionally homogeneous characters (as assessed by the pairwise Stuart's tests) that leads to unbiased ME phylogenetic inference.
There exists a real need for accurate bioinformatic tools to extract at best the phylogenetic information contained in the ever-growing amount of available genomic sequence data. BMGE allows accurate character trimming of multiple alignments of DNA, codon, or amino acid sequences based on entropy-like scores weighted with BLOSUM or PAM matrices. Thus, BMGE is able to identify and extract unambiguously aligned blocks of characters that contain biologically expected variability and are therefore suitable for phylogenetic analysis. Simulation studies show that the trimmed datasets returned by BMGE lead to inference of accurate trees, in particular when in presence of multiple alignments including distantly-related sequences. BMGE also allows a number of useful recoding and trimming aimed at minimizing compositional heterogeneity in the alignment dataset and therefore the risk of phylogenetic artefacts.
In conclusion, BMGE is an accurate tool that can have several applications in phylogenomics analyses. The software BMGE is freely available (see below), and can also be used online through the Mobyle Web Portal  at http://mobyle.pasteur.fr/cgi-bin/portal.py.
Availability and Requirements
• Project home page: ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/
• Operating systems: Platform independent
• Programming language: Java
• Other requirements: Java 1.6 or higher
• License: GNU General Public License (version 2)
• Any restrictions to use by non-academics: None
We thank Céline Petitjean, Elie Desmond, Céline Brochier and two anonymous reviewers for their comments and for testing the program, as well as the BMGE team for support. Many thanks to Corrine Maufrais for providing us with an ftp home page, and for placing BMGE in the Mobyle portal. This research was supported by the PhyloCyano project of the Agence Nationale de la Recherche (ANR; 07-JCJC-0094-01).
- Lake JA: The order of sequence alignment can bias the selection of tree topology. Mol Biol Evol. 1991, 8: 378-385.PubMedGoogle Scholar
- Morrison DA, Ellis JT: Effects of nucleotide sequence alignment on phylogeny estimation: a case study on 18 S rDNAs of apicomplexa. Mol Biol Evol. 1997, 14: 428-441.View ArticlePubMedGoogle Scholar
- Ogden TH, Rosenberg MS: Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol. 2006, 55: 314-328. 10.1080/10635150500541730.View ArticlePubMedGoogle Scholar
- Wang L-S, Leebens-Mack J, Wall PK, Beckmann K, dePamphilis CW, Warnow T: The impact of multiple protein sequence alignment on phylogenetic estimation. IEEE/ACM Trans Comput Biol Bioinf. 2009.Google Scholar
- Edgar RC, Batzoglou S: Multiple sequence alignment. Curr Opin Struct Biol. 2006, 16: 368-373. 10.1016/j.sbi.2006.04.004.View ArticlePubMedGoogle Scholar
- Notredame C: Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol. 2007, 3: e123-10.1371/journal.pcbi.0030123.PubMed CentralView ArticlePubMedGoogle Scholar
- Castresana J: Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol. 2000, 17: 540-552.View ArticlePubMedGoogle Scholar
- Talavera G, Castresana J: Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol. 2007, 56: 564-577. 10.1080/10635150701472164.View ArticlePubMedGoogle Scholar
- Olsen GJ, Woese CR: Ribosomal RNA: a key to phylogeny. FASEB J. 1993, 7: 113-123.PubMedGoogle Scholar
- Rodrigo AG, Bergquist PR, Bergquist PL: Inadequate support for an evolutionary link between the Metazoa and the Fungi. Syst Biol. 1994, 43: 578-584. 10.1093/sysbio/43.4.578.View ArticleGoogle Scholar
- Swofford DL, Olsen GJ, Waddell PJ, Hillis DM: Phylogenetic inference. Molecular Systematics. Edited by: Hillis DM, Moritz C, Mable BK. 1996, Sunderland: Sinauer Associates, 407-514.Google Scholar
- Rodríguez-Ezpeleta N, Brinkmann H, Burey SC, Roure B, Burger G, Löffelhardt W, Philippe H, Lang BF: Monophyly of primary photosynthetic eukaryotes: green plants, red algae, and glaucophytes. Curr Biol. 2005, 15: 1325-1330. 10.1016/j.cub.2005.06.040.View ArticlePubMedGoogle Scholar
- Huerta-Cepas J, Bueno A, Dopazo J, Gabaldón T: PhylomeDB: a database for genome-wide collections of gene phylogenies. Nucleic Acids Res. 2008, 36: D491-D496. 10.1093/nar/gkm899.PubMed CentralView ArticlePubMedGoogle Scholar
- Dress AWM, Flamm C, Fritzsch G, Grünewald S, Kruspe M, Prohaska SJ, Stadler PF: Noisy: Identification of problematic columns in multiple sequence alignments. Algorithms for Molecular Biology. 2008, 3: 7-10.1186/1748-7188-3-7.PubMed CentralView ArticlePubMedGoogle Scholar
- Swingley WD, Blankenship RE, Raymond J: Integrating Markov clustering and molecular phylogenetics to reconstruct the cyanobacterial species tree from conserved protein families. Mol Biol Evol. 2008, 25: 643-654. 10.1093/molbev/msn034.View ArticlePubMedGoogle Scholar
- Capella-Gutiérez S, Silla-Martínez JM, Gabaldón T: trimAl: a tool for automated alignment triming in large-scale phylogenetic analyses. Bioinformatics. 2009, 25: 1972-1973. 10.1093/bioinformatics/btp348.View ArticleGoogle Scholar
- Daubin V, Gouy M, Perrière G: A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history. Genome Res. 2002, 12: 1080-1090. 10.1101/gr.187002.PubMed CentralView ArticlePubMedGoogle Scholar
- Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward automatic reconstruction of a highly resolved tree of life. Science. 2006, 311: 1283-1287. 10.1126/science.1123061.View ArticlePubMedGoogle Scholar
- Adachi J, Waddell PJ, Martin W, Hasegawa M: Plastid genome phylogeny and a model of amino acid substitution for protein encoded by chloroplast DNA. J Mol Evol. 2000, 50: 348-358.PubMedGoogle Scholar
- McMahon MM, Sanderson MJ: Phylogenetic supermatrix analysis of GenBank sequences from 2228 Papilionoid legumes. Syst Biol. 2006, 55: 818-836. 10.1080/10635150600999150.View ArticlePubMedGoogle Scholar
- Schneider TD, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucl Acids Res. 1990, 18: 6097-6100. 10.1093/nar/18.20.6097.PubMed CentralView ArticlePubMedGoogle Scholar
- Jayaswal V, Jermiin LS, Robinson J: Estimation of phylogeny using a general Markov model. Evol Bioinf Online. 2005, 1: 62-80.Google Scholar
- Galtier N, Gouy M: Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol Biol Evol. 1998, 15: 871-879.View ArticlePubMedGoogle Scholar
- Jermiin LS, Ho SYW, Ababneh F, Robinson J, Larkum AWD: The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst Biol. 2004, 53: 638-643. 10.1080/10635150490468648.View ArticlePubMedGoogle Scholar
- Phillips MJ, Delsuc F, Penny D: Genome-scale phylogeny and the detection of systematic biases. Mol Biol Evol. 2004, 21: 1455-1458. 10.1093/molbev/msh137.View ArticlePubMedGoogle Scholar
- International Union of Pure and Applied Chemistery and International Union of Biochemistery (IUPAC-IUB) Commission on Biochemical Nomenclature: Abbreviations and symbols for nucleic acids, polynucleotides and their constituents. Biochem J. 1970, 120: 449-454.View ArticleGoogle Scholar
- Dayhoff MO, Schwartz RM, Orcutt BD: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure. Edited by: Dayhoff MO. 1978, Washington: National Biomedical Research Foundation, 5 (Suppl 3): 345-352.Google Scholar
- Embley TM, van der Giezen M, Horner DS, Dyal PL, Foster P: Mitochondria and hydrogenosomes are two forms of the same fundamental organelle. Philos Trans R Soc Lond B Biol Sci. 2003, 358: 191-203. 10.1098/rstb.2002.1190.PubMed CentralView ArticlePubMedGoogle Scholar
- Susko E, Roger AJ: On reduced amino acid alphabets for phylogenetic inference. Mol Biol Evol. 2007, 24: 2139-2150. 10.1093/molbev/msm144.View ArticlePubMedGoogle Scholar
- Stuart A: A test for homogeneity of the marginal distributions in a two-way classification. Biometrika. 1955, 42: 412-416.View ArticleGoogle Scholar
- Ababneh F, Jermiin LS, Ma C, Robinson J: Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences. Bioinformatics. 2006, 22: 1225-1231. 10.1093/bioinformatics/btl064.View ArticlePubMedGoogle Scholar
- International Union of Pure and Applied Chemistery and International Union of Biochemistery (IUPAC-IUB) Commission on Biochemical Nomenclature: A one-letter notation for amino acid sequences (definitive rules). Pure Appl Chem. 1972, 31: 639-645. 10.1351/pac197231040639.Google Scholar
- Von Neumann J: Mathematische Grundlagen der Quantenmechanik. 1932, Berlin: SpringerGoogle Scholar
- Caffrey DR, Somaroo S, Hughes JD, Mintseris J, Huang ES: Are protein-protein interfaces more conserved in sequence than the rest of the protein surface?. Protein Sci. 2004, 13: 190-202. 10.1110/ps.03323604.PubMed CentralView ArticlePubMedGoogle Scholar
- JAMA: A Java Matrix Package. [http://math.nist.gov/javanumerics/jama/]
- Shannon C: A mathematical theory of communication. Bell System Tech J. 1948, 27: 379-423. 623-656View ArticleGoogle Scholar
- Taylor WR: The classification of amino acid conservation. J Theor Biol. 1986, 119: 205-218. 10.1016/S0022-5193(86)80075-3.View ArticlePubMedGoogle Scholar
- Bordo D, Argos P: Suggestion for "safe" residue substitutions in site-directed mutagenesis. J mol Biol. 1991, 217: 721-729. 10.1016/0022-2836(91)90528-E.View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci. 1992, 89: 10915-10919. 10.1073/pnas.89.22.10915.PubMed CentralView ArticlePubMedGoogle Scholar
- BLOSUM matrices. [http://blocks.fhcrc.org/blocks/uploads/blosum/]
- Eddy SR: Where did the BLOSUM62 alignment score matrix come from?. Nat Biotechnol. 2004, 22: 1035-1036. 10.1038/nbt0804-1035.View ArticlePubMedGoogle Scholar
- States DJ, Gish W, Altschul SF: Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods: A Companion to Methods Enzymol. 1991, 3: 66-70. 10.1016/S1046-2023(05)80165-3.View ArticleGoogle Scholar
- Uesaka H: Validity and applicability of several tests for comparing marginal distributions of a square table with ordered categories. Behaviormetrika. 1991, 30: 65-78. 10.2333/bhmk.18.30_65.View ArticleGoogle Scholar
- Press WH, Teukolsky SA, Vetterling WT, Flannery BP: Numerical Recipes in C -- The Art of Scientific Computing. 1992, Cambridge: Cambridge University Press, 2Google Scholar
- Artificial datasets used in . [http://www.atgc-montpellier.fr/phyml/datasets.php]
- Rambaut A, Grassly NC: Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput Appl Biosci. 1997, 13: 235-238.PubMedGoogle Scholar
- Jones D, Taylor W, Thornton J: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992, 8: 275-282.PubMedGoogle Scholar
- Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003, 52: 696-704. 10.1080/10635150390235520.View ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32: 1792-1797. 10.1093/nar/gkh340.PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004, 5: 113-10.1186/1471-2105-5-113.PubMed CentralView ArticlePubMedGoogle Scholar
- Egan JP: Signal detection theory and ROC analysis, series in cognition and perception. 1975, New York: Academic PressGoogle Scholar
- Metz CE: Basic principles of ROC analysis. Semin Nucl Med. 1978, 8: 283-298. 10.1016/S0001-2998(78)80014-2.View ArticlePubMedGoogle Scholar
- Swets J: Measuring the accuracy of diagnostic systems. Science. 1988, 240: 1285-1293. 10.1126/science.3287615.View ArticlePubMedGoogle Scholar
- Fawcett T: An introduction to ROC analysis. Pattern Recogn Lett. 2006, 27: 861-874. 10.1016/j.patrec.2005.10.010.View ArticleGoogle Scholar
- McStewart W: A note on the power of the sign test. Ann Math Stat. 1941, 12: 279-303. 10.1214/aoms/1177731710.View ArticleGoogle Scholar
- Dixon WJ, Mood AM: The statistical sign test. J Am Statist Assoc. 1946, 41: 557-566. 10.2307/2280577.View ArticleGoogle Scholar
- Hemelrijk J: A theorem on the sign test when ties are present. Proc Nederl Akad Weten Ser A. 1952, 55: 322-Google Scholar
- Gascuel O: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol. 1997, 14: 685-695.View ArticlePubMedGoogle Scholar
- Goloboff P, Farris S, Nixon KC: TNT (Tree analysis using New Technology) ver. 1.0. 2000, Published by the authors, Tucumán, ArgentinaGoogle Scholar
- Nixon KC: The parsimony ratchet, a new method for rapid parsimony analysis. Cladistics. 1999, 15: 407-414. 10.1111/j.1096-0031.1999.tb00277.x.View ArticleGoogle Scholar
- Estabrook GF, McMorris FR, Meacham CA: Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst Zool. 1985, 34: 193-200. 10.2307/2413326.View ArticleGoogle Scholar
- Felsenstein J: Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 1985, 39: 783-791. 10.2307/2408678.View ArticleGoogle Scholar
- Anisimova M, Gascuel O: Approximate likelihood ratio test for branches: a fast, accurate and powerful alternative. Syst Biol. 2006, 55: 539-552. 10.1080/10635150600755453.View ArticlePubMedGoogle Scholar
- Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O: New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010, 59: 307-321. 10.1093/sysbio/syq010.View ArticlePubMedGoogle Scholar
- Hanley JA, McNeil BJ: The meaning and the use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982, 143: 29-36.View ArticlePubMedGoogle Scholar
- Fogarty J, Baker RS, Hudson SE: Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction. Proceedings of Graphics Interface (GI 2005): 09-11 May 2005; Victoria, British Columbia. Edited by: Michael McCool. 2005, University of Waterloo, 129-136.Google Scholar
- Concatenate: a software to build supermatrices of characters. [http://www.supertriplets.univ-montp2.fr/PhyloTools.php]
- Adachi J, Hasegawa M: Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol. 1996, 42: 459-468. 10.1007/BF02498640.View ArticlePubMedGoogle Scholar
- Rodríguez-Ezpeleta N, Brinkmann H, Burey SC, Roure B, Burger G, Löffelhardt W, Bohnert HJ, Philippe H: Monophyly of Primary Photosynthetic Eukaryotes: Green Plants, Red Algae, and Glaucophytes. Curr Biol. 2005, 15: 1325-1330. 10.1016/j.cub.2005.06.040.View ArticlePubMedGoogle Scholar
- Hackett JD, Yoon HS, Li S, Reyes-Prieto A, Rümmele SE, Bhattacharya D: Phylogenomic analysis supports the monophyly of Cryptophytes and Haptophytes and the association of Rhizaria with Chromalveolates. Mol Biol Evol. 2007, 24: 1702-1713. 10.1093/molbev/msm089.View ArticlePubMedGoogle Scholar
- Burki F, Shalchian-Tabrizi K, Pawlowski J: Phylogenomics reveals a new 'metagroup' including most photosynthetic eukaryotes. Biol Lett. 2008, 4: 366-369. 10.1098/rsbl.2008.0224.PubMed CentralView ArticlePubMedGoogle Scholar
- Hampl V, Hug L, Leigh JW, Dacks JB, Lang BF, Simpson AGB, Roger AJ: Phylogenomic analyses support the monophyly of Excavata and resolve relationships among eukaryotic "supergroups". Proc Natl Acad Sci USA. 2009, 106: 3859-3864. 10.1073/pnas.0807880106.PubMed CentralView ArticlePubMedGoogle Scholar
- Ewens W, Grant G: Statistical methods in bioinformatics: an introduction. 2005, New York: Springer, SecondView ArticleGoogle Scholar
- Felsenstein J: Evolutionary tree from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981, 17: 368-376. 10.1007/BF01734359.View ArticlePubMedGoogle Scholar
- Rzhetsky A, Nei M: Tests of applicability of several substitution models for DNA sequence data. Mol Biol Evol. 1995, 12: 131-151.View ArticlePubMedGoogle Scholar
- Bowker AH: A test for symmetry in contingency tables. J Am Stat Assoc. 1948, 43: 572-574. 10.2307/2280710.View ArticlePubMedGoogle Scholar
- Rokas A, Williams BL, King N, Carroll SB: Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003, 425: 798-804. 10.1038/nature02053.View ArticlePubMedGoogle Scholar
- Yeast multi-gene dataset. [http://systbio.org/files/Ren_etal_Rokas2003data.txt]
- Lanave C, Preparata G, Saccone C, Serio G: A new method for calculating evolutionary substitution rates. J Mol Evol. 1984, 20: 86-93. 10.1007/BF02101990.View ArticlePubMedGoogle Scholar
- Rodriguez R, Oliver JL, Marin A, Medina JR: The general stochastic model of nucleotide substitution. J Theor Biol. 1990, 142: 485-501. 10.1016/S0022-5193(05)80104-3.View ArticlePubMedGoogle Scholar
- Yang Z: Estimating the pattern of nucleotide substitution. J Mol Evol. 1994, 39: 105-111.PubMedGoogle Scholar
- Lake JA: Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. Proc Natl Acad Sci USA. 1994, 91: 1455-1459. 10.1073/pnas.91.4.1455.PubMed CentralView ArticlePubMedGoogle Scholar
- Lockhart PJ, Steel MA, Hendy MD, Penny D: Recovering evolutionary trees under a more realistic model of sequence evolution. Mol Biol Evol. 1994, 11: 605-612.PubMedGoogle Scholar
- Steel MA: Recovering a tree from the leaf colourations it generates under a Markov model. Appl Math Lett. 1994, 7: 19-23. 10.1016/0893-9659(94)90024-8.View ArticleGoogle Scholar
- Taylor DJ, Piel WH: An assessment of accuracy, error, and conflict with support values from genome-scale phylogenetic data. Mol Biol Evol. 2004, 21: 1534-1537. 10.1093/molbev/msh156.View ArticlePubMedGoogle Scholar
- Ren F, Tanaka H, Yang Z: An empirical examination of the utility of codon-substitution models in phylogeny reconstruction. Syst Biol. 2005, 54: 808-818. 10.1080/10635150500354688.View ArticlePubMedGoogle Scholar
- Burleigh JG, Driskell AC, Sanderson MJ: Supertree bootstrapping methods for assessing phylogenetic variation among genes in genome-scale data sets. Syst Biol. 2006, 55: 426-440. 10.1080/10635150500541722.View ArticlePubMedGoogle Scholar
- Ané C, Larget B, Baum DA, Smith SD, Rokas A: Bayesian estimation of concordance among gene trees. Mol Biol Evol. 2007, 24: 412-426. 10.1093/molbev/msl170.View ArticlePubMedGoogle Scholar
- Criscuolo A, Michel CJ: Phylogenetic inference with weighted codon evolutionary distances. J Mol Evol. 2009, 68: 377-392. 10.1007/s00239-009-9212-y.View ArticlePubMedGoogle Scholar
- Néron B, Ménager H, Maufrais C, Joly N, Maupetit J, Letort S, Carrere S, Tuffery P, Letondal C: Mobyle: a new full web bioinformatics framework. Bioinformatics. 2009, 25: 3005-3011. 10.1093/bioinformatics/btp493.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.