BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments

Background The quality of multiple sequence alignments plays an important role in the accuracy of phylogenetic inference. It has been shown that removing ambiguously aligned regions, but also other sources of bias such as highly variable (saturated) characters, can improve the overall performance of many phylogenetic reconstruction methods. A current scientific trend is to build phylogenetic trees from a large number of sequence datasets (semi-)automatically extracted from numerous complete genomes. Because these approaches do not allow a precise manual curation of each dataset, there exists a real need for efficient bioinformatic tools dedicated to this alignment character trimming step. Results Here is presented a new software, named BMGE (Block Mapping and Gathering with Entropy), that is designed to select regions in a multiple sequence alignment that are suited for phylogenetic inference. For each character, BMGE computes a score closely related to an entropy value. Calculation of these entropy-like scores is weighted with BLOSUM or PAM similarity matrices in order to distinguish among biologically expected and unexpected variability for each aligned character. Sets of contiguous characters with a score above a given threshold are considered as not suited for phylogenetic inference and then removed. Simulation analyses show that the character trimming performed by BMGE produces datasets leading to accurate trees, especially with alignments including distantly-related sequences. BMGE also implements trimming and recoding methods aimed at minimizing phylogeny reconstruction artefacts due to compositional heterogeneity. Conclusions BMGE is able to perform biologically relevant trimming on a multiple alignment of DNA, codon or amino acid sequences. Java source code and executable are freely available at ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/.


stantly-rel
ted sequences.BMGE also implements trimming and recoding methods aimed at minimizing phylogeny reconstruction artefacts due to compositional heterogeneity.Conclusions: BMGE is able to perform biologically relevant trimming on a multiple alignment of DNA, codon or amino acid sequences.Java source code and executable are freely available at ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/.

Background

Most phylogenetic inference approaches are based on an alignment of homologous sequences (e.g.DNA, RNA, amino acids).The alignment of sequences aims at highlighting the substitutions that have occurred during the evolutionary process from their common ancestral sequence.The quality of a multiple sequence alignment can have a strong impact on the accuracy of the inferred phylogenetic tree, whatever the inference criterion used [1][2][3][4].In spite of constant improvements of the multiple sequence alignment heuristics [5,6], an alignment can contain regions (i.e.sets of contiguous characters, also often called blocks [7,8]) where homology is ambiguous.Moreover, too divergent regions (even when correctly aligned) may induce a mutational saturation effect, which is an important sou ce of bias for many phylogenetic reconstruction methods.In order to minimize the bias introduced by these problematic regions, a frequent approach is to detect and remove them from the multiple sequence alignment prior to phylogenetic analysis (e.g.[9][10][11][12][13]). Indeed, it has been observed that the removal of such regions allows more accurate trees to be inferred [7,8,[14][15][16].

A current trend consists in reconstructing phylogenetic trees by using a large number of datasets of aligned sequences from many complete genomes.Phylogenetic trees are then reconstructed from these datasets in many contexts, such as the construction of gene tree databases [13], the inference of species trees based on a core-gene set [17,18] or the estimation of amino acid substitution matrices [19].These different phylogenetic explorations are often based on (semi-)automated processes (e.g.[15,18,20]), requiring a software solution for each step of these computer pipelines.Given the importance of dataset quality, the use of practical and accurate software dedicated to alignment trimming task has become a real need.

In this paper, we present a novel software, named BMGE (Block Mapping and Gathering with Entropy), that identifies regions inside multiple sequence alignments that are suited for phylogenetic inference.BMGE computes a score for each character (i.e.amino acid, nucleotide or codon column), mainly determined by the entropy induced by the proportion of character states.To estimate realistic scores that take into account biologically relevant substitution processes (e.g.transition rates more frequent than transversions for DNA sequences, highest probability of changes between amino acids with physicochemical similarities), BMGE weights the entropy estimation with standard substitution matrices (e.g.PAM or BLOSUM).Averaging score values across the characters of the ultiple sequence alignment allows identifying conserved (i.e. with low entropy-like score values) and highly variable/uncertain regions (i.e. with large entropy-like score values [15,21]).By removing such high entropy regions, BMGE returns trimmed datasets that allow the reconstruction of more accurate phylogenetic trees than the initial alignment, as shown by simulation studies.

In addition, BMGE also provides simple solutions to alleviate systematic artefacts caused by compositional heterogeneity.Most probabilistic phylogenetic inference methods make the assumption (among other more or less axiomatic ones) that the studied sequences arose from a common ancestral sequence following a stationary evolutionary process, i.e. the marginal probabilities of the character states remained constant over all sequences (e.g.[11,22]).Consequently, when phylogenetic trees are inferred from sequences with heterogeneous composition of character states, the violation of the stationary assumption may cause systematic errors [19,[23][24][25].BMGE is therefore able to perform RYcoding from a DNA sequence alignment [25], and to convert amino acid sequences into their corresponding degenerated codons according to the universal genetic code.These two recoding strategies may prove useful to minimize some biases when dealing with datasets with known heterogeneous composition across sequences.As these two recoding approaches use only the standard one-letter nucleotide alphabet [26] (see Table 1), the resulting datasets can be given to all phylogeny inference programs, in contrast to alternative recoding techniques based on non-standard alphabet cardinality such as the "Dayhoff classes" 6-residue alphabet [27,28] (see also [29] for discussion on other recoding schemes).Moreover, the use of degenerated codons allow fast inference of trees, in particular with Maximum Likelihood (ML) methods which are faster with nucleotide sequences than with amino acid ones.BMGE also implements a novel stationary-based trimming method that allows compositionally heterogeneous characters to be identified and removed.To do so, BMGE uses the Stuart's c 2 matched-pairs test of marginal symmetry [30] that allows assessing th

hat two sequences are compositionally homogeneous [
1], and iteratively performs character removal/addition steps until the Stuart's test assesses that each pair of sequences presents homogeneous composition.As shown by computer simulations with heterogeneous GC-content DNA sequences, this stationary-based trimming leads to unbiased phylogenetic trees.


Implementation


Input/output files and sequence coding conversions

The input file for BMGE is a multiple sequence alignment in FASTA (or PHYLIP sequential) format.The user must indicate whether the sequences are amino acids or DNA (with standard one-letter coding [26,32]; see Table 1).It is also possible to consider DNA sequen es as codons, which allows the multiple sequence alignment to be handled with amino acid substitution matrices.Selected (and/or removed) regions are written in an output file in several formats (i.e.FASTA, PHYLIP sequential, NEXUS).HTML output is also available to display selected sites as well as graphical representation of both entropy values and gap proportions.

Several sequence conversion options are also available: from DNA or codons to RY-coding [25], and from codons to translated amino acids (according to the universal genetic code).BMGE also allows converting an amino acid alignment into a nucleotide alignment by considering the corresponding degenerated codons (see Table 1).In practice, given an amino acid and its set of corresponding synonymous codons (following the universal genetic code), the degenerated codon is simply obtained, for each of the three codon positions p, by considering the nucleotides co

esponding to the set of possible
odons at position p.For example, isoleucine (I) can be encoded by the three codons ATA, ATC and ATT; therefore, the degenerated codon corresponding to this set of codons is ATH, knowing that the degenerated nucleotide H (i.e.A, C or T; see Table 1) represents the possib ach diagonal entry is the relative frequency of each of the r possible character states
Methyl M A or C Pur a [34]:


Amino acid 1-letter code Degenerated codon 3-letter code
Alanine A GCX Arginine R MGX Asparagine N AAY Aspartic acid D GAY Cysteine C TGY Glutamine Q CAR Glut cine G GGX Histidine H CAY Isoleucine I ATH Leucine L YTX Lysine K AAR Methionine M ATG Phenylalaline F TTY Proline P CCX Serine S WSX Threonine T ACX Tryptophan W TGG Thyrosine Y TAY Valine V GTXh c log c r c ( ) [ ( )], ( ) ( ) = −trace       S S(1)
where S is a similarity matrix (see below for more details) and μ = [trace (∏ (c) S) ] -1 is a normalizing factor so that trace (μ∏ (c) S) = 1.To compute formula (1), BMGE uses the simpler equation
h c s c s r r s c ( ) log , ( ) ( ) = − = … ∑   1(2)
where the  s c ( ) parameters are the eigenvalues of the matrix μ∏ (c) S estimated via the JAMA package [35].It should be stressed that the entropy normalization condition
 s c s r ( ) = ∑ = 1 1  is verified with the normalizing fac- tor μ.
Knowing that h(c) = 0 indicates that character c is constant and that h(c) is as close to 1 as character c is unexpectedly variable (see below), BMGE looks for conserved regions of the multiple sequence alignment by smoothing the different h(c) values by using a sliding window of length 2w+1 (with w = 1 by default).For each c = 1, 2,...,m , the smooth  h c ( ) value is estimated by the weighted average
 h c i g i h i i g i ( ) [ ()] () [ ()] , = ∑ = − ∑ = − start end start end 1 1(3)
with start = max(1;c-w) and end = min(m;c+w), in order to give more weight to characters with few gaps, i.e. g(c) ≅ 0. After this smoothing operation, BMGE defines as conserved those characters c that have  h c


( )

value lower than a fixed threshold (0.5 by default).A conserved region is then defined as a set of contiguous conserved characters.Then, the multiple sequence alignment is partitioned into successive conserved (C ) and variable (i.e.nonconserved; V ) regions, these being either a single character or a set of con

guous ones.Let V i be the i th n
nconserved region.By definition, V i is flanked by the two conserved regions C i and C i+1 .In order to distinguish variable regions due to ambiguous alignment from those due to natural variation, the average h˜value is computed for the region C i ∪ V i ∪ C i+1 by formula (3) with parameters 'start' and 'end' set as the first character of C i and the last character of C i+1 , respectively.The consecutive regions C i , V i , C i+1 with less than 30% of gaps and with  h value lower than the fixed 0.5 threshold are then merged into a unique conserved region.Finally, BMGE iteratively performs these merging operations until no more variable region V i can be merged with its two flanking C i and C i+1 ones.


On the use of similarity matrix

If the similarity matrix S in formula (1) is the identity matrix I r , then h(c) is closely related to the well-known Shannon entropy [36], given by formula (2) where each
 s c ss c ( ) ( )
=   is simply the proportion of the character state s for character c.If the character c is constant, then h(c) = 0. On the other hand, if the character c is highly variable, then each of the r character states is present with a relatively high proportion (e.g.  ss c r ( ) / ≈ 1 ), implying that h(c) is close to 1, its maximal value.Therefore, h allows the level of variability of a character to be quantified [21].Unfortunately, using h with the identity matrix I r (as suggested in [15]) suffers from biases.For example, when considering amino acid sequences, if a given character c 1 is only made of the four residues I, L, M and V, each with 25% proportion, and a second character c 2 is made of the four residues C, Q, W and Y, also with identical proportions, then formula (1) with matrix I r eturns h(c 1 ) = h (c 2 ) = -4× 0.25 log 20 0.25 ≈ 0.462, and then indicates the same level of variability for both characters c 1 and c 2 , while residues in character c 1 are much more likely to be substituted than those in character c 2 [27,37,38].In contrast, using dedicated similarity matrices S ≠ I allows relevant substitution processes to be taken into account.As suggested in [34], when using the Henikoff and Henikoff's [39] BLOSUM50 target frequency matrix in formula (1), one obtains h(c 1 ) ≈ 0.300 and h(c 2 ) ≈ 0.453.Therefore, the function h with appropriate similarity matrix S computes score values t at allow distinguishing among expected (i.e.biologically relevant) and ambiguous (e.g.source of noise) variability.

For practical use with amino acid sequences, BMGE provides the complete range of BLOSUM target frequency matrices (i.e.BLOSUM30, 35, 40, ..., 95 from [40]; see [39] for more details) in order to estimate pertinent h values depending on the level of divergence between sequences.As shown in simulation results (see below), our entropy-based character trimming method performs better when using stringent matrices (e.g.BLOSUM95) for closely related sequences, and, reciprocally, when using more relaxed matrices (e.g.BLO-SUM30) for distantly related sequences.By default, BMGE uses the popular BLOSUM62 matrix [41].

For DNA sequences, PAM matrices are first computed.BMGE uses

transition/transversion ratio (= 2.0
by default) to compute the PAM-1 4 × 4 matrix: the four diagonal elements are all 0.99, and the off diagonal elements are 0.01 (2 + ) -1 for transition and 0.01 (2 + ) -1 for transversion (see [42] for more details about the PAM-1 calculation for DNA).Given a prefixed integer h (= 100 by default), BMGE then computes the PAM-h matrix (= PAM-1 h [27]), which is finally used by BMGE as similarity matrix S in formula (1).In a similar way as the previously described amino acid framework, our character trimming method is more accurate when using a stringent matrix (e.g.PAM-1) with closely related DNA sequences, and when using a relaxed matrix (e.g.PAM-250) with distantly related ones.


Stationary-based character trimming

Given two aligned sequences (e.g.taken from the multiple sequence alignment), one can use a r×r divergence matrix F to represent the relative proportion of each of the possible character state pairs in the pairwise comparison of these two sequences (as schematized for DNA sequences by formula (19) in [11]).If these two sequences have similar character state composition, then, for each character state s = 1, 2,...,r , this sequence pair verifies the null hypothesis F s. = F .s , named the marginal homogeneity (e.g.[30,43]) or marginal symmetry (e.g.[31]).A no -too-low p-value returned by the Stuart's c 2 test [30] on F allows the marginal homogeneity/symmetry to be assessed; then, one can assess that two homologous sequences arise from a nonstationary evolutionary process if the Stuart's test returns a p-value close to zero (e.g.< 0.1).This test being essentially based on numerical linear algebra operations (see e.g.[31] for more details about its computation from two sequences), BMGE implements it with, on the one hand, matrix operations available in the JAMA package [35], and, on the other hand, with fast and accurate numerical algorithms for estimating c 2 cumulative distribution functions (adapted in Java from the C code sources available p. 216-219 in [44]).

If a multiple sequence alignment seems to be compositionally heterogeneous, BMGE implements a stationary-based character trimming in order to obtain a compositionally homogeneous alignment.Given a multiple alignment of n sequences, for each possible pair of distinct sequences i, j (i.e. 1 ≤ i <j ≤ n ), BMGE computes the Stuart's test p-value, denoted p ij .If there is at least one pair of sequences for which p ij < 0.1, then BMGE unction of their decreasing entropy-like h(c) values -as estimated by formula (1)-until p ij > 0.1 for every pairs of sequences i, j.This first crude character removal approach leads to a set C of compositionally homogeneous characters.Then, BMGE aims at integrating the set C with some of the previously removed characters, in order to obtain a set of compositionally homogeneous characters of maximal size.

To do so, BMGE uses an iterative add-and-remove approach of characters.Defining p ij (c) as the Stuart's test p-value for the sequence pair i, j after adding the character c inside the set C, BMGE computes the following score for each character c ∉ C :
( ) log( / ). ( ) c p p i j n ij c ij = ≤ < ≤ ∑ 1
If σ (c) > 0, then adding character c inside C leads to an increase for most of the n(n-1)/2 p-v

ues; reciprocally, addi
g characters c with σ (c)< 0 leads to a (not wished) overall decrease of the p-values.BMGE then progressively removes the characters c ∉ C from the initial multiple sequence alignment following the increasing order of their respective σ(c) value, i.e. from the smaller (negative) to the larger (positive) σ score value.After each character removal, every p ij are re-estimated.When all p ij > 0.1

BMGE stops removing characters from the initial alignment
and considers the remaining (i.e.not removed) characters as the new set C. Finally, BMGE iteratively performs these add-andremove operations until the set C of compositionally homogeneous characters cannot be increased further.


Results and Discussion

In order to assess the utility of multiple alignment character trimming to infer more accurate phylogenetic trees and to compare the respective performances of BMGE (with different similarity matrices) with other available character trimming methods (i.e.Gblocks [7]; Noisy [14]; trimAl [16]), we carried out computer simulations and real case studies.Protocol and results are described below.
Simulation results with entropy-based character trimming

The 200 first 40-taxon trees available in [45] were selected as model trees to generate artificial amino acid sequence datasets.On the one hand, in order to simulate phylogenetically informative characters, 10 clusters of 40 sequences each were generated using Seq-Gen [46] under the JTT model [47] from each of the 200 model trees.Sequence lengths for each of these 10 sequence clusters were randomly drawn from 30 to 70 amino acids.On the other hand, in order to simulate uninformative/var able characters, amino acid sequences were generated in the same way from a 40-taxon star tree (i.e. a tree with 40 leaves and a unique internal node).

Following these two steps, for each of the 200 initial model trees, we generated 20 clusters of 40 amino acid sequences of lengths 50 ± 20, ten containing informative phylogenetic signal, and ten containing uninformative/ variable signal.Finally, from each of these 200 sets of 20 sequence clusters, 10 clusters were randomly drawn and concatenated, in order to produce a dataset composed of 200 clusters of 40 sequences of 500-amino acid length on averag , each containing an equal mixture of phylogenetically informative and uninformative regions.

This simulation procedure to generate 200 clusters of sequences containing 50% (on average) of uninformative/variable characters as repeated three times, each with initial model tree branch lengths (see [45,48] for more details) multiplied by a divergence factor of 1, 2 and 3, respectively, in order to mimic from closely-to distantly-related sequence clusters.Each of these (3 × 200=)600 clusters of simulated amino acid sequences were aligned with MUSCLE [49,50].The average lengths of these multiple sequence alignments are 513, 561, and 573 for the levels of divergence ×1, ×2, and ×3, respectively.

The software BMGE was applied on these multiple sequence alignments with three similarity matrices: BLOSUM30, BLOSUM62 and BLOSUM95.As a comparison, character trimming was also performed by using three available softwares.

Gblocks (0.91 b) was used with 'strict' and 'relaxed' parameter sets (see [7] for a precise d scription of these two parameter sets).However, the 'strict' conditions (default parameters in Gblocks) are indeed very stringent (e.g.no character was selected in ≈40% of the multiple sequence alignments with level of divergence ×3), and hylogenetic trees inferred from the remaining blocks (when these existed) were always less accurate than those inferred from the blocks returned by the 'relaxed' conditions (results not shown).Worse results than those obtained with the 'relaxed' conditions were also observed with alternative parameter sets (e.g.those described by [12]; results not shown).Then, results from Gblocks presented below are only those obtained with the 'relaxed' conditi ns.

The software Noisy was used with the -nogap options.Several other options were tested (especially the -cutoff one; see [14] for more details) but the Noisy default options allows better results to be observed with our simulated datasets.

The software trimAl (1.2rev59) allows three trimming methods (among numerous ones) to be used: 'gappyout', 'strictplus' and 'automated1'.Since the 'gappyout' method mainly focuses on highly gapped regions, this approach removed too few characters in our poorly gapped simulated datasets.Consequently, the 'gappyout' results are not shown since they are very close to those observed with the initial (i.e.non-trimmed) multiple sequence alignments.

As expected in regard to the sequence simulation protocol (see above), initial multiple sequence alignments returned by MUSCLE are composed of approximately 50% phylogenetically informative characters (i.e.those characters that are not generated by Seq-Gen from the star tree).For each trimming method, the character set of an initial multiple sequence alignment is then partitioned into four subsets: true positives and false negatives (i.e.phylogenetically informative characters that are selected or not by the trimming method, respectively), false positives (i.e.characters selected by the trimming method and that are not p ylogenetically informative), and true negatives (i.e.characters that are not phylogenetically informative and are indeed not selected by the trimming method).Denoting tp and tn , the number of true positive and negative characters, respectively, and fp and fn , the number of false positive and negative characters, respectively, the true positive rate tpr = tp/ (tp + fn) and false positive rate fpr = fp/(fp + tn) were computed.For each trimming method and each of the three levels of divergence (i.e.×1, ×2, ×3), a ROC graph (e.g.[51][52][53][54]) depicting the plot of the true (y-axis) and false (x-axis) positive rates for each of the 200 datasets is represented in Figure 1.

All initial (i.e.non-trimmed) multiple sequence alignments (i.e.tn = fn = 0) correspond to the (1,1) point (i.e.fpr = fp/fp = tpr = tp/tp =1); reciprocally, the removal of all characters (i.e.tp = fp = 0) corresponds to the (0,0) point (i.e.fpr = 0/tn = tpr = 0/fn = 0).A cloud close to the (1,1) point in the ROC graph then indicates that the corresponding trimming method is liberal (i.e. it keeps too many uninformative/variable characters).Conversely, conservative trimming methods (i.e. that remove too many phylogenetically informative characters) correspond to clouds close to the (0,0) point.Ideally, the best method selects only those characters that are phylogenetically informative (i.e.fn = fp = 0), and then corresponds to the (0,1) point inside the ROC graph (i.e.fpr = 0/tn = 0 and tpr = tp/tp = 1).For each case in Figure 1, the L 1 distance between each point and the (0,1) point (= 1-tpr+fpr ) was computed, averaged and written under its ROC graph.For each of the three levels of divergence (i.e.×1, ×2, ×3), a sign test [55][56][57] was performed to assess the statistical significance between the best average L 1 distance measure (i.e. the lowest) and the other ones.For each level of divergence in Figure 1, an average L 1 distance measure is considered as non-significantly different to the best one if the p-value returned by the sign test is > 5%.

Figure 1 shows that, in all cases, trimAl has always tpr ≈ 1 but often fpr ≫ 0, indicating that it is too liberal (i.e.fp ≫ 0); moreover, L 1 distances observed for trimAl are often among the worst, similarly to those observed for Noisy (except for the level of divergence ×3).We can also see from Figure 1 that Gblocks with relaxed conditions presents very good performance with closely related sequences (i.e. level of divergence ×1), but induces tpr «1 when the level of divergence increases (i.e. from ×1 to ×3), showing that it becomes too Figure 1 ROC graphs plotting true (y-axis) and false (x-axis) positive rates for seven character trimming method

The best methods (i.e., that minimize both the nu
ber of true negative and false positive characters) are those with the corresponding cloud concentrated around the upper left point (0,1).Under each ROC graph, the average L 1 distance between each point and the (0,1) point is given.For each level of divergence, the best (i.e.lower) distance is written in boldface characters.Average L 1 distances that are not significantly different to this best value (as assessed by a sign test) are underscored.conservative (i.e.fn ≫ 0).As expected, BMGE with stringent option (i.e.BLOSUM95) corresponds to points that are more concentrated around the y-axis as the level of sequence divergence increases, showing that the use of stringent similarity mat ices such as ranging from BLOSUM80 to BLOSUM95 lead to conservative character trimming with distantly related sequences (e.g. level of divergence ×3).Reciprocally, the use of the similarity matrix BLOSUM30 leads to selecting too many characters (i.e.fpr ≫ 0) with closely related sequences (i.e. level of divergence ×1).However, BMGE with BLO-SUM30 allows minimizing both fn and fp when the level of divergence increases; indeed this last BMGE usage leads to the best average L 1 distances for the levels of divergence ×2 and ×3.


Simulation results on phylogenetic tree accuracy

For each of the three levels of divergence and from each multiple sequence alignment (i.e. the initial one as well as those outputted by BMGE, Gblocks, Noisy and tri-mAl), phylogenetic trees were inferred with several approaches.Maximum Likelihood (ML) trees were inferred by PhyML [48] with model JTT-Γ 4 .BioNJ [5 ] trees were also inferred by PhyML with JTT distances.Maximum Parsimony (MP) trees were inferred by TNT [59] with TBR branch swapping and parsimony ratchet [60].Bayesian inference was not performed because of the huge running time required by this approach.Moreover, Bayesian trees are expected to be very close to ML trees when inferred from our simulated datasets [3].

For each of the three levels of divergence, topological accuracy of the ML, BioNJ and MP trees inferred from the initial and the trimmed multiple sequence alignments was measured by the quartet distance [61] between each inferred tree and its corresponding model tree.All these distance measures are reported in Tables 2, 3 and 4 for ML, BioNJ and MP trees, respectively, normalized in order to restrict these values to the interval [0,1], then averaged.For each of the three levels of divergence (i.e.×1, ×2, ×3) and each of the three tree reconstruction methods (i.e.ML, BioNJ, MP), a sign test was performed to assess the statistical significance between the best average distance measure (i.e. the lowest) and the other ones.In each column of the Tables 2, 3 and 4, an entry is considered as non-significantly different to the best one if the p-value returned by the sign test is > 5%.

Variable/uncertain characters contained in the initial multiple sequence alignments cause strong artefacts in the resulting phylogenetic trees, especially for BioNJ and MP trees (see Tables 2, 3,and 4).When character trimming softwares are used, almost all reconstructed phylogenetic trees are closer to their model tree (Tables 2, 3, and 4), showing that character trimming is a useful step prior any phylogenetic inference.However, it should be stressed that the ML approach is very robust to the noise introduced by uninformative/variable characters in   our simulated datasets, mainly thanks to the Γ parameter.Even if our datasets wer

generated with equal rates across characters (see above
, estimation of a Γ parameter was included in ML inference because preliminary tries without Γ parameter led to less accurate trees, especially those inferred from Noisy and trimAl outputs (results not shown).The Γ parameter then helps to compensate part of the phylogenetic noise contained in our datasets.However, when considering distantly-related sequences (e.g. level of diver ence ×3), the ML approach (as well as the BioNJ and MP ones) needs a preliminary trimming step to infer significantly accurate trees, in agreement with results from previous simulation-based studies (e.g.[8]).

In general, BMGE (when used with a BLOSUM substitution matrix adequate to the level of sequence divergence) allows reconstructing among the most accurate trees for each of the three levels of divergence and every tree reconstruction method used (Tables 2,3 and 4).tri-mAl and Noisy infer the worst BioNJ and MP trees, whereas Gblocks with relaxed conditions produces good results as long as sequences are not too divergent.To the minor exception of Noisy with ML trees, BMGE trimming with the (less stringent) BLOSUM30 matrix allows reconstructing the significantly best trees with level of divergence ×3.


Simulation results on phylogenetic tree branch support

In order to observe the impact of character trimming methods on branch supports inside phylogenetic trees, we have focused on two different approaches to estimate confidence values on the internal branches of a phylogenetic tree: the bootstrap-based support [62] with BioNJ trees, and the approximate likelihood ratio test (aLRT; [63]) as implemented by default in PhyML 3.0 [64].

For each of the three levels of divergence (i.e.×1, ×2, ×3) a d from each multiple sequence alignment (i.e. the initial as well as the seven outputted by BMGE, Gblocks, Noisy and trimAl; see above), 100 bootstrap-based replicates were generated, and 100 BioNJ trees were inferred from these multiple sequence alignment replicates.From these 100 BioNJ bootstrap-based trees, bootstrap proportions were assessed on the internal branches of the corresponding (true) model tree.For each character trimming method and each level of divergence, we computed the distribution of the so-obtained bootstrapbased confidence values, as well as its average value (Figure 2).Similarly, we also computed the distributions of aLRT branch support estimated on the (true) branches of the model trees from the different multiple sequence alignments, as well as their average values (Figure 3).For each way to estimate confidence values (i.e.BioNJ bootstrap-based and aLRT-based ones) and each level of divergence (i.e.×1, ×2, ×3), a c 2 test was performed to assess whether distributions are significantly different.In Figures 2 and 3, a distribution is considered as non-significantly different to the best one (i.e.those corresponding to the highest average confidence value) if the p-value returned by the c 2 test is > 5%.

For each of the three levels of divergence and from each multiple sequence alignment, BioNJ-based bootstrap proportions were also estimated on the branches of the phylogenetic trees inferred by BioNJ (i.e.those used to compute quartet distances in Table 3).A bootstrap-based confidence value is then expected to be high when the corresponding branch is true (i.e.present in the model tree), and is expected to be low for false branches.Given a threshold τ, all branches of an inferred tree can be partitioned into four subsets: true positives or false negatives (i.e.true branches with confidence values ≥ τ or <τ, respectively), false positives or true negatives (i.e.false branches ith confidence values ≥ τ or <τ, respectively).As a consequence, there exists a point in the ROC space (see above) that is associated to a threshold value τ, and plotting these points for a large number of possible threshold values τ results in a socalled ROC curve (see [52,54] for more details).ROC curves obtained by varying τ from 0 to 1 with 0.02increment are displayed in Figure 4, where up-right head of each ROC curve corresponds to lowest τ values, whereas down-left tail corresponds to highest τ values.Figure 5 represents ROC curves built in the same way from aLRT-based confidence values on the branches of the inferred ML trees (i.e.those used to compute quartet distances in Table 2).Figure 4 and 5 also display the area under each ROC curve (AUC) that corresponds to the probability that a confidence value will be higher for a true branch than for a false branch (see [65,54] for more details).For each of the three levels of divergence, a Z test (as described in [66]) was performed to assess the statistical significance between the best AUC (i.e. the highest) and the other ones.In Figures 4 and 5, an AUC is considered as non-significantly different to the best one if the p-value returned by the Z test is > 5%.

Broadly, Figure 2 shows that the estimate of BioNJ bootstrap proportions from the initial (non-trimmed) multiple sequence alignments leads to very biased values, with a proportion of true branches with bootstrap-based confidence values ≤ 0.1 varying from 31% to 47%, and those > 0.9 varying from 18% to 20%. Figure 2 also shows that the highest average bootstrap-based confidence values are obtained from the character-trimming methods that best optimize both tpr and fpr criteria (see above and Figure 1).The same conclusions hold for aLRT with level of divergence ×1 (Figure 3), with (significantly) best average aLRT values observed with BMGE and Gblocks.Surprisingly, for level of divergence ×3, the best average aLRT values are obtained   , that are able to associate higher confidence values to true branches than to false branches) are those that maximize the area under the ROC curve (AUC).For each simulation case, the AUC is given under the corresponding ROC space representation.For each column, the best (i.e.higher) AUC is written in boldface characters.AUCs that are not significantly different to this best value (as assessed by a Z test) are underscored.AUC values in Figures 4 and 5 show that initial (nontrimmed) multiple sequence alignments lead to more incorrect confidence values than those estimated from characters selected by trimming methods.Interestingly, Figures 4 and 5 show that aLRT-based ROC curves induce highest AUC values on average that bootstrapbased ones (with the slight exception of BMGE with BLOSUM95 for level of divergence ×3).This shows that aLRT-based confidence values are more able to discreminate true and false branches than BioNJ bootstrapbased ones.ROC curve shapes also show that trimming methods lead to more precise confidence values.Indeed, in Figure 4, the down-left tail point of each ROC curve (i.e.corresponding to threshold τ = 0.98) always has highest tpr (y-axis) for trimmed alignments than for initial ones; this shows that a larger proportion of true branches with BioNJ bootstrap-based confidence values >0.98 is observed with trimmed alignments than with initial alignments.For initial multiple sequence alignments and for level of divergence ×2 and ×3 in Figure 4, the up-right head points of these two ROC curves (i.e. each corresponding to threshold τ = 0.02) have tpr < 1; this means that there exists some true branches with BioNJ bootstrap-based confidence value < 0.02.Notably, this is not the case when using character trimming methods.These two tendencies on tpr alues for both extremities of the ROC curves are in agreement with the distributions in Figure 2. Nevertheless, when observing the fpr ranges (x-axis) of the ROC curves in Figure 4, trimmed multiple sequence alignments all induce larger fpr values than initial ones; this shows that when using character trimming methods instead of initial multiple sequence alignments, t

re is an increase in the proportion of false branches in Bio
J trees with high confidence value (e.g.down-left tail of ROC curves corresponding to τ = 0.98) and a decrease of the proportion of false branches with low confidence values (e.g.up-right tail of ROC curves corresponding to τ = 0.02).This last tendency is clearly obvious with level of divergence ×3 (see Figure 4).However, shapes of ROC curves constructed by thresholding aLRT-based confidence values on ML trees do not seems to be strongly modified by trimming methods (see Figure 5).More precisely, for levels of divergence ×1 and ×2, Gblocks and BMGE (with BLOSUM62) present always among the best results.For level of divergence ×3, Noisy and BM E with BLOSUM30 present the significantly best AUC values for aLRT-based confidence values.

To sum up, these simulation results show that, by selecting among variable characters those with biologically-relevant expected variability thanks to the use of similarity matrices, BMGE often presents results that are among the (significantly) best (e.g.phylogenetic accuracy, better confidence values for true branches) and leads to a less biased phylogenetic signal.


Entropy-based character trimming in a phylogenomics context

In order to illustrate the benefit of using character trimming approaches, we have re-analysed the multi-gene dataset used by Castresana (2000) to describe the usefulness of Gblocks [7] in selecting suited characters.This amino-acid dataset is composed by ten genes (i.e. three subsunits of cytochrome c oxidase CO1-CO3, one subunit of cytochrome c-ubiquinol oxidoreductase CYTb, and six subunits of NADH desydrogenase ND1-ND5 and ND4L) gathered from the complete mitochondrial genome of 16 eukaryotes (11 unikonts, 4 archaeplastida, 1 excavate), and from Paracoccus denitrificans, an α-proteobacterium used as outgroup (see [7] for more details).

Protein sequences were aligned with MUSCLE, and each of the 10 multiple sequence alignments were trimmed by BMGE, Gblocks, trimAl and Noisy with the same parameters used in the previous simulations.For each of these seven character trimming approaches as well as the initial (non-trimmed) multiple sequence alignment, the ten so-obtained character matrices were concatenated into a single character supermatrix with the software Concatenate [67].Table 5 shows the different number of characters of these eight supermatrices.ML trees were inferred with PhyML from each character supermatrix with the mtREV model of amino acid substitution [68].Parameters defining the shape of the Γ distribution (8 categories) and the proportion of invariable characters were both left as free.The different loglikelihood (log-lk) estimated for each inferred ML trees were normalized by the total number of characters for better comparison and are shown in Table 5. ML bootstrap-based (100 replicates) and aLRT confidence values were assessed on the branches of the different inferred trees.

It should be stressed that our initial character supermatrix (i.e.4,530 characters; see Table 5) is slightly larger than the number of characters inside the original character supermatrix in [7] (i.e.4,453); this is due to the use of different multiple sequence alignment softwares.It should also be stressed that the log-lk per character estimated from our initial character supermatrix (i.e.-21.07; see Table 5) is higher than those provided in [7] (i.e.-22.71; see Table 4, page 546 in [7]); this can be explained by the different number of characters but also by our use of the amino acid model mtREV+Γ 8 +I instead of just mtREV in [7].However, the ML tree inferred from our initial character supermatrix has the same topology (left-hand tree in Figure 6) as the ML tree inferred by Castresana (2000; see Figure 5A, page 545 in [7]).The use of BMGE (with default BLOSUM62 and liberal BLOSUM30 matrices), Gblocks (relaxed), trimAl (strictplus and automated1) and Noisy lead to the same phylogenetic tree (left-hand tree in Figure 6).However, BMGE with the stringent BLOSUM95 similarity matrix leads to a different ML phylogenetic tree (right-hand tree in Figure 6).These two trees agree on the monophyly of Unikonts (nodes 1 and 1' in Figure 6), but differ in the placement of the jacobid Reclinomonas americana, which is inside the Archaeplastida subtree in the left-hand tree of Figure 6.Knowing that Archaeplastida and Unikonts are each likely monophyletic, and that jacobids are phylogenetically distinct from Archaeplastida (e.g.[69][70][71][72]), the BMGE (with BLOSUM95) tree at the right-hand side in Figure 6 seems more accurate.Moreover, BLOSUM95-based character trimming in BMGE gives among the best confidence val es for the monophyly of Unikonts whereas Gblocks and trimAl both weakly support this subtree (see confidence values for nodes 1 and 1' in Table 6).The monophyly of Archaeplastida is weakly supported by all approaches, Figure 6 Phylogenetic trees obtained from a non-trimmed character supermatrix (left) and from the concatenation of the multiple sequence alignments trimmed by BMGE with BLOSUM95 (right).These ML trees were inferred by PhyML with the model mtREV+Γ 8 +I.Note that the left topology was also inferred from character supermatrices built by concatenating multiple sequence alignments trimmed by BMGE (BLOSUM30 and BLOSUM62), Gblocks (relaxed), trimAl (strictplus and automated1), and Noisy.Bootstap-based and aLRT-based confidence values at nodes (1), (1'), ( 2) and ( 3) are given in Table 6.

suggesting that this dataset does not induces sufficient phylogenetic signal for this particular node.However, BMGE with BLOSUM95 seems to provide a character trimming that leads to both accurate phylogenetic tree and confidence values, probably due to its ability to best minimize the number of false positive characters (see above and Figure 1).

This re-analysis of a known phylogenomic dataset shows that it is quite difficult to choose an appropriate similarity matrix with BMGE, knowing that, in the one hand, these mitochondrial amino acid sequences are distantly related (see [7]), and that, in the other hand, a less unrealistic tree is inferred when BMGE is used with the stringent BLOSUM95 similarity matrix.To deal with amino acid sequences and BLOSUM matrices, a trivial approach would be to use the sequence clustering rule defined by Henikoff and Henikoff [39]: to compute the BLOSUMh similarity matrix, they grouped amino acid sequences into clusters in such a way that each sequence in any cluster has at least h% identity to at least one other sequence in this cluster.Given an alignment with n amino acid sequences, a systematic procedure to choose the BLOSUM h similarity matrix is then to apply the following formula:
 = = … = − + min max {% , , , , ,..., , ,..., i n j i i n 1 2 1 2 1 1
identity between se equences and i j}.(4)

Unfortunately, this formula underestimates the value of h that best minimizes both the number of false positive and negative characters in our simulated datasets, e.g.h, as estimated by (4), varies from 16 to 71 with an average of 33 with level of divergence ×1, and varies from 10 to 62 with an average of 22 with level of divergence ×3, whereas simulation shows that best results are obtained with h = 95 and 30 for levels of divergence ×1 and ×3, respectively.Indeed, formula (4) on the Castresana (2000) phylogenomic dataset gives h values varying from (ND3) to 64 (CO1), which shows that the amino acid sequences constituting these ten alignments are not closely related, whereas we have shown that the conservative BLOSUM95 similarity matrix is best adapted than BLOSUM62 and BLOSUM30 ones.Moreover, when estimated by (4) from the eight considered character supermatrices, h varies from 52 (concatenated initial multiple sequence alignments) to 60 (concatenated alignments trimmed by Noisy).A similar underestimation of h was also observed by averaging the n max values in formula (4), or by considering only characters with no gaps.Finally, even if formula (4) or related could give a suited approximate of the h value by using the building rules of BLOSUM matrices, it will be even more difficult to provide a similar formula for the PAM-h similarity matrices when dealing with DNA sequences.

Therefore, as maintained by Ewens and Grant (2005) for BLAST queries, we believe that "one often has prior knowledge about the evolutionary distance between the sequences of interest that helps one choose which BLO-SUM matrix to use" (p.244 in [73]).In practice, when inferring a particular gene tree, our opinion is to carefully examine the original multiple sequence alignment and those obtained by BMGE tri

ing with several similarity matrices (i.e.BLOSUM or PAM).On
he contrary, when building a phylogenomic dataset, we believe, as Talavera and Castresana (2007), that "there is enough information from the concatenation of several genes" and then that "stringent conditions tend to give rise to the best phylogenetic trees" (p.575 in [7]): the use of BMGE with stringent similarity matrices (e.g.BLOSUM95 or PAM-1) can strongly increase the number of false negatives (i.e.too many characters suited for phylogenetic inference are removed; see Figure 1) but systematic biases due to this conservative approach are often compensated by a sufficiently large number of genes used.Finally and in every case, we think that it is more relevant to deal with well-defined similarity matrices in order to choose among stringent to relaxed character trimming as in BMGE, rather than having to set several (and subjective) numerical parameters.2) and ( 3) are indicated on the phylogenetic trees in Figure 6.Nodes ( 1) and (1') correspond to the monophyly of Unikonts.Paraphyly and monophyly of Archaeplastida correspond to nodes (2) and (3), respectively.For each column, the best (i.e.higher) confidence value is written in boldface characters.


Simulation results with stationary-based character trimming

We illustrate the impact of compositional heterogeneity and stationary-based character trimming on the accuracy of phylogenetic tree inference with a simple computer simulation.Given the quartet tree in Figure 7, the evolution of a DNA sequence of length 10,000 with 25%-proportion of each nucleotide was simulated by Seq-Gen from the root to the four leaves u, v, x and y.

To infer the evolution of this DNA sequence, the evolutionary model F81 [74] was chosen.On the one hand, Seq-Gen was used with equal relative character state frequencies to generate a compositionally homogeneous region (i.e. from 100% to 50% of the total length, with 10% increments).On the other hand, to generate a compositionally heterogeneous region (i.e. from 0% to 50% of the total length, respectively), DNA evolution was simulated with 80% GC-content for the external branches corresponding to the taxa u and x, and 20% GC-content for the two other external branches (i.e.taxa v and y).For each proportion of characters with heterogeneous composition (i.e.0%, 10%, ..., 50% of the length of the alignment), 500 four-sequence alignments were generated following this procedure.ML trees were inferred with PhyML (model F81) from all these simulated alignments.Stationary-based character trimming was applied with BMGE, and ML trees were inferred f om the resulting compositionally homogeneous alignments.For each of the five proportions (i.e. 0%, 10%, ..., 50%) of unequal GC-content characters, the number of times that the model tree (Figure 1) was correctly inferred from both initial and trimmed alignments is graphically represented in Figure 8.The average proportions of characters removed by the stationary-based trimming are also reported.As expected [24], the model tree is often recovered with the initial alignments when these contain no or a small proportion of compositionally heterogeneous characters (e.g.< 20% of the total length of the initial alignment; see Figure 8), whereas the model tree is much less or not recovered when there is a high overall GC-content in taxa u and x, causing a biased attraction between them [25].Thanks to the stationary-based character trimming, BMGE detects and removes regions that are compositionally heterogeneous; then, as shown in Figure 8, the model tree is very often recovered (e.g. the model tree is correctly inferred from more than 90% of the trimmed alignments), even when the proportion of unequal GCcontent characters across sequences is high (e.g.> 30% of the total length of the initial alignment).

There exist many alternative statistical solutions to compare the character state composition of two (or more) sequences.Each of these has its own strengths and weaknesses (see [24] for instructive survey and discussion).However, matched-pairs tests (such as Stuart's test [30]) are comparati

ly efficient, particularly because of their ability to co
sider aligned sequences on a site-by-site basis [24].Albeit the stationary-based character trimming may be extended by the use of overall tests for marginal symmetry (i.e.assessing the compositional homogeneity in the complete multiple sequence alignment [75,31]), we think that using pairwise p-values allows more precise sorting of the characters c according to their s(c) score value (see above).Moreover, such overall tests are more time consuming.It should also be stressed that the Bowker's [76] test (i.e. another matched-pairs test assessing the complete symmetry inside F) was tried instead of the Stuart's test in th stationary-based trimming, but it led to worse results in the simulation analysis (not shown).Finally, as stressed in [43], Stuart's test is based on ordinary c 2 approximation and is not appropriate for small samples, particularly when the number of categories (i.e. the row number in F) is not small.It is then strongly recommended to use the stationary-based character trimming on a large number of characters (e.g.≥ 1, 000, such as in supermatrices of characters), especially when dealing with amino acid sequences.


Real case study with stationary-based character trimming

In order to illustrate the performance of the stationarybased character trimming, we applied it on the multigene dataset described in [77] (available at [78]), which is known to suffer from a GC-content bias.This phylogenomic dataset was built by concatenating 106 alignments of DNA sequences gathered from 7 Saccharomyces species and Candida albica s as outgroup (see [77] for more Figure 7 Phylogenetic tree used to simulate the non-stationary evolution of a DNA sequence.This tree and the different branch lengths are closely related to [24].details).The so-obtained supermatrix contains 127,026 nucleotide characters.

By using on this character supermatrix the same methods and software as Phillips, Delsuc and Penny (2004; see [25] for more details), we retrieved the same phylogenetic tree (left-hand tree in Figure 9) by minimizing the Minimum Evolution (ME) criterion with both GTR [79][80][81] and LogDet [82][83][84] distance estimates.We also retrieved the same ME bootstrap-based confidence values as in [25], i.e. 100% bootstrap proportion at all branches.Nevertheless, as shown by numerous phylogenetic analyses of the same dataset (e.g.[77,25,[85][86][87][88][89]), this tree is incorrect: the real evolutionary history of these 8 yeast taxa is in fact the right-hand tree in Figure 9, with no monophyletic relationship between S. kudriavzevii and S. bayanus.It was shown in [25] that this systematic bias is due to a GCcontent compositional heterogeneity across sequences that is sufficiently important to mislead the ME c

terion.

A s
bset of 114,105 characters was selected by stationary-based character trimming implemented in BMGE.These so-selected characters were used to infer ME trees following the same methods as described previously.As expected, the right-hand tree in Figure 9 was inferred from both GTR and LogDet distance estimates with 100% bootstrap-based confidence value at each branch.More precisely, pairwise Stuart's test p-values are all ≈ 0 in the initial character supermatrix (with the slight exception of the two sequence pairs S. cerevisiae -S.paradoxus and S. cerevisiae -S.mikatae with p-values of 0.0146 and 0.0868, respectively).After the stationarybased trimming performed by BMGE, all Saccharomyces sequences induce pairwise Stuart's test p-values > 0.85, the lowest p-values being induced by the outgroup species (i.e.varying from 0.1000 to 0.3675 for the sequence pairs C. albicans -S.kluyveri and C. albicans -S.ca tellii, respectively).This shows that by removing ≈10% characters, the stationary-based trimming is able to select a subset of compositionally homogeneous


Conclusions

There exists a real need for accurate bioinformatic tools to extract at best the phylogenetic

nformation contained in the ev
r-growing amount of available genomic sequence data.BMGE allows ccurate character trimming of multiple alignments of DNA, codon, or amino acid sequences based on entropy-like scores weighted with BLOSUM or PAM matrices.Thus, BMGE is able to identify and extract unambiguously aligned blocks of ch racters that contain biologically expected variability and are therefore suitable for phylogenetic analysis.Simulation studies show that the trimmed datasets returned by BM