Three large datasets have been used: (i) 13 proteins encoded in mitochondria from 336 metazoan species divided into 15 clades (Additional file 2 : Table S1), named mt336 dataset, (ii) 111 nucleus encoded proteins from 80 metazoan species divided into 5 clades (Additional file 2 : Table S2), named nuc80 dataset, and (iii) 13 mitochondrion encoded proteins from 68 species (66 Metazoa and 2 Choanoflagellata) grouped into 5 clades (Additional file 2 : Table S3), named mt68 dataset. All sequences have been downloaded from the GenBank database. For each dataset, proteins have been aligned by ClustalW , manually refined using ED , and then concatenated into a super-matrix using SCaFoS . Orthology of nuclear proteins was verified using the congruence approach described in . Ambiguously aligned positions have been removed using Gblocks , with some manual refinements (necessary because of an inadequate stringency of Gblocks in the presence of missing data). Since we are not interested in constant or quasi-constant positions, which obviously have the same evolutionary properties in all clades, only the parsimony informative positions have been retained, resulting in 1,851, 12,608 and 1,927 positions (from originally 2,547, 22,082 and 2,382 unambiguously aligned positions) for mt336, nuc80 and mt68 datasets, respectively, which allows us to reduce computational costs. The species were selected in order to obtain the most homogeneous taxonomic diversity, that is, monophyletic groups of a similar size (i.e. similar tree length) represented by about twenty species. Two groups have more species (Actinopterygii and Primates) with respect to the tree length criterion.
Topologies were inferred by maximum likelihood separately for each monophyletic group under a WAG  or a mtREV  model with four gamma discrete categories using Treefinder , for the nuc80 and the mitochondrial datasets respectively. As these topologies are biologically reasonable (see Additional file 2 : Table S4), all subsequent analyses were performed under these fixed topologies in order to reduce the CPU burden. We verified, in the case of the mt336 dataset, that the same results were obtained under free topology (data not shown).
A scheme of the protocol, described only for two clades for clarity, is shown in the Additional file 1 : figure S4. For each clade, the CAT model  implemented in the program Phylobayes inferred substitution profiles. However, comparing the profile affiliation across groups is not straightforward since (1) the CAT model infers profiles independently from each clade resulting in different sets of profiles, (2) the number and nature of profiles varies during the MCMC, and (3) different profiles can be affiliated to a given site during the Monte Carlo Markov Chain (MCMC). To achieve our comparison between clades, we need identical profiles in the different runs, and therefore need to define a set of common profiles to which sites can be affiliated whatever the clade. In a first step, the CAT model freely inferred the profiles separately for each clade under a fixed topology; the phylobayes program performed a total of 10,000 cycles, the 1,000 first cycles being discarded as "burn-in". Profiles and their affiliation to positions are repeatedly updated during the MCMC, so different profiles, which are themselves potentially unstable, can be assigned to a same position; we focus on profiles that are the most stable. Stable profiles were identified according to the protocol described in  with a threshold value of 0.035 for quadratic distance and a threshold value of 4 for the minimum of profile affiliation to site number. Only profiles that are present >50% of draws from the posterior were retained. Among the stable profiles identified in various clades, some are generally highly similar and need to be further clustered to avoid redundancy. For each pair of profiles, the quadratic distance over the twenty amino acid frequencies was calculated to compute a distance matrix as an input for clustering using UPGMA as implemented in NEIGHBOR . A threshold on the quadratic distance was chosen in order to obtain about 25 clusters (Additional file 1 : Figure S1-3), a number of profiles known to provide the greatest step in model fit improvement . Within each cluster, a common profile was defined as the average over the twenty equilibrium frequencies weighted by the affiliation frequency of each initial profile included in the cluster. Twenty-six, twenty-four and twenty-four common profiles were obtained for the large, the small mitochondria encoded proteins and the nuc80 datasets, respectively.
To compare the profile affiliation in different monophyletic groups, phylobayes was re-run with the set of common profiles for each clade separately (CAT model, fixed topology, 1,100 cycles, removing of 100 first cycles). Under these conditions, only the profile affiliations, branch lengths and site-specific rates were free parameters. This allowed to compute pik(c), the posterior probability of affiliation of the profile k to site i for clade c. An affiliation was considered stable if k exists such that pik(c)>0.75.
Two criteria have been defined to test the homogeneity of the evolutionary process over time. Homogeneity implies that a given site is affiliated to the same profile in all clades, apart from stochastic fluctuations. For pairwise clade comparison, the Frequency of Different Profiles (FDP) is a global criterion over all the positions for which a profile is stably allocated in the two alignments. The FDP criterion is defined by:
where ndif is the number of positions with two different profiles in the two clades, and nid is the number of positions with two identical profiles. For a threshold of 75% of stability across the MCMC, only 28% and 11% of sites are on average stably affiliated, for the nuc80 dataset and the mt336 dataset respectively. This statistic cannot be extended for comparing all clades simultaneously, since only 1% of the sites are always stably affiliated for the 15 clades of the later dataset. Moreover, the FDP criterion does not give information at the site level.
For the simultaneous comparison of n clades, the Probability of Identical Profiles (PIPn
) is calculated site by site without any affiliation stability conditions. It is defined for a given site i by:
where K is the number of profiles and n the number of clades. This criterion will take a high value when the site shares the same profiles in different clades. In contrast, a small PIPn corresponds to a site that has different evolutionary profiles in the taxonomic groups under consideration. Indeed, even a site with unstable affiliations (i.e. affected to a different set of profiles within a clade) can be compared and shows a PIPn value close to zero. For instance, the site can belong to various categories containing hydrophobic profiles in one clade and to various categories containing charged amino acid profiles in other clades. For computational reasons, we have limited the phylobayes runs to 1,100 cycles. This choice increases the number of sites with a PIPn value equal to 0: a low frequency of affiliation for a given profile (e.g. 10-5) would artificially be estimated at 0 in the posterior distribution. The more cycles performed, the less sites show a zero PIPn (Additional file 2 : Table S5). However, precision of PIPn estimated with 1,000 points is sufficient for the aim of this work (see R2 = 0.999 for comparison between 1000 and 5000 points on Additional file 1 : Figure S8). As PIPn values are small, in the subsequent analysis, we will use -ln(PIPn), except for sites with a PIPn value of zero. Because of this latter constraint, results are presented by binning sites into four classes of equal size plus one class for PIPn = 0 (Tables 1, 2, 3, 4, 5 and 6, and Additional file 1 : Figure S7).
Evaluation of the protocol
The statistical significance was evaluated by comparing results obtained from real and simulated data. With phylobayes, we performed simulations for each dataset according to the posterior predictive principle, i.e. we took 10 points from the posterior distribution obtained with the real complete dataset (i.e. all species simultaneously) under the CAT+Γ4 model (burn-in discarded) and simulated 10 new sequence alignments according to the parameter values of each point, in particular the profiles (for more details see [24, 31]). Using the same sets of predefined profiles (obtained from real data), the previously described protocol was used to calculate the FDP and the PIPn for each replicate.
Second, we tested that our results were robust vis-à-vis various aspects of our protocol. (i) The use of different stability thresholds yielded virtually identical results in the FDP analysis (Additional file 1 : Figure S6B). (ii) To test that low PIPn values were due to unstable profile affiliations related to an insufficient number of taxa, we randomly removed half and three quarter of the species in each clade of the large mitochondrial dataset and recomputed PIPn. As expected, the less species considered, the less profile affiliations were stable (data not shown), because the phylogenetic signal became insufficient. However, this instability led to a sharp decrease of PIPn values equal to 0 (Additional file 1 : Figure S5A), indicating that instability was not responsible for low PIPn values. (iii) Three additional sets of profiles were used in the case of mt336 alignment: 45 profiles defined using a different threshold on the UPGMA tree, the 25 stable profiles obtained from the analysis of the complete alignment, and the 20 profiles obtained by Le et al. . Similar results were obtained for both PIPn (Additional file 1 : Figure S5B) and FDP (Additional file 1 : Figure S6A) criteria.
Third, we performed a cross-validation test to evaluate the fit of different models on the various datasets: CAT, GTR and WAG/mtREV models were compared using cross-validation as described in Lartillot et al. . An alignment was randomly split in two slices: one tenth for use as a test dataset, and nine-tenths for use as a "training", or "learning" dataset. The parameters were estimated on the learning sets for each model (fixed topology; 21,000 and 11,000 cycles, the first 11,000 and 1,000 cycles discarded, for CAT and others models, respectively) and used to calculate the cross-validation log-likelihood scores of the test sets. Scores were averaged over 10 replicates.
Influence of profile change on phylogenetic inference
Profiles are representative of the functional constraints acting on a given site in a given clade; if a change of profile occurred in the common ancestor of two clades, the same substitutional profile should be shared by the two sister clades. Hence this can be viewed as a synapomorphy. In other words, the variation of substitution profiles across clades may contain a phylogenetic signal (or noise if the same profile has been independently acquired). A simple recoding approach might capture this putative phylogenetic signal. For reasons of compatibility with available inference tools, each profile is encoded as a one-letter amino acid, therefore only the twenty most frequent profiles have been conserved for this analysis. More precisely, a new sequence is created for each clade according to the following rule: the site is encoded as an amino acid when profile affiliation is stable, and by a question mark otherwise. Under these conditions, the percent of un-encoded sites in the alignments is 59% and 54% for the mt336 and the nuc80 dataset, respectively. It is difficult to know which model of sequence evolution should be used on this artificial alignment. Since we do not know a priori which exchangeability rate between profiles should be applied, the resulting file is analyzed with a GTR+Γ4 model to infer a phylogenetic tree using phylobayes. To test the effect of the model, we also made inferences with the CAT+Γ4 model. To verify the significance of the results, we performed the same analysis with the 10 simulated alignments obtained by using a posterior predictive approach for the mtp336 dataset, as described above.
Progressive removal of heteropecillous sites
The mt68 dataset was used to evaluate the potential misleading effect of the detected model violations on phylogenetic inference. More precisely, we made the hypothesis that the observed grouping of Cnidaria and Porifera to the exclusion of Bilateria was due to a long branch attraction artifact. We followed the same approach as for heterotachous positions [71, 72], by removing the most heterogeneous sites, as estimated by PIPn. At each step, we removed ~10% of the positions and stopped when 1,039 positions remained in order to keep a sufficient amount of phylogenetic signal. The five steps corresponded to the exclusion of sites with PIPn = 0, and -ln(PIPn) higher than 12, 8, 6, and 4.5, respectively. Phylogenetic trees were inferred from these reduced dataset with the CAT+Γ4 model using phylobayes. The reduced datasets were also analyzed with GTR+Γ4 and mtREV+Γ4 models using RAxML , the robustness was evaluated with 100 bootstrap replicates.
To compare the qualitative heterogeneity studied here (heteropecilly) with the quantitative rate heterogeneity (heterotachy) over time, we looked for heterotachous positions. The number of substitutions per position and per clade was calculated by phylobayes under the CAT+Γ4 model. Subsequently, heterotachous positions were identified by the test of Lopez et al.  (the improved test of Baele et al.  is not implemented for amino acid sequences). Eventually, the coefficient of correlation between heterotachy p-values and PIPn values over sites was computed.
Biochemical constraint estimation
We want to know whether the time-variation in profiles corresponds to change in physico-chemical properties of the amino acids involved in the profiles. Profiles were classified into five groups with similar physico-chemical properties (small, aliphatic, aromatic, charged, other) according to the properties of the two amino acids with the highest equilibrium frequencies in the profile. Only sites stably affiliated (threshold = 75%) to two different profiles in clade pairwise comparison were considered. The numbers of sites for which the two profiles were in the same physico-chemical group were counted over all pairwise comparisons.
Finally, we looked for a correlation between heteropecilly and variations in biochemical constraints over time. To do that, we computed a site-specific criterion of hydrophobic variation. For each profile k, a Hydrophobic Score (HS) was computed by summing the hydrophobicity of the twenty amino acids according to Kyte and Doolittle [75
] weighted by the equilibrium frequency of the amino acid aj
in the profile:
) is the hydrophobicity of the amino acid aj
(k) its equilibrium frequency in the profile k.
Then, for each clade c and site i, the sitewise Profile Hydrophobic Score (PHS) was calculated by weighting the HS score of each profile k with its affiliation frequency:
To estimate the existence of a hydropathy change over time, the standard deviation of PHS(c,i) across all clades was calculated.