To obtain a representative set of protein alignments with enough taxa and sites to test for departures from the empirical JTT amino acid substitution model, we took the 7459 seed alignments from Pfam-A database (release 14.0) and filtered it using two criteria. First, only alignments with > 30 sequences were considered and submitted to the Gblocks program  to automatically trim regions of ambiguous alignment with a minimum block size set to 5 and maximum number of contiguous nonconserved positions of 16. From this set of trimmed alignments only data sets with > 200 positions were considered, yielding a final set of only 12 alignments. To these 12 alignments, 9 alignments of proteins used for phylogenetic studies in our laboratory were added that met the requirement of > 30 taxa and > 200 sites after trimming with Gblocks (using the same settings as above). The 21 protein families examined are indicated in Table 1 and include proteins with functional roles ranging from components of the cytoskeleton (e.g. tubulins and actin), to globular enzymes (e.g. CTP-synthetase) to viral coat proteins (e.g. Poty_coat).
For each of the 21 data sets, a phylogenetic tree was estimated under the JTT + F + Γ model with 8 gamma rate categories using Tree-Puzzle (version 5.2). The resulting trees were used to simulate amino acid sequences of 100, 000 sites under JTT + F + Γ using Seq-Gen .
Method 1 – Amino acid uniformity and uniformity test: We utilized an information theoretical notation of relative entropy (r
), also called the Kullback-Leibler divergence [35
], to measure the amino acid uniformity at sites. It is defined as:
is the frequency of amino acid i at a given site. A site with all 20 AA's having equal frequencies (Pi = 0.05) would have an r = 0; a perfectly conserved site would result in the maximum possible r = log20 = 4.32 bits; all other sites would have an r between 0 and 4.32 bits.
The uniformity test asks whether the amino acid frequencies at sites in real data have the same uniformity as those in data simulated under the JTT + F + Γ model. An
averaged over all sites is calculated respectively for a real data set (
) and for a corresponding data set simulated under JTT + F + Γ (
). The simulated data have 100,000 sites, so the standard error of
is effectively 0 and therefore can be ignored allowing a simple z-test. The test statistic for the uniformity test is a z
-score defined as
where Sreal is the standard deviation of r
and n is number of sites in the real data.
Furthermore, the sites of the real and simulated data were divided into four rate categories and Z-tests were conducted on each rate category separately.
Method 2 – State frequency test: comparing the number of states at a site in real data and in simulated data. The test statistics is defined as
where Oy is the number of sites showing y distinct character states observed in real data and ey is expected number of sites showing y distinct character states in simulated data under JTT + F + Γ. The c2 has an approximate χ2 distribution with 19 degrees of freedom.
Simulations of four-taxon trees under a site-specific frequency model
We sought to evaluate the potential impact of site-specific amino acid frequencies on ML-based phylogenetic inference with empirical amino acid substitution models and overall data set frequencies. To do this, site-specific amino acid frequencies were calculated from the HSP90 protein family and these frequencies were used to simulate data sets of four sequences using covTREE (morticia.cs.dal.ca/lab_public/?Download:covTREE), a C++ adaptation of Seq-Gen. The simulation settings were as follows.
Following the studies of Huelsenbeck  and Wang et al. , we evaluated tree reconstruction efficiency, over a grid of branch-lengths a and b, for trees of the form of ((1:a,2:b),(3:a,4:b):b). In the figures, the grids of branch-lengths have a on the x-axis and b on the y-axis with an increment of 0.1 for both a and b, within a range of 0.05 to 2.95 substitutions per site. For each branch-length setting implied by a given element of the grid, 100 simulated data sets were generated. For evaluation of the class frequency model described below, similar grids were calculated but with finer increments of 0.05, and focussing on the 'Felsenstein zone' region. In this case the range of a was 0.05–1.45 and the range of b was 0.5–2.95.
Principal components analysis of amino acid frequency at sites
An amino acid frequency composition vector was calculated for every site in each of the 21 protein data sets and assembled into a matrix of 6555 sites by 20 amino acid frequencies. To investigate whether there were any recurring patterns in these frequency vectors, principal component analysis was performed using the R package . The four site classes, given in Figure 3, were determined as follows. An initial clustering divided sites into three classes based of the first principal component (less than -0.04, greater than 0.04 on the x-axis, or between these bounds). For each cluster, linear regression applied to the first two principal components gave the three lines in Figure 3. Sites were then classified to whichever line they were least distant from. To reduce the risk of misclassification, sites with first principal component between -0.1 and 0.05 and second principal component between -0.25 and 0.25 were excluded from class frequency calculation associated with the linearly determined classes; this gave a fourth class. The aggregate amino acid frequencies of these four classes (i.e., the site frequency profiles) were calculated and used in the class frequency model (see below).
A class frequency (cF) mixture model
We proposed a class frequency mixture model under which the likelihood of a sequence site is a weighted sum of the site likelihood conditional on each of the class frequencies found from the PCA analysis. In order to take account of frequency distributions not modelled in the PCA study of the limited data (the 21 data sets), the cF mixture model was further added with a fifth class that corresponds to the average amino acid frequency of the whole data set. The cF model can further be combined with a Γ model to take account of the rates-across-sites variation and the site likelihood under a JTT + cF + Γ is given in the following equation.
are data at site i, the wc's are the probabilities (i.e., weights) of the class frequencies, including the whole data set frequency as one class, rk is the rate of a Gamma distribution discretized into one of the g categories with equal probabilities.
In this likelihood calculation the usual JTT + F + Γ model is a special case of the JTT + cF + Γ mixture model when the probability of each of the class frequency profiles is 0 and the probability of the whole data set frequency profile is 1. Therefore, a likelihood ratio test (LRT) may be used to compare the models, where the test statistics is twice the difference in log-likelihoods of the alignment under the two models and a P-value can be calculated from a χ2 distribution with 4 degrees of freedom. However, since the parameters (i.e., the weights of the cFs) are on the boundary of the parameter space, a simple χ2 approximation does not hold. The real distribution of the test statistics follows a mixture of χ2 distributions [39, 40] and the P-value is even smaller. If this were the only complication, then the P-values reported using the χ2 distributions would be conservative estimates.
However, an additional complication in calculating degrees of freedom arises because the proteins for which a comparison between the JTT + F + Γ and JTT + cF + Γ models was desired were sometimes also used in constructing the class frequencies, although this is not true for the four additional alignments at the bottom of Table 2. An extremely conservative estimate in these cases would adjust the degrees of freedom upwards by 19 × 4 = 76. In this case, a difference in log likelihoods for the two models would be declared significant at the 5% level if it exceeds 51. Table 2 indicates that this is the case for all but two of the 21 proteins used in constructing the class frequencies. This adjustment, however, greatly understates the significance of the differences. For a given alignment, an additional 76 degrees of freedom would be appropriate if the class frequencies were chosen to give the largest likelihood for that alignment. Not only are the class frequencies not chosen to give largest likelihoods but they are based on 21 different alignments, making it unlikely that any one alignment would have a substantial influence on them.
The cF mixture model was implemented in a maximum likelihood framework for phylogenetic inference, by modifying the source code of RAxML-VI-HPC version 2.2.3  to produce the 'Q matrix mixture RAxML' or QmmRAxML for short. As the weights for the class frequency profiles are not known, an Expectation-Maximization algorithm [41, 42] (described in the following subsection) was used to optimize the weights from an initial set of equal weights. In addition to modelling a mixture of site class-frequency profiles as discussed in this paper, QmmRAxML may also be used to model any mixture of amino acid substitution matrices, such as those based on protein secondary structures and solvent accessibilities at sites.
Parameter optimization and the Expectation-Maximization algorithm
In QmmRAxML we use an alternating scheme to optimize parameters in the class frequency mixture model, including branch-lengths, the among-sites rate variation parameter (α) and the weights (wc) of class frequency vectors. First the program has the branch-lengths and α optimized with routines in the original RAxML for an initial set of wc, which set all weights equal. Then it uses an EM algorithm to optimize wc for the current branch-lengths and α. Then it optimizes branch-lengths and α again and followed by updating wc with another round of an EM. These processes repeat until a maximum likelihood is reached for the current topology.
In updating the wc's each round of EM itself alternates between performing an expectation (E) step, which computes a conditional expectation, conditional upon the data, of the complete likelihood by including the latent variables (wc) as if they were known, and a maximization (M) step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E step. The parameters found on the M step are then used to begin another E step, and the process is repeated. Specifically the EM updating scheme for the wc's is as follows.
be the likelihood for the i
th site fixing the c
th class frequency vector and Li
be the overall likelihood at the current weight parameters for the i
th site. At the j
where k = 5 is the number of class frequency vectors plus the average frequency vector of the whole data set.
Then the updating scheme is
where n is the number of the sites. The program continues updating wc according to (*) and (**) until they converge.