The role of laterally transferred genes in adaptive evolution

Background Bacterial genomes develop new mechanisms to tide them over the imposing conditions they encounter during the course of their evolution. Acquisition of new genes by lateral gene transfer may be one of the dominant ways of adaptation in bacterial genome evolution. Lateral gene transfer provides the bacterial genome with a new set of genes that help it to explore and adapt to new ecological niches. Methods A maximum likelihood analysis was done on the five sequenced corynebacterial genomes to model the rates of gene insertions/deletions at various depths of the phylogeny. Results The study shows that most of the laterally acquired genes are transient and the inferred rates of gene movement are higher on the external branches of the phylogeny and decrease as the phylogenetic depth increases. The newly acquired genes are under relaxed selection and evolve faster than their older counterparts. Analysis of some of the functionally characterised LGTs in each species has indicated that they may have a possible adaptive role. Conclusion The five Corynebacterial genomes sequenced to date have evolved by acquiring between 8 – 14% of their genomes by LGT and some of these genes may have a role in adaptation.


Background
Vertebrate genomes are now recognized as containing a huge number of non-coding functional regions, a large fraction of which is likely to be involved in regulating the various steps of gene expression [1][2][3][4].
While most of the attention has been centered on understanding the regulation of transcription, post-transcriptional regulatory mechanisms now appear to be more important than originally thought.
Cis-regulation of pre-mRNA splicing is believed to be operated by splicing factors binding intronic and exonic splicing enhancers and helping to include or exclude specific exons from the transcript [5,6].
Post-splicing, parts of the mature mRNA often folds into some RNA secondary structure that determines the level of mRNA degradation [7] as well as mRNA localization [8]. Translational efficiency and accuracy have been shown to be largely determined by the choice of synonymous codon, thus imposing some selective pressure on the codons of certain genes [9]. Translation is also known to be affected by certain secondary structure elements in the mRNA [10]. While most of the known examples of formation of functional secondary structure are restricted to the 5' and 3' UTRs, the coding portion of the mRNA has also been shown to form functional structures [11]. Finally, there are also examples of transcription factor binding sites located in coding exons (e.g. in CD28 [12]). The method presented here should allow the detection of many of these functional elements, which we call coding regions under non-coding selection (CRUNCS).
To this point, the computational methods that have proven the most valuable for identifying non-coding functional regions are based on comparative genomics. The guiding principle of this family of approaches is that functional features of a DNA sequence tend to evolve slower than non-functional ones, because of selective pressure. This simple idea is at the core of phylogenetic footprinting, a method that compares orthologous regulatory DNA regions to identify short conserved motifs likely to be transcription factor binding sites [13,14]. The key here is that most of the DNA in promoter regions is non-functional, with the exception of the regulatory elements we are interested in. The same reasoning applies to the detection of intronic splicing enhancers [15]. With the ongoing sequencing of a large number of vertebrate genomes [16], the power of these methods is quickly improving and, coupled with algorithmic improvements [17], they are now able to detect very short regions under selection, or regions under weak selection.
The search for CRUNCS is more challenging. Although the same "conservation implies function" principle applies in this case, it needs to be used more cautiously. Indeed, CRUNCS are not located in non-functional sequences as are, for example, most known transcription factors binding sites, but rather in coding regions. This means that the sequence conservation observed in exons may be the result of two types of selective pressures. The first one is the pressure to maintain the function of the protein encoded by the gene, which probably explains most of the sequence conservation observed in coding regions. The second type of selective pressure applies only to CRUNCS, which are required to maintain their regulatory role. To apply phylogenetic footprinting to the detection of CRUNCS, one needs to determine which type of selective pressure is responsible for the sequence conservation observed.
The method suggested here takes a conservative approach to the problem. Given a set of aligned orthologous coding sequences, we first evaluate the degree of conservation of each column of the alignment, using either a parsimony score or an entropy score. We then put the burden of explaining the conservation observed as much as possible on the shoulders of the selective pressure on the protein product. Because most amino acids are encoded by many synonymous codons, amino acid selective pressure leaves room for some sequence variation. A region of the sequence will be predicted to be an CRUNCS only if the conservation observed cannot be explained solely by the need for conservation of the encoded amino acids.
The method introduced here build a mixture model of codon evolution, and then uses it as a null model to assess the significance of the observed degree of conservation. We illustrate our approach on two sets of orthologous vertebrate genes (growth hormone 1 and CORTBP2) and compare it to a related approach by Blanchette [18].

Results and Discussion
Given a multiple alignment of orthologous mRNA sequences, our goal is to identify alignment columns that are conserved beyond what would be expected by chance if the corresponding sites were evolving only under the selective pressure on the amino acid they contribute to encode. Such sites are likely to be under non-coding selective pressure. This section, which constitutes the main contribution of this paper, is structured as follows. First, we define two commonly used sequence conservation scoring methods: the entropy, and the parsimony score. We then describe a methods for assigning a p-value to a given entropy or parsimony score, under null models of evolution of codons that are only under coding selective pressure.
Under this method, we model codon evolution as a mixture of codon substitution models and use these models to assign a posterior p-value to a given conservation score.

Two measures of sequence conservation
A number of methods have been proposed to measure the degree of conservation of a set of orthologous sequences and to identify regions under selective pressure (see [19] for an evaluation of some of these methods for finding regulatory elements in intergenic regions). In this paper, we consider two such methods, the entropy score and the parsimony score, and later show how to assess the statistical significance of these scores in the context of coding regions.

Entropy
In the area of transcription factor binding sites detection, a popular method for evaluating sequence conservation is the entropy (see, for example, [20]). Given a set of orthologous nucleotides x 1 , x 2 , ..., x n from n different species, the entropy measures the distance between the observed frequency of A, C, G, and T's at that site and the uniform distribution: f α is the relative frequency of nucleotide α. Perfectly conserved sites have an entropy of zero, while the worst levels of conservation obtain a score of 2.

Parsimony score
A major drawback of the entropy score is that it does not take into consideration the phylogenetic relationships among the sequences being compared, and indeed the method is mostly used for motif discovery within a single species. An alternative to the entropy score is the parsimony score [21]. Given a set of orthologous nucleotides x 1 , x 2 , ..., x n and a phylogenetic tree T whose leaves are labeled with these nucleotides, the parsimony score is defined as the minimum number of substitutions, performed along the branches of the tree T , that can explain the set of nucleotides observed at the leaves. It is thus a lower bound on the actual number of substitutions at that site, and, in cases where this number is not too large compared to the number of branches in the tree, it is a fairly good estimate of the actual number of substitutions that occurred at that site. The parsimony score of a set of n nucleotides can be computed in time O(n) using Sankoff's algorithm [22] or Fitch's algorithm [21]. Figure 1 gives an example of orthologous nucleotides whose conservation is well characterized by either the entropy or the parsimony scores.
Entropy and parsimony scores attempt to measure the total selective pressure on a given site. While, for non-coding regions, this selective pressure can be assumed to come completely from the presence of non-coding functional elements, this is not the case in coding regions. The rest of this paper describes how to measure the statistical significance of a certain conservation (entropy or parsimony) score when the site under study lies within a coding region.

Conservation p-values under a mixture of codon models
In this section, we introduce a null model of coding sequence evolution that consists of a mixture of codon substitution models representing the evolution of codons that are under different types of purely coding selective pressure. We then show how to compute posterior p-values for the entropy and parsimony scores of orthologous sequences evolving under those models.

Mixture models for codon evolution
Different positions in a protein sequence are usually subject to different types of coding selective pressures.
Some are constrained to have a specific amino acid (e.g. the active site in a zinc finger has to be a cystein), while others are free to have any residues with some particular chemical properties (say, a hydrophobic residue), and still others are under little or no selective pressure at all. Selective pressure on amino acids translates into selective pressure on codons, which explains part of the sequence conservation observed at the mRNA level in coding exons. We describe a mixture model of amino acid evolution, and the corresponding mixture model of codon evolution. We derive a set of 50 amino acid substitution rate matrices Q a 1 , ..., Q a 50 , together with codon rate matrices Q c 1 , ..., Q c 50 , describing the substitution rates for as many classes of selective pressures on amino acids. The choice of modeling codon evolution with only 50 classes is a compromise between an highly accurate modeling of codon evolution (which would most likely require a larger number of classes [23]) and constraints on computing time. Tests carried out with 100 classes instead of 50 resulted in very similar results (data not shown).
We learn amino acid functional categories using the Pfam database of amino acid sequence alignments of protein domains [24]. We start by learning the amino acid stationary distribution of each rate category.
We then use these distributions to classify the Pfam alignment columns, and use this classification to estimate the amino acid and codon substitution rate matrices for each class.
The stationary amino acid distributions π 1 , π 2 , ..., π 50 and class prior probability distribution τ are estimated using an unsupervised Expectation-Maximization algorithm (see Methods) in order to fit the Pfam alignments as closely as possible. Figure 2 (top) shows the amino acid distribution obtained for each of the 50 classes (see also Supplementary data). We observe that most of the expected functional classes are present among our 50 classes. First, for each amino acid, there is at least one category where that amino acid has a probability at least 0.5. These correspond to positions that tolerate little variation outside of that particular amino acid. Less constrained categories include several combinations of residues with similar properties: hydrophobic residues (ILV), small neutral residues (AST), aromatic residues (YF), positively charged residues (KR), etc. The class of negatively charged amino acids (DE) is surprisingly not directly represented, although these two amino acids show up together in several more weakly defined classes. Various categories correspond to positions under little selective pressure. Many of our classes are similar to those reported by Sjolander et al. [25] using a related procedure.

Amino acid and codon substitution rate matrices
For each of the 50 classes above, an amino acid substitution rate matrix and a codon substitution rate matrix are derived. We first compute the probability of each Pfam alignment column to belong to each of the classes, and use these to estimate the probability of amino acid and codon transitions between human and mouse sequences. Rate matrices are then derived from these empirically estimated transition probabilities matrices. The detailed procedure is described in Methods.
Figure 2 (bottom) shows two of the codon rate matrices obtained. Note how mutations toward codons encoding favored amino acids occur at a high rate, whereas substitutions away from those are rare. Notice that these rate matrices automatically take into account codon biases, as they are built from mRNA sequences. We make the assumption that the codon biases of human and mouse, which are reflected in our rate matrices, are representative of the codon biases in the other species used in our analysis. This assumption appears to hold quite well within mammals and birds, and to a lesser extent within vertebrates in general [26]. However, we recognize the fact that some genes (e.g. those requiring a high rate of translation) may have codon biases that are stronger than the average, resulting in an unexpected degree of conservation. Still, since this type of selective pressure would most likely apply to the entire transcript, it would be easily detectable using our approach, and would not result in false identification of other types of non-coding elements.

Distribution of conservation scores
We now return to the problem of identifying regions under non-coding selective pressure in a multiple alignment of orthologous coding mRNA sequences X = X 1 ...X m , where X i is the triplet of alignment columns corresponding to the i-th codon in the alignment. Let X i (j) be the codon observed in species j ∈ {1...s}, where s is the number of column triplets in the alignment, and let X i,p (j) be the nucleotide at position p (p = 1, 2 or 3) in that codon. Given a column of orthologous codons X i , we want to assess whether the conservation observed at position p of the codon is unexpected. To this end, we compute the posterior p-value of the observed conservation score (entropy or parsimony score) of that codon position, under the null hypothesis that the columns are only under coding selective pressure.
To describe more formally our null model of sequence evolution, we need to introduce some notation. Let T = (V, E) be a binary phylogenetic tree with vertices V , edges E, root r, and with leaves numbered 1,2,...,n. Let λ(u, v) be the length of the branch going from node u to node v, let a(c) be the amino acid encoded by codon c, and let Q be some codon substitution rate matrix. The codon transition probability matrix for a branch (u, v) is given by P (u,v) = e λ(u,v)Q [27]. Let b(c) be the background probability of codon c, which is assumed to be the stationary distribution of Q. These three parameters (T , λ, Q) describe a process that generates random but related codons at the leaves of the tree T , by drawing a codon from the stationary distribution of Q at the root of T and letting it mutate along the branches of the tree using the appropriate transition probability matrices. Let C(u) be the random variable representing the codon that has been generated by this process at node u of the tree.
We are interested in computing the distribution of the conservation score of a given position p of a set of random orthologous codons generated at the leaves of the tree. We start by showing how to compute this distribution for the entropy score entropy(C p (1), C p (2), ..., C p (n)), and later show the modifications required to do the same for the parsimony score. For a fixed codon position p ∈ {1, 2, 3}, and for any node is the number of nucleotides of type α at position p of the codons at the leaves of the subtree rooted at u. Notice that Y u is only a function of the codons at the leaves of subtree(u), and not of those at the internal nodes of subtree(u). The p-value of a certain entropy score e for position p is obtained by summing the probabilities of all values of Y r that yield an entropy score e or better: We will show how to compute Pr[Y u = (y a , y c , y g , y t )|C(u) = κ], for every node u and all choices of y a , y c , y g , y t , and κ, using a dynamic programming algorithm visiting the nodes of T in post-order. When u is a leaf, we have 1 if y a = 1, y c = y g = y t = 0 and κ(p) = A 1 if y c = 1, y a = y g = y t = 0 and κ(p) = C 1 if y g = 1, y a = y c = y t = 0 and κ(p) = G 1 if y t = 1, y a = y c = y g = 0 and κ(p) = T 0 otherwise Define (y a , y c , y g , y t ) ⊕ (z a , z c , z g , z t ) := (y a + z a , y c + z c , y g + z g , y t + z t ). Now, let u be an internal node We compute the desired conditional probabilities at node u based on those at nodes v and w: Implementation optimizations and computational complexity analysis are given in Methods.

Distribution of parsimony scores
The method described in the previous section can be surprisingly easily modified to compute the conditional p-value of parsimony scores instead of that of the entropy score. We need to redefine the random variable Y u = (y a , y c , y g , y t ) so that y α is now the parsimony score obtained for the nucleotides at position p of codons at the leaves of the subtree rooted at u, assuming that the nucleotide at the ancestral node u is required to be α. Note that this set of four numbers per node is exactly that computed by the Sankoff algorithm for computing parsimony scores [22]. We also redefine the ⊕ operator as (y a , y c , y g , y t ) ⊕ (z a , z c , z g , z t ) = min(y a + z a , y a + zā + 1, yā + z a + 1, yā + zā + 2), min(y c + z c , y c + zc + 1, yc + z c + 1, yc + zc + 2), min(y g + z g , y g + zḡ + 1, yḡ + z g + 1, yḡ + zḡ + 2), min(y t + z t , y t + zt + 1, yt + z t + 1, yt + zt + 2) where yī = min j =i y j . Notice that this is again in direct analogy to Sankoff's algorithm. Again, .., C p (n)) = min(Y r ). With these redefinitions, Equation 2 holds without any modifications needed. We get

Posterior distributions of conservation scores
Having shown how to compute the p-value of a given entropy or parsimony score under a fixed codon rate matrix, it is simple to compute posterior p-values for the case where the functional class is not known in advance. Consider a given set of aligned codons X i = (X i (1), ..., X i (n)) encoding the set of amino acids . Define the unobserved variables Z i,k = 1 if the site i belongs to functional class k, and zero otherwise. We first compute the posterior probability of each Z i,k , given the observed amino acids at site i: is computed using Felsenstein's algorithm [28], with rate matrix Q a k .

Conditional p-values
An alternative to trying to guess the type of selective pressure under which a given codon evolves is to use a single codon rate matrix but subject to the constraint that the random codon generated at each leaf has to encode the amino acid that was actually observed at that leaf. This approach was originally proposed by Blanchette [18]. This model does not rely on amino acid classifications and in fact allows sites to change function during their evolution. By conditioning on the observed amino acids at the leaves of the tree, we ask: given that in species j, the codon had to encode amino acid a(X j (i)), for each leaf j ∈ {1, ..., n}, is the conservation observed in X i surprising? Notice how, compared to the mixture model approach, this model transfers the responsibility of sequence conservation even more onto the shoulders of coding selection. See

Implementation
The algorithms were implemented in C++ and the program is available upon request. A number of optimizations described in [18] have also been implemented and make the program relatively fast. In particular, a caching mechanism allows to re-use the results of computations done on previous columns.
This allows the program to handle very long sequences quickly. All analyses reported here have been obtained in less two hours of computation on a desktop machine.

Analysis of simulated data
We first verify that the p-values computed by our approach have the basic properties we would expect of them. To start, we confirm that sequences evolving under the null model obtain p-values that are approximately uniformly distributed. This would be a trivial statement if the functional category of each site was known, but in the absence of such prior knowledge, the uniformity of p-values under the null model is less obvious. To this end, we simulated the evolution of a 50kb region of DNA, with each codon belonging to one of the 50 rate categories described above. Sequences were evolved along the branches of the 69-leaf phylogenetic tree derived from the GH1 data set described below. Figure 4 shows the distribution of posterior p-values obtained at each of the three codon positions. As desired, the distribution is nearly uniform. However, we observe a depletion of columns with low p-values (<0.1), in particular for codon positions 1 and 2. This is due to the fact that, at these positions, a column that is perfectly conserved is not particularly surprising, and obtains a p-value around 0.2-0.3. To obtain a p-value distribution that is closer to uniform, one would need to use a tree with much larger total branch length.
Second, we study the power of our method to detect selection on sets of aligned codons that are perfectly conserved. As expected, the power of our method to detect CRUNCS depend on the amino acid encoded by the codon. Perfectly conserved codons encoding amino acid W cannot be detected by our approach. On the other hand, codon conservation is easiest to detect among four-fold and six-fold degenerate amino acids. Among those, codons encoding amino acids P and G, both of which have chemical properties making them more difficult to exchange for other amino acids, obtain higher p-values, because their conservation can be explained to a larger extent by the amino acid encoded.

Analysis biological data
We illustrate our approach on two sets of orthologous mRNA sequences: a set of 69 vertebrate growth hormone 1 (GH1) sequences, and a set of 13 vertebrate CORTBP2 (also known as CTTNBP2) sequences (see Supplementary data).
GH1 was one of the first gene shown to harbor an exonic splicing enhancer, in cow [29]. The availability of a large number of orthologous sequences makes it ideal for our study. Figure 5 shows the compounded p-values obtained from the mixture-based method, for the parsimony score, using a sliding window with w = 9. The human region orthologous to a known exonic splicing enhancer in the cow, located around position 577 [29], is clearly identified, obtaining a compounded p-value of about 2 × 10 −3 . Although this region could have been identified on the basis of parsimony scores alone (gray curve on Figure 5), it is highlighted much more clearly by the posterior p-values.. Figure 6 compares the posterior p-values obtained for the entropy and parsimony scores, under the mixture model of codon evolution, on the growth-hormone 1 data set. The correlation between the two is quite obvious, especially for very small p-values. This is in part due to the fact that alignment columns that are perfectly conserved obtain the same p-values from the two methods. On the other hand, many more sites obtain p-values between 0.01 and 0.1 using the parsimony score than using the entropy score. These are sites that have undergone a small number of substitutions early in vertebrate evolution, and that obtain a good parsimony score but a poor entropy score, as in Figure 1. In this data set, there are no sites that obtain a good entropy p-value and a poor parsimony p-value. Since the parsimony score p-values can be computed much faster and since they appear to be strictly more sensitive than entropy p-values, they seem to be the method of choice in most situations. This is also true, to a lesser extent, for the third codon position.
Finally, the analysis of the CORTBP2 transcript is particularly intriguing. Little is known about the post-transcriptional regulation of this gene. We find that its mRNA contains a very large region (roughly between positions 1500 and 2200, see Figure 8 and Supplementary material), which obtain fairly low p-values. This region is much too large to be a binding site, and we conjecture that it may form some large RNA secondary structure affecting the pre-mRNA splicing or the mRNA stability, localization or translation. Notice that in this case, a simple parsimony score is not sufficient for identifying the region, as many other regions of the transcript are well equally well conserved.

Conclusion
With the many genome sequencing projects rapidly producing vertebrate genomic data, comparative genomic approaches are becoming increasingly powerful. In the case of CRUNCS, additional data is often available in the form of ESTs and cDNAs. We believe that within one year or two, there will be sufficient data for accurate detection of CRUNCS in vertebrate genes, using methods like those described here. Once many CRUNCS will have been detected, the next step will of course be to assign functions to these elements. Although the last word will remain with experimentalists, we have good hopes that more advanced bioinformatics approaches will yield insights into these questions.
Finally, we expect that organisms that are under severe genome size constraints, in particular bacteria and viruses, will more often use CRUNCS. We believe that our approaches will prove particularly fruitful to analyze these genomes.

Estimating amino acid stationary distributions for each class
The Pfam database consists of a set of multiple alignments of homologous protein domain sequences. For some domains, the database contains several sequences that are very closely related. To reduce biases due to this over-representation, one member of any pair of domain sequences that share more than 60% identity is discarded. Let For simplicity, we assume that the amino acids in a given column are drawn independently from the same unknown amino acid distribution. Let π 1 , π 2 , ..., π 50 be a set of amino acid distributions, where π k (α) is the probability of amino acid α in class k. Let τ (k) be the prior probability of class k. The class membership of column i is unknown. Let Z i (k) be a hidden variable that takes value 1 if column i belongs to class k, We search for the distributions π 1 , ..., π 50 and prior probabilities τ that maximize . This is achieved using a simple EM algorithm for learning mixtures of multinomial distributions, using the following update formulas to go from iteration t to iteration t + 1. and where N i (α) is the number of occurrences of amino acid α in column D i and where ζ and ζ are normalizing constants that ensure that the probabilities sum to one.
Initialized with noisy uniform distributions, the algorithm converges quickly (less than 50 iterations) to the local optimum depicted in Figure 2. Ten restarts from other random initial conditions converged to very similar distributions, so the first one was retained.

Estimating amino acid and codon rate matrices
Once the amino acid distributions π 1 , ..., π 50 and prior distribution τ are learned, we can use them to compute the posterior probability of each class k for each alignment column D i : where ζ is another normalizing constant. For simplicity, the amino acid and codon substitution rate matrices P a k and P c k are estimated for each class k based only on the observed substitutions between human and mouse sequences (however, the class posterior probabilities of each column are based on all available species). Any column that does not include an entry for these two species is excluded. Let E i (human) and E i (mouse) be the codons encoding amino acid D i (human) and D i (mouse) in the corresponding mRNA sequences. The transition matrices are then estimated as follows: Finally, we estimate the instantaneous rate matrix Q a k (α, β) = ln P a k (α, β)/t (h,m) , where t (h,m) is the expected number of substitutions per site between human and mouse, in neutrally evolving DNA. Codon rate matrices are obtained in a similar manner.
Note that the codon substitution rates we obtain are slightly underestimating the true rates for sites evolving only under coding selective pressure, because some sequences in Pfam are likely to be evolving slower due to non-coding selective pressure. However, we expect that this underestimation is negligible as the fraction of such sites is likely to be small. In any case, underestimating the rates will only cause conservative estimates of the conservation p-values.

Computing entropy and parsimony score p-values
The probabilities Pr[Y u = (y a , y c , y g , y t )|C(u) = κ u ] are stored in a hash table associated to each node, indexed by the quintuplet (y a , y c , y g , y t , κ u ). Only non-zero probabilities are stored. To compute these probabilities for a node u with children v and w, it is simpler to enumerate all pairs (y a , y c , y g , y t , κ v ), (z a , z c , z g , z t , κ w ) of quintuplets from the hash tables of the two children, and add the proper quantity to the entry ((y a , y c , y g , y t ) ⊕ (z a , z c , z g , z t ), κ u ) of the hash table at u, for all choices of κ u .
Please see [18] for more details.
To study the complexity of the resulting algorithm for the entropy p-value computation, observe that the hash table associated with node v whose subtree contains l(v) leaves will contain at most The hash table implementation works well in the case of parsimony score p-value computation too, especially with the following optimizations. First, for binary trees, the only choices of y a , y c , y g , y t that may have non-zero probability are those where |y i − y j | ≤ 2 for all i, j ∈ {A, C, G, T} (this is a direct consequence of the ⊕ operator). When evaluating the p-value for a parsimony score ψ, choices of y a , y c , y g , y t such that min(y a , y c , y g , y t ) > ψ are not affecting the final p-value and can be safely ignored. Therefore, the hash table associated with node v contains O(ψ) = O(n) entries, and computing all entries for node u from those of nodes v and w take O(l(u) · l(v)). Thus, one can compute all entries of all tables in O(n 2 ), irrespective of the tree topology. More details can be found in [18].

Authors contributions
Chen implemented the program for the posterior p-value computation, for learning amino acid functional classes, and for learning the rate matrices associated to each class, and did some of the data analysis.
Blanchette was responsible for the original ideas and mathematical derivations, for some of the data analysis, and for writing the paper.   Each row corresponds to one class. The numbers on the right are the prior probability of each class.
(Bottom) Two examples of codon rates matrices, where dark cells correspond to high substitution rates and light cells to low rates. The left matrix corresponds to a functional class that favors hydrophobic residues (ILV), while the right matrix comes from a class that favors the glycine amino acid (G).