Modeling codingsequence evolution within the context of residue solvent accessibility
 Michael P Scherrer^{1},
 Austin G Meyer^{1} and
 Claus O Wilke^{1}Email author
DOI: 10.1186/1471214812179
© Scherrer et al.; licensee BioMed Central Ltd. 2012
Received: 11 April 2012
Accepted: 3 September 2012
Published: 12 September 2012
Abstract
Background
Protein structure mediates sitespecific patterns of sequence divergence. In particular, residues in the core of a protein (solventinaccessible residues) tend to be more evolutionarily conserved than residues on the surface (solventaccessible residues).
Results
Here, we present a model of sequence evolution that explicitly accounts for the relative solvent accessibility of each residue in a protein. Our model is a variant of the GoldmanYang 1994 (GY94) model in which all model parameters can be functions of the relative solvent accessibility (RSA) of a residue. We apply this model to a data set comprised of nearly 600 yeast genes, and find that an evolutionaryrate ratio ω that varies linearly with RSA provides a better model fit than an RSAindependent ω or an ω that is estimated separately in individual RSA bins. We further show that the branch length t and the transitiontransverion ratio κ also vary with RSA. The RSAdependent GY94 model performs better than an RSAdependent MuseGaut 1994 (MG94) model in which the synonymous and nonsynonymous rates individually are linear functions of RSA. Finally, protein core size affects the slope of the linear relationship between ω and RSA, and gene expression level affects both the intercept and the slope.
Conclusions
Structureaware models of sequence evolution provide a significantly better fit than traditional models that neglect structure. The linear relationship between ω and RSA implies that genes are better characterized by their ω slope and intercept than by just their mean ω.
Background
Substitution patterns in proteincoding genes are shaped by the 3dimensional structure of the expressed proteins. To account for this influence of structure on sequence evolution, evolutionary biologists increasingly aim to combine sequence analysis with structural information or to develop models of sequence evolution that incorporate structural features of the expressed protein. Some authors calculate aminoacid substitution matrices as a function of protein structure [1, 2] or correlate sequence variability in alignments with structural features [3, 4]. Others subdivide proteins into broad categories by solvent exposure (buried/exposed) or secondary structure (αhelix, βsheet, etc.) and then use standard maximumlikelihood models of sequence evolution to infer evolutionary rates as a function of structural features [5–9]. Some authors employ more complex methods that allow for nonindependence among sites, and use energy functions to model how substitutions at one site influence substitutions at others [10–13]. Finally, a few groups have attempted a variety of other approaches to link sequence variability with protein structure [14–17].
These various analyses differ in their specific results as well as in the approaches taken. However, one pattern consistently emerges: Residues in the core of proteins are more conserved than residues on the surface. This finding agrees with our understanding of protein biochemistry. Substitutions in the core of a protein are more likely to disrupt fold stability than substitutions on the surface, and the loss of the structural integrity of a protein is frequently the underlying cause of loss of function [18, 19]. Further, the observed relationship between residue buriedness and evolutionary conservation seems surprisingly simple. When evolutionary rate is plotted as a function of relative solvent accessibility (RSA, a number between 0 and 1 measuring how exposed a residue is to the solvent surrounding the protein), one finds a nearperfect linear relationship [9, 20].
Inspired by the observed linear relationship between evolutionary conservation and RSA, we here take the standard GoldmanYang model of codingsequence evolution (GY94, [21]) and introduce to it a dependency of the model parameters on RSA. We find that the RSAdependent GY94 model provides a substantially better fit to yeast sequence data than the standard, RSAindependent model. We further find that for several model parameters, a simple, linear dependency on RSA provides the best fit. In particular, the ratio of nonsynonymous to synonymous evolutionary rates ω is a linear, increasing function of RSA. Thus, we can characterize protein evolutionary rates by the slope and intercept of the ω–RSA relationship rather than by just a single ω value. We show that slope and intercept of the ω–RSA relationship vary among proteins with different structures or different expression levels.
Results
An RSAdependent Markov model of codingsequence evolution
Previous works assessing the relationship between evolutionary rate and RSA subdivided sites into groups with comparable RSA and then calculated evolutionary rates separately for each group [9, 20]. This approach yields a set of independent evolutionaryrate estimates that can be plotted against representative RSA values for each group. While this approach has provided valuable new insight, it is not satisfactory from a methodological perspective. First, some model parameters (such as parameters describing the nucleotidelevel mutation process, e.g. the transition–transversion bias) could be conserved among groups. Yet they are estimated individually for each group. Second, a consistent framework for hypothesis testing is lacking. For example, in order to test whether evolutionary rates vary linearly with RSA, one would have to do a regression analysis on the previously estimated rates. In this regression analysis, sample size corresponds to the number of RSA groups rather than to the number of sites in the original data set. Consequently, the P value resulting from the regression would likely be incorrect.
where t corresponds to evolutionary time, in arbitrary units. The parameter t measures the branch length in the phylogenetic tree; it is broadly related to the rate of synonymous substitutions. On first glance, it might be surprising that we allow t to vary with RSA. However, as we will see below, models with sitedependent t fit the data better than models with a single t across all sites. The reason for the improved fit is that RSA influences both aminoacid level processes and nucleotidelevel processes.
We implemented this model in the phylogenetic modeling language HyPhy [22]. One problem we faced was that HyPhy does not allow a continuous covariable (such as r) in the model matrix. To overcome this technical problem, we binned RSA values into n bins and represented all RSA values within bin k by the bin midpoint, which we denote by r_{ k }. In this way, we approximate a single matrix Q(r) that changes continuously with r by a set of n discrete matrices Q_{ k }= Q(r_{ k }), with k = 1,…,n. HyPhy allows us to simultaneously fit multiple discrete matrices, and it also allows us to share parameters among these matrices. In the limit of large n, our discretized model converges to the model that is continuous in r.
A linear RSA ependency for all estimated parameters provides the best model fit
We fitted our model to a data set of yeast sequences with available structural information. We identified 587 Saccharomyces cerevisiae genes with known ortholog in Saccharomyces paradoxus and with a representative structure in the Protein Data Bank (PDB). We calculated RSA for each site as described [7]. Unless noted otherwise, we used n = 20 evenlyspaced RSA bins.
Fitted models, in order of ascending AIC
ω  t  κ  lnL  df  AIC  tslope  κslope 

linear  linear  perbin  −839713.86  24  1679476  +  − 
linear  linear  linear  −839736.74  6  1679485  +  − 
perbin  linear  perbin  −839701.37  42  1679487  +  − 
perbin  linear  linear  −839722.37  24  1679493  +  − 
linear  perbin  linear  −839723.27  24  1679495  +  − 
linear  perbin  perbin  −839707.75  42  1679499  +  − 
perbin  perbin  linear  −839710.08  42  1679504  +  − 
perbin  perbin  perbin  −839694.42  60  1679509  +  − 
linear  constant  linear  −839757.23  5  1679524  0  − 
perbin  constant  linear  −839740.64  23  1679527  +  − 
linear  constant  perbin  −839742.62  23  1679531  0  − 
perbin  constant  perbin  −839727.25  41  1679537  0  − 
linear  linear  constant  −839825.99  5  1679662  −  0 
perbin  linear  constant  −839809.70  23  1679665  −  0 
linear  perbin  constant  −839817.06  23  1679680  −  0 
perbin  perbin  constant  −839800.41  41  1679683  −  0 
linear  constant  constant  −839867.98  4  1679744  0  0 
perbin  constant  constant  −839856.43  22  1679757  0  0 
constant  linear  perbin  −840468.84  23  1680984  +  − 
constant  perbin  perbin  −840459.99  41  1681002  +  − 
constant  perbin  linear  −840479.14  23  1681004  +  − 
constant  linear  linear  −840524.57  5  1681059  +  − 
constant  linear  constant  −840697.41  4  1681403  +  0 
constant  perbin  constant  −840688.35  22  1681421  +  0 
constant  constant  linear  −840738.77  4  1681486  0  − 
constant  constant  constant  −840740.37  3  1681487  0  0 
constant  constant  perbin  −840726.86  22  1681498  0  0 
In general, we found that all parameters varied significantly with RSA. The top eight models did not contain a single model in which even one parameter was constant over RSA. This result shows that it is not sufficient to just make ω a function of RSA, the transition–transversion bias κ and the branchlength t also depend on RSA. Among the models with constant parameters, models with constant t ranked the highest. Models with constant ω ranked consistently the lowest. This result highlights the strong dependency of aminoacid substitution patterns on RSA.
Whenever the transition–transversion bias κ was allowed to vary with RSA, either linearly or perbin, we found that it generally had a negative slope (decreased with increasing RSA). The branch length t tended to have a positive slope (increased with increasing RSA), unless κ was made constant, in which case t assumed a negative slope (Table 1).
Effect of the number of bins on parameter estimates
n  ω _{0}  ω _{1}  t _{0}  t _{1}  κ _{0}  κ _{1}  lnL 

4  0.1205  0.0106  0.7110  2.4706  2.5487  5.3465  839824.56 
5  0.1208  0.0116  0.6967  2.4734  2.5948  5.3547  817178.82 
6  0.1162  0.0135  0.7012  2.4828  2.5361  5.3136  839781.41 
7  0.1149  0.0143  0.7034  2.4849  2.5102  5.2976  839764.54 
8  0.1138  0.0148  0.7269  2.4805  2.5336  5.2996  839760.69 
9  0.1123  0.0154  0.7062  2.4900  2.4831  5.2759  835407.29 
10  0.1129  0.0156  0.7020  2.4898  2.5003  5.2811  839745.29 
11  0.1132  0.0159  0.6742  2.4879  2.4497  5.2669  797981.33 
12  0.1119  0.0161  0.6706  2.5007  2.4451  5.2571  837291.42 
13  0.1110  0.0162  0.7114  2.4902  2.4846  5.2703  836692.33 
14  0.1108  0.0164  0.6956  2.5005  2.4632  5.2532  837806.63 
15  0.1115  0.0164  0.6959  2.4941  2.4759  5.2653  839684.07 
16  0.1102  0.0167  0.7174  2.4897  2.4858  5.2666  839740.91 
17  0.1098  0.0169  0.7146  2.4886  2.4609  5.2562  835852.76 
18  0.1097  0.0170  0.7074  2.4942  2.4652  5.2548  839148.15 
19  0.1100  0.0169  0.7038  2.4937  2.4785  5.2627  839318.45 
20  0.1097  0.0171  0.7038  2.4943  2.4732  5.2592  839736.74 
Surprisingly, the loglikelihood did not vary smoothly in n (Table 2). For example, we observed the overall best likelihood score for n = 11, while n = 10 had a comparatively poor likelihood score. We believe that the discontinuity in likelihood scores was caused by aliasing issues. A site’s RSA can be high or low relative to the range of RSA values within a bin. After a small change in the number of bins (for example from n = 10 to n = 11), some sites that previously had a relatively low RSA for their bin will now have a relatively high RSA or vice versa. If those sites are particularly variable or particularly conserved, the change in their location relative to the bin center can substantially affect the quality of the model fit. For this reason, we do not think that it is reasonable to select the number of bins based on the likelihood score of the model. Instead, we opted for using a relatively large bin number (n = 20), which more accurately approximates a smooth dependency of model parameters on RSA.
GY94 model provides a better modelfit than MG94 model
The GY94 model describes evolutionary rates using the two parameters t and ω. An alternative model, the Muse–Gaut model (MG94 [25]), uses instead the parameters α and β. The parameter α in MG94 corresponds to t in GY94 and the parameter β in MG94 corresponds to tω in GY94. If we fit a model without site variability (all parameters are constant across sites), the MG94 model and the GY94 model are identical. However, when we allow for site variability, the two models become different. The GY94 model is usually set up with a constant t and a variable ω [26, 27]. This setup implicitly assumes that the synonymous rate is constant across sites whereas the nonsynonymous rate is variable. The MG94 model, on the other hand, has been used to explicitly model both nonsynonymous and synonymous site variability [28].
Here, we have allowed both ω and t to vary with RSA, so we have considered both nonsynonymous and synonymous rate variation. However, in using the GY94 model, we have assumed that the two quantities that vary linearly with RSA are the synonymous rate and the ratio of the nonsynonymous to synonymous rates. A priori, it is just as reasonable to assume that the synonymous rate α and the nonsynonymous rate β are linear functions of RSA. In this case,the ratio ω = β/α would of course not be linear in RSA.
Effect of relative solvent accessibility on synonymous and nonsynonymous substitution rates
The previous subsections have shown that substitution rates at both synonymous and nonsynonymous sites are affected by RSA, and that the ratio ω = dN/dS changes linearly with RSA. If ω is linear in RSA and both dN and dS vary with RSA, then we expect dN and dS individually to not be linear in RSA.
The quantities dN and dS are not parameters that are estimated in the model fit. Instead, they are derived quantities that we can calculate once the model has been fit to the data. One complication in calculating dN and dS arises, however: There are multiple definitions of these parameters. For example, dS is defined as the number of synonymous differences divided by the number of synonymous sites in the sequence. We obtain the number of synonymous differences by summing over appropriate elements in the matrix Q [29]. The number of synonymous sites can be obtained in two different ways. First, we can simply count the number of sites atwhich a mutation would lead to a synonymous change, using fractional counts for sites at which mutations can cause either a synonymous or a nonsynonymous change. This method of counting gives us the physicalsites definition of dS [30]. Second, we can weigh each site with the probability that a synonymous mutation will occur at this site under the fitted model. This method of counting sites gives us the mutationalopportunity definition of dS [30]. The same two definitions exist for dN.
The effect of core size and expression level on evolutionary rate
In yeast, the primary determinant of evolutionary rate is gene expression level [31, 32]. A second determinant is protein structure, measured either by contact density [7] or by core size [9]. Thus, we investigated how the slope and the intercept of the linear function ω = ω_{0} + ω_{1}r changed with protein core size (measured by average RSA) and with gene expression level (measured by mRNA abundance).
The two slopes we found were more similar to each other than the ones found by Franzosa and Xia [9]. The main difference between our data set and theirs was that we used more stringent criteria to match sequences to structures. To verify that we could reproduce the results of Ref. [9], we relaxed our criteria for alignment length to 70%, thereby increasing our dataset to 870 sequencestructure pairs. For this larger data set, we found a similar slope for largecore proteins as found before (${\omega}_{1}^{\text{lc}}=0.124$), but the slope for smallcore proteins was reduced (${\omega}_{1}^{\text{sc}}=0.058$). These slopes were consistent with the findings of Ref. [9].
We carried out a similar analysis on highexpression and lowexpression genes, fitting a separate line to each group of proteins (${\omega}_{1}^{\text{he}}={\omega}_{0}^{\text{he}}+{\omega}_{1}^{\text{he}}r$ for highexpression genes, ${\omega}_{1}^{\mathrm{le}}={\omega}_{0}^{\mathrm{le}}+{\omega}_{1}^{\mathrm{le}}r$ for lowexpression genes). We found a substantial difference in slope between these two groups of genes (${\omega}_{1}^{\text{he}}=0.047$ vs. ${\omega}_{1}^{\mathrm{le}}=0.164$). The difference was significant (likelihood ratio test, P = 1.75×10^{−62}). We also found a difference in intercept (${\omega}_{0}^{\text{he}}=0.011$ vs. ${\omega}_{0}^{\mathrm{le}}=0.023$) and this difference was significant as well (likelihood ratio test, P = 6.05×10^{−12}). Similar results were found when we used codon adaptation index as a proxy for gene expression level (data not shown).
Discussion
We have developed a method that models the evolutionary rate of a coding sequence within the context of the protein’s 3dimensional structure. Our method is a simple extension of the standard GY94 model, modified such that all parameters are functions of relative solvent accessibility (RSA). We have found that the evolutionaryrate ratio ω = dN/dS, the branch length t, and the transition–transversion bias κ all depend on RSA. The overall best fitting model had a linear relationship of ω and t with RSA, while κ showed small deviations from strict linearity. In the secondbest model, all parameters had a linear relationship with RSA.
Our method presents a unified statistical framework for comparing RSAdependent model parameters among different groups of proteins. Using this framework, we have shown that protein core size affects only the slope of ω as a function of RSA, but not the intercept. The most buried residues have—on average—the same ω value regardless of protein core size. By contrast, expression level affects ω even for the most buried residues.
We have found that the variation in ω with RSA is substantial; for the most exposed residues, ω was on average 510 times larger than it was for the most buried residues. This observation highlights the importance of incorporating protein structure into models of codingsequence evolution. Traditional models of rate variation [27, 29, 33] cannot distinguish between rate variation caused by protein structure and rate variation caused by other factors (e.g., positive or negative selection on sites of functional importance). As an obvious extension to the work presented here, we can combine the present model with more traditional models of rate variation by allowing for additional rate variation among sites with similar RSA. This work will be presented elsewhere [34].
Our findings here are broadly consistent with the findings of Franzosa and Xia [9]. We have confirmed the linear relationship between dN/dS and RSA in an independently derived data set; we have also confirmed that proteins with larger core size show a faster increase of dN/dS with increasing RSA than proteins with smaller core size. Our work goes beyond Franzosa and Xia’s findings by demonstrating that the evolutionary rate of fully buried residues is independent of protein core size, that expression level affects evolutionary rate at all RSA values, and that the GY94 model provides a better model fit than the MG94 model when RSAdependent evolutionary rates are considered. Our work also suggests that nucleotidelevel processes vary systematically with protein structure.
In our joint analysis of core size and expression level, we made the unexpected observation that the effect of core size on the slope of ω is reversed for genes with low expression level. However, this observation disappeared in a larger data set obtained under slightly less stringent criteria for matching sequences to PDB structures. We can offer no good explanation for this observation. It could be a statistical fluke. The number of genes in each of the four groups (low expression and small core size, low expression and large core size, high expression and small core size, high expression and large core size) is relatively small in this analysis, so a few unusual proteins could skew the analysis. What exactly is the cause of this unexpected observation may have to be clarified in future analyses, either using expanded data sets—as more structures become available—or using data from different organisms.
Our approach is conceptually related to other recent works attempting to combine protein structure with sequence evolution [10–13]. These works imposed structural constraints on sequence evolution via sophisticated energy functions describing how protein fold stability changes as amino acids are replaced. In comparison, our approach is much more simplistic. However, we believe that this simplicity has substantial benefits. First, our approach is simple and fast. All the models we have used here can be fit within 10–15 minutes on an offtheshelf laptop. Second, our approach yields results that can be interpreted easily. Instead of a single ω value per gene, we obtain two values, an intercept and a slope. The intercept tells us to what extent selection constrains the most buried residues; the slope tells us by how much selection relaxes as we move towards more exposed residues. Third, our approach can be implemented with relative ease in existing modeling frameworks such as HyPhy [22].
Following Franzosa and Xia [9], we used a model that fit a single rate ratio ω, regardless of which amino acids were substituted into which other ones. A recent study has shown that such models can always be improved upon with aminoacid dependent transition rates, even if amino acids are grouped into exchangeability categories at random [35]. This finding is not entirely surprising, considering that aminoacid substitution matrices have consistently been found to depend substantially on the aminoacid identity (e.g. Refs. [36–38]). Therefore, it would be desirable to develop codonlevel substitution models that accurately capture this rate variation, without adding too many additional parameters. Approaches that have been suggested include automatically grouping amino acids into exchangeability categories [39, 40] and decomposing aminoacid substitution rates into components corresponding to biophysical properties of amino acids (LCAP model, Ref. [41]). Yet substitution rates also depend on protein structure [1, 2, 6, 42], and thus one would want to incorporate structure into these models as well. One study developed a variant of the LCAP model where parameters were fit separately to buried and exposed sites and found to be significantly different [17]. Since we have seen here that substitution rates seem to depend continuously (and linearly) on RSA, it might be worth it to investigate a variant of the LCAP model in which rate parameters are linear functions of RSA. Such a model would have the same number of parameters as the model in Ref. [17] but would quite possibly provide a better fit to the data. Alternatively, one could attempt to incorporate an RSAdependence into models that automatically group amino acids [39, 40].
We found that in our model, both t and κ varied with RSA. We believe that this finding reflects the effect of selection on nucleotidelevel processes. First, equilibrium aminoacid frequencies vary with RSA [20, 43], and this variation will have some effect on equilibrium codon frequencies. Second, protein structure also seems to exert a direct selection pressure on synonymous codon choice [44–50], most likely through an interaction between the translation process and protein folding. A more realistic model could represent this relationship between protein structure and the nucleotidelevel substitution process more accurately, for example via a structuredependent variant of the FMutSel model [51] or by extending models such as the LCAP model [17, 41] to contain structuredependent terms for nucleotidelevel processes.
The challenge in developing any such models will be to make them realistic yet sufficiently simple so they can be fit to moderately sized data sets. An alternative, simpler strategy could be to calculate equilibrium codon frequencies in an RSAdependent manner. We considered calculating codon frequencies per bin and found that doing so generally improved AIC scores but did not eliminate the need for RSAdependent t or κ, nor did it alter any of our other results in a substantive way (not shown).
Our method requires a solved crystal structure to calculate RSA values. Although the Protein Data Bank (PDB) has been growing rapidly over the past decade, the number of available structures is still small compared to the number of available sequences. For example, many of the yeast sequences we used in our analysis did not have a corresponding structure. For those sequences, we relied on homologous protein structures solved in related organism. Homology mapping performs relatively well in predicting relative solvent accessibility [49] but clearly it is not perfect. Further, certain proteins or regions of proteins, such as membrane proteins or intrinsically disordered regions, can usually not be crystalized. Thus, our method cannot be applied to such proteins or regions of proteins.
Our method assumes that RSA remains constant throughout evolution. Yet every aminoacid replacement will cause some distortion in the protein structure [52], and RSA values at homologous sites will slowly diverge with increasing sequence divergence [49]. In the future, if either the number of available PDB structures increases drastically or if atomlevel computational modeling of protein structures becomes sufficiently reliable, we will able to study how changes in structure correlate with evolutionary rate.
Conclusions
Our work has shown that the evolutionary rate ratio ω, the branch length t, and the transition–transversion bias κ all vary significantly with relative solvent accessibility (RSA). All three parameters show an approximately linear RSA dependency. In general, both the slope and intercept of the ω–RSA dependency differ according to the specifics of individual genes, such as protein structure and gene expression level. Our work demonstrates that protein structure can be an important ingredient in comparative sequence analysis. Our work further suggests that a tighter integration of structural and sequence data will improve the performance of comparative analysis methods.
Methods
Homology mapping and categorization of genes
In order to construct a large data set of sequences with corresponding structures, we obtained open reading frames (ORFs) of the yeast Saccharomyces cerevisiae from the Saccharomyces Genome Database [53] and aligned them with orthologous Saccharomyces paradoxus sequences using MUSCLE [54]. Each ORF was translated and searched against the Protein Data Bank (PDB) [55] using the PSIBLAST algorithm [56] and then paired with the structural chain with the lowest alignment Evalue. To ensure that enough of the yeast protein was represented in the chain and that the PDB structure was a reasonable homology model, we only considered pairs with > 80% alignment length and > 40%sequence identity for analysis. Our final data set had 587 sequence–structure pairs. A data set with relaxed criteria used > 70% alignment length and > 40%sequence identity. This data set had 870 sequence–structure pairs.
The percent solventaccessible surface area (ASA) for each aligned residue was calculated using DSSP [57]. We obtained relative solvent accessibility (RSA) by normalizing ASA values with the surface areas of an extended GlyXGly peptide [58].
Calculation of evolutionary rates
The codons from the yeast alignments were binned by the RSA value of their respective residues, as described [9]. Protein core size was estimated by the average RSA value over all residues in a protein. We considered a structure to have a large core if its average RSA ranked within the bottom third of all average RSA values and to have a small core if ranked within the top third of all average RSA values [9]. Yeast expression data measured in mRNA abundance per cell was obtained from [59]. Codon adaptation index (CAI), a measure of the strength of codon usage bias, was used as an alternative for expression level, since the latter may be biased by laboratory growth conditions of the yeast cells [60]. Both expression level and CAI were ranked and divided into thirds with the top third representing highexpression genes and the bottom third lowexpression genes.
We implemented the model described by Eq. (1) in the HyPhy batch language [22]. We estimated codon frequencies (Π_{ j }) using F3×4 model.
We calculated synonymous (dS) and nonsynonymous (dN) substitution rates according to the mutationalopportunity and the physicalsites definitions, as described [29, 30].
Statistical analysis
We used the Akaike information criterion (AIC) [23, 24] to rank models by their quality of fit. For pairwise comparison of nested models, we also carried out likelihoodratio tests. All statistical analyses were performed using the statistics software R [61].
Abbreviations
 AIC:

Akaike information criterion
 CAI:

Codon adaptation index
 GY94:

GoldmanYang 1994
 HyPhy:

Hypothesis testing using Phylogenies (software)
 MG94:

MuseGaut 1994
 ORF:

Open reading frame
 PDB:

Protein data bank
 RSA:

Relative solvent accessibility.
Declarations
Acknowledgements
This work was supported by NIH grant R01 GM088344, NSF grant EF0742373, and NSF Cooperative Agreement No. DBI0939454.
Authors’ Affiliations
References
 Overington J, Donnelly D, Johnson MS, Šali A, Blundell TL: Environmentspecific amino acid substitution tables: tertiary templates and prediction of protein folds. Prot Sci. 1992, 1: 216226.View ArticleGoogle Scholar
 Koshi JM, Goldstein RA: Contextdependent optimal substitution matrices. Protein Eng. 1995, 8: 641645.PubMedView ArticleGoogle Scholar
 Mirny LA, Shakhnovich EI: Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J Mol Biol. 1999, 291: 177196. 10.1006/jmbi.1999.2911.PubMedView ArticleGoogle Scholar
 Dokholyan NV, Shakhnovich EI: Understanding hierarchical protein evolution from first principles. J Mol Biol. 2001, 312: 289307. 10.1006/jmbi.2001.4949.PubMedView ArticleGoogle Scholar
 Thorne JL, Goldman N, Jones DT: Combining protein evolution and secondary structure. Mol Biol Evol. 1996, 13: 666673. 10.1093/oxfordjournals.molbev.a025627.PubMedView ArticleGoogle Scholar
 Goldman N, Thorne JL, Jones DT: Assessing the impact of secondary structure and solvent accessibility on protein evolution. Genetics. 1998, 149: 445458.PubMedPubMed CentralGoogle Scholar
 Bloom JD, Drummond DA, Arnold FH, Wilke CO: Structural determinants of the rate of protein evolution in yeast. Mol Biol and Evol. 2006, 23: 17511761. 10.1093/molbev/msl040.View ArticleGoogle Scholar
 Zhou T, Drummond DA, Wilke CO: Contact density affects protein evolutionary rate from bacteria to animals. Mol Biol and Evol. 2008, 66: 395404.View ArticleGoogle Scholar
 Franzosa EA, Xia Y: Structural determinants of protein evolution are contextsensitive at the residue level. Mol Biol and Evol. 2009, 26 (10): 23872395. 10.1093/molbev/msp146.View ArticleGoogle Scholar
 Robinson DM, Jones DT, Kishino H, Goldman N, Thorne JL: Protein evolution with dependence among codons due to tertiary structure. Mol Biol Evol. 2003, 20: 16921704. 10.1093/molbev/msg184.PubMedView ArticleGoogle Scholar
 Rodrigue N, Lartillot N, Bryant D, Philippe H: Site interdependence attributed to tertiary structure in amino acid sequence evolution. Gene. 2005, 347: 207217. 10.1016/j.gene.2004.12.011.PubMedView ArticleGoogle Scholar
 Rodrigue N, Philippe H, Lartillot N: Assessing siteinterdependent phylogenetic models of sequence evolution. Mol Biol Evol. 2006, 23: 17621775. 10.1093/molbev/msl041.PubMedView ArticleGoogle Scholar
 Rodrigue N, Kleinman CL, Philippe H, Lartillot N: Computational methods for evaluating phylogenetic models of coding sequence evolution with dependence between codons. Mol Biol Evol. 2009, 26: 16631676. 10.1093/molbev/msp078.PubMedView ArticleGoogle Scholar
 Bustamante CD, Townsend JP, Hartl DL: Solvent accessibility and purifying selection within proteins of Escherichia coli and Salmonella enterica. Mol Biol and Evol. 2000, 17 (2): 301308. 10.1093/oxfordjournals.molbev.a026310.View ArticleGoogle Scholar
 Dean AM, Neuhauser C, Grenier E, Golding GB: The pattern of amino acid replacements in α/βbarrels. Mol Biol Evol. 2002, 19: 18461864. 10.1093/oxfordjournals.molbev.a004009.PubMedView ArticleGoogle Scholar
 Marsh L, Griffiths CS: Protein structural influences in rhodopsin evolution. Mol Biol Evol. 2005, 22: 894904. 10.1093/molbev/msi081.PubMedView ArticleGoogle Scholar
 Conant GC, Stadler PF: Solvent exposure imparts similar selective pressures across a range of yeast proteins. Mol Biol Evol. 2009, 26: 11551161. 10.1093/molbev/msp031.PubMedView ArticleGoogle Scholar
 Yue P, Li Z, Moult J: Loss of protein structure stability as a major causative factor in monogenic disease. J Mol Biol. 2005, 353: 459473. 10.1016/j.jmb.2005.08.020.PubMedView ArticleGoogle Scholar
 Bloom JD, Labthavikul ST, Otey CR, Arnold FH: Protein stability promotes evolvability. Proc Natl Acad Sci USA. 2006, 103: 58695874. 10.1073/pnas.0510098103.PubMedPubMed CentralView ArticleGoogle Scholar
 Ramsey DC, Scherrer MP, Zhou T, Wilke CO: The relationship between relative solvent accessibility and evolutionary rate in protein evolution. Genetics. 2011, 188: 479488. 10.1534/genetics.111.128025.PubMedPubMed CentralView ArticleGoogle Scholar
 Goldman N, Yang Z: A codonbased model of nucleotide substitution for proteincoding DNA sequences. Mol Biol and Evol. 1994, 11 (5): 725736.Google Scholar
 Kosakovsky Pond SL, Frost SDW, Muse SV: HyPhy: hypothesis testing using phylogenetics. Bioinformatics. 2005, 21 (5): 676679. 10.1093/bioinformatics/bti079.View ArticleGoogle Scholar
 Akaike H: A new look at the statistical model identification. IEEE Trans Autom Control. 1974, 19 (6): 716723. 10.1109/TAC.1974.1100705.View ArticleGoogle Scholar
 Burnham KP, Anderson DR: Multimodel inference: understanding AIC and BIC in model selection. Sociological Methods & Res. 2004, 33: 261304. 10.1177/0049124104268644.View ArticleGoogle Scholar
 Muse SV, Gaut BS: A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol. 1994, 11: 715724.PubMedGoogle Scholar
 Nielsen R, Yang Z: Likelihood models for detecting positive selected amino acid sites and applications to the HIV1 envelope gene. Genetics. 1998, 148: 929936.PubMedPubMed CentralGoogle Scholar
 Yang ZH, Nielsen R, Goldman N, Pedersen AMK: Codonsubstitution models for heterogeneous selection pressure at amino acid sites. Genetics. 2000, 155: 431449.PubMedPubMed CentralGoogle Scholar
 Kosakovsky Pond S, Muse SV: Sitetosite variation of synonymous substitution rates. Mol Biol Evol. 2005, 22: 23752385. 10.1093/molbev/msi232.View ArticleGoogle Scholar
 Yang Z: Computational Molecular Evolution. 2006, New York: Oxford University PressView ArticleGoogle Scholar
 Bierne N, EyreWalker A: The problem of counting sites in the estimation of the synonymous and nonsynonymous substitution rates: implications for the correlation between the synonymous substitution rate and codon usage bias. Genetics. 2003, 165: 15871597.PubMedPubMed CentralGoogle Scholar
 Drummond DA, Bloom JD, Adami C, Wilke CO: Why highly expressed genes evolve slowly. PNAS USA. 2005, 102: 1433814343. 10.1073/pnas.0504070102.PubMedPubMed CentralView ArticleGoogle Scholar
 Drummond DA, Raval A, Wilke CO: A single determinant dominates the rate of protein evolution. Mol Biol Evol. 2006, 23: 327337.PubMedView ArticleGoogle Scholar
 Kosakovsky Pond SL, Scheffler K, Gravenor MB, Poon AFY, Frost SDW: Evolutionary fingerprinting of genes. Mol Biol Evol. 2010, 27: 520536. 10.1093/molbev/msp260.PubMed CentralView ArticleGoogle Scholar
 Meyer AG, Wilke CO: Integrating sequence variation and protein structure to identify sites under selection. Mol Biol Evol. 10.1093/molbev/mss217.
 Delport W, Scheffler K, Gravenor MB, Muse SV, Kosakovsky Pond S: Benchmarking multirate codon models. PLoS One. 2010, 5: e1158710.1371/journal.pone.0011587.PubMedPubMed CentralView ArticleGoogle Scholar
 Dayhoff MO, Eck EV, Park CM: A model of evolutionary change in proteins. Atlas of protein sequence and structure, Volume 5. Edited by: Dayhoff MO. 1972, Washington D.C.: National Biomedical Research Foundation, 8999.Google Scholar
 Jones D, Taylor W, Thornton J: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992, 8: 275282.PubMedGoogle Scholar
 Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximumlikelihood approach. Mol Biol Evol. 2001, 18: 691699. 10.1093/oxfordjournals.molbev.a003851.PubMedView ArticleGoogle Scholar
 Lartillot N, Philippe H: A Bayesian mixture model for acrosssite heterogeneities in the aminoacid replacement process. Mol Biol Evol. 2004, 21: 10951109. 10.1093/molbev/msh112.PubMedView ArticleGoogle Scholar
 Delport W, Scheffler K, Botha G, Gravenor MB, Muse SV, Kosakovsky Pond SL: CodonTest: modeling amino acid substitution preferences in coding sequences. PLoS Comp Biol. 2010, 6: e100088510.1371/journal.pcbi.1000885.View ArticleGoogle Scholar
 Conant GC, Wagner GP, Stadler PF: Modeling amino acid substitution patterns in orthologous and paralogous genes. Mol Phylogenet Evol. 2007, 42: 298307. 10.1016/j.ympev.2006.07.006.PubMedView ArticleGoogle Scholar
 Koshi JM, Goldstein RA: Models of natural mutations including site heterogeneity. Proteins. 1998, 32: 289295. 10.1002/(SICI)10970134(19980815)32:3<289::AIDPROT4>3.0.CO;2D.PubMedView ArticleGoogle Scholar
 Porto M, Roman HE, Vendruscolo M, Bastolla U: Prediction of sitespecific amino acid distributions and limits of divergent evolutionary changes in protein sequences. Mol Biol Evol. 2004, 22: 630638. 10.1093/molbev/msi048.PubMedView ArticleGoogle Scholar
 Thanaraj TA, Argos P: Ribosomemediated translational pause and protein domain organization. Protein Sci. 1996, 5: 15941612. 10.1002/pro.5560050814.PubMedPubMed CentralView ArticleGoogle Scholar
 Komar AA, Lesnik T, Reiss C: Synonymous codon substitutions affect ribosome traffic and protein folding during in vitro translation. FEBS Lett. 1999, 462: 387391. 10.1016/S00145793(99)015665.PubMedView ArticleGoogle Scholar
 Cortazzo P, Cervenansky C, Marin M, Reiss C, Ehrlich R, Deana A: Silent mutations affect in vivo protein folding in Escherichia coli. Biochem Biophys Res Commun. 2002, 293: 537541. 10.1016/S0006291X(02)002267.PubMedView ArticleGoogle Scholar
 KimchiSarfaty C, Oh JM, Kim IW, Sauna ZE, Calcagno AM, Ambudkar SV, Gottesman MM: A “silent” polymorphism in the MDR1 gene changes substrate specificity. Science. 2007, 315: 525528. 10.1126/science.1135308.PubMedView ArticleGoogle Scholar
 Zhang G, Hubalewska M, Ignatova Z: Transient ribosomal attenuation coordinates protein synthesis and cotranslational folding. Nat Struct Mol Biol. 2009, 16: 274280. 10.1038/nsmb.1554.PubMedView ArticleGoogle Scholar
 Zhou T, Weems M, Wilke CO: Translationally optimal codons associate with structurally sensitive sites in proteins. Mol Biol Evol. 2009, 26 (7): 15711580. 10.1093/molbev/msp070.PubMedPubMed CentralView ArticleGoogle Scholar
 Lee Y, Zhou T, Tartaglia GG, Vendruscolo M, Wilke CO: Translationally optimal codons associate with aggregationprone sites in proteins. Proteomics. 2010, 10: 41634171. 10.1002/pmic.201000229.PubMedPubMed CentralView ArticleGoogle Scholar
 Yang Z, Nielsen R: Mutationselection models of codon substitution and their use to estimate selective strengths on codon usage. Mol Biol Evol. 2008, 25: 568579. 10.1093/molbev/msm284.PubMedView ArticleGoogle Scholar
 Chothia C, Lesk AM: The relation between the divergence of sequence and structure in proteins. EMBO J. 1986, 5: 823826.PubMedPubMed CentralGoogle Scholar
 Cherry JM, Alder C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, Weng S, Botstein D: SGD: Saccharomyces Genome Database. Nucleic Acids Res. 1998, 26 (1): 7379. 10.1093/nar/26.1.73.PubMedPubMed CentralView ArticleGoogle Scholar
 Edgar R: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32 (5): 17921797. 10.1093/nar/gkh340.PubMedPubMed CentralView ArticleGoogle Scholar
 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Wessig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res. 2000, 28 (1): 235242. 10.1093/nar/28.1.235.PubMedPubMed CentralView ArticleGoogle Scholar
 Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 33893402. 10.1093/nar/25.17.3389.PubMedPubMed CentralView ArticleGoogle Scholar
 Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogenbonded and geometrical features. Biopolymers. 1983, 22 (12): 25772637. 10.1002/bip.360221211.PubMedView ArticleGoogle Scholar
 Creighton T: Proteins: Structures and Molecular Properties. 1992, New York: FreemanGoogle Scholar
 Holstege FCP, Jennings EG, Wyrick JJ, Lee TI, Hengartner CJ, Green MR, Golub TR, Lander ES, Young RA: Dissecting the regulatory circuitry of a eukaryotic genome. Cell. 1998, 95: 717728. 10.1016/S00928674(00)816414.PubMedView ArticleGoogle Scholar
 Sharp PM, Li WH: The codon adaptation index  a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987, 15: 12811295. 10.1093/nar/15.3.1281.PubMedPubMed CentralView ArticleGoogle Scholar
 Ihaka R, Gentleman R: R: a language for data analysis and graphics. J Comput and Graphical Stat. 1996, 5: 299314.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.