Biophysical and structural considerations for protein sequence evolution
© Grahnen et al; licensee BioMed Central Ltd. 2011
Received: 24 August 2011
Accepted: 16 December 2011
Published: 16 December 2011
Skip to main content
© Grahnen et al; licensee BioMed Central Ltd. 2011
Received: 24 August 2011
Accepted: 16 December 2011
Published: 16 December 2011
Protein sequence evolution is constrained by the biophysics of folding and function, causing interdependence between interacting sites in the sequence. However, current site-independent models of sequence evolutions do not take this into account. Recent attempts to integrate the influence of structure and biophysics into phylogenetic models via statistical/informational approaches have not resulted in expected improvements in model performance. This suggests that further innovations are needed for progress in this field.
Here we develop a coarse-grained physics-based model of protein folding and binding function, and compare it to a popular informational model. We find that both models violate the assumption of the native sequence being close to a thermodynamic optimum, causing directional selection away from the native state. Sampling and simulation show that the physics-based model is more specific for fold-defining interactions that vary less among residue type. The informational model diffuses further in sequence space with fewer barriers and tends to provide less support for an invariant sites model, although amino acid substitutions are generally conservative. Both approaches produce sequences with natural features like dN/dS < 1 and gamma-distributed rates across sites.
Simple coarse-grained models of protein folding can describe some natural features of evolving proteins but are currently not accurate enough to use in evolutionary inference. This is partly due to improper packing of the hydrophobic core. We suggest possible improvements on the representation of structure, folding energy, and binding function, as regards both native and non-native conformations, and describe a large number of possible applications for such a model.
Protein-coding sequences play a central role in cellular functions necessary to produce the vast variation in organismal phenotypes observed in nature. To function, most proteins must fold into a unique and stable structure . The structure is responsible for orienting residues necessary for proper function, including binding and catalysis. To maintain fitness, protein function and the underlying structure must be preserved. Protein structure and binding (for example, protein-ligand and protein-protein interactions) are determined by the interactions of individual amino acid residues. Therefore, to mechanistically model the evolution of protein sequences, these residue-residue interactions must be considered, relaxing the common assumption of independent evolution of each amino acid position . Structural models for sequence evolution where function is protein-protein intermolecular interaction will now be considered.
To evaluate if a sequence will fold into a unique and stable structure, it must be demonstrated that the sequence prefers the folded state to both the unfolded state (and folding intermediates) as well as to alternative folded states . From a practical perspective, it is impossible to enumerate the enormous space of all possible alternative conformations . Therefore, sets of representative "decoy conformations" must be used to approximate folding intermediates and/or other stable or meta-stable states . Alternatively, without explicitly considering decoy structures, the inter-residue contacts from different conformations can be randomly sampled . Additionally, a scoring function is needed to quantify differences between the possible states. Strategies developed for generating such scoring functions include those derived statistically from existing structures (informational models, specifically coarse-grained pairwise statistical potentials)[7–9] and those derived from physical first principles [10, 11]. Both types of scoring function assume that the native sequence lies close to (within a neutral sequence network of) a thermodynamic optimum, and that there is a gap in energy between the native state and the closest non-native state .
Formally, the protein sequence evolution problem relates to the inverse folding problem (whether a sequence will preferentially adopt a particular confirmation) rather than the folding problem. Underlying both problems are similar physical assumptions and similar models .
Protein biological function depends not only on proper folding, but also on the ability to bind the target ligands. Therefore, to study sequence evolution, binding must also be evaluated. From a physical perspective this is easier, as the only states to consider are the bound and the unbound states. However, it has recently been suggested that selective pressures on proteins to avoid non-specific binding are also an important aspect of protein fitness, necessitating consideration of binding "decoys" as well . Identifying the actual selective pressures against non-specific binding in a cell is a daunting task, but the number and nature of binding decoys is a major determinant of the ease of evolving new binding interactions. The binding decoy characteristics also affect the level of selective constraint on the binding interface.
Site and function independent models clearly do not capture critical elements of protein evolution. For structure-based models, a good scoring function must produce sequences with similar properties to real proteins. This includes a hydrophobic core that evolves slower than the hydrophilic surface . A dN/dS ratio (the ratio of the nonsynonymous nucleotide substitution rate to the synonymous nucleotide substitution rate) much smaller than unity, particularly for functional residues, should also emerge . More particularly, amino acid substitution rates should show heterogeneity, expressed by a gamma distribution across positions (rejecting an equal rates model of evolution), reflecting their relative importance to folding and function. While the gamma function for rate heterogeneity was adopted for model fit rather than mechanistic purposes, it is one of the most common parameters in standard evolutionary models and accounts for heterogeneity in the substitution rate driven by structural signals as well as other signals in evolutionary sequence data [18, 19]. In addition to rate heterogeneity, sequences must also have an energy gap between the native and alternative conformations to ensure rapid and stable folding , although structurally disordered proteins represent a distinct class of proteins that do not have this property . The model should place the native sequence near an optimum so as not to provide a signal of directional selection when function is not changing. Consequently, most mutations should be deleterious or nearly neutral rather than adaptive . Lastly, proteins must retain their binding function.
Population size dictates what fitness changes are neutral as well as what mutations become substitutions  and must therefore be taken into account. To enable use in forward (simulation) and backward (phylogenetic or population genetic) studies of evolution, especially in a population genetic context, in a complete genome context, or in studies of interactome evolution, the model needs to be coarse grained at a level that allows for sufficient computational speed. In this context the computational cost makes state of the art models and methods from the protein structure community  intractable.
Here a novel physics-based scoring function is developed and compared with a commonly used informational approach on the criteria described above. The physics-based model is more specific and predicts less drastic stability changes. Both types of models violate the assumption of high native sequence stability and produce many adaptive mutations through directional selection towards a scoring function optimum, even though the model works with truncation selection. These properties are discussed in light of their effect on modeling accuracy and improvements to the models are suggested.
Protein folding and function cause interdependence between sites in the protein sequence. For instance, a mutation that removes a cysteine involved in a disulfide bridge (interaction with another cysteine residue at a different position in the protein) is likely to be deleterious, whereas mutation of other cysteines (not involved in disulfide bridge formation) may be more neutral. Of course, disulfide bridges can be lost and interconverted to other types of stabilizing interactions over evolutionary time, albeit with a large transitional barrier due to the strong site interdependence. Such selective pressures can be modeled by scoring the thermodynamic effects of mutations. Two approaches to calculating such scores are the statistically motivated informational methods (specifically, coarse-grained pairwise statistical potentials) and the first principles physics-based methods. Informational models score the likelihood of observing specific types of contacts in the folded protein based on those seen in previously known structures . Physics-based approaches evaluate structures by measuring the fit of residues to geometrical properties such as backbone torsion and residue-residue distances . In both cases the fit to both the native and the many possible non-native conformations must be measured to ensure specificity.
This manuscript sets out to evaluate the ability to replicate biological properties of proteins using both approaches described above through evolutionary simulation. A new physics-based approach is described. Native sequences were evaluated in both approaches to evaluate if they fell within the neutral network of an optimum or if they underwent directional selection initially. Similarly, sequences sampled from within the optimum were evolved to examine the evolutionary properties of each model. With the physics-based model, the relative importance of each term in the folding and the binding functions was evaluated. Together, these analyses give a picture of the evolutionary performance of each model and the appropriateness for use in addressing various questions in molecular evolution.
In the informational model the sequences are dominated by stretches of hydrophobic residues, interspersed with charged residues and a few tyrosines (Figure 4A, left). The remaining positions are highly randomized and do not maintain a specific biochemical character. When mapping the most informative positions onto the structure (Figure 4A, right) a fairly simple pattern emerges. The β-sheet core is dominated by small hydrophobic residues, surrounded by a shell of larger such residues with the occasional tyrosine embedded. Clusters of charged residues maintain the stability of some loops, although most of the highly exposed positions are randomized.
The physics-based model produces qualitatively similar results, but differs in some aspects. Core residues are prolines rather than leucines, and the larger hydrophobic residues surrounding the core tend to be tryptophans instead of phenylalanines (Figure 4B, left). Tyrosines are again favored in semi-exposed positions. However, in place of ionic interactions the physics-based model prefers glycines. Mapping the informative positions onto the structure (Figure 4B, right) reveals that these conserved glycines are located in or near portions of the backbone that are β-sheet-like and highly exposed. Another notable difference is that the physics-based model seems more residue-specific as it conserves fewer positions overall (36 vs 54), but those residues are typically more strongly conserved with larger co-evolutionary barriers to change. Both models impart roughly the same selective pressure overall.
The protein-like properties of sequence evolution over simulation time is shown
As noted above, the physics-based model sampled with truncation selection for folding and binding is substantially more restrictive in how many non-synonymous substitutions it allows, and the simulated sequences reflect this (Figure 7C, left). Nearly 1/3 of the positions are effectively invariant over shorter evolutionary timeframes, compared to only a few residues in the informational model. Structural mapping of highly conserved positions (Figure 7C, right) again recovers the starting pattern (Figure 4B, right), although a few β-sheet-associated glycines have been replaced by charges on the most exposed loops. The binding site is very highly conserved. Only very exposed positions (17-18, 59-62, etc.) have started to randomize, mostly containing polar and charged residues. Core and surface hydrophobicities started close to those of the native sequence (83% and 50% vs 72% and 40%), and did not change appreciably during the simulation (Table 1).
Comparing the simulated sequences to the Hidden Markov Model of the Pfam  SH2 family (Figure 7B), some similarities are noticeable. Both models reproduce the overall conservation of the core residues, particularly the very buried ones. The informational model is more accurate in predicting the residue type within the core (leucines and phenylalanines), whereas the physics-based model correctly conserves some glycines. Neither model captures the defining functional feature of the family (a conserved arginine in the binding pocket), but this is due to the SAP SH2 domain binding a non-phosphorylated ligand. Instead the models more generally preserve residues within the binding interface. This type of conservation is absent in the HMM since SH2 domains have unique specificity-determining binding site geometries. Similarly, since the HMM reflects several dozen distinct structures and sequences it is less conserved overall than simulated sequences derived from a single structure evolved over relatively short evolutionary distances.
In evaluating our ability to model the evolution of sequences using structural and functional (binding) constraints, both informational and physical models show protein-like properties, but also need further improvement. Despite a stability gap between random sequences (and structures) from the native for both methods, the native sequence is not near enough to a thermodynamic stability optimum to prevent directional selection. Sequences near such an optimum are more homopolymeric in composition. The evolution of such sequences, however, does show many attributes of real protein evolution.
The effect of negative (purifying) selection, a dN/dS ratio smaller than 1, is well modeled by both approaches, even reproducing the known correlation between exposure and evolutionary rate . The physics-based model produces the stronger selective pressure overall, particularly so on the protein surface (due to the solvation term). Variation in this rate across positions is mimicked by the informational model, which more consistently supports a gamma parameter throughout the simulation. The physics-based model instead conserves some positions completely over shorter evolutionary distances, but ultimately supports a covarion process rather than simply rate heterogeneity. In particular, hydrophobic residue content is much closer to that of the native sequence. When it comes to the exact identity of such hydrophobic residues (e.g. leucines vs prolines) the informational model has higher correspondence to native residues, presumably due to being derived from known sequence/structure combinations. Overall the physics-based model appears more specific for local but crucial contributions to free energy of folding, whereas the informational model produces sequences with protein-like properties but without fold specificity.
In the context of fold specificity, a major difference between models is consideration of secondary structure. The physics-based model is enriched in prolines in the β-sheet core, and glycines in other sheet-like portions of the backbone. In principle any small hydrophobic residue should suffice in the tightly packed and highly buried core positions, but prolines in particular score well in the helical term (being helix breakers in the most non-helical segment of the protein) and are mostly acceptable in the sheet term. The glycines in half-exposed positions with sheet-like geometry provide an excellent compromise between the helix, sheet and solvation terms, and have little effect on the remainder of the scoring function. Although somewhat unrealistic in the context of the native sequence, this does demonstrate the strong impact of the particular backbone conformation under selection.
The physics-based model is also different in that its parameters are adjusted for each protein. This is beneficial in adding specificity to the model , but also presents issues with parameter selection . The Lennard-Jones term in particular is surprisingly variable and offers little benefit in isolation. When testing against random sequences, this is mostly due to the side chain replacement protocol. It minimizes a Lennard-Jones function , thereby removing much of the signal a priori. Side chain optimization with a complete scoring function including other terms might remedy this problem. The vastly larger utility in a binding context (Figure 8A), where no such pre-minimization occurs, further illustrates this point. The test against random compact structures with constant sequence (Figure 8B) is more illuminating. It shows that the L-J term needs the restrictions imposed by other factors (for example, solvation and secondary structure) to efficiently identify a protein-like heteropolymer. In other words, it adds information to the total function, but only when the solution space is already somewhat restricted.
Generally, the parameterization of both folding and binding scoring functions is fairly difficult. For instance, the choice of the reference state (the unfolded or unbound molecules) has a substantial impact on the calculations. The ruggedness of the optimality landscape in parameter space makes it quite difficult to find the global optimum, and parameterizing folding and binding functions separately may lead to inconsistencies between the two. Future work will explore more sophisticated approaches to solving these problems.
The specificity of either model is ultimately reflected by the shape of its stability landscape. As seen both near the native sequence (Figure 2) and non-native optima (Figure 6), the physics-based model produces a landscape that is more rugged and has fewer equivalent minima. This is consistent with theoretical expectations of the real free energy landscape in both conformational  and sequence  space. The result is that the simulated sequences tend to be more conserved, stay near the starting ("native") point during simulation, and show mostly deleterious or neutral mutations. In contrast, under the informational model populations simply diffuse across a relatively flat sequence landscape with less co-evolutionary pressure between interacting sites, rapidly moving away from the starting point. In this aspect the physics-based model shows more protein-like and fold-specific behavior overall, despite selecting sequences with some compositional "quirks", as discussed above. Informational methods, especially those built only on pairwise contacts, necessarily lack specificity. For example, all Lys-Phe interactions are not equivalent. Those oriented in a plane will be repulsive, while those that are orthogonal will generate a cation-pi effect and be attractive. This is but one example. Physical models, at the right level of granularity, will not suffer from this problem and are thereby inherently attractive for this interaction specificity. Further, they enable study of the forces driving evolutionary processes at a physical level.
Why then are the native states so far from sequence optima? The answer is found in the approximations made by the models. Packing is perhaps the most severely affected, as can be seen from the accumulation of large hydrophobic residues in both models. These substitutions fill up the empty space that results from the coarse-graining procedure. Spherical Cβ beads, for instance, are a particularly poor approximation for elongated (Lys, Arg, etc.) or flattened (Trp, Phe, etc.) residues. A structural model with more than one bead per side chain, such as that of Hills and co-workers , could remedy this issue. The representation of ions as spherical point charges is also problematic, and salt bridge formation could be better addressed by using a model of induced dipoles . Many entropic contributions are not considered, although the physics-based model includes some of them implicitly in the solvation term. In addition to entropic considerations, explicit hydrogen bonding , cation-pi interactions , and other interactions are not considered, but would increase model complexity. Finally, an overarching problem may be found in the effect of individual mutations. A typical disadvantageous mutation under the informational model has a very small effect on the stability score E(s, c) (eq. 12) (< 0.5 kT units out of tens of kT units, where k is Boltzmann's constant and T is the absolute temperature), and the physics-based model generally predicts even smaller stability changes. In real proteins the average mutation is strongly deleterious (3 kcal/mol out of a total ΔG of ~10 kcal/mol, where ΔG is the free energy difference between the unfolded and folded protein structure), indicating that the models are much too lenient when it comes to the cost of mutation. Appropriate weighting of contributing terms or more precise modeling of packing would improve this aspect, and make the energy landscape more realistically rugged.
Early attempts at considering alternative states in the above models of folding used explicit decoys. Because of the large number of potential alternative states for a sequence, both kinetic traps and thermodynamic optima, evaluating even a large finite number of such states proved inadequate. Further, identifying the states that were most informative near the native confirmation was not clearly driven by fold sequence or geometry and it may be that the relevant structures originate from smaller regions of diverse structures. The random contacts model  proved powerful and efficient in sampling many contacts from diverse structures as an implicit solution.
For binding, there is a greater conundrum in that while the real biological decoys are not known, a random contacts model has the potential to be too restrictive in selecting against binding interactions that are not biologically deleterious. For an alternative binding interaction to be deleterious, the proteins need to have a deleterious interaction (as defined by biological fitness) and to have the potential to interact at the same time, in the same cellular location, at the right concentration . However, the proper decoys are frequently not known and this restricts the use of the alternative approach of explicit binding decoys.
Further, when evaluating binding specificity as a function of affinity it is assumed that complex thermodynamic stability (and relatedly, life time) must be maximized. While this is a relevant constraint for cellular structures such as the nuclear pore complex , protein complexes involved in signaling or even in forming the cytoskeleton have transient associations that are necessary for functional signaling [44–46]. A requirement of transient binding means that dissociation kinetics are selected to be just fast enough, not as slow (corresponding to very stable complexes) as possible. Therefore, even in the limit of perfect predictions one should not expect to recover the native binding site sequence based on thermodynamics alone.
From the perspective of protein structural space , the folding decoys all lie near the native conformation. The total amount of secondary structure is conserved, as are the relative percentages of helices and sheets. This means that the decoys can be thought of as being sampled from the same Superfamily, or at least Fold, in the SCOP hierarchy . Conservation of residue exposures and sequence length also guarantees that the conformations have roughly the same radius of gyration as the native state. This should cause selection against subtle misfolds and accumulating pathway intermediates, but may be a poor representation of the diversity of conformations a sequence may fold into and thus may miss important local alternative conformations that are close in energy. For example, an increase in residues with high helical propensity could shift the lowest-energy conformation toward an all-α topology instead of α+β. The current approach does not allow modeling this type of event, but evaluation of the partition function of the Boltzmann probability of folding is a notoriously difficult (and currently unsolved) problem . A critical aspect of characterizing the energy landscape is the role of decoy structures in restricting the funnel shape as well as providing fold specificity, and this is a challenging task.
Recently, the utility of site-interdependent structural models of sequence evolution was evaluated in the context of phylogenetics [50–52]. Despite use of a variety of scoring functions and probability measures, the utility of such models was found to be generally low. Minimizing a coarse-grained potential failed to recover much of the native sequence and did not outperform site-independent models . Simulating the effects of substitutions on folding energy revealed a rapid divergence from the native state even when modeling solvation in addition to residue-residue contacts , and including the contribution of folding energy to transition path probability can even decrease tree reconstruction accuracy . These results are all consistent with the nature of the coarse-grained energy landscape seen above. There is no global minimum near the native sequence, and its position on a steep slope results in rapid divergence in terms of sequence, energy or both. Models that accurately characterize the structural effects of substitutions are urgently needed to make progress in this direction. It is also worth noting that these approaches all relied on informational potentials, which we have shown above to be less structurally specific than physics-based scoring.
With models that properly characterize the evolution of proteins, many important evolutionary biological questions can be addressed. What are the patterns of sequence evolution associated with different mechanisms of duplicate gene retention? Relatedly, how easily does orthologous neofunctionalization occur and how dependent upon protein fold and binding interface size is it? What are the roles of positive and negative pleiotropy in restricting neofunctionalization? How do structural transitions occur between neighboring folds? At the contact level, how are different stabilizing interactions interconverted (for example, the transition between Cys-Cys, cation-pi, and Coulombic interactions)? What is the interplay between population size, fold distribution, and functional evolution? To address these important evolutionary biology questions, continued progress on models such as described here is needed.
Proteins evolve according to the laws of physical chemistry, biochemistry, and population genetics. Classic phenomenological site independent models, while computationally simple to use, do not adequately describe evolutionary processes for either phylogenetic or functional evolutionary analysis (comparative genomic) purposes. Mechanistic models built upon the type described here will be necessary. Biological (for example, regulatory or metabolic) pathways dictate the level of constraint on physical constants and such considerations will also need to be taken. While the currently available models remain inadequate, an understanding of the problems in physical chemistry associated with evolutionary models will lead to future improvements, with many downstream applications.
The protein structure is represented by a reduced two-bead model originally described by Levitt  and more recently used by Mukherjee and Bagchi  as well as Grahnen et al. . Each residue is represented by one backbone bead (Cαi) and one side chain bead (Cβi) (Figure 1). The Cα bead is centered on the Cα atom in an all-atom representation and has a radius of 1.8 Å. For a residue i (except glycine), the Cβi bead center is placed at a distance b i from Cαi, which is determined as the centroid of all the side chain atoms (including the Cαi atom), and assigning a residue-dependent radius r i . Glycine is simply represented with a Cα bead with the appropriate properties (hydrophobicity etc.). For residue i, where 2 < = i < = N-2 in a protein with N residues, we define θ i as the angle between Cαi-1, Cαi and Cβi, and φ i as the torsion angle between Cαi-1, Cαi, Cαi+1 and Cαi+2. Residues within two bonds of the termini do not have a defined φ i , and the residues at the N-terminus do not have a defined θ i . Finally, r ij describes the distance between the center of any two beads i and j in the model. In the informational representation of protein structure r ij was measured by taking the minimum distance between the Cβ beads in residues i and j (Cα is substituted for Cβ for glycines).
To thread a sequence s into a protein backbone conformation c, the side chain replacement algorithm SARA  was used. SARA is a very fast approach based on iterative Markov Chain Monte Carlo refinement of the replaced side chain geometry. Backbone coordinates were not adjusted during the replacement procedure, and c was assumed to be constant throughout.
K Θ is the force constant for the bending potential (10.0 kJ mol-1 rad-2), Θ i and Θ i+1 are defined as described above, and is the equilibrium bond angle for a Cβ bead of type t.
ε ij is the interaction parameter between beads i and j (a pair of Cα beads, a pair of one Cα bead and one Cβ bead or a pair of two Cβ beads), based on their respective hydropathy indexes, σ ij is the collision diameter of beads i and j (i.e. the sum of their radii), and r ij is the defined in Figure 1. The sum ij runs over all pairs of beads noted above.
K i 1-3 is average helical propensity of residues i, i+1 and i+2. r i, i+2 is the distance between Cα beads i and i+2. r i, i+3 is the distance between Cα beads i and i+3. is the average helical propensity of residues i, i+1, i+2, and i+3. Finally, r h is the equilibrium helix inter-bead distance (5.5 Å).
is the average beta propensity of residues i-1, i, i+1, and i+2. The beta propensity for a given residue type is constructed using the same linear scaling of helical propensities used by Mukherjee, but is based on Kim and Berg's beta propensity scale  rather than Pace and Scholtz's helix propensity scale . C b is a scaling constant (0.01) selected to scale the range of torsion angles to a similar magnitude as the range of bead-bead distances in V helix . Torsional angle φ i is defined above (Figure 1), and φ b is the equilibrium beta sheet value (210°) for a two-bead representation as described by Bahar and coworkers .
C c is a scaling constant (1,000) selected to scale the term to similar magnitude as other terms, q i and q j are the charges of residues i and j (+1 or -1 depending on residue type), r ij is the distance between beads i and j, and ε is the dielectric constant of the protein interior (3.0), with the sum running over all pairs of charged residues with a SASA (see below) of less than 0.25. This roughly approximates the strong screening of charged interactions due to the highly polarizable surrounding water, and the nearly negligible effects of the largely non-polarizable interior of the protein . The whole term is then scaled by a weighting term w ion when used in V(s, c) (eq. 1), so C c has no effect on the parameterization, but is necessary computationally so as not to introduce errors due to lack of floating point precision.
h i is the hydrophobicity-based interaction parameter of residue i (equivalent to ε ii above) and p i is the "polarity index", constructed as p i = (h max -h min ) - h i . SASA(i) is the fraction of solvent-accessible surface area of residue i calculated by the NeighborVector method of Durham  as adapted for the two-bead representation.
and r ij is the distance between two cysteine Cβ beads and r SS is the maximal Cβ-Cβ distance in a typical disulfide-bonded cysteine pair (4.5 Å). The sum ij runs over all pairs of cysteine residues. This form of V cys predicts disulfide bonds with a specificity of 0.98, sensitivity of 0.91 and Matthews Correlation Coefficient of 0.79 as measured on the structural data set used to construct SCWRL 3.0 .
Finally, w x are weights for each individual term in eq. 1, determined in a procedure described below.
and are the same as above but evaluated for inter- rather than intra-molecular bead pairs.
where complex is the bound state. The weights w x are determined as described below.
where U ab is a matrix describing the free energy gain when amino acids of type a and b are in contact, C is the contact map of c (1 when the Cβ beads of i and j are closer than 4.5 Å, 0 otherwise) and the sum runs over all residue pairs (i, j) that are separated by more than four residues in primary sequence.
where C is evaluated between c 1 and c 2 (inter-molecularly), and the sum runs over all residue pairs (i, j) such that i comes from c 1 and j comes from c 2 .
where G is the distribution of free energies (ΔG) in the alternative conformations, G nat is the free energy of folding into the native conformation, and <G> and σ( G ) describe the location and dispersal of that distribution. In other words, Z fold can be interpreted as measuring the specificity of a protein sequence for its native structure, or equivalently a combination of unfolding and misfolding stability as well as the preference for the specified fold over alternative folds.
where <G> is estimated by the median of G. The dispersal σ( G ) is assumed to be constant. This assumption is computationally efficient, which is a major consideration when evaluating tens of millions of sequences. Also, since the dispersal is purely dependent on the amino acid composition under the REM , and the gross features of the composition (the proportion of hydrophobic residues for example) should not change when simulating biologically realistic protein sequences, this enables the approximation Z fold ≈ S gap .
The case of specific protein-protein interaction is simpler due to the lower number of states available. It was previously shown that specific binding can be adequately modeled as a combination of selection for the native ligand and against a single non-specific decoy ligand . Here this approach is adopted during simulations, which are described in more detail below.
where θ k is the set of weights in V(s, c) (eq. 1) for step k of the Markov chain, and T is the temperature of the chain. θ k is proposed by perturbing q k-1 in each dimension with values drawn from a Gaussian distribution with mean 0.1 and variance 0.1. The search space was restricted to (0,1) for all weights, and 100,000 moves were attempted at T = 0.1. The parameter set with the largest Z fold found during sampling was retained as the best set of weights.
To obtain the distribution of non-native scores G necessary to calculate Z fold in this procedure, a two-pronged approach was taken. The goal was to find a set of weights that makes the native sequence-structure combination as specific as possible, both with respect to the fold recognition problem (which conformation does a fixed sequence fold into?) and the inverse folding problem (which sequence best fits a fixed conformation?). Because random sequences may fold into alternative structures, the random sequences also play a role in parameterizing the fold recognition problem, where having a hydrophobic core and preferring a given backbone to an alternative state is not enough. To address specificity of conformation, 1,000 decoy conformations were generated as described above to obtain the distribution of scores G struct . Specificity of sequence was addressed by sampling 1,000 random sequences of the same length as the native sequence (with equal probability of drawing any amino acid), threading each one through the native conformation, and obtaining from these the distribution of scores G seq . For each score G, composed of individual term scores from V(s, c) (eq. 1), the value of each term score was re-scaled such that the overall term-specific distributions of values were of equal range. The full distribution G was formed by the union of G struct and G seq for each ϴ k .
where i runs over all possible other conformations, ΔG is the free energy of folding, k is Boltzmann's constant and T is the absolute temperature. It follows that only those alternative conformations with large negative ΔG contribute greatly to this probability, or equivalently that only those decoys with very low scores V(s, c) (eq. 1) are important for folding specificity. Therefore the median was used as a location estimator, and the difference between the first quartile and the median as the estimator of dispersal.
where G bind = G seq . Weight parameters w x can then be chosen using the MCMC algorithm described above (eq. 16).
where each step k was a single substitution in the protein sequence.
For near-native sampling two temperatures were employed: a very high temperature for nearly unbiased sampling (T = 10 for the informational function E(s, c) of eq. 12, T = 50 for the physics-based function V(s, c) of eq. 1) and a lower temperature to find local minima (T = 0.1 for E(s, c), T = 1.5 for V(s, c)). The low temperature was chosen to make the chain capable of passing the barrier in the energy landscape caused by an average deleterious substitution under the informational model (~0.1 units of E(s, c)) with a reasonably high probability (~0.1). This was also calibrated with respect to the expected fraction of sequences with better folding stability for real proteins and indirectly on dN/dS. The high temperature was chosen to make the sampling nearly independent of S gap (> 0.95 expected acceptance probability). Temperatures for the physics-based function were then chosen such that the actual acceptance frequencies obtained from sampling under the informational model were reproduced. Differences in the observed dN/dS ratios between the physics-based and informational models were due to differences in the ruggedness of the landscape and its sequence context dependence (local ruggedness).
To increase sample density, divergence of the chains from the starting state was restricted to 5%-30%, in increments of 5%, and two chains were run at each temperature and divergence limit. The number of attempted steps was adjusted to obtain ~8,500 unique samples for each function. Samples were projected onto a two-dimensional representation of sequence space using Sammon mapping  and score surfaces were calculated using Gnuplot v 4.2.
To discover regions of the landscape where most mutations are destabilizing (i.e S gap is near a maximum), an important feature of biological protein sequences leading to observed dN/dS ratios, ten replicate chains attempting 100,000 steps were run for each function at the lower sampling temperature without any divergence restrictions. The chains were thinned by taking every fourth unique sample and the final 100 such samples were retained, resulting in 1,000 unique samples for each scoring function. Samples were Sammon-mapped and sequence logos  were calculated to assess convergence. Structural properties of the sampled equilibria were further characterized by projecting the consensus sequence at each position with > 2 bits of information onto the protein conformation.
To test the accuracy of the scoring functions, Z fold was calculated for the native sequence and 1,000 random sequences. The test was performed on the same diverse set of structures used in the development of the side chain replacement method employed .
The interaction score was tested by a similar scoring procedure. In this case a subset of the structures from the PepX database , which describes protein-peptide interactions, was constructed. As for the folding score test set, 100 structures (25 from each major SCOP Class) were selected, each a centroid of a cluster of binding interfaces at the 3 Å-75% level of PepX. For each Class with more than 25 available centroids, the structures were sorted by average B-factor across the binding interface, and the 25 structures with the lowest B-factor were selected. A list of these structures can be found in Additional File 1. Z bind was calculated as described above (eq 18) for the native peptide sequence and a set of 1,000 random peptide sequences for each complex.
In addition, the distribution of weights for each term in V(s, c) (eq. 1) and V(s 1 , c 1 , s 2 , c 2 ) (eq. 10) across the structural data sets was generated to compare the relative importance of the terms. The effect on the gap score of using each term individually was also examined. Furthermore, the numerical stability of the MCMC algorithm for choosing weights and the characteristics of the parameter-stability landscape were probed by producing eight different sets of non-native scores G for the SAP protein, and re-running the parameterization 100 times for each set.
As a test of the utility of the methods in an evolutionary context, simulations were performed under negative selection for protein folding and binding. In a manner similar to that of Rastogi et al. , a population of 1,000 virtual organisms containing a single copy of the SAP protein  (PDB ID: 1D4T) was evolved for 200,000 generations at a rate of 10-5 mutations per bp per generation at the DNA level, with transitions being twice as probable as tranversions. To obtain a thermodynamically stable starting point, non-native optima of S gap (eq. 15) in protein sequence space were sampled as described above and a starting DNA sequence was randomly selected from the reverse translation of those samples.
where S gap (s) is the fold gap score (eq. 15) of the current sequence s, S gap start is the fold gap score of the starting sequence, V bind (s) is the binding score V(s, c protein , s ligand , c ligand ) (eq. 10), s ligand and c ligand correspond to the SLAM peptide (the native ligand of SAP), V bind start is the binding score of the starting protein sequence to the same ligand, and V decoy (s) is the binding score V(s, c protein , s decoy , c ligand ) for a decoy ligand. Simulations under the informational model used E(s, c) (eq. 12) to compute S gap and E(s 1 , c 1 , s 2 , c 2 ) (eq. 13) to compute binding scores. In other words, mutations that make the protein or the protein-ligand complex less stable than the starting point are infinitely deleterious, mutations that maintain those stabilities but enable binding of the decoy ligand are somewhat deleterious, and all other changes are neutral. Organisms were then propagated to the next generation by random sampling weighted by fitness w (eq. 20), with replacement, while maintaining a constant population size.
The decoy ligand were constructed by threading a sequence s decoy = IWMTIYMIIIT through the SLAM peptide conformation c ligand for the informational function, and s decoy = RLPTIYICITG for the physics-based function. Both sequences are variations on the experimentally determined XXXTIYXX(VI)XX SAP-binding motif , and do not bind the protein at the starting point of the simulation.
Finally, each simulation was replicated 10×, and the resulting sequences analyzed. Structural patterns of evolutionary rates (dN/dS) were measured by comparing sequences to the original sequence and calculating according to the PBL method [73, 16], and the distribution of position-specific residue preference was compared with that in the Pfam database  for the same fold by constructing sequence logos  from the evolved sequences. The sequences were also tested for emergence of rate heterogeneity by assessing the superior fit of the data to the rates-across-sites model over an equal-rates model using ProtTest . To discern if the novel solvation model preserves the general pattern of hydrophobic core and hydrophilic surface the proportion of hydrophobic and hydrophilic residues in parts of the protein were measured before and after simulation.
The scoring function and simulation software are available as C++ code by request. The code will be released as part of open source software for sequence simulation to be described in a future publication.
We thank two anonymous reviewers for thorough and helpful reviews. This study was supported by an institution NIH INBRE award to University of Wyoming (P20 RR016474) and by NSF award DBI-0743374. JK was supported by NSF CAREER 0846140.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.