Reducing the false positive rate in the non-parametric analysis of molecular coevolution
© Codoñer et al; licensee BioMed Central Ltd. 2008
Received: 05 September 2007
Accepted: 10 April 2008
Published: 10 April 2008
The strength of selective constraints operating on amino acid sites of proteins has a multifactorial nature. In fact, amino acid sites within proteins coevolve due to their functional and/or structural relationships. Different methods have been developed that attempt to account for the evolutionary dependencies between amino acid sites. Researchers have invested a significant effort to increase the sensitivity of such methods. However, the difficulty in disentangling functional co-dependencies from historical covariation has fuelled the scepticism over their power to detect biologically meaningful results. In addition, the biological parameters connecting linear sequence evolution to structure evolution remain elusive. For these reasons, most of the evolutionary studies aimed at identifying functional dependencies among protein domains have focused on the structural properties of proteins rather than on the information extracted from linear multiple sequence alignments (MSA). Non-parametric methods to detect coevolution have been reported to be especially susceptible to produce false positive results based on the properties of MSAs. However, no formal statistical analysis has been performed to definitively test the differential effects of these properties on the sensitivity of such methods.
Here we test the effect that variations on the MSA properties have over the sensitivity of non-parametric methods to detect coevolution. We test the effect that the size of the MSA (number of sequences), mean pairwise amino acid distance per site and the strength of the coevolution signal have on the ability of non-parametric methods to detect coevolution. Our results indicate that all three factors have significant effects on the accuracy of non-parametric methods. Further, introducing statistical filters improves the sensitivity and increases the statistical power of the methods to detect functional coevolution. Statistical analysis of the physico-chemical properties of amino acid sites in the context of the protein structure reveals striking dependencies among amino acid sites. Results indicate a covariation trend in the hydrophobicities and molecular weight characteristics of amino acid sites when analysing a non-redundant set of 8000 protein structures. Using this biological information as filter in coevolutionary analyses minimises the false positive rate of these methods. Application of these filters to three different proteins with known functional domains supports the importance of using biological filters to detect coevolution.
Coevolutionary analyses using non-parametric methods have proved difficult and highly prone to provide spurious results depending on the properties of MSAs and on the strength of coevolution between amino acid sites. The application of statistical filters to the number of pairs detected as coevolving reduces significantly the number of artifactual results. Analysis of the physico-chemical properties of amino acid sites in the protein structure context reveals their structure-dependent covariation. The application of this known biological information to the analysis of covariation greatly enhances the functional coevolutionary signal and removes historical covariation. Simultaneous use of statistical and biological data is instrumental in the detection of functional amino acid sites dependencies and compensatory changes at the protein level.
There has been great interest in understanding the role of amino acid covariation in protein evolution and function. The importance of covariation goes beyond the linear contribution of amino acid coevolution to explain the stability of protein folds. In fact not only conservation of amino acid positions may be important for maintaining protein folds but also the correlated variation of pairs of amino acid residues around these folds may have an important role in maintaining their stability . Covariation may therefore reflect functional and/or structural constraints. Many parametric and non-parametric methods have been developed for such purpose (i.e., [2–12]). Methods based on the information theory are among the most used ones. These methods measure the reduction in uncertainty in particular amino acid sites (entropy) achieved by considering the site's mutual information (MI) with additional sites. This measure accounts for the predictability of the composition of one amino acid column when the composition of another amino acid site is known. However, the sensitivity of most of these methods has been always halted by the fact that true positive coevolutionary relationships between amino acid sites are swamped in a background of stochastic amino acid covariation. Consequently, studies using methods slightly different in their mathematical parameterisation have been dramatically affected by the properties of MSAs (i.e., Number of sequences in the alignment, mean pairwise sequence amino acid distances, etc.). In addition to the problems of stochastic covariation between amino acid sites, no single pair of sites is unrelated in MSAs due to their historical (phylogenetic) dependencies [6, 7, 13]. Finally, the coevolutionary relationships are due to many different links between amino acid sites including coevolution due to structural, functional or physical interactions .
Many authors have devised mathematically transformed models based on the mutual information of two amino acid sites to highlight true functional coevolutionary relationships between them. Some of these methods included a correction of MI values by the degree of amino acid variability at the amino acid sites , using the phylogenetic information to remove phylogenetic coevolution [6, 7, 16], limiting the size of groups of coevolving amino acid sites  or quantitatively normalising the MI value by the content of information at the sites under test . Even though most of these methods have improved the sensitivity to detect functional/structural relationships between coevolving amino acids, a realistic method accounting for some of the most obvious biological properties of amino acid sites is as yet to be performed. For instance, most of the methods developed use qualitative approaches to remove the phylogenetic coevolutionary signal [i.e., [12, 7]]. A formal or statistical method is then needed to conduct a more rigorous approach of removing the undesired historical coevolution.
Introducing some biological parameters, such as the physico-chemical properties of amino acid sites may dramatically increase the sensitivity of mutual-information based methods to detect coevolution. Furthermore, these biological-information based methods may improve our ability in detecting compensatory mutations. This is due to the fact that compensatory mutations usually occur proximal in the protein structure and they are produced to maintain internal volumes, preserve salt bridges, or retain optimal hydrogen bonding. In summary, information-based coevolutionary analyses introducing biological information may indirectly aid at identifying pairs of three-dimensionally proximal residues. Few attempts have been made to relate coevolving pairs in a MSA to biological information in order to shed light on the structural and functional reasons of such coevolution [2, 6, 7, 18, 19]. These reports however, did not use this biological information as a filter for the coevolutionary analyses but rather as a criterion adopted a posteriori to performing the method to detect functionally important coevolving pairs. Also, no general approach was conducted to determine the relationships between such parameters and their biological meaning. Because of the growing interest in detecting molecular coevolution to infer biologically meaningful selective constraints, measuring the accuracy of such methods when varying the MSAs properties becomes crucial.
In this work we first examine the evolutionary and statistical behaviours of some of the properties of amino acid sites in proteins. Then we introduce this information in the coevolutionary analysis and test its effect on the sensitivity of MI-based methods. Because of the large amount of data examined in this study, we implemented these MI methods in in-house software (available from the corresponding author on request). We then examined the effect of MSAs with different properties. Finally, we apply our improved procedure to detect coevolution to several well-studied proteins, whose functional domains are known and examine the ability of the method to detect important functional sites.
Filtering by the parsimony information criterion removes most of the stochastic coevolution
Comparison of coevolution data corresponding to the levels of 20% and 25% coevolution presented however no significant differences for both sets (F1 = 0.732; P = 0.392). The interaction between the level of coevolution (20% and 25%) and the variation of pairwise divergence levels also showed no effect on the detection of true coevolution (F2 = 1.061; P = 0.374). Neither the interaction between these three factors showed any significant variability levels in PPV (F8 = 1.279; P = 0.250). However, pairwise distance as well as number of sequences still showed significant effects on the variance of PPV, with increasing pairwise distance being the only factor explaining a significant increase in PPV. Regarding the sensitivity (SN) of the method when filtering by parsimonious variability per site, all three factors (number of sequences, strength of coevolution and pairwise divergence levels) as well as all possible combination of factors influenced the variance of SN. In fact, univariate analyses showed significant effect of the size of the MSA (F2 = 105.99; P < 0.001), amino acid substitutions per site (F4 = 223.775; P < 0.001) and strength of coevolution (F2 = 89.214; P < 0.001) on SN. The tendency of SN with variations of each of the factors was however different (Figure 2d to 2f). Even though an increase of the number of sequences from 20 to 50 influenced the SN and PPV values, further increasing to 100 sequences in the MSA did not show any directional influence in SN as it remained within the same range of values (Figure 2d to 2f). When distances were greater than 0.1 amino acid substitutions per site SN tended to increase with increasing coevolutionary signal and increasing pairwise amino acid distances (Figure 2d to 2f). As before, the effect was more pronounced when the signal of coevolution increased from 10% to 20% or to 25% coevolution. In fact, significant differences were observed between the dataset belonging to the 10% of coevolution level and that belonging to the 20% coevolution level (F1 = 130.339; P < 0.001). No differences were found however between the 20% and 25% datasets (F1 = 1.126; P = 0.289). Our results indicate that strong signals of coevolution ameliorate the effect of the number of sequences and pairwise amino acid distances. This was especially notorious in the case of pairwise distance factor because its interaction with coevolution level factor shows no significant variability in SN (F4 = 0.963; P = 0.427). Finally, at high number of sequences, a direct relationship exists between the level of coevolution and the PPV values. For instance the mean PPV value increased when we increased the strength of coevolution from 10% to 20% in MSAs comprising 100 sequences (F1 = 42.325; P < 0.001). This increase is also significant when coevolution increases from 20% to 25% (F1 = 4.124; P = 0.043). We obtained the same results when we tested the effect of the strength of coevolution on SN. Also, there was no proportionality at low number of sequences between the level of coevolution and the PPV and SN values (an increase in the coevolution level does not imply a proportional increase in PPV or SN values).
Biological filtering increases the robustness of the sensitivity of coevolution analyses
The PPV values generated in the coevolution analyses filtered by the correlated hydrophobicity of the pairs of amino acid sites were influenced by all three factors: size of MSAs, mean pairwise distance and level of coevolution. However, simultaneous variation of the coevolution levels and size of MSAs or mean amino acid pairwise distance showed no effect on the amount of true positive values detected, with high coevolution levels compensating for variations in the MSA size (F4 = 0.186; P = 0.946) or mean pairwise amino acid distance (F8 = 0.880; P = 0.533). The combination of the three factors did not have significant effect on the PPV values generated (F16 = 0.962; P = 0.496).
Filtered coevolution analyses identify functional sites in the proteins GroEL, Hsp90 and Env
Amino acid sites in Escherichia coli detected to be important for the function of GroEL.
Amino acid site
30–33, 51–53, 87–91, 150–151, 398, 414–416, 454, 478–481, 493, 495
ATP/ADP and Mg2+ binding residues (ATP/Mg-B)
199, 201, 203–204, 234, 237, 259, 263–264
Substrate binding (SubB)
230, 238, 241, 257, 260, 261, 265, 268, 270, 271
234, 237, 238, 241, 242, 257, 261, 265, 270
GroES binding (ESB)
241, 257, 261, 265, 270
GroES binding (ESB)
4, 41–42, 58–59, 61, 63, 75–76, 80, 83, 178–179, 188, 196–197, 224–226, 252, 253, 255, 257, 277, 283, 286, 303, 304, 308, 327, 328, 359, 361, 363, 364, 367, 368, 371, 380, 386, 390, 393, 397, 404, 408, 523
Charged residues exposed to the central cavity in the cis ring, probably contacting substrates (CER)
Intra-molecular coevolution analysis in the heat-shock protein GroEL.
Group of coevolution
Amino acid sitesa
Mean H(X, Y)d
D11, V94, C458, K171, E172, I379, E191, A251, V254, A293, G306, E354, Q366, V369
Application of the filter of hydrophobicity reduced the number of groups from 6 to 3 groups, including G1 to G3 (Figure 6b). Interestingly, All these groups included sites proximal to functionally important regions, although G2 and G3 only included one out of the two amino acid sites to be proximal (less than 8 Å distance) to functional regions.
Finally, Only G1 was detected when we applied the filter of correlation in the variance of the molecular weight or molecular weight and hydrophobicity between the coevolving amino acid sites (Figure 6c). Once again, only the group sharing the greatest number of functionally/structurally important sites was identified. In conclusion, the results show that application of variation in the biological properties of coevolving amino acids a priori identified removes most of the non-biologically relevant pairs of significant coevolving amino acids.
Coevolution analysis in the heat-shock protein 90.
Group of coevolution
Amino acid sitesa
Mean H(X, Y)d
Coevolution analysis in the env protein of the Human Immunodeficiency Type 1 virus HIV-1.
Group of coevolution
Amino acid sitesa
Mean H(X, Y)d
L21, S640, L641, T723
L21, S640, V833
T232, S640, L641
Proteins evolve (fix amino acid mutations) under rules governed by the different selective constraints imposed over their constituent amino acids. The magnitude of these constraints is a function of the relative contribution of proteins to the biological fitness of individuals in the population. Because different amino acids within a protein have different roles, they are usually under different selective pressures depending upon their importance on the protein's function or structure. This differential relationship of amino acid sites to the protein's function is translated into different levels of within-protein amino acid conservation. Amino acids that contribute equally to the protein's function or structure should theoretically correlate in their evolutionary rates. Unraveling correlated evolution between amino acids is however generally swamped in a background of historical coevolution that makes difficult disentangling both types of covariation.
Despite that most of the functionally important sites are generally restricted to change by selection, mutations studies have shown that non-conserved amino acid sites can be functionally important because of their contribution to maintain geometry-volume characteristics around important sites, their compensatory effects of mutations at important regions or their contribution to protein-protein interaction interfaces. We have previously shown that indeed most of the significant coevolutionary relationships can affect regions surrounding functionally important domains . This implies that coevolving sites detected as not being proximal in the three-dimensional structure in other studies  may be related through their contribution to the structural/functional equilibrium of important regions in the protein. In such cases, introducing statistical filters based on biological information about the physico-chemical properties of the coevolving amino acids can significantly improve our ability to detect functionally important regions as shown in this study.
Our results based on simulated MSAs demonstrate that the accuracy of MI based methods to detect coevolution can improve by introducing statistical and biological filters a priori to the identification of coevolving amino acid sites. Introducing statistical filters such as those proposed previously  or using parsimony-informative amino acid sites as we show in this study increases SN and PPV values significantly. As we show, despite the introduction of statistical filters to the analysis of coevolution, SN and PPV still depend on the size of MSAs, mean pairwise divergence levels and strength of the coevolution signal. All these factors have intermingled effects and they should be considered prior to any coevolution analysis. The effect of the interaction of all three factors in PPV and SN is however complex and differs significantly with specific combinations of values for these factors. Fares and Travers  highlighted the significant effect of amino acid divergence levels and size of the MSAs on the SN of the method. They also showed that SN values of the MI based methods did not go beyond 20% when no filters were introduced. Martin and colleagues demonstrated using simulated MSAs that the contribution of background MI from finite column lengths is mitigated when MSAs comprise more than 150 sequences . When analysing real protein sequences, the number of sequences in MSAs is often limited to few sequences, which make the use of MI to detect coevolution prohibitive given the large expected number of false coevolutionary relationships. We show that, the effect of the number of sequences on the variance of SN and PPV is significant but that parsimony filtering criterion employed in our coevolution analyses reduces the number of FP, increasing hence the PPV. This procedure ameliorates hence the problem of low statistical power of MI coevolutionary analyses when the number of sequences in the MSA is low. The effect of the number of sequences in MSAs becomes less important when the strength of coevolution increases. This is especially notorious when switching from MSAs comprising 50 sequences to those with 100 sequences. These results support the fact that introducing stringent filters will increase the ability of avoiding FP in the coevolutionary analyses even when the MSA sizes are limited. Furthermore, MSAs presenting high number of sequences (for example 100 sequences) present proportional increase of PPV and SN when the coevolution level increases. This means that our ability to classify sites with high MI values and their distinction from those with lower MI values is more feasible when the number of sequences is high. Consequently, these results suggest that the separation between the different types of coevolution may be more approachable in large MSA.
Parsimony criterion has been previously used successfully to identify pairs of coevolving residues involved in the interaction between two proteins . Adding hydrophobicity and molecular weight as a priori filters for MI values we obtained significantly greater MI values compared to parsimony filtering. In addition, the size of the MSA did not have significant effects on PPV and SN. Moreover, increasing the level of coevolution did not involve a significant increase in PPV or SN when divergence levels were high. Although this approximation is similar to that performed previously , their approach was uniquely based on the a posteriori qualitative use of the biological properties of amino acids to decide whether a coevolutionary pair was important biologically. Our approach however, considers the correlation in the hydrophobicity and/or molecular weight of coevolving amino acid sites a priori to statistically determine their biological significance.
In summary, taking into account this study for a given protein we can theoretically interpolate the values for the size of MSAs and mean pairwise amino acid distance and know at which level of coevolution we will be able to identify real coevolving pairs. Application of our method using the different filters to identify coevolving amino acid sites within the heat-shock proteins GroEL and Hsp90 and the envelope protein env of HIV-1 clearly show the usefulness of these biological filters to identify functionally important amino acid regions. This approach was performed previously using a parametric method to detect coevolution  and applying it to the Hsp70-Hop-Hsp90 system  providing great sensitivity to detect functionally important regions. In this study we show that these filters are very useful to ameliorate the limitations of non-parametric methods to detect functional coevolution as well.
The main conclusion derived from this study is that appropriate statistical as well as biological filters introduced in coevolutionary analyses can substantially reduce the false discovery rate. The number of sequences, mean pairwise divergence levels and the strength of coevolution between amino acid sites in MSAs are strong parameters affecting the ability of non-parametric methods to detect coevolution. As additional parameters affecting the accuracy of coevolution methods based on the mutual information criterion are the asymmetric phylogenies and the lack of biological filters. The high correlation in the hydrophobicities and molecular weights of closely located amino acid sites in the protein structure is testament to the importance of the biological information in these kinds of analyses. Introduction of these amino acid properties as biological filters in MI based methods appears to equalise their sensitivity to a similar range of values determined by previous simulation studies of other parametric and highly sensitive methods .
Mutual Information content
The logarithm permits scaling H(X) so as to have H(X) = 1 when there is no uncertainty about the amino acid site composition. The mutual information is then measured taking into account the information (lack of uncertainty) for two amino acid sites (X and Y) and is calculated as:
MI(X, Y) = H(X) + H(Y) - H(X, Y)
Here N and K are the different elements that amino acid sites i and j could contain, respectively.
MI is a value that ranges between 0, when amino acid sites i and j evolve independently from each other, and a positive value that is proportional to the amount of evolutionary dependency between these sites.
Filtering pairs of coevolving amino acid sites
To remove the stochastic coevolutionary signal some authors have weighted the MI value by using the amount of mutual information contained on the amino acid sites under consideration . Other authors have partially accounted for the phylogenetic covariation by predefining in a qualitative way the size of groups of coevolution .
We have introduced two filters different from those introduced by these authors. The first filter is based on selecting pairs of coevolving amino acid sites that are parsimony informative (for example, select those amino acid sites that show at least two different amino acid each one repeated at least twice at that amino acid site in any two sequences of the MSA). Figure 1 shows an example of when an amino acid site would be considered parsimony-informative or not. Notice that the phylogenetic relationships are never accounted for in the identification of coevolving sites since the approach is purely based on the amount of patterned information present at two particular amino acid sites. Although this filter is less quantitatively sensitive than that proposed by Martin and colleagues  nonetheless it presents the advantage of making the detection of true positive coevolving sites less subjected to the background amino acid site variability (for example, the detection of coevolving sites does not depend on the mean level of amino acid site variability). Moreover, coevolutionary analyses are less error prone when we have asymmetric phylogenies. For example, when one branch in the tree is significantly longer than the remaining branches this branch will contain most of the evolutionary signal (this would be regarded as phylogenetic coevolution; Figure 1).
One additional filter that may improve the detection of functional coevolution between amino acid sites is that based on biological parameters. What are the relevant biological parameters is not clear and previous studies have only conducted few qualitative (non-statistical) attempts to use such filters. Among the different physico-chemical characteristics (for example, amino acid volume, geometry, molecular weight and hydrophobicity), molecular weight and hydrophobicity are the most important parameters in explaining functional links between coevolving amino acids. In general, hydrophobic amino acids will be placed in the core of the protein or in the inter-subunit interfaces in a solvent-free environment. In spite of the expected improvement of sensitivity to detect coevolution when biological information is used, methods have usually ignored this information due to the number and complex dependencies among biological properties. In theory the probability for a pair of coevolving sites to be correlated also in these two parameters is very low unless they are located in a protein region with similar hydropathy. We may however find that proximal amino acid sites in the protein structure may be correlated in their hydrophobicities or molecular weight to maintain the volume-geometry characteristics of local protein structures, affecting therefore the stability of the full protein structure. Correlations in the inter-species variation of amino acids molecular weights at neighbouring sites may uncover compensatory relationships between sequences. This kind of approximation has however the drawback of ignoring long-range distant functionally related and coevolving amino acid sites. Its conservative character would on the other hand decrease the rate of false positives.
To examine these dependencies we first tested whether amino acid sites are correlated in their hydrophobicities or molecular weight characteristics based on their three-dimensional locations in the protein structure. For such purpose, we downloaded from NCBI a dataset comprising 45,000 protein structures. We then removed redundant structures and ended with a non-redundant set of 8000 protein structures. We built a MSA for each data protein structure comprising the most similar (A maximum of 40% sequence divergence) protein sequences isolated from eukaryotic organisms. Orthologous sequences were identified using BLASTP program and taking sequences showing mutual best hits and a low probability (P < 10-6). We then examined the correlation in the hydrophobicities and molecular weights for each protein structure looking at all the possible pairs of amino acid sites in each structure and using the following procedure:
Here is the estimated variation in hydrophobicity in the comparison between sequences i and j as shown in equation (4).
Simulation of intra-molecular coevolution
Testing the sensitivity of methods to detect coevolution, using MSA of real proteins is tedious due to the lack of knowledge regarding intra-molecular amino acid site evolutionary and functional interactions. Forcing coevolution between known pairs of sites in simulated MSAs built on the bases of the evolutionary and physico-chemical properties of real proteins allows a better understanding of the power of methods designed for such purpose. Building MSA under realistic models of evolution is however anything but straightforward since models usually oversimplify real evolutionary processes. These limitations oblige researchers to test specific and explicit hypotheses individually and independently from others. In other words, the effect of the different factors has to be tested individually to simplify the problem discerning the main factors affecting the sensitivity of the methods.
We tested the effects of three main factors in the sensitivity to detect coevolution: i) when the strength of coevolution between two amino acid sites varies; ii) under a range of realistic mean pairwise sequence amino acid distances in MSAs; and iii) under different MSA sizes. Although the relationships between the amount of coevolution and MI values is clear, the linearity between MI and the sensitivity of the method when the coevolutionary relationships between sites changes has not been tested before. We tested this relationship using simulated MSAs presenting different coevolutionary strengths. We also examined and tested when the coevolutionary relationship between a pair of amino acids is sufficiently strong as to overcome the effect of background noise or limited MSA size on the sensitivity of the method to detect coevolution.
We used the program Evolver from the PAML package version 3.15  to conduct all the primary simulations. The method of simulation used is the same as that previously published . Briefly, an ancestral sequence of 200 amino acids in length was generated using the amino acid composition corresponding to the equilibrium residue frequencies in naturally occurring proteins . We then evolved the ancestral sequence using a Markov chain Monte Carlo simulation along a bifurcated and symmetrical phylogenetic tree showing the divergence levels explained above. Simultaneously to this simulation we randomly chosen 10 pairs of amino acid positions and forced them to coevolve under the different coevolutionary strengths explained above.
Here TP, FP and FN stand for the percentage of true positives (percentage of times where we detect real pairs under coevolution), false positives and false negatives, respectively. TP refer to those pairs of amino acid sites that were forced to coevolve in the simulated data and that were detected when applying the coevolutionary analyses. We are aware of the circularity that simulations may introduce, however we have assumed that functional sites coevolve in their physico-chemical properties and that this coevolution is diluted by the stochastic noise generated throughout the evolution of proteins. In our simulations we have followed exactly this assumption introducing noise and testing the performance of the method to identify the real coevolutionary signal.
Analysis of real data
Simulated MSAs have the disadvantage of under-representing biological information and being also "well behaved" regarding the effect of a particular factor in the analysis. We hence used a MSA for the heat-shock protein GroEL (Table including accession numbers and name of species is in Additional file 1). The ATPase molecular chaperone GroEL is found specifically in bacteria and the organelles of eukaryotic cells . The multimeric protein GroEL folds 10% to 15% of slow-folding proteins, which are mostly aggregation-prone [30, 31]. Each GroEL subunit is organized in three domains, including the apical, equatorial and intermediate [32, 33]. Several functionally important intra-domain regions in GroEL have been previously identified (Table 1). Here we test if coevolution among sites is crucial for the functional and structural stability of GroEL. We used groEL gene sequences isolated from a diverse range of gamma-proteobacteria. The range of bacteria was selected in a way that allowed us testing the effect of accelerated rates of evolution on the ability of the method to detect functional coevolution. For such purpose, we selected free-living bacteria as well as endosymbiotic bacteria of the aphid insect Buchnera aphidicola. The effective population sizes of this endosymbiotic bacterium are subjected to strong bottlenecks during the transmission between generations. In addition, they present vertical transmission and no recombination, which ensures a high rate of slightly deleterious mutations being fixed due to genetic drift [34, 35]. GroEL performs a good example to examine coevolution because the evolutionary rate is among the slowest of the endosymbiotic proteins and yet is faster than its free-living bacterium homologue. Bacteria used for this study and the accession numbers of their GroEL and GroES sequences are collected in Table 1 of supplementary information.
In the coevolutionary analyses we followed the approach recently published to detect functionally or structurally important domains . In this study we focused on coevolving pairs previously detected as functionally important but also in those pairs three-dimensionally proximal to functionally important domains. The rationale behind this is that GroEL is a highly conserved molecule and most of the sites involved in ATP binding or GroES and substrates binding are very conserved as to be detected by conventional coevolutionary analyses. In these cases, pairs of amino acid sites do mainly coevolve to maintain the structural characteristics around important functional regions and consequently to maintain the conformational and functional stability of the domain . Even though circular, the opposite reasoning can be used for a primary detection of functionally/structurally important protein sites or domains. We therefore used this rationale to test whether identifying coevolving pairs of sites could allow us to identify known functionally important domains due to their three-dimensional proximity. For example, when a residue was involved in inter-protein coevolution we examined whether functionally important sites nearby any of the coevolving amino acid regions (less than 8 Å distance from the functionally important region) was previously reported. If that was the case, we supported the usefulness of the method to detect functional domains. Consequently, this approach could be also valid for identifying functionally important sites that show very low evolutionary rate and that would therefore be completely ignored by standard coevolutionary analyses. We have also used additional real cases of proteins which domains are well characterised, including the env protein of the Human Immunodeficiency Virus HIV-1 group M subtypes and the Heat-shock protein 90 (Hsp90). These proteins show different divergent levels and their functional domains are well characterised and are therefore highly suitable for the purpose of this work. The sequences used for env gene of HIV-1 and for Hsp90 and their accession numbers are shown in Tables 2 and 3 of Supplementary Information, respectively.
Hsp90 is an ATPase molecular chaperone that assists the conformational maturation of molecules involved in signal transduction and cell cycle regulation [37–41]. The function of Hsp90 depends largely on its dimerisation and several domains can be identified in the Hsp90 linear sequence (supplemental Table 1 in ). The fundamental function of Hsp90 relies on the complex intra-molecular interaction between the different domains, which are poorly understood. The env gene of HIV-1 is the gene yielding the functionally important proteins gp120 and gp41 that are involved in host cell recognition, binding and entry [42–44]. The coordinated function of gp120 and gp41 is translated into a complex evolutionary patterns observed in the env gene (i.e. [45, 46]) and even more complex coevolutionary patterns [25, 47]. Here we apply the improved method to detect coevolutionary patterns in the env gene of the HIV-1 subtype.
This work was supported by Science Foundation Ireland to M.A.F. S.O. and F.M.C. are supported by Marie Curie FP6 actions. We would like to thank David McNally for his assistance with Figure 6 of this manuscript. We are very grateful to reviewers who have contributed significantly to the improvement of this manuscript.
- Galitsky B: Revealing the set of mutually correlated positions for the protein families of immunoglobulin fold. In Silico Biol. 2003, 3 (3): 241-264.PubMedGoogle Scholar
- Pollock DD, Taylor WR, Goldman N: Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol. 1999, 287 (1): 187-198. 10.1006/jmbi.1998.2601.View ArticlePubMedGoogle Scholar
- Korber BT, Farber RM, Wolpert DH, Lapedes AS: Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proc Natl Acad Sci USA. 1993, 90 (15): 7176-7180. 10.1073/pnas.90.15.7176.PubMed CentralView ArticlePubMedGoogle Scholar
- Chelvanayagam G, Eggenschwiler A, Knecht L, Gonnet GH, Benner SA: An analysis of simultaneous variation in protein structures. Protein Eng. 1997, 10 (4): 307-316. 10.1093/protein/10.4.307.View ArticlePubMedGoogle Scholar
- Pollock DD, Taylor WR: Effectiveness of correlation analysis in identifying protein residues under going correlated evolution. Protein Eng. 1997, 10 (6): 647-657. 10.1093/protein/10.6.647.View ArticlePubMedGoogle Scholar
- Atchley WR, Wollenberg KR, Fitch WM, Terhalle W, Dress AW: Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. Mol Biol Evol. 2000, 17 (1): 164-178.View ArticlePubMedGoogle Scholar
- Fares MA, Travers SA: A novel method for detecting intramolecular coevolution: adding a further dimension to selective constraints analyses. Genetics. 2006, 173 (1): 9-23. 10.1534/genetics.105.053249.PubMed CentralView ArticlePubMedGoogle Scholar
- Tuff P, Darlu P: Exploring a phylogenetic approach for the detection of correlated substitutions in proteins. Mol Biol Evol. 2000, 17 (11): 1753-1759.View ArticlePubMedGoogle Scholar
- Afonnikov DA, Oshchepkov DY, Kolchanov NA: Detection of conserved physico-chemical characteristics of proteins by analyzing clusters of positions with co-ordinated substitutions. Bioinformatics. 2001, 17 (11): 1035-1046. 10.1093/bioinformatics/17.11.1035.View ArticlePubMedGoogle Scholar
- Pritchard L, Bladon P, J MOM, M JD: Evaluation of a novel method for the identification of coevolving protein residues. Protein Eng. 2001, 14 (8): 549-555. 10.1093/protein/14.8.549.View ArticlePubMedGoogle Scholar
- Wang ZO, Pollock DD: Context dependence and coevolution among amino acid residues in proteins. Methods Enzymol. 2005, 395: 779-790.PubMed CentralView ArticlePubMedGoogle Scholar
- Shapiro B, Rambaut A, Pybus OG, Holmes EC: A phylogenetic method for detecting positive epistasis in gene sequences and its application to RNA virus evolution. Mol Biol Evol. 2006, 23 (9): 1724-1730. 10.1093/molbev/msl037.View ArticlePubMedGoogle Scholar
- Tillier ER, Lui TW: Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments. Bioinformatics. 2003, 19 (6): 750-755. 10.1093/bioinformatics/btg072.View ArticlePubMedGoogle Scholar
- Buck MJ, Atchley WR: Networks of coevolving sites in structural and functional domains of serpin proteins. Mol Biol Evol. 2005, 22 (7): 1627-1634. 10.1093/molbev/msi157.View ArticlePubMedGoogle Scholar
- Clarke ND: Covariation of residues in the homeodomain sequence family. Protein Sci. 1995, 4 (11): 2269-2278.PubMed CentralView ArticlePubMedGoogle Scholar
- Wollenberg KR, Atchley WR: Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. Proc Natl Acad Sci USA. 2000, 97 (7): 3288-3291. 10.1073/pnas.070154797.PubMed CentralView ArticlePubMedGoogle Scholar
- Martin LC, Gloor GB, Dunn SD, Wahl LM: Using information theory to search for co-evolving residues in proteins. Bioinformatics. 2005, 21 (22): 4116-4124. 10.1093/bioinformatics/bti671.View ArticlePubMedGoogle Scholar
- Codoner FM, Fares MA, Elena SF: Adaptive covariation between the coat and movement proteins of prunus necrotic ringspot virus. J Virol. 2006, 80 (12): 5833-5840. 10.1128/JVI.00122-06.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhong J, Gastaminza P, Chung J, Stamataki Z, Isogawa M, Cheng G, McKeating JA, Chisari FV: Per sistent hepatitis C virus infection in vitro: coevolution of virus and host. J Virol. 2006, 80 (22): 11082-11093. 10.1128/JVI.01307-06.PubMed CentralView ArticlePubMedGoogle Scholar
- Pearl LH, Prodromou C: Structure and mechanism of the hsp90 molecular chaperone machinery. Annu Rev Biochem. 2006, 75: 271-294. 10.1146/annurev.biochem.75.103004.142738.View ArticlePubMedGoogle Scholar
- Chen B, Vogan EM, Gong H, Skehel JJ, Wiley DC, Harrison SC: Structure of an unliganded simian immunodeficiency virus gp120 core. Nature. 2005, 433 (7028): 834-841. 10.1038/nature03327.View ArticlePubMedGoogle Scholar
- Kwong PD, Wyatt R, Robinson J, Sweet RW, Sodroski J, Hendrickson WA: Structure of an HIV gp120 envelope glycoprotein in complex with the CD4 receptor and a neutralizing human antibody. Nature. 1998, 393 (6686): 648-659. 10.1038/31405.View ArticlePubMedGoogle Scholar
- Wyatt R, Kwong PD, Desjardins E, Sweet RW, Robinson J, Hendrickson WA, Sodroski JG: The antigenic structure of the HIV gp120 envelope glycoprotein. Nature. 1998, 393 (6686): 705-711. 10.1038/31514.View ArticlePubMedGoogle Scholar
- Pantophlet R, Burton DR: GP120: target for neutralizing HIV-1 antibodies. Annu Rev Immunol. 2006, 24: 739-769. 10.1146/annurev.immunol.24.021605.090557.View ArticlePubMedGoogle Scholar
- Travers SA, Fares MA: Functional coevolutionary networks of the Hsp70-Hop-Hsp90 system revealed through computational analyses. Mol Biol Evol. 2007, 24 (4): 1032-1044. 10.1093/molbev/msm022.View ArticlePubMedGoogle Scholar
- Pritchard L, Dufton MJ: Do pr oteins learn to evolve? The Hopfield network as a basis for the under standing of protein evolution. J Theor Biol. 2000, 202 (1): 77-86. 10.1006/jtbi.1999.1043.View ArticlePubMedGoogle Scholar
- Yang Z: PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci. 1997, 13 (5): 555-556.PubMedGoogle Scholar
- Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992, 8 (3): 275-282.PubMedGoogle Scholar
- Landry SJ, Zeilstra-Ryalls J, Fayet O, Georgopoulos C, Gierasch LM: Characterization of a functionally important mobile domain of GroES. Nature. 1993, 364 (6434): 255-258. 10.1038/364255a0.View ArticlePubMedGoogle Scholar
- Deuerling E, Schulze-Specking A, Tomoyasu T, Mogk A, Bukau B: Trigger factor and DnaK cooperate in folding of newly synthesized proteins. Nature. 1999, 400 (6745): 693-696. 10.1038/23301.View ArticlePubMedGoogle Scholar
- Thulasiraman V, Yang CF, Frydman J: In vivo newly translated polypeptides are sequestered in a protected folding environment. Embo J. 1999, 18 (1): 85-95. 10.1093/emboj/18.1.85.PubMed CentralView ArticlePubMedGoogle Scholar
- Braig K, Adams PD, Brunger AT: Conformational variability in the refined structure of the chaperonin GroEL at 2.8 A resolution. Nat Struct Biol. 1995, 2 (12): 1083-1094. 10.1038/nsb1295-1083.View ArticlePubMedGoogle Scholar
- Braig K, Otwinowski Z, Hegde R, Boisvert DC, Joachimiak A, Horwich AL, Sigler PB: The crystal structure of the bacterial chaperonin GroEL at 2.8 A. Nature. 1994, 371 (6498): 578-586. 10.1038/371578a0.View ArticlePubMedGoogle Scholar
- Moran NA: Accelerated evolution and Muller 's rachet in endosymbiotic bacteria. Proc Natl Acad Sci USA. 1996, 93 (7): 2873-2878. 10.1073/pnas.93.7.2873.PubMed CentralView ArticlePubMedGoogle Scholar
- Herbeck JT, Funk DJ, Degnan PH, Wernegreen JJ: A conservative test of genetic drift in the endosymbiotic bacterium Buchner a: slightly deleterious mutations in the chaperonin groEL. Genetics. 2003, 165 (4): 1651-1660.PubMed CentralPubMedGoogle Scholar
- Gloor GB, Martin LC, Wahl LM, Dunn SD: Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. Biochemistry. 2005, 44 (19): 7156-7165. 10.1021/bi050293e.View ArticlePubMedGoogle Scholar
- Pratt WB: The hsp90-based chaperone system: involvement in signal transduction from a variety of hormone and growth factor receptors. Proc Soc Exp Biol Med. 1998, 217 (4): 420-434.View ArticlePubMedGoogle Scholar
- Buchner J: Hsp90 & Co. – a holding for folding. Trends Biochem Sci. 1999, 24 (4): 136-141. 10.1016/S0968-0004(99)01373-0.View ArticlePubMedGoogle Scholar
- Caplan AJ: Hsp90's secrets unfold: new insights from structural and functional studies. Trends Cell Biol. 1999, 9 (7): 262-268. 10.1016/S0962-8924(99)01580-9.View ArticlePubMedGoogle Scholar
- Mayer MP, Bukau B: Hsp70 chaperone systems: diversity of cellular functions and mechanism of action. Biol Chem. 1998, 379 (3): 261-268.PubMedGoogle Scholar
- Mayer MP, Bukau B: Molecular chaperones: the busy life of Hsp90. Curr Biol. 1999, 9 (9): R322-325. 10.1016/S0960-9822(99)80203-6.View ArticlePubMedGoogle Scholar
- Deng H, Liu R, Ellmeier W, Choe S, Unutmaz D, Burkhart M, Di Marzio P, Marmon S, Sutton RE, Hill CM, et al: Identification of a major co-receptor for primary isolates of HIV-1. Nature. 1996, 381 (6584): 661-666. 10.1038/381661a0.View ArticlePubMedGoogle Scholar
- Dragic T, Litwin V, Allaway GP, Martin SR, Huang Y, Nagashima KA, Cayanan C, Maddon PJ, Koup RA, Moore JP, et al: HIV-1 entr y into CD4+ cells is mediated by the chemokine receptor CC-CKR-5. Nature. 1996, 381 (6584): 667-673. 10.1038/381667a0.View ArticlePubMedGoogle Scholar
- Feng Y, Broder CC, Kennedy PE, Berger EA: HIV-1 entry cofactor: functional cDNA cloning of a seven-transmembrane, G protein-coupled receptor. Science. 1996, 272 (5263): 872-877. 10.1126/science.272.5263.872.View ArticlePubMedGoogle Scholar
- de Oliveira T, Salemi M, Gordon M, Vandamme AM, van Rensburg EJ, Engelbrecht S, Coovadia HM, Cassol S: Mapping sites of positive selection and amino acid diversification in the HIV genome: an alternative approach to vaccine design?. Genetics. 2004, 167 (3): 1047-1058. 10.1534/genetics.103.018135.PubMed CentralView ArticlePubMedGoogle Scholar
- Travers SA, O'Connell MJ, McCormack GP, McInerney JO: Evidence for heterogeneous selective pressures in the evolution of the env gene in different human immunodeficiency virus type 1 subtypes. J Virol. 2005, 79 (3): 1836-1841. 10.1128/JVI.79.3.1836-1841.2005.PubMed CentralView ArticlePubMedGoogle Scholar
- Poon AF, Lewis FI, Pond SL, Frost SD: Evolutionary Inter actions between NLinked Glycosylation Sites in the HIV-1 Envelope. PLoS Comput Biol. 2007, 3 (1): e11-10.1371/journal.pcbi.0030011.PubMed CentralView ArticlePubMedGoogle Scholar
- Fenton WA, Kashi Y, Furtak K, Horwich AL: Residues in chaperonin GroEL required for polypeptide binding and release. Nature. 1994, 371 (6498): 614-619. 10.1038/371614a0.View ArticlePubMedGoogle Scholar
- Buckle AM, Zahn R, Fersht AR: A structural model for GroEL-polypeptide recognition. Proc Natl Acad Sci USA. 1997, 94 (8): 3571-3575. 10.1073/pnas.94.8.3571.PubMed CentralView ArticlePubMedGoogle Scholar
- Brocchieri L, Karlin S: Conservation among HSP60 sequences in relation to structure, function, and evolution. Protein Sci. 2000, 9 (3): 476-486.PubMed CentralView ArticlePubMedGoogle Scholar