Coevolution of amino acid residues in the key photosynthetic enzyme Rubisco
© Wang et al; licensee BioMed Central Ltd. 2011
Received: 15 June 2011
Accepted: 23 September 2011
Published: 23 September 2011
Skip to main content
© Wang et al; licensee BioMed Central Ltd. 2011
Received: 15 June 2011
Accepted: 23 September 2011
Published: 23 September 2011
One of the key forces shaping proteins is coevolution of amino acid residues. Knowing which residues coevolve in a particular protein may facilitate our understanding of protein evolution, structure and function, and help to identify substitutions that may lead to desired changes in enzyme kinetics. Rubisco, the most abundant enzyme in biosphere, plays an essential role in the process of carbon fixation through photosynthesis, thus facilitating life on Earth. This makes Rubisco an important model system for studying the dynamics of protein fitness optimization on the evolutionary landscape. In this study we investigated the selective and coevolutionary forces acting on large subunit of land plants Rubisco using Markov models of codon substitution and clustering approaches applied to amino acid substitution histories.
We found that both selection and coevolution shape Rubisco, and that positively selected and coevolving residues have their specifically favored amino acid composition and pairing preference. The mapping of these residues on the known Rubisco tertiary structures showed that the coevolving residues tend to be in closer proximity with each other compared to the background, while positively selected residues tend to be further away from each other. This study also reveals that the residues under positive selection or coevolutionary force are located within functionally important regions and that some residues are targets of both positive selection and coevolution at the same time.
Our results demonstrate that coevolution of residues is common in Rubisco of land plants and that there is an overlap between coevolving and positively selected residues. Knowledge of which Rubisco residues are coevolving and positively selected could be used for further work on structural modeling and identification of substitutions that may be changed in order to improve efficiency of this important enzyme in crops.
Coevolution is one of the few paramount forces acting on all levels of biological organization from bioms to nucleotides. Observations of the complementary adaptations in two or more species caused by mutual selection pressures have started from Darwin's (1862) work on orchids and their pollinators and resulted in theoretical generalizations such as 'Red Queen Hypothesis' [1, 2]. More recently concepts and methodologies developed for the study of species coevolution were applied to the growing wealth of molecular data, in particular for detection of coevolution between and within proteins . Identifying coevolving positions in proteins allows better understanding of their structure and function and paves the road to engineering proteins with desired properties. Several computational methods have been proposed to detect coevolving residues from multiple sequence alignments (e.g., [4–8]). Best approaches strive to disentangle patterns created by coevolution and those due to shared ancestry (phylogenetic correlation) and stochasticity (random error). Based on recent comparative evaluation of the state-of-art techniques to detect coevolution , here we use one of the top performing approaches implemented in CoMap  to study the coevolution of residues in a key photosynthetic enzyme Rubisco.
Rubisco (ribulose-1,5-bisphosphate carboxylase/oxygenase, EC 18.104.22.168) is the key enzyme of the Calvin cycle, catalyzing the fixation of inorganic carbon dioxide to organic sugars. Rubisco is a gateway for inorganic carbon, which is present in all light-dependent ecosystems. However, due to the poor turnover rate and competition between O2 and CO2 at the active site, Rubisco is often the rate-limiting step of the photosynthesis . These properties of Rubisco coupled with its high concentration in photosynthesizing organs make it the most abundant enzyme on Earth . Both biospheric importance and intracellular abundance of Rubisco stimulated plenitude of molecular studies using Rubisco as a model system (reviewed in ), but despite significant progress our understanding of Rubisco functioning and evolution is still far from complete .
Land plants and green algae have type IB Rubisco, which is a hexadecamer consisting of eight large, plastid encoded, and eight small, nuclear encoded, subunits. Large subunits which possess the active site of Rubisco are encoded by the chloroplast gene rbcL, which over three decades ago was among the first fully sequenced genes  and since that time became one of the most often sequenced genes thanks to its wide use in phylogenetics of plants and algae (e.g. ). Plant systematists have mainly used rbcL paying little attention to its function. However, recently positive selection acting on rbcL was found in the most lineages of land plants . The mapping of the positively selected residues on Rubisco tertiary structure revealed that they are located in regions important for dimer-dimer, intradimer, large subunit-small subunit and Rubisco-Rubisco activase interactions, and that some of the positively selected residues are close to the active site . Positive selection on Rubisco is in concert with well-known variation in Rubisco kinetics found in different species (e.g. ) and its correlation with environmental parameters (e.g. ). Positive selection has been shown as a driving force for kinetic differences in Rubiscos of C3 and C4 plants [19, 20].
Coevolutionary studies have been applied to a few important proteins and provided new information about protein-protein interactions, ligand-receptor bindings, and the 3D protein structure [8, 21–23]. Here we study the coevolution and positive selection on Rubisco large subunit using 142 data sets of the rbcL gene representing the main lineages of land plants (for detailed description see ). Our aim is to provide a better insight into the patterns of groups of non-independent sites and positively selected sites as well as to find their amino acid composition, pairing preference, and spatial distribution.
Known interactions of the inferred coevolving residues
15, 63, 64, 106, 109, 121, 126, 128, 129, 131, 176, 180, 205, 207, 208, 209, 211, 271, 297, 408, 413, 461
34, 105, 142**, 143, 146, 147, 162, 164, 216, 249, 285, 286
Small Subunit (SSU)
76, 163, 166, 223, 226, 227, 229, 230, 260, 261*, 397, 433, 453, 454
DD and ID
SSU and DD
219, 258, 288
SSU and ID
Presumably, sites coevolving for certain biochemical properties (charge, volume, polarity and Grantham distance) may have different amino acid composition preferences. Thus, we studied how amino acid frequency was different in residues that coevolve compared to the whole sequences. All coevolving sites had higher proportion of A, C, E, F, K, V, T compared with whole sequences (Figure 2 and Additional file 1). Among these residues, A, C, F, V are hydrophobic and E, K, T are hydrophilic. Amino acids W, Q, M and I were underrepresented in all coevolving sites. Proportion of A, D, K, L, T and V were significantly higher in coevolving sites detected by polarity (Figure 1a). Proportion of D, F, H, K, P were significantly higher in coevolving sites detected by Grantham (Figure 1b). Proportion of K, D and G were significantly higher in coevolving sites detected by charge (Figure 1c) compared with whole sequences. Frequencies of D, F, K, L V, Y were significantly higher in sites coevolving for volume (Figure 1d). Thus, our results demonstrate that frequencies of certain amino acids at coevolving sites are significantly different from the amino acid composition found in the whole sequence.
Fourteen of the most often positively selected residues of the Rubisco large subunit
Location of residues
Residues within 9 Å4
374, 375, 376, 396, 399, 400, 401, 402, 407, 410, 411, 445, 446, 447, 448, 449,451, 452, 453, 454
ID, SSU, DD
154, 155, 184, 187, 189, 219, 220, 221, 223, 224, 226, 227
DD, SSU, ID
209, 210, 213, 217,247, 248, 249, 250, 252, 253
DD, ID, SSU
142**, 143, 144, 146, 147, 148, 149, 150, 281, 282, 314, 315, 316, 364, 365, 366
140,141, 143, 144, 145, 272, 276, 311, 312, 313, 314, 315, 364, 365
23, 25, 26, 27, 54, 70, 71, 72, 73, 74, 93, 94, 96. 97. 98. 99
23, 86**, 326*, 332
413, 414, 417, 435, 436, 437, 438, 440, 441, 442, 443, 444, 446, 447
33*, 152, 151, 153, 135*, 281*, 310*, 440*, 470*
180, 181, 182, 184, 185, 186, 215, 216, 217, 218, 220, 221, 222, 223, 224, 225,227
121, 388, 423
DD, SSU, ID
143, 144, 152, 249, 253, 254, 274, 285, 286, 277, 278, 280, 281
319, 320, 321, 322, 323, 324, 326, 327, 329, 330, 332, 333, 462, 464
373, 374, 376, 377, 378, 379, 393, 396, 410, 411, 414, 436, 446, 449, 450, 453
AS, ID, SSU
190, 228, 229, 230, 231, 248, 253, 254, 256, 257, 280, 281, 282, 283, 315, 316
101, 86**, 167, 149*, 169*, 256*, 320*, 371*, 398
26, 27, 29, 30, 76, 91, 94, 128, 129, 130, 131, 132
19, 355*, 93*
33, 34, 35, 36, 37, 39, 41, 81, 84, 85, 87, 88
149*, 256*, 167, 169*, 222*, 317*, 320*, 371*, 398, 23, 30*
The protein secondary structure elements helix, strand and coil have different physical and chemical properties, thus play distinct roles in the protein tertiary structure and function. So the evolutionary force may vary among different secondary structures. In Drosophila proteins the coil regions are more likely to be under positive selection than expected, while the helices and strands undergo less positive selection .
The secondary structure of the large subunits of Rubisco is conserved throughout land plants, despite the variation in primary sequences . The helix parts are usually amphipathic with one side hydrophobic and the other side hydrophilic, thus the structured regions can occur anywhere in the protein and involve the largest proportion of residues in Rubisco large subunit. The strands often contain hydrophobic residues and could form a well-structured parallel or anti-parallel beta sheet. The active site of Rubisco is located at the carboxy-terminal end of the beta strand . The coil is the most flexible element without ordered structure and assists the conformational change of the protein. Loop 6 of Rubisco large subunit is conserved in land plants and green algae. It is crucial for the catalytic process because it controls the opened or closed state of the enzyme, which influences the association of the substrate . It was shown that residues in mobile regions of the protein tend to evolve in highly correlated fashion, participating in physical and functional contacts during their motion .
Overall, this shows that evolutionary forces are unevenly distributed on the large subunit of Rubisco, with the helical parts of the structure more frequently affected by coevolution and positive selection compared to other parts. Interestingly, our results differ from ones obtained for Drosophila proteins where less than expected selection was found in helices .
In order to compare the distribution of physical distances between the coevolving residues and all the residues in the LSU of Rubisco, we used four known 3D structures of spinach and tobacco from PDB both in activated and non-activated states. For each PDB record, distances between the center masses of any two residues in the protein 3D structure were calculated. The coevolving residues were mapped onto the PDB structures, and all the pair-wise combinations of the coevolving sites within a group were listed. The corresponding distances of all the coevolving pairs were collected. The minimum pair-wise distances between residues for both the activated and unactivated state of each species were calculated and the smallest value was chosen for further comparisons. It is said that two residues are in physical contact, if the distance between them is under a certain threshold. In some studies, they use the distance between two beta carbons (Cβ) or two alpha carbons (Cα) of the corresponding residues, with the direct contact threshold of 8Å [22, 23]. However, this method only considers one point of the residue, so that the possible position conformations of the other part of the molecule are neglected. In this study, distances between the center mass of the residues were calculated, thus the residue molecule was considered as a whole.
Physical distances between residue pairs
Coevolving sites distance
Positively selected sites distance
Positively selected site/active site distance
Next we analyzed physical proximity of the 14 residues most frequently found under positive selection in the LSU. Interestingly, these positively selected sites showed an opposite trend, as they were significantly further from each other than the background (Table 3), again showing different pattern from the Drosophila proteins . The distances between active sites and sites under positive selection tended to be shorter compared with the background, although not significantly (possibly due to small samples).
The functionally and structurally important sites in the protein are usually more conserved than other sites. But in some cases, one mutation at a crucial site may be compensated by mutations at other sites, so to maintain vital interactions and functions . Our study shows that the coevolving and positively selected sites tend to be located within the functionally and structurally important regions of Rubisco. Substitutions have to be compatible with the protein function and be structurally stable. Therefore, the amino acid composition and the residues pairing preference of the sites under coevolution and positive selection can provide a better insight in terms of protein evolution. Our molecular evolutionary analysis reveals that different evolutionary forces may have distinct amino acid composition and pairing preferences. The coevolving residue composition may not be too different from that of the background because they are wide spread across the sequence, while the positive selection is quite different. Moreover, the groups of non-independent residues have their pairing preference. Based on the amino acid pairing frequency matrix with different biochemical properties, the distinct patterns of coevolving pairs could provide a hint for the further analysis about the mutual information.
Our study indicates that coevolving sites are in closer proximity in the tertiary structure of the Rubisco large subunit. Predicting protein tertiary structure from the primary sequence is a crucial problem in computational biology. The physical interactions between coevolving residues could help to build a residue contact map in protein tertiary structure analysis. Moreover, many residues which coevolve or under positive selection are found in the functionally or structurally important locations, such as dimer-dimer, intradimer, active site and small subunit interactions. Our results appear to be in agreement with the study of Yeang and Haussler , who proposed that in large protein families coevolving positions are spatially coupled and many of the coevolving positions are located at the functionally or structurally important positions. Furthermore, we find that many sites are both under positive selection and coevolution, suggesting that selection towards a new optimum may require more than one substitution. Indeed, multiple neutral changes along the mutational landscape of a protein may precede mutations with high advantageous fitness effect .
Because of the importance of Rubisco, it has been the target of genetic engineering for a long time. Aspects including structure, function and evolution of this enzyme have been studied with the aim to improve its kinetics. Nowadays, the experimental way of random mutagenesis and bioselection could be used to identify mutations that influence important properties of Rubisco [13, 28–30], but the vast amount of candidates and the repetitive lab work make the process slow, unpredictable and tedious. Knowledge of location of coevolving or positively selected residues may be used to design future mutagenesis experiments and accelerate efforts to engineer better Rubisco, which would potentially increase the yield of agriculturally important crops.
We used 142 rbcL data sets from . The sequences were assigned to each data set according to their phylogenetic relations. Each data set had 11 to 40 sequences. The codon alignments were constructed from DNA sequences by back-translating from amino acid alignments. Sequences within each data set were truncated to the same length. Angiosperms, gymnosperms, ferns, and mosses were represented by 122, 8, 9 and 4 datasets, respectively.
For each alignment a phylogeny was reconstructed using maximum likelihood as implemented in PhyML v3.0 [31, 32]. During the inference we used amino acid models WAG  and LG , both with Γ-rate variation. The tree space was traversed using the combination of NNI and SPR heuristics . The inferred phylogenies were used for further coevolution and positive selection analyses.
To detect coevolving residues we used a clustering approach that searches for ancestral co-substitutions or for compensatory changes by correlating amino acid substitution histories, as implemented in the R-program CoMap [6, 35]. Substitution numbers for each branch were sampled from a posterior distribution based on a Markov substitution model and a phylogeny with branch lengths relating the sequences. Parameters of the models and tree branches were estimated by maximum likelihood prior to the sampling. Substitutions were weighted by different biochemical properties (charge, polarity, volume and Grantham) to detect coevolutionary trends specific to amino acid properties . The amount of the biochemical change for one site was represented by weighted substitution vectors, containing weighted substitution counts for each branch of a phylogeny. The correlated or compensatory evolution was estimated based on the correlation coefficient of the substitution vectors. To select candidate groups, the complete linkage hierarchical clustering was applied to distance matrices based on the correlation and compensation statistics . To asses the significance of inferred clusters, the parametric bootstrap with 1000 replicates was used to generate the joint null distribution of minimum site variability together with coevolution or compensation statistic ρ, as described in . From such empirical distribution, p-values for each candidate cluster with observed statistic ρ obs were computed as Pr(ρ > ρ obs | N min), i.e., by conditioning on minimum site variability N min of the cluster. Clusters with p-value ≤ 0.05 were considered as evolving non-independently. A simulation procedure described in  was used to correct for multiple non-independent tests.
Positive selection on the protein level was measured using the ω ratio, which is the ratio of nonsynonymous to synonymous substitution rates per site. Negative selection results in lower nonsynonymous rate relative to synonymous and so ω < 1. If the nonsynonymous mutations are favored, the nonsynonymous rate should be higher than synonymous, and so ω > 1 indicates evolution by positive selection. Here we estimated selective pressure on rbcL using different types of Markov models of codon evolution: (1) site-specific codon models that allow variation of selective pressure among sites in a sequence, and (2) switching codon models that allow variation of selective pressure among sites and over the evolutionary time. All branch lengths of inferred phylogenies were re-optimized under codon models.
Site-specific codon models were used to test each alignment for positive selection using likelihood ratio tests (LRTs) of nested models: M0 (one ratio) vs M3 (discrete), M1a (neutral) vs M2a (selection), and M7 (beta) vs M8 (beta & ω) . These analyses were performed using the codeml program of the PAML package [37, 38]. If the LRT for positive selection was significant, Bayes naïve empirical Bayesian approach was used to infer sites under positive selection . Sites that were inferred to be under positive selection using model M8, if site's posterior probability for the positive selection class was ≥ 0.95.
We also used switching Markov modulated codon models to detect episodes of positive selection during the evolution of Rubisco, as implemented in the program FitModel . Each switching model used in our study allows three possible selective regimes for codon sites, for example like site model M2a with classes for positive (ω > 1), neutral (ω = 1), and negative selection (ω < 1), or like model M2 with 3 classes with no constraints on the ω ratio. Unlike site model, switching models allow each codon site to change the selective regime, and thus be affected by different selective pressures at different time points. This is accomplished by using an additional Markov process to describe the switches between selection regimes at any individual site. We used models both with bias (+S2) and with no bias (+S1) for switching between selective regimes. For each alignment we used LRTs to test whether switches of selective pressure over time occurred (M2a vs M2a+S1 and M3 vs M3+S1), and for alignments with significant evidence for switches we also tested whether there was switching bias (M2a+S1 vs M2a+S2 and M3+S1 vs M3+S2). Sites with episodes under positive selection were detected a posteriori using the Bayesian approach .
The analyses of location, properties and the distance analysis of residues in the protein structure were performed using VMD viewer , a program for visualization, manipulation and analysis of large molecules in three dimensions. Command options were used to extract information about sets of molecules, vectors and coordinates. The center mass of a molecule was computed using the Tcl language of VMD.
M.W. was supported by the ETH Zurich scholarship for international master students. M.V.K. was supported by Natural Environment Research Council UK. M.A. was supported by ETH Zurich and was also receiving funding from the Swiss National Science Foundation (research grant 31003A_127325).
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.