Widespread positive selection in the photosynthetic Rubisco enzyme
© Kapralov and Filatov. 2007
Received: 28 November 2006
Accepted: 11 May 2007
Published: 11 May 2007
Skip to main content
© Kapralov and Filatov. 2007
Received: 28 November 2006
Accepted: 11 May 2007
Published: 11 May 2007
Rubisco enzyme catalyzes the first step in net photosynthetic CO2 assimilation and photorespiratory carbon oxidation and is responsible for almost all carbon fixation on Earth. The large subunit of Rubisco is encoded by the chloroplast rbcL gene, which is widely used for reconstruction of plant phylogenies due to its conservative nature. Plant systematicists have mainly used rbcL paying little attention to its function, and the question whether it evolves under Darwinian selection has received little attention. The purpose of our study was to evaluate how common is positive selection in Rubisco among the phototrophs and where in the Rubisco structure does positive selection occur.
We searched for positive selection in rbcL sequences from over 3000 species representing all lineages of green plants and some lineages of other phototrophs, such as brown and red algae, diatoms, euglenids and cyanobacteria. Our molecular phylogenetic analysis found the presence of positive selection in rbcL of most analyzed land plants, but not in algae and cyanobacteria. The mapping of the positively selected residues on the Rubisco tertiary structure revealed that they are located in regions important for dimer-dimer, intradimer, large subunit-small subunit and Rubisco-Rubisco activase interactions, and that some of the positively selected residues are close to the active site.
Our results demonstrate that despite its conservative nature, Rubisco evolves under positive selection in most lineages of land plants, and after billions of years of evolution Darwinian selection still fine-tunes its performance. Widespread positive selection in rbcL has to be taken into account when this gene is used for phylogenetic reconstructions.
Ribulose-1,5-bisphospate carboxylase/oxigenase (Rubisco; EC 18.104.22.168) is the cornerstone of photosynthesis responsible for the conversion of inorganic carbon into organic compounds. "The most abundant protein in the world", Rubisco, comprises up to 50% of all soluble proteins in plants , which is the price phototrophs have to pay for the enzyme's relatively slow and inefficient performance. Rubisco confuses the substrate of photosynthesis, CO2, with the product, O2, resulting in energy-wasting photorespiration. As the performance of this enzyme may greatly affect crop yields, significant efforts have been made to study the structure and the function of Rubisco, with the aim to artificially improve its performance . Significant natural variation among the kinetic parameters of Rubiscos from different species  and its dependence mainly on environmental pressure (rather than phylogenetic factors, ) indicate possible positive selection to optimize CO2/O2 specificity and maximize the rate of catalytic turnover of Rubisco in different thermal and gaseous conditions .
Rubisco is one of the slowest and largest enzymes, with a molecular mass of 560 kDa . In land plants and green algae, the chloroplast rbcL gene encodes the 55-kDa large subunit, whereas a family of rbcS nuclear genes encodes nearly identical 15-kDa small subunits [5, 6]; in nongreen algae both the rbcL and rbcS genes are chloroplast encoded . The Form I Rubisco of plants and algae is a hexadecameric protein composed of eight large and eight small subunits, while the Form II Rubisco of some prokaryotes and dinoflagellates consists of a large subunit dimer . Because large subunits of Form II enzymes contain all the structural elements required for catalysis, the origin and the role of small subunits in Form I enzymes remain enigmatic .
Being one of the most conservative genes, rbcL is often chosen by botanists for phylogenetic reconstructions and it has been sequenced in thousands of plant species [8–10]. Surprisingly, despite Rubisco's great physiological importance, well studied protein structure and abundance of sequence data "the systematists have generally treated rbcL sequences as strings of anonymous nucleotides, without function" . While degeneration or loss of rbcL in parasitic non-photosynthetic flowering plants has attracted much attention [11, 12], little is known about rbcL evolution in other groups. The highly conservative nature of rbcL is probably the reason for the lack of enthusiasm for the evolutionary analysis of this gene, and in particular for the study of putative positive selection acting on rbcL. The only exception to this trend was the analysis of the rbcL evolution in the thermotolerant cyanobacteria genus Synechococcus, which has shown an increase in the number of hydrophobic residues in the rbcL s of more thermotolerant strains – a pattern observed for many thermostable proteins . However, the dataset of eight sequences used in Synechococcus study could be too small to detect positive selection using likelihood molecular phylogenetic analysis .
Previously we have reported positive selection in the rbcL gene associated with an adaptive radiation in the Hawaiian endemic genus Schiedea of the Caryophyllaceae family . Interestingly, the Schiedea study demonstrated that adaptive substitutions in rbcL could have caused the spread of an advantageous haplotype across several closely related species, causing incongruence between the chloroplast and species phylogenies . However, positive selection in rbcL of Schiedea could have been driven by adaptation to contrasting environments (e.g. rainforest vs. dry coastal cliffs) and it remains unclear how common is positive selection in rbcL of other phototrophs. This motivated us to conduct a wider study of positive selection in the rbcL gene. Here we report the phylogeny-based and protein structure-based analyses of positive selection in 3228 rbcL sequences representing all the main lineages of green plants and some of other phototrophs, such as brown and red algae, diatoms and euglenids, as well as cyanobacteria. We found that while there is no clear evidence for positive selection in cyanobacteria and algae, positive Darwinian selection in rbcL is fairly common in all the main lineages of land plants (mosses, ferns and allies, gymnosperms, angiosperms). Using the computational analyses of the tertiary structure of Rubisco we show that positively selected residues are mainly located in regions important for dimer-dimer, intradimer, large subunit-small subunit and Rubisco-Rubisco activase interactions.
The Plantae Kingdom (plants)
Pinophyta and Gnetophyta (gymnosperms)
Charophyta (green algae)
Plastid-carrying eukaryotes which are not in the Plantae Kingdom
Phaeophyta (brown algae)
Rhodophyta (red algae)
Euglenida (euglenid protists)
LRT summary statistics
11th class from M8
ferns and alliesa
For all analyzed lineages of land plants (mosses, ferns and allies, gymnosperms, angiosperms) positive selection was found in most cases (Table 2). The smallest proportion of cases with detected positive selection and average dN/dS value of "eleventh" class in M8 model were in mosses (75% and 3.8 respectively); the highest – in gymnosperms (88% and 8.6 respectively). There was no significant difference between the lineages of land plants in proportion of groups with positive selection (2 × 2 contingency χ2 tests with Yates' correction). Among the main lineages of flowering plants – magnoliids, monocots, commelinids, eudicots (including eudicots and core eudicots), rosids (including rosids, rosids I and rosids II) and asterids (including asterids, euasterids I and euasterids II; all groups sensu APG II ) – the smallest proportion of cases with detected positive selection was in monocots (61%); the highest – in asterids (89%) (Additional file 2). There was no significant difference between the lineages of flowering plants in proportion of groups with positive selection (2 × 2 contingency χ2 tests with Yates' correction).
While positive selection is widespread in land plants we did not find it in algae (including green, brown and red algae, diatoms and euglenids) and cyanobacteria (Table 2). Positive selection has been detected in a combined rbcL dataset of 500 sequences of land plants and algae (M. Anisimova and V. Savolainen, personal communication). The fact that positive selection was detected in a large combined data set may simply mean that the signal of positive selection is very strong in land plants so M. Anisimova and V. Savolainen still detect the signal when averaged across lineages with no positive selection, such as algae.
As increasing number of sequences should increase the sensitivity of the analysis , we joined the sets of green (Charophyta-1 + Charophyta-2, 49 sequences in total) and brown algae (Phaeophyta-1 + Phaeophyta-2, 48 sequences in total). Again there was no evidence for positive selection in the joint algae dataset in either M7-M8 or M8a-M8 comparisons. Although in our dataset land plants were much better represented (143 groups), compared to algae (7 groups) and cyanobacteria (1 group), the difference in proportion of groups with positive selection was significantly larger in land plants than in algae and cyanobacteria combined (2 × 2 contingency χ2 with Yates' correction = 20.3, P < 0.00001). In fact, the difference in proportion of groups with positive selection between land plants and algae is conservative, given the sequence divergence is higher in algae and cyanobacteria datasets, compared to land plants (Table 2). The power of LRTs increases with sequence divergence until it reaches its maximal value, after which further increases of sequence divergence lead to reduced power . The tree length values (the expected number of nucleotide substitutions per codon along the tree) for algae and cyanobacteria datasets ranged from 1.3 to 8.6 and were in the optimal range for detection of positive selection . The tree length values for the most of the land plant datasets were smaller than optimal (< 1; Table 2 and Additional file 2), hence the number of cases with detected positive selection in land plants may be underestimated.
Could the difference in the presence of positive selection between land plants and algae be due to structural differences of their Rubiscos? There are prominent differences in the Rubisco protein structures within the paraphyletic algae group with no positive selection detected (the "green-like" Rubiscos of green algae, euglenids and cyanobacteria vs. the "red-like" Rubiscos of red and brown algae and diatoms), while the Rubisco structures of land plants with widespread positive selection and green algae with no selection detected are virtually identical . So, structural differences can not explain the contrast between algae that do not show evidence for positive selection in rbcL, and land plants, where positive selection in this gene is ubiquitous.
An aquatic habitat is one of the few parameters shared by most algae and cyanobacteria, but not by most of land plants, thus the presence of positive selection in Rubiscos of land plants, but not in algae and cyanobacteria may be due to ecological differences between aquatic and terrestral habitats. Thermal and water regimes are more stable in aquatic, compared to terrestrial environments. Furthermore, algae have a "bicarbonate pump" – the CO2-concentrating mechanism that uses bicarbonate dissolved in water and suppresses the oxygenating activity of Rubisco, making gaseous conditions of Rubisco performance more stable . Interestingly, aquatic land plants, a small group of angiosperms with a submerged aquatic lifestyle as a derived state, also use the bicarbonate pump  and hence may be expected to have weaker signal of positive selection than their terrestrial relatives. Indeed the sets that consisted exclusively of submerged aquatic angiosperms (monocots-4 and 9) or contained a high proportion of them (monocots-5 and 7) did not show any evidence for positive selection in rbcL (Additional file 2). However, when aquatic land plants from sets monocots-4 and 9 were analyzed together (including submerged aquatic plants from other monocot groups) both M7-M8 and M8a-M8 comparisons confirm a presence of positive selection, suggesting that its signal in aquatic land plants is too weak to be detected in the analyses of smaller groups (joint group of submerged aquatic monocot plants: N sequences = 67; M7-M8: χ2 = 26.72, P = 0.00001; M8a-M8: χ2 = 4.78, P = 0.0144). Thus, it seems likely that aquatic conditions require less fine-tuning of Rubisco activity by positive selection, compared to terrestrial habitats.
Our analysis of 3228 rbcL sequences revealed the presence of positive selection in 78% of analyzed land plant groups, but not in algae and cyanobacteria. The finding of widespread positive selection in Rubisco suggests that either selection still continues to improve performance of this ancient critically important enzyme, or that adaptive evolution in rbcL may reflect the fine-tuning of Rubisco to optimize its performance in various gaseous and thermal conditions . The much weaker evidence for positive selection in algae and aquatic land plants growing in more stable conditions, compared to terrestrial land plants, suggests that the later explanation is more probable. This is also consistent with the finding that natural variation among the kinetic parameters of Rubisco enzymes from different species depends mainly on environmental pressures rather than on the phylogeny .
Twenty most often positively selected rbcL residues
Location of residue
Residues within 5 Å3
Structural motifs within 5 Å
247, 248, 249, 250, 252, 253, 254, 255, 256, 279, 280, 283
helixes 3, 4
189, 190, 193, 194, 221, 222, 223, 224, 226, 227, 228, 229, 236, 237, 238
helixes 1, 2; strand 3
33, 140, 141, 143, 144, 145, 146, 367, 369
N-terminus; strands D, H
295, 311, 326, 327, 329, 330, 342, 345, 346, 349, 376, 377, 378, 394
AS; loop 6 region; helixes 5, 7; strand 7
445, 446, 447, 448, 450, 451, 452, 453, 455, 456
140, 141, 142, 143, 144, 146, 147, 148, 320, 366, 367, 368, 369, 371
helixes D, 5, H
25, 27, 84, 85, 87, 88, 98, 99, 100
strands A, C, D
117, 121, 125, 134, 135, 301, 302, 307, 308, 310, 311, 313, 314
strand E; helixes F, 5
42, 43, 44, 93, 94, 96, 97, 131
strands B, D, E
155, 158, 159, 169, 324, 325, 326, 373, 374, 376, 377, 397, 398, 399
helix E; strands 6, 7, 8
336, 468, 469, 471, 472
loop 6; C-terminus
250, 251, 274, 275, 276, 277, 278, 280, 281, 282, 283
helixes 3, 4
58I, 59I, 61I, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 256, 260
helixes 2, 3
251, 252, 253, 254, 256, 257, 258, 259, 283
helixes 3, 4
25, 26, 27, 29, 30, 84
strands A, C
190, 193, 194, 224, 225, 226, 227, 229, 230, 231, 232, 236
helixes 1, 2
40, 41, 42, 44, 50, 87, 88, 89, 90, 95, 96, 98, 99, 100
helix B; strands B, C, D
59S, 60S, 226, 240, 257, 258, 260, 261, 263, 264, 289
helixes 2, 3; strand 3
436, 437, 438, 440, 441
149, 278, 279, 280, 281, 283, 284, 285, 286, 321
helixes 4, 5
Directed mutagenesis of Rubisco residues in Chlamydomonas showed that a single amino acid substitution, apart from the active site, has little effect on Rubisco function even if it occurs in a functionally important region, but when double or triple substitutions were created, significant changes were observed in both enzyme catalytic efficiency and specificity [1, 22, 26, 27]. In our study, in 95% of groups with detected positive selection, several residues (> 5 on average) simultaneously were shown to evolve under positive selection (Additional file 3) suggesting their coevolution within Rubisco. Simultaneous coevolution of multiple residues makes mutagenic studies of Rubisco functioning more laborious as with any new residue added the number of mutant combinations to be tested rises exponentially. However the integration of phylogenetic and biochemical approaches might be beneficial for understanding of Rubisco functional mechanisms. As natural variation among the kinetic parameters of Rubiscos from different species mainly depends on environmental pressure rather than phylogenetic factors , the integration of Rubisco activity essays with molecular phylogenetic analysis in a group of closely related plant species occupying contrasting environments could give direct evidence of how Rubisco evolves in nature.
Although rbcL was used in hundreds of phylogenetic studies, only in a small fraction of them was it treated as a biological molecule, not as "strings of anonymous nucleotides, without function" . Our analysis demonstrated that rbcL can not be regarded as a neutral marker and positive selection is fairly common in this gene. Källersjö and coauthors  simultaneously analyzed 2538 rbcL sequences by parsimony jackknifing and found that the first and second codon positions together contain 764 informative positions which support 431 clades, whereas 471 informative third positions support 1327 clades, clearly showing that the third positions on average have a stronger phylogenetic signal. These findings do not support frequent assumption that when changes occur more often in the third position, they are likely to contain more homoplasy and provide less phylogenetically reliable information then more conservative first and second positions . Positive selection may also result in homoplasy due to fixations of the same mutation that arose independently in several phylogenetic lineages. As most substitutions in the third codon positions are synonymous, the third codon positions are less frequent targets of positive selection compared to the first and second ones. Thus, findings that the first and second codon positions in rbcL have a lesser phylogenetic signal then expected [9, 28] can be explained by widespread positive selection on rbcL.
We tested whether the removal of codons evolving under positive selection will improve phylogenetic resolution in 112 groups with detected positive selection (Additional file 4). We compared sums of bootstrap values between the trees reconstructed using all sites and the trees reconstructed using only neutrally evolving sites (positively selected sites were excluded). The sums of bootstrap frequencies did not increase or decrease for more than 5% in 41% of analyzed cases; decreased for more than 5% in 29% of cases, and increased for more than 5% in 30% of cases. In two cases, deletion of codons evolving under positive selection resulted in more than 55% increase of the total bootstrap support (Additional file 4). Thus, taking into account the presence of positive selection in rbcL may improve phylogenetic reconstructions. We recommend checking rbcL datasets for positive selection, and if selection is found, to test whether deletion of sites evolving under positive selection from further phylogenetic analyses would increase topological resolution/bootstrap support of the selected branches. Alternatively, sites evolving under positive selection could be appropriately modelled for improving their contribution into phylogenetic signal, although there is no available "ready to use" program which can do it at the moment.
Adaptive mutations may spread across subpopulations of a species, or across several species with very little gene flow . Thus, positive selection in Rubisco may facilitate horizontal interspecific gene flow for chloroplast DNA, as spreading of adaptive mutations in rbcL may result in fixation of a single chloroplast haplotype in several occasionally hybridising species, which may dramatically affect phylogeny reconstruction. Previously we detected strong cytonuclear discordance apparently caused by positive selection in rbcL in the Hawaiian endemic plant genus Schiedea . This illustrates the danger of reconstructing phylogenetic relations solely from chloroplast data in groups with putative interspecific hybridization: tests for the presence of positive selection and for the congruence between chloroplast and nuclear phylogenies are essential for correct inference of species phylogenetic relations.
Our molecular phylogenetic analysis unexpectedly revealed that positive selection in the rbcL gene of terrestrial land plants is quite a common phenomenon. On the other hand, positive selection in cyanobacteria, algae and aquatic land plants is less prominent, which may possibly be explained by more stable conditions of aquatic environment compared to terrestrial one. The residues involved in dimer-dimer, intradimer, large subunit-small subunit and Rubisco-Rubisco activase interactions as well as ones close to the active site are apparently the prime targets of positive selection in Rubisco. Widespread adaptive evolution in rbcL may reflect the perpetual fine-tuning of Rubisco to optimize its performance in changing gaseous and thermal conditions and/or co-evolution of proteins in the Rubisco complex. The integration of phylogenetic and biochemical research is required to test the hypothesis that Darwinian selection during Rubisco evolution is driven by continuous fine-tuning to changing conditions. Widespread positive selection in rbcL has to be taken into account when this gene is used for phylogenetic reconstructions specifically when interspecific hybridization is possible.
All but thirteen rbcL sequences used in this study were extracted from NCBI GenBank  and the species names and accession numbers as well as taxonomic information are given in Additional file 1. The thirteen novel sequences (all representing genus Silene) were deposited in [GenBank:EF418555–EF418567].
The obtained sequences were aligned and edited for further analyses using ProSeq3 software . Codon alignments were made from amino-acid alignments and manually checked. All alignments were straightforward and unambiguous confirming the highly conservative nature of rbcL. We found only one insertion/deletion which separated the group of red, brown and diatom algae from all the rest of analyzed lineages. This indel gap did not impact our analyses because we analyzed many separate data sets instead of the joint one (see below). It should be noted that many sequences lack bases at the 5' and/or the 3' end; we made alignments within each analyzed dataset of the same length by sequence truncation. Suspicious sequences (such as containing stop codons) were not included into analyses. All alignments are available upon request from the corresponding author.
For detection of positive selection we used codon-based analysis (codeml) implemented in PAML v.3.14 package . It has been shown that the power to detect positive selection is close to 100% in data sets of ≥ 17 sequences . As the computing time grows dramatically with the number of sequences analyzed, we divided all rbcL sequences into 151 relatively small monophyletic groups (Additional file 1) by manual dissection of phylogenetic trees constructed using neighbor-joining algorithm implemented in MEGA v3.1 .
For all analyses of positive selection we used the codeml program from the PAML package . All PAML analyses were performed using "user tree" runmode in codeml. The employed trees were reconstructed by neighbor-joining algorithm implemented in MEGA  using following parameters: pairwise deletion of missed sites, all three codon positions used, both transitions and transversions used, homogeneous pattern among lineages and uniform rates among sites, both Kimura's  and Tamura-Nei's  models of nucleotide substitutions were used, which resulted in similar topologies. The resulting topologies were manually checked for congruence with systematics of analyzed taxa. Although some of these phylogenies may slightly deviate from the "true" species trees, this should not significantly influence Likelihood Ratio Tests (LRTs) we used for analysis of positive selection, as they were shown to be robust to phylogenetic uncertainty . We used models of codon evolution that allow for variation in d N/d S among codons  to perform LRTs for rate heterogeneity among amino acid sites and for positive selection. We performed one LRT for d N/d S heterogeneity (M0-M3 ) and two LRTs for positive selection: M7-M8  and M8a-M8 . For all LRTs, the first model is a simplified version of the second one, with fewer parameters, and is thus expected to provide a poorer fit to the data (lower maximum likelihood). The M7 and M8a models are the null models without positive selection (no codons with d N/d S > 1) and the M8 model is the alternative model with positive selection. The significance of the LRTs was calculated assuming that twice the difference in the log of maximum likelihood values between the two models is distributed as a χ2 distribution. The degrees of freedom (df) were given by the difference in the numbers of parameters in the two nested models. It was argued that for the M0-M3 and M7-M8 comparisons the df = 2 , while for M8a-M8 comparisons the appropriate test would use a 50:50 mixture of df = 0 and df = 1 . Accordingly, to calculate a P -value from this mixture of distributions we first calculated the P -value assuming df = 1, and then halved it. Cases in which M8 model fitted better with P < 0.05 in both M7-M8 and M8a-M8 comparisons were regarded as having positive selection.
The significance of difference in proportion of cases with detected positive selection between the analyzed plant groups was evaluated by 2 × 2 contingency χ2 tests with Yates' correction. First, we performed pairwise comparisons between four groups of land plants (mosses, ferns and allies, gymnosperms, angiosperms); second, between six groups of angiosperms (magnoliids, monocots, commelinids, eudicots, rosids, asterids; all groups sensu APG II ); and, finally, between land plants and combined algae and cyanobacteria dataset.
To identify amino acid sites potentially under selection in the groups with confirmed positive selection, the parameter estimates from M8 model were used to calculate the posterior probabilities that a codon belongs to a class with d N/d S > 1 using the Bayes Empirical Bayes approaches implemented in PAML .
The analyzed rbcL sequences are fairly conserved, and any differences in length occur at the C-terminus. This allows us to use published spinach Rubisco protein structure [36–38] for structural analysis. Throughout the paper, the numbering of Rubisco large subunit residues is based on the spinach sequence. Rubisco structural data files for spinach 1RBO  and 1RCX  were obtained from the RCSB Protein Data Bank . The locations and properties of individual amino acids in the Rubisco structure were analyzed using DeepView – Swiss-PdbViewer v.3.7 .
Given that positive selection may result in homoplasy we tested whether the removal of codons evolving under positive selection will improve the phylogenetic resolution. We compared boostrap sums of trees reconstructed using all sites (including ones evolving under positive selection) with boostrap sums of trees reconstructed using only neutrally evolving sites. Phylogenetic trees were reconstructed in MEGA  using neighbor-joining algorithm with Tamura-Nei's  model of nucleotide substitutions. We used 50% majority rule trees and subtracted 50% from each support value before summing up. The subtraction was done to circumvent the bias in summing up bootstrap values of a consensus tree; e.g. a tree with two 51% groups would have higher support than one with one group with 100% support, and if support was decreased from 51% to 49%, the sum would be zero (due to a threshold of 50%).
We are grateful to Antonina Votintseva for help with the additional files and to Graham Muir for suggestions and correction of the manuscript. We are deeply grateful to the three anonymous reviewers for their stimulating comments and suggestions, which benefited the manuscript. This work was funded by a grant to DF from the BBSRC.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.