Differential gene transfers and gene duplications in primary and secondary endosymbioses

Background Most genes introduced into phototrophic eukaryotes during the process of endosymbiosis are either lost or relocated into the host nuclear genome. In contrast, groEL homologues are found in different genome compartments among phototrophic eukaryotes. Comparative sequence analyses of recently available genome data, have allowed us to reconstruct the evolutionary history of these genes and propose a hypothesis that explains the unusual genome distribution of groEL homologues. Results Our analyses indicate that while two distinct groEL genes were introduced into eukaryotes by a progenitor of plastids, these particular homologues have not been maintained in all evolutionary lineages. This is of significant interest, because two chaperone proteins always co-occur in oxygenic photosynthetic organisms. We infer strikingly different lineage specific processes of evolution involving deletion, duplication and targeting of groEL proteins. Conclusion The requirement of two groEL homologues for chaperon function in phototrophs has provided a constraint that has shaped convergent evolutionary scenarios in divergent evolutionary lineages. GroEL provides a general evolutionary model for studying gene transfers and convergent evolutionary processes among eukaryotic lineages.


Background
Plastids, the solar powered energy factories of phototrophic eukaryotes, either translate mRNAs for their organelle-encoded genes or import nuclear-encoded proteins. In both cases, correct folding of proteins is managed by chaperones such as those of the GroEL family. These are an abundant class of chaperones, which are also found in mitochondria, hydrogenosomes and prokaryotes [1,2]. Their importance and distribution has led to intensive investigation of their function, and has culminated in the 'molecular chaperon concept' [3,4], which has strongly influenced current understanding of protein folding and assembly.
Early genomics on the cyanobacterium Synechocystis sp. PCC 6803 and on plastid chromosomes of eukaryotes has highlighted a surprisingly varied distribution of genome locations for GroEL homologues amongst photosynthetic taxa [5,6]. Synechocystis sp. PCC 6803 harbours two different groEL genes, whereas only one is maintained in red algal plastomes and the plastid genome of the cyanelle. Genes encoding GroEL have not been located within the sequenced genomes of chloroplasts in green algae and land plants, but two nuclear homologs of groEL, the cpn60s, have been detected in the nuclear genome of Chlamydomonas reinhardii and some land plants. In Euglena gracilis, no groEL gene has been identified [7]. The secondary endosymbionts of a cryptomonad (Guillardia theta) and also a diatom (Odontella sinensis) are known to encode a single groEL gene in their plastid, and it can be speculated that the presence of a single copy of groEL may indicate the ancestral state in a primary endosymbiont [8,9]. Interestingly, the nucleomorph genome of the cyptomonad Guillardia theta harbours a groEL homologue [6] and a cpn60-like homologue has recently also been found in the nucleomorph genome of another secondary endosymbiont: the chlorarachniophyte Bigelowiella natans (Gilson & McFadden, unpublished). Additionally, cpn60-like genes have recently been discovered in the nuclear genomes of other photosynthetic organisms: including in a red alga (Cyanidioschyzon merolae), a diatom (Thalassiosira pseudonana), and Plasmodium falciparum (a parasitic organism that harbours a degenerate plastid). With the exception of P. falciparum, the co-occurrence of two groEL genes in the genomes of these recently sequenced organisms was predicted as necessary for maintaining chloroplast function [6]. We develop here a hypothesis for differential transfer and gene duplication that explains the distribution of groEL homologues amongst the mulitple genomes of photosynthetic taxa. We discuss how these proteins may act as an important regulator for plastid functions.

Substitution model selection
In all analyses, except analyses of the red/brown algal GroEL orthologues, an RtTEV model, that accommodated positional rate heterogeneity in some form was the model selected as best by ProtTest under the AIC criterion. In the case of the GroEL orthologues in red/brown algae, we noted that small differences in the AIC criterion separated a variety of different models. When an RtREV model was assumed, and positional rate heterogeneity was approximated by either a constant proportion of variable sites or a discrete gamma distribution of rate classes, we noticed that the optimal estimates for these parameters varied among evolutionary lineages (i.e different estimates were obtained for cyanobacterial GroEL1, cyanobacterial GroEL2, red algal/heterokont plastid GroEL1 like sequences, and highly diverged Cpn60-like sequences). When a uniform rate distribution was assumed, p var values ranged from 0.3 variable sites (cyanobacterial GroEL1) to 0.9 variable sites (highly diverged Cpn60-like sequences). When p var was set to 1, alpha shape parameter values ranged from 0.2 (for cyanobacterial GroEL1) to 0.8 (for highly diverged Cpn60-like sequences). We suggest that this degree of variation in parameter estimates for phylogenetic grouping of anciently diverged sequences is likely to reflect lineage specific differences in structural and functional constraints [10][11][12]. As such it is potentially problematic for phylogenetic reconstruction, since parallel increases in proportions of variable sites in different evolutionary lineages can sometimes induce a form of long branch attraction [10][11][12].

Evolutionary tree building
Optimal PhyML (protein maximum likelihood) trees showed similar topological relationships over a wide range of p var values specified to accommodate positional rate heterogeneity. The robustness of phylogenetic relationships to sampling error was also found to be relatively stable, when evaluated using non-parametric bootstrapping (100 replicates). Figure 1 shows the optimal unrooted phylogenetic tree built assuming an RtREV model with positional rate heterogeneity modelled with a discrete gamma distribution (α = 0.92; 4 discrete rate categories).
Although, it contains more GroEL homologues than was available to Wastl et al. [6], our phylogenetic reconstruction here is nevertheless consistent with observations and inferences made by Wastl et al. [6]. Figure 1 shows that (a) the groEL1 like genes from the plastid genome of red algae/heterokonts and cryptophytes are most closely related to the cyanobacterial groEL1 genes. (b) Interestingly the cyanelle groEL1 like homologue is somewhat intermediate between cyanobacterial and red algal/chromist groEL1-like sequences, a finding that may reflect the cyanobacterial-like nature of this plastid. (c) The nonphotosynthetic eubacterial groEL sequences are arguably more similar to the groEL2 sequences found in cyanobacteria. In any event, the groEL2 orthologues in cyanobacteria are genetically more diverse than the groEL1 sequences in the same taxa. These observations may suggest that groEL2 orthologues represent an ancestral from of GroEL. However, our inference that structural/functional constraints differ amongst GroEL homologues, means that it is not possible to exclude other interpretations. (d) Assuming the root of the tree joins the branch leading to the non-photosynthetic taxa, Figure 1 places the Gloeobacter "A" and "B" sequences as the ancestral forms of GroEL homologues in photosynthetic taxa. (e) In the nuclear genomes of a diatom (Thalassiosira pseudonana), an alveolate (Plasmodium falciparum), the higher plants and the nucleomorph genomes of a chlorarachniophyte (Bigelowiella natans), a cryptophyte (Guillarida theta) and a red algae (Cyanidoschyzon merolae) groEL2 type sequences are found. In the case of higher plants, two forms of groEL2 occur, and appear to represent forms of groEL duplicated from an ancestral groEL2-like sequence. Inferences concerning the origin of this duplication are potentially problematic. A relatively high bootstrap value in Figure 1 (77%) suggests that it may have occurred prior to the divergence of plastids. However, potential long branch attraction problems concerning the placement of α and β Cpn60-like sequences from eukaryotes make this conclusion tentative. Gene duplication within the green lineage and differences in the functional/structural constraints of green α and β Cpn60-like sequences might also explain the results observed Although Chlamydomonas reinhardtii Cpn60 α and β sequences (AAA98642 and AAA98643) are not included in our phylogenetic analysis shown in Figure 1 (because their inclusion significantly reduced the alignment length, and increased phylogenetic uncertainty). However, it is clear from other phylogenetic analyses (unpublished) that this green alga contains the two forms of groEL2 also present in higher plants.

Hypotheses of origin
Our phylogenetic reconstruction suggests a complex pattern of genome transfers, losses and duplications in the evolution of groEL sequences from photosynthetic taxa. In the earliest cyanobacterial-like prokaryotes, the ancestral groEL sequence appears to have duplicated to form an "A" and a "B" type sequence that is still present today in the genome of Gloeobacter [14,15]. It appears that the "A" (groEL2-like) and "B"(more groEL1-like) forms have been inherited by most cyanobacteria, and also the endosymbiont(s) involved in primary plastid endosymbiosis. During the process of endosymbiosis, the groEL2-like orthologue has been transferred from the endosymbiont genome to the nuclear genome in heterokonts, red algae, green algae and high plants, and cryptophytes. We predict that this also will be true for haptophytes and glaucocystophytes. Less certain may be prediction of the pattern of evolution in peridinin-containing dinoflagellates, which in other respects appear highly distinctive in their evolution [16,17]. An interesting observation is that Cyanophora paradoxa, the molecular prototype of the glaucophytes, shows not only archaic features by encoding a groES in its Optimal PhyML tree, built assuming an RtREV + G(α = 0.92) substitution model Figure 1 Optimal PhyML tree, built assuming an RtREV + G(α = 0.92) substitution model. Internal branches relevant to the discussion and which receive greater than 74% have been shown. Branch lengths are indicated. However, those subtending Bigelowiella and Plasmodium have been truncated. Cyanobacteria have been given number identifiers: [ plastid genome, but also by containing a groEL1 sequence, somewhat intermediate in sequence identity between cyanobacterial and plastid encoded groEL1 genes. In the case of the chlorophyte lineage, it appears that it has been the groEL1 form that has been lost, and in Chlamydomonas and higher plants, this has been replaced by a duplicated form of a groEL2-like sequence. This situation may also be true for the endosymbiont of the chlorarachniophyte B. natans (generally assumed to have been a green alga [16]), since phylogenetic reconstructions, including those with partial Chlamydomonas GroEL sequences (not shown), provide some support for Bigelowiella and cpn60α containing taxa being most closely related (e.g. Figure 1). Several investigations on the evolution of the Plasmodium apicoplast indicate a red algal origin for this organelle [18,19]. If so, then based on the generalisations stated above, one would expect that the groEl/cpn60 distribution should be similar to the situation found in heterokonts. Additional duplications, of groEL1 genes (in the presence of groEL2) have also occurred in some species of filamentous cyanobacteria and this observation is discussed in the following section.

GroEL proteins and protein folding
In all eukaryotes and prokaryotes that carry out oxygenic photosynthesis, and for which genome data is complete, two different forms of groEL/cpn60 are known to exist. Thus, at least two copies of the gene or protein seem to be necessary for a complete chaperon function. Further, the importance of having two divergent forms of GroEL for protein folding is suggested from observations on GroEL sequences in Synechocystis sp. PCC 6803, where it has been observed that GroEL1 and GroEL2 respond differently to heat shock and light conditions [20,21]. An interesting speculation is that the number and genetic diversity of GroEL homologues reflects complexity of morphotype in cyanobacteria. This is suggested from comparison within filamentous cyanobacteria. Those strains that harbour three GroEL genes exhibit complex developmental stages (akinetes and hormogonia) whereas filamentous strains with a simpler morphotype, such as Nostoc sp. PCC 7120, harbour only two GroEL genes. Further study is needed to evaluate whether particular groEL homologues are specifically expressed in different developmental stages. Interestingly, P. falciparum encodes one cpn60 gene, but no groEL1 in the apicoplast genome. If phylogenetic inferences of a close relationship with red algae [18,19] are correct, then this observation may reflect relaxed constraints for protein folding for proteins of the apicoplast, and this speculation is also worth further investigation.

Conclusion
Although, phylogenetic reconstruction of individual gene histories is inherently problematic for anciently diverged taxa [5,10,11] phylogenetic reconstruction for groEL homologues nevertheless provides a framework for developing understanding of genome-wide patterns of gene loss, relocation and multiple events of gene duplication. Our results presented here support and extend the hypothesis of groEL/cpn60 evolution by Wastl et al. [6] which suggests a pattern of differential serial gene transfer and gene duplication.

Resources for sequences
Cyanobacterial groEL genes were retrieved from Genbank and the cpnDB chaperonin sequence database [22]. Two groEL genes were found to be present in Synechocystis sp. PCC 6803. One of these genes, termed groEL1 (slr2076), is arranged in an operon together with groES, whereas the other, groEL2 (sll0416), is not adjacent to a small subunit gene [23]. In comparing homologues from other cyanobacteria to those of Synechocystis sp. PCC 6803, we have adopted the terminology of "groEL1" and "groEL 2". All unicellular forms of cyanobacteria retrieved from database searches were found to contain two GroEL genes. In contrast, filamentous strains showed variation in the number of groEL genes. Nostoc punctiforme and Anabaena variabilis ATCC29413 harbour three groEL genes, whilst Nostoc sp. PCC 7120 with two groEL genes, was similar to that of unicellular forms. Genbank also provided us with several entries for cpn60 from several land plants, Chlamydomonas reinhardii and the nucleomorph of the cryptophyte Guillardia theta. In a recently finished genome project on the nucleomorph genome of the chlorarachniophyte Bigelowiella natans (Gilson & McFadden, unpublished), a further nucleomorph-encoded cpn60 gene has been annotated. Two new genome projects on Cyanidioschyzon merolae [24] and Thalassiosira pseudonana [25] were also the source for additional nuclear-located cpn60 genes. In PlasmoDB [26], the genome data base for Plasmodium falciparum, two different cpn60 genes have been annotated in the nuclear genome [27], one encodes a mitochondrial, and the other an apicoplast targeted copy [28]. Additional, BLAST searches of NCBI were made against available plastid genomes using cyanobacterial groEL genes. Significant hits were obtained with the plastid genomes from red (Gracilaria verrucosa, Cyanidium caldarium, Porphyra purpurea, Cyanidioschyzon merolae) and secondary red plastids (Odonetella sinensis, Guillardia theta, Pyrenomonas salina, Thalassiosira pseudonana). These sequences, together with the nuclear, nucleomorph, plastid, and cyanobacterial sequences were aligned using the progressive alignment procedure implemented in CLUSTALX [29], edited to remove any ambiguously aligned regions, and conserved blocks of aligned residues (containing 495 amino acids) were used for phylogenetic analyses. The aligned data matrix and accession details are available from the authors on request.

Evolutionary tree building
Trees were reconstructed from amino acid sequences using the windows version of PhyML [32]. Trees were built assuming an RTRev model and the optimal tree shown in Figure 1  Scheme depicting an evolutionary hypothesis that explains genome locations of groEL and cpn60. An ancestral eubacterial-like groEL duplicated in cyanobacteria to give two homologues: groEL1 and groEL2. Both copies were inherited by phototrophic eukaryotes. One of the copies, groEL1, has been lost from the plastid genome in some lineages. groEL2 was transferred into the cell nucleus and gave rise to cpn60 in the case where groEL1 is still maintained in the plastid genome, or into cpn60α and cpn60β, where groEL1 has been deleted. For details see text. P.D. = predicted.  robustness of phylogenetic reconstructions to variations in assumptions of positional rate heterogeneity was investigated by assuming (a) different proportions of variable sites (p var range = 0.3-1.0) and (b) a discrete gamma distribution of rate classes and a range of alpha shape parameter values (0.2-1). Non-parametric bootstrap trees were analyzed to assess the significance of sampling variability.

Authors' contributions
PG and GIM sequenced and analysed cpn60 from Bigelowiella natans, and contributed to the manuscript. BSM has drawn Figure 2 and contributed to the manuscript. PJL analysed the substitution properties of the data, constructed the phylogenetic trees and wrote together with UGM the manuscript. Data mining was done by SZ, who contributed to the manuscript as well. UGM has initiated the project. All authors read and approved the final manuscript.