Geminiviruses: a tale of a plasmid becoming a virus
© Krupovic et al. 2009
Received: 02 January 2009
Accepted: 21 May 2009
Published: 21 May 2009
Skip to main content
© Krupovic et al. 2009
Received: 02 January 2009
Accepted: 21 May 2009
Published: 21 May 2009
Geminiviruses (family Geminiviridae) are small single-stranded (ss) DNA viruses infecting plants. Their virion morphology is unique in the known viral world – two incomplete T = 1 icosahedra are joined together to form twinned particles. Geminiviruses utilize a rolling-circle mode to replicate their genomes. A limited sequence similarity between the three conserved motifs of the rolling-circle replication initiation proteins (RCR Reps) of geminiviruses and plasmids of Gram-positive bacteria allowed Koonin and Ilyina to propose that geminiviruses descend from bacterial replicons.
Phylogenetic and clustering analyses of various RCR Reps suggest that Rep proteins of geminiviruses share a most recent common ancestor with Reps encoded on plasmids of phytoplasmas, parasitic wall-less bacteria replicating both in plant and insect cells and therefore occupying a common ecological niche with geminiviruses. Capsid protein of Satellite tobacco necrosis virus was found to be the best template for homology-based structural modeling of the geminiviral capsid protein. Good stereochemical quality of the generated models indicates that the geminiviral capsid protein shares the same structural fold, the viral jelly-roll, with the vast majority of icosahedral plant-infecting ssRNA viruses.
We propose a plasmid-to-virus transition scenario, where a phytoplasmal plasmid acquired a capsid-coding gene from a plant RNA virus to give rise to the ancestor of geminiviruses.
The origin(s) of viruses is a longstanding but yet unresolved question in biology. Several hypotheses were put forward in efforts to understand this enigma (reviewed in ). According to the "Virus-first" hypothesis, viruses emerged in the prebiotic world, just before or in parallel with cellular organisms [2, 3]. The "Reduction" hypothesis states that viruses evolved by reduction from free-living ancient cellular lineages , while the alternative "Escape" hypothesis suggests that viruses originated from cellular genomic fragments that became free of their cellular environment . Irrespective of which of the viral origin hypotheses is considered, these converge in the appreciation of the extreme antiquity of viruses, with origin(s) possibly predating the emergence of the last universal common ancestor (LUCA) of cellular organisms. The ancient origin of viruses is inferred not only from bioinformatic investigations  but, perhaps more convincingly, from the recent flow of structural information on a number of individual viral proteins as well as entire virions. Structural comparison of viruses infecting hosts from all three domains of life (Bacteria, Archaea, and Eukarya) revealed that certain viruses utilize very similar assembly principles and can be grouped accordingly into structure-based viral lineages [6, 7]. The viral lineage hypothesis predicts that viruses existed at the time of (or even before) LUCA and their diversification into bacterial, archaeal and eukaryotic viruses was associated with the emergence of the three cellular domains. But do all virus families come from the dawn of life or can we still witness the more recent emergence of new viral families?
Plasmids comprise another group of parasitic genetic elements that inhabit cells in all three domains of life. Resemblance of plasmids to DNA viruses is apparent, especially when DNA replication strategies are considered . Nevertheless, evolutionary relationships between these two groups are far from being understood. Obviously, the main (and in some cases the only) difference is the presence of the capsid protein-coding gene in the viral genome. For example, there are a number of cryptic plasmids that encode a single protein responsible for DNA replication, while some small viruses of the Circoviridae family bear only two genes [8, 9], one for genome replication and the other one for capsid formation. Members of another virus family, Nanoviridae, contain multipartite genomes where each genomic segment contains a single gene and is packed into a separate isometric capsid . For example, Faba bean necrotic yellows virus contains up to eleven chromosomes . Of special interest are plant-infecting satellite RNA viruses, such as Satellite tobacco necrosis virus (STNV), that encode a single capsid protein and depend on helper viruses for genome replication. It is thus reasonable to assume that acquisition of a capsid gene by a plasmid or, vice versa, loss of a capsid gene by a virus will result in the transition from a plasmid to a genuine virus or from a virus to a plasmid, respectively. This hypothesis should be testable by scrupulous analysis of replication and capsid protein sequences and/or structures.
Geminiviruses are plant pathogens and due to their agricultural importance, a great number of sequences from geminiviral isolates has been determined and deposited into databases. We generated a specific sequence pattern to select from the non-redundant BLAST database (including environmental protein sequences) all 1072 protein sequences sharing conserved motifs with Rep proteins of geminiviruses. Many of these sequences are almost identical; therefore, in order to avoid redundancy, the initial dataset was filtered to leave only sequences that are less than 70% identical to each other. After subsequent manual examination, the final dataset contained 40 sequences (see Methods for data collection details). Nineteen of these belonged to geminiviruses, while the rest were from a marine metagenome project (6 sequences), circoviruses (6 sequences), phytoplasmal plasmids (5 sequences), plasmid of Porphyra pulchra (1 sequence), nanovirus (1 sequence), Bifidobacterium catenulatum DSM 16992 (1 sequence), and Nicotiana tabacum (1 sequence). Interestingly, the latter sequence was previously concluded to originate from integration of geminiviral DNA into the plant chromosome . Nanoviruses and circoviruses are small icosahedral viruses with ssDNA genomes. While nanoviruses infect plants, circoviruses replicate in mammalian or avian cells. Bifidobacteria are gram-positive bacteria residing in the gastrointestinal tract of humans and other warm-blooded animals. Interestingly, Rep from B. catenulatum DSM 16992 is homologous to a Rep of the Bifidobacterium pseudocatenulatum plasmid p4M [GenBank:AAM00235], which has been previously observed to be similar to Reps of circoviruses . Phytoplasmas are parasitic bacteria infecting the phloem tissue of plants. Phytoplasmas belong to the class of Mollicutes, which encompasses small pleiomorphic wall-less bacteria, also including mycoplasmas, ureaplasmas, spiroplasmas and acholeplasmas . Phytoplasmas are transmitted by insects that feed on the phloem of infected plants [21, 22]. It should be noted that geminivirus-related bacterial RCR Reps, other than those from phytoplasmal plasmids and B. catenulatum DSM 16992, could not be identified neither by BLAST searches, nor by geminivirus-specific pattern searches (see Methods). Since reasonable sequence conservation is a prerequisite for robust phylogenetic analysis, we did not incorporate RCR Rep sequences from other origins into our dataset.
When Rep proteins of phytoplasmal plasmids were searched for homologues using PSI-BLAST  against bacterial and viral databases at NCBI, only Rep protein sequences of other phytoplasmal plasmids or geminiviruses were identified with significant scores. This suggests that other bacterial RCR Rep proteins share much less similarity with phytoplasmal Reps than those of geminiviruses. Indeed, sequences of bacterial plasmid Reps identified using pattern searches by Koonin and Ilyina (1992) share only three of the five motifs characteristic to geminiviral Reps [15, 17]. Also, there is no significant sequence similarity, other than the three shared motifs, between RCR Reps of bacterial plasmids (other than phytoplasmal plasmids) and geminiviruses. For example, BLAST searches against geminiviral protein sequences at NCBI using as seeds Rep sequences of plasmids pMV158 [GenBank:YP_001586272] and pUB110 [GenBank:CAA27141], the two plasmids whose Reps were found to be the closest to geminiviral Reps , returned no positive hits. Our analysis identifies Reps of phytoplasmal plasmids as the most similar sequences to geminiviral Reps from currently available public protein sequence databases. This observation suggests that geminiviral Reps share a more recent common ancestor with phytoplasmal plasmids than they do with other viral or plasmid RCR Reps.
Interestingly, phytoplasmas and geminiviruses are both obligate parasites occupying a common ecological niche – phloem tissue of plants, which consists of parenchyma cells, sieve-tube cells, and companion cells. Phytoplasmas have been observed in companion cells and phloem parenchyma cells as well as in sieve elements . The same types of cells were shown to contain geminiviral DNA when Nicotiana benthamiana and Lycopersicon esculentum were infected with Tomato yellow leaf curl Sardinia virus and/or Tomato yellow leaf curl virus . It should be noted, however, that not all geminiviruses are phloem-limited . Furthermore, both geminiviruses and phytoplasmas share at least one common insect vector (leafhoppers) that is essential for transmission between plants [21, 27]. It is conceivable that extrachromosomal replicons of phytoplasmas evolved by acquisition of the capsid-coding gene to give rise to geminiviruses.
Next, we superimposed the structural models of the STNV and geminiviral CPs and extracted the structure-based sequence alignment (Fig. 4C). Of the 184 STNV CP amino acid residues for which structural information is available [PDB:2buk], 69.1% had corresponding amino acids in at least one of the four geminiviral CP sequences (75 identical and 52 similar residues) (Fig. 4C). Given the fact that all geminiviral CPs are true homologues, our observation indicates that STNV and geminiviral CPs share not only tertiary but also significantly similar primary structures which further justifies the suggested relationship between these viral CPs. It is obvious from Fig. 4 that secondary structure elements are well conserved and that insertions in the loop regions between β-sheets account for the larger size of geminiviral CPs. The most prominent insertions are observed in the CP of mastrevirus (between βB and βC, and between βF and βG) and begomovirus (between βC and βD, and between βD and βE). The βD/βE loop was identified as essential for controlling whitefly transmission of begomoviruses , whereas the βF/βG loop was proposed to be required for leafhopper transmission .
It is notable that the eight stranded β-barrel fold is characteristic to all icosahedral ssRNA plant and animal viruses  as well as to ssDNA viruses of the Microviridae and Parvoviridae families . Previously, twinned particles of two geminiviruses, Maize streak virus (MSV; Mastrevirus) and African cassava mosaic virus (ACMV; Begomovirus), were resolved using electron cryo-microscopy (cryo-EM) and image reconstruction techniques to 25 Å  and 16–19 Å  resolution, respectively. In both studies the CP of STNV was also found to be the best template for structural modeling of the geminiviral CPs. Successful fitting of the pseudo-atomic model of MSV CP into the cryo-EM density map  strongly corroborates the prediction that CPs of STNV and geminiviruses share the same fold.
All these observations suggest a possible scenario for the origin of geminiviruses. Phylogenetic and clustering analyses of the geminiviral Rep proteins (Figs. 2, 3) indicate that they share a more recent common ancestor with Reps of plasmids from phytoplasmas rather than from other bacteria or viruses. There are two possible ways to explain this relationship. One is that a phytoplasmal cell, while being inside the plant cell, internalized the genome of a geminivirus-like agent, replication and partitioning of which was subsequently stabilized along with the loss of a CP-coding gene. The other possibility is that phytoplasmal plasmids released upon lysis of the bacterial cell in the cytoplasm of the host plant cell were able to obtain a capsid-coding gene from an unknown plant virus. The former possibility seems unlikely since some geminiviruses not only maintained features of prokaryotic replicons, such as typical bacterial promoter sequences , but what is more surprising, are in some instances still able to replicate their DNA in bacterial cells [37, 38]. We were unable to identify any other proteins in addition to RCR Reps common to both, phytoplasmal plasmids and geminiviruses. However, this is not surprising, since protein content required for successful persistence inside bacterial (for plasmids) and plant (for geminiviruses) cells is likely to be different. Furthermore, the capsid volume is a limiting factor dictating the amount of genetic information that can be packaged. So, there is a strong pressure on the genome content of viruses with small capsids leading to the loss of genetic information unnecessary for virus propagation.
What virus might be a donor of a capsid-coding gene to the escaping phytoplasmal plasmid? The vast majority of plant viruses have RNA genomes. Modeling of the geminiviral CP suggests that it folds into the eight-stranded β-barrel (Fig. 4A), a fold common to all isometric ssRNA plant viruses. Notably, STNV encodes a single protein, a capsid protein, which was found to be the closest non-geminiviral relative of the geminiviral CP out of the 231 icosahedral virus capsid proteins whose X-ray structures are currently available at the PDB . STNV possesses the simplest capsid formed from 60 subunits of the CP arranged into T = 1 icosahedral lattice . Pentamers of the CP are the building blocks of the STNV particles . The same is true for geminiviruses . Geminivirus virions are composed of two incomplete icosahedra (110 copies of CP in MSV) that are joined together  (Fig. 1A). Such virion architecture is unique to geminiviruses and is not observed in any other currently known viruses. While the interior volume of the isometric particles is sufficient to pack 1,239 bp of the STNV genome, it is unable to accommodate the larger (2.5 – 3.0 kb ) genome of geminiviruses. Interestingly, it was found that the CP of geminiviruses produces not only twinned wild-type capsids but also isometric and even capsids formed of three incomplete icosahedra (Fig. 1) [40–42]. The valency of the capsid apparently correlates with the length of the packed nucleic acid. It has been shown that noninfectious isometric T = 1 MSV particles contain subgenomic MSV DNA fragments from about 0.2 kb to nearly half of the wild-type genome . Such heterogeneity in particle size and production of noninfectious particles per se might be seen as an indication of ongoing optimization and adaptation of the CP, which was originally utilized to form smaller (isometric) particles, to build larger capsids. Taking into account the high nucleotide substitution rate in geminiviruses, which is similar to that of RNA viruses , the sequence conservation between STNV and geminiviral CPs as well as between phytoplasmal plasmid and geminiviral Reps is striking. It is possible that the emergence of the ancestor geminivirus from a phytoplasmal plasmid and an RNA virus occurred relatively recently on the evolutionary timescale. Although less likely, the possibility of the convergent evolution cannot be ruled out either.
An alternative hypothesis for the origin of geminiviruses is that they are descendants of as yet undiscovered ssDNA viruses with geminiviral-like Reps that have acquired their CP-coding genes either from an RNA or DNA virus by horizontal gene transfer. Indeed, recent metagenomic analysis of samples from a rice paddy soil unveiled the presence of putatively viral replicons with geminivirus/phytoplasma-like Reps but not other geminiviral genes . Unfortunately, metagenomic studies do not provide any information on the origin of the amplified replicons, making it impossible to know with certainty that the amplified DNA does not belong to geminiviruses or plasmids. Therefore, there is currently no evidence to support the hypothesis predicting the existence of a virus that would be a missing link between geminiviruses and other ssDNA viruses.
If geminiviruses originated from phytoplasmal plasmids, is it possible that similar transitions happened several times to give rise to different viral families? As mentioned above, RCR Rep of the Bifidobacterium pseudocatenulatum plasmid p4M [GenBank:AAM00235] was previously shown to be more similar to Reps of various circoviruses than it is to Reps from other bacterial plasmids and viruses . It is therefore tempting to speculate that circoviruses might also be direct descendants of bacterial plasmids.
Phylogenetic as well as complete linkage clustering analysis of RCR Rep proteins from geminiviruses suggests their evolutionary relationship with Rep proteins of phytoplasmal plasmids, while structural modeling of the geminiviral CP points to a connection between geminiviruses and icosahedral ssRNA viruses. We suggest a scenario for the origin of geminiviruses in which acquisition of the capsid protein-coding gene from an ssRNA plant virus by phytoplasmal plasmid gave rise to the ancestor of geminiviruses. This scenario involves two assumptions. First, there was a coinfection of the same plant cell by a phytoplasma and an ssRNA virus. Indeed, such a coinfection has been previously observed. Sugarcane phloem was found to frequently contain both phytoplasmas and Sugarcane yellow leaf viruses (an icosahedral ssRNA virus) [45, 46]. The second assumption is that recombination occurred between the RNA genome of a virus and the DNA molecule of a plasmid. Although recombination between RNA and DNA viruses is not common, there is evidence pointing to the possibility of such gene exchange in the viral world [47, 48]. The scenario proposed here implies that geminiviruses emerged in plant cells through introduction of a structural element (capsid-coding gene) of a plant virus into a plasmid liberated from a plant infecting bacterium. Although this plasmid-to-virus transition does not satisfy the requirements of de novo virogenesis, since a preexisting viral building block was utilized for virion formation, it nevertheless accounts for the emergence of a novel virus family, the Geminiviridae. Consequently, the borderline between the two selfish genetic elements – viruses and plasmids – becomes transparent.
Koonin and Ilyina (1992) found that geminiviral rolling-circle replication (RCR) initiation proteins (Rep) are related to certain bacterial Reps . In order to obtain a dataset for phylogenetic analysis of geminiviral Reps we set out to get all bacterial RCR Reps from the nonredundant protein database at NCBI using PSI-BLAST searches (BLOSUM62 matrix, 0.05 as an E-value cutoff) . Surprisingly, only RCR Reps from phytoplasmal plasmids were identified using this approach. To extend the dataset, we carried out an alternative approach, pattern matching. Rolling circle replication proteins of geminiviruses contain five conserved motifs that are essential for the activity [13–16]. Based on this knowledge, an exact geminivirus-specific sequence pattern, encompassing all the five conserved motifs, was generated: F(T [LI]/[LM]T) [YN]X(1,100)HX [HQ]X(1,100)YXXKX(50,200)GXXXXGK [ST]X(1,100)DD. The residues shown in square brackets are alternatives; X – any amino acid; numbers in parentheses denote the allowed distance between corresponding motifs; slash sign indicates alternation of the dipeptides in the second and third positions in the pattern. The non-redundant protein sequences and environmental protein sequences from BLAST database were downloaded (07.02.2009) from NCBI FTP site and searched for sequences exactly matching the derived pattern without paying attention to the sequences surrounding the conserved motifs (as long as their length falls in the range specified in the pattern). Using this approach sequences missed by BLAST searches are expected to be found. 1072 protein sequences were initially extracted. In order to avoid redundancy, the original dataset was subsequently filtered to leave only sequences with less than 70% identity. As a result, a dataset containing 43 protein sequences was obtained. Of these two sequences were false-positive – a 799 amino acid-long hypothetical protein [GenBank:XP_001614627] from Plasmodium vivax SaI-1 and a 440 amino acid-long hypothetical TrmE domain protein GOS_1133298 [GenBank:EDE42344] from marine metagenome project, which were not included in the further analysis. The resultant dataset (41 sequences) was used to create a multiple sequence alignment using CLUSTALW . One geminiviral sequence [GenBank:ABD67440] was found to be considerably longer (469 aa) than the rest of the sequences. The protein was found to be a fusion of RCR Rep and geminiviral transcriptional activator AC2 and was therefore removed from the alignment. The 40 sequences were realigned and following manual examination and editing the subsequent alignment [see Additional file 1] was utilized for phylogenetic analysis. Maximum likelihood analysis was carried out by using PhyML v2.4.4 , with a WAG  model of amino acid substitution, including a gamma law with 4 categories to take into account differences in evolutionary rates at sites, and an estimated proportion of invariable sites. The robustness of the tree was assessed by bootstrap analysis (1,000 replicates). Bayesian phylogenetic tree was constructed using MrBayes  with a mixed model of amino-acid substitution and a Gamma-law (eight discrete classes). MrBayes was run with four chains for 2.1 × 106 generations and trees were sampled every 100 generations. To construct the consensus tree, the first 25% of the trees were discarded as "burnin".
Multiple sequence alignment [see Additional file 1] was used to calculate the pairwise distance matrix with MEGA4 . Analyses were conducted using the Poisson correction method. All positions containing gaps and missing data were eliminated from the dataset (Complete deletion option). There were a total of 178 positions in the final dataset. The calculated pairwise distances were used to perform complete linkage clustering analysis, where the distance between two clusters is defined as the distance between the two farthest objects in the two clusters. At each round the clusters are examined and split to two clusters according to the longest distance. The members of the clusters were then grouped within the new cluster that has a shorter distance. The clustering was run until all sequences formed their own clusters.
BioInfoBank MetaServer  was used for prediction of the tertiary structures. The structure of STNV capsid protein (CP)  was determined to be the best template for structural modeling with significance scores ranging from 57.67 – 82.50; scores above 50 are assumed to be significant and correspond to a prediction accuracy of above 90% . The sequences of the geminiviral CPs were individually aligned with the corresponding protein sequence of STNV using version 9.2 of the MODELLER program . Align2d algorithm of the MODELLER program is different from standard sequence-sequence alignment methods because it takes into account structural information from the template when constructing an alignment. This task is achieved through a variable gap penalty function that tends to place gaps in solvent exposed and curved regions, outside secondary structure segments, and between two positions that are close in space. The resulting alignments were utilized to build the three-dimensional models of the four geminiviral CPs using the MODELLER. Ten variants of each CP were generated and one of them was chosen on the basis of having the best stereochemical quality, which was validated using MolProbity . The structural superpositioning of the models with the X-ray structure of the STNV CP was performed using the STAMP algorithm , and the results were visualized with the VMD program .
This work was supported by the Finnish Center of Excellence Program (2006–2011) of the Academy of Finland (Grant 1213467 and Grant 1210253 to DHB). MK is supported by the Viikki Graduate School in Biosciences.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.