Gain and loss of polyadenylation signals during evolution of green algae

Background The Viridiplantae (green algae and land plants) consist of two monophyletic lineages: the Chlorophyta and the Streptophyta. Most green algae belong to the Chlorophyta, while the Streptophyta include all land plants and a small group of freshwater algae known as Charophyceae. Eukaryotes attach a poly-A tail to the 3' ends of most nuclear-encoded mRNAs. In embryophytes, animals and fungi, the signal for polyadenylation contains an A-rich sequence (often AAUAAA or related sequence) 13 to 30 nucleotides upstream from the cleavage site, which is commonly referred to as the near upstream element (NUE). However, it has been reported that the pentanucleotide UGUAA is used as polyadenylation signal for some genes in volvocalean algae. Results We set out to investigate polyadenylation signal differences between streptophytes and chlorophytes that may have emerged shortly after the evolutionary split between Streptophyta and Chlorophyta. We therefore analyzed expressed genes (ESTs) from three streptophyte algae, Mesostigma viride, Klebsormidium subtile and Coleochaete scutata, and from two early-branching chlorophytes, Pyramimonas parkeae and Scherffelia dubia. In addition, to extend the database, our analyses included ESTs from six other chlorophytes (Acetabularia acetabulum, Chlamydomonas reinhardtii, Helicosporidium sp. ex Simulium jonesii, Prototheca wickerhamii, Scenedesmus obliquus and Ulva linza) and one streptophyte (Closterium peracerosum). Our results indicate that polyadenylation signals in green algae vary widely. The UGUAA motif is confined to late-branching Chlorophyta. Most streptophyte algae do not have an A-rich sequence motif like that in embryophytes, animals and fungi. We observed polyadenylation signals similar to those of Arabidopsis and other land plants only in Mesostigma. Conclusion Polyadenylation signals in green algae show considerable variation. A new NUE (UGUAA) was invented in derived chlorophytes and replaced not only the A-rich NUE but the complete poly(A) signal in all chlorophytes investigated except Scherffelia (only NUE replaced) and Pyramimonas (UGUAA completely missing). The UGUAA element is completely absent from streptophytes. However, the structure of the poly(A) signal was often modified in streptophyte algae. In most species investigated, an A-rich NUE is missing; instead, these species seem to rely mainly on U-rich elements.


Background
In eukaryotes, a polyadenylate tail [poly(A)] is attached to the cleaved 3' end of the nuclear-encoded precursor mRNA of most genes [1]. Polyadenylation is important for the regulation of mRNA stability and also affects translational capacity [2]. The general mechanism of polyadenylation is well understood in yeast and animals [3]. It requires two major components: poly(A) signals (cis-elements) on the pre-mRNA, and a protein complex (transacting factors) that carries out the cleavage of the pre-mRNA and the addition of the poly(A). The protein complex is conserved among organisms, but the poly(A) signals show considerable variation among species.
Five classes of cis-acting DNA elements have been identified [4][5][6] that facilitate polyadenylation: a far upstream element (FUE), a near upstream element (NUE), the cleavage site (CS) and a downstream element (DSE). The CS is surrounded by a cleavage element (CE) in Arabidopsis [CE,4]. The FUE is generally U-rich. There are one or more NUEs, often AAUAAA or a related sequence 13 to 30 nt upstream of the CS (in the following, nucleotide positions are always given relative to the CS), each peculiar to its own cleavage site (CS). Only in animals is there an additional DSE [5,[7][8][9] and the NUE AAUAAA is the major polyadenylation element. In contrast, in embryophytes (land plants), the strict AAUAAA element becomes a minor component and is replaced by variable A-rich sequences. Recently, Loke et al. [4] analyzed mRNA polyadenylation in Arabidopsis using the genome sequence and all available cDNAs. They confirmed the absence of highly-conserved consensus signal patterns and showed that in Arabidopsis the poly(A) signal consists of a U-rich FUE (-25 to -160), an A-rich NUE (about -20) and CE. The CE consists of U-rich sequences on both sides of the CS. Furthermore, they presented evidence for the formation of secondary RNA structures in the 3'-UTR. Because known mutations in these regions affect polyadenylation, Loke et al. [4] suggested that the secondary structures might play an important role in the process.
The Viridiplantae (literally meaning green plants), which include all green algae and embryophyte plants, represent a monophyletic group of organisms, which display a surprising diversity in respect of morphology, cell architecture, life histories and reproduction, and in their biochemistry. The Viridiplantae consist of two monophyletic lineages: the Chlorophyta and the Streptophyta [10]. The Chlorophyta comprise the vast majority of green algae including most scaly green flagellates (e.g. Pyramimonas, Tetraselmis), the Ulvophyceae (e.g. Ulva, Acetabularia), Chlorophyceae (e.g. Chlamydomonas, Volvox) and Trebouxiophyceae (e.g. Chlorella) [11][12][13]. The Streptophyta include all embryophyte plants and a diverse paraphyletic assemblage of freshwater green algae, the Mesostigmatales, Chlorokybales, Klebsormidiales, Zygnematales, Coleochaetales and Charales (stoneworts) [13,14]. The Charales are widely believed to be the sister group of the embryophytes, suggesting that the evolution of true land plants started with an already complex organism [15]. However, the position of the Charales is still controversial [16]. Remarkably, only a single scaly green flagellate, Mesostigma viride Lauterborn, has been found to belong to the Streptophyta [17][18][19][20].
For a few chlorophyte mRNAs (mainly tubulins) an alternative polyadenylation signal (UGUAA) has been suggested [21][22][23][24], and it has been proposed that chlorophytes generally use the UGUAA motif instead of AAUAAA as NUE [25]. So far, UGUAA has only been found in chlorophyte algae, which are thought to have branched deeply from the last common ancestor of Viridiplantae [20]. To test whether UGUAA might be a chlorophyte-specific poly(A) signal we have analyzed the ESTs of chlorophyte and streptophyte algae available from public databases. To include more and especially earlybranching species, we have sequenced ESTs from the flagellate Mesostigma viride [18], the filamentous algae Klebsormidium subtile and Coleochaete scutata, the flagellate Pyramimonas parkeae and the thecate flagellate Scherffelia dubia [26]. The first three are streptophytes, whereas the latter two represent chlorophytes. Our results indicate that derived chlorophyte algae use mainly UGUAA, whereas most streptophyte algae prefer U-rich sequences. Interestingly, only in Mesostigma have we found a polyadenylation signal similar to that in Arabidopsis and other embryophytes.

Data sets used
We sequenced ESTs from cDNA libraries for Mesostigma viride [18], Klebsormidium subtile, Coleochaete scutata, Pyramimonas parkeae and Scherffelia dubia. Details on the preparation and assembly of contigs are given in Materials and Methods. To extend our database on polyadenylation in green algae, all ESTs from chlorophyte and streptophyte algae were downloaded from GenBank and TIGR (June 2006). To perform our analysis we relied on information about the orientation of EST sequence data. Thus, only data in which the polyA tail indicated the orientation could be used. For this reason, we could include only the following organismal data sets: Chlamydomonas, Helicosporidium, Acetabularia, Prototheca, Ulva, Scenedesmus (all chlorophytes) and Closterium (streptophyte). For each data set, the origin, total number of RNAs and number of nonredundant poly(A)-containing mRNAs, and the base compositions of the 200 nt upstream from the CS, are given in Table 1.

Positional nucleotide frequencies upstream of the CS
For all data sets, we established the 1 nt pattern of the 200 nt upstream from the CS and the frequencies of pentaand hexanucleotide words within the first 50 nt upstream from the CS. The 1 nt patterns of selected genera are shown in Fig. 1.
For all organisms, we observed a clear change in nucleotide frequencies around 20 nt upstream from the CS (Fig.  1) indicative of a putative poly(A) signal. In all chlorophytes except Pyramimonas, there is a sharp peak of U followed by sharp peak of A. In contrast, in Pyramimonas and all streptophytes, we observed a broad peak of U followed by peaks of A and then U again. These changes are easier to see in large organismal data sets than in small ones, as variations in nucleotide frequencies at individual positions relative to the CS are higher in the latter. For this reason the curves presented in

Poly(A) signals in Chlorophyta
A short Python program (available from the authors) was written to scan the 50 nt upstream from the CS for the occurrence of penta-and hexanucleotide words in the various organismal data sets. The base composition differs considerably among the various chlorophytes (Table 1). For this reason, the frequency of the UGUAA-motif expected by chance within 50 nt upstream from the CS varies between 2 and 12%. To test whether the observed over-representation of penta-and hexanucleotide words was statistically significant, we calculated the log odds ratio (lnω) and its 95% confidence interval for each word [27, see Methods for details]). Table S1 [see Additional file 2] lists all penta-and hexanucleotide words for each organism that are at least 2.7 times (lnω > 1) more frequent than expected by chance and for which the overrepresentation is statistically significant. Many penta-and hexanucleotide words fulfilling this criteria show considerable overlap and can be arranged into a few clusters centered on the most significantly over-represented word [see Additional file 2]. Some of these clusters occur in frequencies too low (although statistically significant) to be considered as putative polyadenylation signals and might represent regulatory sites for a subset of mRNAs from an organism. Further analysis is required to investigate this possibility.  3]. Similarly, we observed a statistically significant increase of U-rich sequences between -1 and -50 upstream from the CS only for Pyramimonas (Table 1) and Scherffelia (UUUUA: found in 13.5%, expected 2.4%). Again, the Urich sequences showed a broad distribution within the 50 nt upstream from the CS (Fig. 2).

Poly(A) signals in streptophyte algae
In all streptophyte algae analyzed, the chlorophyte-specific UGUAA-motif was present more often within 50 nt upstream from the CS than expected by chance [see Additional file 2]. The over-representation of UGUAA is statistically significant, but the UGUAA-motif showed a more or less random distribution between -1 and -50 nt upstream from the CS [see Additional file 1] for all four streptophyte algae.
We observed significant numbers of the AAUAAA motif only for Mesostigma, Closterium and Coleochaete, whereas it was not over-represented between -1 and -50 upstream from the CS in Klebsormidium. However, AAUAAA was enriched between -13 and -25 nt upstream from the CS only in Mesostigma (Fig. 2). Some single base variations (UAUAAA and AAUUAA) show the same distribution as the motif AAUAAA within 50 nt upstream from the CS, indicating that AAUAAA-like sequences might function as a NUE of the poly(A) signal in Mesostigma in the same way as in land plants. In all streptophyte algae, we observed significant increases in U-rich sequences around -10 and -30 to -40 nt upstream from the CS (Fig. 2).

Number of NUE per transcript
On average, about 8.2 % of the contigs from chlorophytes (except Pyramimonas) and Mesostigma contained more than a single copy of the putative poly (A) signals UGUAA Single-nucleotide profiles of the 3'UTR in various green algae or AAUAAA, respectively, within 50 nt upstream from the CS ( Table 3). Inspection of the mRNAs comprising these contigs revealed that about 15% of those with two putative poly(A) signals (0.5 % of all contigs) were assembled from two mRNA species differing only in the positions of their poly(A) tails (Table 3). In all cases, one polyadenylation site seemed to be preferred over the other.

Discussion
Analysis of the 3'-UTRs of a large set of genes from different chlorophytes and streptophytes revealed major differences in putative polyadenylation signals between chlorophyte and streptophyte algae. Both AAUAAA and UGUAA motifs have previously been described as possible NUE motifs in green algae [25] [28]. However, it was not clear how these motifs were distributed phylogenetically. We detected a clear putative AAUAAA-like NUE only in Mesostigma. The other streptophyte algae investigated seem to have lost the AAUAAA-like NUEs, although a weak enrichment of A around position -17 can still be detected in 1 nt positional frequency plots (Fig. 1). We found the UGUAA motif in all chlorophytes except Pyramimonas. The UGUAA motif in chlorophytes was not randomly distributed within the 50 nt upstream from the CS investigated, but showed a clear peak at -10 to -30 upstream from the CS. Although there is currently no experimental proof that UGUAA is a poly(A) signal, its narrow distribution and its presence in all chlorophytes investigated except Pyramimonas is a strong indication that this signal is indeed functional and plays a role in mRNA processing. It could be argued that some data sets are still too small to identify poly(A) signals. Indeed, if we had used only a single data set (e.g. the 28 Acetabularia sequences) we could not have drawn this conclusion. However, the over-representation and narrow distribution of UGUAA within 50 nt upstream of the CS in 7 different chlorophytes (including chlorophycean, ulvophycean, trebouxiophycean algae) that vary greatly in GC content (Table 1) make us confident that UGUAA is probably a true polyadenylation signal.
For some species we lowered the criterion for recognizing a poly(A) tail to a stretch of 5 terminal adenines. This was required because many researchers trim their poly(A) tails. For example, in the original Chlamydomonas Gene Index as downloaded from TIGR, only 7 out of 31608 sequences contained a stretch of 10 or more terminal adenines. Lowering the criterion to a stretch of 5 terminal adenines and including sequences that start with oligo (T) stretches (possibly representing the reverse complement of mRNAs) increased the number to 10508 sequences. Manual inspection revealed that many of these sequences contained the putative UGUAA poly(A) signal, and indeed we detected the UGUAA motif between -10 and -30 from the CS in about 50% of the sequences. However, as a stretch of 5 adenines can also be found in proteincoding sequences, we cannot exclude the possibility that our data set includes some internal sequences from mRNAs rather than the poly(A)-preceding sequence. Therefore, some of the numbers presented for the organismal data sets downloaded from public databases, which include all late-branching chlorophytes, might actually be Distribution of (putative) polyadenylation signals within 50 nt upstream from the CS in different chlorophyte and streptophyte algae Figure 2 Distribution of (putative) polyadenylation signals within 50 nt upstream from the CS in different chlorophyte and streptophyte algae. Distribution of the (putative) polyadenylation signals UGUAA, AAUAAA and U-rich within 50 nt upstream from the CS in different chlorophyte and streptophyte algae.
too low, making the number of mRNAs possessing the UGUAA motif even higher.
It has been shown that A-rich NUE are necessary for polyadenylation in embryophytes and animals [3,7]. In yeast, U-rich upstream elements are even more important than the NUEs for facilitating polyadenylation [29]. These FUEs are also found in animals and embryophytes [5,6]. The absence of U-rich sequences in the 250 nt upstream from the CS in Chlamydomonas and other chlorophytes may indicate that the putative UGUAA-sequence-dependent polyadenylation involves a different mechanism than from A-rich NUE-dependent polyadenylation, although homologues to the known protein machinery for polyadenylation in other eukaryotes can be detected in the Chlamydomonas genome. Alternatively, other sequence motifs may be required in addition to the UGUAA signals for proper polyadenylation, although we failed to detect other nucleotide words that were significantly enriched in all chlorophytes.
No clear NUE (AAUAAA-like or UGUAA) is present in Closterium, Klebsormidium or Coleochaete. The lack of a UGUAA motif appears to be a general feature of streptophytes as this sequence motif was also not detected in Mesostigma (this study), Arabidopsis or rice [4,5]. Given the presence of AAUAAA-like NUE in Mesostigma (this study) and land plants [4,5,8], its complete absence from the other streptophyte algae is surprising. Instead, the sequence motifs most often found are U-rich. We propose that these sequence motifs represent FUEs facilitating polyadenylation, as in yeast [5,29]. However, the number of mRNAs with a poly(A) tail investigated is still rather low for these species and there is still no direct proof for the proposed function of U-rich sequences in Closterium, Klebsormidium and Coleochaete. In addition, it is known that auto-correlated sequences such as oligo(U) show increased variance, i.e. there is a higher probability of observing higher or lower word counts than expected [30]. Therefore, the significance of the occurrence of such auto-correlated patterns in our analyses is not yet clear, as we cannot exclude the possibility that a NUE could be detected in a larger data set for each organism. In this context, it would be most interesting to know the structures of polyadenylation signals in the Charales. Currently, the Charales are considered to be a sister group to the embryophyte lineage [15]. EST-projects for two Chara strains are underway in Japan and the United States; however, no data have been released yet into the public domain. At present, only 28 mRNAs from Chara species can be found in Genbank, and a clear poly(A)-tail is recognizable only for the nuclear-encoded GAPDHB from Chara vulgaris [31]. This sequence does not contain either UGUAA or AAUAAA within 50 nt upstream from the CS. However, a U-rich region is present about 60 nt upstream from the CS. Whether this is typical for the Charales as for Klebsormidium and Coleochaete remains to be seen.
In plants and algae, the usage of multiple polyadenylation signals within a single gene, leading to different mRNA species, has been reported [7,32]. As we observed this phenomenon only in 0.5% of the contigs (expressed genes) investigated, it may not be prominent within green algae. Another type of mRNA variation observed in mammals is the use of different CS downstream from a single polyadenylation signal [33]. In mammals, this seems to occur in a considerable percentage (22-44%) of cases [33]. Polyadenylation heterogeneity is tissue-specific in mammals [34] and has been related to mRNA stability and/or translation efficiency [34]. We encountered only one similar situation during our study. For the rbcs gene from Scherffelia we observed two mRNA species. In one of these (35% of sequenced ESTs), polyadenylation started 14 nt downstream from the UGUAA motif; in the other (65% of sequenced ESTs), it started 26 nt downstream from the same UGUAA motif. We do not yet know the functional significance of this finding or whether it is a common phenomenon in green algae.  In the first (presented in Fig. 3), the present-day embryophyte signals were already present in the last common ancestor and were lost differentially during the evolution of the different green algal lineages. Only the direct phylogenetic lineage to Mesostigma conserved this motif set. Concurrently, the UGUAA signal evolved as a replacement in the Chlorophyte lineage. In the second scenario, polyadenylation in the last common ancestor depended only on U-rich sequences. Then two different additional signals evolved, constituting either an addition to the existing signals (embryophytes) or replacing them (Chlorophyta). We favor the first scenario for the following reason. Polyadenylation is a typical eukaryotic feature and therefore evolutionarily old. Animals, plants and fungi use AAUAAA or A-rich sequences as NUE, and U-rich sequences as additional elements [see introduction for details, see also [6]]). Therefore, either plants, animals and fungi inherited the general structure of the polyadenylation signals from their last common ancestor, or the similar structures of their poly(A) signals indicate convergent evolution. The latter hypothesis seems to us less likely.
If the first scenario outlined above is correct, the proposed replacement of AAUAAA with UGUAA in chlorophytes occurred stepwise and early during chlorophyte evolution. The A-rich NUE was lost in Pyramimonas, UGUAA and U-rich elements are present in Scherffelia, and UGUAA is present and U-rich elements are absent in all the other late branching chlorophytes investigated. Once the UGA motif was established in late branching chlorophytes, it seems to have been under strong evolutionary pressure, as it has changed so little in the various chloro- Drawing showing the phylogenetic relationships for the organisms investigated phyte lineages (compare the observed variation in streptophytes). The gain of the UGUAA-motif as poly(A) signal in Scherffelia correlates with major evolutionary transitions such as the conversion of an extracellular matrix consisting of scales into a cell wall, changes in mitosis and cytokinesis and other major cellular changes [12,13,35]. Our finding that Scherffelia (a member of the Chlorodendrales), like the other late-branching chlorophyte classes Ulvophyceae, Trebouxiophyceae and Chlorophyceae (UTC clade), possesses a UGUAA-motif in the 3'UTR supports the sister relationship between the Chlorodendrales and the UTC clade as revealed my molecular phylogenetic analyses [13]. The shared derived change in polyadenylation mechanism might represent the first synapomorphic character uniting the Chlorodendrales and the UTC clade.

Conclusion
Our results show that putative poly(A) signals may vary considerably among organisms. In chlorophytes (except Pyramimonas) the A-rich NUE was completely replaced by the UGUAA-motif, which represents a synampomorphic character of the Chlorodendrales and the UTC-clade. Also, the structure of the poly(A) signal was modified in many streptophytes. There is no A-rich NUE and the organisms seem to rely exclusively on U-rich elements.  [18]. Pyramimonas parkeae was cultured in modified ASP medium [36]. mRNA was isolated from interphase cultures using the mRNA Isolation Kit (Roche Applied Science). mRNAs (50 ng) were converted into cDNA using the CapFishing™Full-length cDNA Premix Kit (Seegene) and then amplified by PCR with either of the following primer combinations: 5'-RACE (Seegene) and (5'-TTTTTTTTTTTTTTTTN3') (Coleochaete scutata), or 3'-RACE (Seegene) and C1 (5'-NNNNNNNNNNNNNNNATG-3') (Klebsormidium subtile and Pyramimonas parkeae). cDNA was size-fractioned by agarose gel electrophoresis and the smear between 500 bp-5000 bp was isolated and cloned into pGEM-T Easy vector (Promega). All libraries were transformed into Elec-troMAX DH10B E. coli cells (Invitrogen) by electroporation.

Data selection
Isolated plasmids were sequenced by the cycle sequencing method using an ABI3700 96 capillary sequencer. A minimal contig set was assembled using the phrap assembler and all contigs were manually curated.

Sequence analysis
ESTs and cDNAs from public databases were checked for redundancy (within each organism specific data set) using the GCG software or assembled into contigs using the CAP-assembler of the Bioedit program. All non-redundant sequences from the different organisms were checked for the presence of a poly(A)-tail. Initially, sequences with at least 15 adenines in the 3'-end of the insert were assumed to include poly(A)-tails and were truncated to the base preceding the 5'-most adenine and trimmed to the 200 nt upstream the CS using the Bioedit editor. As we noticed that ESTs with fewer than 15 A at the 3'end contained a clear UGUAA about 20 nt upstream from the putative poly(A) in some species, we lowered the required number of As in the poly(A) tail to 5 to increase the number of sequences in the data sets. 1-nucleotide patterns were calculated for the 200 upstream from the CS using the Bioedit program.
The frequencies of all possible penta-and hexanucleotide patterns within the first 50 nt upstream from the CS were determined using a small Python script. The script is available from the authors upon request. To test whether the observed frequencies in penta-and hexanucleotide words were significantly different from chance we used the logodds ratio (lnω) as described by Sokal and Rohlf [27]. Briefly, the expected oligonucleotide frequency F e (b) of oligonucleotide (b) was calculated using the collection of all 200 nt upstream regions for each organism. The expected oligonucleotide frequencies were then used to calculate the number of expected occurrences in a given organismal data set using the formula: