Transposable element distribution, abundance and role in genome size variation in the genus Oryza

Background The genus Oryza is composed of 10 distinct genome types, 6 diploid and 4 polyploid, and includes the world's most important food crop – rice (Oryza sativa [AA]). Genome size variation in the Oryza is more than 3-fold and ranges from 357 Mbp in Oryza glaberrima [AA] to 1283 Mbp in the polyploid Oryza ridleyi [HHJJ]. Because repetitive elements are known to play a significant role in genome size variation, we constructed random sheared small insert genomic libraries from 12 representative Oryza species and conducted a comprehensive study of the repetitive element composition, distribution and phylogeny in this genus. Particular attention was paid to the role played by the most important classes of transposable elements (Long Terminal Repeats Retrotransposons, Long interspersed Nuclear Elements, helitrons, DNA transposable elements) in shaping these genomes and in their contributing to genome size variation. Results We identified the elements primarily responsible for the most strikingly genome size variation in Oryza. We demonstrated how Long Terminal Repeat retrotransposons belonging to the same families have proliferated to very different extents in various species. We also showed that the pool of Long Terminal Repeat Retrotransposons is substantially conserved and ubiquitous throughout the Oryza and so its origin is ancient and its existence predates the speciation events that originated the genus. Finally we described the peculiar behavior of repeats in the species Oryza coarctata [HHKK] whose placement in the Oryza genus is controversial. Conclusion Long Terminal Repeat retrotransposons are the major component of the Oryza genomes analyzed and, along with polyploidization, are the most important contributors to the genome size variation across the Oryza genus. Two families of Ty3-gypsy elements (RIRE2 and Atlantys) account for a significant portion of the genome size variations present in the Oryza genus.


Background
The C-value paradox, the phenomenon describing the lack of correlation between biological complexity of an organism and its genome size [1], is probably best exemplified in plants were genome sizes span several orders of magnitude ranging from the 98 Mbp Fragaria viridis genome [2] to more than 110,000 Mbp genome of the lily Fritillaria assiriaca [3]. Significant genome size variations are also not uncommon even within single genera such as rice, cotton and sorghum were 3.6, 3 and 8.1 fold genome size variation have been reported, respectively [4][5][6]. The primary mechanisms that contribute to this variation in plants are polyploidization [7,8] and transposable element (TE) proliferation [9][10][11] and elimination [12][13][14][15][16][17]. TEs are classified according to the different molecules used as intermediates in the replicative mechanism of transposition as class 1(or RNA elements) and class 2 (or DNA elements) [18]. Class 1 elements transpose via an RNA intermediate and include long terminal repeat (LTR) retrotransposons (LTR-RTs), long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs) [19]. Class 2 elements transpose via a DNA intermediate and have been classified into superfamilies (hAT, CACTA and Mutator-like elements) according to the similarity of transposases, the element-encoded protein that catalyzes transposition and integration [20]. Other classes of DNA transposable elements are represented by helitrons [21] and polintons [22].
The genus Oryza is an ideal model system to study the role of TEs on genome size variation. The genus is composed of 23 species: two cultivated (Oryza sativa and Oryza glaberrima) and 21 wild [23,24]. Based on evidence derived from interspecific crossing, cytogenetics and genomic DNA hybridization, ten Oryza genome types have been recognized including 6 diploid (2n = 24) and 4 tetraploid (2n = 48). The relevance of rice as staple food resource for the world, its compact 389 Mbp genome [25] and its role as a "model" species for genomic studies of cereals [26,27] have driven massive research efforts that include the production of the first finished genome sequence of any crop plant [25]. Additionally, our laboratory has developed a comprehensive set of BAC libraries, BAC end sequences and integrated physical maps representing the 10 Oryza genome types [28,29].
To better understand the role transposable elements have played in genome size variation in Oryza we generated and sequenced a set of random sheared genomic libraries from 12 species representative of the 10 genome types of the Oryza. Random sheared libraries represent an unbiased sampling of genome content and enable the characterization of their most relevant features without undertaking a massive sequencing effort. The sequences obtained were analyzed in order to describe, classify and compare the repetitive fraction of the genus Oryza. Particular attention was paid to the role played by the most important classes of TEs (LTR RTs, LINEs, helitrons and other DNA transposable elements) in shaping these genomes and in contributing to genome size variation. In the case of four groups of TEs (the class 1 LTR-RTs and LINEs and the class 2 CACTA and Mutator-like elements) phylogenetic analyses were carried out.

Results
To better understand the repeat content of the genus Oryza, 12 random sheared libraries were constructed and sequenced from 12 vouchered accessions that represent the 10 genome types of Oryza ( Figure 1). The total number of clones produced was 42,432, which roughly corresponded to 0.015× coverage of each genome. Mean insert size ranged from 2.5 kbp for O. australiensis [genome type:EE; genome size:965 Mbp] to 3.9 kbp for O. rufipogon [AA; 439 Mbp]. Cloned inserts were bi-directionally sequenced and after removing low quality data (sequences shorter than 50 bases and all those similar to plastid genomes), 72,245 high quality sequence reads were obtained for a total of 51.76 Mb of sequence (0.56% of the total size of all the genomes studied (Table 1)).

Repeats abundance and distribution
Repetitive sequences were identified by similarity searches against databases of previously characterized repetitive elements, isolated from O. sativa and other Oryza species, including centromeric repeats, helitrons, SINEs, LINEs, MITEs, LTR-RTs, DNA transposable elements (DNA-TEs), ribosomal sequences and telomeric repeats (see Methods). All the major classes of repeats were represented throughout the genus ( Table 2). The overall amount of repeats was quite variable in different species and ranged Phylogeny of genus Oryza Figure 1 Phylogeny of genus Oryza. Phylogenetic tree is a modified version of that proposed by Ge et al [53].  15.28% of reads. The lower than expected MITE content in the non-AA genomes species is likely due to the fact that MITEs are highly species specific and the repeat database used to identify MITEs was a curated data set from the O. sativa species only.
Differences in TE size, host genome size and degrees of interspecific conservation make it particularly difficult to perform a detailed comparison based solely on the percentage of significantly similar sequences out of the total. A more rigorous approach was recently proposed by Hawkins et al. [9] where an equation was derived (see Methods) that takes into account TE mean size, host genome size, the minimum length needed in order to have a significant BLASTN hit, and the number of matches out of total reads available to estimate the absolute number of repeats in a given genome. We used this equation for all classes of repeats for which reliable estimates of the average length were available and then, in order to have comparable results, normalized the data to the single Mbp ( Figure 2 and Additional file 1). Here we considered only the most represented and conserved families of elements: LTR-RTs, LINEs, helitrons, and the DNA-TEs -CACTA and Mutator like. It should however be noted how the use of average lengths for entire classes of elements could introduce a certain degree of uncertainty and so the following values should be considered as an approximation. Not surprisingly the LTR-RTs remained the major contributor to the repeat pool of each species.  [HHKK] where twice as many Ty1-copia elements (ratio of Ty3-gypsy/Ty1-copia = 0.5) were found.
Among other repeats, it is worth to note that CACTA elements seemed to be significantly more frequent in the BB  The significant variance of LTR-RT representation presented above prompted us to perform a more in depth analysis of the distribution of the different families in each genome. To identify the primary LTR-RT families responsible for this variance and possibly link them to the genome size variation in each host species, all sequences previously identified as being similar to LTR-RTs were screened a second time against a reduced database containing representatives of twenty-six of the most abundant LTR-RT families isolated in Oryza genus (11 Ty1-copia and 15 Ty3-gypsy) ( Table 3). In all species the overall majority of sequences already annotated as LTR-RTs could be easily classified according to these parameters ( Our strategy to rely on similarity searches in order to identify the repetitive fraction of a genome has the obvious pitfall that, if a repeat is not present in the repeat database it will not be detected. To check for this possibility we adopted a different approach to screen all sequence reads that did not show any significant hits with already characterized repeats. Sequences were clustered using the program Blastclust and the results are shown in Table 4. Only a few clusters containing more than 3 sequences were iso-

Phylogenetic analysis of repeats
To establish the phylogeneic relationships among the most prominent superfamilies of TEs in Oryza, we used amino acid sequences of the most conserved domain tracts (see Methods) of Ty1-copia and Ty3-gypsy LTR-RTs, LINEs and CACTA and Mutator-like DNA-TEs to build neighbor-joining trees.
For Ty1-copia elements ( Figure 3) it was possible to retrieve 269 reverse transcriptase (RT) sequences out of the 72,245 available sequences. These sequences were resolved into at least eight lineages as representatives of the major Ty1-copia retroelements isolated. RT-like sequences from Ty3-gypsy elements (508 elements) were isolated and used to build a phylogenetic tree ( Figure 4). The complete set is less heterogeneous than the Ty1-copia set, however all the major families are represented. In the case of RIRE2 and Atlantys, the two most abundant Ty3-gypsy families throughout the genus, all the species were included in the correspondent lineages indicating that their presence in the genus predates its speciation. However, considering the different species in which the major amplification events of elements belonging to these two families took place, it was possible to note how the elements isolated in different species usually do not mix together in the internal lineages suggesting that their amplification has happened after speciation.
LINEs are usually not very abundant in plants as compared to LTR-RTs with some exceptions such as the del2 element in Lilium speciosum [30]. The genus Oryza in this sense appears to follow this general rule. This trend is reflected in the number of LINE reverse transcriptase like sequences [38] isolated from our set. Twenty-two of them cluster in 4 major bootstrap supported lineages and many are characterized by very long branches suggestive of an ancient origin of these retroelements in the genus ( Figure  5). These findings suggest that even in the case of the Oryza genus, LINEs exhibit, as in other plants, high sequence divergence and extreme heterogeneity [31][32][33].
The two major groups of DNA TEs, CACTA and Mutatorlike were phylogenetically investigated using tracts of their transposase coding domains. In the case of CACTA elements, we identified 125 transposase-like sequences, whereas 55 were isolated for Mutator-like elements. For both classes of TE elements, it was not possible to identify species specific lineages in the corresponding Neighbor Joining trees (Figures 6 and 7

Discussion
Genome survey sequences from 12 random sheared genomic libraries representative of the 10 genome types of the genus Oryza were generated and analyzed to obtain an unbiased sampling of the repeat content of the genus as a whole. Our results showed that the genomic composition of the Oryza species is not exceptional when compared to other plant species in that the repetitive fraction consistently makes up a significant portion of each of the 10 genome types.
Among TEs, the major class of repeats in Oryza was represented by LTR-RTs whose absolute occurrence was estimated to be in the order of thousands of copies for each The consequences of TE proliferation subsequent to speciation on genome size variation is reminiscent of Zea mays in which five LTR-RT families represent nearly 25% of the genome [34] and of the species Gossypium herabaceum and G. exiguum where a group of Ty3-gypsy elements, named GORGE-3, has extensively proliferated in comparison to other elements and constitutes a significant portion of each genome [9]. In contrast, no single LTR-RT family was found to constitute significant portions of the AA genome Oryza species with the only exception being represented by the Ty3-gypsy element RIRE8. The finding that, apart from the polyploids, the largest genomes in the Oryza (O. australiensis and O. granulata) are those that have seen the "explosive" proliferation of one or two LTR-RT families, once again identifies the pivotal role of this class of TEs in the dynamics of genome size variation.
Phylogenetic analysis of Ty1-copia retroelements Figure 3 Phylogenetic analysis of Ty1-copia retroelements. A) 269 sequences similar to the Ty1-copia reverse transcriptase domain were used to build a phylogenetic tree using the neighbor-joining method. Bootstrap values were calculated for 1000 replicates; only those with values greater than 50 are proposed B) distribution of the domains isolated in different species. Bar colors are the same of those used in the circles marking, on the neighbor-joining tree sequences from different species.
The differential proliferation of LTR-RTs in the Oryza genus could be explained by taking into account the effects of varying genomic backgrounds on element life cycles, different rates of LTR-RT removal by unequal recombination and/or illegitimate recombination [8,15,16] and inherently different retrotranspositional potentials of various element families [35]. However no single convincing explanation of this phenomenon has so far been advanced for other plant species in which similar events have been described [9,10].
Here we have demonstrated that the pool of LTR-RTs in the Oryza is substantially conserved throughout the genus because in almost all cases relatives of the LTR-RT families identified in O. sativa [AA] could also be found in the other Oryza species. We therefore conclude that its existence predates the speciation events that originated the genus as we know. This finding is well supported by our phylogenetic analyses and it is in accordance with the results of previous studies analyzing the Ty1-copia elements phylogeny and distribution across different genera of the Gramineae family [36,37].
Besides the identification of different amounts of elements from the same LTR-RT families, all the genomes studied, except O. coarctata, showed the same quantitative relationship between Ty3-gypsy and Ty1-copia elements where the Ty3-gypsy elements constantly outnumber Ty1copia elements, even if the ratio between the two elements varied (from more than 4. where we found a ratio close to 0.5 between the Ty3-gypsy and Ty1-copia elements. An even more striking difference between the O. coarctata [HHKK] genome and the other Oryza spp genomes is that it was found to contain the least amount repetitive elements despite it being polyploid and its predicted large genome size. It should be noted that we have not been able to obtain live plants or tissues in the U.S. in order to determine an accurate genome size measurement by flow Phylogenetic analysis of Ty3-gypsy retroelements Figure 4 Phylogenetic analysis of Ty3-gypsy retroelements. A) 508 sequences similar to the Ty3-gypsy reverse transcriptase domain were used to build a phylogenetic tree using the neighbor-joining method. Bootstrap values were calculated for 1000 replicates; only those with values greater than 50 are proposed B) distribution of the domains isolated in different species. Bar colors are the same of those used in the circles marking, on the neighbor-joining tree sequences from different species.
cytometry due to quarantine restrictions. To rule out the possibility that our similarity searches could not detect highly diverged elements we analyzed all the sequences to try to identify repeats de novo through a clustering approach. In this case no significant de novo clusters were obtained in O. coarctata. These findings rule out the existence of a single recently originated family of repeats, highly species specific and constituting a significant portion of the genome.  A completely different scenario could be described for TEs lacking conserved domains and having higher species sequence specificity such as MITEs and SINEs. In this case the evolutionary distance from AA genomes consistently results in a steady drop in the amount TEs identified (with the significant exception of MITEs in O. brachyantha) prompting the need of de novo and ad hoc identification tools based on structural features beside the nucleotidic sequence.

Conclusion
Here we report the results of a comprehensive analysis of the abundance and relative distribution of major TE classes across twelve species of the genus Oryza.
We have demonstrated how the LTR-RT complement in of these species is ancient and conserved throughout the genus, and has attained different retrotranpositional success in different species.
We also identified two LTR-RT families (RIRE2 and Atlantys) that are responsible for a significant portion of genome size variation in the genus and we demonstrated how their massive increase in copy number is recent.

Plant material (put in genome designations)
Total genomic DNA was isolated from young leaf tissue from the following  materials were grown at the International Rice Research Institute, Philippines.

Library construction and DNA sequencing
Genomic DNA was extracted using CTAB [44]. DNA was sheared using the Hydroshear (GeneMachines) to produce 2-5 kb fragments. DNA fragments were end-repaired using Epicientre's End-it Repair kit and size selected on 0.6% agarose gels. DNA fragments in the size range of 2-5 kb were cut out of the gel, eluted using a QIAEX II Gel extraction kit and ligated to linearized pBluescriptII KS+ at 16°C overnight. Ligation products (1 uL) were transformed into DH10B electrocompetent cells (Invitrogen). Shotgun plasmids were sequenced bi-directionally on ABI 3730XL DNA sequencers (ABI) using standard protocols. Sequence data was extracted using ABI sequence analysis software and base called using Phred [45]. Vector screen-ing and low quality sequence removal was done using the program Lucy [46].

Similarity searches
Sequences were used as queries in similarity searches against different repeat databases: TIGR rice repeats version 3.1 [47], Repbase version 11_09 [48] and a proprietary collection of retroelements isolated in Oryza sativa and other Oryza species. Searches were carried out using the BLASTN algorithm [49] run under relaxed settings (-q -2 -r 3) in order to accommodate the divergence between species searched from the major source of repeats in our database (O. sativa). Only hits having an E value equal or lower to 1e-10 were used. For all sequences that did not result in any significant hits, a second round of searches was carried out using the algorithm BLASTX against the non-redundant division of GenBank. Under these conditions only hits having an E value equal or lower to 1e-5 Phylogenetic analysis of Mutator-like transposable elements were parsed. To estimate the number of significantly similar hits present in the entire genome of each of the 12 Oryza species, we used this equation: n = (Xobs/N)*(G/ (L-2m+e) that is a slight modification of that proposed by Hawkins et al. [9], where "Xobs" is the observed number of copies, "N" is the total number of sequence reads, "n" is number of targets in the genome, "L" is length of target sequence, "m" is estimated minimum length required to identify a sequence in a BLAST search (we used the extremely conservative value of 100 bp), "e" is number of bp sequenced from each insert and "G" is genome size.
Published sequences for various repetitive elements were used to estimate "L".
A third round of similarity searches was performed using only representative elements from the LTR-RT families listed in Table 3.
All sequences that were not classified as repeats were used in a cluster analysis using "Blastclust" [50]. The program was used with the following settings, L 0.51 S 80, meaning that all sequences sharing at least 80% similarity over at least 51% of their length were included in the same cluster. Only clusters containing 3 or more hits were considered in analyzed further.  [49]. The corresponding amino acid domains were retrieved and aligned using Muscle [51]. Neighbor-joining trees were produced and edited using the program MEGA version 3 [52].

Accession numbers
Sequences for this paper were submitted to the GSS division of GenBank under the following accessions numbers: