Generation of microsatellite repeat families by RTE retrotransposons in lepidopteran genomes
© Tay et al. 2010
Received: 10 January 2009
Accepted: 17 May 2010
Published: 17 May 2010
Skip to main content
© Tay et al. 2010
Received: 10 January 2009
Accepted: 17 May 2010
Published: 17 May 2010
Developing lepidopteran microsatellite DNA markers can be problematical, as markers often exhibit multiple banding patterns and high frequencies of non-amplifying "null" alleles. Previous studies identified sequences flanking simple sequence repeat (SSR) units that are shared among many lepidopteran species and can be grouped into microsatellite-associated DNA families. These families are thought to be associated with unequal crossing-over during DNA recombination or with transposable elements (TEs).
We identified full-length lepidopteran non-LTR retrotransposable elements of the RTE clade in Heliconius melpomene and Bombyx mori. These retroelements possess a single open reading frame encoding the Exonuclease/Endonuclease/Phosphatase and the Reverse Transcriptase/nLTR domains, a 5' UTR (untranslated region), and an extremely short 3' UTR that regularly consists of SSR units. Phylogenetic analysis supported previous suggestions of horizontal transfer among unrelated groups of organisms, but the diversity of lepidopteran RTE elements appears due to ancient divergence of ancestral elements rather than introgression by horizontal transfer. Similarity searches of lepidopteran genomic sequences in GenBank identified partial RTE elements, usually consisting of the 3' terminal region, in 29 species. Furthermore, we identified the C-terminal end of the Reverse Transcriptase/nLTR domain and the associated 3' UTR in over 190 microsatellite markers from 22 lepidopteran species, accounting for 10% of the lepidopteran microsatellites in GenBank. Occasional retrotransposition of autonomous elements, frequent retrotransposition of 3' partial elements, and DNA replication slippage during retrotransposition offers a mechanistic explanation for the association of SSRs with RTE elements in lepidopteran genomes.
Non-LTR retrotransposable elements of the RTE clade therefore join a diverse group of TEs as progenitors of SSR units in various organisms. When microsatellites are isolated using standard SSR enrichment protocols and primers designed at complementary repeated regions, amplification from multiple genomic sites can cause scoring difficulties that compromise their utility as markers. Screening against RTE elements in the isolation procedure provides one strategy for minimizing this problem.
Microsatellite genetic markers are based on the properties of SSRs (simple sequence repeats) which are numerous and ubiquitous in the DNA of eukaryotes [1, 2]. The basic repeat unit of an SSR is generally considered to be one to six bases long, and an array of two or more basic units repeated in tandem constitutes the SSR. Such SSRs are often miscopied by DNA polymerase, and the resulting high mutation rate, leading to a change in the number of basic units in the array, has proven to be very useful for population genetic studies. The high degree of polymorphism in the size of the tandem array can easily be visualized by gel-separation of PCR products generated by primers placed in the flanking regions, on either side of the SSR.
Another pattern of repetition, however, has proven to be very troublesome in the practical application of microsatellite markers in some organisms, including the insect order Lepidoptera (butterflies and moths). The regions closely flanking one or both sides of the SSR may themselves be highly repeated, dispersed throughout the genome rather than occurring tandemly [3, 4]. Primers designed to match these regions, therefore, may generate multiple bands, or even fail to produce discrete visible products because too many different sites are being amplified. Unlike the internally-repeated patterns of the SSRs, there is nothing obvious about the structure of these repeated flanking regions. The fact that they are repeated can be deduced only by comparisons among many flanking sequences, not by any intrinsic pattern of the sequence itself.
Meglécz et al.  first described families of similar sequences flanking SSR repeats from two species of Lepidoptera, Parnassius apollo and Euphydryas aurinia. These similarity patterns fall into two main types, if the location of the SSR is taken into account but its sequence and orientation is ignored. Unilateral repeat families occur on one side only of the SSR, the other side is not repeated (e.g. L1-SSR-R, L2-SSR-S, L3-SSR-T, etc.). In bilateral repeat families, both flanking sides are repeated (e.g. L1-SSR-R1, L2-SSR-R2, L3-SSR-R3, etc.). van't Hof et al.  denoted these as asymmetric and symmetric families respectively, however these terms are misleading as there is not necessarily any symmetry between L1 and R1, L2 and R2, etc. These authors examined patterns of flanking SSR families in a third lepidopteran, Bicyclus anynana. Using blastn comparisons of these to sequences in GenBank, they defined four "Lepidoptera Specific Core Sequences" (LSCSs) that represented the common elements of four repeated flanking sequences.
The widespread occurrence of these LSCS regions can complicate the use of microsatellite markers in population studies. For example, markers have been developed for the polyphagous noctuid moth Helicoverpa armigera by several groups [7–10]. Markers developed for populations of one region (e.g. China) typically have a low success rate in other regions (e.g. Australia). Studies of Australian populations, even using microsatellites developed from those populations, have produced a highly dynamic picture of temporal variation in migration patterns [11–14]. However, a careful re-examination of ten of these markers by Endersby et al.  showed that a high proportion of these loci showed allele drop-outs or were not in Hardy-Weinberg equilibrium, undermining their reliability as accurate indicators of population structure.
Based on analysis of whole-genome sequences of several insects, including the lepidopteran Bombyx mori, Meglécz et al.  showed there were large interspecific differences in frequencies of microsatellites in insect genomes. They confirmed and extended the observations of shared flanking regions among independently isolated sets of microsatellite markers from several different Lepidoptera. They demonstrated a statistical association between SSRs and previously compiled datasets of repeated DNA sequences in six Drosophila species and Anopheles gambiae (a similar compilation for B. mori was not provided). van't Hof et al.  also suggested a general association between transposable elements (TEs) and some of the LSCS regions. However, no specific class of TEs was identified as contributing to the observed patterns.
Here we show that a class of non-LTR retrotransposons, RTE elements, are associated with SSR repeats in Lepidoptera and other species. Full-length autonomous elements have low copy numbers in the genome of Bombyx mori, however partial elements consisting of the coding sequence of the C-terminal end of the reverse transcriptase protein immediately flanking an SSR are highly abundant. Slippage during reverse transcription and chromosomal integration can increase the number of repeating units within the SSR. These numerous chromosomal integration events of partial elements have peppered the genome with ready-made SSRs flanking an element-specific sequence. These events can account for a substantial fraction of unilateral repeat families recovered in independent efforts from several lepidopteran species.
The C. elegans RTE-1 sequence was also used to identify 25 full-length elements from the unannotated whole-genome shotgun contigs of B. mori (BmRTE-d01 through -d25), the largest number of distinct RTE elements identified from a single species to date. These have the same two conserved protein domains AP and RT, possess various SSR repeats in the short 3' UTR immediately following the stop codon at the end of the RT domain, and are flanked by TSD sequences of differing lengths. Some of the Bombyx elements possess an in-frame stop codon or a frameshift within the first 200 bases of the coding sequence, like other RTE elements. The partial RTE sequence identified by Malik and Eickbush  next to the cecropin B gene of B. mori is represented by the full-length element BmRTE-d05, with some protein sequence differences because of attempts of those authors to correct frameshifts that had been identified in the partial sequence. Three partial RTE sequences identified by Zupunski et al.  correspond to the full-length elements BmRTE-d07, -d24, and -d25. Using the full-length Bombyx RTE elements as query sequences in blast searches of the Bombyx genome contigs yields far more hits of various lengths at the 3' end than at the 5' end, suggesting that the copy number of full-length functional autonomous elements is generally low (around 2 to 4 copies per genome, but up to 16 and 22 copies for BmRTE-d01 and BmRTE-d02 respectively, see Additional File 1), in contrast to frequent insertions (up to 300 in the case of BmRTE-d04) of partial non-autonomous elements with intact 3' ends and flanking SSRs.
In contrast, the high similarity among Bov-B LINEs of reptiles and mammals (V. ammodytes Bov-B LINE, Bos taurus BDDF) could be explained by horizontal transfer, as previously suggested by Zupunski et al. . These authors also inferred another instance of horizontal transfer from arthropods to reptiles based on similarity between Bov-B from V. ammodytes and other reptiles, and partial sequences from B. mori they named Bombyx Bov-B/RTE (equivalent to our full-length BmRTE-d24 and -d25). We found the reptile/mammal group to cluster additionally with B. mori BmRTE-d05, -d06, -d12, and -d16, but outside the other Bombyx elements, pointing to this subgroup as the likely source of the horizontal transfer. As also pointed out by Zupunski et al. , horizontal transfer between plant and fish RTE elements was also evident (Figure 3), and the distribution of plant-like elements within fishes is wider than previously suspected, as shown by the element we have identified from the hagfish Eptatretus.
A few examples of partial RTE elements in non-coding flanking sequences or introns of genes from Lepidoptera have been deposited in public databases. Chen and Li  looked for TEs neighbouring cytochrome P450 genes in Helicoverpa zea, and recognized a partial RTE element they named HzRTE-1 within the third intron of CYP9A12v3 (GenBank:DQ788839). This partial element was 1,754 bp long, flanked by 10 bp TSDs, encoded an RT domain with 40% amino acid sequence identity to the C. elegans RTE-1 (and 78% identity to the last 400 residues of BmRTE-d01, Additional File 4), and was terminated by a region of TGA trinucleotide repeats in the short 3' UTR. Xu et al.  discovered genes with similarity to delta-11 desaturases (GenBank:EF113398) implicated in pheromone biosynthesis in Ostrinia nubilalis and O. furnacalis; each of which was adjacent to a partial RTE element with 78% amino acid identity to the C-terminal 473 residues of BmRTE-d08 (Additional Files 3 and 4). These authors recognized the sequence similarity to the C. elegans RTE-1 element, but considered the partial elements from Ostrinia to represent a new family which they named ezi, although this Ostrinia family was clearly clustered within the RTE-1 clade in their phylogenetic analysis . No SSRs were found in the 3' UTR of these elements.
The largest collection of microsatellite markers for any single Lepidopteran species was assembled by Miao et al.  in B. mori. More than 13,600 positive clones were identified by hybridization, yielding 2,690 confirmed by sequencing and 555 polymorphic markers that were used to construct a linkage map. Of these, 518 were deposited in GenBank, and we identified 64 containing fragments of one or more of the BmRTE elements by BLAST. Thus about 12% of this collection of microsatellite markers contained one or more RTE elements.
In their survey of repeated flanking sequences in Lepidopteran microsatellite markers, van't Hof et al.  defined four different Lepidopteran Specific Core Sequences (LSCS) that were represented in a large number of clones. Translation of the reverse complement of their 44 bp LCSC1 in the second reading frame produces a protein sequence identical to the last 12 residues of BmRTE-d02. Moreover, the nucleotide sequence is identical to the corresponding region of HzRTE-1. Thus the class of unilateral repeated microsatellites similar to LCSC1 has an RTE element flanking one side. Examples (Figure 4) include H. armigera microsatellite clones HaD47 (GenBank:AY497338) , HarSSR3 (GenBank:AJ504787)  and HarSSR7 (GenBank:AJ627416) , the H. zea HzMS1-6 microsatellite marker (GenBank:EF152206) , and the B. anynana microsatellite marker BA-ATG230 (GenBank:DQ225294) . TSD sequences were identified for HzMS1-6, HaD47, HarSSR7, and BA-ATG230.
Additionally, 165 related non-redundant lepidopteran microsatellite DNA loci were identified by tblastn searches based on protein sequences of the full-length HmRTE-e01 element, 13 full-length elements BmRTE-d01 to -d13, one partial HaRTE-t01 and six partial BaRTE elements (BaRTE-d01 to -d06; Additional File 3). A further 29 microsatellite loci with significant threshold values were identified from blastn searches of LSCS1, HzRTE-1-1 and McRTE-t01 partial RTE elements. Lepidopteran RTE protein and DNA sequence homology searches therefore identified association between microsatellite DNA loci and the non-LTR RTE elements from a total of 22 species from superfamilies Papilionidea (B. anynana, H. melpomene, Papilio zelicaon, Arhopala epimuta, P. apollo, Drupadia theda, M. cinxia, and E. aurinia), Hesperiidae (Erynnis properties), Geometroidea (Chiasmia assimilis, Biston betularia), Noctuoidea (H. armigera, H. zea, Arctia caja, Busseola fusca), Bombycoidea (B. mori), Pyraloidea (O. nubilalis), Tortricoidae (Cydia pomonella, Lobesia botrana, Tortrix viridana) and Yponomeutoidea (P. xylostella, Yponomeuta padellus). A wide diversity of SSR units for the reported microsatellite loci was identified (i.e., di-, tri-, tetra- and pentameric nucleotide SSR units of (TGATG)n, (TAGA)n, (TTTA)n, (TGA)n, (TGG)n, (ATT)n, (GAA)n, (TAT)n, (TG)n, (AT)n, (GA)n; Additional File 5). The numbers of SSR units varied greatly from two to > 35, and included perfect, compound and imperfect SSR units.
Examples of RTE insertions in lepidopteran microsatellite loci lacking SSR units at the 3' UTR.
Associated microsatellite locus
79 .. 141
191 .. 3
230 .. 466
364 .. 480
364 .. 480
B. betularia microsatellite
146 .. 3
3 .. 50
387 .. 328
3 .. 41
Overall, of 2,183 lepidopteran microsatellite sequences deposited in GenBank, 218 or almost 10% were found to contain an RTE element (Additional File 6). The top 16 lepidopteran species accounted for 1,723 microsatellites, 202 or almost 12% of which contain RTE elements. There is some heterogeneity within this group, with RTE content ranging from 0% to 44%. The remaining species with 27 or fewer microsatellites each account for 460 loci, only 16 of which (3.5%) contain RTEs. These differences could reflect actual species differences in microsatellite frequencies or methodological differences among different researchers; only some of which used enrichment techniques.
We have identified several RTE non-LTR retrotransposable elements in the Lepidoptera, and provided evidence for their association with unilateral microsatellite DNA families. These lepidopteran RTE elements therefore join a small group of TEs in other organisms (e.g. Drosophila , humans , barley ) as being associated with the genesis of microsatellite DNA repeat units. They account for a significant fraction of microsatellite markers isolated by many independent groups working with Lepidoptera (almost 10% overall, ranging from 0% to 44% for different species), and their wide occurrence may be partly responsible for problematical aspects of these markers.
Non-LTR RTE elements have a wide but disjunct phylogenetic distribution in eukaryotes [17, 22, 23] and have been found in sea urchins (Strongylocentrotus purpuratus), nematodes (C. elegans), blood flukes (Schistosoma mansoni ), mosquitoes (Ae. aegypti), amphioxus (Branchiostoma floridae ), fishes (Xiphophorus maculatus ), snakes (Vipera ammodytes ) and mammals (Bos taurus) as well as plants . We have identified additional full-length elements from cnidaria, mollusca, and hagfish, as well as 25 from the genome of B. mori and one from H. melpomene. Lepidopteran RTE elements discovered so far are very diverse but are all grouped into the RTE clade of Malik et al. .
Genomic insertion sites of RTEs do not appear to be sequence-specific, and can be discerned only by target-site duplications of 6 - 12 bp after insertion. The typically short 3' UTR often contains di- or tri-nucleotide simple sequence repeats. Truncated RTE elements consisting only of the 3' end of the RT coding sequence followed by the 3' UTR and flanked by target-site duplications are often found in genomes. This configuration suggests that a staggered double-stranded break in the target sequence was made, followed by reverse transcription of the full-length RNA element from the 3' end, which was interrupted before completion so that only the 3' end of the element was finally inserted between the target-site duplications. A large number of truncated elements may accumulate due to the replication efforts of just a few full-length elements, similar to that reported for L1 retrotransposons (reviewed in ). For example, in the mosquito Anopheles gambiae, the Ag-JAMMIN-2 element is represented by five copies with intact ORFs and about 1,940 truncated elements . In the absence of a full-length RTE sequence as evidence, such truncated RTE elements may be misidentified as short interspersed nucleotide elements (SINEs; ) which by contrast are usually derived from abundant cellular RNAs, lack coding potential, possess poly(A) tails, are independently transcribed from their own internal Pol III promoters, and utilize proteins encoded by other retro-elements for their insertion. Similarity of the predicted protein sequences at the 3' ends enabled Malik and Eickbush  to recognize that some families previously classified as SINEs were actually truncated RTE elements, just as it has enabled us to recognize RTE elements within isolated microsatellite markers. This is because it is easier to recognize homology of divergent protein sequences than nucleotide sequences.
Some insect TEs show a preference for inserting into specific sequences, and sometimes these sequences are simple sequence repeats. For example, DONG targets the (TAA)n of the ribosomal DNA non-transcribed spacer region , and the telomere-specific TRAS families target pentameric (TTAGG/CCTAA)n SSR units . Since these SSR-target insertion sequences are pre-existing, TE insertion would not increase the abundance of SSRs in the genome, but could increase the abundance of a particular unilateral family of SSR repeats by juxtaposing more copies of the TE next to SSRs existing in different genomic locations. We were not able to find evidence of DONG or TRAS elements in microsatellite markers isolated from Lepidoptera; however it may be that some of the still-uncharacterized LSCS sequence families are formed in this way. By contrast, each new insertion of an RTE element with SSRs in the 3' UTR does increase the abundance of SSRs in the genome. The SSR may have been present in the 3' UTR in the RNA molecule participating in the insertion, or may have been incorporated into the cDNA by slippage of the RT enzyme during reverse transcription. But since the SSR is positioned between the two TSDs, it clearly did not exist in that location prior to the insertion event.
Some fingerprinting methods benefit from the random multiple insertion of transposable elements into genomes [38–40], however, these do not yield single-copy markers. The difficulty of developing lepidopteran microsatellites that function as reliable single-copy codominant markers is widely acknowledged (e.g., [3–5, 41, 42]). Our sequence comparison of size variants of the microsatellite marker HaD47 showed that at least three different loci were being amplified in different individuals, violating the single-copy assumption. With respect to codominance, a deficiency in observed levels of heterozygosity relative to Hardy-Weinberg expectations has been reported for many published lepidopteran microsatellite DNA markers, and has often been attributed to the presence of null alleles (e.g., ). Null alleles can significantly alter the estimation of population substructure patterns by decreasing the within population allelic diversity (e.g., ). However, many microsatellite loci reported to exhibit significantly lower than expected levels of heterozygosity are associated with RTE elements, such as in H. armigera (HarSSR3, ; HarSSR7, ), H. zea (HzMS1-6, ), Y. padellus (YP35, ), A. epimuta (AeG5, ) and B. betularia (Biston 12, ). Although it is still unknown whether a systematic RTE-associated effect on heterozygosity exists, the possibility deserves to be taken into consideration along with other explanations such as the Wahlund effect or selection pressure (e.g., insecticide applications on pest species).
One approach to improving the reliability of microsatellite markers is to target single-copy regions in the genome. Widdel et al.  used Cot analysis to isolate the slowly-reannealing, low-copy-number fraction of the genome of Ae. japonicus from which SSR clones were subsequently isolated. Another approach is to sort through clones already obtained to identify those sharing similarities in their flanking regions and discarding these; Meglécz  has described a computer program MICROFAMILY for that purpose. Another alternative for developing codominant markers is to avoid SSRs entirely and to design PCR primers to conserved exons in protein-coding genes to screen for polymorphisms in the intervening intron (EPIC markers, for example as developed as an alternative to microsatellites for H. armigera, ). A completely opposite strategy is to exploit the high copy number and dispersed distribution of the repeated sequences flanking microsatellites to develop fingerprinting-type, dominant markers [38–40]. Anderson et al.  developed two repetitive flanking sequences (ReFS1 and ReFS2) based on microsatellites isolated from A. caja, where both ReFS markers are part of the LSCS1 sequence with minor variations. They used PCR primers designed to these sequences to discriminate between different species of the moth genus Schrankia and to detect interspecific hybrids; dominant markers are sufficient for both purposes. In this application, the sequence conservation of the RTE element during evolution is useful for extending marker utility outside of the species of discovery, and insertions that occur after species divergence are useful for discriminating species; presuming that these insertions have fixed.
New lepidopteran TEs are continuously being identified and characterized (e.g., [24, 25]). TEs identified to date in the Lepidoptera are likely to represent only a fraction of the total numbers and types, and some of these others may also be responsible for the genesis of unilateral or bilateral microsatellite DNA families in various lepidopteran species.
The presence of microsatellite DNA families in the Lepidoptera has mystified many evolutionary and population geneticists over the last decade. Our findings that non-LTR retrotransposable elements of the RTE clade have rendered large proportions of lepidopteran microsatellite DNA markers developed to-date ineffective is indicative of the depth of problems challenging many researchers who have invested resources to developing such molecular genetics tools. Our study will enable TE-affected DNA markers to be recognised across a wide range of organisms, thus allowing informed decisions to be made regarding the utilisation of such DNA markers in future population and evolutionary genetic studies. The presence of RTEs across diverse plant and animal evolutionary lineages implies that the RTE-associated microsatellite DNA families phenomenon may be widespread in many biological systems, and likely to represent one of the many yet unrecognised classes of TEs capable of generating microsatellite DNA families.
The C. elegans RTE-1 amino acid sequences (GenBank:U58775, U0063, g2253129) as reported in Youngman et al.  and Malik and Eickbush  were used to identify related RTE elements through tblastn searches in NCBI and KaikoBlast. There are currently three different unannotated genome assemblies available in the wgs section at NCBI. We searched the second assembly with 6-fold coverage (GenBank AADK01000001:AADK01066482) in preference to the first assembly with 3-fold coverage (GenBank BAAB01000001:BAAB01213289) because of the larger contig size of the second assembly; and in preference to the third assembly with 9-fold coverage (GenBank BABH01000001:BABH01088672) because most repeated sequences including RTEs had been masked out and not restored in the third version. All positive gDNA sequences identified in B. mori were assembled using Sequencher 4.5 (Gene Codes Corporation, Ann Arbor MI, USA) to obtain consensus copies of full-length B. mori RTE elements. When an RTE element spanned two or more contigs, only those with >95% sequence identity in the region of overlap were used.
We used the C. elegans RTE-1 protein sequence (GenBank:AAC72298) in a tblastn search within the NCBI, specifying 'Lepidoptera' within the Entrez query search parameter. Significant positive matches were recorded and the H. melpomene BAC clone AEHM-22C5 (GenBank:CU462842) was assessed for putative coding sequences using the NCBI ORF Finder search tool and implementing the default genetic codes option. Protein sequences identified by the ORF search were subjected to blastp searches against the non-redundant (nr) database to ascertain the presence of RTE-like conserved domains. SSR units flanking the 3' UTR of putative RTE-like conserved domains were visually identified. DNA sequences flanking the 3' end of the SSR units and the 5' UTR were manually examined for evidence of TSD sequences. Annotation was added to the GenBank record CU462842 by the submitter, Dr. Simon W. Baxter, Cambridge University.
The nr, wgs, and dbEST sections of GenBank were searched using tblastn with protein sequences derived from the following: XM_001196835 (Strongylocentrotus purpuratus), AJ621035 (Rex3 from Tetraodon nigroviridis), and AF332697 (Vipera ammodytes) as well as the B. mori elements BmRTE-d01 to -d23 and C. elegans RTE-1. All new elements identified and their sequences are listed in Additional File 2.
Sequences identified (with GenBank source and base numbers) are for nematodes (Pristionchus pacificus PpRTE-a01 ABKE01000427; 33,932-36,769); (Caenorhabditis brenneri CbRTE-a01 ABEG01003514; 3,331-6,195), lizard (Anolis carolinensis AcRTE-a01 AAWZ01014759; 66,966-69,213), sea slug (Aplysia californica ApcRTE-a01 AASC02045436; 3,412-450), lancelet (Branchiostoma floridae BfRTE-a02 ABEP02012082; 20,876-17,670), zebrafish (Danio rerio DrRex3-a01 BX511127; 5,323-8,013), stickleback fish (Gasterosteus aculeatus GaRex3-a01 AANH01001355; 40,384-43,528) cichlid fish (Mchenga conophorus McRTE-a01 ABPJ01001712; 1,023-3,566), medaka fish (Oryzias latipes Rex3_Ol AB111925; 55,168-57,602, OlRTE-a03 AB207138; 17,235-20,258), pufferfish (Tetraodon nigroviridis Rex3_Tet AJ621035; 168-3,337), platyfish (Xiphophorus maculatus Rex3-XmJ AF125982; 1-2,625), hagfish (Eptatretus stoutii EpRTE-a01 AY965681; 70,600-73,645), sea anemone (Nematostella vectensis NvRTE-a01 ABAV01019150; 20,589-23,804), and plants, including peanut (Arachis hypogaea AhRTE-a01 FJ654705; 20,923-16,248), soybean (Glycine max GmRTE-a01 AC235444; 69,163-66,083), barrel medic (Medicago truncutala MtRTE-a01 AC233569; 64,560-61,525), goatgrass (Aegilops tauschii AtRTE-a01 AF091802 6,510-9,456), barley (Hordeum vulgare HvRTE-a01 AF064563 1,441-2,212), potato (Solanum. tuberosum StRTE-a01 AC233388; 58,011-55,175), rice (Oryza sativa OsRTE-a01 AAAA02042936), maize (Zea mays ZmRTE-a01 AJ850302; 4,233-7,553), bread wheat (Triticum aestivum TaRTE-a01 EU835981; 122,283-119,112), apple (Malus × domestica MdRTE-a01 AM167520; 22,051-24,641). Additional full-length B. mori RTE elements previously identified by Zupunski et al.  on the basis of partial sequences (Prof. Dusan Kordis, Jozef Stefan Institute, Ljubljana, Slovenia, pers. comm.) are BmRTE-d12 (partial sequence AV406121), BmRTE-d24 (GU815089, partial sequence AV4052480), and BmRTE-d25 (GU815090, partial sequence AV406078).
To determine the evolutionary relationships between the full-length RTE elements identified in this study and previously identified full-length non-LTR retrotransposable elements, complete alignment of RT domain amino acids (EMBL:DS36752)  was obtained. The RT conserved domain of non-LTR retroelements from (EMBL:DS36752) , all newly assembled B. mori RTE elements (BmRTE-d01 to BmRTE-d25; (Genbank:FJ265542 - FJ265564, GU815089- GU815090), and the H. melpomene HmRTE-e01 element (GenBank CU462842) were identified using NCBI CD search program against the RT conserved domain CD01650 within the database CDD-31608 PSSMs, prior to a global RT conserved domain alignment using the Kalign program [51, 52] in EMBL-EBI. The RT conserved domain of CRE clade elements (CRE1, CRE2, CZAR and SLACS, see Malik et al. ) were included as outgroups. For cluster analysis, we implemented the Neighbour-Joining method and conditions (tree bisection-reconnection branch swapping, 2,000 bootstrap resampling, and maximum trees saved at each step limited to five) specified by Malik et al.  using PAUP* v4.0b10 . Due to the large data set of Malik et al. , we included only elements from the R2, RTE, R4, L1, Jockey, and CR1 clades with our BmRTE and HmRTE elements. The new amino acid alignment file used in the phylogenetic analysis is provided in Additional File 7.
Phylogenetic analysis of RTE subgroups used the RT conserved domain of RTE and Rex3 elements that were newly identified in this study (Additional File 2) and also included various elements previously reported by Zupunski et al. . The RT conserved domain were identified using the CD Search program in NCBI http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml against conserved domain CD01650. Outgroup for the RTE clade phylogenetic analysis used the Neurospora group II intron (GenBank:S07649), with the RT conserved domain identified against conserved domain CD01651 using the NCBI CD Search program. Sequence alignment was carried out by including amino acid sequences representative of the conserved domains CD01650 and CD01651 with the identified RT conserved domain amino acid sequences of all RTE, Rex3 and Bov-B LINE elements (Additional File 8). Both CD01650 and CD01651 representative sequences were removed prior to phylogenetic analysis using the Neighbour Joining method as described above.
Using the Lepidoptera Specific Core Sequences 1 to 4 (LSCS 1-4, ) we searched the NCBI GenBank non-redundant (nr) genomic DNA database using the Basic Local Alignment Tool blastn , and limiting the Entrez query with search term 'Lepidoptera'. Significant matches with expected threshold (E-value) of < E-04 were manually search for evidence of TSD and the presence of 3' SSR units.
B. anynana microsatellite DNA loci described in van't Hof et al.  were aligned and grouped using Sequencher 4.5. Consensus 5' flanking sequences prior to the SSR units from grouped microsatellite loci were translated into consensus protein sequences for use in tblastn analyses along with the 'Lepidopteran Microsatellite' Entrez query search term. In Melitaea cinxia, a preliminary blastn search using the LSCS1 DNA sequence ('Lepidoptera' as Entrez query search term) identified a significant (E-value = 4e-07) hit (GenBank:DQ389528, MCclone113). SSR units at the 3' UTR, and TSD sequences at 5' and 3' flanking regions were visually identified.
Complete amino acid sequences of RTE elements from B. mori (BmRTE-d01 to -d13), H. melpomene HmRTE-e01, and partial RTE-like element protein sequences from B. anynana and M. cinxia (Additional File 3), and H. zea (HzRTE-1-1, ) were used as input queries in tblastn searches in NCBI. Accession numbers DQ242653:DQ243686 representing the B. mori microsatellites of Miao et al.  were searched separately with the BmRTE elements as queries. Partial RTE-like element DNA sequences from H. armigera (identified based on sequence homology to HzRTE-1-1), H. zea, M. cinxia and B. anynana were also searched using blastn for homology to lepidopteran microsatellite loci within the GenBank. The search term 'Lepidoptera microsatellite' was used in the Entrez query option in all tblastn and blanstn searches. Due to the general short RTE protein sequences identified from H. armigera, B. anynana and M. cinxia, tblastn expected threshold of 100 was specified within the 'General Parameters' window. Matrix options implemented were either the default (BLOSUM62), PAM30 (≤ 40 amino acids) or PAM70 (≤ 70 amino acids) settings. All remaining algorithm parameters were as default settings. Default tblastn settings were implemented for full-length B. mori (BmRTE-d01 to BmRTE-d13) and H. melpomene HmRTE-e01 RTE elements. All significant lepidopteran microsatellite loci matching the complete and partial RTE elements were sorted for non-redundancy based on GenBank Accession numbers. Lepidopteran microsatellite loci from all tblastn and blastn searches that exhibited threshold values of greater than 1E-04 were considered non-significant and thus excluded in all subsequent analyses. All microsatellite loci with significant sequence homologies to partial RTE-like elements of H. zea, H. armigera, B. anynana and M. cinxia, and to full-length B. mori and H. melpomene RTE elements (limiting to within 200 amino acid residues to the 3' terminus) were further characterized with respect to SSR units (Additional File 5, Figure 5) and TSD sequences where possible.
The microsatellite DNA locus HaD47 (GenBank:AY497338)  with its reverse primer designed within the LSCS1 is problematic when used in population genetic studies  (but see ). The molecular dynamics of RTE integration sites within the host genome was investigated by PCR amplification using the HaD47 microsatellite forward and reverse markers following published PCR conditions . DNA from two field-collected H. armigera (AD1, AD2, from Dalmore, Victoria, Australia; ) were used in this analysis. Post PCR purification, cloning, sequencing and post sequencing DNA analyses of HaD47 microsatellite PCR amplicons followed the protocol of Scott et al. . We randomly selected five and six positively transformed colonies from AD1 and AD2 respectively to confirm the presence of partial RTE-like elements, to ascertain insertion site specificity and to characterize the patterns of associated SSR units.
non-Long Terminal Repeat
Simple sequence repeat
Target site duplication
Single nucleotide polymorphisms
Lepidoptera specific core sequences
Open reading frame
WTT and DGH were supported by the Max-Planck-Gesellschaft. GTB was supported by the Melbourne International Research Scholarship (MIRS) and the Melbourne International Fee Remission Scholarship (MIFRS). Siu Fai Lee, Adam Williams, Tamar Sztal and Adrian Boey (CESAR, University of Melbourne, Australia) provided helpful discussion during the course of this project. We thank three anonymous reviewers for helpful comments, and Dusan Kordis and Jean-Nicolas Volff for identifying accession numbers for Bombyx and Tetraodon RTE elements respectively. This project was supported by The Australian Research Council (ARC) through its funding of the Special Research Centre CESAR (Centre for Environmental Stress and Adaptation Research), and the Max-Planck-Gesellschaft.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.