Open Access Research Article Generation of Microsatellite Repeat Families by Rte Retrotransposons in Lepidopteran Genomes

article distributed under the terms of the Creative Commons At-tribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Developing lepidopteran microsatellite DNA markers can be problematical, as markers often exhibit multiple banding patterns and high frequencies of non-amplifying "null" alleles. Previous studies identified sequences flanking simple sequence repeat (SSR) units that are shared among many lepidopteran species and can be grouped into microsatellite-associated DNA families. These families are thought to be associated with unequal crossing-over during DNA recombination or with transposable elements (TEs). Results: We identified full-length lepidopteran non-LTR retrotransposable elements of the RTE clade in Heliconius melpomene and Bombyx mori. These retroelements possess a single open reading frame encoding the Exonuclease/ Endonuclease/Phosphatase and the Reverse Transcriptase/nLTR domains, a 5' UTR (untranslated region), and an extremely short 3' UTR that regularly consists of SSR units. Phylogenetic analysis supported previous suggestions of horizontal transfer among unrelated groups of organisms, but the diversity of lepidopteran RTE elements appears due to ancient divergence of ancestral elements rather than introgression by horizontal transfer. Similarity searches of lepidopteran genomic sequences in GenBank identified partial RTE elements, usually consisting of the 3' terminal region, in 29 species. Furthermore, we identified the C-terminal end of the Reverse Transcriptase/nLTR domain and the associated 3' UTR in over 190 microsatellite markers from 22 lepidopteran species, accounting for 10% of the lepidopteran microsatellites in GenBank. Occasional retrotransposition of autonomous elements, frequent retrotransposition of 3' partial elements, and DNA replication slippage during retrotransposition offers a mechanistic explanation for the association of SSRs with RTE elements in lepidopteran genomes. Conclusions: Non-LTR retrotransposable elements of the RTE clade therefore join a diverse group of TEs as progenitors of SSR units in various organisms. When microsatellites are isolated using standard SSR enrichment protocols and primers designed at complementary repeated regions, amplification from multiple genomic sites can cause scoring difficulties that compromise their utility as markers. Screening against RTE elements in the isolation procedure provides one strategy for minimizing this problem. Background Microsatellite genetic markers are based on the properties of SSRs (simple sequence repeats) which are numerous and ubiquitous in the DNA of eukaryotes [1,2]. The basic repeat unit of an SSR is generally considered to be one to six bases long, and an array of two or more basic units repeated in tandem constitutes the SSR. Such SSRs are often …


Background
Microsatellite genetic markers are based on the properties of SSRs (simple sequence repeats) which are numerous and ubiquitous in the DNA of eukaryotes [1,2]. The basic repeat unit of an SSR is generally considered to be one to six bases long, and an array of two or more basic units repeated in tandem constitutes the SSR. Such SSRs are often miscopied by DNA polymerase, and the resulting high mutation rate, leading to a change in the number of basic units in the array, has proven to be very useful for population genetic studies. The high degree of polymorphism in the size of the tandem array can easily be visualized by gel-separation of PCR products generated by primers placed in the flanking regions, on either side of the SSR.
Another pattern of repetition, however, has proven to be very troublesome in the practical application of microsatellite markers in some organisms, including the insect order Lepidoptera (butterflies and moths). The regions closely flanking one or both sides of the SSR may themselves be highly repeated, dispersed throughout the genome rather than occurring tandemly [3,4]. Primers designed to match these regions, therefore, may generate multiple bands, or even fail to produce discrete visible products because too many different sites are being amplified. Unlike the internally-repeated patterns of the SSRs, there is nothing obvious about the structure of these repeated flanking regions. The fact that they are repeated can be deduced only by comparisons among many flanking sequences, not by any intrinsic pattern of the sequence itself.
Meglécz et al. [5] first described families of similar sequences flanking SSR repeats from two species of Lepidoptera, Parnassius apollo and Euphydryas aurinia. These similarity patterns fall into two main types, if the location of the SSR is taken into account but its sequence and orientation is ignored. Unilateral repeat families occur on one side only of the SSR, the other side is not repeated (e.g. L1-SSR-R, L2-SSR-S, L3-SSR-T, etc.). In bilateral repeat families, both flanking sides are repeated (e.g. L1-SSR-R1, L2-SSR-R2, L3-SSR-R3, etc.). van't Hof et al. [6] denoted these as asymmetric and symmetric families respectively, however these terms are misleading as there is not necessarily any symmetry between L1 and R1, L2 and R2, etc. These authors examined patterns of flanking SSR families in a third lepidopteran, Bicyclus anynana. Using blastn comparisons of these to sequences in GenBank, they defined four "Lepidoptera Specific Core Sequences" (LSCSs) that represented the common elements of four repeated flanking sequences.
The widespread occurrence of these LSCS regions can complicate the use of microsatellite markers in population studies. For example, markers have been developed for the polyphagous noctuid moth Helicoverpa armigera by several groups [7][8][9][10]. Markers developed for populations of one region (e.g. China) typically have a low success rate in other regions (e.g. Australia). Studies of Australian populations, even using microsatellites developed from those populations, have produced a highly dynamic picture of temporal variation in migration patterns [11][12][13][14]. However, a careful re-examination of ten of these markers by Endersby et al. [15] showed that a high proportion of these loci showed allele drop-outs or were not in Hardy-Weinberg equilibrium, undermining their reliability as accurate indicators of population structure.
Based on analysis of whole-genome sequences of several insects, including the lepidopteran Bombyx mori, Meglécz et al. [16] showed there were large interspecific differences in frequencies of microsatellites in insect genomes. They confirmed and extended the observations of shared flanking regions among independently isolated sets of microsatellite markers from several different Lepidoptera. They demonstrated a statistical association between SSRs and previously compiled datasets of repeated DNA sequences in six Drosophila species and Anopheles gambiae (a similar compilation for B. mori was not provided). van't Hof et al. [6] also suggested a general association between transposable elements (TEs) and some of the LSCS regions. However, no specific class of TEs was identified as contributing to the observed patterns.
Here we show that a class of non-LTR retrotransposons, RTE elements, are associated with SSR repeats in Lepidoptera and other species. Full-length autonomous elements have low copy numbers in the genome of Bombyx mori, however partial elements consisting of the coding sequence of the C-terminal end of the reverse transcriptase protein immediately flanking an SSR are highly abundant. Slippage during reverse transcription and chromosomal integration can increase the number of repeating units within the SSR. These numerous chromosomal integration events of partial elements have peppered the genome with ready-made SSRs flanking an element-specific sequence. These events can account for a substantial fraction of unilateral repeat families recovered in independent efforts from several lepidopteran species.

Identification of full-length RTE elements from Lepidoptera
The founding member of the RTE clade is the RTE-1 element from Caenorhabditis elegans [17,18]. We used its deduced protein sequence (GenBank:AF054983) in tblastn searches of Lepidopteran sequences in GenBank (nr and wgs subsets). A full-length element that we have named HmRTE-e01 sharing all the major features of RTE elements was identified in BAC clone AEHM-22C5 from Heliconius melpomene (GenBank:CU462842) ( Figure 1). The single open reading frame (ORF) of 990 amino acids contains two conserved domains, an AP (apurinic) endonuclease domain and an RT (reverse transcriptase) domain. Unlike some RTE elements, no skipped stop codons or frameshifts are evident at the 5' end of the coding sequence. The AP endonuclease domain belongs to a family of proteins (pfam03372, Exo_endo_phos, endonuclease/exonuclease/phosphatase family) that includes magnesium dependent endonucleases as well as phosphatases involved in intracellular signalling. The RT domain belongs to a family of proteins found in non-LTR retrotransposons and retroviruses (cd01650: RT_nLTR_like) with activities that include RNA-directed DNA polymerase, DNA-directed DNA polymerase and a ribonuclease that degrades the RNA in a RNA:DNA duplex (RNase H), although no RNase H domain was found in this element. These RTs catalyze the conversion of single-stranded RNA into double-stranded DNA for integration into the host chromosomes. The 3' stop codon (TAA) is immediately followed by a short 14 bp 3'UTR containing a (CTT) 2 CT SSR repeat. A 20 bp target-site duplication (TSD) sequence (AGTTTAAAC-GAAGTATATCT) immediately follows the (CTT) 2 CT repeat units and also occurs at the 5' end of the element prior to the 5'UTR. The TSD sequences are produced when a staggered cut is made in the double-stranded DNA prior to element insertion, and the two singlestranded regions flanking the new insert are filled in by DNA polymerase [19][20][21]. The 20 bp TSD sequence can be more easily identified through pairwise sequence alignment between the AEHM-22C5 BAC clone (Gen-Bank:CU462842) and the homologous BAC clone AEHM-7G5 (GenBank:CU462858) that lacks the RTE insertion. The comparison of these two BAC clones from the same population of H. melpomene offers a rare opportunity to observe the pre-and post-insertion status of the same genomic region. In the BAC clone AEHM-7G5 the pre-insertion position of the HmRTE-e01 element is between nucleotide positions 22,485 and 22,486, and the 20 bp sequence at 22,466 to 22,485 has become the TSD on both sides of the element in AEHM-22C5. The insertion appears to be recent, as frameshifts and internal stop codons commonly found in other RTE elements have not accumulated, and the population is polymorphic for presence/absence of the element in this genomic locus.
The C. elegans RTE-1 sequence was also used to identify 25 full-length elements from the unannotated wholegenome shotgun contigs of B. mori (BmRTE-d01 through -d25), the largest number of distinct RTE elements identified from a single species to date. These have the same two conserved protein domains AP and RT, possess various SSR repeats in the short 3' UTR immediately following the stop codon at the end of the RT domain, and are flanked by TSD sequences of differing lengths. Some of the Bombyx elements possess an in-frame stop codon or a frameshift within the first 200 bases of the coding sequence, like other RTE elements. The partial RTE sequence identified by Malik and Eickbush [17] next to the cecropin B gene of B. mori is represented by the fulllength element BmRTE-d05, with some protein sequence differences because of attempts of those authors to correct frameshifts that had been identified in the partial sequence. Three partial RTE sequences identified by Zupunski et al. [22] correspond to the full-length elements BmRTE-d07, -d24, and -d25. Using the full-length Bombyx RTE elements as query sequences in blast searches of the Bombyx genome contigs yields far more hits of various lengths at the 3' end than at the 5' end, suggesting that the copy number of full-length functional autonomous elements is generally low (around 2 to 4 copies per genome, but up to 16 and 22 copies for BmRTE-d01 and BmRTE-d02 respectively, see Additional File 1), in contrast to frequent insertions (up to 300 in the case of BmRTE-d04) of partial non-autonomous elements with intact 3' ends and flanking SSRs.

Relation of lepidopteran RTE elements to other non-LTR retrotransposons
A phylogenetic analysis of the lepidopteran RTE elements based on alignment against 220 amino acid residues of the RT domain (domain CD01650 in the GenBank Conserved Domain Database) was conducted in comparison with other RTE elements and more generally, other clades of non-LTR retrotransposons ( Figure 2). We follow the definition of "clade" used by Malik et al. [23] to represent those retroelements that (1) share the same structural features, (2) are grouped together with ample phylogenetic support, and (3) date back to the Precambrian era. Previously analyzed non-LTR retrotransposons [23] and the B. mori CR1 non-LTR element within the CR1 clade [24] were included, with members within the CRE clade (i.e., CRE1, CRE2, CZAR and SLACS) as outgroups. All of the lepidopteran RTE elements cluster together with     Figure 3 of Malik et al. [23], and included also the recently described B. mori non-LTR CR1B element of the CR1 clade [24]. The tree was constructed using the Neighbour Joining (NJ) method as described in Malik et al. [23] with the CRE element RT conserved domain as outgroups. The NJ tree is a 50% consensus tree, with bootstrap values of >70 from 2,000 bootstrap replications indicated at respective nodes. The RTE clade includes previously described RTE-1, RTE-2, JAM1 and BDDF [23] as well as 25 newly identified elements from B. mori and one from H. melpomene. With the exception of BmCR1B which was obtained from [24], all amino acid sequences from the RTE, R2, R4, L1, Jockey, CR1 and CRE clades of Malik et al. [23] were from their sequence alignment (EMBL:DS36752).  previously identified RTE elements (RTE-1, RTE-2, BDDF, JAM1) into a single group with a bootstrap value of 85%, placing these newly described lepidopteran elements within the RTE clade [23]. Two major sister groups are evident, consistent with previous analysis [23], one containing RTE-1 and RTE-2 from C. elegans and the other JAM1 from the mosquito Aedes aegypti and BDDF from the cow Bos taurus. A third minor sister group with a more basal position contained three B. mori elements (BmRTE-d13, -d15, -d17) and the H. melpomene element HmRTE-e01. We extended previous phylogenetic analyses of the RTE subgroups using newly identified fish, molluscan, cnidarian, lizard, amphioxus and plant RTE elements (Additional File 2) as well as previously described nematode RTE (RTE-1, RTE-2, [23]), Aedes JAM1 [23], Schistosoma SR2 [32], Oryzias RTE [22], partial RTEs from Bombyx [22], and plant RTE (Aegilops, Hordeum [22]). This indicates the BmRTE elements are highly divergent, being distributed among the Caenorhabditis RTE, Bov-B LINE and Animal RTE subgroups ( Figure 3). From the phylogenetic analysis, elements within the RTE clade are grouped into four subgroups of Plant/Animal RTE, Rex3/RTE, Caenorhabditis RTE/Bombyx RTE, and Bov-B LINE/ Bombyx RTE ( Figure 3). HmRTE-e01 and BmRTE-d13 are clustered within the animal RTEs, while BmRTE-d17 and Schistosoma SR2, and BmRTE-d14 and Aedes JAM1 are placed basal to the Rex3/RTE and Plant/Animal RTE subgroups. The pattern of wide RTE diversity within Lepidoptera, and low similarity to other groups, suggests that the main lineages of Bombyx RTE elements identified here are ancient and not due to recent horizontal transfer of elements from non-arthropod groups.
In contrast, the high similarity among Bov-B LINEs of reptiles and mammals (V. ammodytes Bov-B LINE, Bos taurus BDDF) could be explained by horizontal transfer, as previously suggested by Zupunski et al. [22]. These authors also inferred another instance of horizontal transfer from arthropods to reptiles based on similarity between Bov-B from V. ammodytes and other reptiles, and partial sequences from B. mori they named Bombyx Bov-B/RTE (equivalent to our full-length BmRTE-d24 and -d25). We found the reptile/mammal group to cluster additionally with B. mori BmRTE-d05, -d06, -d12, and -d16, but outside the other Bombyx elements, pointing to this subgroup as the likely source of the horizontal transfer. As also pointed out by Zupunski et al. [22], horizontal transfer between plant and fish RTE elements was also evident (Figure 3), and the distribution of plant-like elements within fishes is wider than previously suspected, as shown by the element we have identified from the hagfish Eptatretus.

Lepidopteran RTE elements in public DNA databases
A few examples of partial RTE elements in non-coding flanking sequences or introns of genes from Lepidoptera have been deposited in public databases. Chen and Li [25] looked for TEs neighbouring cytochrome P450 genes in Helicoverpa zea, and recognized a partial RTE element they named HzRTE-1 within the third intron of CYP9A12v3 (GenBank:DQ788839). This partial element was 1,754 bp long, flanked by 10 bp TSDs, encoded an RT domain with 40% amino acid sequence identity to the C. elegans RTE-1 (and 78% identity to the last 400 residues of BmRTE-d01, Additional File 4), and was terminated by a region of TGA trinucleotide repeats in the short 3' UTR. Xu et al. [26] discovered genes with similarity to delta-11 desaturases (GenBank:EF113398) implicated in pheromone biosynthesis in Ostrinia nubilalis and O. furnacalis; each of which was adjacent to a partial RTE element with 78% amino acid identity to the C-terminal 473 residues of BmRTE-d08 (Additional Files 3 and 4). These authors recognized the sequence similarity to the C. elegans RTE-1 element, but considered the partial elements from Ostrinia to represent a new family which they named ezi, although this Ostrinia family was clearly clustered within the RTE-1 clade in their phylogenetic analysis [26]. No SSRs were found in the 3' UTR of these elements.
Other instances of lepidopteran sequences in GenBank that we found have not been previously recognized as RTE elements. These include two partial HzRTE-1-like sequences (named HaRTE-t01; see Additional File 3) within the first intron of two different alleles of a cadherin gene from H. armigera (GenBank:AY714875 and AY714876) (Additional File 4, Figure 4). A 224 bp partial HzRTE-1-like sequence was also found in the second intron of the preproattacin A gene (GenBank:U46130) of the noctuid moth Trichoplusia ni. In B. mori, RTE elements are common features of genes deposited in Gen-Bank. For example, in the gDNA sequence for cuticle protein genes BMWCP5 -BMWCP2 (Gen-Bank:AB262389), two partial elements (a 442 bp with 97% identity to BmRTE-d09 over the last 137 amino acids, and a 521 bp with 88% identity to BmRTE-d02 over the last 144 amino acids) were identified, with both partial elements flanked by TSDs and (TGA)n SSR units at the 3' UTR (Additional File 4, Figure 4). Other lepidopteran species with significant gDNA homology to RTE elements (but not as microsatellite DNA loci) included the prophenoloxidase-activating proteinase-1 gene from Manduca sexta (GenBank:AY789465), the hemolymph storage protein 2 gene from Samia cynthia (GenBank:AB288052), the farnesyl diphosphate synthase gene from Choristoneura fumiferana (Gen-Bank:AY962308), the arylphorin gene of Galleria mellonella (GenBank:M73793), a BAC clone from Heliconius Figure 3 Neighbour-joining RTE clade phylogenetic tree. The NJ tree is a 50% consensus tree, with bootstrap values of >70 from 2,000 bootstrap replications indicated at respective nodes. Alignment of complete RT conserved domain used the Kalign sequence alignment program [51,52] in EMBL-EBI. The Neurospora Group II intron (GenBank: S07649) was used as the outgroup. 14 representative BmRTE sequences used in Figure 2 have been included, along with newly described RTE elements as listed in the Methods, indicated with an asterisk. The RTE elements were broadly clustered as reported in Figure 5 of [22] although the higher number of BmRTEs identified and included in this analysis, and together with the lower numbers of Bov-B LINEs included have altered the tree topology. Overall, the elements were grouped into four sister groups of animal/plant RTE, Rex3/RTE, Bov-B LINE/RTE, and Caenorhabditis/Bombyx RTE. Although basal to the Plant/Animal RTE and Rex3/RTE subgroups, the positions of JAM1 and SR2 remained uncertain [22] due to the lower (<70%) confidence values at the respective nodes. Two nodes representing horizontal transfer events proposed by Zupunski et al. [22] are indicated. (A) from plants to some fishes, (B) from arthropods to reptiles and then to ruminant mammals. Note that the medaka fish Oryzias has both a Rex3 RTE element similar to other fishes, and a plant-like RTE element.

RTE elements in Lepidopteran microsatellite clones
The largest collection of microsatellite markers for any single Lepidopteran species was assembled by Miao et al. [27] in B. mori. More than 13,600 positive clones were identified by hybridization, yielding 2,690 confirmed by sequencing and 555 polymorphic markers that were used to construct a linkage map. Of these, 518 were deposited in GenBank, and we identified 64 containing fragments of one or more of the BmRTE elements by BLAST. Thus about 12% of this collection of microsatellite markers contained one or more RTE elements.
Overall, of 2,183 lepidopteran microsatellite sequences deposited in GenBank, 218 or almost 10% were found to contain an RTE element (Additional File 6). The top 16 lepidopteran species accounted for 1,723 microsatellites, 202 or almost 12% of which contain RTE elements. There is some heterogeneity within this group, with RTE content ranging from 0% to 44%. The remaining species with 27 or fewer microsatellites each account for 460 loci, only 16 of which (3.5%) contain RTEs. These differences could reflect actual species differences in microsatellite frequencies or methodological differences among different researchers; only some of which used enrichment techniques.

Discussion
We have identified several RTE non-LTR retrotransposable elements in the Lepidoptera, and provided evidence for their association with unilateral microsatellite DNA families. These lepidopteran RTE elements therefore join a small group of TEs in other organisms (e.g. Drosophila [29], humans [30], barley [31]) as being associated with the genesis of microsatellite DNA repeat units. They account for a significant fraction of microsatellite markers isolated by many independent groups working with Lepidoptera (almost 10% overall, ranging from 0% to 44% for different species), and their wide occurrence may be partly responsible for problematical aspects of these markers. Non-LTR RTE elements have a wide but disjunct phylogenetic distribution in eukaryotes [17,22,23] and have been found in sea urchins (Strongylocentrotus purpuratus), nematodes (C. elegans), blood flukes (Schistosoma mansoni [32]), mosquitoes (Ae. aegypti), amphioxus (Branchiostoma floridae [33]), fishes (Xiphophorus maculatus [34]), snakes (Vipera ammodytes [22]) and mammals (Bos taurus) as well as plants [22]. We have identified additional full-length elements from cnidaria, mollusca, and hagfish, as well as 25 from the genome of B. mori and one from H. melpomene. Lepidopteran RTE elements discovered so far are very diverse but are all grouped into the RTE clade of Malik et al. [23].
Genomic insertion sites of RTEs do not appear to be sequence-specific, and can be discerned only by targetsite duplications of 6 -12 bp after insertion. The typically short 3' UTR often contains di-or tri-nucleotide simple sequence repeats. Truncated RTE elements consisting only of the 3' end of the RT coding sequence followed by the 3' UTR and flanked by target-site duplications are often found in genomes. This configuration suggests that a staggered double-stranded break in the target sequence was made, followed by reverse transcription of the fulllength RNA element from the 3' end, which was interrupted before completion so that only the 3' end of the element was finally inserted between the target-site duplications. A large number of truncated elements may accumulate due to the replication efforts of just a few fulllength elements, similar to that reported for L1 retrotransposons (reviewed in [20]). For example, in the mosquito Anopheles gambiae, the Ag-JAMMIN-2 ele- ment is represented by five copies with intact ORFs and about 1,940 truncated elements [35]. In the absence of a full-length RTE sequence as evidence, such truncated RTE elements may be misidentified as short interspersed nucleotide elements (SINEs; [17]) which by contrast are usually derived from abundant cellular RNAs, lack coding potential, possess poly(A) tails, are independently transcribed from their own internal Pol III promoters, and utilize proteins encoded by other retro-elements for their insertion. Similarity of the predicted protein sequences at the 3' ends enabled Malik and Eickbush [17] to recognize that some families previously classified as SINEs were actually truncated RTE elements, just as it has enabled us to recognize RTE elements within isolated microsatellite markers. This is because it is easier to recognize homology of divergent protein sequences than nucleotide sequences. Some insect TEs show a preference for inserting into specific sequences, and sometimes these sequences are simple sequence repeats. For example, DONG targets the (TAA) n of the ribosomal DNA non-transcribed spacer region [36], and the telomere-specific TRAS families target pentameric (TTAGG/CCTAA) n SSR units [37]. Since these SSR-target insertion sequences are pre-existing, TE insertion would not increase the abundance of SSRs in the genome, but could increase the abundance of a particular unilateral family of SSR repeats by juxtaposing more copies of the TE next to SSRs existing in different genomic locations. We were not able to find evidence of DONG or TRAS elements in microsatellite markers isolated from Lepidoptera; however it may be that some of the still-uncharacterized LSCS sequence families are formed in this way. By contrast, each new insertion of an RTE element with SSRs in the 3' UTR does increase the abundance of SSRs in the genome. The SSR may have been present in the 3' UTR in the RNA molecule participating in the insertion, or may have been incorporated into the cDNA by slippage of the RT enzyme during reverse transcription. But since the SSR is positioned between the two TSDs, it clearly did not exist in that location prior to the insertion event.
Some fingerprinting methods benefit from the random multiple insertion of transposable elements into genomes [38][39][40], however, these do not yield single-copy markers. The difficulty of developing lepidopteran microsatellites that function as reliable single-copy codominant markers is widely acknowledged (e.g., [3][4][5]41,42]). Our sequence comparison of size variants of the microsatellite marker HaD47 showed that at least three different loci were being amplified in different individuals, violating the single-copy assumption. With respect to codominance, a deficiency in observed levels of heterozygosity relative to Hardy-Weinberg expectations has been reported for many published lepidopteran microsatellite DNA markers, and has often been attributed to the presence of null alleles (e.g., [41]). Null alleles can significantly alter the estimation of population substructure patterns by decreasing the within population allelic diversity (e.g., [43]). However, many microsatellite loci reported to exhibit significantly lower than expected levels of heterozygosity are associated with RTE elements, such as in H. armigera (HarSSR3, [8]; HarSSR7, [7]), H. zea (HzMS1-6, [28]), Y. padellus (YP35, [44]), A. epimuta (AeG5, [45]) and B. betularia (Biston 12, [46]). Although it is still unknown whether a systematic RTE-associated effect on heterozygosity exists, the possibility deserves to be taken into consideration along with other explanations such as the Wahlund effect or selection pressure (e.g., insecticide applications on pest species).
One approach to improving the reliability of microsatellite markers is to target single-copy regions in the genome. Widdel et al. [47] used Cot analysis to isolate the slowly-reannealing, low-copy-number fraction of the genome of Ae. japonicus from which SSR clones were subsequently isolated. Another approach is to sort through clones already obtained to identify those sharing similarities in their flanking regions and discarding these; Meglécz [48] has described a computer program MICROFAMILY for that purpose. Another alternative for developing codominant markers is to avoid SSRs entirely and to design PCR primers to conserved exons in protein-coding genes to screen for polymorphisms in the intervening intron (EPIC markers, for example as developed as an alternative to microsatellites for H. armigera, [49]). A completely opposite strategy is to exploit the high copy number and dispersed distribution of the repeated sequences flanking microsatellites to develop fingerprinting-type, dominant markers [38][39][40]. Anderson et al. [50] developed two repetitive flanking sequences (ReFS1 and ReFS2) based on microsatellites isolated from A. caja, where both ReFS markers are part of the LSCS1 sequence with minor variations. They used PCR primers designed to these sequences to discriminate between different species of the moth genus Schrankia and to detect interspecific hybrids; dominant markers are sufficient for both purposes. In this application, the sequence conservation of the RTE element during evolution is useful for extending marker utility outside of the species of discovery, and insertions that occur after species divergence are useful for discriminating species; presuming that these insertions have fixed.
New lepidopteran TEs are continuously being identified and characterized (e.g., [24,25]). TEs identified to date in the Lepidoptera are likely to represent only a fraction of the total numbers and types, and some of these others may also be responsible for the genesis of unilateral or bilateral microsatellite DNA families in various lepidopteran species.

Conclusions
The presence of microsatellite DNA families in the Lepidoptera has mystified many evolutionary and population geneticists over the last decade. Our findings that non-LTR retrotransposable elements of the RTE clade have rendered large proportions of lepidopteran microsatellite DNA markers developed to-date ineffective is indicative of the depth of problems challenging many researchers who have invested resources to developing such molecular genetics tools. Our study will enable TE-affected DNA markers to be recognised across a wide range of organisms, thus allowing informed decisions to be made regarding the utilisation of such DNA markers in future population and evolutionary genetic studies. The presence of RTEs across diverse plant and animal evolutionary lineages implies that the RTE-associated microsatellite DNA families phenomenon may be widespread in many biological systems, and likely to represent one of the many yet unrecognised classes of TEs capable of generating microsatellite DNA families.

Identification of lepidopteran RTE elements In Bombyx mori
The C. elegans RTE-1 amino acid sequences (Gen-Bank:U58775, U0063, g2253129) as reported in Youngman et al. [18] and Malik and Eickbush [17] were used to identify related RTE elements through tblastn searches in NCBI and KaikoBlast. There are currently three different unannotated genome assemblies available in the wgs section at NCBI. We searched the second assembly with 6fold coverage (GenBank AADK01000001:AADK01066482) in preference to the first assembly with 3-fold coverage (GenBank BAAB01000001:BAAB01213289) because of the larger contig size of the second assembly; and in preference to the third assembly with 9-fold coverage (GenBank BABH01000001:BABH01088672) because most repeated sequences including RTEs had been masked out and not restored in the third version. All positive gDNA sequences identified in B. mori were assembled using Sequencher 4.5 (Gene Codes Corporation, Ann Arbor MI, USA) to obtain consensus copies of full-length B. mori RTE elements. When an RTE element spanned two or more contigs, only those with >95% sequence identity in the region of overlap were used.

In Heliconius melpomene
We used the C. elegans RTE-1 protein sequence (Gen-Bank:AAC72298) in a tblastn search within the NCBI, specifying 'Lepidoptera' within the Entrez query search parameter. Significant positive matches were recorded and the H. melpomene BAC clone AEHM-22C5 (Gen-Bank:CU462842) was assessed for putative coding sequences using the NCBI ORF Finder search tool and implementing the default genetic codes option. Protein sequences identified by the ORF search were subjected to blastp searches against the non-redundant (nr) database to ascertain the presence of RTE-like conserved domains. SSR units flanking the 3' UTR of putative RTE-like conserved domains were visually identified. DNA sequences flanking the 3' end of the SSR units and the 5' UTR were manually examined for evidence of TSD sequences. Annotation was added to the GenBank record CU462842 by the submitter, Dr. Simon W. Baxter, Cambridge University.

Phylogenetic analysis of putative lepidopteran RTE elements
To determine the evolutionary relationships between the full-length RTE elements identified in this study and previously identified full-length non-LTR retrotransposable elements, complete alignment of RT domain amino acids (EMBL:DS36752) [23] was obtained. The RT conserved domain of non-LTR retroelements from (EMBL:DS36752) [23], all newly assembled B. mori RTE elements (BmRTE-d01 to BmRTE-d25; (Genbank:FJ265542 -FJ265564, GU815089-GU815090), and the H. melpomene HmRTE-e01 element (GenBank CU462842) were identified using NCBI CD search program against the RT conserved domain CD01650 within the database CDD-31608 PSSMs, prior to a global RT conserved domain alignment using the Kalign program [51,52] in EMBL-EBI. The RT conserved domain of CRE clade elements (CRE1, CRE2, CZAR and SLACS, see Malik et al. [23]) were included as outgroups. For cluster analysis, we implemented the Neighbour-Joining method and conditions (tree bisection-reconnection branch swapping, 2,000 bootstrap resampling, and maximum trees saved at each step limited to five) specified by Malik et al. [23] using PAUP* v4.0b10 [53]. Due to the large data set of Malik et al. [23], we included only elements from the R2, RTE, R4, L1, Jockey, and CR1 clades with our BmRTE and HmRTE elements. The new amino acid alignment file used in the phylogenetic analysis is provided in Additional File 7.
Phylogenetic analysis of RTE subgroups used the RT conserved domain of RTE and Rex3 elements that were newly identified in this study (Additional File 2) and also included various elements previously reported by Zupunski et al. [22]. The RT conserved domain were identified using the CD Search program in NCBI http:// www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml against conserved domain CD01650. Outgroup for the RTE clade phylogenetic analysis used the Neurospora group II intron (GenBank:S07649), with the RT conserved domain identified against conserved domain CD01651 using the NCBI CD Search program. Sequence alignment was carried out by including amino acid sequences representative of the conserved domains CD01650 and CD01651 with the identified RT conserved domain amino acid sequences of all RTE, Rex3 and Bov-B LINE elements (Additional File 8). Both CD01650 and CD01651 representative sequences were removed prior to phylogenetic analysis using the Neighbour Joining method as described above.

RTE-like elements in public DNA databases
Using the Lepidoptera Specific Core Sequences 1 to 4 (LSCS 1-4, [6]) we searched the NCBI GenBank nonredundant (nr) genomic DNA database using the Basic Local Alignment Tool blastn [54], and limiting the Entrez query with search term 'Lepidoptera'. Significant matches with expected threshold (E-value) of < E-04 were manually search for evidence of TSD and the presence of 3' SSR units.

RTE-related lepidopteran microsatellite loci in GenBank
B. anynana microsatellite DNA loci described in van't Hof et al. [6] were aligned and grouped using Sequencher 4.5. Consensus 5' flanking sequences prior to the SSR units from grouped microsatellite loci were translated into consensus protein sequences for use in tblastn analyses along with the 'Lepidopteran Microsatellite' Entrez query search term. In Melitaea cinxia, a preliminary blastn search using the LSCS1 DNA sequence ('Lepidoptera' as Entrez query search term) identified a significant (E-value = 4e-07) hit (GenBank:DQ389528, MCclone113). SSR units at the 3' UTR, and TSD sequences at 5' and 3' flanking regions were visually identified.
Complete amino acid sequences of RTE elements from B. mori (BmRTE-d01 to -d13), H. melpomene HmRTE-e01, and partial RTE-like element protein sequences from B. anynana and M. cinxia (Additional File 3), and H. zea (HzRTE-1-1, [25]) were used as input queries in tblastn searches in NCBI. Accession numbers DQ242653:DQ243686 representing the B. mori microsatellites of Miao et al. [27] were searched separately with the BmRTE elements as queries. Partial RTE-like element DNA sequences from H. armigera (identified based on sequence homology to HzRTE-1-1), H. zea, M. cinxia and B. anynana were also searched using blastn for homology to lepidopteran microsatellite loci within the GenBank. The search term 'Lepidoptera microsatellite' was used in the Entrez query option in all tblastn and blanstn searches. Due to the general short RTE protein sequences identified from H. armigera, B. anynana and M. cinxia, tblastn expected threshold of 100 was specified within the 'General Parameters' window. Matrix options implemented were either the default (BLOSUM62), PAM30 (≤ 40 amino acids) or PAM70 (≤ 70 amino acids) settings. All remaining algorithm parameters were as default settings. Default tblastn settings were implemented for fulllength B. mori (BmRTE-d01 to BmRTE-d13) and H. melpomene HmRTE-e01 RTE elements. All significant lepidopteran microsatellite loci matching the complete and partial RTE elements were sorted for non-redundancy based on GenBank Accession numbers. Lepidopteran microsatellite loci from all tblastn and blastn searches that exhibited threshold values of greater than 1E-04 were considered non-significant and thus excluded in all subsequent analyses. All microsatellite loci with significant sequence homologies to partial RTE-like elements of H. zea, H. armigera, B. anynana and M. cinxia, and to fulllength B. mori and H. melpomene RTE elements (limiting to within 200 amino acid residues to the 3' terminus) were further characterized with respect to SSR units (Additional File 5, Figure 5) and TSD sequences where possible.

Molecular characterization of HaD47 alleles
The microsatellite DNA locus HaD47 (Gen-Bank:AY497338) [9] with its reverse primer designed within the LSCS1 is problematic when used in population genetic studies [15] (but see [50]). The molecular dynamics of RTE integration sites within the host genome was investigated by PCR amplification using the HaD47 microsatellite forward and reverse markers following published PCR conditions [9]. DNA from two field-collected H. armigera (AD1, AD2, from Dalmore, Victoria, Australia; [55]) were used in this analysis. Post PCR purification, cloning, sequencing and post sequencing DNA analyses of HaD47 microsatellite PCR amplicons followed the protocol of Scott et al. [49]. We randomly selected five and six positively transformed colonies from AD1 and AD2 respectively to confirm the presence of partial RTE-like elements, to ascertain insertion site specificity and to characterize the patterns of associated SSR units.

Additional material
Additional file 1 Estimates of full length BmRTE elements in the genome of Bombyx mori. NCBI GenBank tblastn searches using Bombyx mori RTE elements BmRTE-d01 to -d25, first 50 (5' end) and last 50 (3' end) amino acid residues against the Chinese B. mori contigs (GenBank AADK01000001:AADK01066482) to estimate full-length BmRTE copy numbers. Results indicated overall low copy numbers of between 2 to 4 in most of these BmRTE elements within the host genome, although BmRTE-d01 and BmRTE-d02 were higher each with 16 and 22 copies respectively.