Screening the bacterial artificial chromosome library
Le Paslier et al.  described the construction and characterization of a bacterial artificial chromosome (BAC) library of the Schistosoma mansoni genome. The library, constructed in the plasmid vector pBeloBac11 with genomic DNA (gDNA) from cercariae of a Puerto Rican strain of S. mansoni partially digested with Hin d III, consists of 23,808 clones, about 21,000 of which are estimated to contain inserts ranging from 120 to 170 kb, providing ~8-fold coverage of the schistosome genome. Numerous BAC end sequences determined from randomly selected clones from this library are in the public domain. Inspection of the end sequence of BAC clone number 30-H-16 indicated identity with Pao -like LTR retrotransposons (not shown). Because the retrotransposon sequence was located at the end of the BAC, the clone was unlikely to contain the entire Pao -like element. Given that retrotransposons can be expected to be present in multiple copies in the host genome, we screened the library with a probe based on the end of BAC 30-H-16 in order to locate an entire copy of the retrotransposon. The gene probe was obtained by PCR amplification of a fragment of BAC 30-H-16 using the primers 5'-CGCGGATCCAAGAGAAAAACCTTGATAGAC and 5'-CCGGAATTCCTGTCGAAGATAAAAGAGC, was cloned into pBluescript and its identity confirmed by sequencing (Accession AY871176). This probe spanned residues 2457 to 2823 of the BAC 33-N-3 copy of the new retrotransposon (see below). The cloned insert was labeled with digoxygenin (DIG) and employed to screen the BAC library, as described , represented as high-density clone arrays on nylon membranes. Positive clones were cultured as described  and the presence of sequences with identity to the novel retrotransposon in the positive clones was confirmed by PCR (primers as above) or by colony hybridizations  to the DIG labeled probe. One positive clone, BAC 33-N-3, was investigated further by sequence analysis. BAC plasmid DNA was isolated from bacterial cultures using the PhasePrep BAC DNA purification system (Sigma). Analysis of the insert of 33-N-3 was accomplished after subcloning Bam H1 fragments of the BAC into pNEB 193 (New England Biolabs, MA), sequencing the inserts of the sub-clones, and also by direct sequencing of BAC 33-N-3. Automated nucleotide sequencing, using ABI BigDye Terminator chemistry (ABI, Foster City, CA) and an ABI Prism 3100 sequencer, was undertaken using primers specific for the probe and subsequently with gene specific primers at Tulane University and at Davis Sequencing (Davis, CA).
Sequence analysis and alignments
Contigs of the sequences were assembled using SeqMan (DNAstar, Inc., Madison, WI). Repeat sequences were identified with a Pustell style dot matrix  using the DotPlot3 program (Ramin Nakisa, Imperial College, London, UK) [see ] and the Pustell DNA Matrix function in MacVector (Accelrys). Amino acid alignments were accomplished with MacVector and ClustalW  using sequences from GenBank or using conceptual translations of nucleic acid sequences. Open reading frames were located and conceptually translated using MacVector. Sequences of the following retrotransposons were used in the multiple sequence alignments based on gag, protease, and reverse transcriptase: Ninja, T31674; Pao, S33901; MAX, CAD32253; Roo, AAN87269; BEL, AAB03640; and Saci-1, BK004068. Sequences of the following retrotransposons were used in the multiple sequence alignment based on Integrase: Saci-1, DAA04498; Pao, S33901; Ninja, T31674; Roo, AAN87269; Suzu, AF537216, BEL, AAB03640, Tas, Z29712, and MAX, CAD32253.
Parasite DNAs, Southern hybridization, densitometric estimation of copy number
Genomic DNAs of cercariae of a Puerto Rican strain of S. mansoni and of adults of a Chinese (Anhui Province) strain of S. japonicum were extracted using the AquaPure Genomic DNA Purification system (Bio-Rad, Hercules, CA). S. mansoni gDNA (30 μg/lane) and 33-N-3 BAC DNA (800 ng) were digested with Hin d III and Bam H I restriction enzymes, and S. japonicum gDNA (20 μg/lane) was digested with Hin d III. Digested gDNA and BAC DNA were size fractionated by electrophoresis through a 0.8% agarose gel, transferred to a nylon membrane (Zeta-Probe GT, Bio-Rad) by capillary action , and UV-light cross-linked to the membrane. Southern hybridization analysis to the DIG-labeled probe (above) was performed as described . Chemiluminescent signals were detected using X-ray film (Fuji). Densitometric analysis of Southern hybridization signals was accomplished using the Versa-Doc gel documentation system (Bio-Rad) and Quantity-One software (Bio-Rad). Densitometry values for signals evident in the gDNA and BAC DNA lanes were used to estimate the copy number for the new retrotransposon, Sinbad, according to the formula [(A/B) × C]/E = F. This formula was derived from two equations: (A/B) × C = D and D/E = F, where A was the number of copies of Sinbad in the BAC 33-N-3 lane, B was the density volume of the 33-N-3 lane in units of optical density per mm2, C was the density volume of the S. mansoni genomic DNA lanes in units of optical density per mm2, D was the total number of copies of Sinbad per genomic DNA lane, E is the number of haploid genomes in the gDNA lane, and F represented the copy number of Sinbad per haploid S. mansoni genome. The insert of 33-N-3 was estimated to be 145 kb in length and assumed to contain only a single copy of the retrotransposon.
Other copy number estimations
In addition to the densitometry-based estimate, estimates of the copy number of the Sinbad retrotransposon also were obtained by a comparative bioinformatics approach  wherein BLAST analysis of the bacterial artificial chromosome (BAC) -end database of S. mansoni genomic sequences targeted more well-characterized retrotransposable elements from S. mansoni for which copy numbers had been reported. These included the Boudicca LTR retrotransposon  and the non-LTR retrotransposons SR1 and SR2 [61, 62]. The NCBI database was searched by BLAST using the sequences of these mobile genetic elements and some other genes of S. mansoni, all of which included at least one Hin d III site. Specifically, the Advanced BLAST function was used, set to search only the S. mansoni sequences in the GSS database (Limit by Entrez Query: <Schistosoma mansoni[organism]>), and with the E value at 0.000001. The E value (Expect value) reflects the probability of obtaining a match purely by chance. Scores at or below this stringent cutoff E value of 10-6 were counted as positive. This exceptionally stringent cutoff value was used to minimize the chance of counting other Pao -like elements in the total copy number of Sinbad. Since the formula for E is based not only on the bit scores of the local alignment of each pair of sequences, but also on the lengths of the subject and query [see ], no additional correction was made for the length of the query sequence.
Phylogenetic analysis of Pao -like elements
Sequences for phylogenetic analysis comparing the RT region of several different retrotransposons were prepared by trimming sequences from the large single polyprotein of each retrotransposon to just the conserved domains of RT (see [21, 27]). Pol sequences presented in Xiong et al.  and Abe et al.  were trimmed exactly to the stretch of sequence shown by these authors to represent the RT domain. Other elements were aligned with these sequences and likewise trimmed to obtain just the RT domain. For some elements, nucleotide sequences were analyzed for open reading frames and translated before being trimmed to include just the 7 conserved blocks of the RT domain. Alignments were accomplished using Clustal X , after which bootstrapped trees (1,000 repetitions) were prepared using the neighbor joining method  and drawn with Njplot. The accession numbers for sequences included in the phylogenetic analysis are as follows: Ty3, S53577; Tas : Z29712; Suzu, AF537216; Sinbad, AY506538 (an N was inserted at position 2761 to a resolve a frameshift and generate a single ORF) Saci-1, DAA04498; Roo, AAN87269; Ninja, T31674; Moose, AF060859; Max, CAD32253; Kamikaze, AB042120; HIV-1, P04585; Gypsy, GNFFG1; Gulliver, AF243513; Copia, OFFCP; BEL, AAB03640; Cer7, AAB63932, Cer8, CAB04994, Cer9, CAB1647, and Cer11, AAA82437, two uncharacterized Anopheles gambiae retrotransposons, XP_309281 and XM_308737, an uncharacterized Caenorhabditis briggsae retrotransposon, AC084491, and two uncharacterized Danio rerio retrotransposons, BX537152 and BX005079 [see Additional file 2]. Two additional sequences were either not in the database or were composites made to reconstruct sequences more closely resembling non-mutated forms of the retrotransposons. The sequence representing Pao was a reconstruction prepared by Abe et al. , from accession numbers S33901, AB042118, and AB042119; the sequence representing Boudicca was a composite of translated cDNA sequences introduced in Copeland et al. , AY308018, AY308019, AY308021 and AY308022 [see Additional file 2].
Screening entire or partial genomes for Sinbad
A panel of fully or partially sequenced entire genomes was searched by BLAST for elements exhibiting sequence similarity to Sinbad. The deduced amino acid sequence encoding the region from the Cys-His Box through to the protease domain (encoded by nucleotides 106 to 753 of Sinbad [Y506538]) was employed as the query to search each genome individually using tBLASTn. The genomes searched in this way were as follows: Homo sapiens, Mus musculus, Rattus norvegicus, Takifugu rubripes, Danio rerio, Bos taurus, Gallus gallus, Sus scrofa, Canis familiaris, Anopheles gambiae, Apis mellifera, Drosophila melanogaster, Drosophila pseudoobscura, Brugia malayi, Caenorhabditis elegans, Caenorhabditis briggsae, Strongylocentrotus purpuratus, Ciona intestinalis, Ciona savigny, Giardia lamblia, Plasmodium falciparum, Plasmodium yoelii, Plasmodium berghei, Cryptosporidium parvum, Eimeria tenella, Theileria annulata, Toxoplasma gondii, Dictyostelium discoideum, Entamoeba histolytica, Leishmania major, Trypanosoma brucei, Trypanosoma cruzi, Arabidopsis thaliana, Avena sativa, Glycine max, Hordeum vulgare, Oryza sativa, Triticum aestivum, Zea mays, Lycopersicon esculentum, Schizosaccharomyces pombe, Saccharomyces cerevisiae, Saccharomyces paradoxus, Saccharomyces mikatae, Saccharomyces bayanus, Saccharomyces castelli, Saccharomyces kluyveri, Saccharomyces kudriavzevii, Neurospora crassa, Magnaporthe grisea, Aspergillus nidulans, Aspergillus fumigatus, Aspergillus terreus, Candida albicans, Coccidioides posadasii, Gibberella zeae, Coprinopsis cinerea, Cryptococcus neoformans, Ustilago maydis and Encephalitozoan cuniculi. In addition, 275 eubacterial and 21 Archaean genomes were searched [see Additional file 3]. Genomes with matches with E values less than 0.001 (corresponding approximately to bit scores greater than 40) were considered positive for Sinbad -like elements.
GenBank accession numbers
Sequences of the Sinbad LTR retrotransposon have been assigned accession numbers AY506537, AY506538, AY645721, AAT66412, and AY871176. Other sequences introduced here been assigned GenBank Third Party Annotation accession numbers; BK005570 (Danio rerio), BK005571 (D. rerio), BK005572 (Caenorhabditis briggsae), BK005573 (Anopheles gambiae), BK005574 (D. rerio).