IS 4 family goes genomic
© De Palmenaer et al. 2008
Received: 13 September 2007
Accepted: 23 January 2008
Published: 23 January 2008
Skip to main content
© De Palmenaer et al. 2008
Received: 13 September 2007
Accepted: 23 January 2008
Published: 23 January 2008
Insertion sequences (ISs) are small, mobile DNA entities able to expand in prokaryotic genomes and trigger important rearrangements. To understand their role in evolution, accurate IS taxonomy is essential. The IS 4 family is composed of ~70 elements and, like some other families, displays extremely elevated levels of internal divergence impeding its classification. The increasing availability of complete genome sequences provides a valuable source for the discovery of additional IS 4 elements. In this study, this genomic database was used to update the structural and functional definition of the IS 4 family.
A total of 227 IS 4 -related sequences were collected among more than 500 sequenced bacterial and archaeal genomes, representing more than a three fold increase of the initial inventory. A clear division into seven coherent subgroups was discovered as well as three emerging families, which displayed distinct structural and functional properties. The IS 4 family was sporadically present in 17 % of analyzed genomes, with most of them displaying single or a small number of IS 4 elements. Significant expansions were detected only in some pathogens as well as among certain extremophiles, suggesting the probable involvement of some elements in bacterial and archaeal adaptation and/or evolution. Finally, it should be noted that some IS 4 subgroups and two emerging families occurred preferentially in specific phyla or exclusively inside a specific genus.
The present taxonomic update of IS 4 and emerging families will facilitate the classification of future elements as they arise from ongoing genome sequencing. Their narrow genomic impact and the existence of both IS-poor and IS-rich thriving prokaryotes suggested that these families, and probably ISs in general, are occasionally used as a tool for genome flexibility and evolution, rather than just representing self sustaining DNA entities.
Insertion sequences (ISs) are small (< 2.5 kb), generally phenotypically cryptic segments of DNA able to jump, or copy themselves, into various genomic sites with no need for DNA homology . They generally encode no functions other than those involved in their mobility, although elements including additional genes are now being identified [2, 3]. While almost exclusively restricted to bacterial and archaeal genomes, they are, like eukaryotic transposable elements, involved in a wide variety of biological transactions leading to genome reshuffling and evolution. Indeed, their ability to proliferate within a genome provides the potential for homologous recombination-mediated deletions or inversions, and their capacity to transport accessory genes represents an additional contribution to genome flexibility. By affecting gene expression and facilitating the emergence of new gene clusters they play an important role in adaptability of their host. Eventually, horizontal transfer mechanisms such as conjugation allow these IS-mediated sets of genes to cross barriers between strains, species and beyond .
The DNA breaks and joins necessary for transposition are catalysed by an element encoded protein referred to as transposase. These proteins determine transposition mechanisms and are now used to lead classification of prokaryotic transposable elements in general. Accordingly, transposases that form a covalent intermediate with DNA are distinguished from those that do not. Additional distinction is provided by protein active-site residues crucial for transposition. These define the five major transposase classes currently established : tyrosine (Y), serine (S), relaxase (Y1) and rolling-circle (Y2) transposases involve covalent intermediates with DNA during transposition, while the fifth class, namely DDE transposases, prompts transposition via direct transesterification reactions [5–7].
DDE transposases display three acidic residues in three distinct regions of their primary sequence, namely regions N2, N3 and C1, which harbor the aspartate (D), aspartate (D) and glutamate (E) residues, respectively. Spacers of various lengths separate these regions, but their acidic residues are brought together upon protein folding to form a catalytic triad essential for transposition [8–11].
Prokaryotic DDE transposons (mainly ISs) can move in two different ways, depending on the fate of the donor site. Replicative transposons mobilize a copy of their DNA, leaving the parent site intact, while conservative transposons cut themselves out of the donor molecule in order to paste their DNA into the target .
Beyond mechanistic behaviors, each individual IS is characterized by structural features used to fine-tune their classification. A wealth of these data is currently generated with the rising availability of whole genome sequencing projects. As of July 2007, more than 19 different IS families are established based on over 1800 bacterial and archaeal IS sequences [1, 13–16]. An IS family can be defined as a collection of elements sharing the same catalytic site structure (with conserved spacers between key residues), an identical genetic organization (e.g. frameshifting in transposase gene), similar arrangements of their ends and uniform target site fates upon insertion. However, not all families are so coherent. This is why some of them (like families IS 4 and IS 5) are divided into subgroups being composed of a core of closely related elements that can be linked to other members of the family by weaker but still significant similarities.
The IS 4 family, like most IS families described so far, contains elements mobilized by DDE transposases performing a 'cut-and-paste' mechanism. The main and almost only hallmarks of this family's transposases are (i) absence of frameshift in the transposase gene and (ii) an Y-(2)-R-(3)-E-(6)-(K) signature (YREK) in region C1 where the glutamate residue is that of the DDE motif [1, 17]. Only very few IS 4 elements have been studied in detail. IS 231A has been shown to transpose in vivo by a 'cut-and-paste' mechanism , both in its natural host, Bacillus thuringiensis , and in Escherichia coli ; and it displays a certain degree of insertion specificity . Both IS 10R and IS 50R are part of composite transposons (Tn 10 [22, 23] and Tn 5 [24, 25], respectively) and are the only members for which in vitro systems have been set up. The latter has allowed extensive acquisition of genetic, biochemical, mechanistic and regulation data for these elements. Also, the Tn 5 transposase is the only IS 4 element for which X-ray crystallographic structure data are available [9, 26].
Since the initial definition of family IS 4 (based on about 45 elements), other elements were progressively added via ISfinder, the prokaryotic IS database . However, some of them displayed distant resemblance both to existing IS 4 members as well as members of other groups such as the IS 5 family or ISNCY (IS Not Classified Yet). Some elements even lacked the above mentioned distinguishing protein motifs. It was therefore necessary to perform a systematic screen for related elements in order to gain a more rational view of the organization of the IS 4 family.
Here we report an extensive in silico search for IS elements related to family IS 4 among more than 500 complete bacterial and archaeal genomes. A total of 227 putative intact IS 4 -related elements were collected and permitted a detailed update of the IS 4 catalogue, together with the description of novel emerging IS families. This allowed the evaluation of their distribution and impact among major prokaryotic phyla. Finally, known transposition mechanisms could be discussed in light of novel primary sequence data.
Family IS 4 included 153 distinct intact elements. The main hallmarks were the presence of D(60~110)D(100~150)E and Y(2)R(3)E(6)K motifs, single orf encoded transposases and target site duplication lengths corresponding approximately to one DNA helix turn. Considerable diversity was observed at the level of transposase regions (sequence and length) outside catalytic residues, DNA end signatures and target site specificity. The following description of established subgroups will summarize this diversity.
Major features of IS 4 subgroups and newly defined families
Typical size range (bp)
Direct repeat length (bp)
1450 – 5400
10 – 12
1400 – 1800
1150 – 1750
8 – 10
1400 – 1650
10 – 13
1200 – 1350
1350 – 1550
8 – 9
1500 – 1600
7 – 8
1400 – 1550
1225 – 1500
4 – 5
1500 – 2000
5 – 6
Subgroup ISH8 was initially included into subgroup IS 4Sa . Yet, our results showed a closer proximity to subgroup IS 231; not only at the transposase level, but also due to DNA extremity conservations (both displayed conserved 5'-CAT ends, Figure 1B and Additional file 2). IS H8 elements were initially discovered in archaea but recent genome projects uncovered elements of this subgroup in proteobacteria. Examination of IS H8 insertions did not reveal apparent target site specificity, contrary to the one observed for IS 231 elements (Figure 3).
This study provided the first thorough description of subgroup IS 4Sa that was originally mentioned in reference . Together with IS 4Sa -like elements appeared a closely related subgroup including element IS 4, which provided the name for this second ensemble. Subgroups IS 4Sa and IS 4 were relatively close at the protein level and the major argument for splitting them in two was a different organization of TIRs (Figure 1B and Additional files 3 & 4). Extremities in subgroups IS 4Sa and IS 4 displayed relaxed conservation of the third and first nucleotide, respectively, which is rather new but not unique in family IS 4 (see below). Some elements of subgroup IS 4 displayed spectacular apparent insertion specificity in left and right target arms, a conservation that was also observed for IS Rm22 from subgroup IS 4Sa (Figure 3). Both subgroups were quite diverse and broadly distributed in bacteria. Rare occurrences were recorded in archaea (IS Fac10) and algae viruses (ISv EsV1_1, see Figure 4 and Additional files 3 and 4).
Subgroups IS10and IS50 were the only ones that did not cluster together with subgroups IS 231, IS H8, IS 4Sa, IS 4 and IS Pepr1 during Tribe-MCL analysis (results not shown), indicating more distant relationships with these subgroups. This was also observed at the level of DNA ends since they systematically displayed 5'-NT extremities instead of 5'-NA. Yet, they displayed key residue conservation (Figure 2) and target site duplication lengths comparable to typical IS 4 elements (Figures 1 & 3). So far, the majority of both subgroups occurred among proteobacteria.
Subgroup ISPepr1 was also newly established here and is almost exclusively composed of novel elements. Although it is composed of a limited repertoire, it is already forming a consistent subgroup with apparent insertion preference for AT-rich sequences (Figure 3) and, like elements of subgroups IS 10 and IS 50, it preferentially displays 5'-CT ends (Additional files 5, 6 and 7), in contrast to 5'-CA extremities encountered in the other elements from family IS 4.
Members of Family IS 701 were already considered distantly related to IS 4 in reference  and this was confirmed here. Only four of the 27 considered IS 701 elements displayed the tyrosine of the YREK motif (results not shown, see Figure 2); and almost all elements displayed a highly conserved target site duplication of exactly four bps (results not shown, see Figure 3). The diversity emanating from IS 701 transposases and TIRs allowed identification of three distinct clusters (Additional file 8), which announced a possible division into subgroups. As indicated by their prokaryotic distribution, family IS 701 seemed to be rather widespread (Figure 4).
Family ISH3 was a small group so far restricted to archaea (Additional file 9). Half of their transposases lacked the lysine residue of the YREK motif while all (except IS Fac10) displayed a Y-(2)-R-(3)-E-(3)-(R) motif. DRs flanking IS H3 -like insertions are typically five bps long and generally flanked by A at one end and T at the other end.
Family IS1634 was initially named IS 1549 . This new designation should prevent confusion with an emerging group called IS 1595 . Transposases were among the largest due to relatively long N3-C1 spacers, which were sometimes twice as long as those from family IS 4. Only five out of 32 displayed the arginine residue of the YREK motif (results not shown, see Figure 2). This large group could be divided in (three) distinct clusters according to transposases (Additional file 10). Target site duplications were five to six bp AT-rich tracts while DNA ends were only poorly conserved. They were remarkably diverse in archaea and relatively widespread in bacteria (Figure 4).
The distribution of individual elements among the different bacterial and archaeal phyla is shown in Figure 4. As indicated by the histogram in Figure 4B, current sequencing projects focus unequally on different prokaryotic phyla . The distribution presented in Figure 4A partially reflected this sequencing bias, which is why most gathered elements were found in proteobacteria. Globally, families IS 4, IS 701, IS H3 and IS 1634 were detected in eight, seven, two and seven phyla, respectively.
Figure 4A shows that the distribution of individual IS families or subgroups as defined in this study could either be broad (as for subgroups IS 4, IS 4Sa and family IS 701) or concentrated in a particular phylum (as for subgroup IS 231 and family IS H3). Interestingly, two copies of an element belonging to subgroup IS 4 were found in the genome of the brown algae virus, Ectocarpus siliculosus virus 1 . So far, no IS 4 -related element was found in eukaryotes.
The availability of a large number of completely sequenced genomes on one hand, and the extensive manual characterization of an IS family on the other hand, allowed a direct evaluation of global presence of its putatively functional members. Family IS 4 displayed a sporadic distribution among 17 % of explored genomes. Moreover, the behavior of these ISs was assessed by determining their genomic occurrence. This showed that numerous elements exhibited single or low copy numbers, which was in line with the assumption that transposable elements need to compromise between transposition for self maintenance and limited insertions to keep host genome integrity . This distribution also displayed a certain, less pronounced, prevalence for copy numbers between 6 and 8, although the biological relevance of this observation remains unclear.
While the global distribution of IS elements was shown to be rather sporadic, important expansion events were observed in a few pathogens as well as in some free-living extremophiles. One of these pathogens is a member of the facultative intracellular Mycoplasma, of which eleven genomes are available. 65 intact copies of family IS 1634 elements were found together with 17 intact copies of the IS 3 family in M. mycoides . The presence of family IS 1634 was also detected in M. synoviae , M. agalactiae [GenBank:CU179680] and the three strains of M. hyopneumoniae [31, 32]. Only M. genitalium , M. pneumoniae  and M. mobile  were devoid of ISs. Interestingly, the Mycoplasma species lacking ISs were also those displaying the smallest genomes (between 0,58 and 0,82 Mb), while the IS-rich M. mycoides had a relatively large genome (1,2 Mbs). A positive correlation between the total number of genes and the amount of mobile DNA in a genome has been reported for other facultative intracellular bacteria , and a recent study has extended this observation to other prokaryotes . However, this relationship must be seen as a trend with notable exceptions (e.g. the large genome of M. penetrans – 1,36 Mbs – has only 21 transposase genes ).
Family IS 701 was represented by 70 and 72 elements in the genomes of two sequenced pathovars of the rice pathogen X. oryzae [39, 40]. This family was completely absent in phylogenetically related species with available genome sequences, including X. campestris (black rot in crucifers) and X. axonopodis (citrus canker) [39–43]. Additionally, a moderate presence of family IS 4 was observed in all but one of the six Xanthomonas genomes (X. campestris pv. vesicatoria), together with other, more expanded families. So far, all sequenced xanthomonads are IS-rich, which is supposed to provide the necessary flexibility for adaptation to different plant hosts . Interestingly, each strain is characterized by the expansion of one specific IS family . The IS 4 family was also observed to be relatively expanded in sequenced Shigella spp. where other families, both with higher and lower expansion levels, were also detected [44–47]. These mobile elements are supposed to contribute to the emergence of variable epidemiological and pathological features among these phylogenetically close pathogens .
These observations are reminiscent of IS expansions observed together with emergence of pathogenicity in other host associated pathogens, such as Bordetella pertussis (whooping cough ), Yersinia pestis (plague ) and Francisella tularensis (tularemia ).
A different relationship was observed in the genus Vibrio. The expansion of family IS 4 in the non-pathogenic Photobacterium  was striking (Figure 5) as compared to those observed in the pathogenic Vibrionaceae, such as V. cholerae , V. parahaemolyticus  and V. vulnificus , that displayed relatively modest IS contents. Therefore, ISs apparently have a quite different role in Vibrio evolution, where some IS-rich species are non-pathogenic piezophiles. Other important IS expansions among free-living extremophiles were observed in one out of three Sulfolobus spp. [55–57], where family IS H3 represents a significant fraction of global IS content, as well as in Halobacterium, which harbors numerous copies of IS 4 and IS H3 elements together with other families .
Analysis of the distribution of individual elements among prokaryotes and among established families and subgroups was also performed regardless of copy numbers (Figure 4). Prokaryotic IS hosts were from world-wide origins and covered a large range of lifestyles and habitats, from free-living environmental germs and extremophiles to host associated symbionts and pathogens. Evaluation of emergence on evolutionary timescales, as well as estimation of horizontal gene transfer (HGT) extent prompted by this distribution remain both problematic because of low identity levels among transposases. However, many IS 4 -related elements were found on putative mobile DNA, including various plasmids (Figure 5), bacteriophages (results not shown) and one eukaryotic virus (Figure 4). Additionally, most IS 4 -related subgroups and emerging families were scattered over several phyla (Additional files 1 – 10), suggesting an involvement of HGT in the dissemination of these elements.
Reports on prokaryotic IS distribution have shown that IS families are not clade-specific and our data regarding family IS 4 were in agreement with this assumption. However, the distribution of IS 4 subgroups and emerging families showed that the number of elements of some IS groups was not systematically proportional to the extent of genome sequencing (Figure 4). Subgroup IS 231 (family IS 4) was almost exclusively restricted to Firmicutes and 26 of its 32 elements were found among bacteria of the B. cereus s.l. group. Likewise, family IS 1634, and to a lesser extent family IS H3 and subgroup IS H8, were preferentially found among Euryarchaeota. These observations may partially be due to insufficient genome data, but they do not rule out the possibility of existing clade specificity at the level of IS subgroups. Preferential IS occurrences among certain prokaryotic phyla or genera can be the consequence of isolated niches reducing or limiting horizontal transfer of hosted ISs. In this case, a series of extremophilic archaea were almost exclusive hosts of subgroup IS H8 and family IS H3, their preference for harsh ecological niches may represent a frontier for HGT with other phyla or with bacteria. Another possible explanation can be based on IS-associated features. It is possible that the presence of some ISs cannot be tolerated by certain hosts. Uncontrollable transposition behaviors, lack of target site specificity, preferred insertions into vital genes or regulatory regions can limit the host compatibility of some ISs for which a viable equilibrium can be found only in rare 'IS – host' combinations.
The co-crystal structure of the IS 50R transposase binding the ends of the Tn 5 transposon generated function assignments for numerous residues of this protein . A comparative sequence analysis based on 19 transposases was performed and compared with a previous report on seven transposases from subgroup IS 50 by Reznikoff et al. in 2004 . This confirmed the previous function assignment of conserved residues since most identified domains were still conserved among this subgroup, as were a series of residues with unknown function (further results available in Additional file 11).
The primary sequence data concerning the transposase motifs (Figure 2) and target site duplications (Figure 3) suggested divergent biochemical behaviors between IS 4 transposases and those from the emerging families. The two 3'-OH transposon ends are known to attack the target helix at opposite sites in the case of IS 4. For the emerging families, the length of target site duplications (5 bp) suggested a different target strand cleavage path: owing to the structure of B-DNA, this attack needs to come from the same helix side. It is worth noting that in the case of family IS 1634, two elements were reported to display atypical target site duplication lengths. IS 1549 and IS 1634, from M. smegmatis and M. mycoides, respectively, have both been shown to produce long, variable-length DRs upon insertion [60, 61]. They had lengths between 8 and 514 bps, depending on the copy. So far, no mechanistic model has been proposed for the generation of this variable target duplication size.
This study consisted in a thorough and systematic screen for IS 4 -related elements among available genome sequence data. It allowed a considerable improvement regarding the description and definition of family IS 4, as well as the establishment of new IS families which were, until now, assimilated to the IS 4 ensemble (Table 1). Thus, 22 distinct IS families are now officially established. Novel primary sequence descriptions were unraveled for reported families, which will allow easier identification of other related elements as they will be uncovered with ongoing genome sequencing projects. This work also established a link between genomic and functional data regarding reaction mechanisms, which underlined the importance of both approaches for a more complete understanding of transposition biology.
The extent of genomic impact of single IS families showed that IS 4 and its related families are far from being ubiquitous among prokaryotes. This limited IS distribution applies to most known IS families as reported by a recent automated survey of bacterial ISs . Significant genome wide expansions were observed only in a few host-associated pathogens and certain free-living extremophiles, suggesting that particular ISs could have been, at least partially, implicated in the emergence or evolution of these particular lifestyles. Yet, the reasons explaining these sporadic IS demographic explosions remain to be uncovered.
Our interpretation of these results is that the IS families described here, and probably ISs in general, represent an evolutionary tool available among several. In order to provide the necessary genome flexibility for adaptation to new or variable environments, evolution seems to select this tool in some cases. In contrast to a purely 'selfish' or 'parasitic' perception of mobile DNA, this view supports the idea that prokaryotes, and maybe life in general, may also make use of mobile DNA for its own benefits, rather than being constantly invaded by it in an uncontrollable manner.
This search covered more than 500 complete and partial bacterial genomes. Only apparently full-length transposases were retained to avoid inclusion of partially deleted and thus inactivated transposases which can accumulate mutations of functionally important residues.
When we began our search for novel IS 4 -like elements, the ISfinder database contained about 70 ISs designated as 'member of family IS 4 '. First, representative elements (about twenty) covering most of the sequence diversity of this family were selected. Primary transposase sequence of each of these representatives was then used in a BLASTP search, either among microbial genomes only, or against all organisms. Since IS 4 elements exhibit short TIRs and generate target site duplications upon insertion (DRs), the flanking DNA of resulting hits was checked for the presence of these repeats. This, together with DNA extremity comparison of various elements, allowed the determination of both ends of the collected elements. New ISs were checked for the existence of formerly registered iso-forms which we defined as elements with less than 2% divergence in the amino acid sequence of their putative transposases and/or less than 5% difference in their DNA sequences. ISs were submitted to the ISfinder database, which provided new names according to the current IS nomenclature .
BLASTP searches were performed on the NCBI BLAST online interface  without low complexity filter and with otherwise default parameters. Each transposase sequence retrieved a series of protein hits which were possible transposase candidates. The DNA encoding these candidates was downloaded together with 1000 bps up- and downstream regions. These DNA sequences were then verified for the presence of TIRs flanking the transposase candidate genes using BLASTN and the PALINDROME algorithm of the wEMBOSS package at the Belgian EMBnet Node . If TIRs were present, flanking DRs were looked for by eye.
The following multiple alignment algorithms were then evaluated with the resulting ISs for their performance to accurately align the catalytic D, D and E residues of transposases : Clustal W , Dialign , Parallel PRRN , Muscle , T-Coffee  and M-Coffee . Global alignment of the 227 transposases was made by merging existing M-Coffee alignments with Clustal W. The following order was applied where pre-existing alignments are represented by subgroup or family designations flanked by brackets and merging procedures by + symbols: (((IS 231)+(IS H8)+(IS 4Sa)+(IS 4))+(IS 10, IS 50, IS Pepr1))+((IS 701, IS H3)+(IS 1634)). Examination of resulting alignments and subgroup specific dendrogram construction (UPGMA, BLOSUM62) were performed using the Jalview alignment editor . Dendrograms were drawn with TreeView .
In order to facilitate visualization of transposase clusters, the TRIBE-MCL clustering algorithm  was applied to the complete set of transposases with inflation option (-I) set to 1,2 and default values for other parameters. These settings corresponded to those used by the ISfinder  and ACLAME  databases.
587 chromosomes and 363 plasmids, representing 540 individual completely sequenced bacterial and archaeal genomes, were screened for the presence of IS 4, IS 701, IS H3 and IS 1634 family DNA. Therefore, the 950 molecules were used as input in BLASTN  searches against a homemade database containing the 227 DNA sequences encompassing these IS families.
We like to thank Mick Chandler for critically revising this manuscript and for his assistance in finding cozy accommodations in Toulouse. We are also grateful to Bénédicte Laverne for her contribution to the retrieval of useful bioinformatic software as well as towards Bernard Hallet for some fruitful and inspiring discussions. This work was supported by grants from the National Fund for Scientific Research (FNRS, Belgium) as well as the Université catholique de Louvain (UCL).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.