The origin of a novel gene through overprinting in Escherichia coli

Background Overlapped genes originate by a) loss of a stop codon among contiguous genes coded in different frames; b) shift to an upstream initiation codon of one of the contiguous genes; or c) by overprinting, whereby a novel open reading frame originates through point mutation inside an existing gene. Although overlapped genes are common in viruses, it is not clear whether overprinting has led to new genes in prokaryotes. Results Here we report the origin of a new gene through overprinting in Escherichia coli K12. The htgA gene coding for a positive regulator of the sigma 32 heat shock promoter arose by point mutation in a 123/213 phase within an open reading frame (yaaW) of unknown function, most likely in the lineage leading to E. coli and Shigella sp. Further, we show that yaaW sequences coding for htgA genes have a slower evolutionary rate than those lacking an overlapped htgA gene. Conclusion While overprinting has been shown to be rather frequent in the evolution of new genes in viruses, our results suggest that this mechanism has also contributed to the origin of a novel gene in a prokaryote. We propose the term janolog (from Jano, the two-faced Roman god) to describe the homology relationship that holds between two genes when one originated through overprinting of the other. One cannot dismiss the possibility that at least a small fraction of the large number of novel ORPhan genes detected in pan-genome and metagenomic studies arose by overprinting.


Conclusion:
While overprinting has been shown to be rather frequent in the evolution of new genes in viruses, our results suggest that this mechanism has also contributed to the origin of a novel gene in a prokaryote. We propose the term janolog (from Jano, the two-faced Roman god) to describe the homology relationship that holds between two genes when one originated through overprinting of the other. One cannot dismiss the possibility that at least a small fraction of the large number of novel ORPhan genes detected in pan-genome and metagenomic studies arose by overprinting.

Background
The origin of novelty is one of the central questions in evolutionary biology. Paralogy, orthology, and xenology, as well as domain shuffling between genes, account for a large part of the evolution of protein families at the sequence level [1]. All these mechanisms require the previous existence of coding sequences. Whether novel sequences in cells could arise from spontaneous point mutation remains an open question.
In some cases, two or more genes are known to be encoded in the same DNA region [2]. Overlapping genes have been known since the beginning of virus complete genome sequencing, as exemplified by the gene B of the single stranded DNA bacteriophage ϕX174, which is completely contained within gene A [3]. It has been suggested that overlapping genes originate by spontaneous point mutation due to a mechanism known as overprinting [4]. Overlapping genes could arise by: a) extension of one ORF into other by loss of a stop codon; b) shift to an upstream initiation codon in adjacent genes; or by c) generation of a totally new ORF inside a previous existing ORF by point mutation. The different ways in which two overlapping genes can be coded are known as phases [5], and the different phases determine, in turn, the evolutionary pressure that overlapping genes exert on each other [6,7].
In has been shown that overprinting is an important mechanism for the origin of new genes in viruses [4,8]. Overlapped genes are also a relative common feature of prokaryotic genomes. For instance, it has been suggested that in prokaryotes, where approximately 80% of the overlaps comprise less than 30 base pairs, the phase of coding and their distribution, among other characteristics of overlapping genes are most consistent with the hypothesis that overlaps participate in the regulation of gene expression [9]. A comparative study among the genomes of Mycoplasma genitalium and Mycoplasma pneumoniae found that most overlapped genes were generated primarily due to the loss of a stop codon, the absence of which resulted in elongation of the 3' end of the gene's coding region [10]. Overlapped genes can also be classified according to their direction of transcription into three categories, namely "convergent" (-> <-), "unidirectional" (-> ->) and "divergent" (<-->). In a study that included the analysis of complete genome sequences from 50 bacterial species [11] it was shown that most overlapped genes are coded following the unidirectional pattern. This pattern was confirmed in a comparative study among nine bacteria [12]. Overlapped genes have also been found in mitochondria. For instance, a small protein [UniProtKB: Q6EMS7] of 66 amino acids (gene A6L) overlaps 40 bases into the gene coding for ATPase-6 in bovine mitochondria [13]. Because the formation of overlapping genes necessarily involves the evolution of a coding region from noncoding DNA or from a different frame in coding DNA, their study might help to understand de novo evolution of coding regions [5]. The pattern of natural selection among 71 pairs of 3' overlapping genes (all of them having an overlap larger than 15 nucleotides) conserved at least in two prokaryotic genomes showed a statistically significant bias toward the 123/132 phase, thus ensuring the least mutual constraint on non-conservative amino acid replacements in both overlapping coding sequences [5].
Here, we show that the mechanism of overprinting has also contributed to the origin of a new gene of identifiable function in Escherichia coli.

Results
We have identified a possible case of evolution of a new gene and a new function by overprinting in E. coli K12. The overlapping genes are the gene for the positive regulator for sigma 32 heat shock promoter (htgA) [Uni-ProtKB:P28697] and the gene yaaW [UniProtKB:P75617], whose hypothetical protein product has been classified as an Unidentified Protein Family 0174 in Pfam database [Pfam:UPF0174] (Figure 1). The genes are coded in the 123/213 phase (htgA in the (+) strand, and yaaW in the (-) strand), meaning that the third nucleotide of the (+) codon overlaps with the 3 nucleotide of the (-) codon. The overlap comprises the 591 nucleotides of the full htgA gene.
Although NCBI gene annotation tables provide coordinates for the position of genes in genomes, caution should be taken, since over-annotation and sequencing errors might mislead the identification of true genes. While there is in vitro experimental evidence for the existence of htgA [14], this does not hold true for yaaW. Therefore, we have searched for indirect evidence for its existence by asking whether there are: a) yaaW homologs in other genomes; b) domain fusions of homologs to the yaaW protein product with other protein domains; and c) detectable biases in the ratio of synonymous (ps) versus non-synonymous (pn) substitutions, i.e., departures from neutrality (ps ≠ pn) among yaaW sequences that would suggest the action of natural selection.

Phylogenetic distribution of htgA and yaaW genes
According to NCBI gene annotation tables, htgA is present in three E. coli and one Shigella flexneri strains. Furthermore, based on i) sequence similarity; ii) ORF prediction by ORF Finder; and iii) lowered rates of sequence evolution of yaaW sequences containing overlapped htgA genes (see below), we suggest that homologs to htgA are also present in all the other E. coli and Shigella sp. genomes analyzed here, adding up to a total of nine htgA sequences Overlapped genes in E.coli coli. Numbers indicate nucleotide residue for yaaW gene. The initiation codon of hgtA has been predicted to be in nucleotide 632 [14], and more recently in 527 [23]. A gray box indicates the region where we have detected a lowering in the rate of evolution of yaaW sequences with an htgA overlapped gene.
( Figure 2). On the other hand, yaaW has a wider phylogenetic distribution. This gene is present in 16 γ-proteobacteria and three ε-proteobacteria. yaaW can be identified as a highly divergent sequence in Nostoc sp PCC7120 a filamentous cyanobacteria and in a fusobacteria. In contrast with htgA, which is present as a single copy, some of the genomes analyzed here have paralogous copies of yaaW homologs. There are two genomes endowed with more than one copy of yaaW; Helicobacter pylori J99 has two copies, while H. pylori 26695 has three copies. In total, we detected 24 homologs of yaaW. Since there are three pairs of identical sequences at the nucleotide level (each pair is coded on different strains from the same specie), there are only 21 UniProtKB codes ( Figure 2). Multiple sequence alignment of yaaW protein coding genes is shown in Additional file 1. The wider phylogenetic distribution of yaaW suggests that htgA originated by overprinting in an ancestral copy of extant yaaW sequences (Complete names of proteins are given in Additional file 2).
We detected no homologs among viruses to the protein product of htgA in the non-redundant database. Besides a small similarity of the DnaC protein [Uni-ProtKB:Q9AZV4] from Lactococcus phage bIL286 (41 Phylogenetic distribution of yaaW and htgA overlapped genes amino acids identities along 152 residues, and BLAST evalue of 1.7) to the yaaW protein product. This appears to be a false positive since similarity was not confirmed by Pfam database. These negative results strongly suggest, but do not prove, that yaaW and htgA did not originate in a viral genome.

Protein domain identification of yaaW and htgA coded proteins
The yaaW coded proteins match to [Pfam:UPF0174] profile in Pfam database, the only exception is the hypothetical protein alr3689 [UniProtKB:Q8YQX2] from Nostoc sp. PCC 7120 that exhibits a marginal HMMER E-value of 0.048, comprising 156 amino acids out of 185 (0.01 is considered significant in Pfam). However, a BLAST search of E. coli yaaW coded protein sequence [Uni-ProtKB:P75617] to our database of proteins from complete genomes matches the cyanobacterial protein alr3689 sequence with a significant e-value of 2e-05 BLAST to a 86 amino acid stretch of the protein. The protein alr3689 seems to be a highly divergent homolog of yaaW having the UPF0174 domain. yaaW has fused to a GTPase domain [Pfam:MMR_HSR1] in Shewanella sp. and Campylobacter coli, which has a wide phylogenetic distribution but unknown function. yaaW is also fused to a fraction of this domain in the γ-proteobacteria Idiomarina loihiensis ( Figure 2). This GTPase domain is fused in other proteins with a domain of unknown function: [Pfam:DUF933] related to the ubiquitin and to another GTPase domain [Pfam:GTP1_OBG] which have been shown to be important in normal cell metabolism in Schizosaccharomyces pombe [15]. The fusion of some yaaW homologs to a GTPase domain suggests that yaaW codes for a functional protein. On the other hand, there is no available Pfam domain for the htgA gene at the time being.

Phylogenetic analysis
With the exception of the sequence from Nostoc sp. that branches among the proteobacteria, the phylogeny of yaaW coded proteins conforms approximately to canonical rRNA-based tree ( Figure 3).
Phylogenetic analysis of yaaW protein coded sequences Figure 3 Phylogenetic analysis of yaaW protein coded sequences. Minimum Evolution tree of yaaW protein coding sequences (number in branches denotes 1000 bootstraps). A gray vertical bar indicates the branch where htgA overlapped gene hypothetically originated. A star indicates the placement of the root according to the midpoint root method. A, B and O denotes the groups of sequences used to detect the changes in rate evolution of yaaW sequences having and overlapped htgA gene. Node C is a hypothetical ancestral sequence. Darker branches indicate those sequences having the larger number of statistically significant relative rate tests when used as outgroups. Numbers following the names of proteins indicate the following: number of statistically significant comparisons/out of total number of comparisons, level of significance (number of amino acids involved in the relative rate taste). NS stands for: non-significant.

Synonymous versus non-synonymous substitutions
The relationship between synonymous (ps) versus nonsynonymous (pn) substitutions among all pairs of yaaW sequences is shown in Figure 4a. The dashed oval indicates comparisons between group A and B sequences (as defined in Figure 3). In figure 4b, the values of pn and ps are plotted against Kimura distances for each pair of yaaW sequences. As shown, the excess of synonymous versus non-synonymous substitutions is suggestive of purifying selection. This excess is especially important between sequences belonging to group A and group B (Figure 4c), indicating a higher degree of restriction to non-synonymous changes imposed by htgA on yaaW sequences. Purifying selection is statistically significant for all yaaW sequences ( Table 1).
The pattern of synonymous and non-synonymous substitutions among htgA sequences also suggests purifying selection (Table 1). However this pattern could be the reflection of the accumulation of purifying mutations among yaaW genes and the 123/213 phase of the overlap. Therefore, synonymous and non-synonymous mutations in yaaW will affect the similarly to htgA. This is an unexpected pattern for a novel sequence since it has been shown that new genes originating through overprinting in viruses show an excess of non-synonymous substitutions, indicating the action of positive natural selection [8].
Lower rate of evolution of yaaW sequences with overlapped htgA genes DNA sequences coding for overlapped genes are expected to evolve at a lower rate than those DNA sequences coding for only one gene [6,7]. Accordingly, we have analyzed the substitution rates of closely related yaaW genes with and without the overlapped htgA using a non-parametric relative rate test [16] for all combinations of three sequences consisting of an out-group (O) sequence, and two ingroup (A and B) sequences (see methods and Figure 3).
In-group yaaW sequences lacking the overlap (B sequences in Figure 3) have accumulated more exclusive mutations (m2 changes in Figure 5) in the first 409 nucleotides than those in-group yaaW genes endowed with the overlap (A sequences in Figure 3 and m1 changes in Figure  5). This suggests that htgA exerts an evolutionary pressure to yaaW in their first 409 nucleotides. Accordingly, we have subdivided the yaaW alignment in two sections. The first one comprises nucleotides 1 to 408, while the second one includes nucleotides 409 to 714. We have then applied the Tajima test [16] to both sections independently. As seen in Figure 6, many of the differences are significant at α = 0.05 for the first 408 nucleotides, and for some comparisons even at the α = 0.01 level. This is particularly true for the genes encoding for [UniProtKB: O26107] and [UniProtKB: Q9ZJ24] protein sequences (they also align best with A and B sequences). However, not all comparisons give statistically significant results. It is likely that signal erosion in sequences having experienced more substitutions may explain in part lack of statistically significant results in some relative rate tests, since there seems to be a tendency of lower Chi-square values towards increasing genetic distance ( Figure 6).

Discussion
Ever since Darwin, homology in biology has been explained by common descent. With the advent of the molecular structure of nucleic acids and proteins, the concept of homology was further refined to describe more precisely the phylogenetic relationships among genes (reviewed in [17]). Accordingly, genes separated by speciation events are referred as orthologs, genes duplicated inside the same genome are named as paralogs, and horizontally transferred genes are known as xenologs. Further on, in symbiogenesis lineages of homologous genes previously separated by lineage divergence, became associated in a single cellular system. The term synology has been proposed to describe the phylogenetic relationship among such genes [18]. However, no terminology has been employed to describe the phylogenetic relationship  between homologous overlapping genes. We suggest the term janology (from the two-faced Roman deity Jano) for the homology relationship among the sequences sharing the same DNA.
Other cases of novel genes have been described (i.e., Jingwey, [19]; Sphinx, [20];Morpheus [21]) and the origin of novel sequences through overprinting in viruses is well documented [4,8]. However, to the best of our knowledge, this is the first time that a novel gene originated through overprinting has been carefully described in a cell.
Caution should be taken since the existence of the htgA gene remains controversial. Although the gene has been cloned and its product originally characterized as a positive regulator of sigma 32 heat shock promoter [14], a recent study shows that this gene is not induced by sigma 32 as previously suggested [22]. Moreover, the initiation codon of htgA have recently being changed [23] and one of the predicted copies of htgA ( Figure 2) has a smaller size due to a non-sense mutation. Given that there has not been more experimental work carried on htgA, we rely on indirect evidence of its existence based on the statistically significant lowering rate of evolution of those sequences putatively coding for htgA. It is worth mentioning that available evidence for the role of htgA included cloning of the htgA gene, RNA isolation and Northern (RNA) blots, protein labeling, and promoter mapping experiments [14]. One possibility for the lack of sigma 32 induction is that, as suggested by [22], htgA either requires additional regulators or is not recognized by sigma 32. Our study should stimulate more research on this subject in order to elucidate the true nature of htgA.

Statistic analysis
There is no available experimental evidence of the existence of the protein product of yaaW. Nevertheless, i) the existence of homolog sequences to yaaW in different genomes; ii) the fusion of yaaW sequences to other protein domains (a GTPase domain in this case); and iii) the higher ratio of ps versus pn substitutions, strongly suggest that yaaW truly codes for a functional polypeptide. Here we propose that htgA originated by overprinting (de novo origination of a new gene from a non-coding frame in DNA in a prokaryotic genome) at some moment in the clade leading to E. coli and Shigella flexneri species.
Recently, a survey of genetic diversity on the Sargasso Sea using shot-gun technology showed the existence of large number of previously undetected new genes [24]. Using a different approach, [25] have shown the existence of an open pan-genome for the species Streptococcus agalactiae. This indicates that the genome of S. agalactiae is constantly exchanging genes with other organisms in its environment and that novel sequenced genomes from this species will exhibit an approximate of 30 new genes not previously found in the other strains. In a recent re-annotation of E. coli genome [23], it was shown that despite much effort in understanding the functions of each gene, there still a number of uncharacterized genes unique to the genome of E. coli (the gene products of unknown function are divided into those containing a conserved domain (N = 145, 3.3%), those with (N = 233, 5.3%) or without (N = 238, 5.3%) a detectable homolog in the sequence databases). All these approaches suggest the existence of a large number of as yet uncharacterized new genes in the biosphere. This opens the possibility that some of such novel genes arose through point mutation from coding (overprinting) as well as non-coding sequences. It also suggests that the origin of the novo sequences is an ongoing process in prokaryotic evolution.
The number of gene and protein families in the biosphere continues to be an open area of research. For instance, the 1.69 release of SCOP database has 1536 superfamilies [26], and this is certainly an underestimate. The origin of each one of these families continues to be a matter of debate and several mechanisms may be involved. Over-printing is a clear example of how new protein families can originate independently from non-coding sequences. If the new gene (or the new region of the gene resulting from the loss of a stop codon) is in a different strand or in a different coding frame than the other overlapping gene, a new coding sequence is being generated de novo. If this new coding region is transcribed and translated and the product is able to fold properly, natural selection could recruit this protein for some function, and if successful, it can be fixed in the population, and may in the long term represent the origin of a new protein/domain family.

Conclusion
We have described the origin of a novel gene of known function through overprinting in a prokaryote. Due to the wider phylogenetic distribution of yaaW, parsimony favors the origin of htgA by overprinting inside an already existing yaaW gene. The evidence of htgA being a functional gene is experimental [14], while the evidence for yaaW comes from: i) the existence of homolog sequences to yaaW in different genomes; ii) the fusion of yaaW sequences to other protein domains (a GTPase domain in this case); and iii) and the higher ratio of ps versus pn substitutions among yaaW sequences. Intriguingly there is no evidence of the action of positive natural selection among htgA homologs. This can be explained by a high degree of restriction imposed by yaaW due to the 123/213 phase of the overlap. The lower rate of evolution of yaaW sequences having an overlapped htgA gene is explained by the higher constraints of a DNA sequence coding for two genes. If our interpretation of the above evidence is correct, it is likely that the origin of de novo sequences from non-coding DNA in prokaryotes is an ongoing process.

Methods
We have searched for all strict pairs of completely overlapped genes in available E.coli complete genomes. We have found 227 pairs of such genes (Additional file 3 We have searched for all homologs of the protein products of both genes in a database of 416 prokaryotic chromosomes (including main chromosomes and accessory genetic elements like plasmids) from 251 prokaryotic organisms using BLAST [27] searches (e value cut-off 0.001) as well as in the NCBI non-redundant database for virus homologs.
Protein domain identification of homologs thus identified was assigned according Pfam database [28]. Open reading frames of htgA genes potentially coded in yaaW sequences were predicted using ORF Finder at NCBI [29]. Genomic coordinates of each of the genes were searched in NCBI gene annotation tables [30].

Multiple alignment
Multiple sequence alignment of homologs to yaaW protein products was performed using ClustalW [31] as implemented in Bioedit [32] and edited manually following the alignment of protein family [Pfam:UPF0174] deposited in Pfam. Conserved residues were identified using Jalview [33]. DNA sequences of yaaW genes were aligned following [Pfam:UPF0174] protein sequence alignment, in order to preserve codon positions in the multiple DNA sequence alignment.

Phylogenetic analysis
A Minimum-Evolution tree of amino acid sequences homologs to yaaW protein product was inferred using Poisson correction, constant rate homogeneity and 100 bootstrap replications, using MEGA3.1 software [34].

Selection test
In order to understand the pattern of selection (purifying selection, neutrality or positive selection) we have applied a large sample Z test as implemented in MEGA3.1 soft-ware [34] to all pairs of yaaW genes, as well as to all pairs of htgA homologs.

Relative rate test
A non-parametric relative rate test following Tajima [16] was applied to all combinations of three sequences consisting of an out-group (O) sequence, and two in-group (A and B) sequences (see Figure 3 to identify sequences belonging to each group) using a Perl script RRT.pl available upon request. The analysis was performed using DNA sequences. The method counts the number of substitutions exclusive of A sequences (m1 substitutions) against the number of substitutions exclusive of the B sequences (m2 substitutions), using an out-group sequence O as a reference to identify such changes. The statistical significance is evaluated using a Chi-square test.
We have also analyzed if there are rate differences across the overlap, therefore we have subdivided the multiple alignment of yaaW homologs in 13 windows of 102 nucleotides each (adjacent windows overlap by 51 nucleotides).