The origin of a novel gene through overprinting in Escherichia coli
© Delaye et al. 2008
Received: 22 March 2007
Accepted: 28 January 2008
Published: 28 January 2008
Skip to main content
© Delaye et al. 2008
Received: 22 March 2007
Accepted: 28 January 2008
Published: 28 January 2008
Overlapped genes originate by a) loss of a stop codon among contiguous genes coded in different frames; b) shift to an upstream initiation codon of one of the contiguous genes; or c) by overprinting, whereby a novel open reading frame originates through point mutation inside an existing gene. Although overlapped genes are common in viruses, it is not clear whether overprinting has led to new genes in prokaryotes.
Here we report the origin of a new gene through overprinting in Escherichia coli K12. The htgA gene coding for a positive regulator of the sigma 32 heat shock promoter arose by point mutation in a 123/213 phase within an open reading frame (yaaW) of unknown function, most likely in the lineage leading to E. coli and Shigella sp. Further, we show that yaaW sequences coding for htgA genes have a slower evolutionary rate than those lacking an overlapped htgA gene.
While overprinting has been shown to be rather frequent in the evolution of new genes in viruses, our results suggest that this mechanism has also contributed to the origin of a novel gene in a prokaryote. We propose the term janolog (from Jano, the two-faced Roman god) to describe the homology relationship that holds between two genes when one originated through overprinting of the other. One cannot dismiss the possibility that at least a small fraction of the large number of novel ORPhan genes detected in pan-genome and metagenomic studies arose by overprinting.
The origin of novelty is one of the central questions in evolutionary biology. Paralogy, orthology, and xenology, as well as domain shuffling between genes, account for a large part of the evolution of protein families at the sequence level . All these mechanisms require the previous existence of coding sequences. Whether novel sequences in cells could arise from spontaneous point mutation remains an open question.
In some cases, two or more genes are known to be encoded in the same DNA region . Overlapping genes have been known since the beginning of virus complete genome sequencing, as exemplified by the gene B of the single stranded DNA bacteriophage ϕX174, which is completely contained within gene A . It has been suggested that overlapping genes originate by spontaneous point mutation due to a mechanism known as overprinting . Overlapping genes could arise by: a) extension of one ORF into other by loss of a stop codon; b) shift to an upstream initiation codon in adjacent genes; or by c) generation of a totally new ORF inside a previous existing ORF by point mutation. The different ways in which two overlapping genes can be coded are known as phases , and the different phases determine, in turn, the evolutionary pressure that overlapping genes exert on each other [6, 7].
In has been shown that overprinting is an important mechanism for the origin of new genes in viruses [4, 8]. Overlapped genes are also a relative common feature of prokaryotic genomes. For instance, it has been suggested that in prokaryotes, where approximately 80% of the overlaps comprise less than 30 base pairs, the phase of coding and their distribution, among other characteristics of overlapping genes are most consistent with the hypothesis that overlaps participate in the regulation of gene expression . A comparative study among the genomes of Mycoplasma genitalium and Mycoplasma pneumoniae found that most overlapped genes were generated primarily due to the loss of a stop codon, the absence of which resulted in elongation of the 3' end of the gene's coding region . Overlapped genes can also be classified according to their direction of transcription into three categories, namely "convergent" (-> <-), "unidirectional" (-> ->) and "divergent" (<- ->). In a study that included the analysis of complete genome sequences from 50 bacterial species  it was shown that most overlapped genes are coded following the unidirectional pattern. This pattern was confirmed in a comparative study among nine bacteria . Overlapped genes have also been found in mitochondria. For instance, a small protein [UniProtKB: Q6EMS7] of 66 amino acids (gene A6L) overlaps 40 bases into the gene coding for ATPase-6 in bovine mitochondria . Because the formation of overlapping genes necessarily involves the evolution of a coding region from non-coding DNA or from a different frame in coding DNA, their study might help to understand de novo evolution of coding regions . The pattern of natural selection among 71 pairs of 3' overlapping genes (all of them having an overlap larger than 15 nucleotides) conserved at least in two prokaryotic genomes showed a statistically significant bias toward the 123/132 phase, thus ensuring the least mutual constraint on non-conservative amino acid replacements in both overlapping coding sequences . Here, we show that the mechanism of overprinting has also contributed to the origin of a new gene of identifiable function in Escherichia coli.
Although NCBI gene annotation tables provide coordinates for the position of genes in genomes, caution should be taken, since over-annotation and sequencing errors might mislead the identification of true genes. While there is in vitro experimental evidence for the existence of htgA , this does not hold true for yaaW. Therefore, we have searched for indirect evidence for its existence by asking whether there are: a) yaaW homologs in other genomes; b) domain fusions of homologs to the yaaW protein product with other protein domains; and c) detectable biases in the ratio of synonymous (ps) versus non-synonymous (pn) substitutions, i.e., departures from neutrality (ps ≠ pn) among yaaW sequences that would suggest the action of natural selection.
We detected no homologs among viruses to the protein product of htgA in the non-redundant database. Besides a small similarity of the DnaC protein [UniProtKB:Q9AZV4] from Lactococcus phage bIL286 (41 amino acids identities along 152 residues, and BLAST e-value of 1.7) to the yaaW protein product. This appears to be a false positive since similarity was not confirmed by Pfam database. These negative results strongly suggest, but do not prove, that yaaW and htgA did not originate in a viral genome.
The yaaW coded proteins match to [Pfam:UPF0174] profile in Pfam database, the only exception is the hypothetical protein alr3689 [UniProtKB:Q8YQX2] from Nostoc sp. PCC 7120 that exhibits a marginal HMMER E-value of 0.048, comprising 156 amino acids out of 185 (0.01 is considered significant in Pfam). However, a BLAST search of E. coli yaaW coded protein sequence [UniProtKB:P75617] to our database of proteins from complete genomes matches the cyanobacterial protein alr3689 sequence with a significant e-value of 2e-05 BLAST to a 86 amino acid stretch of the protein. The protein alr3689 seems to be a highly divergent homolog of yaaW having the UPF0174 domain. yaaW has fused to a GTPase domain [Pfam:MMR_HSR1] in Shewanella sp. and Campylobacter coli, which has a wide phylogenetic distribution but unknown function. yaaW is also fused to a fraction of this domain in the γ-proteobacteria Idiomarina loihiensis (Figure 2). This GTPase domain is fused in other proteins with a domain of unknown function: [Pfam:DUF933] related to the ubiquitin and to another GTPase domain [Pfam:GTP1_OBG] which have been shown to be important in normal cell metabolism in Schizosaccharomyces pombe . The fusion of some yaaW homologs to a GTPase domain suggests that yaaW codes for a functional protein. On the other hand, there is no available Pfam domain for the htgA gene at the time being.
Natural selection analysis. The mean of p values for all the comparisons among all yaaW and htgA homologs plus standard deviations, SD. * p < 0.05; ** p < 0.01. Neutrality: a statistically significant p value indicates non-neutral evolution; Purifying selection: a statistically significant p value indicates purifying selection
yaaW (all sequences)
0.042 ± 0.1215 *
0.024 ± 0.0869 *
A versus B
2.079E-09 ± 1.6218E-09 **
1.168E-09 ± 6.5436E-10 **
A versus O
0.009 ± 0.0266 **
0.005 ± 0.0139 **
B versus O
0.031 ± 0.0696 *
0.015 ± 0.0337 *
htgA (all sequences)
0.116 ± 0.2171
0.068 ± 0.1456
A versus B
6.69E-06 ± 8.3638E-06 **
2.378E-06 ± 3.0695E-06**
A versus O
0.118 ± 0.2358
0.057 ± 0.1187
B versus O
0.097 ± 0.2122
0.050 ± 0.1050
The pattern of synonymous and non-synonymous substitutions among htgA sequences also suggests purifying selection (Table 1). However this pattern could be the reflection of the accumulation of purifying mutations among yaaW genes and the 123/213 phase of the overlap. Therefore, synonymous and non-synonymous mutations in yaaW will affect the similarly to htgA. This is an unexpected pattern for a novel sequence since it has been shown that new genes originating through overprinting in viruses show an excess of non-synonymous substitutions, indicating the action of positive natural selection .
DNA sequences coding for overlapped genes are expected to evolve at a lower rate than those DNA sequences coding for only one gene [6, 7]. Accordingly, we have analyzed the substitution rates of closely related yaaW genes with and without the overlapped htgA using a non-parametric relative rate test  for all combinations of three sequences consisting of an out-group (O) sequence, and two in-group (A and B) sequences (see methods and Figure 3).
Ever since Darwin, homology in biology has been explained by common descent. With the advent of the molecular structure of nucleic acids and proteins, the concept of homology was further refined to describe more precisely the phylogenetic relationships among genes (reviewed in ). Accordingly, genes separated by speciation events are referred as orthologs, genes duplicated inside the same genome are named as paralogs, and horizontally transferred genes are known as xenologs. Further on, in symbiogenesis lineages of homologous genes previously separated by lineage divergence, became associated in a single cellular system. The term synology has been proposed to describe the phylogenetic relationship among such genes . However, no terminology has been employed to describe the phylogenetic relationship between homologous overlapping genes. We suggest the term janology (from the two-faced Roman deity Jano) for the homology relationship among the sequences sharing the same DNA.
Other cases of novel genes have been described (i.e., Jingwey, ; Sphinx, ; Morpheus ) and the origin of novel sequences through overprinting in viruses is well documented [4, 8]. However, to the best of our knowledge, this is the first time that a novel gene originated through overprinting has been carefully described in a cell.
Caution should be taken since the existence of the htgA gene remains controversial. Although the gene has been cloned and its product originally characterized as a positive regulator of sigma 32 heat shock promoter , a recent study shows that this gene is not induced by sigma 32 as previously suggested . Moreover, the initiation codon of htgA have recently being changed  and one of the predicted copies of htgA (Figure 2) has a smaller size due to a non-sense mutation. Given that there has not been more experimental work carried on htgA, we rely on indirect evidence of its existence based on the statistically significant lowering rate of evolution of those sequences putatively coding for htgA. It is worth mentioning that available evidence for the role of htgA included cloning of the htgA gene, RNA isolation and Northern (RNA) blots, protein labeling, and promoter mapping experiments . One possibility for the lack of sigma 32 induction is that, as suggested by , htgA either requires additional regulators or is not recognized by sigma 32. Our study should stimulate more research on this subject in order to elucidate the true nature of htgA.
There is no available experimental evidence of the existence of the protein product of yaaW. Nevertheless, i) the existence of homolog sequences to yaaW in different genomes; ii) the fusion of yaaW sequences to other protein domains (a GTPase domain in this case); and iii) the higher ratio of ps versus pn substitutions, strongly suggest that yaaW truly codes for a functional polypeptide. Here we propose that htgA originated by overprinting (de novo origination of a new gene from a non-coding frame in DNA in a prokaryotic genome) at some moment in the clade leading to E. coli and Shigella flexneri species.
Recently, a survey of genetic diversity on the Sargasso Sea using shot-gun technology showed the existence of large number of previously undetected new genes . Using a different approach,  have shown the existence of an open pan-genome for the species Streptococcus agalactiae. This indicates that the genome of S. agalactiae is constantly exchanging genes with other organisms in its environment and that novel sequenced genomes from this species will exhibit an approximate of 30 new genes not previously found in the other strains. In a recent re-annotation of E. coli genome , it was shown that despite much effort in understanding the functions of each gene, there still a number of uncharacterized genes unique to the genome of E. coli (the gene products of unknown function are divided into those containing a conserved domain (N = 145, 3.3%), those with (N = 233, 5.3%) or without (N = 238, 5.3%) a detectable homolog in the sequence databases). All these approaches suggest the existence of a large number of as yet uncharacterized new genes in the biosphere. This opens the possibility that some of such novel genes arose through point mutation from coding (overprinting) as well as non-coding sequences. It also suggests that the origin of the novo sequences is an ongoing process in prokaryotic evolution.
The number of gene and protein families in the biosphere continues to be an open area of research. For instance, the 1.69 release of SCOP database has 1536 superfamilies , and this is certainly an underestimate. The origin of each one of these families continues to be a matter of debate and several mechanisms may be involved. Overprinting is a clear example of how new protein families can originate independently from non-coding sequences. If the new gene (or the new region of the gene resulting from the loss of a stop codon) is in a different strand or in a different coding frame than the other overlapping gene, a new coding sequence is being generated de novo. If this new coding region is transcribed and translated and the product is able to fold properly, natural selection could recruit this protein for some function, and if successful, it can be fixed in the population, and may in the long term represent the origin of a new protein/domain family.
We have described the origin of a novel gene of known function through overprinting in a prokaryote. Due to the wider phylogenetic distribution of yaaW, parsimony favors the origin of htgA by overprinting inside an already existing yaaW gene. The evidence of htgA being a functional gene is experimental , while the evidence for yaaW comes from: i) the existence of homolog sequences to yaaW in different genomes; ii) the fusion of yaaW sequences to other protein domains (a GTPase domain in this case); and iii) and the higher ratio of ps versus pn substitutions among yaaW sequences. Intriguingly there is no evidence of the action of positive natural selection among htgA homologs. This can be explained by a high degree of restriction imposed by yaaW due to the 123/213 phase of the overlap. The lower rate of evolution of yaaW sequences having an overlapped htgA gene is explained by the higher constraints of a DNA sequence coding for two genes. If our interpretation of the above evidence is correct, it is likely that the origin of de novo sequences from non-coding DNA in prokaryotes is an ongoing process.
We have searched for all strict pairs of completely overlapped genes in available E.coli complete genomes. We have found 227 pairs of such genes (Additional file 3). After the exclusion of all overlapped genes annotated as: a) hypothetical; b) related to phages; c) lack of homologs to one (or both) of the overlapped genes, there are 7 pairs left. Among these seven pairs, there is experimental evidence for the existence of the hypothetically newer gene (the one with fewer homologs) only the htgA - yaaW pairs (three pairs). Among the four pairs left, the pair conformed by the molecular chaperone DnaK (GI identification number: 26245936; 500 homologs found) and the putative glutamate dehydrogenase (GI identification number: 26245935; 6 homologs found) looks like a promising alternative. However we have filed to detect a difference in the rate of evolution with the method used here. We have left all the previous information available through the Additional file 3 for further studies. Therefore, we have selected the htgA - yaaW pair based on the experimental evidence of the existence of the gene with fewer homologs (htgA) and the presence of homologs to the other gene (yaaW) in genomes belonging to different bacterial species.
Based on the recently re-annotated gene table of E. coli K12 genome  these genes have the following coordinates, yaaW [UniProtKB:P75617] and htgA [UniProtKB:P28697] (yaaW : locustag K-12 ECK0011, left nucleotide 10643, right nucleotide 11356, direction of transcription (-); htgA : locustag K-12 ECK0012, left nucleotide 10830, right nucleotide 11315, direction of transcription (+)).
We have searched for all homologs of the protein products of both genes in a database of 416 prokaryotic chromosomes (including main chromosomes and accessory genetic elements like plasmids) from 251 prokaryotic organisms using BLAST  searches (e value cut-off 0.001) as well as in the NCBI non-redundant database for virus homologs.
Protein domain identification of homologs thus identified was assigned according Pfam database . Open reading frames of htgA genes potentially coded in yaaW sequences were predicted using ORF Finder at NCBI . Genomic coordinates of each of the genes were searched in NCBI gene annotation tables .
Multiple sequence alignment of homologs to yaaW protein products was performed using ClustalW  as implemented in Bioedit  and edited manually following the alignment of protein family [Pfam:UPF0174] deposited in Pfam. Conserved residues were identified using Jalview . DNA sequences of yaaW genes were aligned following [Pfam:UPF0174] protein sequence alignment, in order to preserve codon positions in the multiple DNA sequence alignment.
A Minimum-Evolution tree of amino acid sequences homologs to yaaW protein product was inferred using Poisson correction, constant rate homogeneity and 100 bootstrap replications, using MEGA3.1 software .
In order to understand the pattern of selection (purifying selection, neutrality or positive selection) we have applied a large sample Z test as implemented in MEGA3.1 software  to all pairs of yaaW genes, as well as to all pairs of htgA homologs.
A non-parametric relative rate test following Tajima  was applied to all combinations of three sequences consisting of an out-group (O) sequence, and two in-group (A and B) sequences (see Figure 3 to identify sequences belonging to each group) using a Perl script RRT.pl available upon request. The analysis was performed using DNA sequences. The method counts the number of substitutions exclusive of A sequences (m1 substitutions) against the number of substitutions exclusive of the B sequences (m2 substitutions), using an out-group sequence O as a reference to identify such changes. The statistical significance is evaluated using a Chi-square test. We have also analyzed if there are rate differences across the overlap, therefore we have subdivided the multiple alignment of yaaW homologs in 13 windows of 102 nucleotides each (adjacent windows overlap by 51 nucleotides).
We are indebted to Drs. Peter Gogarten, Kalin Vetsigian, and two anonymous reviewers for careful reading and many useful suggestions. Support from CONACYT-México (Project 50520-Q) is gratefully acknowledged. LD whishes to dedicate this study to Professor Ramon Adell Febrer (in memoriam) for all his lovely and kind support.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.