We have studied the mRNA stability at the translation-initiation region of protein-coding genes in 650 genomes of dsDNA viruses. We have found for many of these viruses that there is a tendency for reduced mRNA stability in the first 30-40 nt of the coding sequence. In this region, mRNA stability tends to be less than expected given a gene's amino-acid sequence and codon-usage bias. We have also found that GC content of coding sequences is a major predictor of the reduction in mRNA stability. The higher the GC content, the larger the reduction in mRNA stability at the 5' end of the coding sequence (i.e., the larger 5'
). For bacteriophage, the 5'
also correlates positively with the 5'
in the host organisms.
Experimental and computational work had previously shown that increased local mRNA stability in the translation-initiation region impaired translation initiation in E. coli [11, 13]. Two computational studies suggested that this effect exists more broadly in both prokaryotes and eukaryotes [14, 15]. Here, we have shown that similar selection pressures exist in the viral kingdom.
As in cellular organisms, the region with reduced mRNA stability is located right downstream from the start codon and has a length of 30 to 40 nt (the first two windows in our analysis). Past the first two windows,
tends to be zero or slightly negative. In cellular organisms,
is consistently negative downstream from the start codon . The lack of a negative
in most virus genomes likely reflects lack of statistical power, a consequence of the small genomes of viruses. The strong positive correlation between genomic GC composition and the reduction of mRNA stability at the translation-initiation region is in agreement with the finding by Gu et al. .
In contrast to cellular organisms, viruses frequently have overlapping ORFs. In fact, nearly all viruses in our analysis had at least one overlapping ORF. Our codon-shuffling approach conserves the amino-acid sequence of the focal ORF, but does not conserve the amino-acid sequence of any second ORF that overlaps with the focal one. Thus, overlapping sequences will experience additional selective constraint that our approach does not accurately take into account. In principle, this issue could cause spurious results. However, we found that there is little difference in
values in overlapping and non-overlapping ORFs. At worst,
values in overlapping ORFs are reduced compared to those in non-overlapping ORFs (Additional File 1 Figure S1). Therefore, treating overlapping ORFs as non-overlapping ORFs, as we have effectively done throughout much of this work, is a conservative approach when looking for elevated
To understand why
increased with increasing GC content, we also considered the raw ΔG values. One can envision two extreme cases of how ΔG might depend on GC content. On the one hand, the ΔG in the first window might be required to be at a fixed low value, independent of GC content, to enable efficient translation. The ΔGs further downstream would be expected to decrease with increasing GC content, due to the higher thermodynamic stabiltiy of GC bonds. On the other hand, the ΔG in the first window might always differ by a fixed amount from ΔGs further downstream, independent of GC content. We found the reality to be somewhere in between these two extreme cases. Even though the ΔG in the first window showed a strong negative correlation with GC content, the difference in ΔG was not constant for prokaryotic or eukaryotic viruses, for which it increased strongly and moderately, respectively. For archaeic viruses, however, it did not significantly increase. Since the correlation between
and GC was of comparable magnitude for all three groups, we infer that two separate mechanisms are at play. First, for prokaryotic and eukaryotic viruses, the requirement for decreased stability in the first window increases with increasing GC content. Second, in general, the
measure seems to become more powerful for sequences with increased GC content, because the higher the GC, the less likely it is that a reshuffled sequence shows reduced stability.
For bacteriophages, we addressed the question whether the requirement of low mRNA secondary-structure stability in host genomes affects the 5'
in phages. Because phages share the cellular environment and translation machinery with their hosts, we would expect that phages are optimized for the expression machinery of their hosts. We found a significant positive correlation between the 5'
in phage genomes and that in their hosts. We also observed an even stronger correlation between the genomic GC content in phages and that in their hosts. Moreover, we found that a phage's GC content seems to impose a lower limit on the GC content of the hosts it can infect (Figure 5). These host-specific results are consistent with previous reports that synonymous codon usage in bacteriophage mimicks that of their hosts  and that viral and host GC content are similar in certain cases (Mycobacterium tuberculosis, 63.6% phage vs. 65.6% host, ; Staphylococcus aureus, 33.7% phage vs. 32.9% host, ).
We used independent contrasts to assess whether
correlated with GC content. The independent contrasts method requires an accurate phylogeny of the organisms under study. Such a phylogeny is difficult to obtain for viruses, because viruses have either arisen multiple times independently or their common ancestor is extremely ancient [20–23]. In our analysis, we separately considered viruses infecting eukaryotes, prokaryotes, and archaea, and used phylogenetic trees derived from the taxonomic classification of these viruses. The branch lengths in these trees reflect simply the number of taxonomic levels that two viruses are separated by. Therefore, the branch lengths are almost certainly incorrect. Nevertheless, these trees should at a minimum remove any major biases that might arise if some groups of viruses were more heavily sampled than others. We found generally that the results based on PIC were nearly identical to results calculated on the raw data (not shown). Therefore, we believe that our results are not strongly confounded by phylogeny and that the correction for phylogeny we employed was sufficient.
In our comparison of viruses with their hosts, we encountered the added complication that virus and host trees will in general not be identical. We are not aware of any method that can calculate correct correlations in this scenario. We addressed this issue by considering both the raw data and PIC based on the virus trees (because properties of the virus host can be considered as a measurement on the virus). Again, both methods produced nearly identical results. Thus, it is unlikely that the results are strongly confounded by phylogeny.