High correlation between the turnover of nucleotides under mutational pressure and the DNA composition
© Kowalczuk et al; licensee BioMed Central Ltd. 2001
Received: 10 September 2001
Accepted: 17 December 2001
Published: 17 December 2001
Any DNA sequence is a result of compromise between the selection and mutation pressures exerted on it during evolution. It is difficult to estimate the relative influence of each of these pressures on the rate of accumulation of substitutions. However, it is important to discriminate between the effect of mutations, and the effect of selection, when studying the phylogenic relations between taxa.
We have tested in computer simulations, and analytically, the available substitution matrices for many genomes, and we have found that DNA strands in equilibrium under mutational pressure have unique feature: the fraction of each type of nucleotide is linearly dependent on the time needed for substitution of half of nucleotides of a given type, with a correlation coefficient close to 1. Substitution matrices found for sequences under selection pressure do not have this property. A substitution matrix for the leading strand of the Borrelia burgdorferi genome, having reached equilibrium in computer simulation, gives a DNA sequence with nucleotide composition and asymmetry corresponding precisely to the third positions in codons of protein coding genes located on the leading strand.
Parameters of mutational pressure allow us to count DNA composition in equilibrium with this mutational pressure. Comparing any real DNA sequence with the sequence in equilibrium it is possible to estimate the distance between these sequences, which could be used as a measure of the selection pressure. Furthermore, the parameters of the mutational pressure enable direct estimation of the relative mutation rates in any DNA sequence in the studied genome.
Nucleotide sequences of all contemporary genomes are results of compromise between mutational pressure and selection . Many mutations which took place in the past have been eliminated by genetic death. Even so, there are differences in the nucleotide composition of protein coding sequences and intergenic sequences – it is very difficult to discriminate between the effects of selection and mutations on their composition. Furthermore, many prokaryotic genomes have very asymmetric nucleotide composition of chromosomes [e.g. [2–9]]. Strand composition of DNA depends on the role which the strand plays in the replication process – leading or lagging. Usually the leading strand is richer in Guanine (G) than in Cytosine (C) and richer in Thymine (T) than in Adenine (A). The replication-associated mutational pressure is thought to be the most probable cause of this asymmetry [10–12].
Analyses of long range correlations in DNA sequences revealed that in the intergenic sequences a very strong triplet signal can be detected [13, 14]. This signal can be created by fragments of coding sequences transferred into intergenic space by recombination mechanisms. Since the nucleotide compositions of the first, the second and the third nucleotide positions in coding sequences are strongly correlated, these correlations are seen even in some noncoding intergenic sequences. We have assumed that some intergenic sequences have derived from coding sequences and could freely accumulate mutations with frequencies determined by the replication-associated mutational pressure. If the time of divergence has not been very long, the homology between the intergenic sequences and their original protein coding sequences can be found (these original coding sequences we have called the reference sequences).
We have made an assumption that mutations have been accumulated only in the intergenic sequences and not in the reference sequences, which is not exactly true, but which enabled accomplishing our studies. This assumption could give a good approximation of mutational pressure exerted on intergenic sequences. Many other authors, who have constructed matrices of substitutions using the mutations accumulated in pseudogene sequences have made the same assumptions [15, 16]. Such an assumption could give higher estimated mutational rate than the real one. Nevertheless, the substitution rates in the matrices are described as a relative values thus, it should not change the values in the matrix.
We have chosen for our analyses the B. burgdorferi genome because there are many premises indicating that this genome is in the steady state. The B. burgdorferi genome is very asymmetric, which suggests its structural conservation . There are not many inversions of genes between the leading and lagging strands or the mutational pressure has had enough time to make the inverted genes resemble the genes of the new strand . The nucleotide composition of the third positions in codons testifies for the very conserved structure of chromosome. These positions follow precisely the sign of the asymmetry of intergenic sequences and, the third positions of Open Reading Frames (ORFs) situated on the leading and lagging strands have precisely mirror asymmetry, which is even stronger than that of intergenic sequences . This paradox could be explained assuming that the highly degenerated third positions have accumulated more neutral or near neutral mutations introduced by the replication-associated processes because they stay at their positions longer than intergenic sequences. There are constraints for inversions of coding sequences but no constraints for inversions of intergenic sequences. Thus, some newly inverted intergenic sequences could complement the asymmetry of the "new host" strand.
Results and Discussion
Testing the table of substitutions and verifying the assumptions
Tables of substitutions, DNA composition in the equilibrium with the mutational pressure and half times of nucleotide substitutions.
mtDNA l, 2p
Half time of substitution
NA>G + NA>C + NA>T = NG>A + NC>A + NT>A (1)
NG>A + NG>C + NG>T = NA>G + NC>G + NT>G (2)
NC>A + NC>G + NC>T = NA>C + NG>C + NT>C (3)
NT>A + NT>G + NT>C = NA>T + NG>T + NC>T (4)
where NA>G = N A *p(NA>C) and where p(NA>C) is the probability of substitutions of A by G, taken from the BbTS, (other symbols – respectively).
Properties of the substitution matrices
Let us consider only nucleotides existing in the original sequence, which is already in the steady state. It is trivial that the substitution of each of the four nucleotides will follow exactly the same rules as a decay of radioactive isotopes with characteristic for each nucleotide "half time of substitutions" (τ A , τ G , τ T , τ C for A, G, T, and C, respectively) determined by the sum of probabilities of substitutions of a given nucleotide by the other three nucleotides. In a more formal language:
τ A = ln2/(p mut *(p(A>G) + p(A>T) + p(A>C)); (symbols for nucleotides other than A-respectively), where pmut is a parameter which denotes the overall rate of mutations and does not influence the ratios between τ for different nucleotides.
We claim that the mutational pressure leading to the nucleotide substitutions is extremely highly correlated with the DNA composition of the genome in such a way that the higher substitution turnover of a nucleotide determines the lower fraction of this nucleotide in the DNA sequence.
It seems very unbelievable that such correlation in the B. burgdorferi genome has happened accidentally. We have tested many other tables of substitutions which had been published for different genomes and different sequences (data collected in Table 1). As long as such matrices describe the substitutions for sequences which are not under the selection pressure (i.e. pseudogenes or the third positions in codons), they follow the same rule, with extremely high correlation between τ N and the fraction of the nucleotide N in the DNA sequence in equilibrium (all correlations were statistically significant). This rule is true for asymmetric DNA like in the B. burgdorferi genome and for much less biased eukaryotic DNA. Matrices found for the third positions in the four-fold degenerated codons in Drosophila mitochondrial DNA  fulfil this rule more precisely than for all third positions in codons in that organelle's genome (the same results were obtained for matrices of primates' mtDNA published by , data not shown). These differences could be expected if some mutations in the third positions, leading to amino acid substitutions are not neutral. Furthermore, in some instances, for example for the table describing substitution rates in sequences under strong selection , we have not found the correlation between τN and the fraction of nucleotide N (see the last column in Table 1). That supports the hypothesis that the rule is a specific property of the pure mutational pressure. One can also notice that matrices found by analysis of substitutions into different pseudogenes in the same organism or in very closely related organisms give a different DNA composition in equilibrium, which supports the thesis that the mutational pressure varies for different regions of the same eukaryotic genome [23–25].
We have no clear answer for the question: what selection forces have tuned the mutational pressure in such a way that it follows the strict rules for sequences released from selection. It is logical that nucleotides with higher turnover destabilise the genetic information and selection would tend to eliminate them from the DNA molecule. On the other hand, a lower frequency of a nucleotide gives it a higher informative value while at the same time the deviation from the equimolar fractions of nucleotides in DNA diminishes the coding capacity of the whole molecule. Perhaps mathematical analysis of this phenomena, taking into considerations the properties of the universal genetic code, will show that the optimum for information transfer by the DNA molecule is just at such points. Further studies would show other properties of these strategic points where τ determines very specific balance between the DNA composition and mutational pressure.
Keeping in mind the precise relations between the fraction of nucleotide and its turnover time, the symmetric DNA (with A=T and G=C) is a specific case were the turnover times of nucleotides in pairs equal each other. Posing a question of which type of substitution should be blamed for the DNA asymmetry makes sense for the mutational pressure exerted on the DNA released from the selection pressure. Now a simple test for such a mutational pressure is available – it should generate the DNA in equilibrium whose nucleotide composition fulfils the rule of linear interdependence between the sizes of the nucleotide fractions and their turnover times.
Substitution matrices enable counting the DNA composition in equilibrium with a given mutational pressure. It is possible to test if a given substitution matrix is the pure mutational matrix or if it is "contaminated" with the effects of selection. The difference between the DNA composition in equilibrium with mutational pressure and a DNA sequence under both mutational and selection pressures allows for estimation of the effect of selection pressure exerted on the particular sequence.
Materials and Methods
Construction of the substitution table
To estimate the frequency of substitutions, we have analysed the differences between coding sequences of the B. burgdorferi genome and sequences homologous to them found in the intergenic regions. For the data, see Additional file 1. The sequence of the B. burgdorferi genome  was downloaded from http://www.ncbi.nlm.nih.gov. To accomplish our analysis, we extracted all intergenic sequences longer than 90 nucleotides. We translated them into amino acid sequences in all six reading frames. The amber and ochre stop codons were translated for tyrosine residues and opal for tryptophan. Then we searched data bases for homology with the B. burgdorferi protein sequences using FASTA program . For detailed amino acid alignment data see Additional file 2. After selecting homologues (with E < 0.05) whose previously (presumably) coding strands were duplicated on the leading strand, we made alignments of nucleotide sequences of these intergenic sequences with the reference ORFs' sequences using CLUSTAL X programme  and we counted the nucleotide substitutions. The number of the analysed alignments sites was 3737 and the average number of substitutions per site – 0.46. For detailed nucleotide alignment data see Additional file 3. The observed numbers of nucleotide substitutions from nucleotide i to j (where i to j stand for A, T, G or C, and i ≠ j) were converted to relative substitution frequencies according to Gojobori, Li, and Gaur  and Francino and Ochman . That allowed us to count the frequency of each of the twelve possible substitutions on the leading strand. Since the observed substitution rates were different for each of the four nucleotides, we introduced corrections for multiple substitutions and reversions for each type of the substituted nucleotide instead of one general correction. It means that we have counted for each type of nucleotide the fraction of substituted (observed number) and used it for estimating the corrected substitution number according to Kimura's formula . The frequencies of substitutions, normalised in such a way that the sum of all 12 frequencies equals 1, are shown in Table 1 (Table of Substitutions – BbTS).
Computer simulations were performed on the DNA sequences corresponding to the real protein coding sequences of B. burgdorferi leading strand or DNA random sequences generated by computer. In the first case all ORFs longer than 100 codons situated on leading strand of Borrelia genome were spliced together. In the second case the DNA sequence were constructed by computer by drawing the consecutive nucleotides with a probability described by the assumed composition of this artificial sequence.
During the simulation of mutational pressure, in one Monte Carlo Step (MCS) each nucleotide in the sequence was drawn with the probability equalling pmut, then it is substituted with the probability described by the substitution matrix. Note that the nucleotide drawn for substitution not necessarily has to be substituted. After each MCS each substitution was counted in the specific type of substitution and additionally the evolving sequence was compared with the original sequence and the accumulated substitution were counted. This allowed us to measure not only the divergence rate but also the mutational rate subdivided on 12 different classes. The dynamic of substitution was also counted analytically using the equations describing the first order reaction rate. For more details on computing methods see Kowalczuk et al. [19, 32].
table of probabilities of nucleotide substitutions in the Borrelia burgdorferi leading DNA strand.
Funding sources: This work was supported by The State Committee for Scientific Research, grant numbers 6 P04A 025-18 and 6 P04A 016 20. P.M. was supported by Foundation for Polish Science.
- Sueoka N: On the genetic basis of variation and heterogeneity of DNA base composition. Proc Natl Acad Sci USA. 1962, 48: 582-591.PubMed CentralView ArticlePubMedGoogle Scholar
- Lobry JR: Asymmetric substitution patterns in the two DNA strands of bacteria. Mol Biol Evol. 1996, 13: 660-665.View ArticlePubMedGoogle Scholar
- Freeman JM, Plasterer TN, Smith TF, Mohr SC: Patterns of genome organization in bacteria. Science. 1998, 279: 1827-1830. 10.1126/science.279.5358.1827a.View ArticleGoogle Scholar
- Grigoriev A: Analyzing genomes with cumulative skew diagrams. Nucleic Acids Res. 1998, 26: 2286-2290. 10.1093/nar/26.10.2286.PubMed CentralView ArticlePubMedGoogle Scholar
- McLean MJ, Wolfe KH, Devine K: Base composition skews, replication orientation, and gene orientation in 12 prokaryote genomes. J Mol Evol. 1998, 47: 691-696.View ArticlePubMedGoogle Scholar
- Mrazek J, Karlin S: Strand compositional asymmetry in bacterial and large viral genomes. Proc Natl Acad Sci USA. 1998, 95: 3720-3725. 10.1073/pnas.95.7.3720.PubMed CentralView ArticlePubMedGoogle Scholar
- Francino MP, Ochman H: A comparative genomics approach to DNA asymmetry. Ann N Y Acad Sci. 1999, 870: 428-431.View ArticlePubMedGoogle Scholar
- Mackiewicz P, Gierlik A, Kowalczuk M, Dudek MR, Cebrat S: Asymmetry of nucleotide composition of prokaryotic chromosomes. J AppI Genet. 1999, 40: 1-14.Google Scholar
- Rocha EP, Danchin A, Viari A: Universal replication biases in bacteria. Mol Microbiol. 1999, 32: 11-16. 10.1046/j.1365-2958.1999.01334.x.View ArticlePubMedGoogle Scholar
- Frank AC, Lobry JR: Asymmetric substitution patterns: a review of possible underlying mutational or selective mechanisms. Gene. 1999, 238: 65-77. 10.1016/S0378-1119(99)00297-8.View ArticlePubMedGoogle Scholar
- Mackiewicz P, Gierlik A, Kowalczuk M, Dudek MR, Cebrat S: How does replication-associated mutational pressure influence amino acid composition of proteins?. Genome Res. 1999, 9: 409-416.PubMed CentralGoogle Scholar
- Tillier ER, Collins RA: The contributions of replication orientation, gene direction, and signal sequences to base-composition asymmetries in bacterial genomes. J Mol Evol. 2000, 50: 249-257.PubMedGoogle Scholar
- Voss R: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett. 1992, 68: 3805-3808. 10.1103/PhysRevLett.68.3805.View ArticlePubMedGoogle Scholar
- Gierlik A, Mackiewicz P, Kowalczuk M, Dudek MR, Cebrat S: Some hints on Open Reading Frame statistics – how ORF length depends on selection. Int J Modern Phys C. 1999, 10: 635-643. 10.1142/S0129183199000474.View ArticleGoogle Scholar
- Li WH, Wu CI, Luo CC: Nonrandomness of point mutation as reflected in nucleotide substitutions in pseudogenes and its evolutionary implications. J Mol Evol. 1984, 21: 58-71.View ArticlePubMedGoogle Scholar
- Yang Z: Estimating the pattern of nucleotide substitution. J Mol Evol. 1994, 39: 105-111.PubMedGoogle Scholar
- Mackiewicz P, Gierlik A, Kowalczuk M, Szczepanik D, Dudek MR, Cebrat S: Mechanisms generating long-range correlation in nucleocide composition of the Borrelia burgdorferi. Physica A. 1999, 273: 103-115. 10.1016/S0378-4371(99)00345-3.View ArticleGoogle Scholar
- Mackiewicz P, Mackiewicz D, Gierlik A, Kowalczuk M, Nowicka A, Dudkiewicz M, Dudek MR, Cebrat S: The Differential Killing of Genes by Inversions in Prokaryotic Genomes. J Mol Evol. 2001, 53: 615-621. 10.1007/s002390010248.View ArticlePubMedGoogle Scholar
- Kowalczuk M, Gierlik A, Mackiewicz P, Cebrat S, Dudek MR: Optimization of gene sequences under constant mutational pressure and slection. Physica A. 1999, 273: 116-131. 10.1016/S0378-4371(99)00346-5.View ArticleGoogle Scholar
- Tamura K: The rate and pattern of nucleotide substitution in Drosophila mitochondrial DNA. Mol Biol Evol. 1992, 9: 814-825.PubMedGoogle Scholar
- Adachi J, Hasegawa M: Tempo and mode of synonymous substitutions in mitochondrial DNA of primates. Mol Biol Evol. 1996, 13: 200-208.View ArticlePubMedGoogle Scholar
- Zhang J: Performance of likelihood ratio tests of evolutionary hypotheses under inadequate substitution models. Mol Biol Evol. 1999, 16: 868-875.View ArticlePubMedGoogle Scholar
- Filipski J: Why the rate of silent codon substitutions is variable within a vertebrate's genome. J Theor Biol. 1988, 134: 159-164.View ArticlePubMedGoogle Scholar
- Wolfe KH, Sharp PM, Li WH: Mutation rates differ among regions of the mammalian genome. Nature. 1989, 337: 283-285. 10.1038/337283a0.View ArticlePubMedGoogle Scholar
- Matassi G, Sharp PM, Gautier C: Chromosomal location effects on gene sequence evolution in mammals. Curr Biol. 1999, 9: 786-791. 10.1016/S0960-9822(99)80361-3.View ArticlePubMedGoogle Scholar
- Fraser CM, Casjens S, Huang WM, Sutton GG, Clayton R, Lathigra R, White O, Ketchum KA, Dodson R, Hickey EK: Genomic sequence of a Lyme disease spirochaete, Borrelia burgdorferi. Nature. 1997, 390: 580-586. 10.1038/37551.View ArticlePubMedGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA. 1988, 85: 2444-2448.PubMed CentralView ArticlePubMedGoogle Scholar
- Jeanmougin F, Thompson JD, Gouy M, Higgins DG, Gibson TJ: Multiple sequence alignment with Clustal X. Trends Biochem Sci. 1988, 23: 403-405. 10.1016/S0968-0004(98)01285-7.View ArticleGoogle Scholar
- Gojobori T, Li WH, Graur D: Patterns of nucleotide substitution in pseudogenes and functional genes. J Mol Evol. 1982, 18: 360-369.View ArticlePubMedGoogle Scholar
- Francino MP, Ochman H: Strand symmetry around the beta-globin origin of replication in primates. Mol Biol Evol. 2000, 17: 416-422.View ArticlePubMedGoogle Scholar
- Kimura M: A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 1980, 16: 111-120.View ArticlePubMedGoogle Scholar
- Kowalczuk M, Mackiewicz P, Mackiewicz D, Nowicka A, Dudkiewicz M, Dudek MR, Cebrat S: Multiple base substitution corrections in DNA sequence evolution. Int J Modern Phys C. 2001, 12: 1043-1053. 10.1142/S0129183101002279.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.