Genome wide exploration of the origin and evolution of amino acids
© Liu et al. 2010
Received: 15 February 2009
Accepted: 15 March 2010
Published: 15 March 2010
Skip to main content
© Liu et al. 2010
Received: 15 February 2009
Accepted: 15 March 2010
Published: 15 March 2010
Even after years of exploration, the terrestrial origin of bio-molecules remains unsolved and controversial. Today, observation of amino acid composition in proteins has become an alternative way for a global understanding of the mystery encoded in whole genomes and seeking clues for the origin of amino acids.
In this study, we statistically monitored the frequencies of 20 alpha-amino acids in 549 taxa from three kingdoms of life: archaebacteria, eubacteria, and eukaryotes. We found that the amino acids evolved independently in these three kingdoms; but, conserved linkages were observed in two groups of amino acids, (A, G, H, L, P, Q, R, and W) and (F, I, K, N, S, and Y). Moreover, the amino acids encoded by GC-poor codons (F, Y, N, K, I, and M) were found to "lose" their usage in the development from single cell eukaryotic organisms like S. cerevisiae to H. sapiens, while the amino acids encoded by GC-rich codons (P, A, G, and W) were found to gain usage. These findings further support the co-evolution hypothesis of amino acids and genetic codes.
We proposed a new chronological order of the appearance of amino acids (L, A, V/E/G, S, I, K, T, R/D, P, N, F, Q, Y, M, H, W, C). Two conserved evolutionary paths of amino acids were also suggested: A→G→R→P and K→Y.
The origin of life arising from either proteins or nucleic acids has been argued for nearly half century. Putting the "Chicken or Egg" question aside, there exist some unsolved problems. Which amino acid(s) appeared first in the prebiotic environment? What cause the different usage of amino acids in modern organisms? To address these questions, a number of hypotheses and theories, e.g. mutation drifts and natural selection, have been proposed. Multiple factors, such as genetic codes, physicochemical properties, mutation-selection equilibrium, amino acid biosynthesis, etc, are likely related to the variation of amino acid usage in organisms [1, 2]. Since there is no way to trace geological evidence in the way scientists normally use in chronicling the evolution of organisms, an alternative path is needed to seek a clue from current living organisms.
Observation of amino acid composition in proteins was recently applied as a statistical approach in facilitating various investigations of the evolution of genetic codes , the origin of amino acids [1, 2, 4–6], the co-evolution of amino acids and genetic codes , the evolution of protein families [8–10], the conservation of subcellular location , the prediction of protein secondary structure [12–14], the natural selection of protein charge , the correlation between gene expression level and protein function , the kinship of different taxa , the molecular mechanisms of dinosaur extinction , the lifestyles of organisms , and even the tracing of the Latest Universal Ancestor (LUA) of life [4–6]. Recently, some research groups have successfully applied genomic information on monitoring amino acid composition linked with various biological phenomena [1, 5, 11, 17, 20]. It is beyond question that an insight into the evolution of amino acids on a genomic scale can extend our knowledge about molecular evolution and the origin of life. In this study, 549 genomes from three kingdoms of life were adopted to investigate statistically the patterns of amino acid usage during evolution. Also, clues for the origin of amino acids in prebiotic environment and their co-evolution with genetic codes were explored.
Which amino acid(s) appeared first in the prebiotic environment? To address this question, we might go back to the first life form in the world. When the first simple life was formed, most amino acid biosynthesis processes had not become fully functional. The environment was the only source to acquire amino acids and other fundamental bio-molecules for life. As a consequence, the amino acid composition of the early life was mainly determined by the amino acid content in the "prebiotic soup" with no or little bias on selection of amino acids. It was assumed that the "early" amino acids had higher concentration in the primitive environment than that of "late" amino acids, thus had higher composition in early life form. Retrospectively, if the amino acid composition of the early life form was estimated, it could be used to determine the amino acid concentration in the environment and further deduced the chronological order of amino acid appearance.
It has been suggested earlier that amino acid composition was determined largely by existing genetic codes . In our study, the relationship between amino acids and codons has also been studied. As shown in Figure 1, the amino acids with more codons are "favored" by proteins. This phenomenon was observed not only in eukaryotes, but also in most representatives of eubacteria and archaebacteria. Two six-codon owners, leucine and serine, are the most frequently-used amino acids in all selective eukaryotic species. Arginine is also a six-codon amino acid, but its frequency of use is much lower than expected (averagely ranking 9th in eukaryotes, 10th in archaebacteria, and 11th in eubacteria). The under-utilization of arginine is as yet mechanistically unclear, but it may be related to its physiochemical properties and roles in protein functions. All the four-codon amino acids (A, G, V, T, and P) are positioned in the middle zone, and most of the two-codon amino acids and all the one-codon amino acids are used less often.
Additionally, we calculated the correlation coefficients between "random" amino acid frequencies following from a uniform usage of codons of the universal genetic code and amino acid compositions of the modern organisms (Additional File 1). As in previous findings , all eukaryotic representatives showed a higher correlation coefficient, indicating the small selection of amino acid composition of proteins in eukaryotes. However in eubacteria and in archaea, correlation coefficients varied from 0.05 to 0.9, suggesting that some microbials show a significant selection of amino acids for their proteins. The substantial variety of selection pressure in microbials may be explained by factors such as particular living environments, frequent mutation, rapid generation, etc. To have an overview of how GC content could affect amino acid usage, we compared the GC% of both coding regions and non-coding regions in the whole genomes of eight organisms. Statistically, the coding regions in lower eukaryotes have rather higher net GC content than the non-coding regions, but this is manifestly reversed in higher organisms (A. mellifera, D. rerio, M. musculus, and H. sapiens), where it can be seen that the net GC content of the coding regions decreases from lower eukaryotes to higher eukaryotes (Figure 3 & Additional file 4). But our previous finding (Figure 2c) indicates that the usage of GC-rich codons increases from S. cerevisiae to H. sapiens. So the decrease in G and C content in coding region in higher eukaryotic species might come from the decrease in the usage of intermediate-GC codons (defined in ref 24). All these suggest the GC rich condons are favorable in proteins even under the pressure of the decrease in GC content.
Our study agrees with previous research that statistical analysis of amino acid composition in proteins is a feasible route to global understanding of the physiological function of living organisms and the mystery encoded in whole genomes. However, proper evaluation of "real" amino acid usage in a modern taxa may be affected by a series of factors, including, time scale of evolution, frequency of organism generation, diverse living environments, chronological order of amino acid appearance, bias of genetic codes, gene mutation frequency, mutation-selection equilibrium, preference of physico-chemical properties, difficulty of biosynthesis, co-evolution of amino acids and genetic codes, incomplete annotation of genomes, existence of "retired" genes and pseudogenes in genomes, and other as yet unrecognized reasons. Many of these factors are currently unpredictable and incalculable and thus have been ignored in this study. It can be concluded that statistical observation of amino acid composition in modern proteomes is an alternative means for broadening our current knowledge on the origin of life.
Whole genome information of 549 prokaryotes (including 495 eubacteria and 44 archaebacteria) and 10 eukaryotic representatives (Saccharomyces cerevisiae, Abrabidopsis thaliana, Caenorhaditis elegans, Drosophila melanogaster, Apis mellifera, Danio rerio, Gullus gallus, Mus musculus, Pan troglodytes, and Homo sapiens) were derived from NCBI genome resource. Taxonomy of these selected organisms, their unique NCBI entry IDs and annotation versions were listed in the Additional File 1.
Where X and Y are the two random amino acids; ρX, Y is the correlation coefficient between X and Y; μX and μY are expected values; σX and σY are standard deviations; E is the expected value operator; cov means covariance.
The project is supported by the National Natural Science Foundation of China (NO. 20572061 and No.20732004). Support from the Program for New Century Excellent Talents in University (NCET) of MOE (2006 to Ji ZL) is gratefully acknowledged.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.