Differences in evolutionary pressure acting within highly conserved ortholog groups
© Przytycka et al; licensee BioMed Central Ltd. 2008
Received: 24 October 2007
Accepted: 17 July 2008
Published: 17 July 2008
In highly conserved widely distributed ortholog groups, the main evolutionary force is assumed to be purifying selection that enforces sequence conservation, with most divergence occurring by accumulation of neutral substitutions. Using a set of ortholog groups from prokaryotes, with a single representative in each studied organism, we asked the question if this evolutionary pressure is acting similarly on different subgroups of orthologs defined as major lineages (e.g. Proteobacteria or Firmicutes).
Using correlations in entropy measures as a proxy for evolutionary pressure, we observed two distinct behaviors within our ortholog collection. The first subset of ortholog groups, called here informational, consisted mostly of proteins associated with information processing (i.e. translation, transcription, DNA replication) and the second, the non-informational ortholog groups, mostly comprised of proteins involved in metabolic pathways. The evolutionary pressure acting on non-informational proteins is more uniform relative to their informational counterparts. The non-informational proteins show higher level of correlation between entropy profiles and more uniformity across subgroups.
The low correlation of entropy profiles in the informational ortholog groups suggest that the evolutionary pressure acting on the informational ortholog groups is not uniform across different clades considered this study. This might suggest "fine-tuning" of informational proteins in each lineage leading to lineage-specific differences in selection. This, in turn, could make these proteins less exchangeable between lineages. In contrast, the uniformity of the selective pressure acting on the non-informational groups might allow the exchange of the genetic material via lateral gene transfer.
Previous studies have shown that proteins are under purifying selection which enforces a certain stasis in terms of sequence and function. Much less frequently they are subject to episodes of positive selection, which are typified by accelerated sequence divergence and corresponding functional shifts [1–6]. A basic assumption in molecular evolution is that the selective pressure represents functional constraints and is correlated with evolutionary conservation . Direct measurement of the functional constraints is not straight-forward; however its effects may be estimated through sequence conservation. For closely related species, selective pressure is usually measured using a nucleotide alignment and the ratio of non-synonymous over synonymous (silent) substitutions [7, 8]. For more divergent species, the purifying selective pressure can be measured through the imprint it makes on the multiple sequence alignment of proteins in an ortholog group.
We consider a set of ortholog groups, which are conserved over a broad spectrum of prokaryotes. We additionally require that selected proteins do not have paralogs in studied organisms. Hence a change in biological function of the proteins within each group is unlikely. Within such an ortholog group, the general expectation is that the main evolutionary force is purifying selection, which is reflected as sequence conservation, with most divergence between the orthologs arising from neutral substitutions. Sequences in a given ortholog group can be further divided into subgroups each comprised of different monophyletic lineages, for example Proteobacteria, Firmicutes, and Archaea. We were interested in understanding whether the selective pressure was similar across the different subgroups, and if the measure of selective pressure acting on one subgroup is predictive of that acting on another within the same ortholog group. We present results of this study and provide evidence that Lateral Gene Transfer (LGT) might have a noticeable, apparently non-intuitive, effect on such extrapolations of selective pressure. We outline below the basic approach used in this study.
Selective pressure, defined as the influence of natural selection in enforcing conservation or in favoring divergence in protein or DNA sequence, has been a observed to be the basis for conservation patterns across different subfamilies in a family of homologous molecules . It has been measured previously using the entropy of individual positions in multiple sequence alignment as a proxy [9–11].
The ortholog groups used in this study satisfies the uniqueness condition, i.e. none of the proteins has a confounding paralog within the same genome, which cannot be differentiated from the true ortholog. Therefore, it is prudent to assume that proteins from such ortholog group perform a comparable function in the corresponding organisms. Consequently, significant deviations from the uniformity in the pattern of sequence conservation in different subgroups can be attributed to differences in effects of selection within these subgroups. We stress that observed differences in the sequence conservation patterns do not directly provide information on the causes for the inferred differences (for example in terms of variation in Ks/Kn ratio) nor the reasons for which the constraints might be different in different clades. Possible reasons may vary from differences in the environment to lineage specific "fine tuning" of proteins functioning as parts of multi-protein complexes. To emphasize the connection to evolution on longer distances selective constraints measured in this way are also referred to as evolutionary pressure [10, 11].
We represent variability in sequence conservation by an entropy profile – a vector constructed from multiple sequence alignment where the value at the ith position in the vector equals the entropy of the corresponding column in the alignment. Thus the entries of the entropy profile vary with the sequence conservation. Consequently, given two subfamilies, it can be tested if their entropy profiles are correlated. Such correlation would be expected if both subfamilies ware subjected to the same evolutionary pressure.
It is important to keep in mind that entropy profile is shaped not only by selective constraints but also by evolutionary distances between the species. We control for this dependency by keeping the set of species fixed for all ortholog groups under study.
An ultimate test for uniformity of evolutionary constraints would require that one can predict the evolutionary pressure imposed on sequences in one subgroup X of an ortholog group based on information on a different subgroup, N (kNown), of this group. In this work, rather than predicting the entropy profile of X from the entropy profile of N we focus on the more modest task of predicting the Pearson's correlation coefficient, cc(N, X) between them. The value of the Pearson's correlation coefficient of entropy profiles of two subgroups of the same ortholog group provides a first estimation of the uniformity of the selective pressure between the groups. In contrast to evolutionary distance measures, correlation coefficient depends not only on the amount of evolutionary changes but also on the location of these changes in protein sequence.
To delineate the relation between lateral gene transfer and universality of selective constraints, we identified the subset of ortholog groups with putative lateral gene transfers between different clades and examined specific properties of this set. Additionally, we performed a series of in-silico lateral gene transfers (s-LGTs). In these experiments a random member of a given subgroup is replaced with a random member from another subgroup of the same ortholog group. Then we measured the effect of such s-LGT on the correlation coefficient.
2. Results and discussion
This study utilizes three prokaryotic clades: Archaea, Proteobacteria, and Gram positive bacteria. In this paper we use A, P, and G to denote the subgroup of an ortholgy group restricted to the given clade (Archaea, Proteobacteria, and Gram positive bacteria respectively). The ortholog groups were extracted based on the COG database  and were filtered so that that each ortholog group has a unique homolog in each of the selected organisms. Such stringent restriction leads to the trade-off between the number of species in a clade and the number of ortholog groups in the study. After confirming high correlation between values of cc(N, X) for four and six species (R2 was 86, 78, 80 depending on N and X, [see Additional file 1]) we concluded using the four-element clades should still provide reliable result and at the same time allow for considering a broader range of ortholog groups (see Methods). The set of 63 ortholog groups obtained in this way was divided further into the "informational groups" containing 37 ortholog groups associated with functions related to information processing (i.e. translation, ribosomal proteins, transcription, DNA replication) and the "non-informational groups" containing 26 remaining ortholog groups, which are mostly proteins involved in metabolism (see Additional file 2 for full description).
2.1. Negative correlation of cc(N, X) and the relative root distance – global measurement
Average correlation between entropy profiles between various clades and average values of the entropy. P-values are computed based on the t-test
Correlation of entropy profiles
Average cc non-informational
Average cc informational
p-value for the difference
Relative root distance
Average r non-informational
Average r informational
p-value for the difference
Gram Positive (G)
Average Entropy non-informational
Average Entropy informational
p-value for the difference
2.2. Dependency of cc(N, X) on sequence conservation in group N- local measurement
The previous test demonstrated a negative correlation between the relative root distance r(N, X) computed on the basis of pairwise distances between protein sequences in X and N and cc(N, X). Next, we tested if cc(N, X) is correlated with sequence divergence within the ortholog subgroup N (Figure 1b). For this purpose, we measured the correlation between negated average entropy E(N) of the subgroup N and the value of cc(N, X), for all choices of N and X (six experiments). We performed the same set of experiment using the percentage of perfectly conserved columns in N, PC(N), instead of E(N). We found that the two measures are strongly correlated (R2 > 0.95 for all subgroups) and the results obtained using with either of the two measures were very consistent. Therefore, we focused on the relation between E(N) and cc(N, X). Out of the six experiments only pairs E(P), cc(A, P)) and (E(G), cc(A, G)) were correlated with R2 > 0.1 (0.17 and 0.38 respectively).
Subsequently, we focused on comparing average properties of informational and non-informational groups. Although, on average, the entropy of non-informational subgroups is higher than that of informational subgroups (and the percent conservation lower) the difference is not statistically significant. In contrast, the values of cc(N, X) are significantly higher for non-informational groups (Table 1). This clear difference between the two ortholog groups is suggestive non-uniformity of constraints on the informational groups. These constraints might preserve certain mutations specific to particular subgroups within the informational ortholog groups. Another striking observation was that for non-informational ortholog groups, the average correlation coefficient is approximately the same for all pairs of clades suggesting an additional level of uniformity of the these groups.
2.3. Uncovering the relation between the cc(N, X) for different pairs of subgroups – semi-local measurement
This provides yet more evidence for the observation that evolutionary pressure acts more uniformly on the non-informational groups than on the informational groups. These results also give further support to the observation that a significant fraction of the informational ortholog groups might be a subject to lineage specific evolutionary pressure. If so, this would imply that proteins in this group are not easily exchangeable between species through LGT. In contrast, the selective pressure acting on non-informational proteins is much more uniform and may more easily permit exchange of corresponding orthologs and corresponding xenologous displacement .
2.4. Lateral gene transfer and evolutionary pressure
Correlation (R2 value) between correlation coefficients for ortholog groups with putative LGA.
Coefficient of determination (R2) between:
All Groups with LGT from A
cc(A, P); cc(A, G)
cc(G, P); cc(G, A)
cc(P, G); cc(P, A)
2.5. In-silico Lateral Gene Transfers (s-LGT) elucidate unifying role of Lateral Gene Transfer
Results of in silico LGT (s-LGT) from Archaea to one of the bacterial clades (A2G or A2P). R2 values for s-LGT are the average over 100 simulations.
cc(A, G); cc(A, P)
cc(A, A2G);cc(A, P)
cc(P, G); cc(P, A)
cc(P, A2G); cc(P, A)
cc(G, P); cc(G, A)
cc(G, A2P); cc(G, A)
We considered the question if the evolutionary pressure within a set of ortholog groups is act uniformly in different subgroups of each ortholog group. We observed a distinctly different behavior of two subsets of our ortholog set, namely the informational and non-informational ortholog groups. We established, through a number of measurements, that evolutionary pressure acting on non-informational proteins is more uniform relative to their informational counterparts. The informational proteins show lower level of correlation between entropy profiles of the subgroups. In contrast, the non-informational groups display higher correlation between entropy profiles, and show more uniformity across subgroups. This uniformity of the evolutionary pressure acting on the non-informational groups might allow gene exchange through LGT. In contrast, the correlation between entropy profiles for informational groups is very low. This suggested that the informational proteins are not expected to be easily exchangeable between species. We confirmed that this is indeed the case for the ortholog groups under study. Thus the low level of lateral gene transfers for informational groups might be a result of distinct evolutionary pressure acting on these highly conserved groups.
It is important to bear in mind that the set of ortholog groups considered in this study is very biased – all groups are highly conserved with unique homolog in each genome used in the study. In particular the behavior of the non-informational groups considered in this study does not necessarily generalize across all proteins that excluded from the informational groups as defined here. The informational ortholog groups in our study do not cover all proteins related to information processing that are present in these organisms. A substantial fraction of these like specific transcription factors are left out because they do not satisfy the conservation criteria used in our study.
Jain et al. proposed that genes involved in multi-protein complexes are expected to be rarely transferred . Indeed, the fact that many informational proteins are functioning as parts of multi-protein complexes is likely to contribute to the unique selective constraints acting on them. However a number of the ortholog groups in the non-informational set also form complexes with other proteins (e.g. NCAIR mutase, FGAM synthetase and glutamine amidotransferase domains, Molybdenum cofactor biosynthesis enzymes). Yet, the properties, such as correlation between entropy profiles, of these proteins are consistent with other proteins in the non-informational groups rather then with those in the informational groups. This indicates that the selective constraints in the informational groups have been imposed by a broader spectrum of conditions and not just being present in complexes.
Alternatively, the difference in the correlation in entropy profiles is a consequence of insufficient sampling of sequence space in the informational groups due reduced LGT, rather than a signature of a difference in evolutionary pressure. Such a possibility cannot be completely excluded. However, it less unlikely that LGT is the primary cause of those differences. For example, the difference in the average entropy of informational and non-informational sets is not statistically significant so the level of exploration of the sequence space is not drastically different. Furthermore, LGT is a stochastic process, with both informational and non-informational groups having a comparable chance of being transferred. It is at the level of the fixation of the transferred gene depending on its ability to function non-disruptively or to the advantage of the host system that differences could arise between the two classes. Numerous cases of horizontal transfer were reported amongst informational proteins, although less frequently than in the non-informational set [15–18]. Thus LGT between informational proteins is readily possible. The difference in frequency of LGT is consistent with the proposed differences in selective constraints which results in rejection of most such transfers in a certain subset of the informational genes.
These observations combined with our results suggest that the evolutionary pressure acting on the informational ortholog groups is not uniform across different sub-groups of organisms in this study. This might suggest "fine-tuning" of these informational group proteins in each lineage that makes them less exchangeable between lineages. In part this might relate to them functioning as parts of multi-protein complexes with several distinct subunits conserved subunits. In contrast, the non-informational groups might not experience such lineage-specific differences in selective pressure as they usually catalyze individual reactions in metabolic pathways with the flux of substrates mediating most functional interactions between them.
List of organisms used in the study; set4org contains genomes used in ortholog groups that span four organisms in each subgroup (A, G, or P) and set6org contains genomes used in ortholog groups that spans six organisms for each subgroup.
▶ Archaeoglobus fulgidus
▶ Halobacterium sp. NRC-1
▶ Methanococcus jannaschii
▶ Pyrobaculum aerophilum
▶ Caulobacter vibrioides
▶ Escherichia coli O157:H7
▶ Pseudomonas aeruginosa
▶ Sinorhizobium meliloti
▶ Bacillus subtilis
▶ Escherichia coli O157:H7
▶ Pseudomonas aeruginosa
▶ Sinorhizobium meliloti
Constructing the multiple sequence alignments
Constructing the entropy profiles and computing the correlation between amino-acid entropy profiles
The entropy of each column in an alignment is computed using AL2CO  with default parameters. Each alignment is represented using a conservation profile of length equal to the length of the alignment, with the ith entry in the profile representing the entropy of residues in the ith position of the alignment and is computed as where f a (i) is the frequency of amino acid a at position i. The entries are normalized by subtracting the mean and dividing by standard deviation. In order to eliminate the impact of gapped positions in the alignments, we only consider those residue positions (in the alignment) with < 50% gap in at least 2 of the 3 group alignments.
Each ortholog group under consideration is represented using a set of three entropy profiles corresponding to the three subgroups under consideration. The pairwise correlation of these 3 profiles is computed using Pearson's linear correlation coefficient.
Computing the evolutionary distance
For each ortholog group, the distance matrix is computed from the multiple sequence alignment obtained with MUSCLE  using two approaches: (i) maximum likelihood method using quartet puzzling approach [21, 22]. and (ii) the programs available in the ClustalW suite  (no correction for multiple substitutions was applied) where the distances are computed based on percent identity scores. Positions with gaps are excluded as described as described in computing entropy profile subsection. The computed distances ware highly correlated (r2 between 0.95 and 0.98). Results included in the paper are obtained using maximum likelihood method.
Lateral Gene Transfers
To identify putative LGT events, evolutionary tree for (12-species) ortholog groups were constructed using ClustalW suite , displayed with the TreeView program  and checked manually for disagreements with the species tree. Putative transfer from Archaea was assigned if a single bacterial gene was found in the Archaeal clade.
This work was funded by the intramural research program of the National Library of Medicine, National Institutes of Health
- Ohta T: Slightly deleterious mutant substitutions in evolution. Nature. 1973, 246 (5428): 96-98. 10.1038/246096a0.View ArticlePubMedGoogle Scholar
- Kimura M: The Neutral Theory of Molecular Evolution. 1983, Cambridge University PressView ArticleGoogle Scholar
- Jordan IK, Kondrashov FA, Adzhubei IA, Wolf YI, Koonin EV, Kondrashov AS, Sunyaev S: A universal trend of amino acid gain and loss in protein evolution. Nature. 2005, 433 (7026): 633-638. 10.1038/nature03306.View ArticlePubMedGoogle Scholar
- Jordan IK, Kondrashov FA, Rogozin IB, Tatusov RL, Wolf YI, Koonin EV: Constant relative rate of protein evolution and detection of functional diversification among bacterial, archaeal and eukaryotic proteins. Genome biology. 2001, 2 (12): RESEARCH0053-10.1186/gb-2001-2-12-research0053.PubMed CentralView ArticlePubMedGoogle Scholar
- Goodman M, Moore GW, Matsuda G: Darwinian evolution in the genealogy of haemoglobin. Nature. 1975, 253 (5493): 603-608. 10.1038/253603a0.View ArticlePubMedGoogle Scholar
- Lynch M: The Origins of Genome Architecture. 2007, Sinauer Associates Inc, 1Google Scholar
- Nekrutenko A, Makova KD, Li WH: The KA/KS Ratio Test for Assessing the Protein-Coding Potential of Genomic Regions: An Empirical and Simulation Study. Genome Res. 2002, 12 (1): 198-202. 10.1101/gr.200901.PubMed CentralView ArticlePubMedGoogle Scholar
- Hurst LD: The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet. 2002, 18 (9): 486-10.1016/S0168-9525(02)02722-1.View ArticlePubMedGoogle Scholar
- Reva B, Antipin Y, Sander C: Determinants of protein function revealed by combinatorial entropy optimization. Genome biology. 2007, 8 (11): R232-10.1186/gb-2007-8-11-r232.PubMed CentralView ArticlePubMedGoogle Scholar
- Ye K, Vriend G, AP IJ: Tracing evolutionary pressure. Bioinformatics (Oxford, England). 2008, 24 (7): 908-915. 10.1093/bioinformatics/btn057.View ArticleGoogle Scholar
- Guharoy M, Chakrabarti P: Conservation and relative importance of residues across protein-protein interfaces. Proceedings of the National Academy of Sciences of the United States of America. 2005, 102 (43): 15447-15452. 10.1073/pnas.0505425102.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Koonin EV, Lipman DJ: A Genomic Perspective on Protein Families. Science. 1997, 278 (5338): 631-637. 10.1126/science.278.5338.631.View ArticlePubMedGoogle Scholar
- Koonin EV, Makarova KS, Aravind L: Horizontal gene transfer in prokaryotes: quantification and classification. Annual review of microbiology. 2001, 55: 709-742. 10.1146/annurev.micro.55.1.709.View ArticlePubMedGoogle Scholar
- Jain R, Rivera MC, Lake JA: Horizontal gene transfer among genomes: The complexity hypothesis. PNAS. 1999, 96 (7): 3801-3806. 10.1073/pnas.96.7.3801.PubMed CentralView ArticlePubMedGoogle Scholar
- Brochier C, Philippe H, Moreira D: The evolutionary history of ribosomal protein RpS14: horizontal gene transfer at the heart of the ribosome. Trends Genet. 2000, 16 (12): 529-533. 10.1016/S0168-9525(00)02142-9.View ArticlePubMedGoogle Scholar
- Brochier C, Bapteste E, Moreira D, Philippe H: Eubacterial phylogeny based on translational apparatus proteins. Trends Genet. 2002, 18 (1): 1-5. 10.1016/S0168-9525(01)02522-7.View ArticlePubMedGoogle Scholar
- Iyer LM, Koonin EV, Aravind L: Evolution of bacterial RNA polymerase: implications for large-scale bacterial phylogeny, domain accretion, and horizontal gene transfer. Gene. 2004, 335: 73-88. 10.1016/j.gene.2004.03.017.View ArticlePubMedGoogle Scholar
- Makarova KS, Ponomarev VA, Koonin EV: Two C or not two C: recurrent disruption of Zn-ribbons, gene duplication, lineage-specific gene loss, and horizontal gene transfer in evolution of bacterial ribosomal proteins. Genome biology. 2001, 2 (9): RESEARCH 0033-10.1186/gb-2001-2-9-research0033.View ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res. 2004, 32 (5): 1792-1797. 10.1093/nar/gkh340.PubMed CentralView ArticlePubMedGoogle Scholar
- Pei J, Grishin NV: AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics (Oxford, England). 2001, 17 (8): 700-712. 10.1093/bioinformatics/17.8.700.View ArticleGoogle Scholar
- Strimmer K, von Haeseler A: Likelihood-mapping: a simple method to visualize phylogenetic content of a sequence alignment. Proceedings of the National Academy of Sciences of the United States of America. 1997, 94 (13): 6815-6819. 10.1073/pnas.94.13.6815.PubMed CentralView ArticlePubMedGoogle Scholar
- Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics (Oxford, England). 2002, 18 (3): 502-504. 10.1093/bioinformatics/18.3.502.View ArticleGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research. 1994, 22 (22): 4673-4680. 10.1093/nar/22.22.4673.PubMed CentralView ArticlePubMedGoogle Scholar
- Page RD: TreeView: an application to display phylogenetic trees on personal computers. Comput Appl Biosci. 1996, 12 (4): 357-358.PubMedGoogle Scholar