Upstream plasticity and downstream robustness in evolution of molecular networks
© Maslov et al; licensee BioMed Central Ltd. 2004
Received: 10 October 2003
Accepted: 08 March 2004
Published: 08 March 2004
Gene duplication followed by the functional divergence of the resulting pair of paralogous proteins is a major force shaping molecular networks in living organisms. Recent species-wide data for protein-protein interactions and transcriptional regulations allow us to assess the effect of gene duplication on robustness and plasticity of these molecular networks.
We demonstrate that the transcriptional regulation of duplicated genes in baker's yeast Saccharomyces cerevisiae diverges fast so that on average they lose 3% of common transcription factors for every 1% divergence of their amino acid sequences. The set of protein-protein interaction partners of their protein products changes at a slower rate exhibiting a broad plateau for amino acid sequence similarity above 70%. The stability of functional roles of duplicated genes at such relatively low sequence similarity is further corroborated by their ability to substitute for each other in single gene knockout experiments in yeast and RNAi experiments in a nematode worm Caenorhabditis elegans. We also quantified the divergence rate of physical interaction neighborhoods of paralogous proteins in a bacterium Helicobacter pylori and a fly Drosophila melanogaster. However, in the absence of system-wide data on transcription factors' binding in these organisms we could not compare this rate to that of transcriptional regulation of duplicated genes.
For all molecular networks studied in this work we found that even the most distantly related paralogous proteins with amino acid sequence identities around 20% on average have more similar positions within a network than a randomly selected pair of proteins. For yeast we also found that the upstream regulation of genes evolves more rapidly than downstream functions of their protein products. This is in accordance with a view which puts regulatory changes as one of the main driving forces of the evolution. In this context a very important open question is to what extent our results obtained for homologous genes within a single species (paralogs) carries over to homologous proteins in different species (orthologs).
Biological processes are rarely performed by single isolated molecules. Instead, they typically involve a coordinated activity of many molecules forming a neighborhood in biomolecular networks. Changes in these networks are thus coupled to the evolution of new functions and functional relationships in the organism. Gene duplication is an important source of raw material for the molecular evolution . Immediately after a duplication event the pair of freshly duplicated genes is thought to be identical in both sequences and functional roles in the cell. However, with time their properties including their positions within molecular networks diverge. Here we quantify this divergence in the baker's yeast Saccharomyces cerevisiae using several recent system-wide data sets. To this end we measure: 1) The similarity of positions of duplicated genes in the transcription regulatory network  given by the number of transcription regulators that regulate both of them; 2) The similarity of the set of binding partners [3, 4] of their protein products, and their ability to substitute for each other in knock-out experiments . These measures reflect, correspondingly, the upstream and downstream properties of molecular networks around duplicated genes. We then repeat this analysis using species-wide data on protein interaction networks in a bacterium Helicobacter pylori  and a fruit fly Drosophila melanogaster , as well as a systematic RNAi gene inactivation assay  in a nematode worm Caenorhabditis elegans.
Results and discussion
Divergence of the upstream transcriptional regulation of duplicated genes in S. cerevisiae
Fig 2B shows the average value of the regulatory overlap as a function of PID. The regulatory overlap in this plot is normalized by a proxy to the ancestral connectivity of a gene, estimated as the total number of distinct transcription factors that are involved in regulation of at least one of the pair of proteins (see Fig 1). The correlation between the normalized regulatory overlap Ω reg and the PID is highly statistically significant: the Pearson correlation is 0.34 (P-value around 10-70 for 2275 data points). Even for the lowest value of PID = 20% the average Ω reg significantly exceeds its value in non-paralogous proteins. One interesting feature of the graph in Fig. 2B is that even pairs of proteins whose amino acid sequences are 100% identical to each other on average have only about 30% overlap in their upstream regulation. Such low regulatory overlap of recently duplicated genes can be partially attributed to false positives and false negatives present in the dataset of Ref.  (see Methods for extended discussion.) It might also be sometimes caused by an incomplete duplication of the upstream regulatory region of a gene, or by a burst of very rapid evolution of the regulatory region immediately following the duplication event. The second feature of the Fig. 2B is a gradual decline of the average regulatory overlap over the whole range of sequence similarities. The data in Fig. 2B can be fitted with an exponential decay with a rate corresponding to an average 3% loss of common regulators of a paralogous pair for every 1% decrease in their amino acid sequence identity. Thus already at PID = 80% about half of the common regulations present at PID = 100% are lost. The decline in the regulatory overlap at lower PIDs clearly visible in Fig. 2A,2B is in accord with a recently published analysis  of similarity between microarray profiles of paralogs. In fact, due to a more direct information about transcriptional regulation contained in the chip-on-chip dataset of Ref.  compared to microarray experiments, our analysis extends the gradual decline to much lower PID than was detected in Ref. . After we submitted this manuscript another group of authors  has reported a rapid decline in the number of shared regulatory motifs of duplicated genes. This study, carried out as a function of a much faster silent substitution rate K s , nicely complements our own findings. Indeed, in their analysis Papp et al.  logarithmically binned the K s into four broad bins: below 0.01, 0.01–0.1, 0.1–1, and above 1. Since the reliability of the measured silent substitution rate dramatically decreases at high values of K s , the whole long-time behavior (i.e. that for PID < 75% which in yeast roughly corresponds to K s > 1) of the regulatory overlap remained inaccessible to the analysis of Ref. .
Divergence in downstream functional roles of duplicated genes in S. cerevisiae
An alternative way to quantify the extent of divergence/redundancy of duplicated genes is to examine phenotypes of of null-mutants lacking one of them. A systematic gene-deletion study in yeast  was recently used  to compare the fraction of essential genes (so that their null-mutants have lethal phenotype) between genes with and without paralogs in the genome. It was found that the fraction of essential genes is approximately 4 times higher among singleton genes than among ones protected by a highly similar paralog. It was also demonstrated that such protection by a paralog persists down to rather low levels of its amino-acid sequence similarity (PID) with the deleted protein. In Fig. 3B we confirm these findings using a more recent and larger systematic study  of viability of null-mutants in yeast as well as demonstrate that the magnitude of this protective effect is the strongest in the nucleus, where the largest fraction of essential proteins resides. Notice that the fraction of essential proteins (especially that of nuclear proteins) shows a dramatic increase as the PID to their closest paralog falls below 50%. Thus paralogous proteins with sequence similarity above 50% can typically substitute for each other.
Having presented different measures of upstream and downstream divergence of duplicated genes in yeast S. cerevisiae we are now in a position to discuss them in a wider context. Comparing Fig. 2B to Figs 3A,3B one concludes that changes in the upstream regulation of duplicated genes happen more readily than changes in their downstream function. The overlap in the set of binding partners (Fig. 3A) and the ability of duplicates to substitute for each other (Fig. 3B) remain virtually constant down to PID of 70%, at which point their average regulatory overlap has dropped to about 40% of its maximum (Fig. 2B). To summarize: our results indicate that duplicated genes would still have the ability to partially substitute for downstream functions of each other even at the time when the repertoire of their regulatory connections has already substantially changed from its ancestral state before the duplication. Such genes would be less constrained in evolving new functions , and thus would contribute to a greater evolutionary plasticity of the network.
Functional redundancy of paralogous proteins from RNAi experiments on C. elegans
In the inset to Fig. 4 we kept all successfully cloned genepairs, while in the main panel we dropped those genepairs whose product was predicted  to target mRNA product of more than one gene in the genome (see Methods for more details). It is instructive that the fraction of essential genes as a function of PID shown in the inset to Fig. 4 has a well pronounced minimum around PID = 70% and then subsequently starts to rise for higher values of PID. The tentative explanation for this behavior is that unlike single-gene deletion technique used in yeast, the RNAi technique is based on RNA complementarity and can eliminate several different mRNAs with similar sequences. Therefore, paralogous genes with nearly identical DNA sequences prove to be useless from the point of view of protection against RNAi since their mRNA products would be eliminated at nearly the same rate as the intended targets. This neatly explains why in the inset to Fig. 4 the fraction of nonviable phenotypes for genes with a 100% identical paralog in the genome approaches that of unprotected genes without paralogous partners (keep in mind that in this plot we use amino acid sequence identity of proteins and not of their mRNA precursors.) This observation also reinforces the point of view that the decline in the fraction of essential genes vs PID shown in Figs 3B,4 is indeed caused by protective effects of paralogs and cannot be explained by a possible tendency of nonessential genes to duplicate more frequently.
Divergence of physical interactions of paralogous genes in H. pylori and D. melanogaster
The evolution of a biological organism modifies it on multiple levels ranging from sequences of individual molecules, to their coordinated activity in the cell (molecular networks), all the way up to the phenotype of the organism itself. While its manifestations both on the level of protein sequences and phenotypes are reasonably well documented, the data needed to quantify evolutionary changes taking place on the level of molecular networks have appeared only very recently. Systematic experiments such as high-throughput two hybrid assays of protein-protein interactions [3, 4, 6, 7], chip-on-chip studies of whole-genome binding of a large number of transcription factors , and whole-genome assays of inactivations of single genes  or proteins  allowed us to go beyond describing particular cases of evolution of molecular networks and look at its large scale dynamics.
For all molecular networks studied in this work we found that even the most distantly related paralogous proteins with amino acid sequence identities around 20% on average have more similar positions within a network than a randomly selected pair of proteins. That means that some pairs of paralogous proteins at least partially retain their functional redundancy for extremely long time after the duplication event.
Our results also indicate that the genetic regulation of paralogous proteins changes faster than both their amino acid sequences and the set of their protein interactions partners. It is tempting to extend this observation to pairs of homologous proteins in different species (orthologs) that diverged from each other as a result of a speciation (as opposed to a gene duplication) event. This would help to explain how species with very similar gene contents can evolve novel properties on a relatively short timescale. However, such an inter-species comparison of molecular networks has to wait for the appearance of whole-genome data on molecular networks in closely related model organisms.
The system-wide data describing the transcription regulatory network of yeast was taken from the Ref. , which reports the so-called "chip-on-chip" study of in-vivo binding of 106 transcription factors to upstream regulatory regions of genes encoding all 6270 of yeast proteins. Since the number of transcriptional regulators in this dataset is quite large, the probability that by pure chance the same transcription factor would be incorrectly detected among upstream regulators of both duplicated genes is small (of order of 1%). Thus the contribution of false positives of the dataset of Ref.  to the regulatory overlap Ω reg is quite insignificant. This allowed us to use a P-value cutoff equal to 10-2 (12854 regulations) less conservative than the 10-3 cutoff (4418 regulations) of Lee et al. . On the other hand, false positives (if present in the data) could significantly affect the average number of regulatory inputs of individual proteins used to normalize the regulatory overlap in Fig. 2B. However, we found that both the initial drop and the rate of exponential decay of the normalized regulatory remains virtually unchanged when Fig. 2B is repeated for different values of the P-value cutoff ranging from 10-2 to 10-4 (data not shown). In the same range of P-values the average number of regulations per gene changes six-fold (from 2 to 0.33)! This suggests that false positives are not a significant part of the experimental dataset of Ref.  at least up to 10-2, and validates the robust nature of parameters extracted from the Fig. 2B. In the analysis shown in Fig. 2 we have dropped 3 paralogous pairs sharing the same intergenic sequence since by design of the chip-on-chip experiment  such pairs would have 100% regulatory overlap. We also checked that Fig. 2A does not change significantly if one limits the analysis to genes without diverging promoters ensuring that a given intergenic could possibly regulate only one gene.
As a source of information about binding partners of yeast proteins we combined the data from two independent high-throughput two-hybrid experiments: the core dataset of Ito et al.  (806 interactions among 797 proteins) and the extended Uetz et al. dataset , downloaded from the website of this group (1446 interactions among 1340 proteins). The resulting network consists of 1734 proteins joined by 2111 non-redundant interactions. Using this combined dataset we found that even 100% identical proteins share on average only 30% of their binding partners. However, unlike for upstream regulation, the set of interaction partners of a protein is fully determined by its amino acid sequence. Therefore, an imperfect overlap in the set of binding partners of identical proteins has to be attributed to false positives/negatives inevitably present in high-throughput two-hybrid experiments. The relatively high rate of false negatives in genome-wide two-hybrid experiments is further corroborated by the fact that datasets used in our study coming from two independent experiments [3, 4] have only 141 interactions in common. The abundance of missing interactions makes the normalization of the interaction overlap impractical. That was the reason why unlike in Fig. 2B in Fig. 3A we used the raw (unnormalized) interaction overlap. To make sure that differences between Figs. 2B and 3A are not caused by differences in normalization we repeated them using various normalization schemes as well as altogether unnormalized (data not shown). We found that apart from the overall scale of the y-axis, changes in normalization do not affect exponential decay parameters of Figs 2B,3A.
The system-wide data on viability of S. cerevisiae null-mutants used in our study was obtained from Ref.  in which 1103 essential (non-viable null-mutants) and 4678 non-essential (viable null-mutants) yeast proteins were reported. The lists of viable and non-viable null-mutants as discovered in Ref.  were downloaded from the Saccharomyces Genome Database .
Our analysis of protective effects of paralogs in C. elegans is based on the set of 15587 viable and 1170 non-viable (embryonic or larval lethality or sterility) RNAi phenotypes reported in . The information about worm paralogs is obtained from the EuGenes database  and consists of 30036 paralogous pairs involving 10071 worm proteins (blastp with 10-30 cutoff and no requirements on the length of aligned region). In Fig. 4 we used 13884 RNAi phenotypes for which we were able to uniquely map the genepair name to the worm protein name used in EuGenes.
The two-hybrid assay of protein-protein inetractions in H. pylori  used in Fig. 5A contains 1465 interactions between 732 proteins, while there are only 260 paralogous pairs involving 140 proteins. As in yeast this set was obtained by blasting all protein sequences found in the fully sequenced genome against each other with a conservative E-value cutoff of 10-10 and leaving only pairs in which the aligned region constituted at least 80% of the length of a longer protein.
Finally, our analysis of the interaction overlap between paralogous proteins in D. melanogaster is based on the full dataset of the high-throughput two-hybrid experiment . It consists of 20671 protein-protein physical interactions involving 7002 of fly proteins obtained in. To generate Fig. 5B we also used the set of 16713 paralogous pairs involving 2827 fly proteins.
SM, KS, and KAE contributed to both the ideas and writing of the manuscript in close collaboration. Koon-Kiu Yan has generated blastp datasets for H. pylori, yeast, and fly, as well as performed the analysis of RNAi experiment shown in Fig. 4. All authors read and approved the manuscript.
Work at Brookhaven National Laboratory was carried out under Contract No. DE-AC02-98CH10886, Division of Material Science, U.S. Department of Energy. Two of us (K.E and K.S.) thank the Institute for Strongly Correlated and Complex Systems at Brookhaven National Laboratory for hospitality and financial support during visits when part of this work was completed. S.M. and K.S. acknowledge the support of the NSF grant PHY99-07949 (work at the KITP, University of California at Santa Barbara). We thank John Little for critically reviewing the manuscript.
- Ohno S: Evolution by gene duplication. 1970, Berlin: Springer-WerlagView ArticleGoogle Scholar
- Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I: Transcription Regulatory Networks in Saccharomyces cerevisiae. Science. 2002, 298: 799-804. 10.1126/science.1075090.View ArticlePubMedGoogle Scholar
- Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature. 2000, 403: 623-627. 10.1038/35001009.View ArticlePubMedGoogle Scholar
- Ito T, Chiba T, Ozawa R, Yoshida M, Hattori Ma, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001, 98: 4569-4574. 10.1073/pnas.061034498.PubMed CentralView ArticlePubMedGoogle Scholar
- Giaever G, Chu AM, Ni L, Connelly C, Riles L, Véronneau S, Dow S, Lucau-Danila A, Anderson K, André B: Functional profiling of the Saccharomyces cerevisiae genome. Nature. 2002, 418: 387-391. 10.1038/nature00935.View ArticlePubMedGoogle Scholar
- Rain JC, Selig L, Reuse HD, Battaglia V, Reverdy C, Simon S, Lenzen G, Petel F, Wojcik J, Schachter V: The protein-protein interaction map of Helicobacter pylori. Nature. 2001, 409: 211-215. 10.1038/35051615.View ArticlePubMedGoogle Scholar
- Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E: A Protein Interaction Map of Drosophila melanogaster. Science. 2003, 302: 1727-1736. 10.1126/science.1090289.View ArticlePubMedGoogle Scholar
- Kamath RS, Fraser AG, Dong Y, Poulin G, Durbin R, Gotta M, Kanapink A, Le Bot N, Moreno S: Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature. 2003, 421: 231-237. 10.1038/nature01278.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1006/jmbi.1990.9999.View ArticlePubMedGoogle Scholar
- Steinmetz LM, Scharfe C, Deutschbauer AM, Mokranjac D, Herman ZS, Jones T, Chu AM, Giaever G, Prokisch H: Systematic screen for human disease genes in yeast. Nature Genetics. 2002, 31: 400-404.PubMedGoogle Scholar
- Gu Z, Nicolae D, Lu HH-S, Li W: Rapid divergence in expression between duplicate genes inferred from microarray data. Trends in Genetics. 2002, 18: 609-613. 10.1016/S0168-9525(02)02837-8.View ArticlePubMedGoogle Scholar
- Papp B, Pál C, Hurst LD: Evolution of cis-regulatory elements in duplicated genes of yeast. Trends in Genetics. 2003, 19: 417-422. 10.1016/S0168-9525(03)00174-4.View ArticlePubMedGoogle Scholar
- Wagner A: The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol Biol Evol. 2001, 18: 1283-1292.View ArticlePubMedGoogle Scholar
- Gu Z, Steinmetz LM, Gu X, Scharfe C, Davis RW, Li WH: Role of duplicate genes in genetic robustness against null mutations. Nature. 2003, 421: 63-66. 10.1038/nature01198.View ArticlePubMedGoogle Scholar
- Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV: Selection in the evolution of gene duplications. Genome Biology. 2002, 3 (2): RESEARCH0008.10008.9-10.1186/gb-2002-3-2-research0008.View ArticleGoogle Scholar
- After our study was completed we learned of a similar analysis submitted for publication, Conant GC, Wagner A: Duplicate genes and robustness to transient gene knockouts in Caenorhabditis elegans. Proc R Soc Lond B.
- Saccharomyces Genome Database,. [http://genome-www.stanford.edu/Saccharomyces]
- Gilbert DG: euGenes: a eucaryote genome information system. Nucleic Acids Res. 2002, 30: 145-148. 10.1093/nar/30.1.145.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.