Subfunctionalization of duplicated genes as a transition state to neofunctionalization

Background Gene duplication has been suggested to be an important process in the generation of evolutionary novelty. Neofunctionalization, as an adaptive process where one copy mutates into a function that was not present in the pre-duplication gene, is one mechanism that can lead to the retention of both copies. More recently, subfunctionalization, as a neutral process where the two copies partition the ancestral function, has been proposed as an alternative mechanism driving duplicate gene retention in organisms with small effective population sizes. The relative importance of these two processes is unclear. Results A set of lattice model genes that fold and bind to two peptide ligands with overlapping binding pockets, but not a third ligand present in the cell was designed. Each gene was duplicated in a model haploid species with a small constant population size and no recombination. One set of models allowed subfunctionalization of binding events following duplication, while another set did not allow subfunctionalization. Modeling under such conditions suggests that subfunctionalization plays an important role, but as a transition state to neofunctionalization rather than as a terminal fate of duplicated genes. There is no apparent selective pressure to maintain redundancy. Conclusion Subfunctionalization results in an increase in the preservation of duplicated gene copies, including those that are neofunctionalized, but never represents a substantial fraction of duplicate gene copies at any evolutionary time point and ultimately leads to neofunctionalization of those preserved copies. This conclusion also may reflect changes in gene function after duplication with time in real genomes.


Background
A number of mechanisms can generate duplicate copies of genes, ranging from single gene duplications to regional and whole genome duplications [1][2][3]. Large increases in gene number have been coupled to increases in organismal complexity and radiative divergence at several points in the history of metazoans including during the chordate/vertebrate transition and during the teleost fish divergence [1,4,5].
Metazoans differ from prokaryotes in their much smaller effective population sizes, where theory predicts that neutral stochastic processes will be relatively more important than adaptive processes in the expected case that adaptive mutations are rarer than nearly neutral mutations [6]. Large scale analyses, based upon the ratio of nonsynonymous to synonymous nucleotide substitution rates [7] or MacDonald-Kreitman statistics [8] have indicated small to intermediate degrees of positive selection (adaptive substitutions) in mammals, but these clearly do not represent the majority of substitutions. In such studies, it appears to be specific positions in protein-encoding genes, rather than the genes as a whole that are under positive selection [7]. Even examining substitution as a neutral walk through sequence in a folded protein (ignoring positive selection) has shown such a process to have fairly complex dynamics [9]. From this, it is relevant to examine population genomic phenomena, like the fates of duplicated genes, in the context of physical models of proteins. Further, it is not possible to systematically identify fates of real genes (subfunctionalization to the exclusion of neofunctionalization or vice-versa), so modeling under increasingly realistic conditions is likely to be the best way to understand evolutionary mechanisms.
Pseudogenization or nonfunctionalization is a purely neutral process that ultimately eliminates one of the duplicated copies as a functional gene and is the most common fate. Subfunctionalization, is an alternative neutral process that leads to an increase in organismal gene number for genes or functions that show modularity (one Eight different fates are possible in the two simulations, nonfunctionalization (pseduogenization) without cellular death, sub-functionalization, pleiotropic neofunctionalization plus either nonfunctionalization of the other copy or subfunctionalization, non-pleiotropic neofunctionalization, non-pleiotropic neofunctionalization also involving subfunctionalization, redundancy, and cellular death Figure 1 Eight different fates are possible in the two simulations, nonfunctionalization (pseduogenization) without cellular death, subfunctionalization, pleiotropic neofunctionalization plus either nonfunctionalization of the other copy or subfunctionalization, non-pleiotropic neofunctionalization, non-pleiotropic neofunctionalization also involving subfunctionalization, redundancy, and cellular death. Some of the fates have additional combinations of activity that are not represented in this figure. It should also be noted that some of the characterizations overlap. For example, pleiotropic neofunctionalization occurs in combination with nonfunctionalization, subfunctionalization, or is redundant.
representative type of modularity is modeled here, but other types are also possible). Neofunctionalization is an alternative process leading to an increase in organismal gene number, but dependent upon rarer adaptive mutations. Neofunctionalization can include the evolution of a completely new binding capability (as modeled here) or modification/improvement of existing binding capabilities under positive selection after removal of pleiotropic constraint. These alternative fates are presented in the context of a lattice model in Figure 1.
Lattices are models of folded proteins in square or cubic shapes (a cubic lattice was employed here). The folding of a lattice is dictated by the contacts from amino acids that are not adjacent in the primary sequence (these contacts are present in the folded and unfolded states). Because lattices are small and the folding rules are simple, they can be used for evolving populations of proteins to study their structural properties.
Lattice models have previously been used to make important predictions about the behavior of proteins in evolutionary contexts, including their metastability [10] and the evolvability of new folds [11]. Lattices that bind to peptides [12] and small hydrophobic molecules [13] have been described and the latter used to show that subfunc- tionalization can lead to an increase in duplicate gene retention rates. Here, model genes that fold into lattices and bind peptides were duplicated, with neofunctionalization and subfunctionalization (simulation A) or just neofunctionalization (simulation B) as possible events that would preserve duplicated copies in a genome, with nonfunctionalization (pseudogenization) as an alternative fate (see Figure 1 and Table 1). The relative levels of duplicate gene preservation and the importance of both neofunctionalization and subfunctionalization were assessed.

Results and discussion
A set of 10 stably folded lattices was designed to each bind to 2 different ligands at overlapping sites. A third ligand was present in the cell, but did not bind at either site at the start of the simulation. The lattice was duplicated in a con-stant population of 1000 cells, where those cells that bound the third ligand were 5% more likely to appear in the next generation (a selection coefficient of 5% is arbitrary, but only serves as a scaler of the results). In each generation, 10% of molecules became nonfunctional at random through transcriptional knock-out. The fitness function required molecules to fold and genomes to have binding capabilities for the first two ligands. Cells were selected under the constraint that the first and second ligands needed to be bound, but could be bound by either molecule (subfunctionalization possible) in simulation A. The second simulation (simulation B) tightened this constraint and required both ligands to be bound to the same molecule (subfunctionalization not possible). In simulation B, only neofunctionalization is possible as a mechanism to preserve both copies non-redundantly. Neofunctionalization can occur through two mecha-  nisms, a pleiotropic mechanism where the third ligand binds at a site that is also capable of binding one of the other two ligands and a non-pleiotropic mechanism, where the third ligand binds to an inactive site. The average values of each fate (from 10 different lattices) in each of the two simulations are shown in Tables 2, 3.
While initially, both models generate similar levels of neofunctionalization, with time model A begins to show significantly more neofunctionalization. In model A, the total number of subfunctionalized genes, including those that have also neofunctionalized increased initially, but then reached a plateau. These results are shown in Figures  2 (neofunctionalization), 3 (subfunctionalization), and 4 (nonfunctionalization, including those that have also neofunctionalized on the other copy). It is clear that allowing subfunctionalization results in a greater retention rate of duplicate genes with less nonfunctionalization, although subfunctionalization without neofunctionalization never accounts for a large fraction of The total frequency of cells with neofunctionalized molecules (capable of utilizing ligand C) is shown for simulation A in green and simulation B in red Figure 2 The total frequency of cells with neofunctionalized molecules (capable of utilizing ligand C) is shown for simulation A in green and simulation B in red. With time, the total level of neofunctionalization in simulation A surpasses that of simulation B.
The frequency of subfunctionalization without neofunctional-ization is shown for simulation A, where subfunctionalization is allowed Figure 3 The frequency of subfunctionalization without neofunctionalization is shown for simulation A, where subfunctionalization is allowed. Subfunctionalization alone never represents a large fraction of cellular genomic fates.
The total rate of nonfunctionalization is shown for simulation A in green and simulation B in red Figure 4 The total rate of nonfunctionalization is shown for simulation A in green and simulation B in red. Simulation B shows a much higher rate of nonfunctionalization at all evolutionary times.
The total rate of duplicate copy retention without redun-dancy through either neofunctionalization or subfunctionali-zation is shown for simulation A in green and for simulation B in red Figure 5 The total rate of duplicate copy retention without redundancy through either neofunctionalization or subfunctionalization is shown for simulation A in green and for simulation B in red. Allowing subfunctionalization to occur results in a different retention profile, with a much higher rate of duplicate copy retention.
the duplicate genes at any point in evolutionary time (total terminal preservation of both duplicates is shown in Figure 5). Figure 5 indicates that the retention profile is completely different when subfunctionalization occurs compared to when it does not. It is also clear in these simulations that there is not a strong selective pressure to retain robustness through redundancy, as seen in Figure 6.
The role of subfunctionalization as a transition is based upon increasing the mutational space accessible to duplicates to neofunctionalize with removal of selective constraint at a binding site. This walk will differ for different lattices (and for different real proteins), modulating the importance of the effect of subfunctionalization. The rate of neofunctionalization in the absence of gene duplication (the emergence of new function in orthologs) is also related to the accessibility of this pleiotropic walk, but is expected to be even slower than that of neofunctionalization in the absence of subfunctionalization.
While it is not possible to systematically analyze duplicated fates and classify duplicated proteins as neofunctionalized, subfunctionalized, or redundant, this study has implications for our understanding of the role of duplication in the evolution of genomes. Protein segments that have lost function but are stably maintained in an expressed form will drift through sequence space until they achieve a function that makes them the targets of selection. Looking back to the origin of chordates, there is little doubt that gene duplication and the evolution of new function (as evidenced by annotation) went hand in hand. However, it may be that subfunctionalization initially played an important role in preserving copies that subsequently neofunctionalized over the past hundreds of millions of years.

Conclusion
Subfunctionalization has previously been shown to increase the retention rate of duplicate genes using a similar approach [13]. However, when neofunctionalization is included as a possible fate for duplicate genes, subfunctionalization is still important in short time frames after duplication. However, with increasing time, subfunctionalization decreases in importance and its role seems to be to preserve duplicate copies for eventual neofunctionalization, a role as a transition state. Subfunctionalization can still play an important role with larger finite population sizes, but the importance of neofunctionalization as a terminal fate becomes even more dramatic with increasing population size.

Lattice model for protein sequences
We considered a simplified model of evolving proteins. Our model consisted of a chain of 64 random (codon derived) amino acid monomers on a three dimensional 4 × 4 × 4 cubic lattice, simulating a folded protein. Gene sequences were selected randomly and lattices folded as below. Two adjacent binding sites were randomly selected on each lattice. Three peptides were then designed: two that bound specifically at each site and a third that was bound by neither site, as shown by the binding energies in Table 1.

Lattice folding and selection
Each amino acid was embedded at a single lattice point with distinct amino acids correspond to a distinct lattice point. All amino acids were considered to be of uniform size connected with covalent bonds of uniform lengths. A protein fold corresponded to a self avoiding walk over the embedding. The walking algorithm tracks the sites visited to avoid visiting them again. A contact was assumed to exist between two residues if they were not adjacently covalently connected but were on adjacent lattice points. The energy of the protein in a particular conformation was calculated according to the formula, where γ(A i , A j ) is the contact potential between residue type A i at position i and residue type A j at position j, and U ij is equal to one if residues i and j are not adjacent in sequence but are on adjacent lattice sites, and zero otherwise. The value of γ(A i , A j ) is obtained from the symmetric interaction matrix given by Miyazawa & Jernigan [14].
The frequency of redundant copies decreases with time in both simulation A (green) and simulation B (red) Figure 6 The frequency of redundant copies decreases with time in both simulation A (green) and simulation B (red). This occurs faster in simulation B, due to the faster rate of nonfunctionalization.

Evolution of lattice proteins
We have simulated two evolution models. Model A, corresponding to the evolution of a set of ten protein sequences (shown in Table 1) evolving to the alternative fates shown in Figure 1 and Model B corresponding to the evolution of the same set of proteins evolving without allowing subfunctionalization. Cells that did not bind ligands A and B (Model A) and ligands A and B in the same molecule (Model B) died. All molecules also needed to fold to be active.
In each model, we considered 1000 haploid cells that did not recombine between copies (independently for all ten gene duplicate pairs), with a protein molecule evolving according to a Poisson distribution with an average of 1 DNA mutation per gene per generation after the duplication event with a transition to transversion ratio of 2. After every generation, 10% of genes were knocked out at random to simulate mutations to transcriptional regulatory sequences and the cells were subsequently divided into the different fates shown in Figure 1. The next generation of cells was picked randomly from the living cells of the previous generation to keep a constant population size, with a 5% selective advantage to the neofunctionalized cells according to the Wright-Fisher selection model [15]. At the start of each generational round, each cell was viable with two gene copies to bind each set of ligands, as shown in Table 1.

Authors' contributions
SR wrote all programs and carried out the simulations and analysis. DAL conceived of the study, supervised its execution, and wrote the manuscript.