Gene duplications contribute to the overrepresentation of interactions between proteins of a similar age

Background The study of biological networks and how they have evolved is fundamental to our understanding of the cell. By investigating how proteins of different ages are connected in the protein interaction network, one can infer how that network has expanded in evolution, without the need for explicit reconstruction of ancestral networks. Studies that implement this approach show that proteins are often connected to proteins of a similar age, suggesting a simultaneous emergence of interacting proteins. There are several theories explaining this phenomenon, but despite the importance of gene duplication in genome evolution, none consider protein family dynamics as a contributing factor. Results In an S. cerevisiae protein interaction network we investigate to what extent edges that arise from duplication events contribute to the observed tendency to interact with proteins of a similar age. We find that part of this tendency is explained by interactions between paralogs. Age is usually defined on the level of protein families, rather than individual proteins, hence paralogs have the same age. The major contribution however, is from interaction partners that are shared between paralogs. These interactions have most likely been conserved after a duplication event. To investigate to what extent a nearly neutral process of network growth can explain these results, we adjust a well-studied network growth model to incorporate protein families. Our model shows that the number of edges between paralogs can be amplified by subsequent duplication events, thus explaining the overrepresentation of interparalog edges in the data. The fact that interaction partners shared by paralogs are often of the same age as the paralogs does not arise naturally from our model and needs further investigation. Conclusion We amend previous theories that explain why proteins of a similar age prefer to interact by demonstrating that this observation can be partially explained by gene duplication events. There is an ongoing debate on whether the protein interaction network is predominantly shaped by duplication and subfunctionalization or whether network rewiring is most important. Our analyses of S. cerevisiae protein interaction networks demonstrate that duplications have influenced at least one property of the protein interaction network: how proteins of different ages are connected.


Figure S2: Overlap between different yeast PINs
A. Overlap in terms of proteins B. Overlap in terms of interactions    Removing proteins from families belonging to overrepresented functional categories entails removing a substantial number of nodes and edges. This does not lead to a decrease in ∆D or ∆D new , rather these values are higher for these networks. This is striking because removal of random nodes leads to a decrease in ∆D and does not affect ∆D new ( Figure S22). In the reduced networks, the fraction of edges that connects paralogs is 1.5 to 3 fold higher than that of the original, unfiltered networks, which is reflected in both ∆D and ∆D new . In the LC and TAP network most interparalog edges connect members from two families: COG2319, a family of proteins that contain the extremely promiscuous WD40 repeat (26 interparalog edges in LC and 79 interparalog edges in TAP) and COG0724, a family of RNA binding proteins that contain the RRM domain (20 interparalog edges in LC and 34 interparalog edges in TAP). In the reduced HTP network, there are 81 edges connecting members from COG0638, a family of alpha and beta subunits of the 20S proteasome.  Families in the AE/BE category, with homologs in Bacteria and Eukaryotes or Archaea and Eukaryotes, are considered to be younger than families in the ABE (homologs in all three Kingdoms) category. This means we assume that Archaea and Eukaryotes share a common ancestor and proteins with homologs in Bacteria only result from an endosymbiosis event leading to the mitochondrion. Loss of the protein in the ancestor of Archaea would give the same presence/absence pattern and proteins that are in the AE/BE category may have been present in the Last Universal Common Ancestor, but lost in either Archaea or Bacteria. We calculate ∆D as well as our alternative measure ∆D new for the 4 different PINs, lumping the ABE and AE/BE categories into one ('age 3'), or, being more stringent with respect to the assumption that Archaea and Eukaryotes share a common ancestor, we consider any family with a homolog in Bacteria older than a family with a homolog in Archaea ('age 4') and find that the positive ∆D is not caused by these specific assumptions. Moreover, we use more fine-grained categories with respect to younger proteins, separating those families with a homog in only Ascomycota from the Fungal specific families and those with homologs in only Ophistokont from the Eukaryotic specific families, with ('age 6') and without ('age 5') assumptions regarding a shared ancestor of Archaea and Eukaryotes.

Figure S7: Dn/Ds ratios for different age groups
The frequency distribution of Dn/Ds ratios per age group shows faster sequence evolution for younger proteins than for older proteins.

Figure S9: Duplication events can increase the number of interactions between proteins of a similar age
Ovals represent proteins. Multiple copies of the same protein are in the same color.
A. Growth of a functional module by incorporating duplicates of subunits. Different colours indicate different proteins that belong to the same family. B. Conservation of an ancestral interaction in both paralogs. Ovals represent proteins. Purple ovals represent proteins that belong to one family and green ovals represent proteins that belong another family. Different shades of a colour indicate different proteins that belong to the same family. Edges connecting these two families overlap. C. Co-duplication. Purple ovals represent proteins that belong to one family and green ovals represent proteins that belong another family. Different shades of a colour indicate different proteins that belong to the same family. Edges connecting these two families do not overlap.  Each row of the table contains the mean overlap in interaction partners for different classes of protein pairs: pairs of homologous proteins ('Paralogs'), pairs of proteins that are not homologous, but do have the same age ('Non paralogs, of the same age') and pairs of proteins that are not assigned to the same family ('Non paralogs'). The mean overlap is the average number of interaction partners shared by protein pairs of each category. The relative overlap is calculated by dividing the absolute overlap by the maximum possible overlap, which is the degree of the protein with the lowest number of interaction partners: relative_overlap x,y = overlap x,y / min (degree x , degree y ). We compare the average relative overlap in interaction partners for different categories of protein pairs. A Mann Whitney test comparing the relative overlap interaction partners of Paralogs vs. Non paralogs of the same age shows that paralogs share significantly more interaction partners than non paralogs of the same age (P ~ 0.0). For each pair of paralogs in the network, we compared the age of the interaction partners to the age of the paralogs. We find that interaction partners that are shared by paralogs (in total 4605 interactions partners for 11559 pairs) are more often of the same age, than interaction partners that are not shared between paralogs.  The most likely scenario (requiring the smallest number of evolutionary events) in which gene duplication generates additional edges between two families of similar age, is when a member A of one family duplicates and both daughters A' and A" keep the ancestral interaction with the protein B from the other family. The two edges representing these interactions overlap as both contain the protein B. On the other hand, if proteins from both families duplicate, the edges representing the interactions do not necessarily overlap: A' interacts with B' and A" interacts with B". For each family-pair that occurs multiple times in the network (i.e. multiple edges exist between members of these families), we calculate the fraction of protein-pairs that is overlapping. We find that for 80% of the families, all protein-pairs overlap (A'-B and A"-B, rightmost column in Table  S14). If both families are of the same age, this fraction is much lower: 65%.
Results for the other networks are in separate tables below.

Figure S17: Heatmap of interaction densities between age groups in a network generated by the model
The network for which the interaction densities are shown was generated using default parameter settings (p=0.2, q=0.7, a=0.2, s=0.5) and has a ΔD value of 0.54. Dark purple squares along the diagonal indicate a strong overrepresenation of interactions between proteins of the same age, which in this case boils down to interactions between proteins of the same family.

Figure S18: Heatmap of the fraction of edges that connects paralogs for all parameter combinations we tried in the model
The left panel is a continuous heatmap of the fraction of edges that connects paralogs for all parameter combinations we tried in the model. Blue indicates a high proportion of all edges connects members of the same family. The right panel is the same heatmap in which each set of parameter conditions is assigned to a category: fitting the data (bright green), too high values (dark green) and too low values (red). In the data, removing interparalog edges from the network had little effect on ΔD. In the model networks, ΔD drops to zero after removing interparalog edges, indicating the positive ΔD value strongly depends on interactions between paralogs. If we collapse the network into a network of protein families rather than proteins we see a further decline in ΔD. This mainly reflects a bias in connectivity in the collapsed model networks: old families have a high degree. We do not observe this in real networks.

Figure S20: Average relative overlap in interaction partners of paralogs.
Boxplot with the average relative overlap in interaction partners of paralogs for different levels of divergence after duplication (different values of q). Average relative overlap is the overlap divided by the maximum overlap (see Table S11). Boxes show the .25 and .75 percentile of 20 runs, the error bars show the extreme values and the black line is the mean of 20 runs.

Figure S21: ΔD versus the average relative overlap in interaction partners between paralogs
Scatterplot of ΔD versus the average relative overlap in interaction partners between paralogs, for different levels of divergence after duplication (different values of q in green, p=0.2, a=0.2, s=0.5) and different yeast PINs (in grey). Average relative overlap is the overlap divided by the maximum overlap (see Table S11).  Figure S22: the effect of removing randomly selected nodes or edges on ΔD and ΔD new .
We randomly selected a certain percentage (x-axes) of nodes (left panels) or edges (right panels) and removed them from the LC network.  Figure S23: ΔD new and new interaction densities between age groups in the original and collapsed protein interaction network. Figure S24: ΔD new for model networks before and after collapsing the network.
As we did with real yeast PINs, we remove edges that result from duplication events from model networks and study the effect on ΔD new . A. ΔD new for model networks in different parameter conditions: all edges, no interparalog edges and networks collapsed into networks of protein families. Default parameter conditions are p=0.2, q=0.7, s=0.5, a=0.2, each plot shows ΔD new values when one of these parameters is varied while the others are kept at default values. The gray line is the ΔD new value of the yeast LC PIN. Boxes show the .25 and .75 percentile of 20 runs, the error bars show the extreme values and the black line is the mean of 20 runs. B. ΔD new and interaction densities for a model network generated using default parameter settings (p=0.2, q=0.7, a=0.2, s=0.5): all edges, no interparalog edges and networks collapsed into networks of protein families.
Because ΔD new does not depend on differences in connectivity between age groups we do not see a strong additional decline when we collapse the model network into families.  FigureS26: ΔD and ΔD new in models networks using age group sizes similar to those in the data.
Default parameter conditions are p=0.2, q=0.7, s=0.5, a=0.2, each plot shows ΔD (top) and ΔD new (bottom) values when one of these parameters is varied while the others are kept at default. The gray line is the ΔD resp. ΔD new value of the yeast LC PIN. Boxes show the .25 and .75 percentile of 10 runs, the error bars show the extreme values and the black line is the mean of 10 runs. The size distribution of age groups does neither affect ΔD as this figure is very similar to Figure 3 in the main text, nor ΔD new (similar to Figure S24)