The evolution of vertebrate tetraspanins: gene loss, retention, and massive positive selection after whole genome duplications
© Huang et al; licensee BioMed Central Ltd. 2010
Received: 12 May 2010
Accepted: 13 October 2010
Published: 13 October 2010
The vertebrate tetraspanin family has many features which make it suitable for preserving the imprint of ancient sequence evolution and amenable for phylogenomic analysis. So we believe that an in-depth analysis of the tetraspanin evolution not only provides more complete understanding of tetraspanin biology, but offers new insights into the influence of the two rounds of whole genome duplication (2R-WGD) at the origin of vertebrates.
A detailed phylogeny of vertebrate tetraspanins was constructed by using multiple lines of information, including sequence-based phylogenetics, key structural features, intron configuration and genomic synteny. In particular, a total of 38 modern tetraspanin ortholog lineages in bony vertebrates have been identified and subsequently classified into 17 ancestral lineages existing before 2R-WGD. Based on this phylogeny, we found that the ohnolog retention rate of tetraspanins after 2R-WGD was three times as the average (a rate similar to those of transcription factors and protein kinases). This high rate didn't increase the tetrapanin family size, but changed the family composition, possibly by displacing vertebrate-specific gene lineages with the lineages conserved across deuterostomes. We also found that the period from 2R-WGD to recent time is controlled by gene losses. Meanwhile, positive selection has been detected on 80% of the branches right after 2R-WGDs, which declines significantly on both magnitude and extensity on the following speciation branches. Notably, the loss of mammalian RDS2 is accompanied by strong positive selection on mammalian ROM1, possibly due to gene loss-induced compensatory evolution.
First, different from transcription factors and kinases, high duplicate retention rate after 2R-WGD didn't increase the tetraspanin family size but just reshaped the family composition. Second, the evolution of tetraspanins right after 2R-WGD had been impacted by a massive wave of gene loss and positive selection on coding sequences. Third, the lingering effect of 2R-WGD on tetraspanin gene loss and positive selection might last for 300-400 million years.
Gene duplication is a major force for gene functional innovation . However, most duplicate genes are doomed to loss, but gene retention may be facilitated by expression divergence, neofunctionalization or subfunctionalization [25–27]. Small-scale duplication (SSD) and whole genome duplication (WGD) differ in their influence on the fate of duplicate genes [28, 29]. For example, in yeasts and plants, duplicate genes from functional classes like transcription factors and kinases tend to be preferentially retained after WGD; also in yeasts, duplicate genes retained after WGD tend to diverge more quickly in expression regulation than in protein function [reviewed in ]. In vertebrates, two rounds of WGD (2R-WGD) happened at the origin of vertebrates, and a third round occurred specifically to teleost fishes (FSGD). Early genome-wide studies suggest that no increase of positive selection has been detected on coding sequences of retained duplicate genes right after FSGD [31, 32]. The same conclusion is also suggested to be applicable for 2R-WGD . Consistently, several studies on individual gene families find little positive selection right after 2R-WGD, although these studies are not specifically designed to address this issue [33–37]. These reports suggest that the rapid increase of biological complexity and novelties at the origin of vertebrates [38, 39], which is believed to be triggered by 2R-WGD, might not be initially attained through positive selection. However, we have reasons to suspect that 2R-WGD's effect on positive selection could be underestimated. First, the effect has not been carefully evaluated actually; second, since 2R-WGD is so ancient (450-500 Mya), even a massive positive selection happened, not every modern genes may still carry the selection imprint after such a long time.
The vertebrate tetraspanin superfamily has several advantages which make it suitable for preserving the imprint of ancient sequence evolution and amenable for phylogenomic analysis. It is a large family with 38 modern ortholog lineages in bony vertebrates (see Results). It has a rigid and compartmentalized protein architecture, which makes it free from domain reshuffling and internal module duplication, or in other words, which limits its structural evolution to amino acid substitutions, insertions and deletions. It maintains highly conserved coding sequences, structural motifs and intron configurations within sub-families. It has high gene retention rate and most of its members preserve traceable genomic syntenic relationships (see Results). Finally, its molecular function is to act as versatile organizers by interacting with various proteins at different affinities, and most of its members serve important but subtle, overlapping and non-essential functions. These biochemical properties make tetraspanins highly adaptable for functional changes.
In this study, we first reconstructed a detailed phylogeny for bony vertebrate tetraspanins by integrating multiple lines of information from sequence-based phylogenetics, key structural features, intron configuration and genomic synteny. Based on this phylogeny, we evaluated the impact of 2R-WGD on the following gene loss, retention and natural selection of the major lineages of vertebrate tetraspanins. In terms of selection tests, we used a stringent branch-site codon model, and contrasted duplication branches with speciation branches. Finally, we concluded that vertebrate tetraspanins had undergone a massive wave of gene losses and positive selection on coding sequences right after 2R-WGD.
2.1 Tetraspanin genes of deuterostomes
To evaluate the 2R-WGD effect on bony vertebrate genes, we require three aspects of phylogenomic information, including the closest invertebrate (outgroup) orthologs, ohnolog patterns, and molecular gene trees with dense taxa. Since we concentrate on bony vertebrates and most tetraspanin lineages are phylum-specific , this study is focused on deuterostome tetraspanins. To determine the major phylogenetic structure of vertebrate tetraspanins, we curated a reference set of tetraspanins from the genomes of human (33 genes), mouse (32), zebrafish (50), ascidian C. intestinalis (33), amphioxus B. floridae (39) and echinoderm S. purpurastus (29) (see Methods). Four molecular protein trees have been constructed from these reference sequences using ME, MP and ML methods (Additional file 1, Figure S1-4). To our knowledge, these reference sequences define the most complete tetraspanin repertoires from these species thus far (Additional file 2). Since the protein sequences between sub-families are so divergent (p-distance between many sub-families ranging from 0.75 to 0.9), these trees provide little reliable consensus on the relation between major sub-families. However, together with the information from previous reports [2, 21, 22], these trees clearly define 38 ortholog lineages for modern bony vertebrates (discussed later).
2.2 Phylogenetics of deuterostome invertebrate tetraspanins
2.3 A large tetraspanin gene cluster in amphioxus
2.4 Phylogenomic analysis of vertebrate tetraspanins
Classification of all vertebrate tetraspanin LELs
cysteine pattern of LEL
CD151, CD63, TSPAN3, TSPAN6, TSPAN12, TSPAN18, RDS, UPK1,
TSPAN4(the member CD53 secondarily lost cysteine No.4&5)
TSPAN8, some amphioxus and ascidian TSPAN8-like
types other than 6a, 6b1, 6b2
TSPAN13, some amphioxus TSPAN8-like
TSPAN5, TSPAN10, TSPAN15 and TSPAN33
It is worth noting that phylogenetic structures inferred in this study provides more complete information than our previous work , and can be quite different from Garcia-Espana et al's work . For the latter case we figured out three reasons. First, more genome sequences and more complete gene data set have been included; second, multiple tree reconstruction methods including more reliable methods like maximum-likelihood (ML) and Bayesian inference were used (see Methods); third, structural motifs, intron configurations and genomic synteny were integrated into the analyses in addition to sequence-based phylogenetics.
2.5 Seventeen ancestral vertebrate tetraspanin lineages
2.6 The monophyletic 8-cysteine super-lineage and the vertebrate TSPAN4 gene cluster
Although the inferred phylogenetic relations between 17 ancient lineages are much less supported, some reliable parts are nevertheless worth noting. Four 8-Cysteine-containing ancient lineages (TSPAN5, TSPAN10, TSPAN15 and TSPAN33), including eight ortholog sub-lineages, form a monophyletic super-lineage (Figure 4). This super-lineage has a special LEL structure defined by eight cysteines capable of forming four disulphite bonds. In fact, all metaozoan 8-Cys tetraspanins belong to this super-lineage with the oldest one from cnidarians , reflecting the ancient origin and independent evolution of this super-lineage. On the other hand, syntenic analysis indicated that the TSPAN4 cluster not only exists in amphioxus and ascidians, but also exists in bony vertebrates, except that in bony vertebrates the cluster has been quadrupled by 2R-WGD. Figure 3 shows the reconstructed pre-WGD status of the vertebrate TSPAN4 cluster. This gene cluster provides insights into the evolution of six ancestral lineages. Four lineages (CD151, TSPAN4, TSPAN18 and TSPAN8-like) should represent the earliest members of this cluster, whereas two other lineages (CD9 and CD37) should be derived from ancestral TSPAN8-like genes given structural similarity between them. Besides, the origin of uroplakin, CD63 and TSPAN13 seem to be related to this cluster. Lastly, more evidence and discussion on the evolution of these tetraspanins are presented in the text of additional file 1.
2.7 The ohnolog duplication patterns produced by 2R-WGD
The 2R-WGD has been confirmed by lines of convincing evidence [40–42]. Apparently due to WGDs, bony vertebrates have many more genes than invertebrates . Paralogs arisen from WGD are called "ohnologs" . Ideally, four ohnologs produced by two consecutive WGDs manifest a hierarchical pattern like ((A, B),(C, D)), but the pattern recovered from phylogenetic analysis is often violated by gene losses and unequal evolutionary pace in different duplicates. Apparently, all 38 tetraspanin lineages of bony vertebrates are paralogs derived from 17 pre-2R-WGD lineages, but not all paralogs from within an ancient lineage are ohnologs produced by 2R-WGD. In other words, some paralogs could be paralogs originated through SSDs. Here we used phylogenetic trees and syntenic data to determine SSD-derived paralogs, 2R-WGD-derived ohnologs and the ohnolog duplication patterns (Figure 4) (see Methods). More details on these patterns are presented in the legend of Figure 4.
2.8 Gene retentions and losses after 2R-WGD
Among 17 ancestral lineages, 13 retain two or more ohnologs and 6 of 13 retain three ohnologs. These 13 lineages include CD151, TSPAN3, TSPAN4, TSPAN6, TSPAN13, RDS/peripherin, TSPAN18, TSPAN5, TSPAN15, TSPAN33, CD9, CD37 and uroplakin, accounting for 32 of 38 tetrapanin lineages in bony vertebrates (Figure 4). This ohnolog retention rate is quite high, approximately three times the average rate (average rate is ~25%, see ) of all bony vertebrate genes, which is similar to the retention rate of transcription factors and protein kinases. In summary, of 21 new tetraspanin lineages in bony vertebrates (38-17 = 21), 19 are contributed by 2R-WGD and only 2 by SSDs (Figure 4). This contrasts the invertebrate situation, where SSDs control the family size tetraspanins (see Section 2.2). Remarkably, although the high ohnolog retention rate greatly increased the family sizes of vertebrate transcription factors and kinases [42, 44], it didn't increase the family size of bony vertebrate tetraspanins (for comparison: 33 members in human, 32 in mouse, 28 in chicken, 33 in C. intestinalis, 39 in B. floridae, 29 in S. purpurastus, 35 in D. melanogaster and 20 in C. elegans). Taken together, 2R-WGD and the high ohnolog retention rate reshape the tetraspanin family composition rather than increase the tetraspanin family size.
Although 75% (13 of 17) of the ancient tetraspanin lineages have ohnologous duplicates retained in bony vertebrates, nearly half (17*4-36 = 32) of the duplicates produced by 2R-WGD were lost before the radiation of bony vertebrates, not mentioning some lineages which might be lost without trace left. In contrast to this, meanwhile only two new lineages were acquired through SSDs. Gene losses are also observed to continue prevailing over gene gains in later evolution of bony vertebrates. For example, human lost six of 38 tetraspanin ortholog lineages and gained only one new gene (TSPAN32) through SSD; mouse lost eight and gained one (TSPAN32, excluding the TSPAN7-like pseudogene); chicken gained two new tetraspanins but probably lost twelve; amphibians probably lost three but gained one; zebrafish gained two through SSDs (pseudogene XP_692767 excluded), but even FSGD could not prevent it from losing four bony vertebrate tetraspanin lineages. So we concluded that it is gene losses but not gene gains which prevail in the evolutionary history of bony vertebrate tetraspanins since 2R-WGD.
2.9 Massive and extensive positive selection right after WGD
Several factors that complicate the analysis of deep evolution of vertebrates need to be assessed before performing selection tests. A large tree with many branches will improve the power of likelihood tests when sequence divergence is high . We sampled 24 fully-sequenced genomes and many sequences from other sources. Several mammalian genomes were excluded due to poor quality. Excessive mammalian species won't enhance the power and robustness of our tests because they don't break long internal branches and the real bottle neck in our tests is the lack of sequences from reptiles and amphibians. Any way, we compiled at least five mammal species, four teleost species and as many species as possible for reptiles and amphibians. So, to our knowledge, in most cases the taxon number used in our tests has exceeded early studies [31–37]. The structures of the reference trees used for selection tests have been justified by comparing information from Bayesian protein trees, Maximum likelihood trees, distance trees, and syntenic data. So, one may notice that the reference trees (Additional file 1, Figure S5-21) and the Bayesian protein trees (Additional file 2, Figure S37-53) have some discrepancies. In fact, selection tests using both types of trees delivered consistent results (data not shown), which is expected because the difference between two trees is caused by the lack of sufficient substitution information on the related branches and the lack of such information is unlikely to produce false positives. Highly divergent branches were removed before tests (but will be separately analyzed in the next section). Since we are testing multiple branches in one tree, a scheme of multiple-testing correction is required: first, because the evolution of duplicated genes may affect with each other right after WGD, all duplication branches in a tree are treated as one hypothesis; second, since the branch leading to the mammal class has been divided into three portions by amphibians and reptiles, all three portions of the branch are also treated as one hypothesis; third, as for the branch leading to the teleost class, it forms one hypothesis alone (Figure 5A). This scheme is expected to be conservative for duplication branches and relaxed for mammal branches. Hocheberg's method is used here (5% and 1%) , which is more stringent than FDR method use in  (10% and 1%). As for the distributions of codon usage, GC content and transition-transversion ratio, we analyzed them in three layers. First, within the vertebrate ortholog lineages, the difference could be presented between vertebrate classes. but these differences unlikely affect the accuracy and power of likelihood ratio tests (LRT) on branches inside the lineage according to the previous simulations . Second, between ohnolog lineages we detected no significant difference in most cases, which was expected because different lineages contained similar composition of species. Third, there was significant difference between invertebrate and vertebrate genes. Though we had not to test invertebrate branches, we performed tests using trees without invertebrates and confirmed that similar results could be obtained with or without invertebrate sequences (data not shown). The last complicating factor we need to consider is the possible dS saturation and model violation on divergent deep branches (duplication branches). Saturation of dS may not be a problem for PAML's likelihood tests [32, 45, 47], but which is often accompanied by model violation and hence may cause false positives . In the next section, we will show that these problems have little impact on our conclusions.
Statistics on branches under significant positive selection
No. of branches
Proportion of sites under positive selection average(median)
Significant LRTa (P < 0.05) (proportion)
Fisher's exact testb
Significant LRTa (P < 0.01) (proportion)
Fisher's exact testb
* 13 tested trees, CD63, TSPAN8, TSPAN10 and TSPAN12 excluded
B. mammalsc (portion 1,2,3 as a whole)
C. mammalsc (portion 1,2,3 as independent branches)
* 15 tested trees, TSPAN8 and TSPAN12 excluded
F. mammalsc (portion 1,2,3 as a whole)
G. mammalsc (portion 1,2,3 as independent branches)
* TSPAN8 and TSPAN12
(Though the branch-site model detects no positive selection, the site-specific model detects positive selection on TSPAN8)
J. speciation (as independent branches)
2.10 Further confirmation of the massive wave of positive selection after 2R-WGD
Since the statistic conclusion from last section is so different from those early studies using similar detection schemes for positive selection on vertebrate gene lineages [31–37] (also see the Introduction), we decided to thoroughly evaluate its robustness from multiple angles. In our tests, we ignored two trees (TSPAN8 and TSPAN12) due to the lack of proper outgroups and took into account two trees (CD63 and TSPAN10) that have fallen back to single gene status, but in fact, whether to count in these trees or not would not change our statistic conclusions (Additional file 3). To further verify the reliability of inference, we performed tests on twelve arbitrarily selected duplication branches using the F61 codon model (CodonFreq = 3 in PAML4), which is expected to contain the bias on GC content, codon usage and transition-transversion ratio better than the F3X4 model. Both models produced similar results, except that the F61 model caused a bit of power loss in LRT. Notably, the quality of sequences can affect the parameter estimation, especially when genomes are sequenced and assembled in varying quality [for example ], but after analyzing the branch-site likelihood tests, we figured out that sequence errors tend to increase more false positives on the terminal speciation branches than the deep internal duplication branches. In fact, since in this study the positive selection on duplication branches was much stronger and affected much more sites than on speciation branches, we expect that no matter what testing schemes or sequence problems are encountered, the duplication branches should be less affected than speciation branches.
Two factors, the dS saturation and the violation of the codon model assumption, will inflate false positive rates of likelihood ratio tests for positive selection [47, 49]. In our tetraspanin dataset, duplication branches produced by 2R-WGD are so ancient and divergent that they have much higher chance to suffer from saturation and model violation than speciation branches. Therefore, we have designed three schemes to confirm that the excessive positive selection detected on tetraspanins' duplication branches is evident and trust-worthy.
First, we ran simulations that reproduced the original dataset under nearly-neutral model to evaluate the false positive rates of positive selection on duplication branches (see Methods). We detected no positive selection (P < 0.05) on all 170 tested branches. To correction for the possible underestimation of dS substitution rate, we ran another simulation with branch length multiplied by 1.5. This time we detected positive selection (P < 0.05) on only 2 branches among all 170 tested branches. So, it is clear that dS saturation-induced false positive rate is not a big problem, which is consistent with early simulation analyses [32, 45, 47].
Second, we especially assessed those divergent branches in our tetraspanin dataset. There are four duplication branches with dS > 1.8, two of which are reported under positive selection (TSPAN14 and TSPAN15-like). Positive selection on the TSPAN14 branch is highly supported (P < 0.00002) and affects a large proportion of sites (48%), whereas dS of the TSPAN15-like branch is merely above 1.8 (≈1.87) and positive selection on it is also highly supported (P < 0.002). Moreover, after we removed branches with dS > 1.5 (15 branches) or dS > 1.2 (23 branches), our statistic conclusions still hold, suggesting that in our tetraspanin dataset, divergent branches did not show any higher false positive rate than those less divergent.
Third, we employed two protein-based approaches to detect evolutionary rate shifts on or between tetraspanin ohnolog lineages (implemented in software DIVERGE2  and RASER2 , see Methods for details). It is suggested that in theory amino acid-based methods for selection tests may better contain divergent sequences and dS saturation than codon-based methods . Using DIVERGE2, we detected significant functional divergence between all pairs of tetraspanin ohnolog lineages except the pair of TSPAN6/TSPAN7 (Additional file 4). It should be noted that there were also no positive selection detected on TSPAN6 and TSPAN7 using the branch-site codon model (Additional file 3). Using RASER2, we detected significant rate shifts on all tetraspanin ohnolog lineages (Additional file 5). Globally, protein-based methods produced much higher significance on each ohnolog lineages than codon-based methods (readers can compare Additional files 3-5 to see this point). Taken together, we concluded that: 1) protein-based methods seem less stringent than codon-based methods; 2) significant rate shifts did happen on the protein level on tetraspanin ohnolog lineages (corresponding to duplication branches), which strongly corroborates the results from the branch-site codon model.
2.11 Natural selection on the extremely long branches
The most divergent branches were not analyzed in Section 2.9, which include four branches originated by independent SSD (TSPAN4-like, TSPAN6-like, TSPAN32-like, CD9-like), two branches arisen from speciation (mammalian ROM1 and TSPAN16), and one branch produced by WGD (TSPAN19). We set up new tests on these branches by using several different models and phylogenetic contexts, but no positive selection could be detected on these branches over the full-length sequence. This outcome is not surprising because codon-based positive-selection models have been shown to have little power in dealing with highly divergent sequences . However, we observed that the evolutionary rate, transition-transversion ratio and dN/dS ratio could be different between the TM and the non-TM regions in some tetraspanin lineages (data not shown), so we redid tests on these branches with the TM and non-TM regions separated. This time, we detected positive selection in non-TM regions on two branches, the mammalian ROM1 branch (discussed later) and the CD9-like branch. All these tests are present in Additional file 3 (test No.462-671), and their corresponding reference trees are presented on Figure S22-36 (in Additional file 1).
2.12 Possible compensatory evolution of ROM1
The RDS family forks into three ohnolog lineages, RDS, ROM1 and RDS2. RDS is a major organizer for the photoreceptor outer segment (OS) architecture . Defects of RDS may cause blindness. ROM1 forms hetero-tetramers and -octamers with RDS in OS . A study of ROM1 -/- mice indicates that though ROM1 has a role in rod photoreceptor viability and in the regulation of disc morphogenesis, RDS alone is sufficient for both disc and outer segment morphogenesis . Defects of ROM1 may also cause photoreceptor degeneration but the phenotype is milder than that of RDS. As for RDS2, the function has not been explored.
The protein structure of RDS family is highly conserved and free of indels, suggesting strong purifying selection against indels. Positive selection can be detected on all duplication branches and affects 12~29% of the sites, whereas positive selection is much less on speciation branches (Addition file 3). A notable observation about the RDS family is the accelerated substitution rate of the mammalian ROM1 (mROM1) branch (Figure 5B), which is so extreme that it causes aberrant topology in the distance tree (Figure 4H), but the correct topology can be recovered by using ML or Bayesian methods (Figure S3 & S43 in Additional file 1). Analysis of the genes adjacent to the genomic location of mROM1 showed that the acceleration is restricted to mROM1, suggesting that regional effects are not responsible for the phenomenon. Since the high substitution rate of mROM1 produces no indels or poor alignment portions, we suspected that acceleration is directional or under certain constraints. The BEB site prediction method reports many sites under positive selection in non-TM regions on the mROM1 branch without significant LRT support. Prompted by this, we divided the alignment into TM and non-TM regions and re-tested the branch (Test No.384-461 in Additional file 3). The rationale of this scheme is that to increase the proportion of positive selection sites will increase the power of LRT . In this time, positive selection has been significantly detected in non-TM regions, affecting 48% of the sites and having 42 under-positive-selection sites predicted with > 95% probability. The likelihood test of positive selection will become much more significant if we reduced the interference from non-ROM1 sequences and model violations by restricting the tests to the sub-tree of bony vertebrate ROM1 (Test No.462-479 in Additional file 3).
To further find out the relation between the positive selection and the substitution acceleration, we applied the branch-specific model for further tests. These tests indicate that most synonymous substitutions on the mROM1 branch are accumulated in TM regions, whereas the synonymous substitution rate in non-TM regions has no difference from other branches (Figure 5C-D). In contrast, most nonsynonymous replacements on the mROM1 branch occur in non-TM regions (Figure 5E-F). These observations not only explain why positive selection is not significant across the whole sequence, but suggest that massive and rapid positive selection in non-TM regions of the mROM1 branch is driven by the acceleration of global substitution rate. However, it is still hard to comprehend why rapid and massive positive selection suddenly acted on the mROM1 branch. We proposed that it is due to compensatory evolution caused by the loss of mammalian RDS2 (see Discussion).
3.1 Distinct influence of 2R-WGD on family size and composition of vertebrate tetraspanins
It is reported that after 2R-WGD, only 25% of the ancestral chordate genes retained more than two ohnologs in modern vertebrates , but retention events are not random, instead, related to gene functional modes. For instance, genes with high retention rates are enriched in development, signal transduction and transcription regulation . The ohnolog retention rate of vertebrate tetraspanins is found to be 75%, three times the average and similar to that of transcription factors (TFs) and kinases, suggesting that new tetraspanins tended to be quickly recruited into novel biological processes after 2R-WGD. In line with this, in invertebrates, species-specific SSD-derived tetraspanins also had high retention rate, hence accounting for the majority of invertebrate tetraspanins (see Section 2.2). Why do tetraspanins have high duplicability and retention rate? It is proposed that genes with biochemical features permitting easy adaptation for new functions should have higher duplicability . And genes with secondary functions or properties more likely retain both duplicates and develop new functions from secondary properties . These hypotheses may explain the situation of tetraspanins, because tetraspanins are involved in various biological processes and interacting with various proteins at different affinities .
Frequent SSDs dominated the tetraspanin family sizes in invertebrates (see Section 2.2), and 2R-WGD could increase TFs and kinases by 2~4 folds. In contrast, neither 2R-WGD nor SSDs significantly increased the family size of vertebrate tetraspanins. Instead, high ohnolog retention rate just reshaped the family composition. So in these terms, influence of 2R-WGD on tetraspanins is quite different from those genes (TFs and kinases) with high retention rates or those genes with low retention rates. Now the question is: why did this happen? We find that only 5 of 17 ancient vertebrate tetraspanin lineages are vertebrate-specific, contrasting to the invertebrate situation where the majority of tetraspanins are species-specific. If assuming that vertebrate ancestors before 2R-WGD had roughly the same amount of tetraspanins as deuterostome invertebrates had (i.e., 29-39 genes), we may infer that 2R-WGD didn't increase the family size of vertebrate tetraspanins but changed the family composition by displacing those vertebrate-specific lineages with the lineages conserved across the deuterostome phylum. If this inference was true, then how did it happen? It is suggested that conserved genes and essential genes are favorable for duplication if initial gene dosage constraint is not a problem [55–57]. It is also proposed that WGD but not SSD is able to lift dosage constraint because dosage balance is not initially altered by WGD . Based these theories, we speculate that in invertebrates conserved tetraspanins are preferable for duplication but constrained by dosage barrier (which might be especially true in amphioxus, where frequent SSDs kept happening but all conserved tetraspanins remain single gene status). However, in vertebrates, since 2R-WGD lifted the dosage barrier, the duplicates of conserved tetraspanins might be able to displace non-conserved genes quickly.
3.2 Post-2R-WGD massive positive selection and its relation to gene retention and gene loss
In addition to the high ohnolog retention rates, we also detected a massive wave of positive selection in bony vertebrate tetraspanins right after 2R-WGD. This wave of positive selection impacted nearly 80% of the duplication branches and affected 18% of their sites on average. Rather than classify this selection wave as a peculiar case on tetraspanins, we argue that it occurred on genome-wide scale but some families like tetraspanins might preserve the signal of positive selection better than others. Positive selection were also detected on later speciation branches, but at much less extensity and magnitude, with many branches affected by 1~3 sites of strong signal. We theorize that such a few sites in most cases may not create entirely new functions, but likely contribute minor adaptation or optimization for the new function acquired in early evolution.
Initial retention of new tetraspanin ohnologs permitted subsequent differential changes on their coding regions. Early studies suggest that initial retention of duplicates after WGD is likely facilitated by dosage balancing selection, expression divergence and subfunctionalization [review in ]. In this course, Darwin's natural selection favored those adaptive changes (positive selection) and led to neofunctionalization. Although initial retention opens a time window for positive selection, it doesn't directly accelerate positive selection. Some other events like subfunctionalization and gene loss may facilitate positive selection. Initial subfunctionalization on duplicate genes may relax original selective constraints on coding regions and facilitate neofunctionalization (positive selection) in later sequence evolution [58, 59]. As for gene loss, it may also cause positive selection according to the dosage-compensation model. This model predicts that after WGD dosage effect strongly prevent the loss of the duplicated genes encoding interacting proteins in order to maintain the balance of the interaction network (note that tetrapanins are known to form huge "tetraspanin webs" with various proteins), but this model also predicts that once one of the interacting duplicates is lost, the remaining duplicated gene will be positively selected for . Therefore, gene loss may facilitate positive selection by triggering compensatory evolution on the remaining duplicate gene. The positive selection on mammalian ROM1 is possibly due to compensatory evolution triggered by the loss of mammalian RDS2.
Compensatory evolution commonly occurs within a gene, a protein complex or a network . Within a gene, deleterious mutations can be rectified or compensated by other mutations; within a complex or a network, defects of a component may be compensated by changes on other components. However, gene-for-gene compensation is rare. Here we propose a functional vacancy hypothesis to explain gene loss-induced positive selection. This model assumes that close related ohnologs or paralogs have diverged but partially overlapped function. So once an ohnolog is lost, other ohnolog may fill the functional vacancy, but since the replacing ohnolog is not as good as the original, directional (or positive) selective pressure may step in and drive further adaptive changes on the replacing ohnolog. This hypothesis may explain the positive selection on mammalian ROM1. Functional study demonstrates that RDS acquired its present function before the radiation of tetrapods . The branch pattern and length of RDS family suggests that before the mammalian speciation, RDS2 is more divergent from RDS and ROM1, whereas ROM1 is more similar to RDS and could be more dispensable or under stronger subfunctionalization. The dispensability of ROM1 has support from reptiles, where the loss of reptile ROM1 caused no discernable effects on the sequence evolution of reptile RDS and RDS2 (Figure 5B). Taken together, we speculate that the loss of mammalian RDS2 might cause an instant functional vacancy, which may drive the less important ROM1 to evolve adaptive changes to fill the vacancy.
Finally, we infer that 2R-WGD associated gene loss and positive selection could have affected the evolution of vertebrate tetraspanins for 300-400 million years or more. First, the early period right after 2R-WGD is ~100 million years, as the divergence of chordates is 520-680 Mya and that of bony vertebrates is ~420 Mya [62, 63]. Second, gene losses continued to prevail over gene gains in the later evolution of bony vertebrates (see Section 2.8), and the number of branches under positive selection declined in a gradual way along the evolutionary path from 2R-WGD to the mammal radiation (see Section 2.9). Both phenomena suggest that after the bony vertebrate radiation the influence of 2R-WGD kept declining but continued to act in later 200~300 million years. This lingering effect of 2R-WGD is not restricted to tetraspanins but likely presented on a genome-wide scale, because early studies indicate that the effect of 2R-WGD on the gene family size distribution is still detectable after the tetrapod radiation [64, 65].
3.3 The possible differences between 2R-WGD and FSGD
An early study shows that duplicate genes retained after FSGD bias to genes under purifying selection or relaxation of purifying selection . Studer et al's work further suggests that FSGD had no effect on the prevalence of positive selection . These conclusions are quite different from ours on 2R-WGD and tetraspnains (a massive wave of positive selection right after 2R-WGD). However, if disregarding the duplication branches, Studer et al's work and our work are more consistent than different: both studies provide similar estimation of the occurrence of positive selection on the mammal branches (12~20%, P < 0.05, portion 3 only, shown in Figure 5A) and on the bony vertebrate branches (43-50%, P < 0.05 equivalent to the teleost+tetrapod branches, shown in Figure 5A). This implies that the difference lies in duplication branches, in other words, lies between 2R-WGD and FSGD.
One difference between 2R-WGD and FSGD is their differential roles in the increase of biological complexity. 2R-WGD is believed to greatly increase the morphological complexity (including developmental boundaries and gene functional modules) in the transition from basal chordates to vertebrates . However, it appears less dramatic about the changes on morphological complexity before and after FSGD. In addition, 2R-WGD and FSGD seem to have different duplicate retention rate, for example, both tetraspanins and Hox genes had had 75% retention rate after 2R-WGD but 25-30% after FSGD. So it may not be surprising that 2R-WGD and FSGD had different effect on the prevalence of positive selection. One interesting possibility is that the lingering effect of 2R-WGD lasted 300-400 million years and hence eclipsing the effect of FSGD. An alternative possibility is that different choices of gene families for tests may make a different result. Specifically, our work focused on 38 genes from one family that probably suitable for preserving the imprint of ancient sequence evolution, whereas Studer et al's work concerned a set of 117 FSGD-related genes which are biased to genes of high sequence conservation and retention rate . Anyway, the final answer requires the analysis of more genes and gene families.
In this study, we demonstrate how the 2R-WGD affected the gene retention, losses and positive selection on the bony vertebrate tetraspanin superfamily. Based on these results, we argue that a gene family which has high duplicate retention rate after WGD may not lead to the increase of family size; instead, it may just reshape the family composition. This is a new finding about the effect of WGD on the fate of multi-gene families. We also detect a massive wave of gene losses and positive selection that struck vertebrate tetraspanins right after 2R-WGD. To our knowledge, this phenomenon (especially the massive wave of positive selection) has not been detected in early studies [31–37]. So, we argue that the effect of WGD on positive selection should not be underestimated, which could be significantly more extensive and intensive than that of speciation. We finally propose that although the influence of WGD are gradually diminished, the influencing time of WGD could be much longer than we thought before; in terms of tetraspanins, the lingering influence of 2R-WGD on gene losses and positive selection might last for 300-400 million years.
5.1 Identification of tetraspanin sequences
Predicted transcripts of 22 deuterostome genome sequences were downloaded from the Ensembl FTP site , including Dasypus novemcinctus (Dn), Equus caballus (Ec), Homo sapiens (Hs), Monodelphis domestica (Md), Mus musculus (Mm), Rattus norvegicus (Rn), Ornithorhynchus anatinus (Oa), Gallus gallus (Gg), Taeniopygia guttata (Tg), Anolis carolinensis (Ac), Xenopus tropicalis (Xt), Gasterosteus aculeatus (Gac), Danio rerio (Dr), Oryzias latipes (Ol), Takifugu rubripes (Tr), Tetraodon nigroviridis (Tn), Branchiostoma floridae (Bf), Ciona intestinalis (Ci), Ciona savignyi (Cs). Predicted transcripts of the genome of Strongylocentrotus purpuratus (Sp) were downloaded from the NCBI FTP site. Unique gene data sets of Xenopus laevis (Xl), Strongylocentrotus purpuratus (Sp), Ciona intestinalis (Ci), Ciona savignyi (Cs) were downloaded from the NCBI FTP site. Besides, some sequences from other species were also used, including Leucoraja erinacea (Le), Squalus acanthias (Sa), Meleagris gallopavo (Mg), Lachesis muta (Lm), Cyprinus carpio (Cc), salmo salar (Ss) and Oncorhynchus mykiss (Om). Identification of tetraspanin sequences from these data was performed by using the PSSM model pfam00335 and the stand-alone RPS-BLAST . In order to build a reference set of tetraspanin, we set out to find all tetraspanin sequences from human, mouse, zebrafish, C. intestinalis, B. floridae and S. purpuratus. Therefore, for these species, we also downloaded and scanned their ab initio predicted transcripts, expressed sequence tags and genomic sequences.
Because of the structure features (see Background), tetraspanin sequences generally have higher quality than many other gene families, but manual sequence inspection and correction is still needed. Fortunately, the sequence, protein architecture and intron positions of tetraspanins are highly conserved within ortholog lineages (in practice, we can divide tetraspanins in 4 TM, 2EC, and the large EC can be further divided into 4-6 portions (Figure 1B & 1C), so by dividing the sequence into these short portions it is easy to detect and correct the sequence problem. Three problems could be encountered: incorrect splicing sites, fragmented genes and sequencing gaps. Briefly, four procedures were performed: 1) FGENEH or FGENEH+ with close homologs were used to re-predict the transcript to fix incorrect splicing and fragmental genes; 2) mRNA/EST was used to correct sequences; 3) when multiple different sequences for a same gene were available, high-quality sequence from close species were used as referee to decide which one and or which portion was more accurate; 4) incomplete genes with sequencing gaps or not corrected genes were deleted.
5.2 Protein sequence-based phylogenetic reconstruction
Only the sequences of "tetraspanin core" (4TM+SEL+LEL) were used for alignment and tree reconstruction. Clustalw 1.83 with all default settings was used to produce multiple alignments. Unlike the previous analysis , in order not to introduce subjective bias, manual alignment editing was minimized. Minimum-evolution (ME) trees were built by using Mega v4.1 , with 1000 bootstrap tests and pairwise deletion. Both poisson correction model and p-distance model were used in the ME method. Maximum-likelihood (ML) method was conducted by using Phylip v3.65, with 100 bootstrap tests, Jones-Taylor-Thornton model, no gamma correction, and 2 times of jumble for Figure S3 (but 10 times for the analyses of 17 individual trees). Maximum parsimony (MP) method was conducted by using Phylip v3.65, with 100 bootstrap tests and 10 times of jumble. Bayesian protein trees were carried out using MrBayes v3.1, with amino acid substitution models recommended by the program. All other not mentioned parameters were set as defaults. In this study, ME trees were presented as bootstrapped original trees with branch length, whereas ML and MP trees were presented as bootstrapped consensus trees without branch length.
5.3 Codon usage, Intron configuration analysis and syntenic context analysis
Codon bias and GC contents was analyzed using CodonW program http://mobyle.pasteur.fr/cgi-bin/portal.py?form=codonw. Intron positions for each gene were identified by using stand-alone SPLIGN program http://www.ncbi.nlm.nih.gov/sutils/splign/ and marked on the protein sequences using a home-made Perl script. Syntenic context analysis was carried out manually on the UCSC gene sorter and genome browser http://genome.ucsc.edu/.
5.4 Reconstruction of ohnolog lineages and their duplicating patterns
Paralogs arisen from WGDs are called ohnologs. We selected all duplicate pairs, triplets or quartets that were created between the speciation of vertebrates and bony vertebrates, and then used the UCSC gene sorter and genome browser http://genome.ucsc.edu/ to determine if these duplicates are located on syntenic regions. Duplicates sharing synteny were considered ohnolog pairs or triplets (note: no ohnolog quatets found). The next step is to determine the duplicating patterns by constructing phylogenetic trees. The theoretical pattern of four ohnologs arising from 2R-WGD should be (AB)(CD), but 2R-WGD happened more than 500 Mya ago and the interval between 2R-WGDs was rather short, so divergence and gene loss may distorted the ideal pattern. To reconcile the duplicating patterns, we not only constructed trees from tetraspanin ohnologs, but also constructed trees from those ohnolog genes co-localized with tetraspanin genes. Likelihood ratio tests were used to compare the fitness of different phylogenetic topologies using PAML program .
5.5 Detection of positive selection and statistic analysis
Based on the phylogeny obtained from reference tetraspanin seqences, we produced extended alignments and phylogenetic trees for each ancestral vertebrate tetraspanin lineage (seventeen in total) by including more species. For each tetraspanin ortholog lineage, we included at least (but not limited) five mammal species, four teleost species, and as many species as possible for reptiles and amphibians. Because of the conserved protein sequence and architecture, it is easy to produce high-quality alignments for each tetraspanin lineages. Protein-based alignments were converted to nucleotide alignments by using RevTrans . To guarantee reliable likelihood tests, manual sequence correction was conducted on alignments. Reference sequences were used as criteria to detect potential problems by eyes, and the problematic sequences were either deleted or corrected, depending on their dispensability in an alignment. The alignment browser of Mega 4 was used as the basic tool for sequence editing . Highly divergent sequences were also deleted from the alignments.
PAML 4 program was used to detect positive selection . In a codon model, dN/dS > 1 (ω > 1) is considered as indicator of positive selection. In this study, the basic model (one-ratio), two site-specific models (nearly-neutral and positive-selection) and the branch-site model A were applied on each tree. The branch-site model A has been reported to be more powerful and sensitive than the site-specific model and the branch-specific models . In a test using branch-site model A, the branch being tested for positive selection is called the foreground branch, and the other branches in the tree are called the background branches. Likelihood ratio test (LRT) is used to evaluate whether significant positive selection happens on the foreground branch. Two LRTs may be constructed: 1) to compare the model A to the nearly-neutral model or 2) to compare the model A to the modified model A with ω = 1 fixed. The first LRT is known to have high false positive rate because it does not distinguish between positive selection and relaxed selective constraints , hence the second LRT is preferred. The test is done by comparing 2ΔlnL to χ12 or a 1:1 mixture of 0 and χ12. To guide against violations of model assumptions , the more conservative test with χ12 was used in this study. To avoid the known convergence problems (for example, sometimes the parameters may get stuck at a corner of parameter space), we ran the program twice with two set of parameters, the first set is the defaults, whereas the second set is the parameters inferred from the one-ratio model. Since multiple branches from the same tree have been tested, we used the Hochberg multiple testing correction method  to control the family-wise error rate. Except for the construction of reference trees, other processes were automated using a home-made Perl script. All statistic analysis following the evaluation of positive selection was conducted by using Microsoft Office Excel.
Those deep ancient branches (the duplication branches right after 2R-WGD) may be suffered from dS saturation, which in return may cause false detection of positive selection on these branches. To gain insight into the false positive rates on these branches, we used EvolveNSsite from PAML4 to generate alignments under the nearly-neutral (M1a) model , and ran selection tests on the duplication branches of these alignments using a stringent branch-site model as described in Section 5.5. Thirteen tetraspanin families which contain at least two duplication branches were used for simulation. For each family five alignments were simulated, using the same parameters derived from the corresponding family. These parameters include two categories, one derived directly from the original alignment (sequence number, sequence length, tree topology, and codon usage), the other inferred from the original alignment under the nearly-neutral (M1a) model using PAML4 (branch length, dN/dS ratio ω, Ts/Tv ratioκ, and the proportion of site with ω = 1). Furthermore, to correct for possible underestimation of the branch length, we generated and tested five additional simulated alignments for each family with all the same parameters except multiplying the branch length by 1.5.
5.7 Detection of rate shifts among sites using protein-based methods
There are higher chance for those deep ancient branches (the duplication branches right after 2R-WGD) to suffer from dS saturation and codon model violation. If so, the power and accuracy of selection tests on these branches based on codon models may be impaired . Alternatively, maximum-likelihood approaches based on protein sequences are able to better contain divergent sequences in theory , although as a trade-off, they lose part of the codon substitution information. To provide complement and confirmation to the codon models used here, we incorporated two protein-based maximum-likelihood approaches to detect the alteration of selection constraints (or the rate shift) on those tetraspanin ohnolog lineages produced by 2R-WGD. The first method is Gu's method (Gu99) that is implemented in the software DIVERGE2 , which detect specific amino acid sites under functional divergence between two paralog lineages after gene duplication [73, 74]. This method tests whether the coefficient of functional divergence θ of two selected lineages is significantly larger than 0. If so, then there is significant functional divergence between two lineages, and the sites contributed to divergence are also inferred using a probabilistic model. The second method is based on a covarion model that is implemented in the software RASER2, which allows raft shifts to vary among lineages and permits to identify the rate-shifting sites by empirical Beyesian inference . Here we used RASER2 to compare the lineage-specific model (invoked by using the parameter file "raser.stochasticMapping.params", and specifying a tetraspanin ohnolog lineage in this file, which thus enables rate shifts on the specified lineage) with the null model (invoked by using the parameter file "null.params", which does not enable rate shifts). All other parameters were set to defaults. To avoid the known convergence problems, we ran the program for at least twice for each lineage and each model on different platforms. Then likelihood-ratio tests (LRT) were performed to determine whether the lineage-specific model fit the data significantly better than the null model. If so, rate shifts on the specified lineage is significant. Protein-based analyses were only conducted on duplication branches (or ohnolog lineages) from 13 tetraspanin families with at least two duplication branches. Used alignments and tree topologies are the same as used in codon-based analyses.
5.8 Data availability
The reference set of deuterostome tetraspanins including six genomes (human, mouse, zebrafish, C. intestinalis, B. floridae and S. purpuratus) and some other selected sequences were presented in Additional file 2. All alignments used for phylogenetic analyses and positive seletion tests were presented in Additional file 6.
large extracellular loop
likelikhood ratio test
small extracellular loop
whole genome duplication
two rounds of WGD at the origin of vertebrates
We thank two anonymous reviewers for their comments. We thank Jiantao Huang for exquisite artworks. This work was supported by Projects 2007CB815800 (973), and 2008AA092601(863) from the Ministry of Science and Technology of China; project 30730089 and 30901103 from National Natural Science Foundation of China. A. Xu is a recipient of the Outstanding Young Scientist Award from the National Natural Science Foundation of China.
- Hemler ME: Tetraspanin proteins mediate cellular penetration, invasion, and fusion events and define a novel type of membrane microdomain. Annu Rev Cell Dev Biol. 2003, 19: 397-422. 10.1146/annurev.cellbio.19.111301.153609.View ArticlePubMedGoogle Scholar
- Huang S, Yuan S, Dong M, Su J, Yu C, Shen Y, Xie X, Yu Y, Yu X, Chen S, et al: The phylogenetic analysis of tetraspanins projects the evolution of cell-cell interactions from unicellular to multicellular organisms. Genomics. 2005, 86 (6): 674-684. 10.1016/j.ygeno.2005.08.004.View ArticlePubMedGoogle Scholar
- Boucheix C, Duc GH, Jasmin C, Rubinstein E: Tetraspanins and malignancy. Expert Rev Mol Med. 2001, 2001: 1-17.PubMedGoogle Scholar
- Gourgues M, Brunet-Simon A, Lebrun MH, Levis C: The tetraspanin BcPls1 is required for appressorium-mediated penetration of Botrytis cinerea into host plant leaves. Mol Microbiol. 2004, 51 (3): 619-629. 10.1046/j.1365-2958.2003.03866.x.View ArticlePubMedGoogle Scholar
- Martin F, Roth DM, Jans DA, Pouton CW, Partridge LJ, Monk PN, Moseley GW: Tetraspanins in viral infections: a fundamental role in viral biology?. J Virol. 2005, 79 (17): 10839-10851. 10.1128/JVI.79.17.10839-10851.2005.PubMed CentralView ArticlePubMedGoogle Scholar
- Olmos E, Reiss B, Dekker K: The ekeko mutant demonstrates a role for tetraspanin-like protein in plant development. Biochem Biophys Res Commun. 2003, 310 (4): 1054-1061. 10.1016/j.bbrc.2003.09.122.View ArticlePubMedGoogle Scholar
- Adell T, Gamulin V, Perovic-Ottstadt S, Wiens M, Korzhev M, Muller IM, Muller WE: Evolution of metazoan cell junction proteins: the scaffold protein MAGI and the transmembrane receptor tetraspanin in the demosponge Suberites domuncula. J Mol Evol. 2004, 59 (1): 41-50. 10.1007/s00239-004-2602-2.View ArticlePubMedGoogle Scholar
- Wright MD, Moseley GW, van Spriel AB: Tetraspanin microdomains in immune cell signalling and malignant disease. Tissue Antigens. 2004, 64 (5): 533-542. 10.1111/j.1399-0039.2004.00321.x.View ArticlePubMedGoogle Scholar
- Berditchevski F, Odintsova E: Tetraspanins as regulators of protein trafficking. Traffic. 2007, 8 (2): 89-96. 10.1111/j.1600-0854.2006.00515.x.View ArticlePubMedGoogle Scholar
- Maecker HT, Todd SC, Levy S: The tetraspanin superfamily: molecular facilitators. Faseb J. 1997, 11 (6): 428-442.PubMedGoogle Scholar
- Tarrant JM, Robb L, van Spriel AB, Wright MD: Tetraspanins: molecular organisers of the leukocyte surface. Trends Immunol. 2003, 24 (11): 610-617. 10.1016/j.it.2003.09.011.View ArticlePubMedGoogle Scholar
- Yunta M, Lazo PA: Tetraspanin proteins as organisers of membrane microdomains and signalling complexes. Cell Signal. 2003, 15 (6): 559-564. 10.1016/S0898-6568(02)00147-X.View ArticlePubMedGoogle Scholar
- Hemler ME: Tetraspanin functions and associated microdomains. Nat Rev Mol Cell Biol. 2005, 6 (10): 801-811. 10.1038/nrm1736.View ArticlePubMedGoogle Scholar
- Le Naour F, Andre M, Boucheix C, Rubinstein E: Membrane microdomains and proteomics: lessons from tetraspanin microdomains and comparison with lipid rafts. Proteomics. 2006, 6 (24): 6447-6454. 10.1002/pmic.200600282.View ArticlePubMedGoogle Scholar
- Charrin S, le Naour F, Silvie O, Milhiet PE, Boucheix C, Rubinstein E: Lateral organization of membrane proteins: tetraspanins spin their web. Biochem J. 2009, 420 (2): 133-154. 10.1042/BJ20082422.View ArticlePubMedGoogle Scholar
- Stipp CS, Kolesnikova TV, Hemler ME: Functional domains in tetraspanin proteins. Trends Biochem Sci. 2003, 28 (2): 106-112. 10.1016/S0968-0004(02)00014-2.View ArticlePubMedGoogle Scholar
- Seigneuret M, Delaguillaumie A, Lagaudriere-Gesbert C, Conjeaud H: Structure of the tetraspanin main extracellular domain. A partially conserved fold with a structurally variable domain insertion. J Biol Chem. 2001, 276 (43): 40055-40064. 10.1074/jbc.M105557200.View ArticlePubMedGoogle Scholar
- Kitadokoro K, Bordo D, Galli G, Petracca R, Falugi F, Abrignani S, Grandi G, Bolognesi M: CD81 extracellular domain 3D structure: insight into the tetraspanin superfamily structural motifs. Embo J. 2001, 20 (1-2): 12-18. 10.1093/emboj/20.1.12.PubMed CentralView ArticlePubMedGoogle Scholar
- Lambou K, Tharreau D, Kohler A, Sirven C, Marguerettaz M, Barbisan C, Sexton AC, Kellner EM, Martin F, Howlett BJ, et al: Fungi have three tetraspanin families with distinct functions. BMC Genomics. 2008, 9: 63-10.1186/1471-2164-9-63.PubMed CentralView ArticlePubMedGoogle Scholar
- Todres E, Nardi JB, Robertson HM: The tetraspanin superfamily in insects. Insect Mol Biol. 2000, 9 (6): 581-590. 10.1046/j.1365-2583.2000.00222.x.View ArticlePubMedGoogle Scholar
- Garcia-Espana A, Chung PJ, Sarkar IN, Stiner E, Sun TT, Desalle R: Appearance of new tetraspanin genes during vertebrate evolution. Genomics. 2008, 91 (4): 326-334. 10.1016/j.ygeno.2007.12.005.View ArticlePubMedGoogle Scholar
- Garcia-Espana A, Mares R, Sun TT, Desalle R: Intron evolution: testing hypotheses of intron evolution using the phylogenomics of tetraspanins. PLoS One. 2009, 4 (3): e4680-10.1371/journal.pone.0004680.PubMed CentralView ArticlePubMedGoogle Scholar
- DeSalle R, Mares R, Garcia-Espana A: Evolution of cysteine patterns in the large extracellular loop of tetraspanins from animals, fungi, plants and single-celled eukaryotes. Mol Phylogenet Evol. 56 (1): 486-491. 10.1016/j.ympev.2010.02.015.
- Ohno S: Evolution by Gene Duplication. (George Allen and Unwin, London, 1970).
- Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J: Preservation of duplicate genes by complementary, degenerative mutations. Genetics. 1999, 151 (4): 1531-1545.PubMed CentralPubMedGoogle Scholar
- Hughes AL: The evolution of functionally novel proteins after gene duplication. Proc Biol Sci. 1994, 256 (1346): 119-124. 10.1098/rspb.1994.0058.View ArticlePubMedGoogle Scholar
- Wapinski I, Pfeffer A, Friedman N, Regev A: Natural history and evolutionary principles of gene duplication in fungi. Nature. 2007, 449 (7158): 54-61. 10.1038/nature06107.View ArticlePubMedGoogle Scholar
- Hakes L, Pinney JW, Lovell SC, Oliver SG, Robertson DL: All duplicates are not equal: the difference between small-scale and genome duplication. Genome Biol. 2007, 8 (10): R209-10.1186/gb-2007-8-10-r209.PubMed CentralView ArticlePubMedGoogle Scholar
- Maere S, De Bodt S, Raes J, Casneuf T, Van Montagu M, Kuiper M, Van de Peer Y: Modeling gene and genome duplications in eukaryotes. Proc Natl Acad Sci USA. 2005, 102 (15): 5454-5459. 10.1073/pnas.0501102102.PubMed CentralView ArticlePubMedGoogle Scholar
- Conant GC, Wolfe KH: Turning a hobby into a job: how duplicated genes find new functions. Nat Rev Genet. 2008, 9 (12): 938-950. 10.1038/nrg2482.View ArticlePubMedGoogle Scholar
- Brunet FG, Crollius HR, Paris M, Aury JM, Gibert P, Jaillon O, Laudet V, Robinson-Rechavi M: Gene loss and evolutionary rates following whole-genome duplication in teleost fishes. Mol Biol Evol. 2006, 23 (9): 1808-1816. 10.1093/molbev/msl049.View ArticlePubMedGoogle Scholar
- Studer RA, Penel S, Duret L, Robinson-Rechavi M: Pervasive positive selection on duplicated and nonduplicated vertebrate protein coding genes. Genome Res. 2008, 18 (9): 1393-1402. 10.1101/gr.076992.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang M, Zhang X, Zhao H, Wang Q, Pan Y: FoxO gene family evolution in vertebrates. BMC Evol Biol. 2009, 9: 222-10.1186/1471-2148-9-222.PubMed CentralView ArticlePubMedGoogle Scholar
- Fetterman CD, Rannala B, Walter MA: Identification and analysis of evolutionary selection pressures acting at the molecular level in five forkhead subfamilies. BMC Evol Biol. 2008, 8: 261-10.1186/1471-2148-8-261.PubMed CentralView ArticlePubMedGoogle Scholar
- Hughes J, Criscuolo F: Evolutionary history of the UCP gene family: gene duplication and selection. BMC Evol Biol. 2008, 8: 306-10.1186/1471-2148-8-306.PubMed CentralView ArticlePubMedGoogle Scholar
- Lynch VJ, Roth JJ, Wagner GP: Adaptive evolution of Hox-gene homeodomains after cluster duplications. BMC Evol Biol. 2006, 6: 86-10.1186/1471-2148-6-86.PubMed CentralView ArticlePubMedGoogle Scholar
- Bos DH: Natural selection during functional divergence to LMP7 and proteasome subunit X (PSMB5) following gene duplication. J Mol Evol. 2005, 60 (2): 221-228. 10.1007/s00239-004-0120-x.View ArticlePubMedGoogle Scholar
- Freeling M, Thomas BC: Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome Res. 2006, 16 (7): 805-814. 10.1101/gr.3681406.View ArticlePubMedGoogle Scholar
- Ohno S: Gene duplication and the uniqueness of vertebrate genomes circa 1970-1999. Semin Cell Dev Biol. 1999, 10 (5): 517-522. 10.1006/scdb.1999.0332.View ArticlePubMedGoogle Scholar
- Abi-Rached L, Gilles A, Shiina T, Pontarotti P, Inoko H: Evidence of en bloc duplication in vertebrate genomes. Nat Genet. 2002, 31 (1): 100-105. 10.1038/ng855.View ArticlePubMedGoogle Scholar
- Dehal P, Boore JL: Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 2005, 3 (10): e314-10.1371/journal.pbio.0030314.PubMed CentralView ArticlePubMedGoogle Scholar
- Putnam NH, Butts T, Ferrier DE, Furlong RF, Hellsten U, Kawashima T, Robinson-Rechavi M, Shoguchi E, Terry A, Yu JK, et al: The amphioxus genome and the evolution of the chordate karyotype. Nature. 2008, 453 (7198): 1064-1071. 10.1038/nature06967.View ArticlePubMedGoogle Scholar
- Wolfe K: Robustness--it's not where you think it is. Nat Genet. 2000, 25 (1): 3-4. 10.1038/75560.View ArticlePubMedGoogle Scholar
- Prachumwat A, Li WH: Gene number expansion and contraction in vertebrate genomes with respect to invertebrate genomes. Genome Res. 2008, 18 (2): 221-232. 10.1101/gr.7046608.PubMed CentralView ArticlePubMedGoogle Scholar
- Bielawski JP, Yang Z: Maximum likelihood methods for detecting adaptive evolution after gene duplication. J Struct Funct Genomics. 2003, 3 (1-4): 201-212. 10.1023/A:1022642807731.View ArticlePubMedGoogle Scholar
- Hochberg Y: A sharper Bonf erroni procedure for multiple tests of significance. biometrika. 1988, 76: 383-386.Google Scholar
- Anisimova M, Yang Z: Multiple hypothesis testing to detect lineages under positive selection that affects only a few sites. Mol Biol Evol. 2007, 24 (5): 1219-1228. 10.1093/molbev/msm042.View ArticlePubMedGoogle Scholar
- Mallick S, Gnerre S, Muller P, Reich D: The difficulty of avoiding false positives in genome scans for natural selection. Genome Res. 2009, 19 (5): 922-933. 10.1101/gr.086512.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Anisimova M, Bielawski JP, Yang Z: Accuracy and power of bayes prediction of amino acid sites under positive selection. Mol Biol Evol. 2002, 19 (6): 950-958.View ArticlePubMedGoogle Scholar
- Gu X, Vander Velden K: DIVERGE: phylogeny-based analysis for functional-structural divergence of a protein family. Bioinformatics. 2002, 18 (3): 500-501. 10.1093/bioinformatics/18.3.500.View ArticlePubMedGoogle Scholar
- Penn O, Stern A, Rubinstein ND, Dutheil J, Bacharach E, Galtier N, Pupko T: Evolutionary modeling of rate shifts reveals specificity determinants in HIV-1 subtypes. PLoS Comput Biol. 2008, 4 (11): e1000214-10.1371/journal.pcbi.1000214.PubMed CentralView ArticlePubMedGoogle Scholar
- Goldberg AF: Role of peripherin/rds in vertebrate photoreceptor architecture and inherited retinal degenerations. Int Rev Cytol. 2006, 253: 131-175. 10.1016/S0074-7696(06)53004-9.View ArticlePubMedGoogle Scholar
- Goldberg AF, Molday RS: Expression and characterization of peripherin/rds-rom-1 complexes and mutants implicated in retinal degenerative diseases. Methods Enzymol. 2000, 316: 671-687. full_text.View ArticlePubMedGoogle Scholar
- Clarke G, Goldberg AF, Vidgen D, Collins L, Ploder L, Schwarz L, Molday LL, Rossant J, Szel A, Molday RS, et al: Rom-1 is required for rod photoreceptor viability and the regulation of disk morphogenesis. Nat Genet. 2000, 25 (1): 67-73. 10.1038/75621.View ArticlePubMedGoogle Scholar
- Davis JC, Petrov DA: Preferential duplication of conserved proteins in eukaryotic genomes. PLoS Biol. 2004, 2 (3): E55-10.1371/journal.pbio.0020055.PubMed CentralView ArticlePubMedGoogle Scholar
- He X, Zhang J: Gene complexity and gene duplicability. Curr Biol. 2005, 15 (11): 1016-1021. 10.1016/j.cub.2005.04.035.View ArticlePubMedGoogle Scholar
- Papp B, Pal C, Hurst LD: Dosage sensitivity and the evolution of gene families in yeast. Nature. 2003, 424 (6945): 194-197. 10.1038/nature01771.View ArticlePubMedGoogle Scholar
- He X, Zhang J: Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution. Genetics. 2005, 169 (2): 1157-1164. 10.1534/genetics.104.037051.PubMed CentralView ArticlePubMedGoogle Scholar
- Rastogi S, Liberles DA: Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol. 2005, 5 (1): 28-10.1186/1471-2148-5-28.PubMed CentralView ArticlePubMedGoogle Scholar
- Hughes T, Ekman D, Ardawatia H, Elofsson A, Liberles DA: Evaluating dosage compensation as a cause of duplicate gene retention in Paramecium tetraurelia. Genome Biol. 2007, 8 (5): 213-10.1186/gb-2007-8-5-213.PubMed CentralView ArticlePubMedGoogle Scholar
- Pazos F, Valencia A: Protein co-evolution, co-adaptation and interactions. EMBO J. 2008, 27 (20): 2648-2655. 10.1038/emboj.2008.189.PubMed CentralView ArticlePubMedGoogle Scholar
- Benton MJ, Donoghue PC: Paleontological evidence to date the tree of life. Mol Biol Evol. 2007, 24 (1): 26-53. 10.1093/molbev/msl150.View ArticlePubMedGoogle Scholar
- Blair JE, Hedges SB: Molecular phylogeny and divergence times of deuterostome animals. Mol Biol Evol. 2005, 22 (11): 2275-2284. 10.1093/molbev/msi225.View ArticlePubMedGoogle Scholar
- Hughes T, Liberles DA: Whole-genome duplications in the ancestral vertebrate are detectable in the distribution of gene family sizes of tetrapod species. J Mol Evol. 2008, 67 (4): 343-357. 10.1007/s00239-008-9145-x.View ArticlePubMedGoogle Scholar
- Hughes T, Liberles DA: The power-law distribution of gene family size is driven by the pseudogenisation rate's heterogeneity between gene families. Gene. 2008, 414 (1-2): 85-94. 10.1016/j.gene.2008.02.014.View ArticlePubMedGoogle Scholar
- Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, et al: Ensembl 2009. Nucleic Acids Res. 2009, D690-697. 10.1093/nar/gkn828. 37 Database
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol. 2007, 24 (8): 1596-1599. 10.1093/molbev/msm092.View ArticlePubMedGoogle Scholar
- Rosenbloom K, Taylor J, Schaeffer S, Kent J, Haussler D, Miller W: Phylogenomic resources at the UCSC Genome Browser. Methods Mol Biol. 2008, 422: 133-144. full_text.View ArticlePubMedGoogle Scholar
- Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007, 24 (8): 1586-1591. 10.1093/molbev/msm088.View ArticlePubMedGoogle Scholar
- Wernersson R, Pedersen AG: RevTrans: Multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Res. 2003, 31 (13): 3537-3539. 10.1093/nar/gkg609.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang J, Nielsen R, Yang Z: Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level. Mol Biol Evol. 2005, 22 (12): 2472-2479. 10.1093/molbev/msi237.View ArticlePubMedGoogle Scholar
- Gu X: Statistical methods for testing functional divergence after gene duplication. Mol Biol Evol. 1999, 16 (12): 1664-1674.View ArticlePubMedGoogle Scholar
- Gu X: Maximum-likelihood approach for gene family evolution under functional divergence. Mol Biol Evol. 2001, 18 (4): 453-464.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.