Exploring metazoan evolution through dynamic and holistic changes in protein families and domains
© Wang et al.; licensee BioMed Central Ltd. 2012
Received: 1 March 2012
Accepted: 19 July 2012
Published: 3 August 2012
Proteins convey the majority of biochemical and cellular activities in organisms. Over the course of evolution, proteins undergo normal sequence mutations as well as large scale mutations involving domain duplication and/or domain shuffling. These events result in the generation of new proteins and protein families. Processes that affect proteome evolution drive species diversity and adaptation. Herein, change over the course of metazoan evolution, as defined by birth/death and duplication/deletion events within protein families and domains, was examined using the proteomes of 9 metazoan and two outgroup species.
In studying members of the three major metazoan groups, the vertebrates, arthropods, and nematodes, we found that the number of protein families increased at the majority of lineages over the course of metazoan evolution where the magnitude of these increases was greatest at the lineages leading to mammals. In contrast, the number of protein domains decreased at most lineages and at all terminal lineages. This resulted in a weak correlation between protein family birth and domain birth; however, the correlation between domain birth and domain member duplication was quite strong. These data suggest that domain birth and protein family birth occur via different mechanisms, and that domain shuffling plays a role in the formation of protein families. The ratio of protein family birth to protein domain birth (domain shuffling index) suggests that shuffling had a more demonstrable effect on protein families in nematodes and arthropods than in vertebrates. Through the contrast of high and low domain shuffling indices at the lineages of Trichinella spiralis and Gallus gallus, we propose a link between protein redundancy and evolutionary changes controlled by domain shuffling; however, the speed of adaptation among the different lineages was relatively invariant. Evaluating the functions of protein families that appeared or disappeared at the last common ancestors (LCAs) of the three metazoan clades supports a correlation with organism adaptation. Furthermore, bursts of new protein families and domains in the LCAs of metazoans and vertebrates are consistent with whole genome duplications.
Metazoan speciation and adaptation were explored by birth/death and duplication/deletion events among protein families and domains. Our results provide insights into protein evolution and its bearing on metazoan evolution.
KeywordsProteins Domains Evolution Metazoa Vertebrates Arthropods Nematodes
Proteins convey the majority of biochemical and cellular activities in organisms. Their structural and functional units are defined as domains [1, 2] where each protein may contain a single or multiple domains. Evolutionarily related proteins have been grouped into families. Member proteins from the same family usually share high functional and sequence similarity, and contain similar domain architectures [3, 4]. Over the course of evolution, proteins undergo mutations, duplications, and domain shuffling , which can result in the generation of new proteins and protein families through natural selection. The interplay between the different protein evolutionary events creates complicated mechanisms that help govern speciation and adaptation of organisms . It is believed that duplications can create functional redundancies and provide space for mutation and domain shuffling. Mutation and domain shuffling together with other genetic events can create functional variation and in some cases completely alter protein function. These changes subject proteins to natural selection and adaptation which in turn lead to the generation of new domains, proteins, protein families, and species. As such, analyzing these changes can greatly improve our understanding of protein evolution which in turn will enhance our perception of species diversity and adaptation. Such understanding can be of great economic importance. For example, identifying and characterizing protein families or domains unique to parasites, i.e. parasitic nematodes, can result in better disease treatment and control.
Protein evolution has been explored for decades. Indels and substitutions have been linked to protein structure and function [7, 8]; gene duplication and protein family expansion have been correlated to organism adaptation [9–11], and; studies on protein domains have advanced our understanding of the protein repertoire [12, 13]. Systematic studies of protein evolution, especially those that examine the relationships between domain evolution and protein family evolution have been limited by a dearth of sequence and functional data at the genomic level. However, recent and significant progress has been made in obtaining such data. Today, more than 5000 genomes of species encompassing a broad taxonomic distribution have been sequenced (http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome), and their corresponding proteomes have been annotated. The culmination of these efforts is the emergence of databases consisting of well-defined protein domains such as Pfam , which define thousands of conserved protein domains with detailed information on sequence and function. These databases make possible methodical evaluations of protein evolution. Indeed, while our work was ongoing, Kawashima et al.  extracted important information on vertebrate adaptation from changes in domain architecture. Furthermore, Buljan et al.  found that changes in domain architecture are biased to the termini of proteins. These studies highlight the potential to glean important associations between domain evolution, protein family evolution, and species adaptation from systematic studies of protein and genomics databases.
The present investigation analyzed 9 metazoan proteomes covering the three major metazoan clades; vertebrates, arthropods, and nematodes, together with those of Saccharomyces cerevisiae and Monosiga brevicollis as outgroups. Using evolutionary and biostatistics methodologies, we evaluated deaths and births of protein families and domains, and duplications and deletions of protein family and domain members within the target species. Herein, we refer to the generation of new protein or domain families in a lineage as birth events and the disappearance of these families at a lineage as death events. To better illustrate the evolutionary dynamics, these events were summarized into four indices; change in protein family complexity, change in protein domain complexity, domain shuffling, and adaptation. We used these datasets to explore and provide new insights into metazoan adaptation, diversity and evolution.
Protein family birth and death
Classification of protein families and domains
Specific to species
Shared by species
Birth and death evolutionary events a
((Tsp (Cbr Cel)) ((Bmo (Aae Dme)) (Gga (Hsa Mmu))))
(Tsp (Cbr Cel)
((Bmo (Aae Dme)) (Gga (Hsa Mmu)))
(Bmo (Aae Dme))
(Gga (Hsa Mmu)
Duplication and deletion in universal protein families
We selected 804 universal protein families containing members present in all 11 taxa and investigated duplications and deletions among the members. Focusing on universal families helped minimize the effects of species adaptation and detect signals associated with genomic evolutionary events. Six universal families were excluded because large numbers of sequences (more than 1,000) in those families prohibited further multiple sequence alignment and tree building. Of those examined, 12,507 duplications and 22,954 deletions were inferred, averaging 16 duplications and 29 deletions per family. In the majority of lineages, deletions outnumbered duplications; however, at the LCA of nematodes, deletions were 28 times greater than duplications suggesting protein families became smaller (Table 2). It appears there were two rounds of duplication bursts, one in the LCA of metazoans with an average of 2.2 duplications per family, and one in the LCA of vertebrates, which averaged 1.95 duplications per family. All other branches shared less than one duplication per family on average. Despite the variation in deletion events over different lineages, the numbers of deletions from the LCA to each present taxon were less variable than duplications. Comparing the terminal lineages, G. gallus had the fewest duplication events.
Domain birth and death
We successfully identified 5,106 domains from 123,084 proteins. Unlike protein families where less than 5 percent were universal (810 out of 17,752) and more than 20 percent were species specific, more than 20 percent of domains (1,172 out of 5,106) were universal and less than 6 percent were species specific (Table 1).
Birth/death events of the 5,106 identified domains were inferred in the same manner as protein family birth/death events (Table 2). Domains had fewer birth/death events than protein families. Consistent with that observed in protein families, there was a burst of domain births in the LCA of metazoans and this was 2 times greater than that found in the LCA of arthropods and vertebrates after normalizing by branch lengths. However, different lineages exhibited dramatic variations in the number of death/birth events. The lineages leading to humans exhibited the largest number of domain births and the smallest number of domain deaths. In contrast, the lineages leading to T. spiralis showed the smallest number of domain births and the largest number of domain deaths. After the split, 614 domains disappeared in T. spiralis while approximately 250 domains disappeared in C. elegans and C. briggsae Since T. spiralis is a nematode parasite and lateral gene transfer has been reported in parasitic nematodes , details of the 15 domains born in T. spiralis were examined. Interestingly, 13 out of 15 have been annotated as bacterial or viral protein domains ( Additional file 4).
Domain duplications and deletions
Similar to family member duplications and deletions, domain duplications and deletions were analyzed for each phylogeny. For the purpose of comparability, only the 1,168 universal domains (domains present in all 11 species) were considered. In total, 49,958 duplications and 94,648 deletions were inferred for the universal domains; 5 domains were excluded because they have more than 1,000 members. As observed among universal protein family members, domain duplication and deletion varied substantially over the course of evolution, and sister lineages did not have similar numbers of duplications and deletions (Table 2). However, domain duplications and deletions were more frequent than protein member duplications and deletions, averaging 43 duplications and 81 deletions per domain over the course of evolution for the species examined starting with the LCA of metazoa.
Correlation between protein domain evolution and protein family evolution
Pearson's correlation coefficients (bold text) and their significance a (regular text) of different evolutionary events
Universal Families duplication
Universal Families deletion
Universal Domains duplication
Universal Domains deletion
Dynamic evolutionary changes over the phylogeny
Domain shuffling and protein family formation
Enriched biological process GO a terms in protein families born at the LCA b of the three major metazoan groups and the LCA of metazoans
Families born at the LCA of metazoans
regulation of transcription, DNA-dependent
multicellular organismal development
acyl-CoA metabolic process
G-protein coupled receptor protein signaling pathway
lysosome organization and biogenesis
Wnt receptor signaling pathway, calcium modulating pathway
vitelline membrane formation
sphingolipid metabolic process
transforming growth factor beta receptor signaling pathway
regulation of transcription
dicarboxylic acid transport
innate immune response
negative regulation of microtubule depolymerization
Families born at the LCA of nematodes
chaperone cofactor-dependent protein folding
G-protein coupled receptor protein signaling pathway
cell wall catabolic process
trehalose biosynthetic process
Families born at the LCA of arthropods
chitin metabolic process
sodium ion transport
response to stress
sensory perception of smell
Families born at the LCA of vertebrates
G-protein coupled receptor protein signaling pathway
regulation of cell growth
sensory perception of taste
innate immune response
leukotriene metabolic process
regulation of apoptosis
antigen processing and presentation
phospholipid catabolic process
protein amino acid glycosylation
glycerol metabolic process
response to biotic stimulus
negative regulation of Wnt receptor signaling pathway
positive regulation of apoptosis
Enriched biological process GO a terms in protein families died at the LCAs b of three major metazoan groups
Families died at the LCA of arthropods
acyl-CoA metabolic process
vitelline membrane formation
sodium ion transport
cell redox homeostasis
methionine metabolic process
Families died at the LCA of nematodes
DNA catabolic process
regulation of Rho protein signal transduction
porphyrin biosynthetic process
protein amino acid O-linked glycosylation
antibiotic biosynthetic process
Families died at the LCA of vertebrates
G-protein coupled receptor protein signaling pathway
intein-mediated protein splicing
chitin metabolic process
multicellular organismal development
Duplication of whole genome, protein families and domains
This work provided a systematic analysis of both protein family evolution and domain evolution at the genomic level. Related evolutionary events were reconstructed and analyzed using proteomes from nine metazoan species via a variety of evolutionary and statistics programs. We included three well-annotated species, H. sapiens, C. elegans, and D. melanogaster, one for each major metazoan group, to reduce any bias from un-even annotation. We also included other less well characterized vertebrates, arthropods and nematodes for comparative purposes. To assure the reliability in reconstructing evolutionary events, yeast and choanoflagellate, the closest living relatives to metazoans, were included in the analysis as outgroups. In this way, we were able to relate both birth events and death events to species diversity and adaptation. This methodology allowed us to better explore any correlation between the evolution of protein families and protein domains, and reveal insights about species adaptation.
Lineage specific protein family and domain evolution
Both birth/death and duplication/deletion of proteins and their domains vary substantially between lineages. We characterized these variations through four measurements; the log ratio of protein family birth and death (protein family change index), the log ratio of protein domain birth and death (protein domain change index), the ratio of protein family birth and protein domain birth (domain shuffling index), and the summation of logarithm family birth and death events normalized by branch length (adaptation index). Indices defining protein family and domain changes reflect family and domain gain or loss at different lineages, and represent changes in complexity of the organism’s proteome at each lineage. Domain shuffling index is a good indication for the effects of domain shuffling at any given lineage. The adaptation index illustrates how quickly adaptation occurred through protein family birth/death among the lineages during evolution. These measurements provided an interesting representation of metazoan, lineage-specific evolution.
Our study revealed a consistent increase in complexity during the evolution of vertebrate mammals from the perspective of protein families as illustrated by the positive change in the protein family and domain indices at the corresponding lineages. These data corroborate prior work on Pfam protein domains showing large increases in complexity among Metazoa where high rates of new domain formation and changes in domain architecture were observed . The LCA of nematodes had a large reduction in complexity in both protein families and domains. However, unlike the lineages of arthropods, all three nematodes gained protein family complexity after splitting from their LCA (Figure 1). These data are consistent with the previously reported rapid generation of new protein families in vertebrates and nematodes .As reported, G. gallus exhibited a substantial reduction in protein family and domain complexities after it split from the LCA of vertebrates, and in general, the mean genome size is smaller in birds than in other tetrapods . This occurs in concert with a reduction in ancestral protein-coding genes. Our results also showed that the terminal lineage of G. gallus had the largest protein family and domain losses and the smallest protein family and domain duplications. In addition, G. gallus exhibited the smallest domain shuffling effect. In fact, based on our results domain shuffling was less in vertebrates compared to arthropods and nematodes, with G. gallus having the lowest domain shuffling index (as shown in ) among vertebrates (1.32 vs. 1.97 for M. musculus and 2.18 for H. sapiens). It has been hypothesized that reduced genome size is the result of the evolution of flight and concomitant adaptation of birds to the high rate of oxidative metabolism needed for flying . At this time, we cannot link the reduced genome size to reduced domain shuffling; however, less redundancy in the proteome in conjunction with strong selective pressures can effectively increase the deleterious effects of domain shuffling over time. This would result in an apparent reduction in domain shuffling in this lineage. Parasitism is clearly not analogous to flying; however, adapting to confined environmental niches marked by the evolution of parasitic nematodes or birds capable of flight is worth noting. Trichinella spiralis which unlike most parasitic nematodes has neither a free-living stage nor requires multiple hosts to complete its life cycle, is likely subject to fewer selective forces. Because of this, there are undoubtedly reduced requirements for the large repertoire of proteins demanded of free-living nematodes and those with more complicated life cycles, hence the remarkable reduction in complexity. In addition, the concept of a host-restricted animal, i.e., parasite, may result in the elimination of redundant protein families through the selection process . For example, the nuclear receptor superfamily tends to vary largely among different species . Free-living Caenorhabditis species possess hundreds of copies  whereas only 15 copies of this same receptor superfamily have been found in T. spiralis. This large disparity in copy number is accompanied by the loss of sub-family specific domains in T. spiralis. It has been hypothesized that the higher copy numbers are required in free-living nematodes for more efficient regulation of gene expression and for responding better and more quickly to environmental factors such as temperature, nutrient availability, metal ions or pH . It follows therefore that fewer copies would be required in parasites such as T. spiralis that have adapted to a more predictable living environment. In contrast to G. gallus, the lineage of T. spiralis which has a very small genome relative to vertebrates, exhibited a strong domain shuffling effect. We believe that the high level of protein redundancy in T. spiralis resulting from host parasitism has made domain shuffling more tolerable during the evolutionary process.
Meanwhile, terminal lineages uniformly exhibited reductions in domain complexity while more than half of them showed increases in protein family complexity (Figure 1and 2). This suggests that new domains formed at a lower rate, and that domain loss outpaced domain gain at the terminal lineages. It is consistent with observations made by Lander et al.  who demonstrated that a near complete set of human gene domains is common to one or more lower eukaryotes as well. Even humans have gained only 39 domains while losing 42 domains (Table 1). Overall gains in protein family complexity were substantially larger than those of domain complexity at the different lineages examined. This is consistent with new protein families being generated by recruiting novel domains and by domain shuffling i.e. architecture rearrangement [26, 27], which also includes domain (or gene) fusion and fission . Because nematode lineages tended to lose domain complexity, the concomitant increase in protein family complexity suggests a strong contribution from domain shuffling over the course of their evolution (Figure 2). In contrast, mammals achieved protein family complexity by utilizing novel domains more so than other organisms. One possible source for these newly generated mammalian domains is genome duplication [29, 30]. Our results also showed that LCA of Metazoa had low domain shuffling index. Our results also showed that LCA of Metazoa had low domain shuffling index. Although this node is quite distinct from the LCA of the Bilateria, this finding appears to conflict with prior observations indicating an increase in domain promiscuity, around the divergence of Bilateria . However, our definition of domain shuffling is not synonymous with domain promiscuity, because domain shuffling as defined in our work takes into account the birth and death of both protein families and protein domains not taken into account by Cohen-Gihon et al . For this reason, the addition of a large number of new domains as observed at the LCA of the Bilateria will automatically increase the domain promiscuity of ancient domains, calculated using the abundance of different domain architectures. Indeed, the LCA of Metazoa exhibited the largest number of domain births but a low shuffling index. Similarly, a reduction in the number of domains can decrease domain promiscuity; the likely reason why Cohen-Gihon et al  detected the smallest domain shuffling effect in G. gallus. We corroborated this finding where G. gallus exhibited a large number of domain deaths. Nevertheless, it is possible that our definition of domain shuffling index may underestimate the contribution of domain shuffling when there is a burst of domain births, like that observed at the LCA of Metazoa.
Aside from variations in complexities and the effects of domain shuffling, adaptation speed appears much less variable as represented by the more consistent adaptation indices in the different lineages (Figure 3). However, as SNP data has shown, there was a recent acceleration of adaptation in humans where demographic change, gene function, and gene-environment interactions could be key driving forces . In our studies, the human lineages exhibited the highest adaptation index. It is highly likely that the same forces drove the fast protein family adaptation at the human lineage.
Adaptation, and protein family and domain evolution
Our adaptation index derived from protein family birth and death did not reveal significant differences among the lineages studied. However, the differences in the numbers of protein and domain birth/death events and family member duplication and deletions in the lineages studied (Table 2) together with the variation of inferred changes in the indices defining protein family and domain complexities provide sufficient evidence of lineage specific adaptation. In turn, these lineage specific features and variations suggest a role for protein families and domains in species adaptation and diversity . Adaptation related evolutionary variations have been reported multiple times. Taylor et al.  showed that the rate of protein duplication varied substantially among lineages. Hillier and coworkers  demonstrated that lineage-specific protein duplications and deletions were related to evolutionary change. Finally, Babushok et al.  reported that lineage-specific domain shuffling in different protein families promoted phenotypic complexity and species adaptation.
Functional enrichment of protein family births/deaths at the LCAs of the nematodes, arthropods, and vertebrates, and protein family births at the LCA of metazoans provide direct evidence for an association between species adaptation and protein evolution. Not surprising, protein families born at the LCA of metazoans were significantly enriched in functions related to regulation of transcription, multicellular organismal development, and signaling pathways, among others. These functional enrichments revealed the importance of the relevant families in the adaptation of metazoans while validating our family reconstruction. The most significant of enrichments, the regulation of transcription, is clearly reflective of the critical role this function plays in metazoan adaptation . Similar to the results presented in our work, other studies have demonstrated that protein families encompassing signaling pathways and adhesion predated the origins of metazoans and were involved profoundly in metazoan adaptation [36–38], as were families involving neurotransmitter transporters . New families related to the G-protein-coupled receptor protein signaling pathway are also enriched. Significantly more families related to this pathway were born at the LCA of nematodes and at the LCA of vertebrates. Interestingly, there was also a significant increase in the number of families of the same pathway that died at the LCA of vertebrates. These birth and death dynamics, especially those that occurred in the same lineage (i.e. the LCA of vertebrates), illustrate that G-protein-coupled receptor families as a whole played a significant role in metazoan speciation and adaptation [40, 41]. For example, olfactory receptor proteins were found important for vertebrate diversity  as were protein families associated with cell communication. Given the complex configuration of tissues and organs in vertebrates, it is very likely that intercommunication between cell types is important in vertebrate evolution. Compared to signaling pathways, fewer enriched terms are associated with metabolism. This may relate to broad conservation in metabolic metabolisms in all metazoans. Therefore, further enrichment of the Acyl-CoA metabolic process in family deaths at the LCA of arthropods comes as little surprise. In support of this, the lack of acyl-CoA dehydrogenase homologues in arthropods has been previously reported . We also noticed that GO term enrichment of families born at the LCA of vertebrates overlapped largely with the GO term enrichment of proteins under positive selection in mammals reported by Kosiol et al. . This could reflect the role of positive selection on protein family dynamics.
Besides the signaling pathway involving the G-protein-coupled receptor, other families born at the LCA of nematodes with significant functional enrichment include the trehalose biosynthetic process, cation transport, and cell wall catabolic process. Trehalose may be used as a compatible solute to contend with osmotic stress or as an external carbon source [45–47]. In like manner, cation transport enrichment can also address osmotic stress . The birth of families associated with cell wall catabolic processes may reflect the diversification in food resources connected with ancestral plant parasitic nematodes and subsequently coupled to free-living nematodes that followed. As such, additional analyses of enzymes involved in cell wall catabolism may reveal associations between parasitism and nematodes. Consistent with using externally-derived heme sources rather than synthesizing them de novo , porphyrin biosynthesis is among those protein families that died at the LCA of nematodes along with families associated with Rho protein signal transduction. Not much is known about nematode Rho signaling, but the absence of RhoBTB in C. elegans has been reported . Additional families lost at the LCA of nematodes are those associated with the DNA catabolic process and DNA repair. Given the karyotype diversity, rapid genome changes, and chromatin diminution that occur in nematodes [51, 52], it follows that the loss of protein families involved in the DNA catabolic process and in DNA repair could result in increased chromosome instability that can lead to these collective activities. To date there are no reports referencing evolutionary changes in DNA catabolic and/or repair pathways in nematodes; however, nematodes tend to have higher mutation rates than many organisms . We are currently carrying out multiple nematode genome and transcriptome projects. Their completion will provide more data for better understanding the adaptation of nematodes. Expansion in some domains, especially those involved in signal transduction and DNA binding, were previously found to positively correlate with organism complexity . However, our preliminary screen failed to confirm this, presumably because species with low numbers of domain copies and low numbers of cell types, such as protozoa and fungi were not included in this analysis (S. cerevisiae was included only as an outgroup). This could contribute significantly to the reported positive correlation.
Whole genome duplication and burst of protein family/domain duplication
Despite the variation in protein family, and domain death/birth events, bursts of duplications were observed at the LCA of metazoans and at the LCA of vertebrates. The source for these bursts is not known, but whole genome duplication has been suggested for vertebrates and yeast [29, 55]. The presence of duplication bursts confirm genome duplication within vertebrates, and the large number of deletions that accompanied the duplication bursts point to the difficulty in recovering the duplication history for vertebrates [30, 56]. Meanwhile, our data strongly support genome duplication at the LCA of metazoans. There are more families involved in duplication than involved in deletion at the LCA of vertebrates, and a large fraction of these protein families are involved only in duplication. Though duplications and deletions inferred through tree reconciliation could suffer from the errors of tree estimation, systematic bias to a specific type of family is unlikely. In fact, other analyses involving separate protein families confirmed extensive duplications at the LCA of metazoans [57–59].
Although the evolution of protein families and domains has been a research topic for some time, the current study is the first to closely investigate both duplication and birth/death rates for protein families and domains using a well-balanced and extensive data set. By reconstructing the evolution of protein families and domains over lineages that span the Metazoa which included all three major groups and multiple species within each group, for the first time we were able to quantify the relationship between protein family evolution and domain evolution, and examine the effects of domain shuffling in a lineage specific manner. By revealing the strong positive correlation between domain birth and duplication, we provided evidence for the evolutionary role of functional redundancy. By demonstrating a weak correlation between protein family birth and member duplication (in combination with the close correlation of the same events of domains), our analyses provided direct evidence for domain shuffling.
Additionally, we investigated not only new proteins that emerged (born) throughout evolution, but also proteins that disappeared (died) over this same period. This provided insights into understanding organismal adaptation, such as parasitism. To our knowledge, this is the first systematic study exploring adaptation through the death or disappearance of proteins. Finally, by examining both protein family and domain duplications, we provided strong evidence for whole genome duplication at the LCA of the Metazoa.
In conclusion, we studied metazoan evolution at a proteome level using a phylogenetic approach. Metazoan speciation and adaptation were explored by birth/death and duplication/deletion events among protein families and domains. The results characterized metazoan lineage-specific evolution related to protein families and domains. Despite the large variation, lineages leading to mammals exhibited consistent increases in protein family complexity during evolution. Results also illustrated that domain shuffling had a greater impact on protein family complexity in nematodes than in other metazoans, and that protein redundancy may be critical for evolutionary changes controlled by domain shuffling. By relating the evolutionary events to the functions of the proteins/domains involved, the results exposed the adaptive roles of these events. Overall, our study provides new insights into protein evolution associated with metazoan speciation.
Whole proteome data from 9 metazoa were collected. The datasets were comprised of 3 species of vertebrates, 3 species of arthropods, and 3 species of nematodes. Data were downloaded as follows: Homo sapiens, Mus musculus, and Gallus gallus were from Biomart (http://www.biomart.org); Drosophila melanogaster and Aedes. aegypti were from Flybase; Bombyx. mori was from SilkDB; Caenorhabditis elegans and Caenorhabditis. briggsae were from Wormbase; and Trichinella spiralis. The proteomes from the outgroups S. cerevisiae (yeast) and M. brevicollis (choanoflagelllate) were downloaded from Biomart (http://www.biomart.org) and JGI (http://genome.jgi-psf.org/), respectively. These proteomes were selected to keep the phylogenetic distances among the three species within each clade similar [63–66],  i.e., H. sapiens and M. musculus split about 100 million years ago (MYA) as did C. elegans and C. briggsae, and G. gallus split about 300 MYA from the ancestral vertebrates as did T. spiralis from ancestral nematodes. Isoforms of these downloaded sequences were examined against the coding genes, and only the longest ones were kept. The final dataset contained 22,997, 23,873, 16,736, 14,141, 15,419, 14,623, 20,188, 19,517, 16,124, 9,196, and 6,698 sequences from H. sapiens, M. musculus, G. gallus, D. melanogaster, A. aegypti, B. mori, C. elegans, C. briggsae, T. spiralis, M. brevicollis and S. cerevisiae, respectively.
Protein family reconstruction
Protein sequences from the 11 species were searched (BlastP) against each other. Based on these results, we used MCL  to cluster the sequences and generate protein families according to Stein . A value of 2.0 was chosen as the inflation factor for the MCL clustering because this was the optimum value to keep the homologous memberships between D. melanogaster and S. cerevisiae proteins identified by INPARANOID . Protein families having members in all 11 species were defined as universal families.
Each protein sequence was searched against the PFam domain profiles [14, 68], using hmmpfam . Significant matches were selected using the default cutoffs. The domain species and copies, and locations for each protein sequence were recorded. Based on this information, the sequences for every domain from the above proteins were extracted, and each domain was defined as a group. This process enabled an evaluation of domain evolution.
Duplication and deletion detection
Duplications and deletions of protein sequences and domain sequences were identified using Urec . First, the sequences for each family (or domain) were aligned using Muscle . The distance matrices and reconstructed phylogenetic trees for each set of aligned sequences were computed using PRODIST and NEIGHBOR of Phylip , respectively. The reason for using NEIGHBOR instead of other likelihood-based programs was speed and because Urec considers only tree topology. We compared 20 random families using PROML and NEIGHBOR, and did not find any differences in the inferred tree topologies. We believe this topology consistency resulted from the large evolutionary distances among the organisms we analyzed. The reconstructed trees were reconciled with the species trees of the 11 taxa to infer the duplication and deletion events over their evolution using Urec. The relative rates of the corresponding events of each lineage were computed by normalizing the numbers of events using inferred branch lengths. These inferred branch lengths were derived from the multiple alignments of all universal families with single members per species using PROMLK of Phylip . This permitted us to take advantage of molecular clocks among core proteins , and make the events comparable across different lineages.
Protein family and domain death and birth
Using an approach similar to that of Hughes and Friedman , protein family death and birth were evaluated using DOLLOP  by treating each protein family as a character and its presence or absence as a discrete evolutionary state. A family member sequence from a species was assigned a value of 1 if it was found within that family (character), otherwise it was assigned a value of 0. DOLLOP reconstructed the ancestral states for all the characters (protein families) using a dollo parsimony algorithm . Dollo parsimony is considered to overestimates the number of domains/proteins present in the most ancient nodes, however the use of Dollo parsimony does not always lead to overestimation in the most ancient nodes. This phenomenon is data dependent, and Dollo parsimony actually yielded lower ancestral intron densities than maximum likelihood (ML) based methods . Because of this, we did not attempt to correct the bias. Dollo parsimony is based on simple assumptions, and is computationally cheap. In contrast, ML methods are usually computationally intensive, but more importantly require either an estimation of the rate of evolutionary change or force one to assume that the rate of change is constant. They can also produce significant bias when using an unrealistic rate of evolutionary change . In our work like in many other evolutionary studies, the rate of change was not available. We expected great variation in evolutionary rates among different lineages (which our results confirmed), assuming a constant rate of change in all probability would have, severely violated the current analysis. For these reasons, we chose to use parsimony which is also the method of choice in evolutionary analyses when large and disparate datasets are involved. In addition, previous reports (e.g ) have also found that Dollo parsimony performed better than ML based methods in their gene content based tree reconstruction.
Protein family deaths and births were inferred by checking the states of these characters on each lineage of the 11 species tree. In a like manner, the death/birth events of each domain over its evolution were also inferred as were the unique domain losses of each species. Unique domain losses were defined as deaths of domains present in all other 10 species except the one indicated. The association of family member duplications/deletions, domain duplications/deletions, protein family deaths/births, and domain deaths/births were investigated using Pearson’s correlation coefficients. The significance of these correlation coefficients was tested using Student’s t-distribution.
Indices defining protein family and domain change, adaptation, and domain shuffling
Organism complexity is closely related to the number of protein families and domains. In order to illustrate the changes of organism complexity over the course of evolution, we computed the log ratio of birth and death events of protein families and domains at different lineages, which we define as protein/domain change index. For example, if the number of birth events equals that of death events over a lineage, the index is 0, suggesting no complexity change over the evolution of this lineage. If the number of birth events is larger than death events, the index is larger than 0, suggesting the organism has gained complexity over the lineage.
where AI: Adaption Index; B: inferred birth events at the corresponding lineage; D: inferred death events at the corresponding lineage; l: the branch length of the corresponding lineage. If the branch length of a lineage is short and possesses a large number of birth/death events, then the adaptation index of that lineage is large, suggesting dramatic adaptation along that lineage. Only adaptation index of protein family was explored because of the limited number of domain birth events.
Protein families can be generated from new domains and/or domain shuffling. To illustrate the effect of domain shuffling in protein family evolution, we examined the log ratio of protein family birth to protein domain birth at different lineages which we defined as the domain shuffling index. Similar to the adaptation, index, the birth events were first normalized by the lineage branch lengths.
GO term functional annotation of protein families was examined using Interproscan  based on H. sapiens, D. melanogaster or C. elegans proteomes. Significant enrichments of GO terms were computed based on hypergeometric distributions using FUNC  by comparing the numbers of a given GO term in the target group with the numbers in the background group. When testing a group of families, the GO terms identified by multiple members of the same family were counted only once. For example, when testing the GO term enrichment of nematode specific families, the GO terms identified by all C. elegans proteins were included and GO terms identified multiple times by different proteins from the same family were only counted once. When testing the probability of these data, refinement was performed by removing the GO terms identified as significant due to their derived terms. The false discovery rate (FDR) computed by FUNC was used to reduce false discovery. Therefore, unless specified otherwise, GO term enrichment was selected based on both p-value <0.001 (after refinement) and FDR <0.1.
ZW, DZ and MM designed the study and wrote the manuscript. ZW, SA, and JM did the analysis. All authors read and approved the final manuscript
False discovery rate
Last common ancestor.
This work was supported in part be the NIH NIAID grant AI081803 to M.M.
- Bork P: Shuffled domains in extracellular proteins. FEBS Lett. 1991, 286: 47-54.PubMedView Article
- Richardson JS: The anatomy and taxonomy of protein structure. Advances in Protein Chemistry. 1981, 34: 167-339.PubMedView Article
- Wu CH, Huang H, Yeh L-SL, Barker WC: Protein family classification and functional annotation. Comput Biol Chem. 2003, 27: 37-47.PubMedView Article
- Dayhoff MO: Computer analysis of protein sequences. Fed Proc. 1974, 33: 2314-2316.PubMed
- Gilbert W: Why genes in pieces?. Nature. 1978, 271: 501-501.PubMedView Article
- Li W: Molecular evolution. 1997, Sinauer Associates Incorporated, Sunderland, Massachusetts
- Wang Z, Martin J, Abubucker S, Yin Y, Gasser R, Mitreva M: Systematic analysis of insertions and deletions specific to nematode proteins and their proposed functional and evolutionary relevance. BMC Evol Biol. 2009, 9: 23-PubMedPubMed CentralView Article
- Jiang H, Blouin C: Insertions and the emergence of novel protein structure: a structure-based phylogenetic study of insertions. BMC Bioinforma. 2007, 8: 444-View Article
- Cao LH, Ding XM, Yu WB, Yang XM, Shen SQ, Yu L: Phylogenetic and evolutionary analysis of the septin protein family in metazoan. FEBS Lett. 2007, 581: 5526-5532.PubMedView Article
- Enmark E, Gustafsson JA: Nematode genome sequence dramatically extends the nuclear receptor superfamily. Trends Pharmacol Sci. 2000, 21: 85-87.PubMedView Article
- Hoogewijs D, De Henau S, Dewilde S, Moens L, Couvreur M, Borgonie G, Vinogradov SN, Roy SW, Vanfleteren JR: The Caenorhabditis globin gene family reveals extensive nematode-specific radiation and diversification. BMC Evol Biol. 2008, 8: 13-View Article
- Chothia C, Gough J, Vogel C, Teichmann SA: Evolution of the protein repertoire. Science. 2003, 300: 1701-1703.PubMedView Article
- Babushok DV, Ostertag EM, Kazazian HH: Current topics in genome evolution: Molecular mechanisms of new gene formation. Cell Mol Life Sci. 2007, 64: 542-554.PubMedView Article
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, et al: The Pfam protein families database. Nucleic Acids Res. 2012, 40: D290-D301.PubMedPubMed CentralView Article
- Kawashima T, Kawashima S, Tanaka C, Murai M, Yoneda M, Putnam NH, Rokhsar DS, Kanehisa M, Satoh N, Wada H: Domain shuffling and the evolution of vertebrates. Genome Res. 2009, 19: 1393-1403.PubMedPubMed CentralView Article
- Buljan M, Bateman A: The evolution of protein domain families. Biochem Soc Trans. 2009, 37: 751-755.PubMedView Article
- Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30: 1575-1584.PubMedPubMed CentralView Article
- Mitreva M, Smant G, Helder J: Role of Horizontal Gene Transfer in the Evolution of Plant Parasitism Among Nematodes. In Volume. 2009, 532: 517-535.
- Ekman D, Björklund ÅK, Elofsson A: Quantification of the Elevated Rate of Domain Rearrangements in Metazoa. J Mol Biol. 2007, 372: 1337-1348.PubMedView Article
- Lynch M, Conery JS: The Evolutionary Fate and Consequences of Duplicate Genes. Science. 2000, 290: 1151-1155.PubMedView Article
- Hughes AL, Friedman R: Genome Size Reduction in the Chicken Has Involved Massive Loss of Ancestral Protein-Coding Genes. Mol Biol Evol. 2008, 25: 2681-2688.PubMedPubMed CentralView Article
- Cohen-Gihon I, Fong JH, Sharan R, Nussinov R, Przytycka TM, Panchenko AR: Evolution of domain promiscuity in eukaryotic genomes–a perspective from the inferred ancestral domain architectures. Mol Biosyst. 2011, 7: 784-792.PubMedPubMed CentralView Article
- Opperman CH, Bird DM, Williamson VM, Rokhsar DS, Burke M, Cohn J, Cromer J, Diener S, Gajan J, Graham S, et al: Sequence and genetic map of Meloidogyne hapla: A compact nematode genome for plant parasitism. Proc Natl Acad Sci. 2008, 105: 14802-14807.PubMedPubMed CentralView Article
- Robinson-Rechavi M, Garcia HE, Laudet V: The nuclear receptor superfamily. J Cell Sci. 2003, 116: 585-586.PubMedView Article
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al: Initial sequencing and analysis of the human genome. Nature. 2001, 409: 860-921.PubMedView Article
- Peisajovich SG, Rockah L, Tawfik DS: Evolution of new protein topologies through multistep gene rearrangements. Nat Genet. 2006, 38: 168-174.PubMedView Article
- Fong JH, Geer LY, Panchenko AR, Bryant SH: Modeling the Evolution of Protein Domain Architectures Using Maximum Parsimony. J Mol Biol. 2007, 366: 307-315.PubMedPubMed CentralView Article
- Kummerfeld SK, Teichmann SA: Relative rates of gene fusion and fission in multi-domain proteins. Trends in Genetics. 2005, 21: 25-30.PubMedView Article
- Ohno S: Evolution by gene duplication. 1970, Springer, New YorkView Article
- Lundin LG: Evolution of the Vertebrate Genome as Reflected in Paralogous Chromosomal Regions in Man and the House Mouse. Genomics. 1993, 16: 1-19.PubMedView Article
- Hawks J, Wang ET, Cochran GM, Harpending HC, Moyzis RK: Recent acceleration of human adaptive evolution. Proc Natl Acad Sci U S A. 2007, 104: 20753-20758.PubMedPubMed CentralView Article
- Lespinet O, Wolf YI, Koonin EV, Aravind L: The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Res. 2002, 12: 1048-1059.PubMedPubMed CentralView Article
- Taylor JS, Raes J: Duplication and divergence: The evolution of new genes and old ideas. Annu Rev Genet. 2004, 38: 615-643.PubMedView Article
- Hillier LW, Miller W, Birney E, Warren W, Hardison RC, Ponting CP, Bork P, Burt DW, Groenen MAM, Delany ME, et al: Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004, 432: 695-716.View Article
- Larroux C, Luke GN, Koopman P, Rokhsar DS, Shimeld SM, Degnan BM: Genesis and expansion of metazoan transcription factor gene classes. Mol Biol Evol. 2008, 25: 980-996.PubMedView Article
- King N: The unicellular ancestry of animal development. Dev Cell. 2004, 7: 313-325.PubMedView Article
- King N, Hittinger CT, Carroll SB: Evolution of key cell signaling and adhesion protein families predates animal origins. Science. 2003, 301: 361-363.PubMedView Article
- Richards GS, Degnan BM: The dawn of developmental signaling in the Metazoa. Cold Spring Harbor Symp Quant Biol. 2009, 74: 81-90.PubMedView Article
- Caveney S, Cladman W, Verellen L, Donly C: Ancestry of neuronal monoamine transporters in the Metazoa. J Exp Biol. 2006, 209: 4858-4868.PubMedView Article
- Bargmann CI: Neurobiology of the Caenorhabditis elegans genome. Science. 1998, 282: 2028-2033.PubMedView Article
- Zhang XM, Firestein S: The olfactory receptor gene superfamily of the mouse. Nat Neurosci. 2002, 5: 124-133.PubMed
- Niimura Y: On the Origin and Evolution of Vertebrate Olfactory Receptor Genes: Comparative Genome Analysis Among 23 Chordate Species. Genome Biol Evol. 2009, 1: 34-44.PubMedPubMed CentralView Article
- Swigonova Z, Mohsen AW, Vockley J: Acyl-CoA Dehydrogenases: Dynamic History of Protein Family Evolution. J Mol Evol. 2009, 69: 176-193.PubMedPubMed CentralView Article
- Kosiol C, Vinar T, da Fonseca RR, Hubisz MJ, Bustamante CD, Nielsen R, Siepel A: Patterns of Positive Selection in Six Mammalian Genomes. PLoS Genetics. 2008, 4 (8): e1000144-PubMedPubMed CentralView Article
- Strom AR, Kaasen I: TREHALOSE METABOLISM IN ESCHERICHIA-COLI - STRESS PROTECTION AND STRESS REGULATION OF GENE-EXPRESSION. Mol Microbiol. 1993, 8: 205-210.PubMedView Article
- Horlacher R, Boos W: Characterization of TreR, the major regulator of the Escherichia coli trehalose system. J Biol Chem. 1997, 272: 13026-13032.PubMedView Article
- Arguelles JC: Physiological roles of trehalose in bacteria and yeasts: a comparative analysis. Arch Microbiol. 2000, 174: 217-224.PubMedView Article
- Keith PC, Kevin S: Molecular and genetic characterization of osmosensing and signal transduction in the nematode Caenorhabditis elegans. FEBS J. 2007, 274: 5782-5789.View Article
- Rao AU, Carta LK, Lesuisse E, Hamza I: Lack of herne synthesis in a free-living eukaryote. Proc Natl Acad Sci. 2005, 102: 4270-4275.PubMedPubMed CentralView Article
- Boureux A, Vignal E, Faure S, Fort P: Evolution of the Rho family of Ras-like GTPases in eukaryotes. Mol Biol Evol. 2007, 24: 203-216.PubMedPubMed CentralView Article
- Blaxter ML, Baker MR: Littlewood: Nematoda: Genes, genomes and the evolution of parasitism. In Adv Parasitol. Academic Press. 2003, 54: 101-195.
- Miller F, Tobler H: Chromatin diminution in the parasitic nematodes Ascaris suum and Parascaris univalens. Int J Parasitol. 2000, 30: 391-399.View Article
- Denver DR, Morris K, Lynch M, Thomas WK: High mutation rate and predominance of insertions in the Caenorhabditis elegans nuclear genome. Nature. 2004, 430: 679-682.PubMedView Article
- Vogel C, Chothia C: Protein Family Expansions and Biological Complexity. PLoS Comput Biol. 2006, 2: e48-PubMedPubMed CentralView Article
- Kellis M, Birren BW, Lander ES: Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004, 428: 617-624.PubMedView Article
- Wang YF, Gu X: Evolutionary patterns of gene families generated in the early stage of vertebrates. J Mol Evol. 2001, 52: 309-309.
- Lundin L-G: Gene duplications in early metazoan evolution. Seminars in Cell and Developmental Biology. 1999, 10: 523-530.PubMedView Article
- Quiquand M, Yanze N, Schmich J, Schmid V, Galliot B, Piraino S: More constraint on ParaHox than Hox gene families in early metazoan evolution. Dev Biol. 2009, 328 (2): 173-187.PubMedView Article
- Stern A, Privman E, Rasis M, Lavi S, Pupko T: Evolution of the metazoan protein phosphatase 2 C superfamily. J Mol Evol. 2007, 64: 61-70.PubMedView Article
- Duan J, Li R, Cheng D, Fan W, Zha X, Cheng T, Wu Y, Wang J, Mita K, Xiang Z, Xia Q: SilkDB v2.0: a platform for silkworm (Bombyx mori) genome biology. Nucleic Acids Res. 2010, 38: D453-D456.PubMedPubMed CentralView Article
- Yook K, Harris TW, Bieri T, Cabunoc A, Chan J, Chen WJ, Davis P, de la Cruz N, Duong A, Fang R, et al: WormBase 2012: more genomes, more data, new website. Nucleic Acids Res. 2012, 40: D735-D741.PubMedPubMed CentralView Article
- Mitreva M, Jasmer DP, Zarlenga DS, Wang Z, Abubucker S, Martin J, Taylor CM, Yin Y, Fulton L, Minx P, et al: The draft genome of the parasitic nematode Trichinella spiralis. Nat Genet. 2011, 43: 228-235.PubMedPubMed CentralView Article
- Glazko GV, Koonin EV, Rogozin IB: Molecular dating: ape bones agree with chicken entrails. Trends in Genetics. 2005, 21: 89-92.PubMedView Article
- Nei M, Xu P, Glazko G: Estimation of divergence times from multiprotein sequences for a few mammalian species and several distantly related organisms. Proc Natl Acad Sci. 2001, 98: 2497-2502.PubMedPubMed CentralView Article
- Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke L, Clee C, Coghlan A, et al: The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative Genomics. PLoS Biol. 2003, 1: e45-PubMedPubMed CentralView Article
- Gaunt MW, Miles MA: An Insect Molecular Clock Dates the Origin of the Insects and Accords with Palaeontological and Biogeographic Landmarks. Mol Biol Evol. 2002, 19: 748-761.PubMedView Article
- O'Brien KP, Remm M, Sonnhammer ELL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucl Acids Res. 2005, 33: D476-D480.PubMedPubMed CentralView Article
- Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, et al: Pfam: clans, web tools and services. Nucleic Acids Res. 2006, 34: D247-D251.PubMedPubMed CentralView Article
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14: 755-763.PubMedView Article
- Gorecki P, Tiuryn J: URec: a system for unrooted reconciliation. Bioinformatics. 2007, 23: 511-512.PubMedView Article
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32: 1792-1797.PubMedPubMed CentralView Article
- Felsenstein J: PHYLIP-Phylogeny inference package (Version 3.2). Cladistics. 1989, 5: 164-166.
- Grishin NV, Wolf YI, Koonin EV: From Complete Genomes to Measures of Substitution Rate Variability Within and Between Proteins. Genome Res. 2000, 10: 991-1000.PubMedPubMed CentralView Article
- Hughes AL, Friedman R: Differential loss of ancestral gene families as a source of genomic divergence in animals. Proceedings of the Royal Society B-Biological Sciences. 2004, 271: S107-S109.PubMed CentralView Article
- Le Quesne WJ: The uniquely evolved character concept and its cladistic application. Systematic Zoology. 1974, 23: 513-517.View Article
- Csuros M, Rogozin IB, Koonin EV: A detailed history of intron-rich eukaryotic ancestors inferred from a global survey of 100 complete genomes. PLoS Comput Biol. 2011, 7: e1002150-PubMedPubMed CentralView Article
- Knowles DG, McLysaght A: High rate of recent intron gain and loss in simultaneously duplicated Arabidopsis genes. Mol Biol Evol. 2006, 23: 1548-1557.PubMedView Article
- Huson DH, Steel M: Phylogenetic trees based on gene content. Bioinformatics. 2004, 20: 2044-2049.PubMedView Article
- Zdobnov EM, Apweiler R: InterProScan - an integration platform for the signature-recognition methods in InterPro. Bioinformatics. 2001, 17: 847-848.PubMedView Article
- Prufer K, Muetzel B, Do HH, Weiss G, Khaitovich P, Rahm E, Paabo S, Lachmann M, Enard W: FUNC: a package for detecting significant associations between gene sets and ontological annotations. BMC Bioinformatics. 2007, 8: 41-PubMedPubMed CentralView Article
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.