Phylogenetic analyses suggest multiple changes of substrate specificity within the Glycosyl hydrolase 20 family
© Intra et al. 2008
Received: 11 March 2008
Accepted: 22 July 2008
Published: 22 July 2008
Skip to main content
© Intra et al. 2008
Received: 11 March 2008
Accepted: 22 July 2008
Published: 22 July 2008
Beta-N-acetylhexosaminidases belonging to the glycosyl hydrolase 20 (GH20) family are involved in the removal of terminal β-glycosidacally linked N-acetylhexosamine residues. These enzymes, widely distributed in microorganisms, animals and plants, are involved in many important physiological and pathological processes, such as cell structural integrity, energy storage, pathogen defence, viral penetration, cellular signalling, fertilization, development of carcinomas, inflammatory events and lysosomal storage diseases. Nevertheless, only limited analyses of phylogenetic relationships between GH20 genes have been performed until now.
Careful phylogenetic analyses of 233 inferred protein sequences from eukaryotes and prokaryotes reveal a complex history for the GH20 family. In bacteria, multiple gene duplications and lineage specific gene loss (and/or horizontal gene transfer) are required to explain the observed taxonomic distribution. The last common ancestor of extant eukaryotes is likely to have possessed at least one GH20 family member. At least one gene duplication before the divergence of animals, plants and fungi as well as other lineage specific duplication events have given rise to multiple paralogous subfamilies in eukaryotes. Phylogenetic analyses also suggest that a second, divergent subfamily of GH20 family genes present in animals derive from an independent prokaryotic source. Our data suggest multiple convergent changes of functional roles of GH20 family members in eukaryotes.
This study represents the first detailed evolutionary analysis of the glycosyl hydrolase GH20 family. Mapping of data concerning physiological function of GH20 family members onto the phylogenetic tree reveals that apparently convergent and highly lineage specific changes in substrate specificity have occurred in multiple GH20 subfamilies.
Carbohydrates are involved in many biological functions including maintenance of cell structural integrity, energy storage, pathogen defence, viral penetration, cellular signalling and fertilization. Enzymes specifically responsible for carbohydrate hydrolysis have been classified in 111 families of glycosyl hydrolases (GH) on the basis of amino acid sequence similarity [1, 2]. Hexosaminidases belong to families GH3, GH20 and GH84 . Among these, family 20 is of particular interest, and includes β-N-acetylhexosaminidases (β-hexosaminidase) (EC 184.108.40.206), enzymes that hydrolyze non-reducing terminal β-1,4 linked N-acetylglucosamine (GlcNAc) or β-N-acetylgalactosamine (GalNAc) residues of oligosaccharides and their conjugates, bacterial chitobiases (EC 220.127.116.11) and lacto-N-biasidase (EC 18.104.22.168). Crystal structures are known for numerous β-N-acetylhexosaminidases including the bacterial enzymes from Serratia marcescens  and Streptomyces plicatus , and the α- and β-chains of human lysosomal enzymes. The catalytic domain is an α/βTIM-barrel with the active site at the centre of the barrel complex [5–7].
In mammals, there are two major β-N-acetylhexosaminidase isoforms, named HEXA and HEXB, which reside in lysosomes and participate in the degradation of glycoproteins, glycolipids and glycosaminoglycans. HEXA is a heterodimer of subunits α (encoded by the gene HEXA) and β (encoded by the gene HEXB), whereas HEXB is a homodimer of β subunits. The subunits arose via a gene duplication event and the primary sequences are approximately 60% identical [8, 9]. In particular, mutations in human HEXA and HEXB genes cause Tay-Sachs and Sandhoff, fatal neurodegenerative diseases, respectively [10, 11]. Recent data suggest that lysosomal exoglycosidases along with many other factors may participate in the progression of development of tumor cells . The potential involvement of a β-N-acetylhexosaminidase in fertilization in hamster  and human  has been also hypothesized.
Beta-hexosaminidases are also widely distributed in Insects. Several studies have led to the identification, molecular cloning and purification of β-N-acetylhexosaminidases in Lepidoptera like Manduca sexta , Spodoptera frugiperda , Bombyx mori , Trichoplusia ni , Diptera like Drosophila melanogaster [19, 20] and Aedes aegypti  and more recently in the coleopteran Tribolium Castaneum . The β-hexosaminidase activity of insects is of particular interest because of the role this glycosidase plays in the alteration of the structures of N -glycans generated in the cells [16, 22–24] and in the chitin degradation processes [16, 22, 24, 25]. Chitin, found in the cuticle of the integument and peritrophic membrane of the midgut, is a linear polymer of β-1,4 linked N-acetylglucosamine. Chitin degradation is catalyzed by chitinolytic enzymes in two successive steps: chitinase (EC 22.214.171.124) hydrolizes chitin into oligosaccharides of GlcNAc, then β-N-acetylhexosaminidase further degrades the oligomers into monomers. Because of the important role of chitin, β-N-acetylhexosaminidase is considered to be a potential target for insect control agents such as biopesticides [26, 27]. Recently, the presence in the plasma membrane of spermatozoa of Drosophila melanogaster of two β-N-acetylhexosaminidases potentially involved in sperm-egg interactions has been demonstrated [20, 28].
In crustaceans, β-N-acetylhexosaminidase has been shown to be important in the degradation of chitin forming the exoskeleton. Beta-N-acetylhexosaminidases have been purified and characterized in different species, such as Euphausia superba  and Scylla serrata .
In the ascidian Phallusia mammillata a β-N-acetylhexosaminidase present in the plasma membrane of spermatozoa might have a role in the primary binding between gametes .
In addition, an important function of a β-N-acetylhexosaminidase has been postulated in the anaerobic parasitic protozoan Entamoeba histolytica, the causative agent of infectious amoebiosis. Two β-N-acetylhexosaminidases have been identified, cloned and characterized. These enzymes are most probably involved in the destruction of glycoconjugates of the extracellular matrix components to pass basement membranes .
Hexosaminidase genes are also distributed among fungi and bacteria and the enzyme family plays an important physiological role in the natural recycling of chitin, a structural component of cell wall [33, 34]. The molecular cloning of β-N-acetylhexosaminidases has been reported for several bacteria, such as Alteromonas sp. Strain O-7 , Alteromonas sp. Strain 10S-24 , Serratia marcescens , Vibrio harveyi , Vibrio vulnificus  Enterobacter sp. Strain G1 , Cellulomonoas fimi  and several fungi, such as Trichoderma harzianum , Trichoderma atroviride , Aspergillus oryzae , Aspergillus nidulans  and Candida albicans . Beta-N-acetylhexosaminidase is an emerging target for the design of fungicides. In fact, several chitinolytic bacteria and fungi have been shown to be powerful biological control agents protecting for example plants against pathogens [41, 46–48].
Plant β-hexosaminidases have been detected in a variety of tissues including seeds and leaves [49, 50]. In particular, high levels of β-N acetylhexosaminidase activity have been detected in germinating seeds  suggesting a role in the storage of glycoproteins . A function in defence processes has also been proposed, since several of the purified β-N-acetylhexosaminidases could be chitin-degrading enzymes [27, 50]. More recently β-hexosaminidases of Arabidopsis thaliana have been cloned and characterized at the molecular level .
Despite the increasing number of hexosaminidase homologs that have been identified in different species, only restricted analyses of phylogenetic relationships between glycosyl hydrolases of the GH20 family have been conducted [3, 16, 22, 24, 31, 40, 53]. Because of their important roles in several biological processes, we have undertaken a more comprehensive analysis of the evolutionary history of the GH20 family. In this study we have identified, analyzed and characterized β-N-acetylhexosaminidases from prokaryotes and eukaryotes. We show that while the GH20 family is widely distributed among eubacteria, the observed taxonomic distribution is best explained by a combination of gene duplications and horizontal gene transfer events. Likewise, the limited occurrence of the family in archaebacteria is probably the result of lateral transfer from eubacteria. We trace gene duplication events both at basal and lineage-specific levels within eukaryotes and demonstrate that while the most widely studied GH20 family members derive from a single ancestral eukaryotic gene, a second subfamily of more divergent sequences present in at least the majority of metazoans was likely acquired from an independent prokaryotic source. Finally, we describe the patterns of conservation of protein features in numerous β-N-acetylhexosaminidase subfamilies and relate these features to the current understanding of GH20 family function, an important exercise given apparent convergence of physiological function of non-orthologous GH20 enzymes in eukaryotes.
Eukaryotic β-hexosaminidase proteins belonging to the Glycosyl hydrolase family 20 used in Figure 4.
Aspergillus nidulans FGSC A4
Coccidioides posadasii C735
Cryptococcus neoformans var. neoformans JEC21
Magnaporthe grisea 70-15
Neotyphodium sp. FCB-2004
We observed hydrophobic sequences predicted to be a signal peptide with potential cleavage sites in the N-terminal regions of all metazoan β-hexosaminidases (not shown), except Hex3_Dm, Hex3_Dp, Hex2_Ag, that have a putative signal-anchor domain [20, 24]. Moreover, SOSUI and HMMTOP analyses identified one or two putative transmembrane helices in all sequences, but these predictions have a low degree of confidence. Although β-hexosaminidases are typically soluble lysosomal proteins, they have been also found in body fluids in mammals [54–58], in insects [20, 59–61] and in plants . The presence of β-hexosaminidases in an extracellular compartment is not surprising. It is in fact known that lysosomal hydrolases can be released through constitutive secretion or through lysosomal exocytosis [57, 62, 63]. Furthermore, β-N-acetylhexosaminidases are present as intrinsic proteins of the plasma membrane in insects [20, 24], in ascidians , in plants , in vertebrates, including man [13, 14, 64] as well as in a variety of human somatic cells [65, 66]. Thus, it could be hypothesized that the membrane-associated β-N-acetylhexosaminidase might be generated by a deficient cleavage of the signal peptide sequence, as demonstrated for other transmembrane proteins that are also present as soluble enzymes [20, 67–70]. If the signal peptide were not cleaved, the enzyme would be a type II plasma membrane protein with an anchor sequence, as demonstrated in a few human sperm and fruit fly proteins [14, 20, 28, 70].
Four gene products from C. elegans [GenBank: AAA96105.3, CAI06053.1, CAO72177.1, CAA22078.2] belong to this subfamily of GH20 and it has been recently demonstrated that they show β-hexosaminidase activity . However, no data were presented on other amino acids involved in substrate binding and catalysis except the glutamate residue [5, 6]. Further experiments are necessary to determine the enzyme activity of the probable orthologs to these C. elegans genes – which show most similarity to divergent GH20 proteins from low G+C gram positive bacteria, planctomycetes and a single crenarcheote (Thermofilum pendens) (See additional file 2, 3).
The GH20 family does not appear to be distributed evenly among bacterial phyla. Indeed, the majority of available sequences are derived from the Proteobacteria, Actinobacteria, Firimcutes, the Bacteroidetes/Chlorobi (CFB) group and Acidobacteria. Our phylogenetic analyses recover several well-supported clades containing members of well-defined taxonomic groups. For example, beta-proteobacterial and the majority of the gamma proteobacterial sequences are recovered as a well-supported monophyletic group which has a moderately supported association with some alpha-proteobacterial sequences. However, other gamma- and alpha-proteobacteria fall in disparate positions in the tree while Actinobacteria also fall in at least two clusters. Analogously, the Bayesian analysis recovers disparate clades of sequences derived from members of the Bacteroidetes/Chlorobi. These data suggest either ancient gene duplications or multiple horizontal gene transfer events within bacteria. We note that both Bayesian posterior probability and, in particular, bootstrap support for deep level relationships between groups is consistently low, prohibiting strong inferences regarding the evolution of the gene family in prokaryotes. Indeed analysis of constrained tree topologies using the Shimodaira/Hasegawa test suggested that a variety of relationships between the well defined groups indicated in Figure 3 were not significantly worse explanations of the data than the Bayesian consensus tree (not shown). It is thus impossible to judge whether an ancient origin of the GH20 family in bacteria was followed by multiple gene duplications and lineage specific losses of paralogs or whether the GH20 family arose within a discrete bacterial phylum and that the observed current distribution is a result of extensive lateral transfer coupled with occasional gene duplication. Interestingly, a well supported clade consisting mainly of sequences derived from Firmicutes (herein the "Firmicute" clade) also includes several sequences from CFB group bacteria, members of the Chlamydia/Verrucomicrobia group, a single Actinobacterial sequence, several members of the Thermatogae, several Chloroflexi and the few Archaebacterial sequences identified in this study. Taken together with the discontinuous presence GH20 family genes in Archaebacteria, Thermatogae, Chlamydiae and other bacterial taxa (and their apparent absence from other major bacterial groups (eg. Cyanobacteria), we suggest that extensive horizontal gene transfer has played a significant role in generating the observed taxonomic distribution of GH20 family members in prokaryotes.
It is notable that sequences from the parasitic protist Trichomonas vaginalis are recovered within a well-supported clade of sequences from CFB group bacteria. Indeed, this case of probable prokaryote to eukaryote horizontal gene transfer was explicitly noted during the characterization of the genome of this parabasalid flagellate .
To better understand the evolution of the GH20 family within eukaryotes, we performed Bayesian – and maximum likelihood bootstrap – phylogenetic analyses of unambiguously aligned regions of inferred eukaryote clade A protein sequences (Figure 5). Vertebrate alpha and beta chains constitute monophyletic groups and are each others well supported sisters. The emergence of tunicate sequences as the monophyletic sister group of the vertebrate forms is consistent with a gene duplication in a common ancestor of vertebrates after the divergence of tunicates. Well-supported clusters of arthropod, nematode and schistosome sequences emerge basal to the tunicate clade.
A second, well-supported, cluster of arthropod sequences emerges in a distinct part of Eukaryote clade A (Figure 5). Indeed, the Bayesian analysis also recovers two clusters of plant, fungal and mycetozoan sequences – suggesting that a gene duplication event occurred in a common ancestor of plants, animals and fungi. However, it is not possible to position the root of the tree in such a way as to explain these multiple clusters with a single duplication event. We therefore investigated alternative hypotheses to explain the observed distribution of GH20 paralogs using constrained phylogenetic trees. We were unable to generate constrained topologies in which plant or arthropod sequences are monophyletic that were not statistically worse explanations of the data than the Bayesian topology according to the Shimodaira/Hasegawa (SH) test implemented in TREE-PUZZLE although all fungal sequences can be constrained as monophyletic as can mycetozoan (Entamoeba and Dictyostelium) sequences (which emerge close together in the bootstrap consensus tree). Several topologies describing plausible organismal relationships and invoking a single ancestral gene duplication in eukaryotes are not rejected by the SH test (See additional file 5) and the GH20 sequence data is thus compatible with current hypotheses of deep level organismal relationships within eukaryotes  and does not require invocation of eukaryote to eukaryote gene transfer as an explanation. While the monophyly, or otherwise, of fungal sequences cannot be established with confidence, the current data suggest that the observed diversity of arthropod and plant GH20 family genes result in part from ancient duplications that occurred before the divergence of plants and animals. However, subsequent "local" gene duplications have also occurred in both lineages. This phenomenon is particularly notable in plant clade I and in animal clade II – where successive gene duplications have given rise to Hexo1, Hexo2 and fused lobes (named fdl) genes.
Sequence similarity searches of the DOE Joint Genome Institute eukaryotic genome databases and other eukaryotic genome projects recover a large number of sequences that clearly derive from the GH20 family. Given the low quality of annotation and the widespread occurrence of potentially prokaryotic contaminants in such databases we have omitted such sequences from our phylogentic analyses. However, it is clear that the GH20 family is represented in all but a few major eukaryotic lineages. Taxa where complete or nearly complete genome sequences are available but where we were unable to recover traces of GH20 genes in at least one genome were restricted to divergent protists and unicellular organisms with secondarily reduced genomes: Diplomonads, Euglenozoa, Apicomplexa, and several Fungi (Saccharomyces, Schizosaccharomyces and Microsporidia).
The Eukaryote clade B sequences (Figure 6) while conserved between each other, are characterized by generally high levels of divergence from the "Firmicute" clade in which they are embedded. Sequence similarity searches of the JGI eukaryotic genome databases revealed convincing traces of clade B sequences on the genomes of the cnidarian Nematostella [JGI:82909, JGI:245803], the placozoan Trichoplax [JGI:23741, JGI:2986, JGI:2419, JGI:51551], the mollusk Lottia [JGI:140764, JGI:235755] and the annelid Capitella [JGI:235755]. Unfortunately all associated gene predictions were of rather low quality and either incomplete or including apparently un-detected introns (or highly divergent insertions) and lacking in supporting EST evidence. Accordingly they were not used in phylogenetic analyses. However, the presence of highly conserved potentially coding sequences in these genomes (and the absence of such traces in the genome of the choanoflagellate Monosiga and other protist, fungal and plant genomes) is of some considerable interest. Our data are consistent with the proposition of Gutternigg et al.  that eukayotes have obtained GH20 family members from at least two phylogenetically distinct sources. The apparent absence of clade B sequences from non-metazoan genomes suggests a lateral transfer event into a basal metazoan, although an earlier acquisition – and subsequent loss from plant, protist and fungal genomes – can not be excluded. The strong statistical support for the monophyly of the "Firmicute" clade in which these and other more divergent sequences are embedded (Figure 4), our failure to identify acceptable topologies in which Eukaryote clade A and B sequences represent sister groups and the apparent absence of excessively long branches in this clade (not shown) lead us to prefer the hypothesis that these sequences represent a truly monophyletic assemblage rather than a phylogenetic artifact resulting from independent accelerations in evolutionary rates and resulting "long-branch attraction" between unrelated sequences.
The phylogenetic hypotheses presented here may be used to understand the evolution of GH20 family functionality in eukaryotes. While our phylogenetic reconstructions do not allow reliable rooting of the eukaryote tree or permit us to understand the evolutionary position of the more divergent GH20 isoforms, many clades and relationships are robustly supported. In this light it is informative to map functional data onto the trees. One of the most important differences between vertebrate and insects/worms/plants N -linked glycosylation is the presence of the paucimannosidic N -glycans structures in the second group. For example, the insect fdl genes are closely related to the Hexo1 and Hexo2 genes of insects. The fdl gene products are involved in the formation of paucimannosidic N -glycans [24, 53], while Hexo1 and Hexo2 gene products (as well as fungal GH20 proteins ) are believed to function as degradative enzymes (exochitinases) . Plant clade II sequences also have an exochitinase-like function, as recently demonstrated in Arabidopsis thaliana . In this light it can be seen as likely that the ancestral activity for the plant, fungal and insect clade II sequences was likely to be as an exochitinase and that the fdl gene products have undergone a change in substrate specificity towards N -glycans. The animal Clade I sequences have been shown to be involved in the degradation of oligosaccharide chains of glycoproteins and glycolipids [7, 77], although the proteins encoded by the lepidoptera Spodoptera frugiperda genes have been shown to be also responsible for the generation of paucimannosidic N -glycans [16, 78], and might potentially represent a second example of functional shift within insects. Plant clade I sequences have been demonstrated to participate in the biosynthesis of truncated N -linked oligosaccharides and degradation of chitooligosaccharides [52, 53]. Furthermore, the C. elegans hexosaminidase emerging in animal clade I likely has a role as a exochitinase  and may represent another example of a functional shift. We note that Drosophila species are not represented in clade I and speculate that their need for N -glycan-metabolizing β-hexosaminidases is fulfilled by the fdl gene product. Moreover, nematode and insect N -glycan hexosaminidases are membrane-bound proteins and are not localized in the lysosomes, while in plants the processing of N -glycans to paucimannosidic saccharides is in the vacuole . An understanding of the functional roles of the Mycetozoan proteins which provide one with best candidates for the most ancient divergence in the eukaryote tree might be the key to understanding the nature of the ancestral enzyme activity. However, functional characterization of the slime mould Dictyostelium GH20 protein has not been published and biochemical data of the β-N-acetylglucosaminidases of Entamoeba histolytica do not allow strong inference as to the substrate specificity of these amoebozoan enzymes . Furthermore, although we have observed high conservation of amino acids involved in substrate binding (see above), a crystal structure of a β-hexosaminidase involved in paucimannosidic N -glycans formation might provide precise information about the characteristics of active sites of β-hexosaminidases that display exochitinase activity. Despite the incompleteness of functional information, the phylogenetic tree presented in Figure 5 suggests that synthesis of paucimanosidic glycans is a derived characteristic that has evolved independently on at least 3 occasions. Finally, Gutternigg et al.  showed that two clade B C. elegans gene products [GenBank: AAA96105.3, CAI06053.1] are involved in the metabolism of paucimannosidic N -glycans, while other two nematode β-hexosaminidases have chitoligosaccharidase activity [GenBank: CAO72177.1; CAA22078.2]. The functional roles of the vertebrate sequences that appear to be most closely related to these divergent nematode genes (see above and additional file 3) remain obscure.
Our phylogenetic analyses of GH20 family proteins provides new insights into evolutionary relationships and the history of the protein family and represent the first such detailed study of the GH20 protein family. We show that eukaryote sequences derive from two independent gene acquisitions. The most widely studied group of genes was present in a common ancestor of plants animals and fungi and the ancestral sequence underwent at least one gene duplication event early in eukaryote evolution gaving rise to at least two paralogs that evolved differential functions as degradative enzymes or a processing activity involved in the synthesis of paucimannosidic N -linked oligosaccharides. Despite these ancient roles, members of both subfamilies of eukaryote clade A have undergone apparent functional shifts, typified by the FDL proteins of insects which appear to be derived from exochitinase-like ancestors but which now play a role in the metabolism of paucimannisodic N -glycans. Conversely, the C. elegans sequence represented in Figure 5 is derived from paucimannisodic glycan proccessing enzymes, but functions as an exochitinase . Indeed, the mammalian isoforms also derived from paucimannosidic N -glycans proccessing enzymes, but exhibit specificity towards the more complex oligosaccharides present in mammal cells. It remains unclear whether these differences reflect differential substrate specificities, or are more related to physiological patterns of expression.
Our data lend strong support to the contention of Gutternigg et al.  that more divergent GH20 family members from C. elegans and other metazoa are likely to be derived from a separate acquisition from prokaryotes. While the physiological roles of these proteins have not yet been studied experimentally outside of nematodes, it is unlikely that they should be involved in the production of paucimannosidic N -glycans in mammals, and it is thus probable that additional important roles for GH20 members remain to be discovered in animals.
All characterized and complete β-N-acetylhexosaminidases protein sequences belonging to the GH20 family were retrieved from CAZy , Pfam  and Swiss-Prot  databases. In order to retrieve all the other β-N-acetylhexosaminidase sequences available we performed PSI-BLAST searches of the protein database throughout all organisms at NCBI  and through translated BLAST searches (tBLASTn) against the full genomic sequences available at the UCSC genome browser , using Homo sapiens protein sequences HEXA [GenBank: AAB00965.1] and HEXB [GenBank: AAA52645.1], Arabidopsis thaliana β-N-acetylhexosaminidase-like protein [GenBank: AAM61367.1], Danio rerio Zgc:112084 protein [GenBank: AAH93192.1], Drosophila melanogaster protein sequences HEXO1 [GenBank: AAF47881.1], HEXO2 [GenBank: AAM48390.1] and FDL [GenBank: AAM29423.1], and Streptomyces plicatus β-N-acetylhexosaminidase protein [GenBank: AAC38798.3] as queries, respectively. Only complete and representative protein sequences were employed for subsequent evolutionary study and protein analysis. For proteins with possible splice variants the longest predicted isoforms were used. A total of 223 protein sequences distributed across Eubacteria, archaebacteria and eukaryotes, were selected for the analysis of protein features and phylogenetic study (Table 1 and See additional file 1, 2, 3). In this study, we denote β-N-acetylhexosaminidase proteins as Hex followed by Arabic number when two or more β-N-acetylhexosaminidase proteins have been identified in the same organism and the abbreviation of genus and species name (e.g. Bacteroides fragilis NCTC 9343 β-N-acetylhexosaminidase will be denoted as Hex1_Bfr).
Beta-N-acetylhexosaminidase sequences were analyzed using SignalP 3.0 , with default options to predict signal peptide sequences; SOSUI  and HMMTOP  with default parameters for the prediction of transmembrane helices; SMART  to predict the domain architecture. Conserved motif analysis was performed by MEME program  using default settings.
In order to maximize the number of unambiguously aligned sites used for phylogenetic analysis of eukaryote sequences, two separate datasets were prepared. The first set contained only eukaryote sequences which passed tests of compositional homogeneity implemented in TREE-PUZZLE  while the second included available prokaryote sequences (several clusters of extremely closely related sequences from taxonomically similar bacteria were excluded to reduce the computational burden of phylogenetic analyses).
In both cases, inferred protein sequences were aligned with the software Muscle  and alignments refined manually. Unambiguously aligned regions were identified using the program GBlocks . The resulting datasets contained 233 sequences with 223 amino acid positions (prokaryote and eukaryote dataset) and 76 sequences with 274 amino acid positions (eukaryote-only dataset). The WAG amino acid substitution model  with gamma distributed site rates and an invariable site category was used in all phylogenetic analyses as the Prottest software  indicated that this model provided the best fit to the data. Phylogenetic trees were estimated within the Bayesian statistical framework using a parallelized version of the program MrBayes v3.1  on a cluster of processors running the linux operating system (2000000 generations with trees sampled every 50 generations). The first 5000 trees generated were excluded as "burnin" for the MCMC chains (likelihood stabilization was determined graphically). The program SEQBOOT from the PHYLIP  package was used to generate 100 pseudoreplicate datasets, maximum-likelihood trees were estimated for each replicate using the program PHYML  and consensus trees were inferred using the program CONSENSE. Constrained trees were generated manually and evaluated according to the Shimodaira/Hasegawa test of alternative tree topologies  as implemented in TREE-PUZZLE.
This work was supported by PRIN grants from MIUR to D. Horner and G. Pavesi.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.