Adaptive evolution of Hox-gene homeodomains after cluster duplications

Background Hox genes code for homeodomain-containing transcription factors that function in cell fate determination and embryonic development. Hox genes are arranged in clusters with up to 14 genes. This archetypical chordate cluster has duplicated several times in vertebrates, once at the origin of vertebrates and once at the origin of gnathostoms, an additional duplication event is associated with the origin of teleosts and the agnanths, suggesting that duplicated Hox cluster genes are involved in the genetic mechanisms behind the diversification of vertebrate body plans, and the origin of morphological novelties. Preservation of duplicate genes is promoted by functional divergence of paralogs, either by subfunction partitioning among paralogs or the acquisition of a novel function by one paralog. But for Hox genes the mechanisms of paralog divergence is unknown, leaving open the role of Hox gene duplication in morphological evolution. Results Here, we use several complementary methods, including branch-specific dN/dS ratio tests, branch-site dN/dS ratio tests, clade level amino acid conservation/variation patterns, and relative rate ratio tests, to show that the homeodomain of Hox genes was under positive Darwinian selection after cluster duplications. Conclusion Our results suggest that positive selection acted on the homeodomain immediately after Hox clusters duplications. The location of sites under positive selection in the homeodomain suggests that they are involved in protein-protein interactions. These results further suggest that adaptive evolution actively contributed to Hox-gene homeodomain functions.


Background
The homeobox codes for a highly conserved 60 amino acid DNA-binding motif (the homeodomain) found in transcription factors [1]. One class of homeobox-containing transcription factor genes are the Hox genes, which are homologous to the genes in the Drosophila homeotic (HOM) gene cluster, that specify cell fate during embryonic development [1] and have derived functions in other tissues [2]. Multiple Hox genes located in tightly linked clusters have been identified in all animal phyla examined, with the archetypical chordate cluster having 14 genes (Hox1-Hox14) [3]. The number of Hox clusters has increased several times in vertebrate evolution: the cluster duplicated twice in early vertebrates leading to four clusters (HoxA-D) with 42 genes [4,5] and additional cluster duplications in teleost fish led to 7-8 clusters with [45][46][47] genes [6,7]. Independent duplications have also occurred in the jawless vertebrates hagfish [8] and lamprey [9].
Models of duplicate gene preservation predict functional differentiation of paralogs based on protein sequence or regulatory divergence [10,11]. Although numerous models of duplicate gene divergence have been proposed, four different mechanisms of functional divergence are likely to explain preservation of duplicate Hox genes: acquisition of novel functions by one paralog (neo-functionalization) [12], passive erosion of functional redundancy due to complementary degenerative mutations, (subfunctionalization) [11], models that predict the accumulation of neutral mutations, which later acquire functional constraints because the environment or genetic background changes (the Dykhuizen-Hartl effect) [13] or divergent adaptive selection of both paralogs (adaptive diversification) [14]. This list has recently been expanded by the introduction of the subneofunctionalization [15] and the adaptive radiation [16] models that predict rapid subfunctionalization after duplication followed by a prolonged period of neofunctionalization and adaptive divergence of duplicate genes in a process analogous to species radiations, respectively. Here, we are interested in testing whether positive selection acted immediately after cluster duplications to promote functional divergence and identify which mechanisms discussed above most adequately explain the preservation of Hox duplicates in vertebrates.
How paralogus Hox genes have been retained is not known, although evidence suggestive of positive selection after cluster duplication has been identified in Hox7 [17], Hox5 and Hox6 [18] paralogs. In these studies, however, it is not clear whether directional selection was responsible for the maintenance of the duplicated genes or other mechanisms promoted the maintenance of duplicates [19]. In addition, evidence for positive selection immediately after Hox cluster duplications has recently been identified in teleost fish for HoxA-11 and HoxB-5 [20]. These data suggest that, in the evolution of ray-finned fishes, some duplicate Hox genes have been preserved by functional differentiation through the action of positive Darwinian selection immediately following the gene duplication. This suggests that Hox genes may have also experienced adaptive evolution following the cluster duplications earlier in vertebrate evolution.
Hox cluster duplication and gene diversification has been proposed to be one of the genetic mechanisms behind the diversification of vertebrates and body plans and the origin of morphological novelties [21][22][23]. This association, however, is difficult to reconcile with the perceived degree of sequence conservation between the homeodomains of Hox genes and the numerous examples of functional equivalence of Hox/Hom genes from strikingly divergent organisms [24][25][26][27][28]. Mouse HoxA-5, for example, is able to activate the same target genes as its Drosophila homolog, Sex combs reduced (Src), in axis determination indicating strong conservation of function over 500 to 600 million years [29], but counter examples also exist, showing functional non-equivalence of Ubx orthologs from fairy shrimp, velvet worm and Drosophila [30] and non-equivalence of homeodomains from HoxA-4, HoxA-10, HoxA-11 and HoxA-13 paralogs from mouse [31,32].
In this paper we investigate the sequence divergence in homeoboxes from the four gnathostome Hox clusters, including genes from basal vertebrates and sarcopterygians like shark and coelacanth, respectively. This is the first study of homeodomain divergence with extensive taxon sampling allowing us to identify the relative phylogenetic age of substitution events in vertebrate phylogeny. We use three different, but complementary, approaches to test for functional divergence among paralogs: comparison of patterns of amino acid sequence conservation/variation among paralog clades, d N /d S ratio tests to detect directional selection and identify positive sites, and comparison of clade level polymorphisms/divergence rates.
Our results indicate that after cluster duplication positive Darwinian selection acted on the homeodomain of Hox proteins prior to the divergence of the modern gnathostome and bony fish lineages. We find amino acid substitutions at sites that are not involved in structural constraints and are located on the molecular surface where they are available for protein-protein interactions were targets of positive selection. We suggest that the action of positive selection at a subset of sites not constrained by ancestral (plesiomorphic) functions after cluster duplications led to the emergence of novel protein interactions while maintaining ancestral ones. This model can help reconcile the role of Hox genes in morphological diversification and innovation with their extreme sequence conservation.

Functional divergence of paralog-group homeodomains
We compiled a database of Hox genes with 4-5 species for each gene (155 sequences in total) and compared conserved and variable sites between paralog group members to identify if there are characteristic residues that distinguish which cluster a paralog belongs to (for example, see Figure 1). This analysis identified many sites that are conserved among species but variable between genes in the same paralog group ('cluster-specific' residues; Figure 2). Although the homeodomain is a highly conserved motif, it is not invariant; in fact only 17 residues are absolutely conserved between all vertebrate Hox genes in our alignment, suggesting that variable sites could be functionally divergent. Many of these variable sites have been previ-ously shown to be 'characteristic residues' that distinguish paralog groups from each other and have been suggested to be engaged in protein-protein interactions [33].
To test if 'cluster-specific' residues are functionally divergent we estimated the coefficient of functional divergence (θ), which measures the difference in evolutionary rate at amino acid sites between gene clusters. Rejection of the null hypothesis (θ = 0) is strong evidence for altered functional constraints after gene duplication (or speciation) [34]. We found significant evidence of type-I functional divergence for comparisons between HoxA, HoxB, and HoxD clusters (θ I = 0.24-0.37, p < 0.05; Figure 3) under the ((AD)(BC)) topology and under the (B(A(CD))) topology (θ I = 0.271-0.396, p < 0.05; Figure 3) at sites that differentiate paralog groups. We also found evidence of functional divergence of the HoxB cluster from the proto-HoxACD cluster after the initial duplication event predicted under the (B(A(CD))) model (θ I = 0.221 ± 0.07, p < 0.05; Figure 3). Results were not significant for comparisons between HoxC and any other cluster (θ I = 0.001-0.029) under either duplication model. Comparisons between HoxB and HoxA to protoBC under the (B(A(CD))) model and protoAD to protoBC under the ((AD)(BC)) model, θ I(B(A(CD))) = 0.138-134 and θ I((AD)(BC)) = 0.001, respectively, were not significant.
Type-I sites are defined as those with an amino acid that is conserved in one clade but variable in the sister clade, implying that the site is under structural/functional constraints in the first clade that is absent in the variable clade [34]. Type-I sites are located in the amino-and carboxy terminal ends of the homeodomain (outside of the 3 hel-ices and in regions not predicted to be well structured) and in loop connecting helix 2 and 3.
Recently, a method has been developed to test for type-II functional divergence [35]. Type-II sites are those that are highly conserved in both clades but are fixed for amino acids with different biochemical properties between sister clades, implying these residues are responsible for functional differences between these groups. Although all paralog groups had at least one site with evidence of type-II divergence, all of which were radical amino acid substitutions (defined as a change in polarity or charge, but not size) of surface residues, the θ II values are extremely small (θ II = 0.001-0.062), highlighting the conservation between homeodomains from different clusters. These results are not unexpected given that this method calculates θ across all sites in an alignment and thus effectively averages site-wise θ values. With only ~4% of sites/cluster showing a pattern of type-II divergence in our concatenated alignment, it is not likely that the ~46 possible type-II sites have θ II values high enough to compensate for the extremely low θ II values of the over 900 sites with θ II effectively equal to zero.

Accelerated evolution of homeodomains after cluster duplication
Several models of molecular evolution have been proposed to account for the preservation of duplicate genes. Although the details of each model can vary (see introduction), they differ in their predictions regarding the pattern of sequence evolution following gene duplication. The neo-functionalization and divergent selection models An Example of Functional Divergence in Hox6 Paralogs predict that the nonsynonymous substitution rate will be increased following gene duplication because of positive Darwinian selection in the gene acquiring the new function, while the Dykhuizen-Hartl and DDC models predict and increase in the substitution rate because of relaxed purifying selection. It is possible to distinguish between these models by comparing nonsynonymous (d N ) to synonymous (d S ) substitution rate (d N /d S = ω) with ω = 1, <1, and >1 indicating neutral evolution, purifying selection and directional selection.
Unlike the functional divergence methods developed by Gu [36,37], estimating selection using the d N /d S ratio is, by definition, dependent on the degree of divergence of the sequences under study. Thus, short sequences with a high degree of amino acid conservation but substantial synonymous site divergence may not contain enough sig-nal to reliably obtain estimates of d N and d S . We assessed whether homeodomain sequences contained sufficient information for reliable rate estimates by examining the tree length statistic S, the number of nucleotide substitutions per codon. For individual paralog groups S range from 6.3-13.3 (average = 10.00), with tree length d N averaging 1.2 substitutions per nonsynonymous site and tree length d S averaging 18 substitutions per synonymous site along the tree. Interestingly, simulation studies [38] have shown that at levels of sequence divergence similar to our datasets, use of the χ 2 made the likelihood ratio test statistic (LRT) extremely conservative such that the type-I error rate is very small. Similarly, the power of the LRT to reject the null hypothesis even when it is false (type-II error) was found to be conservative even at medium to high levels of sequence divergence [38]. The power of the LRT increases as the number of sequences increases such that at 17 taxa

Hox1
Relative Rate Ratio Tree and Coefficent of Functional Divergence    power is nearly 100% [38], suggesting that our inclusion of at least 8-12 sequences (depending on the paralog group) helped alleviate loss of power from short conserved sequences. The simulation study results indicated that the optimal sequence divergence depends on the dataset and appears to be within the medium-to-high range [38]. Our data indicate that results based on estimates of d N and d S from this homeodomain dataset are reliable, if conservative.
To estimate the strength and kind of selection acting on Hox gene homeodomains, we used maximum likelihood methods to estimate the nonsysnonymous (d N ) to sysnonymous (d S ) substitution rate ratio [39,40]. The one ratio model is the simplest and provides a measure of the average strength and direction of selection acting on the gene throughout its history and can test if there was an increase in the rate of evolution after Hox cluster duplications. As expected, the d N /d S ratio for the homeodomains of all paralog groups is much less than 1 (0.0033-0.0359) highlighting the dominant role purifying selection plays on Hox gene evolution. To test if there was an increase in the nonsynonymous substitution rate following Hox cluster duplication we used a two ratios model that estimated separate ω's for post cluster duplication (ω PCD ) and post speciation (ω PS ) branches. Post cluster duplication branches evolved significantly faster (3-27x) than post speciation branches for 10 of 13 paralog groups; the remaining 3 paralog groups had ω PCD > ω PS but the results were not significant (Table 1). A more complex model that allowed each post duplication lineage to have separate d N /d S ratios from each other (the paralog 6 group for example: ω PCD-A6 , ω PCD-B6 and ω PCD-C6 ) and post speciation (ω PS ) branches was not better than the simple tworatio model indicating that paralogs experienced similar selective forces after cluster duplication. These results are consistent with previous data from Hox5, Hox6 and Hox7 and indicate there was a period of rapid evolution of the homeodomain after Hox cluster duplication that could have been the result of either positive Darwinian selection or relaxed purifying selection.

Adaptive evolution of homeodomains after cluster duplication: Relative rate ratio tests
Although positive selection at the molecular level is most often tested using the d N /d S ratio, this method has several inherent limitations. The most problematic of which is when positive selection is acting at a limited number of sites while the majority are under strong purifying selection. Under these conditions d N will never become larger than d S and the signal for positive selection will be masked. In addition, when there is a large amount of sequence divergence between two nodes in a tree (site saturation) the accuracy of d S , and to a lesser extent d N , is greatly reduced. These two limitations of the d N /d S ratio to detect positive selection are particularly important for studying selective forces after Hox cluster duplications since very few sites (less than 15%) changed after duplication and the duplication events are relative ancient (about 560 MYA; ref), leading to substantial synonymous site divergence. Thus, even though we found evidence of accelerated rates of sequence evolution post cluster duplication, it is unlikely that the d N /d S ratio tests used above would be able to detect positive selection (ω > 1).
One complementary method that has been developed to compensate for some of limitations of the d N /d S ratio is the relative rate ratio test of Creevey and McInerney [41], which is an extension of the contingency test of neutrality proposed by Templeton [42] and McDonald and Kreitman [43]. Briefly, this method reconstructs ancestral sequences for each node in a phylogenetic tree using parsimony and identifies all substitutions that result in nonsynonymous and synonymous changes for each node. Substitutions are classified as replacement invariable (RI, i.e. nonsynonymous substitutions that are not substituted again in descendent lineages), replacement variable (RV, i.e. nonsynonymous substitutions that are substituted again in descendent lineages), silent invariable (SI, i.e. synonymous substitutions that are not substituted again in descendent lineages) and silent variable (RV, i.e. synonymous substitutions that are substituted again in descendent lineages).
Under neutral evolution the ratio of RI/RV will not be significantly different from SI/SV. Similarly, a period of relaxed purifying selection may increase RI/RV relative to SI/SV, but RI/RV will never be significantly greater than the neutral expectation given by SI/SV since the rate of replacement substitution can only exceed the rate of silent (neutral) substitution under positive selection. During an episode of positive selection, advantageous substitutions will become fixed in a lineage and remain invariant in descendent lineages, elevating the ratio of RI/RV relative to the neutral expectation given by SI/SV. Thus, when lineages are identified with a significantly greater RI/RV than SI/SV positive selection is indicated.
Using the relative rate ratio test to examine selective forces after cluster duplications identified that post-duplication lineages under the ((AD)(BC)) and the (B(A(CD))) models had significantly larger RI/RV than SI/SV (Figure 3 and Tables 2 and 3), indicating these duplication events were followed by adaptive evolution and supporting the results obtained with the d N /d S ratio tests and further suggesting that the increase in rates identified from the d N /d S ratio were due to positive selection.

Adaptive evolution of homeodomains after cluster duplication: d N /d S
The lineage-specific d N /d S model utilized above has been extended to account for variable d N /d S between sites and can detect positive selection at specific sites in specific lineages under appropriate conditions [44,45]. These branch-site models are ideal for detecting short episodes of positive selection that acted on a few sites while the majority of sites in the protein remained under purifying selection, as is likely to have occurred in the homeodomain after Hox cluster duplication. Applying branchsite models and to post cluster duplication (ω PCD ) branches identified sites under positive selection after cluster duplication (Figure 3) in paralog groups 1-6, 9 and 13 (Table 4). Positive Sites were identified with posterior probabilities (PP) greater than 0.90 using the both the liberal Neive Empircal Bayes (NEB) and the more conservative Bayes Empirical Bayes (BEB) methods implemented in PAML3.15, although only the result of the BEB method is shown. In addition, two genes in the Hox3, 5, 6, and 13 paralog groups have evidence of positive selection, but the results are not statistically significant. The sites identified under positive selection are the same as those that show evidence of type-II functional divergence   Table 4). Given that the ability of likelihood models to detect sites with ω > 1 is an extremely difficult computational problem, it is possible that these sites actually experienced positive selection, but that the models are not able to identify ω > 1. An equally likely explanation that does not invoke positive selection is that the ω = 1 is an accurate estimate for the rate at this sites, and is actually indicative of relaxed functional constraints after duplication, that the sites have not been substituted again indicates they under strong purifying selection in post-speciation lineages, supporting a Dykhuizen-Hartl mechanism for their evolution.

The structural basis of homeodomain evolution
To gain a better understanding of how functional constraints on the homeodomain relate to sequence divergence, we generated a sequence logo [46,47] from the multiple sequence alignment of Hox-gene homeodomains and mapped the location of sites under positive  selection and residues with known functions onto the logo and the crystal structure of the homeodomain bound to DNA (Figure 4). Adaptive/functionally divergent sites are grouped into three discrete regions of the homeodomain: the extreme amino and carboxy terminal arms just outside of the homeodomain proper and in the C-terminal end of helix-2 extending into the loop connecting helix-2 and helix-3.
The repressor domain, where the majority of protein interactions have been found, and helix-3 are free from positive sites, likely reflecting conserved functions shared by all Hox genes. Several proteins have been shown to bind in the repressor domain including the CREB binding protein (CBP) [48], high mobility group protein 1 (HMG1) [49], members of the Maf family of basic-leucine zipper (bZip) activators [50], and geminin [51]. This  region also overlaps with the Sp1 transactivation region [52]. In addition to characterized protein-protein interactions, the repressor domain also contains the majority of 'characteristic-residues' that distinguish cognate groups from each other indicating the majority of sites in this region were already under functional constraints after the tandem duplications which created the Hox gene cluster and were not available to be targets of adaptive selection after cluster duplication. Interestingly, the first 3 sites of the geminin-binding region were under directional selection in different paralog groups, including radical amino acid substitutions, suggesting selection to modulate geminin binding between paralog group members.
Mapping functionally divergent amino acid sites and sites under positive selection in and around helix-2 onto the logo and crystal structure shows that purifying selection has acted to preserve hydrophobic/aliphatic residues critical for the nuclear exportation signal [53] and positive selection has acted exclusively on sites that occur at the molecular surface. These sites form a small cluster at the posterior end of helix-2 in a prime location for protein-Function and Evolution of the Homeodomain Figure 4 Function and Evolution of the Homeodomain. The structure of the homeodomain bound to DNA is shown as ribbon models.
The location of the repressor domain is shown in orange, the nuclear localization signal in red, critical hydrophobic residues of the nuclear export signal in blue and positive sites in yellow. Only side chains of amino acids that make base contacts are shown. (C) Sequence logo of the Hox-gene homeodomain and surrounding amino acids. The overall height of the stacked amino acids indicates the sequence conservation at that position, while the height of symbols within the stack indicates the relative frequency of each amino acid at that position. The location of the homeodmain is shown above the logo. The location of protein-protein interaction regions are shown in blue (note that only some sites, not all sites, in blue actually participate in protein-protein interactions), sequence motifs are shown in light gray and the location of helices in dark gray. Sites identified under directional selection after cluster duplication are shown with an asterik (*). Sites with known functional information are shown: G, characteristic paralog-group residue; S, site that assists in binding site discrimination between paralog groups; B, site that makes base contacts; H, site that is part of the hydrophobic core; P, site that contacts the phosphate backbone; E, location of leucine and isoleucine residues critical for the nuclear export signal.
protein interactions. Beyond the ultra-conserved helix-3, which also contains the nuclear localization signal [54], sites under positive selection have been identified in an unstructured connecting loop leading to additional structures in the carboxy terminus. The amino-terminal arm of the homeodomain, which confers functional specificity on Hox proteins, contains the majority of sites under positive selection suggesting that selection has acted to modify functional specificity between paralogs. This region appears to be unstructured and is a prime target for protein-protein interaction sites. This pattern of purifying and positive selection suggests that after Hox cluster duplications, selection acted on protein-protein interaction sites in such a way that ancestral functions were maintained while the acquisition of novel protein interaction partners driven was driven by selection on non-constrained amino acids. These derived interactions could be those responsible for novel Hox gene functions in vertebrates.

Conclusion
The homeodomain serves multiple functions in addition to DNA-binding, including containing nuclear localization and export signals, transcriptional activation and repression domains and other protein-protein interaction sites [33,54]. These functions combine to impose severe limitations on the degree of sequence divergence that can be accommodated by the homeodomain of Hox genes. Even with these constraints, however, the relatively small set of amino acids that were free to diverge after cluster duplication were subject to positive selection. Although the Hox cluster duplications are relatively ancient (450 MYA), complicating the detection of positive selection, we find congruence between multiple methods a strong indicating that our results are reliable. These results support an important role for the action of positive Darwinian selection in the divergence of Hox genes after cluster duplications, particularly at sites that distinguish paralog groups ('cluster-specific' residues).
Nearly all 'cluster-specific' residues map onto the molecular surface of the homeodomain, similar to the paralog group specific sites [33], suggesting changes in amino acid properties could influence interaction of the homeodomain with other proteins. Cofactor associations are important for Hox proteins and most other transcription factor functions; these protein-protein interactions occur at the molecular surface through the formation of hydrophobic and ionic bonds and other intermolecular interactions such as salt bridges and van der Vaals forces. Thus, changes in the physicochemical properties of amino acids participating in these bonds could disrupt preexisting interactions and/or lead to new interactions. These changes could provide a selective advantage for maintaining duplicate genes through the origin of novel protein-protein interactions (effectively reducing degeneracy between paralogs) leading to new gene functions.

Methods
The homeodomain of Hox genes was identified from BLAST searches of the nr database at NCBI. At least four members of each gene from diverse taxa were included in the dataset. The sequences were aligned based on the translated amino acid sequences with Se-Al v2.0, alignments were simple given the high degree of sequence conservation within paralog groups. Regions of ambiguous alignment just outside of the homeodomain but within exon 2 were excluded. Most alignments ranged from 70-82 amino acids. The alignment is available from V.J.L. and has been deposited in TREEBASE.
We used codon-based maximum likelihood models of coding sequence evolution implemented in CODEML in the PAML package of programs (version 3.15) to test for lineages and amino acid sites under positive selection. Sites were classified as being under positive selection if they were identified from the Bayes Empirical Bayes (BEB) method with a posterior probability of greater than 0.90. The branching order of the Hox cluster duplications is still debated (refs), but our analyses suggest that the most likely topologies are ((AD)(BC)) and (B(A(CD))) (a detailed analysis of Hox cluster duplication history is beyond the scope of this paper and will be presented elsewhere). We used 2 alternate trees to test for selection: ((AD)(BC)) and (B(A(CD))) and found no significant differences between the results of these different topologies.
Functional divergence was tested with DIVERGE alpha1.2 (obtained from X. Gu). We also used the relative rate ratio test of Creevey and McInerny [41] implemented in the program CRANN to test for adaptive evolution. Both DIVERGE and CRANN analyses used the 2 alternate topologies discussed above.