Adaptive evolution of the matrix extracellular phosphoglycoprotein in mammals
© Machado et al; licensee BioMed Central Ltd. 2011
Received: 31 March 2011
Accepted: 21 November 2011
Published: 21 November 2011
Skip to main content
© Machado et al; licensee BioMed Central Ltd. 2011
Received: 31 March 2011
Accepted: 21 November 2011
Published: 21 November 2011
Matrix extracellular phosphoglycoprotein (MEPE) belongs to a family of small integrin-binding ligand N-linked glycoproteins (SIBLINGs) that play a key role in skeleton development, particularly in mineralization, phosphate regulation and osteogenesis. MEPE associated disorders cause various physiological effects, such as loss of bone mass, tumors and disruption of renal function (hypophosphatemia). The study of this developmental gene from an evolutionary perspective could provide valuable insights on the adaptive diversification of morphological phenotypes in vertebrates.
Here we studied the adaptive evolution of the MEPE gene in 26 Eutherian mammals and three birds. The comparative genomic analyses revealed a high degree of evolutionary conservation of some coding and non-coding regions of the MEPE gene across mammals indicating a possible regulatory or functional role likely related with mineralization and/or phosphate regulation. However, the majority of the coding region had a fast evolutionary rate, particularly within the largest exon (1467 bp). Rodentia and Scandentia had distinct substitution rates with an increased accumulation of both synonymous and non-synonymous mutations compared with other mammalian lineages. Characteristics of the gene (e.g. biochemical, evolutionary rate, and intronic conservation) differed greatly among lineages of the eight mammalian orders. We identified 20 sites with significant positive selection signatures (codon and protein level) outside the main regulatory motifs (dentonin and ASARM) suggestive of an adaptive role. Conversely, we find three sites under selection in the signal peptide and one in the ASARM motif that were supported by at least one selection model. The MEPE protein tends to accumulate amino acids promoting disorder and potential phosphorylation targets.
MEPE shows a high number of selection signatures, revealing the crucial role of positive selection in the evolution of this SIBLING member. The selection signatures were found mainly outside the functional motifs, reinforcing the idea that other regions outside the dentonin and the ASARM might be crucial for the function of the protein and future studies should be undertaken to understand its importance.
Dentin, one of the major mineralized tissues of teeth, is deposited by odontoblasts, which synthesize collagenous and non-collagenous proteins (NCPs) [1, 2]. Among the NCPs, there is a family of small integrin-binding ligand N-linked glycoproteins (SIBLINGs) consisting of dentin matrix protein 1 (DMP1), dentin sialophosphoprotein (DSPP), integrin-binding sialoprotein (IBSP), matrix extracellular phosphoglycoprotein (MEPE, also known as OF45) and osteopontin (SPP1) . These genes share common genetic and structural features, including a small non-translational first exon, a start codon in the second exon and a large coding segment in the last exon (although exon number varies among the different genes) . The entire SIBLING protein family likely arose from the secretory calcium-binding phosphoprotein (SCPP) family by gene duplication, since this cluster of genes encodes proteins with similar molecular-structural features and functions .
Members of this gene family are encoded by a compact tandem gene cluster (located on chromosome 4 q in Human and 5 q in mouse) characterized by: (i) common exon-intron features, (ii) the presence of the integrin-binding tripeptide Arg-Gly-Asp (RGD) motif that mediates cell attachment/signaling via interaction with cell surface integrins , and (iii) post-translational modifications of conserved phosphorylation and N-glycosylation sites . In humans, the MEPE protein (525 amino acids) is encoded by four exons with a 1960 bp transcript with two N-glycosylation motifs (at residues 477-481), a glycosaminoglycan (SGDG) attachment site at residues 256-259, and the RGD cell attachment motif at residues 247-249 . The RGD motif has a similar function in other members of the SIBLING's (DSPP, DMP1, IBSP, and SPP1) . The protein MEPE has several predicted phosphorylation sites/motifs for protein kinase C, casein kinase II, tyrosine kinase, and cAMP-cGMP-dependent protein kinase and a large number of N-myristoylation sites that appear to be also a feature of the RGD-containing proteins . The acidic serine-aspartate-rich MEPE-associated motif (ASARM motif) occurs at the C-terminus in MEPE (residues 509 to 522)  and when phosphorylated this small peptide can bind to hydroxyapatite and inhibit mineralization .
The basic MEPE protein was first cloned from a tumor resected from a patient with tumor-induced osteomalacia (OHO) [7, 9], which is associated with hypophosphatemia and is caused by a renal phosphate wasting. The MEPE gene is also up-regulated in X-linked hypophosphatemic rickets (XLH or HYP-osteoblasts) and OHO-tumors [7, 10–14]. Under normal conditions it is expressed primarily in osteoblasts, osteocytes, and odontoblasts .
Targeted disruption of the MEPE gene in mouse causes increased bone formation and bone mass, suggesting that MEPE plays an inhibitory role in bone formation and mineralization . In humans, MEPE inhibits mineralization and is also involved in renal phosphate regulation [16, 17]. The inhibition of mineralization and phosphate uptake are related with the protease resistant small peptide ASARM motif located near the end of the protein [7, 17]. However, the MEPE protein has dual functions depending on the proteolytic processing. When the protein is cleaved by cathepsin B or D into several fragments, the small peptide ASARM is released  and when phosphorylated, this small peptide can bind the hydroxyapatite crystal and inhibit mineralization . By contrast, when fragments containing the RGD motif are released and the ASARM is not degraded by proteases, mineralization is accelerated . The influence of MEPE-ASARM peptides in the modulation of mineralization is due to a protein-protein interaction with PHEX, an X-linked phosphate-regulating endopeptidase homolog (also called the minhibin model) . PHEX is also expressed in osteoblasts, osteocytes and odontoblasts and the protein interacts with MEPE, protecting it from the proteolytic process (from cathepsin-B) and preventing ASARM from being released into blood circulation . Most of the disorders associated with MEPE result from a malfunction of this PHEX-MEPE interaction, which in turn leads to an increase of ASARM blood levels.
The majority of mammalian genes are strongly conserved in the coding sequence [20, 21]. Genes carrying signatures of selection may be involved in adaptation and functional innovation, and often have elevated ratios of nonsynonymous/synonymous nucleotide substitutions (dN/dS) in their coding regions . However, evolutionary rates of nuclear and mitochondrial genes are not equal in all the mammalian lineages . For example, while rodents tend to accumulate more mutations in nuclear genes than humans , the differences between the rates in the two lineages seems to be smaller than the generation time difference .
Since MEPE protein has an important role in the regulation of the skeleton mineralization process and since the mineralized tissue is a critical innovation in vertebrate evolution, the evolutionary study of this developmental gene could provide valuable insights on the adaptive diversification of morphological phenotypes in mammals. As the MEPE gene has been suggested to be under selection , our objective was to undergo a thorough analysis to evaluate signatures of positive selection using both a gene-level and protein-level approaches. We assessed the evolution of the MEPE protein-coding gene in 26 mammalian species, from Hyracoidea to Primates, showing that while four regions/motifs in the MEPE gene have a high degree of conservation, the majority of the coding region has a fast evolutionary rate, especially in rodents and tree shrews. Indeed, evidence of strong positive selection (gene and protein-level) was found in 20 amino acids that encompass MEPE protein, highlighting the role of molecular adaptation in the functionality of this gene.
Twenty-six mammalian MEPE sequences were retrieved from the GenBank and Ensembl databases, comprising eight different mammalian Orders (Additional file 1: Table S1). In addition, sequences of the putative MEPE orthologue, Ovocleidin-116, were obtained from the available bird genome projects (Gallus gallus, Taeniopygia guttata, Meleagris gallopavo) for comparative purposes. For the majority of the mammals considered in this work, the MEPE gene encompasses four exons that encode a transcript that varied from 1272 bp in Ochotona princeps to 2030 bp in Pan troglodytes. Some of the smallest reported transcripts may be incomplete, as in the case of O. princeps, which is missing a stop codon. The absence of the ASARM motif in the MEPE's C-terminal in some species (Equus caballus, Ochotona. princeps, Otolemur garnetti and Pteropus. vampyrus) also suggests that those genes were not fully annotated. Thus, we performed a detailed search in databases for those species using TBLASTN , which led to the identification of the ASARM in E. caballus, but not in O. princeps, O. garnetti and P. vampyrus (in these cases, the missing end portion of the protein corresponds to the end of the contig available in the database). However, several stop codons are present between the end of the present sequence and the putative ASARM motif in the E. caballus sequence and therefore it was not included in subsequent analyses.
The cell attachment region, RGD, situated near the center of the MEPE protein, is fully conserved in 20 of the 26 mammalian species (Figure 2). However, some changes are observed in Tursiops truncatus, Procavia capensis, the bats Pteropus vampyrus and Myotis lucifugus, and in the rodents Dipodomys ordii and Spermophilus tridecemlineatus (Figure 2) and it is likely that such amino acids changes in the RGD motif may have functional relevance. Moreover, the RGD motif is also present in other genes of this gene-cluster family.
The SDGD is completely conserved among all the mammals, reinforcing the premise that this peptide region is, along with RGD, important to the MEPE function. These two motifs constitute the dentonin region, which was not detected in any of the others members of the SIBLING protein family. The chicken and the turkey MEPE orthologues appear to be exceptions, since they do not have the cell-adhesion motif, RGD, but contain the glycosaminoglycan-binding motif, SGDG. In these species we found a HGD near the SGDG motif, suggesting that RGD is replaced by HGD (Additional file 4: Figure S2). A similar change from RGD has been described in other members of the DSPP orthologues (e.g. in rat, Rattus norvegicus, the RGD replaces the HGD) . Nevertheless, in zebra finch (T. guttata) we found the RGD motif but not the SGDG region (Additional file 4: Figure S2). The ASARM motif is highly conserved within the 21 mammals for which ASARM is annotated (average above 85%), although the Bottlenose dolphin (Tursiops truncatus) has a similarity of only 59.1%. Pairwise similarity among birds was 79.9% (among the three avian species), but on average only 27.3% similarity was observed between birds and the mammalian ASARM. Moreover, in birds this motif is capped at the C-terminal by 21 (G. gallus, M. gallopavo) to 24 (T. guttata) amino acids, and this region shows 77.2% similarity between G. gallus and M. gallopavo but less than 40% between these two species and T. guttata, showing that this region in birds is probably less constrained than the ASARM.
Results from the RRTree test comparing substitution rates in Rodentia, Scandentia and the other mammals.
Rodentia and Scandentia (n = 6)
Other Mammals (n = 20)
PAML results of MEPE for the 20 mammalian species (excluding ambiguity data).
ω = 0.46086
p0 = 0.63812 p1 = 0.36188
ω0 = 0.27631 ω1 = 1.00000 ω2 = 1.00000
p0 = 0.63812 p1 = 0.27244 p2 = 0.08944
M2 vs M1
p = 1.06299 q = 1.09727
p0 = 0.86974 p = 1.58369 q = 2.27462
(p1 = 0.13026) ω1 = 1.29913
M8 vs M7
The evaluation of positive selection using the model implemented in Single Likelihood Ancestor Counting (SLAC) showed three sites under selection, one of those sites being similar to that retrieved with model M8 in PAML. Since SLAC tends to be quite conservative, we also estimated the selection signatures using the Fixed Effects Likelihood (FEL) model, which is assumed to be more powerful than SLAC [42, 43]. The FEL model revealed a total of 23 sites under selection using this model, including a mutation in the highly conserved small peptide ASARM (from aspartate to glicine) (Figure 2). Such a radical change in ASARM was only observed in a few species and further studies are needed to better document its frequency across mammals. All the sites presented in the model implemented in Datamonkey have a significance value (p < 0.10) in FEL and SLAC, which is an accepted level of significance for the test of those models . When we use a significance threshold of 0.05, the number of positive selected amino acids decreased to 14 in FEL and zero in SLAC, meaning that 9 sites (out of the 23 detected with a significance level of 0.10) had less evidence of being under strong positive selection. However, these positions may still be indicative of selection signatures. Recombination can affect several analyses, including phylogenetic reconstruction and analysis of positive selection . Therefore, we assessed gene recombination using GARD implemented in the Datamonkey web-based server  and repeated the selection analysis including and excluding recombination in the dataset. Partitioning the data did not change the conclusions of the positive selection analyses (data not show), suggesting that recombination is not significantly affecting the MEPE gene evolution.
No additional sites were found using the SLR , but six sites under selection in the previous analysis were also statistically supported. Overall, across MEPE, 32 of the 525 sites (referenced to the length of the human MEPE) were under positive selection; additionally six sites were supported by more than one codon analysis (9 in PAML, 23 in FEL, 3 in SLAC, and 6 in SLR) (Additional file 6: Table S4).
Selection models that use dN/dS ratios to detect selection are generally not sensitive enough to detect subtle molecular adaptations . It is therefore necessary to employ alternative criteria within generally conserved protein-coding genes or within proteins with strict motifs intermixed with regions under fast directional evolution. Therefore, we used TreeSAAP , which evaluates destabilizing radical changes at each site, and an empirical threshold of change in three properties was applied as evidence that a site is under positive (or negative) selection.
MEPE properties under positive selection determined in TreeSAAP.
Equilibrium constant (ionization of COOH)
Power to be at the C-terminal
Power to be at the middle of alpha-helix
Power to be at the N-terminal
Solvent accessible reduction ratio
At the amino acid site level, MEPE has 181 sites (33.8%) under positive selection in at least one property. Although applying the empirical threshold of at least three properties showing signatures of positive selection the number of sites is reduced to 41 (7.6%) (Additional file 7: Table S5). The majority of these 41 sites are located in the N-terminal region of the protein and the dentonin region (68% of the positive selected sites). The alternative calculation method was performed using CONTEST and estimates of variation in amino acid charge and volume revealed 79 sites with signatures of positive selection for at least one of the amino acid properties (Additional file 8: Table S6). However, after the Bonferroni and False Discovery Rate (FDR) correction, only one site showed positive selection. This site, located at position 354 in the alignment (position 349 in the human sequence), corresponds either to lysine or glutamate and was not detected by TreeSAAP. The ancestral protein reconstruction in TreeSAAP, based on the baseml implemented in PAML, shows that glutamate is present in the common ancestor of non-Afrotheria mammals, suggesting that the radical change to lysine occurred in Cetartiodactyla, Perissodactyla and in at least one representative of the Lagomorpha.
Based on selection analyses at the protein level across MEPE, 42 of the 525 sites (human MEPE as reference) were under selection at the amino acid level (41 detected with TreeSAAP and 1 with CONTEST).
MEPE evolution has disproportionally accumulated serines, threonines (potential phosphorylation target residues), arginines, alanines and valines, as all these amino acids showed directional evolution in the DEPS analysis (with a P-value < 0.01) (Additional file 9: Table S7). The MEPE protein had 14 sites under directional selection (Additional file 10: Table S8), seven of which are amino acids that tend to increase the disorder/unstructured probability of the regions. Additionally, eight of these 14 sites had a tendency to change to amino acids that are potentially phosphorylated residues, particularly at positions 496 and 503 (505 and 512 positions in the alignment), since these sites are relatively near the ASARM motif and the cleavage site by cathepsin-B.
The MEPE protein belongs to a category of proteins classified as "intrinsically unstructured/natively disordered", with 53.8% and 55.8% of the human and the mouse MEPE constituted by amino acids that are associated with disorder/unstructured regions, respectively. This is reinforced given that most of the protein (around 78.8%) is disordered at a 0.05% false positive rate. Interestingly, the ASARM motif has a high content of amino acids disorder promoters while the other functional motifs (such as RGD and SGDG) incorporate regions that are structured (Additional file 11: Figure S3). The protein has a high percentage of the amino acid aspartate, which characterizes the proteins of the SIBLING family. Given the importance of disorder/order in MEPE, we analyzed the implications of selection signatures relative to the protein structural differences, and found that sites 75-Ser, 127-Glu, and 481-Arg (human MEPE as reference) are under positive selection and have a higher number of non-synonymous mutations towards codons that encode the amino acids disorder promoters.
Given the absence of the MEPE gene in fishes and amphibians, its origin likely coincides with the divergence of amniotes, when mineralization [11, 48] and phosphate regulation  had a crucial role in species survival and diversification. SPP1 diverged from SPARCL1 (secreted protein acidic cysteine-rich like 1) and both are expressed in bone, participating in the bone formation (as an inhibitor of mineralization in SPP1) . Therefore, the presence of SPP1 in fishes with a broader tissue expression pattern  suggests that SPP1 might also have similar functions to MEPE. Remarkably, after duplication, the genes were conserved during evolution and probably have differentiated to assume various functions related with tissue mineralization specificity. Recently, it was proposed that SPP1 is a more-powerful inhibitor of mineralization than MEPE . This suggests that after the emergence of the complete SIBLING family in vertebrates, some functions were possibly shared among genes, notably because MEPE is absent in fishes.
The MEPE gene has similarities with other SIBLING genes, suggesting that it originated through a duplication event from another member of the gene family , but different dynamics of gene duplication and gene loss have occurred among lineages (e.g. absence of MEPE - Figure 1). The five genes of the SIBLING family are present in therian mammals and reptiles, but birds only have four genes (IBSP, SPP1, DMP1 and MEPE/OC-116), while fish only have two genes (SPP1 and DSPP-like). The DSPP orthology in fishes is controversial . However, despite the low similarity, DSPP starmaker was identified as a functional orthologue  clearly associated with DSPP . The presence/absence of various SIBLING family genes in vertebrates suggests that despite the crucial role of MEPE in mammals, birds and reptiles, its function may have been compensated in other taxa by other genes of the family. For example, in fishes a duplicated copy of SPP1 has not been described, suggesting that the fish SPP1 orthologue may have had a similar function to MEPE since SPP1 and MEPE, interact with PHEX . The release of the ASARM from the MEPE protein and the phosphorylation of this motif lead to an inhibition of mineralization . Similarly, the ASARM from SPP1 inhibit the mineralization . Moreover, the ASARM from SPP1 is potentially phosphorylated and can interact with the hydroxyapatite crystals leading to a negative regulation of mineralization . Although SPP1 has an ASARM motif near the center of the molecule, it does not have the full dentonin region (just the RGD motif). Moreover, the SPP1-ASARM has been described as a more-potent mineralization inhibitor than the MEPE-ASARM . However, the knockouts of SPP1 and MEPE in mice have different phenotypes. MEPE knockouts have increased bone mass and inhibition of age-related bone loss  while SPP1 knockouts cause a resistance to bone loss and trabecular bone mass .
The functional motifs of MEPE (RGD, SGDG and ASARM) are highly conserved among the studied mammals. In the SIBLING proteins the first coding exon encodes the signal peptides [4, 54], as is observed in MEPE. The RGD motif is a common feature of all member of the SIBLING family, remaining functionally preserved after the tandem duplication that gave rise to all the members of this gene family . Surprisingly, birds do not have a complete dentonin region (RGD and SGDG), although the high conservation observed among mammals suggests that this region has an important role in the function of the protein. In fact, in mammals the gene function apparently depends on the full dentonin region, as the RGD motif alone does not enhance an optimal adhesion on biomaterial surfaces in osteoblast . However, when SGDG is close to RGD the mitogenic activity of dentonin increases, while the presence of only the SGDG motif promotes the cell proliferation . In mammals, MEPE is involved in bone formation and osteoblast proliferation [19, 56], while in birds it is involved in egg-shell formation . This functional divergence may explain the sequence differences observed between the two lineages, particularly reinforced by the absence of the full dentonin region in birds. The ASARM motif is also highly conserved among mammals, but shares less than 50% similarity with the avian ASARM. Moreover, we have not detected a similar cathepsin-B cleavage site near avian MEPE-ASARM and this motif is capped at the C-terminal by 21 to 24 amino acids. Amino acids towards the C-terminal after the ASARM motif are also observed in marsupials . Despite the lower similarity with the mammalian ASARM and its different position, the high conservation within birds suggests that this motif continues to have a crucial role. The changes are probably not due a relaxation of selection, but instead may have an adaptive role. In mammals, the cathepsin-B cleavage site is crucial for the function of MEPE, since this small peptide only interferes with hydroxyapaptite crystals when released . Therefore, birds are also expected to have a mechanism for cleavage of ASARM. MEPE has not yet been annotated in a monotremata, no significant matches were found in a representative species of this group, the platypus (Ornithorhynchus anatinus). Nevertheless, the discovery of this gene in egg-laying mammals would be of great relevance to understanding the functional differences between mammals and birds.
The coding region of MEPE that flank the motifs described above is less conserved, but retain considerable phylogenetic signal across species. The human MEPE sequence has a high similarity with the great apes and with the genus Macaca (Cercophitecidae), even in the non-coding regions (M. mulatta). To a lesser degree, human MEPE also has some significant similarities in the non-coding regions with the genes of the lower primates (M. murinus and O. garnetti). MEPE appears to be particularly conserved among primates, in both coding and non-coding regions. The intronic conservation could provide valuable information about the role of non-coding sequences in the regulation/functionality of this gene. Despite the accelerated evolution in rodents, intronic conservation allowed us to reconstruct a well-supported species phylogeny from intronic sequences (even including the rodents sequences) with similar results as those obtained from MEPE coding regions.
Several human diseases increase MEPE expression [7, 14, 58], which may imply functional constraints in the gene even at the intronic level. Previous studies have demonstrated that highly conserved intronic regions are correlated with functional constraints and can be evidence of a hidden class of abundant regulatory elements . Recently, a SNP in the region 7 kb 3' of the gene was associated with osteoporosis, a disease characterized by reduced bone mass and microarchitectural deterioration of bone tissue that reduces bone strength and leads to an increased risk of fracture . These findings suggest that intergenic regions can also be important in gene function and may cause significantly different phenotypes. We hypothesize that intronic regions can also lead to significant differences at the expression level and ultimately to differences in phenotypes. This is consistent with our findings that there are strong evolutionary constraints in the MEPE intronic region.
Within mammals, MEPE in rodents is evolving faster, presenting a high amount of transitions and transversions. A similar trend is also observed in the tree shrew T. belangeri. However, since we only had one MEPE sequence from the order Scandentia we were unable to infer if this pattern is species-specific or if it is typical of this order. The increased number of substitutions in rodents was expected as previous studies have shown that rodents tend to accumulate more mutations in the coding regions [24, 61]. We hypothesize that the observed differences in these two orders have resulted from either a divergent functional role or simply a relaxation in Darwinian selection. It is not known if the function of the rodent MEPE is similar to that in humans , but all the functional motifs are conserved and the signatures of positive selection or the differences observed were only detected outside of these important motifs. It is clear that positive selection may have an important role in the functional divergence of homologous proteins during adaptation to different habitats . Indeed, selection may be episodic as positive and negative selection shifts over time across different lineages, reinforcing the importance of comparing sequences that have diverged within appropriate time frames . The branch-site model, using rodents as foreground branches and allowing ω ratio variation not only between the branches but also among sites, identified 12 sites with strong signatures of positive selection. This suggests that the rodents and probably Scandentia may have lineage-specific selection differences in MEPE, not only in the magnitude of the selective pressure found in the branch, but also in the number of sites under selection. The acceleration of the substitution rates in rodents and the tree shrew potentially compromises the assessment of positive selection by increasing the number of synonymous mutations and because this heterogeneous site selection is observed in only two of the eight orders evaluated (i.e. Rodentia and Scandentia). The results may also be biased by the mixing of species with long and short generation times , as well as the related long-branch-attraction effect in phylogenetic reconstruction. Therefore, we did not include the rodents and the tree shrew in the site analysis.
The evolutionary analyses of mammalian MEPE codons (excluding the rodents and the tree shrew) found 32 sites under positive selection at codon level, and remarkably three were in functional regions of the protein, positions 6-Val and 11-Phe (Signal Peptide) and position 517-Gly (ASARM motif) (Figure 2).
Recent methods for investigating selection in protein coding genes have focused on evaluating the type of positive selection detected (directional or nondirectional, stabilizing or destabilizing), determining the presence of purifying selection, and interpreting how selection affects overall protein structure and function. Amino acid substitutions have different effects on a protein depending on differences in physicochemical properties and their position in the protein structure . Here, we performed multiple analyses to differentiate among the different types of selective pressures acting in MEPE at the amino acid level. The evaluation of the amino acid physiochemical properties changes in the mammalian MEPE identified 37 more sites (36 using TreeSAAP and one using CONTEST) with selection signatures compared with the results retrieved using codon models. This shows that total reliance on models based on dN/dS using codon models may not detect some important sites with signatures of selection, often because a single adaptive mutation may occur in a small number of species, resulting in an omega lower than one. By contrast, these could also primarily be amino acid stabilizing rather than destabilizing changes, and a ω > 1 may not always be indicative of adaptive evolution.
Combining all the selection analyses, we found 69 amino acids with evidence of positive selection (20 well-supported by both codon and amino acid level approaches) (Additional file 12: Figure S4). Three clusters of positively selected sites revealed three new motifs that likely have a functional role, SEASEN (75-80), LNXEXS (96-101) and ENT (170-172) (Figure 10), using the human protein as site reference.
Selection analysis of MEPE in TreeSAAP using amino acid destabilizing properties revealed that the structural properties tend to be more affected by positive selection than the chemical properties. This suggests that the flexible and intrinsically unstructured nature of MEPE is linked to its multiple biological roles. The ASARM motif shows a "high tendency" to be a "disordered region and highly acidic", although the conformation of ASARM should be dependent on the phosphorylation level . The ability to bind to hidroxyapatite is also correlated with phosphorylation state and PHEX cleavage of MEPE is dependent on the Serine phosphorylation status . Moreover, our results shows that the protein tends to accumulate numerous residues with potential phosphorylation sites and this can be important to the folding/function of the protein. Proteins fold to minimize their free energy, although the structure also reflects an organization that can allow the recognition of a ligand or a transition state . In fact, there is a balance between protein function and stability, and most of functional sites are non-optimal in terms of stability. If a residue is replaced by another residue, the protein activity will be reduced but the stability will be increased . In MEPE we detected 75 sites with a Γ lower than -2 kcal/mol, indicating that a large number of sites in MEPE are non-optimal and therefore possibly involved in protein function. Moreover, 13 of those sites showed signatures of selection in one analysis, and sites 55, 127 and 276 in both codon and amino acid level analyses. Proteins have different secondary-structures and physicochemical properties and roles that help determine their evolutionary flexibility . Thus, amino acids that comprise disordered regions, such as random coils, are more likely to be under positive selection than expected from their proportion in the proteins, compared with the residues in helices and β-structures which are subjected to less positive selection . Indeed, when we compare the evidence of positive selection with the protein secondary structure in MEPE we observed that the number of sites under selection in the random coils and disordered regions are slightly higher than expected. This suggests that a high number of sites probably have a functional role or are at least relevant to an increase in MEPE protein flexibility.
Presently, most of the research on MEPE has centered on the biological role of the RGD and ASARM regions. However, our comparative study of mammalian MEPE orthologues revealed that the protein has lineage-specific properties (e.g. biochemical, evolutionary rate, intronic conservation), and that outside these two well-described motifs there are 69 sites (20 with high confidence level) under positive selection and of probable functional relevance. As positively selected sites might be either near catalytically important regions of the proteins  or be functionally relevant sites [70, 71], these sites are good candidates for mutagenesis and structural studies to determine the functionality of MEPE relative with the other SIBLING proteins.
MEPE is found in reptiles, birds and mammals (eutheria and metatheria), and to date has not been identified in monotremes. The description and study of MEPE in other taxonomic groups will be crucial to fully understanding the differences reported in avian and mammalian orthologues, and the adaptive significance of these differences. The absence of this gene in some vertebrate lineages suggests that SPP1 might partially cover the functions of MEPE in those groups. MEPE retains a strong phylogenetic signal at both coding and non-coding regions in mammals, probably due to in the functional relevance of these regions. Nevertheless, the gene is highly variable, particularly in the largest exon outside the functional motif, while other regions appear to be under strong positive selection. We found 20 sites with a significant signature of positive selection at both nucleotide and amino acid level complimentary analyses (in addition to other 69 sites with evidence of selection at either the nucleotide or the amino acid level). The analyses identified three motifs (LNXEXS, SEASEN and ENT) with selection signatures suggesting important adaptive functions. We also showed that Rodentia and Scandentia have an accelerated evolutionary rate with a unique evolutionary pattern. Finally, we showed that MEPE tends to accumulate amino acids that promote "disorder" and that present potential phosphorylation targets, supporting the contention that other regions outside the dentonin and ASARM might have crucial functional roles and demonstrating the need for future studies to understand the importance of these regions.
MEPE nucleotide sequences were retrieved from GenBank and ENSEMBL. We aligned 26 MEPE sequences representing eight orders of mammalian species and produced two different alignments, one including all species and another excluding rodents and the tree shrew due to its nucleotide saturation bias. Given the low similarity between the avian and the mammalian sequences, the avian sequences were excluded from phylogenetic and selection analyses. BLAST searches were used to retrieve non-annotated sequences from several mammalian genomes. All the alignments were performed after the translation of nucleotides to amino acids and the corresponding alignments were back-translated to nucleotides. The alignment were performed in ClustalW  implemented in BIOEDIT v7.05 , MEGA4  and LAGAN . Sliding-window percent amino acid and nucleotide identity, and % GC content were calculated in Swaap 1.0.3 . Saturation plots (including or excluding the third-coding position) and the estimated pI (excluding indels) were assessed in DAMBE . Conservation in the coding and the non-coding regions was assessed using mVISTA .
We used Modelgenerator version 0.85  to determine the optimal model of sequence substitution for our protein dataset, employing the Jones-Taylor-Thornton (JTT+I+G) substitution model. MrModeltest 2.3  was employed to determine the optimal model of sequence substitution for our coding sequence dataset, employing the General-Time-Reversible (GTR+I+G) substitution model with the invariant site plus gamma options (five categories). Bayesian inference methods with Markov chain Monte Carlo (MCMC) sampling were performed in MrBayes [81, 82]. The analysis was run for 5, 000, 000 generations with a sample frequency of 100 and burn-in was set to correspond to 25% of the sampled trees. The Maximum-Likelihood (ML) phylogenetic tree was constructed in PHYML , under the best-fit model for nucleotides and amino acids, 1000 bootstrap replicates and the NNI branch search algorithm. The parameters used in the tree reconstructions were set to: (i) Nucleotides: GTR+I+G with 6 substitution rate parameters and gamma-distributed rate variation with a proportion of invariant sites; (ii) Amino acids: JTT+I+G. A neighbor-joining tree was conducted in MEGA 4  using the complete deletion of ambiguous data and the maximum composite likelihood option. The topologies were tested in TREE-PUZZLE  to identify the tree that best fits the alignment, using three tests: KH, SH and ELW. A phylogenetic signal test was performed in TREE-PUZZLE  using the implemented methodology .
Positive selection analyses were performed in the Eutherian mammals (the closely-related taxa) to avoid nucleotide saturation and base-compositional bias. We assessed positive selection using primarily a gene-level approach  based on the ratio (ω) of nonsynonymous (dN) to synonymous (dS) substitutions rate (i.e., ω = dN/dS), implemented in PAML v4.3  and in the web-based program SELECTON [87, 88], PAML uses LRT to compare two nested models, a model that does not allow, and a model that allows, sites categories > 1 (null versus positive selection, respectively). Here, we used three LRTs based on site-specific models comparing the nested models: M1a-M2a, M7-M8 and M8a-M8. The first LRT was performed comparing M1a (nearly neutral: p0, p1, ω0 < 1, ω1 = 1, NS sites = 1) against M2a (positive selection: p0, p1, p2, ω0 < 1, ω1 = 1, ω2 < 1, NSsites = 2); the second LRT was comparing M7 (beta: p, q, NS sites = 7) with M8 (beta & ω: p0, p1, p, q, ωs > 1, NS sites = 8). The third LRT was between M8a (beta & ωs = 1: fix omega = 1, omega = 1, NS sites = 8) and M8. However, a significant LRT only demonstrated that the selection model is more suitable than the neutral model; it does not provide any indication of the sites under selection . This can be accomplished through an Empirical Bayes (EB) approach to calculate the posterior probability (PP) that a given site comes from the class with ω > 1. Sites presenting a PP above the defined cut-off value (e.g. p > 95%)  are inferred to be under positive selection. A robust method was used to accommodate the uncertainties in the MLEs of parameters in the ω distribution, designated by BEB . This approach was shown to be reliable in both small and large data sets, and also to have a good resolution power for identifying individual sites under positive selection, especially in large data sets or with strong selective pressure. We also performed an analysis using the branch-site model A  (model = 2 NSsites = 2), including and excluding the rodents and tree shrew as foreground branch, allowing the ω ratio vary both among sites and among lineages. The branch-site test 2 was performed using the null model, ω2 = 1 fixed (using the parameters fix omega = 1 and omega = 1). The sites under selection in the foreground branches were obtained after calculating probabilities of site classes using the BEB procedure.
Although the PAML models  allow for variation in the non-synonymous substitution rate, the synonymous rate is fixed across the sequence. To overcome that specificity, we used SLAC and FEL  for detecting positive selection while allowing variation in synonymous rate. SLAC is a heavily modified and improved derivative of the Suzuki-Gojobori counting approach [42, 93] that maps changes in the phylogeny to estimate selection on a site-by-site basis. SLAC calculates the number of non-synonymous and synonymous substitutions that have occurred at each site using ML reconstructions of ancestral sequences [42, 93]. The FEL model estimates the ratio of nonsynonymous to synonymous not assuming an a priori distribution of rates across sites substitution on a site-by-site analysis . The SLAC and FEL methods were implemented using the web interface Datamonkey . Since recombination in the gene can bias the analysis , we also re-run SLAC and FEL in Datamonkey using the GARD method , allowing each calculated partition to have its own phylogenetic tree.
Additionally, we used the LRT based analysis as implemented in the SLR (Sitewise Likelihood-Ratio) software package . This method assumes that substitutions (both synonymous and non-synonymous) can occur independently with every other site, modulating substitution rates as a continuous-time Markov process. The LRT on a site-wise basis is performed testing a null model (neutrality, ω = 1) against an alternative model ω ≠ 1.
We performed multiple analyses to differentiate the different types of selective pressures acting in MEPE: (i) positive versus negative selection, and (ii) stabilizing (selection that tends to maintain the overall biochemistry of the protein) versus destabilizing selection (selection that results in radical structural or functional shifts in local regions of the protein). These analyses provided insight into the structural and functional consequences of the residues under selection . We used TreeSAAP v3.2  and CONTEST  implemented in IMPACT  to detect selection signatures at the amino acid level. In TreeSAAP positive destabilizing-selection is detected based on the properties changes with significantly greater amino acid replacements than would be expected under neutrality for magnitude categories +7 and +8 (i.e., the two most-radical property-change categories). Within TreeSAAP, 31 amino acid properties were evaluated across the phylogenetic tree to identify the specific amino acid residues within each region that showed evidence of positive destabilization for each property. The baseml implemented PAML  is used in TreeSAAP  to reconstruct ancestral character states at the nodes on the MEPE phylogeny.
To test if evolutionary rates varied between lineages we used the relative-rate test, weighting by the predefined tree topology, as implemented in RRTree . To detect directional selection over the tree or a large number of substitutions towards a particular residue in a maximum likelihood context we used the directional evolution in protein sequences (DEPS) analysis to identify statistically significant directional changes in amino acid residue frequencies .
To determine the position of the positive selected amino acids when the protein is folded, we modeled the three-dimensional (3D) structure of MEPE. Protein structure prediction can be approached in three ways: (i) comparative modeling, (ii) threading, and (iii) ab-initio folding. For MEPE, the first two methods, which build a protein model by aligning query sequences onto solved template structures, were not feasible. Thus, the only practical strategy was to run the I-TASSER  to obtain an ab-initio 3D model of MEPE. The model obtained using the Homo sapiens sequence had a TM Score of 0.46 ± 0.15 and a C-Score = -2.18. To accurately infer the correct topology, the model should have a C-score above -1.5, varying from [-5; 2] . A TM score above 0.5 means that the obtained topology is not random . Results using the sequences of the rock hyrax (out-group), the dog (i.e. one of the species showing differences in the pI) and the mouse (which demonstrates accelerated evolution) all had similar C-scores and the 3D structures similar to the results retrieved for the human MEPE, suggesting that the biochemical differences in the composition of the amino acids that constitutes the different orthologues are not imposing significant differences in the folding of the protein.
To assess the surface exposure of the amino acids in the protein structure, we used the GETAREA 1.1  web-based program based on the atom coordinates of the PDB file. This provides an estimate of the solvent exposure based on the ratio of the side-chain surface area to "random coil" value per residue, performing an analytical calculation of solvent accessible surface area residues. These are considered to be solvent exposed if the ratio value exceeds 50% and to be buried if the ratio is less than 20% . Since MEPE has been described as an intrinsic unfolded protein, we also used the Protein DisOrder prediction System (PrDOS) server  to predict natively disordered regions of a protein chain based on the composition of the amino acid sequence. Protein stability was calculated with the PoPMuSiC 2.1 web server  using the MEPE PDB file previously obtained in I-TASSER to calculate the sites Γ considering all the possible mutations in each site. The secondary structure was visualized in POLYVIEW .
The authors acknowledge the Portuguese Fundação para a Ciência e a Tecnologia (FCT) for financial support to JPM (SFRH/BD/65245/2009) and the project PTDC/BIA-BDE/69144/2006 (FCOMP-01-0124-FEDER-007065) and PTDC/AAC-AMB/104983/2008 (FCOMP-01-0124-FEDER-008610). This work was further supported by a grant from Iceland, Liechtenstein and Norway through the EEA Financial Mechanism and the Norwegian Financial Mechanism. We thank Siby Philip from LEGE/CIIMAR for discussion and helpful suggestions. Comments made by the anonymous reviewers improved a previous version of this manuscript.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.