The origin and evolution of ARGFXhomeobox loci in mammalian radiation
© Li and Holland; licensee BioMed Central Ltd. 2010
Received: 25 January 2010
Accepted: 17 June 2010
Published: 17 June 2010
Many homeobox genes show remarkable conservation between divergent animal phyla. In contrast, the ARGFX (Arginine-fifty homeobox) homeobox locus was identified in the human genome but is not present in mouse or invertebrates. Here we ask when and how this locus originated and examine its pattern of molecular evolution.
Phylogenetic and phylogenomic analyses suggest that ARGFX originated by gene duplication from Otx1, Otx2 or Crx during early mammalian evolution, most likely on the stem lineage of the eutherians. ARGFX diverged extensively from its progenitor homeobox gene and its exons have been functional and subject to purifying selection through much of placental mammal radiation. Surprisingly, the coding region is disrupted in most mammalian genomes analysed, with human being the only mammal identified in which the full open reading frame is retained. Indeed, we describe a transcript from human testis that has the potential to encode the full deduced protein.
The unusual pattern of evolution suggests that the ARGFX gene may encode a functional RNA or alternatively it may have 'flickered' between functional and non-functional states in the evolutionary history of mammals, particularly in the period when many mammalian lineages diverged within a relatively short time span.
The homeobox genes comprise a large and diverse gene superclass characterized by presence of a DNA motif encoding the homeodomain. Most homeodomain proteins function as transcription factors involved in switching other genes on or off during embryonic development, cell fate specification and cell differentiation. The pivotal importance of homeobox genes to animal development is demonstrated by the fact that mutation or experimental misexpression can cause dramatic developmental abnormalities or cancers [1–3]. The best known homeobox genes include Hox, ParaHox, NK, Otx, Pax and Dlx genes which have been extensively studied in many animal model systems including mice, zebrafish, nematodes and Drosophila. Indeed, it was comparison of homeobox genes between species that led to one of the most striking findings of twentieth century biology: the remarkable conservation of homeobox and other developmental patterning genes between very divergent animal phyla [4, 5].
Not all homeobox genes are ancient, however, and the extent of their evolutionary conservation varies considerably. For example, a detailed search for all homeobox loci in the human genome sequence revealed six novel genes, DPRX, ARGFX, TPRX1, DUXA, DUXB and LEUTX, each of which has no orthologue in the mouse genome, nor in invertebrate genomes [6, 7]. It was hypothesized that these homeobox loci originated relatively recently in evolution and had undergone rapid sequence evolution. Booth and Holland  suggested that DPRX, TPRX1 and DUXA may have originated by tandem duplication and extensive sequence divergence from the CRX homeobox gene (a member of the ancient and conserved Otx gene family), because TPRX1 is adjacent to CRX and the other two homeobox genes are just 5.8 Mb and 9.2 Mb distant in chromosomal region 19q13. The evolution of DUXA and DUXB was studied by Clapp et al  who showed that these genes originated before mammalian radiation but have been lost from mouse. In contrast, the origin and subsequent evolution of the ARGFX locus remains very unclear. Human ARGFX maps to 3q13 so is not in the same chromosomal region as CRX, or indeed any other Paired (PRD) class homeobox gene. The sequence of its homeodomain assigns ARGFX as a divergent member of the PRD class , but gives no clear solution to its mode of origin. In addition, it is not yet certain whether ARGFX is a true functional gene or a nonfunctional pseudogene. The existence of two human retrotransposed pseudogenes derived from ARGFX, three ESTs from human testis tumor and a weakly positive RT-PCR amplification from human testis and embryonic stem cells indicates a low level of ARGFX transcription in humans . Here we undertake a comparative study of ARGFX sequences in vertebrate genomes to investigate the origin, the patterns of mutation and gene loss, and the extent of evolutionary conservation of this locus.
Results and Discussion
Human ARGFXgene and transcript
Since the ARGFX mRNA is present at very low levels, we suspected that detection of a protein product would be difficult. We carried out western blots on human testis total protein using a commercial polyclonal antibody and detected a single band of 50 kDa (data not show); this is much larger than the expected 36 kDa. The size discrepancy suggests the band is artefactual and may reflect the fact that the antibody was raised to the whole protein including the Alu region. We note that exonized coding Alu elements are also found in several proteins; examples include DSERG1 (GeneID 751816), ZMAT1 (GeneID 84460) and POLR3B (GeneID 55703). In addition, we searched the PeptideAtlas http://www.peptideatlas.org/ and PRIDE database http://www.ebi.ac.uk/pride/ and found four matches in the latter. However, these spectra did not pass the filtering criteria used, and thus do not provide conclusive evidence for translation of human ARGFX gene.
The evolutionary origin of ARGFXsequences
To further refine these results, we used a more sensitive method that exploited the fact that ARGFX has readily recognized flanking genes within an easily identified syntenic region. Starting with the human ARGFX genomic region, we were able to identify the corresponding genomic region in virtually all placental mammals, plus opossum (a non-placental mammal) and a frog (Xenopus tropicalis). We then used mVISTA to search these regions at low stringency for any sequences related to human ARGFX. In neither opossum nor frog was there detectible ARGFX-related sequence in the corresponding region. The locus was also missing from the syntenic region of mouse, rat and dog.
By combining the sequence similarity searching and the synteny analyses with a probable phylogenetic tree of the vertebrates, we can deduce the likely date of evolutionary origin of ARGFX homeobox loci (Figure 2). The absence of the locus in mouse, rat and dog clearly reflects secondary loss, since these species are nested within the tree of placental mammals. The minimal inclusive clade containing taxa possessing ARGFX sequences encompasses Xenarthra, Afrotheria, Supraprimates and Laurasiatheria. In contrast, there is no evidence that ARGFX sequences existed before the origin of placental mammals. We propose, therefore, that ARGFX originated after the divergence of Eutheria from Metatheria.
If ARGFX did originate as proposed, it is not clear which Otx gene was the progenitor, since ARGFX is on a different chromosome to all three candidates. Interestingly, the canonical Otx genes share more sequence similarity with ARGFX than they do with DPRX and TPRX1 which are located close to CRX chromosomally (Figure 4).
In summary, we propose that ARGFX originated after the divergence of Eutheria from Metatheria. This view is strongly supported by the phylogenetic distribution of ARGFX sequences in mammals and by analysis of the ARGFX syntenic region in placental and marsupial mammals and an amphibian. An alternative hypothesis, that the origin of ARGFX dates to genome duplication events in early vertebrate evolution, gains no support from paralogy analysis. It is likely, therefore, that the eutherian mammal ARGFX locus originated by gene duplication from CRX, OTX1 or OTX2, followed by extensive sequence divergence from these conserved Otx genes.
Mutation in placental mammals
Although ARGFX-related sequences were identified in most placental mammals examined, we were surprised to find that each one carried critical sequence mutations when compared to the human ARGFX open reading frame. In no case did a non-human ARGFX locus contain an intact open reading frame of equivalent length to the human sequence (Figure 1), and hence we conclude that none can code for a functional protein. Furthermore, each lineage of mammals has different disabling mutations (Additional file 2). For example, within the primates, chimpanzee and gorilla have the start codon ATG mutated to ACG, in orangutan the stop codon TGA is mutated to TCA causing termination four codons further downstream, macaque has a four base-pair insertion causing a frameshift, and marmoset and mouse lemur each have different stop codon mutations resulting in a shorter protein. In other mammals, tree shrew has two frameshift mutations and one premature stop codon, guinea pig has four separate frameshift deletions in the C-terminus, cow has a ten base pair deletion at the N-terminus causing a frameshift and a pair of two-nucleotide deletions at the C-terminus, horse has a one nucleotide insertion, two premature stop codons and deletions within the homeobox, while megabat has three frameshift mutations and a premature stop codon. As already mentioned, mouse, rat and dog have lost the locus secondarily.
These results are particularly surprising when considered in relation to the phylogenetic tree of mammals (Figure 2), because the range of species in which mutation of the ARGFX locus is found do not form a monophyletic group to the exclusion of humans. There are several possible explanations for this unusual pattern. First, ARGFX may be a non-functional pseudogene in all mammals, and the intact open reading frame in humans is simply a stochastic variant of a sequence diverging in a neutral manner. This would imply a single loss of function event soon after the origin of ARGFX in the eutherians. Second, the ARGFX locus may have been functional for the entire evolutionary history of humans, and is still functional, but has become a pseudogene (or has been lost) in many independent mammalian lineages. This would imply at least 10 independent loss of function events in mammalian evolution, if the phylogeny used in Figure 2 is correct. Third, the gene may have 'flickered' between functional and non-functional states in the evolutionary history of mammals. This might imply a smaller number of independent loss of function events, yet selective maintenance of function for at least part of human evolutionary history. Fourth, the locus may be functional as an RNA but not as a protein in most mammals.
Signatures of selection in ARGFXevolution
The unusual evolutionary pattern prompted us to reconsider whether human ARGFX locus encodes a functional gene, or indeed whether ARGFX has ever been functional. Detection of a transcription product from the locus, or even a putative translation product, is not sufficient evidence; many pseudogenes are transcribed [14, 15]. Finding a human phenotype associated with mutation of ARGFX would be informative, but so far no such human condition is known. We therefore addressed this question from an evolutionary perspective. Two approaches were employed.
dN and dS values in different analyses
whole coding region with Alu
N-terminal (231 bp)
Homeodomain (183 bp)
C-terminal (495 bp)
whole coding region without Alu
In this study, we have performed a deep comparative phylogenetic and phylogenomic analysis on the recently described homeobox locus ARGFX. We present evidence that this gene originated by gene duplication from Otx1, Otx2 or Crx during early mammalian evolution. The most likely origin was on the stem lineage of the eutherian mammals, after their divergence from marsupials and monotremes, and before the radiation of living placental mammals. The gene diverged extensively in sequence from its progenitor homeobox gene and was then subject to purifying selection. This purifying selection was maintained through much of placental mammal radiation, even though the coding sequence of the locus is disrupted in most mammalian genomes analysed. Strangely, human is the only mammal studied to date in which the full open reading frame is retained, and we suggest that the ARGFX locus is still functional in human. Indeed, we detected a transcript in human testis that has the potential to encode the full deduced protein. The unusual pattern of conservation suggests either that there have been very many independent losses of function for this gene in mammalian radiation, or, that the gene encodes a functional RNA molecule. An alternative hypothesis is that the gene has 'flickered' between functional and non-functional states in the evolutionary history of mammals, particularly during the late Cretaceous period when many mammalian lineages diverged in a relatively short timespan.
cDNA cloning of human ARGFX
Human testis total RNA was obtained commercially from AMS Biotechnology Ltd. (Cat. No. R1234260-50), and used to synthesize cDNA using a SMART RACE kit (Clontech Laboratories Inc.). Based on open reading frame predictions made from genomic DNA, gene-specific primers were used to amplify and clone ARGFX cDNA covering the complete predicted open reading. Forward primers used were: H.sa-ARGFX-F1 (CACGTAGGACTGAAAATGGTTACTC in exon 1) and HsARGFX-51F (CGGAGAAGGCATAAAGAACG in exon 4). Reverse primer was HsARGFX-51Rm (AGGGTCTAAGGGCTGAGATGG in exon 5). Primer positions are shown in Figure 1. To examine whether the transcribed mRNA is translated, western blots were performed using human ARGFX polyclonal antibody (Abcam Inc., Cat. no. ab67562) and human testis total protein (AMS Biotechnology, Cat. no. P1234260).
Identification of ARGFXsequences in other vertebrates
Initially, we focused attention on nineteen vertebrate genome sequences of high quality in Ensembl (release 56) accessible at http://www.ensembl.org/index.html, comprising ten mammals (chimpanzee, orangutan, rhesus macaque, mouse, rat, dog, cow, horse, opossum, platypus), two birds (chicken, zebra finch), one reptile (anole lizard), one amphibian (Xenopus tropicalis) and five teleost fish (Tetraodon, Takifugu, medaka, stickleback, zebrafish). Genomes were searched using tblastn  with human ARGFX deduced protein sequence as the query and an E-value cutoff of 1e-5. Many partial genome sequences were also searched. In cases where no match was detected, this can be due either to incomplete sequencing, complete absence or to partial degeneration. To distinguish between these possibilities, we identified the region of each genome assembly syntenic to the human ARGFX location and searched this for short degenerated sequences related to ARGFX using the LAGAN program  implemented in mVISTA http://genome.lbl.gov/vista/mvista/submit.shtml with a match criterion of 50% identity over 50bp. All sequences used in analysis are included in Additional file 1.
Nucleotide sequences and deduced protein sequences were aligned using CLUSTAL_X . NJ (Neighbor-Joining) trees  were calculated using MEGA4  using the Kimura 2-parameter distance estimation for nucleotide sequences and Poisson distance estimation for protein sequences . ML (maximum likelihood) trees were constructed using PHYML  using the JTT model, estimated to be the most appropriate model by ProtTest . The reliability of interior nodes in ML and NJ tree was assessed by bootstrapping with 100 replications . Bayesian trees were constructed using MrBayes 3.1 with 1.2 million cycles and sampling 1/10 of trees after a 30,000 cycle burn-in . Deduced protein sequences used for ARGFX trees were human ARGFX (without the region encoded by the Alu element), NP_001012677, and the thirteen most similar human proteins according to blastp: human CRX, NP_000545; human OTX2, NP_068374; human OTX1, NP_055377; human UNCX, NP_001073930; human ALX1, NP_008913; human PHOX2B, NP_003915; human PAX6, NP_001595; human DRGX, NP_001073989; human DMBX1, NP_671725; human ALX4, NP_068745; human PHOX2A, NP_005160; human VSX1, NP_055403; human ALX3, NP_006483. Numbers of synonymous (dS) and nonsynonymous (dN) per site were calculated using the modified Nei-Gojobori Jukes-Cantor method  implemented in MEGA4. Potential positive selected sites were determined using codeml program in PAML 4 package . Alignments of genome sequences were undertaken using LAGAN  or Shuffle-LAGAN programs  implemented through mVISTA http://genome.lbl.gov/vista/mvista/submit.shtml.
- ARGFX (Argnine-fifty homeobox:
- dS :
- dN :
Rapid Amplification of cDNA Ends.
The authors thank Yaseen Ladak and Tommy Duncan for help with western blots, and Ying-fu Zhong for help with Bayesian analysis.
- Cillo C: HOX genes in human cancers. Invasion & Metastasis. 1994, 14: 38-49.Google Scholar
- Nunes FD, de Almeida FC, Tucci R, de Sousa SC: Homeobox genes: a molecular link between development and cancer. Pesquisa Odontologica Brasileira = Brazilian Oral Research. 2003, 17: 94-98.View ArticlePubMedGoogle Scholar
- Del Bene F, Wittbrodt J: Cell cycle control by homeobox genes in development and disease. Seminars in Cell & Developmental Biology. 2005, 16: 449-460. 10.1016/j.semcdb.2005.02.001.View ArticleGoogle Scholar
- Kappen C, Ruddle FH: Evolution of a regulatory gene family: HOM/HOX genes. Current Opinion in Genetics & Development. 1993, 3: 931-938. 10.1016/0959-437X(93)90016-I.View ArticleGoogle Scholar
- Manak JR, Scott MP: A class act: conservation of homeodomain protein functions. Development (Cambridge, England). 1994, 61-77.Google Scholar
- Booth HAF, Holland PWH: Annotation, nomenclature and evolution of four novel homeobox genes expressed in the human germ line. Gene. 2007, 387: 7-14. 10.1016/j.gene.2006.07.034.View ArticlePubMedGoogle Scholar
- Holland PWH, Booth HAF, Bruford EA: Classification and nomenclature of all human homeobox genes. BMC Biology. 2007, 5: 47-10.1186/1741-7007-5-47.PubMed CentralView ArticlePubMedGoogle Scholar
- Clapp J, Mitchell LM, Bolland DJ, Fantes J, Corcoran AE, Scotting PJ, Armour JA, Hewitt JE: Evolutionary conservation of a coding function for D4Z4, the tandem DNA repeat mutated in facioscapulohumeral muscular dystrophy. American Journal of Human Genetics. 2007, 81: 264-279. 10.1086/519311.PubMed CentralView ArticlePubMedGoogle Scholar
- Gal-Mark N, Schwartz S, Ast G: Alternative splicing of Alu exons--two arms are better than one. Nucleic Acids Research. 2008, 36: 2012-2023. 10.1093/nar/gkn024.PubMed CentralView ArticlePubMedGoogle Scholar
- Kriegs JO, Churakov G, Jurka J, Brosius J, Schmitz J: Evolutionary history of 7SL RNA-derived SINEs in Supraprimates. Trends Genet. 2007, 23: 158-161. 10.1016/j.tig.2007.02.002.View ArticlePubMedGoogle Scholar
- Takatori N, Butts T, Candiani S, Pestarino M, Ferrier DE, Saiga H, Holland PWH: Comprehensive survey and classification of homeobox genes in the genome of amphioxus, Branchiostoma floridae. Development Genes and Evolution. 2008, 218: 579-590. 10.1007/s00427-008-0245-9.View ArticlePubMedGoogle Scholar
- Garcia-Fernandez J, Holland PW: Archetypal organization of the amphioxus Hox gene cluster. Nature. 1994, 370: 563-566. 10.1038/370563a0.View ArticlePubMedGoogle Scholar
- Putnam NH, Butts T, Ferrier DE, Furlong RF, Hellsten U, Kawashima T, Robinson-Rechavi M, Shoguchi E, Terry A, Yu JK, et al: The amphioxus genome and the evolution of the chordate karyotype. Nature. 2008, 453: 1064-1071. 10.1038/nature06967.View ArticlePubMedGoogle Scholar
- Zheng D, Zhang Z, Harrison PM, Karro J, Carriero N, Gerstein M: Integrated pseudogene annotation for human chromosome 22: evidence for transcription. Journal of Molecular Biology. 2005, 349: 27-45. 10.1016/j.jmb.2005.02.072.View ArticlePubMedGoogle Scholar
- Frith MC, Wilming LG, Forrest A, Kawaji H, Tan SL, Wahlestedt C, Bajic VB, Kai C, Kawai J, Carninci P, et al: Pseudo-messenger RNA: phantoms of the transcriptome. PLoS Genetics. 2006, 2: e23-10.1371/journal.pgen.0020023.PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Research. 2003, 13: 721-731. 10.1101/gr.926603.PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research. 1997, 25: 4876-4882. 10.1093/nar/25.24.4876.PubMed CentralView ArticlePubMedGoogle Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution. 1987, 4: 406-425.PubMedGoogle Scholar
- Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution. 2007, 24: 1596-1599. 10.1093/molbev/msm092.View ArticlePubMedGoogle Scholar
- Nei M, Kumar S: Molecular evolution and phylogenetics. 2000, New York: Oxford University PressGoogle Scholar
- Guindon S, Lethiec F, Duroux P, Gascuel O: PHYML Online--a web server for fast maximum likelihood-based phylogenetic inference. Nucleic Acids Research. 2005, W557-559. 10.1093/nar/gki352. 33 Web Server
- Abascal F, Zardoya R, Posada D: ProtTest: selection of best-fit models of protein evolution. Bioinformatics (Oxford, England). 2005, 21: 2104-2105. 10.1093/bioinformatics/bti263.View ArticleGoogle Scholar
- Felsenstein J: Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 1985, 39: 783-791. 10.2307/2408678.View ArticleGoogle Scholar
- Ronquist F, Huelsenbeck JP: MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (Oxford, England). 2003, 19: 1572-1574. 10.1093/bioinformatics/btg180.View ArticleGoogle Scholar
- Zhang J, Rosenberg HF, Nei M: Positive Darwinian selection after gene duplication in primate ribonuclease genes. Proceedings of the National Academy of Sciences of the United States of America. 1998, 95: 3708-3713. 10.1073/pnas.95.7.3708.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution. 2007, 24: 1586-1591. 10.1093/molbev/msm088.View ArticlePubMedGoogle Scholar
- Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou S: Glocal alignment: finding rearrangements during alignment. Bioinformatics (Oxford, England). 2003, 19 (Suppl 1): i54-62. 10.1093/bioinformatics/btg1005.View ArticleGoogle Scholar
- Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, O'Brien SJ: Molecular phylogenetics and the origins of placental mammals. Nature. 2001, 409: 614-618. 10.1038/35054550.View ArticlePubMedGoogle Scholar
- Nishihara H, Maruyama S, Okada N: Retroposon analysis and recent geological data suggest near-simultaneous divergence of the three superorders of mammals. Proceedings of the National Academy of Sciences of the United States of America. 2009, 106: 5235-5240. 10.1073/pnas.0809297106.PubMed CentralView ArticlePubMedGoogle Scholar
- Prasad AB, Allard MW, Green ED: Confirming the phylogeny of mammals by use of large comparative sequence data sets. Molecular Biology and Evolution. 2008, 25: 1795-1808. 10.1093/molbev/msn104.PubMed CentralView ArticlePubMedGoogle Scholar
- Schneider A, Cannarozzi GM: Support patterns from different outgroups provide a strong phylogenetic signal. Molecular Biology and Evolution. 2009, 26: 1259-1272. 10.1093/molbev/msp034.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.