Positive selection for the male functionality of a co-retroposed gene in the hominoids
© Zhang et al. 2009
Received: 19 March 2009
Accepted: 15 October 2009
Published: 15 October 2009
Skip to main content
© Zhang et al. 2009
Received: 19 March 2009
Accepted: 15 October 2009
Published: 15 October 2009
New genes generated by retroposition are widespread in humans and other mammalian species. Usually, this process copies a single parental gene and inserts it into a distant genomic location. However, retroposition of two adjacent parental genes, i.e. co-retroposition, had not been reported until the hominoid chimeric gene, PIPSL, was identified recently. It was shown how two genes linked in tandem (phosphatidylinositol-4-phosphate 5-kinase, type I, alpha, PIP5K1A and proteasome 26S subunit, non-ATPase, 4, PSMD4) could be co-retroposed from a single RNA molecule to form this novel chimeric gene. However, understanding of the origination and biological function of PIPSL requires determination of the coding potential of this gene as well as the evolutionary forces acting on its hominoid copies.
We tackled these problems by analyzing the evolutionary signature in both within-species variation and between species divergence in the sequence and structure of the gene. We revealed a significant evolutionary signature: the coding region has significantly lower sequence variation, especially insertions and deletions, suggesting that the human copy may encode a protein. Moreover, a survey across five different hominoid species revealed that all adaptive changes of PSMD4 -derived regions occurred on branches leading to human and chimp rather than other hominoid lineages. Finally, computational analysis suggests testis-specific transcription of PIPSL is regulated by tissue-dependent methylation rather than some transcriptional leakage.
Therefore, this set of analyses showed that PIPSL is an extraordinary co-retroposed protein-coding gene that may participate in the male functions of humans and its close relatives.
Retroposition, an RNA-intermediated copy mechanism, could shape genomes widely in eukaryotes, and in particular plays a substantial role in evolution of functional novelties . People ever viewed it as a trivial molecular process for making functionless processed pseudogenes . However, extensive analyses have revealed that a large number of retrosequences have acquired various functions from vertebrates to invertebrates [3, 4], from spermatogenesis  to courtship behaviors . Many retrogenes could recruit nearby preexisting exon-intron sequences and genomic regions to form a chimeric gene structure with novel protein structures, which may expand protein functional diversity [6–8].
Almost all observed retroposition events involve a single parental gene that serves as substrate for retroposition. However, Akiva et al recently observed an extraordinary case that two adjacent genes in the hominoid lineage, PIP5K1A and PSMD4 in chromosome 1, co-retroposed from a read-through transcript and formed a new chimeric gene, PIPSL, in chromosome 10 . PIP5K1A encodes the alpha isoform of phosphatidylinositol 4-phosphate 5-kinase type I (PIP5K), which is involved in the synthesis of two essential second messengers, 1,2-diacylglycerol and inositol 1,4,5-trisphosphate [10, 11]. On the other hand, PSMD4 recruits ubiquitylated substrates to the proteasome for their degradation, which is mediated by two conserved 20-30 residual hydrophobic regions, i.e., ubiquitin-interacting motifs (UIMs) .
Northern analyses revealed high expression of PIPSL in testis in humans and chimpanzee and base-level or undetectable expression in other tissues . The human PIPSL possesses a 1 bp in-frame deletion at +45 bp with respect to the start codon, which is fixed in human populations. Such a deletion causes an early frameshift with disrupted translation such that western blotting does not reveal protein product in human. However, Western blotting also fails to detect proteins in chimpanzee which does not have such a deletion. Therefore, PIPSL was believed to undergo translational silencing in both species . Comparison of the substitution rates between synonymous and nonsynonymous rates revealed that the C-terminal portion of PIPSL derived from PSMD4 possibly undergoes positive selection in the early stage of the hominoid clade while the PIP5K1A -derived portion of the N-terminal portion in later stage of the hominoid clade does not depart from neutral expectation . The authors even propose PIPSL might be detrimental in human population now.
These analyses provided valuable data to understand the origination process and the characteristics of PIPSL. Moreover, this extraordinary gene raised two interesting questions for further pursuit. First, considering no protein product is detected, is PIPSL a functional protein-coding gene or it is a processed pseudogene that has never had or has lost its function in human? Second, there were only two species of hominoid species, i.e., human and chimp in the analysis. Such limited dataset renders it hard to differentiate between the following two scenarios: positive selection indeed occurred on evolutionary branches toward human/chimp; positive selection occurs on branches leading to other hominoids, e.g., orangutan and gibbon and thus signal observed by comparing human PIPSL and its two parental genes is a by-product. If the later case is true, it will be reasonable to expect the repressed translation in both human and chimp since PIPSL might be only functional in orangutan and gibbon rather than human and chimp.
We tackled these problems by analyzing the evolutionary signature relating to the coding potential of PIPSL in both within-species variation and between species divergence. Such a strategy, complementary to the biological analyses of Babushock et al. (2007) and Akiva et al. (2006), revealed that the present-day PIPSL is subject to significant evolutionary constraint that shapes the standing variation in natural populations of the functional protein coding gene in humans and maintains its open reading frame (ORF). More than that, comparative analysis reveals that the adaptive evolution occurred in the lineage toward human and chimpanzee. Finally, genome-wide analysis of retro-pseudogenes suggests testis-specific transcription of PIPSL is not due to transcriptional leakage but highly regulated by tissue-dependent methylation. These analyses, in conjunction of previous expression data at RNA level, suggest that PIPSL is a functional retro-fusion testis-specific gene. Thus, the extraordinary co-retroposition mechanism played a role in the evolution of the male-specific functions in the lineage toward the humans.
Statistics of polymorphism, which was generated by DnaSP .
#Single nucleotide polymorphisms
π indel /siteD
The probability of CDS generating not more than nine SNPs if the whole PIPSL locus is homogeneously neutral.
P (S <= S obs | θ = 0.00255)
HKA test using chimp as the outgroup.
p = 0.0343
where " n " is 78 and 2 for our case and the gibbon genome sequence, respectively. In other words, if we increase the sample size of gibbon sequences to 78, the expected number of segregating sites in 3' UTR and CDS should be about 10 and 20, respectively. It is notable this iterative formula usually require multiple alleles like six or even more. Therefore, the estimation of 10 or 20 might not be that accurate. However, such a small sample does show CDS might have more polymorphisms than that of 3' UTR in gibbon PIPSL locus. Therefore, considering the observed data in human, 11 substitutions in 3'UTR and 9 substitutions in CDS, the deviation of human polymorphism data from the neutral expectation should be more likely attributed to the increasing constraint in CDS region.
The polymorphism data of Table 1 also show that CDS appears to avoid indels compared to both flanking regions. Specifically, there are five homozygous indels and two heterozygous indels across 1.6 kb flanking regions; by contrast, there is only one heterozygous indel across 2,589 bp coding region (Fisher Exact Test p < 10-4). Regarding indel diversity, it is an order-of-multitude larger in UTR (6 × 10-4) than in CDS (1 × 10-5). Based on Hudson's formula , the probability to observe only one indel in an CDS of 2,589 bp is only 0.001 if CDS has as identical indel coefficient as that of UTR.
Using between-species data and by performing forward-simulation from the ancient sequence of PIPSL in hominoid, we further tested how it might be possible to maintain one ORF of 2,589 bp. Specifically, taking advantage of NCBI trace data, we assembled the complete PIPSL locus in orangutan and gibbon given the high read coverage (>4x) around this locus. By contrast, we were only able to assemble PIP5K1A -derived region in gorilla due to its lower sequencing coverage in PSMD4 -derived region. Like human and chimp, orangutan maintains its PIPSL ORF. However, the gibbon PIPSL lost its coding potential by accumulating three nonsense substitutions and one in-frame indel (See Additional file 3). Gorilla seems to lie in a similar case with one nonsense substitution (CGA->TGA) in the middle of PIP5K1A -derived region.
100,000 simulations are ran to track frame-disrupting features in case of CDS region derived from PSMD4, CDS region derived from PIP5K1A and the complete CDS of PIPSL, respectively.
P NaNs B
P dis C
The above analysis reveals that natural selection maintained a long ORF from the split of human and gibbon at 18 million years ago (Mya). Taking advantage of recently available sequence data of gibbon, orangutan and gorilla, we investigated this process with higher resolution. We used CODEML of PAML package  to infer the background selection force (See also Method section).
Selection of different lineages based on CODEML.
PSMD4 -derived region
PIP5K1A -derived region
Based on the branch-site model of CODEML, we inferred PSMD4 -derived region consists of seven nonsynonymous mutations that occurred before the split of human and chimp inferred. Four out of them happened around the second ubiquitin-interaction motif (UIM-2) [21, 22], R->H (amino-acid 262 in PSMD4-derived region), A->V (278), Y->C (279) and Q->L (292) (see Additional file 5). Moreover, strongly supports all these substitutions are driven by adaptive evolution. All of them have a Bayes Empirical Bayes (BEB) probability greater than 0.9, which measures whether the Ka/Ks is larger than one for this local region. Out of them, Y->C and Q->L even have a BEB value larger than 0.95. These sites are also supported by HyPhy (Additional file 5). The key residue Ala and Ser are unchanged, which ensures that UIM-2 still maintains its ubiquitin-interaction capability to some extent.
By contrast, the parental protein PSMD4 is nearly unchanged across 70~80 million years' mammalian evolution (Figure 2): there is only one nonsynonymous substitution between primate and dog. Inspection of all available vertebrate lineage sequences shows that PSMD4 was always under strong purifying selection, except prior to the split of birds and mammalian (see Additional file 6).
Testis may provide a transciptionally permissive environment , which implies transcription leakage tends to occur in testis more frequently than other tissues. From this prospective, pseudogenes could be more likely expressed in testis. Therefore, testis-specific transcription is not necessarily a signature of functionality. In order to test this possibility, we performed genome-wide profiling of all retroposed pseudogenes in human. In brief, we integrated and improved upon previous strategies [4, 8, 23] to identify retroposed copies (RPCs). Out of 6,750 RPCs, we generated a highly reliable dataset of 729 retropseudogenes and a less stringent dataset of 5,386 retropseudogenes (see Additional file 7).
Exon-array based retropseudogene expression profile across 11 tissues.
Expression intensity of PseudogenesA
Number of pseudogenes with highest transcriptionB
Number of transcribe-able (>=0.2) pseudogenesC
In addition, the exon-array data confirmed the abundant expression of PIPSL in testis, consistent with previous results from northern profiling experiments in testis, liver, lung and many other tissues . The abundance of PIPSL amounted to 0.71 in testis as revealed by both independent probesets, while PIPSL abundance was lower than 0.2 for all other tissues, indicating trace level transcription. For all 548 pseudogenes with exon array data, 38 (7%) are transcribed in testis above the cutoff of 0.2. However, only six out of them (1%) reaches as high abundance as 0.7 in testis. Moreover, all of these six pseudogenes are also transcribed in some other tissues with the abundance above 0.2. Thus, abundant and specific transcription of PIPSL establishes it as a clear outlier compared to retropseudogenes.
If testis-specific transcription of PIPSL has functional significance, how is this tissue-specificity achieved? Weber and his colleagues generated genome-wide methylation data presented as methylation log2 ratios of bound over input signals and they also proposed the value of 0.4 as a cutoff to differentiate hypermethylation from hypomethylation . As a result, they found germline-specific genes preferentially undergo de novo methylation. Specifically, genes that are not methylated in sperm are more likely to get methylated in somatic cells and transcriptionally repressed. Remarkably, PIPSL is consistent with this pattern in that it is hypermethylated in primary lung fibroblast cells with the log value of 0.6 and hypomethylated in sperm with the log value of -0.4.
For the aforementioned 729 retropseudogenes, only three of them are covered by Weber et al 's data and none of them displays such a de novo methylation pattern. Notably, regarding the larger dataset of 5,386 pseudogenes, for which 190 entries are included in Weber et al 's set, only ten (5%) show a strong de novo methylation in somatic cells as PIPSL does. Considering this dataset might include some functional retrogenes, the percentage of real pseudogenes possessing de novo methylation might be smaller than 5%. Again, this analysis suggests that PIPSL is an intriguing outlier in that PIPSL 's transcription profile is highly regulated rather than transcriptionally leaky. That means testis-specific expression could be a functional signature for PIPSL.
Babushok et al. suggests that the indel around the original start codon disrupts the coding potential of PIPSL locus in human. However, our population genetics analysis shows the downstream 2,589 bp ORF is more constrained than the flanking regions in terms of frequency of SNP or indels. Specially, lack of indels in CDS in the current human population not only suggests PIPSL 's coding potential, but also indicates that human PIPSL is still under purifying selection rather than relaxation, as proposed by Babushok et al. Secondly, our forward simulation shows it is highly unlikely that a neutral segment of such a length would remain over 18 million years. As shown by the PIPSL locus in gibbon, there are up to three nonsense substitutions and one in-frame indel scatters across the whole region. Thirdly, positive selection always occurred in the internal branches leading to human and chimp, which reveals continuing gain of function for a long time rather than limited to the hominoid ancestor. All these three lines of evidence suggest human PIPSL is a bona-fide protein-coding gene. Protein translation might be finely tuned and limited to specific time point, producing a small quantity of protein, as in a special stage of spermatogenesis or during sperm-egg interaction. Thus, PIPSL escapes detection of western blotting in whole testis lysates.
The aforementioned comparative analysis also indicates lineage specific evolution after origination of the PIPSL locus. Different hominoid species might face different environmental changes, which affect fitness of the same gene, for example, PIPSL. An alternative interesting reason might be the difference of the long term effective population size (N e ). All branches from the divergence of orangutan and human/chimp/gorilla groups have an estimation of Ne which ranges from 10,000 to 100,000 (see Additional file 8) [25–28]. Organisms with a large N e are selectively efficient: those slightly advantageous alleles would be more likely to be fixed and those slightly deleterious mutations would be more likely to be removed [29, 30]. By contrast, in organisms with a small N e , slightly beneficial mutations have high chance to get lost and slightly deleterious mutations have high chance to get fixed. From this point of view, PIPSL has a small fitness advantage, which is more likely to get fixed or maintained in orangutan or ancestral branches. In contrast, it might be lost in gorilla with a smaller N e . Herein, positive selection might occur before the split of gorilla and human/chimp since they share the majority of their evolutionary history. Thus, two independent losses of the open reading frame occurred in both gorilla and gibbon. It is notable such parallel loss is not that unlikely for new genes. For example, an X-linked testes chimeric gene, Hun, was created about 2~3 million years ago, prior to the the split of D. simulans, D. sechellia, and D. Mauritiana . However, Hun maintains a integral open reading frame only in D. simulans, while its suffers from different frame-distrupting mutations in both D. sechellia and D. Mauritiana.
As revealed by the lower number of SNPs and indels in human, adaptive selection in the chimp/human ancestral branch (with a much larger Ne) might increase the fitness of PIPSL to an extent such that it could be maintained in the current population under selective constraint.
Selection drives the remarkable change of UIM-2 on the evolutionary branch leading to human and chimp. Considering UIM-2 while not UIM-1 is the preferred ubiquitin binding partner, maintenance of the essential key residues Ala and Ser in UIM-2 explains why PIPSL still can bind ubiquitin, although its capability does decrease relative to PSMD4 . UIM-2 is also known to be responsible to bind ubiquitin receptors . The remarkable adaptive changes in the ancestor of human and chimp might contribute to the change of interaction partner of PIPSL. Since the parental gene PSMD4 undergoes strong purifying selection during hundreds of millions of years of evolution, PIPSL could be an interesting target for further comparative functional study.
Finally, it is interesting to ask how many retroposed fused genes the genome encodes considering the prevalence of transcription-mediated gene fusion event . We compared Ensembl gene annotation and retroposed copies we identified, and found PIPSL is the unique case in the human genome. Analogously, we do not find any case in rhesus monkey, rat, dog, cow, opossum, platypus and fruitfly. In mouse, we found another case that transcripts of 6030436E02Rik and C330019G07Rik together with 1 kb intergenic region fused first and retroposed to Chromosome 8 (see Additional file 9). The fused locus has been pseudogenized with seven frame shifts and four nonsense mutations or six frame shifts and two nonsense mutations scattered in the 6030436E02Rik-derived region and C330019G07Rik-derived region, respectively. Its non-functionality is also supported by the lack of transcription evidence like EST or mRNA. Extremely low abundance of retroposed, fused genes across numerous animals suggests the inefficiency or complexity of this generation mechanism itself. First, as Akiva et al.  shows, most transcription-mediated events are rare or confined to certain tissues. In other words, they might not be expressed in the germ line. Second, the mouse case suggests that the fusion might not be able to generate a continuous ORF with intervention of noncoding regions. Akiva also shows only 25% cases can generate a fused ORF. Finally, such locus might not get fixed in the genome considering it interrupts the original dosage balance for multiple genes. In addition to these issues, it is also not clear how retroposed fused copies can insert into an appropriate genomic context to become transciptionally active.
Thus, PIPSL represents an extraordinary case in which natural selection has increased the genetic novelty via a complicated mechanism in primates.
Identification of retroposed copies are described in the supplementary methods (see Additional file 7).
We used several annotation tracks of UCSC genome browser like Chip-chip data of Ludwig institute  and Human Exon 1.0 ST panel data of Affymetrix . As for the exon array data, UCSC processed the raw signal intensity with a quantile normalization method and generated the summary signal using the PLIER algorithm . After that, these summary values were converted to log-ratios, namely, negative values indicate below-median expression and positive value indicates above-median expression.
In order to test functional constraint, we sequenced PIPSL in 39 human individuals. DNA samples were purchased from the Coriell Institute for Medical Research, which consists of 10 African Americans, nine Africans in the south of the Sahara, 10 Russians and 10 Chinese. Such a combination should be able to cover the majority of human diversity. PIPSL locus including the coding sequence (CDS) and 1 Kb flanking regions (mainly untranslated regions, UTR) were PCR amplified based on primers designed by Oligo http://www.oligo.net. If necessary, multiple PCR experiments were run to amplify the full-length region. After that, PCR bands were sent to Invitrogen for sequencing. For each copy, six to eight walking reactions were performed. Subsequently, we implemented a well-established pipeline including Phred, Phrap  and Consed  to assemble PIPSL locus for each individual.
Single nucleotide polymorphisms (SNPs) and Insertion/Deletions (indels) were identified with Polyphred  and Polyscan . Specifically, homozygous or heterozygous SNPs were called by Polyphred first. We retained those highly reliable SNPs with Polyphred score of 99. For SNPs with a score lower than 99, we retained them only if they were identified by Polyscan too. As for indels, we used Polyscan's results because Polyphred failed to identify any indel. After that, we manually checked Polyscan's results and accepted those indels with high scores. Notably, we found Polyscan tends to assign homozygous indels with much higher score by investigating the raw sequencing data. Thus, our strategy tends to overlook heterozygous indels. However, it should not matter that much because our population analysis mainly relies on SNPs rather than indels and such a bias should exist for both coding region and non-coding region.
Finally, we used DnaSP v4.50  to generate the statistics of polymorphisms and perform Hudson, Kreitman and Aguadé's (HKA) test  to detect whether both loci follow the neutral null model. In brief, based on number of segregating sites, S i and number of divergences, d i , HKA jointly estimates θ i (selection coefficient), f (ratio of N e for both loci) and t (divergence time shared by both loci) by fitting the expected values and variations of S i and d i . Finally, goodness-of-fit is tested with an approximate chi-square test.
Where, l, n and s are defined as the length of region of interest, the number of alleles and the number of segregation sites, respectively. Q n (i) indicates the probability that i mutations occur when there are n ancestral lineages, while P n (s) indicates the probability that s segregating sites in a sample of n individuals.
We slightly modified Tracembler  to automatically retrieve homologous reads from NCBI Trace-BLAST website using PIPSL sequence as the query, E-value 10-10 as the cutoff, and gorilla, orangutan, and gibbon reads as the database. Subsequently, all reads were submitted to UCSC BLAT  server to check whether the best hit of each read is PIPSL rather than its parental genes. We retained reads meeting with the following two criteria: the top hit had to match human PIPSL locus; the alignment identity of the second top hit was smaller than that of the top hit. Finally, we fed all the retained reads into the aforementioned Phred, Phrap and Consed pipeline and assembled PIPSL in gorilla, orangutan, and gibbon.
Baylor University College of Medicine Human Genome Sequencing Center (BCM-HGSC) and Washington University Genome Sequencing Center (WUGSC) sequenced two chromosomes of one wild-born gibbon female. Given the high sequencing coverage for both chromosomes, we identified segregating sites using Polyphred and Polyscan.
We constructed the multiple sequence alignment of PIPSL and its parental genes using MUSCLE  and further manually checked the alignment in GeneDoc . Then this protein based alignment was converted to the codon-based alignment with PAL2NAL .
We performed evolutionary simulation using ReEVOLVER v1.0 as its online document describes . It estimates the probability that an ORF is maintained for millions of years of evolution. In simulation, we used the species tree described in , the substitution rate of 1.0 × 10-9 per site per year and the indel rate of 1.0 × 10-10 per site per year . Given these parameters, ReEVOLVER assumed Kimura-2-parameters model of sequence evolution and did forward simulations from the ancestral sequence constructed by PAML  or DNApars . In this process, the so-called disable features, i.e., mutations causing stop codons or frame shifts, are counted. We performed 100,000 simulations for the whole PIPSL locus, PSMD4 -derived region and PIP5K1A -derived region, separately.
ML-based analysis were implemented using CODEML of PAML package v4.0b . We used the free-ratio model to estimate number of synonymous site and nonsynonymous sites and branch-model to estimate whether there is a significant departure compared to the neutral expectation along one specific lineage. In order to infer which site is under adaptive evolution in a specific lineage, we re-ran CODEML with the branch-site model. Specifically, Yang and Nielsen implemented two models, called A and B , which permits variation of the ω ratio (i.e., Ka/Ks) both among sites and among lineages. PAML 4.0b further permits ω of the null model to vary between 0 and 1 rather than the old model A which fixes the ω ratio to 0. There are two tests associated with current model A. We used Test 2 (fix_omega = 1; omega = 1), which is supposed to be more robust to differentiate positive selection compared to a relaxation of functional constraint . However, in order to increase confidence, we used SubtreeSelection module of HyPhy  to do a similar analysis. Across ReEVOLVER, PAML and HyPhy, we used the species tree of primates described in .
As a complementary test, we also estimated the time required to destroy PIPSL 's ORF in one half of all simulations, t 1/2 . We followed the same substitution rate or indel rate as  and found t 1/2 is about 1.9 million years. Thus, considering PIPSL predated diverge of human and gibbon 18 million years ago, the possibility to maintain this ORF is like 0.518/1.9 or 1 × 10-3.
We appreciate three anonymous reviewers for insightful comments. We are grateful for Roman Arguello, Margarida MC Moreira and Benjamin H. Krinsky in Long lab for insightful discussions and revision of the manuscript. We are also debted to Henrik Kaessmann for discussion of ReEVOLVER and Jianzhi Zhang for providing his software in estimating half life of an open reading frame. This work was supported by grants from China Ministry of Science and Technology 973 Basic Research Program (No. 2007CB946904, 2006CB910404) and 863 Hi-Tech Research and Development Programs (No. 2006AA02Z334, 2006AA02Z314, 2006AA02A312, 2007AA02Z165), China Ministry of Education 111 Project (No. B06001), and the Changjiang Adjunct Professorship Funds of Peking University and Sabbatical Leave support of the University of Chicago to M.L.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.