A genome-wide screen for noncoding elements important in primate evolution
© Bush and Lahn. 2008
Received: 20 August 2007
Accepted: 23 January 2008
Published: 23 January 2008
Skip to main content
© Bush and Lahn. 2008
Received: 20 August 2007
Accepted: 23 January 2008
Published: 23 January 2008
A major goal in the study of human evolution is to identify key genetic changes which occurred over the course of primate evolution. According to one school of thought, many such changes are likely to be found in noncoding sequence. An approach to identifying these involves comparing multiple genomes to identify conserved regions with an accelerated substitution rate in a particular lineage. Such acceleration could be the result of positive selection.
Here we develop a likelihood ratio test method to identify such regions. We apply it not only to the human terminal lineage, as has been done in previous studies, but also to a number of other branches in the primate tree. We present the top scoring elements, and compare our results with previous studies. We also present resequencing data from one particular element accelerated on the human lineage. These data indicate that the element lies in a region of low polymorphism in humans, consistent with the possibility of a recent selective sweep. They also show that the AT to GC bias for polymorphism in this region differs dramatically from that for substitutions.
Our results suggest that screens of this type will be helpful in unraveling the complex set of changes which occurred during primate evolution.
Humans possess numerous behaviors absent in other animals. Such behavioral specializations reflect anatomical, physiological and ultimately genetic modifications which occurred during the course of primate evolution.
The nature of the underlying genetic changes has long interested scientists. According to one influential viewpoint, key genetic differences in primate evolution resulted from changes in the regulation of gene expression . Until recently however, there have been few examples of functionally significant cis-regulatory changes in the human lineage. This is beginning to change with the recent identification of several such cases. These include changes which occurred in a cis regulatory element of the PDYN gene in the lineage leading to humans , and changes in cis-regulatory elements of the LCT gene which lead to adult lactase persistence in a number of human populations [3, 4].
Whole genome sequencing efforts may now allow us to identify more such cases. Several recent studies have used genome-wide screens to identify noncoding elements which may have been subject to positive selection in human evolution [5–7]. The basic approach is to look for genomic regions where the human branch contains a surprisingly large number of substitutions-that is regions which have an accelerated substitution rate in the human lineage. Pollard et al. used a likelihood ratio test approach that compared the branch lengths for each element with what would be expected given a genome wide model for conserved elements rescaled to the conservation level of the given element [5, 6]. Prabhakar et al. developed a different method which incorporated variation in the neutral rate among lineages and loci into its estimation of acceleration . In both cases the main focus was the human terminal lineage after its divergence from chimpanzee.
Here we present a new likelihood ratio test method for identifying acceleration in noncoding elements. Our approach is designed to account for lineage specific variation in mutation rates. We apply it not only to the human terminal lineage, as was done in previous studies, but also to other branches of the primate tree . We identify lists of elements which have been accelerated in these branches, and in addition we report a follow up resequencing study on one of the elements we identified. In it we find evidence suggestive of a recent selective sweep in the human lineage.
Our goal was to identify elements which have undergone acceleration due to changes in selective pressure. To distinguish changes in selection from changes in mutation rate we compared the rate of molecular evolution in a given element with the rate in nearby ancestral repeats. For each element we examined, we obtained ancestral repeats in a 750 kb window surrounding (Figure 1b). Ancestral repeats represent sequence that is likely to be nearly neutral, and therefore allowed us to approximate the local mutation rate.
We then calculated the likelihood of the alignment for each element and its associated repeat alignments jointly under a constrained model, and then separately under an unconstrained model (Figure 1c and Methods). The unconstrained model is the same as the constrained with the addition of an extra variable which allows the human terminal branch to grow more rapidly. The likelihood ratio test (LRT) statistic is based on the ratio of these likelihoods, and will be larger in cases where the ratio between the human branch of the element and the human branch of the nearby repeats is significantly greater than the genome wide average. To assess the significance of these LRTs we used a nonparametric bootstrap method to calculate false discovery rates (FDR) for our data  (Methods).
Additional File 1 contains the 63 elements in our 10% FDR group for the human terminal lineage along with the trees calculated for each element, and for nearby repeats. (The file also contains the 10% FDR groups for for the long human lineage, and for internal branch 2.) These elements, which are generally very conserved among vertebrates, are changing extremely rapidly on the human terminal lineage. In all cases the estimated human branch length of the conserved element is greater than the local neutral branch length as estimated by our ancestral repeat tree. For the top 10 it is roughly an order of magnitude larger.
We wish to understand the genomic processes which produced these elements. To do so it is important to examine them carefully for biases in features such as their pattern of substitutions or physical positioning. Such biases can provide clues as to what forces produced them. We examined the position of our elements relative to Ensembl gene annotations and found that the accelerated group for the long human branch was enriched near gene deserts (defined as a region > 500 kb without an Ensembl gene). 34% of our starting set of 169,447 elements fall in or adjacent to a gene desert. In comparison for the accelerated group on the long human branch, the proportion is 49% (Fisher's exact test p = 0.008181). For internal branch 2 and the terminal human branch this trend is not significant after we consider multiple testing.
In an earlier study, Pollard et al. found that in the 202 accelerated elements they identified on the human terminal branch there was a substitution bias from AT to GC. AT to GC substitutions constituted 57% of their total, while GC to AT were 29%. They suggested that biased gene conversion might contribute to this, an idea that was supported by the fact that their human accelerated elements were enriched in the terminal band of chromosomes. These regions tend to have a higher recombination rate, and are therefore a likely site of biased gene conversion . It is interesting to ask if these patterns are also true for our group of accelerated elements. We counted AT to GC and GC to AT substitutions, omitting cases where the mutation occurred at a hypermutable CG dinucleotide. We found that in our human terminal group the bias to GC is weaker: 51% of substitutions are AT to GC while 31% are GC to AT. This ratio is not significantly different than what we find in the whole set of 169,447 elements which we screened. In the long human lineage group the proportion of AT to GC substitutions is 45%, which is less than that found in the whole set of elements. The same is true for the other primate lineage FDR groups. We should note that though we don't see a significant trend toward bias in the groups of elements we look at, certain individual elements do seem to show a strong trend. We will discuss one of these below. And we do find some tendency for our accelerated elements to be located in the final band of chromosomes. Our 10% FDR groups for the long human lineage, and for the terminal human lineage both have a higher proportion of elements in the final band of chromosomes than does the complete set of elements we screened (fisher exact test p = 0.005 and p = 0.15 respectively).
In fact, screens of this type are by nature only a first step. They provide a list of candidate elements which can then be subjected to further experiments. This is so because various confounding factors can lead to spurious hits. For example, if an element becomes nonfunctional in the human lineage, this will lead to relaxation of constraint and a higher substitution rate. Accelerations of this type should be less dramatic than those caused by positive selection, but may nevertheless slip into our lists. The best way to think about these lists, is that they are likely to be enriched for cases of positive selection. But in any individual case, we must take the element and try to find additional evidence.
To address this, we resequenced a 5.5 kb region around this element in a global human diversity panel of 90 individuals (Additional File 2). We found that the human specific substitutions in the element itself are fixed in our sample of humans. We also found that the surrounding region has a low level of polymorphism (pi = 0.00033), and a negative Tajima's D statistic suggesting an abundance of rare alleles (Tajima's D = -1.698, 0.05 < p < 0.10). Levels of polymorphism increase as one moves to the edges of the region we resequenced. In particular the 500 bp on the centromeric side of our region have a much higher level of polymorphism. This may be due to the fact that they overlap a recombination hotspot as defined by LDHot method [13, 14] using Hapmap  and Perlegen  data. Excluding those 500 bp, pi in the remaining 5 kb region is 0.00017 and Tajima's D is -2.154 (p < 0.01). Pollard and colleagues also noted that the region of low polymorphism appears to be about 5 kb .
GC vs. AT bias for polymorphisms and human lineage substitutions
AT to GC
GC to AT
For the future, we see several ways to improve our method. Ancestral repeats are known to have relatively less constraint than much of the genome. However it is worth noting that certain processes such as gene conversion may be more common in repeats, and that there may be variation between different kinds of repeats in terms of substitution rates. Such considerations may provide a basis for further improvements in our method. For example we could focus on certain categories of repeats which are known to have higher average substitution rates, or are less subject to gene conversion.
We think it will be important to choose cases such as these and obtain direct experimental evidence about their function. One can begin by using transgenics to study expression in vivo. For example, in the case of the element located in the HNT gene we can attach the human and mouse versions to reporter constructs. We determine if these drive expression, and if so whether the human version's expression differs significantly from mouse. The hope is that in some cases the results of such work will suggest more detailed functional studies, specific to the biology of a particular adaptation.
We are optimistic that these lists and others like them will allow us to get a toehold in understanding the genetic basis of key adaptations in primate evolution. In this way, we can to begin to unravel the complex set of changes which have occurred in the course of primate evolution.
We began with the Phastcons elements (based on the 17 way vertebrate mutliz alignments) [9, 10] on the March 2006 release of the human genome. We subjected these to a number of filters before obtaining multiple alignments for six eutherian mammal species: human [20, 21] (NCBI build 36.1 Mar. 2006), chimpanzee  (build 2 Mar. 2006), macaque  (v1.0 Jan. 2006), mouse  (build 36 Feb. 2006), rat  (v3.4 Nov. 2004) and dog  (May 2005).
We eliminated coding sequence using the UCSC genome informatics group's table browser . This step was important because one category of spurious alignment is the case where a pseudogene has been aligned in one or more lineages. This situation could be mistaken for acceleration. We therefore filtered out all elements which overlapped UCSC Known genes, Refseq genes, Ensembl genes, mRNAs and spliced ESTs using the table browser on both the human and mouse genomes. We also eliminated elements which overlapped regions of low sequence quality (score<40) in any of our six genomes. We found that spurious alignments often involved cases where the aligning fragment in one of the species did not come from a syntenic genomic region relative to the aligning fragments in other species. We therefore applied a gene synteny filter. For each alignment, we examined the Ensembl genes nearby the aligning fragment in each species. To pass the filter, fragments needed to be less than three genes distant from a homologous gene in all six genomes. Another possible source of misalignments are cases where an element has undergone a recent duplication in one of the genomes. To eliminate this possibility, we filtered out elements which had multiple high blat hits in any of the six genomes. We also eliminated elements which were less than 50 bp long, or which were not present in all six of our species.
We obtained Multiz alignments for the 169,447 elements passing our filters using the six way mammalian synteny alignment available on the UCSC website . In order to keep our molecular evolutionary models simple, we eliminated alignment columns with a CpG in any of the species. We also eliminated columns containing a gap.
By intersecting a genome wide set of repeats with the six way multiple alignments, we identified a set of repeats which were present in the ancestor of our six mammal species. Such elements are a good neutral proxy and we used them to estimate the local mutation rate around a given conserved element.
Our likelihood ratio test method is conceptually similar to a relative ratio test . It is illustrated in Figure 1. For each element we obtained all ancestral repeats in a 750 kb window surrounding (Figure 1b). If an element had less than 6000 bp of repeats, we omitted it. For the whole set the median number of repeat bases per element was 17,930. We calculated the likelihood of each element and associated repeats jointly under a constrained model (weighting the element and repeat bases differently in the likelihood). We then did the same with the unconstrained model (Figure 1c). In the constrained model we require the rate of evolution of element and repeats to be proportional. The constant of proportionality for each branch is determined by the genome wide proportion (Figure 1d). To calculate it we determined the length of a particular branch for all elements taken together, and did the same for all repeats taken together. The ratio between these was the constant of proportionality for a given branch. It was necessary to calculate separate constants for each branch because these vary from branch to branch [22, 28–31]. The constrained model also includes a tree wide scaling factor S, which accounts for the fact that different elements have different levels of conservation. Then for each branch we wish to test, we also obtain the likelihood for an unconstrained model. This model allows the element to evolve more quickly on that particular branch. From these likelihoods we calculate the likelihood ratio test (LRT) statistic for this element and this branch. We implemented this with the HYPHY package  using the HKY85 substitution model . We are interested in cases where acceleration has occured in an individual element on a particular lineage. To isolate such effects, our models are designed to account for three other kinds of factors. Local variations in the mutation rate are accounted for by the fact that we are looking at the ratio between element and nearby reapeats. These should be affected equally by such variations. Lineage specific differences in constraint which are genome wide are accounted for by the constants of proportionality (e.g. Khuman, Kchimp in Fig. 1C.). And element to element variations in the (tree-wide) level of conservation are accounted for by S. differences between constrained and unconstrained models will reflect cases where the substitution rate in an individual element on a particular lineage has been accelerated.
To identify significant LRT values, we need to know the distribution of LRTs in elements which do not have a lineage and locus specific acceleration. We chose to do this with a nonparametric bootstrap method which involved creating pseudo elements by taking a random sample from all element alignment columns. We obtained mock ancestral repeats in the same manner. Both were size matched to actual data. The element and repeat sets produced by this method are a fair approximation of the null (no acceleration) condition. Such mock elements are unlikely to contain a cluster of accelerated columns. Such a cluster is what we would expect in a true accelerated element. The fact that they may contain any accelerated column by chance is conservative for our application. Contrained and unconstrained models were constructed in the same way as for real data. The constants of proportionality for these were created by calculating pseudo-element and pseudo-repeat branch lengths for the complete set of each taken together. We created 3,388,940 element repeat sets, and calculated an LRT for each. Using LRTs from these we calculated false discovery rate groups using the method of Benjamini and Hochberg .
The human samples and re-sequencing procedure are the same as in . In this case we analyzed the data with phred, and phrap [35, 36]. We called SNPs using polyphred with score set to 30 and confirmed them by eye using consed [37, 38].
We thank James Taylor for providing an initial set of ancestral repetitive elements, Aya Pusic for her efforts on an earlier version of the gene synteny filter, and Katherine Pollard for comments on a draft. This work was supported by the Howard Hughes Medical Institute and by US National Institutes of Health grant HL07605.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.