Natural selection among Eurasians at genomic regions associated with HIV-1 control

Background HIV susceptibility and pathogenicity exhibit both interindividual and intergroup variability. The etiology of intergroup variability is still poorly understood, and could be partly linked to genetic differences among racial/ethnic groups. These genetic differences may be traceable to different regimes of natural selection in the 60,000 years since the human radiation out of Africa. Here, we examine population differentiation and haplotype patterns at several loci identified through genome-wide association studies on HIV-1 control, as determined by viral-load setpoint, in European and African-American populations. We use genome-wide data from the Human Genome Diversity Project, consisting of 53 world-wide populations, to compare measures of FST and relative extended haplotype homozygosity (REHH) at these candidate loci to the rest of the respective chromosome. Results We find that the Europe-Middle East and Europe-South Asia pairwise FST in the most strongly associated region are elevated compared to most pairwise comparisons with the sub-Saharan African group, which exhibit very low FST. We also find genetic signatures of recent positive selection (higher REHH) at these associated regions among all groups except for sub-Saharan Africans and Native Americans. This pattern is consistent with one in which genetic differentiation, possibly due to diversifying/positive selection, occurred at these loci among Eurasians. Conclusions These findings are concordant with those from earlier studies suggesting recent evolutionary change at immunity-related genomic regions among Europeans, and shed light on the potential genetic and evolutionary origin of population differences in HIV-1 control.


Background
HIV-1 infection is a global health problem and the fourth leading cause of death worldwide [1]. One approach to effective HIV prevention or treatment is to understand the interplay of host molecular mechanisms that are primarily influenced by genetic variation. Multiple genes have been implicated as important genetic determinants of the clinical course of HIV-1 infection by regulating various biological pathways including adaptive and innate immune responses [2]. However, associations with most of these genes appear to be population-specific, and many exhibit marked allele frequency differences between different racial/ethnic groups [3].
This interpopulation genomic variability and these differential associations are still poorly understood, and could plausibly be due to natural selection resulting from distinct environmental or cultural pressures in different geographical populations [3]. For instance, the most promising host genetic variant with biological and functional significance in the control of HIV-1 infection is a 32-bp deletion (CCR5-Δ32) in the CCR5 gene, which codes for the β-chemokine receptor that the virus uses to enter the host cells [2][3][4][5]. The homozygous deletion of a 32-bp segment results in a nonfunctional receptor [6], rendering individuals largely, but not completely, resistant to HIV-1 infection. The frequency of homozygous deletion is < 1.5% among European-descent populations, in which it has been postulated to have arisen in response to immune challenges [5], is rare in African Americans, and completely absent in African and East Asian populations. Studies have also shown evidence of positive selection at other genes involved in HIV-1 pathogenesis, such as TRIM5a and APOBEC3G [7,8], while others do not find any such evidence [9].
Recent genome-wide association studies (GWAS) in both European Americans (EA) and African Americans (AA) have identified a region within the Human Leukocyte Antigen (HLA) superlocus that is consistently associated with HIV-1 viral load (VL) set-point, and which seems to be the main predictor for disease progression [10][11][12][13]. The two single nucleotide polymorphisms (SNPs) with the strongest associations among EA, rs2395029 and rs9264942, are located within 208 kilobases (Kb) of each other in the HLA class I region of chromosome 6, near the HCP5, HLA-B, and HLA-C genes. Among AA, this is also the most strongly implicated region of the HLA [13,14]. Among both AA and EA, several non-HLA loci were also found to be among the 'top hits' (i.e., lowest p-value) in GWAS. In fact, the strongest associations in the GWAS among AA were located outside the HLA region. Thus, if HLA variants implicated in HIV-1 VL control show population differences and evidence of natural selection, one may expect to observe similar patterns at the non-HLA variants that have been implicated in HIV-1 VL control.
While the HIV-1 virus is known to adapt itself in humans with staggering mutation rates, especially in response to therapy, factors such as the very short time since the emergence of the recent HIV-1 epidemic make it unlikely that selection in humans in response to this specific epidemic has exerted a large, detectable effect [15]. It is however possible that these loci were subject to natural selection in the more distant past in response to other pathogens, and this has contributed to the population differences we currently observe in HIV-1 VL control. We therefore hypothesize that the genomic regions most strongly associated with HIV-1 VL control in GWAS have been subject to natural selection and differentiation in some populations. In order to test this hypothesis, we examined patterns of F ST and relative extended haplotype homozygosity (REHH) in the genomic regions surrounding both the HLA and non-HLA HIV-1 control-associated loci identified through GWAS in EA and AA. We find evidence of natural selection among Eurasian groups at these loci.

F ST analysis
We first analyzed F ST at the loci with the strongest evidence of association among EA (rs2395029 and rs9264942 in the HLA locus; see Table 1). The region surrounding these two loci, which are in low LD (r 2 = 0.06, D' = 0.86) in a EA population [11], exhibits very low F ST compared to random regions of chromosome 6 for all of the pairwise comparisons with sub-Saharan Africans (AFR), suggesting that much less differentiation than expected has occurred between AFR and other groups (see Figure 1). However, the pairwise comparisons with Europeans (EUR) show that the extent of differentiation between EUR and Middle Easterners (MID) is substantially elevated, as it is between South Asians (SAS) and EUR, compared to the comparisons with AFR. Our bootstrap analysis shows that the F ST of AFR- EUR is significantly lower (p < 0.001, and asymptotically, p < 3.8 × 10 -12 ) than the F ST of MID-EUR, SAS-EUR, and AME-EUR, after considering the chromosomal background differences. We next examined whether this pattern of population differentiation applies to the wider region of the HLA that contains four additional independently associated loci among EA, and the top five 'hits' in the HLA among AA (see Table 1, and top two panels in Figure 2). For most of this region we observe a similar pattern of very little differentiation between AFR and other groups and greater differentiation among EUR, MID, and SAS, as well as Oceanians (OCE). Next, we examined the entire HLA region, and find that this pattern of very little differentiation between AFR and other groups applies specifically to the region containing the loci associated with HIV-1 VL control, and is therefore not generalizable to the entire HLA. As shown in Figure 3, this pattern applies approximately to the 31 Mb to 32 Mb region of chromosome 6, which is the region containing variants associated with HIV-1 VL control In order to investigate the finer-grained patterns of differentiation among all 53 populations, we focused on the 400 Kb genomic window surrounding rs2395029. As shown in the above analyses, this region exhibits very little differentiation between AFR groups and other groups. However, within AFR, the degree of differentiation appears to be quite elevated among the Bantu, Mandenka, and Yoruba. We observe the greatest degree of differentiation among the Makrani, Sindhi (SAS), and the Druze (MID) populations (see Figure 4). F ST results for the mean of the top 10 non-HLA 'hits' for each of the EA and AA GWAS (see Table 1) are shown in Figure 5. We find a pattern of differentiation for the AA loci similar to that found for the HLA loci in that the group-specific F ST (GSF ST ) appears to be lower compared to that of other groups. However, this difference is not statistically significant since the 95% CI for the AFR GSF ST overlaps with that of other groups. Results for individual loci are shown in Additional Files 1 and 2. Among the AA 'top hits', the region surrounding rs1357339, a SNP in an intergenic region on Chromosome 11, is much less differentiated between AFR and other groups compared to the differentiation among non-AFR. We observe a similar pattern at rs430374 near the ST8SIA5 gene on chromosome 18, and at rs9910853 in the ZNF652 gene on Chromosome 17. In all of our analyses, we obtained very similar results using cM as opposed to Kb.

REHH
Depending on population group, among the 27,300 -42,400 total haplotype cores across chromosome 6, we examined between 346 and 520 haplotype cores in the region delimited by 200 Kb upstream of rs9264942 and 200 Kb downstream of rs2395029. The number of >99 th percentile REHH values within this region are shown in Table 2. We find an excess of extreme REHH values among SAS, MID, and East Asians (EAS) compared to AFR and Native Americans (AME). This excess is statistically significant (p < 0.0017) for all pairwise comparisons with AFR and AME, except for EUR and OCE. Q-Q plots (Additional file 3) also show the excess of lower-than-expected p-values among SAS, EAS, and MID, compared to the other groups. We also examined REHH values for haplotypes extending to 150 Kb, and found a similar pattern. We find a similar pattern when using only part of chromosome 6 containing the HLA, instead of the entire chromosome, as the empirical distribution. The lowest panel in Figure 3 shows the location of haplotypes with REHH p-values less than 0.005 for each group, except AFR and AME, for which there are no qualifying haplotypes. Many of these haplotypes are located around rs2395029, thus providing confirmatory evidence (based on both allele frequency differentiation and haplotype length) for recent selection at these loci in the MID, EUR and SAS and EAS groups. We observe an upward trend in F ST for the pairwise comparison at the same location where we observe evidence of extreme REHH for a given group. This is especially apparent for MID, SAS, EAS, and EUR. REHH results for the non-HLA 'top hits' are shown in Tables 3 and 4. We find no evidence of major differences among groups in the number of loci exhibiting high REHH values.
Finally, we sought to determine whether the susceptibility or the protective allele for rs2395029 and rs9264942 is contained in the core of the haplotypes (with frequency greater than 5%) that are outliers in the REHH distribution (p < 0.05). We find no qualifying haplotypes containing either of these SNPs among AFR and MID. Among SAS, we find that the susceptibility allele (T) for rs2395029 is the allele which is in the core of a haplotype with high REHH (p = 0.0015 and p = 0.0022, in either direction, frequency = 22.5%). This haplotype contains approximately 23% of the susceptibility alleles at this locus present among SAS. Among EUR, the protective allele (G) is in the core of a haplotype with high REHH (p = 0.0029 and p = 0.032 in either direction, frequency = 5.4%). This haplotype contains all protective alleles at this locus present among EUR. Among EAS, the susceptibility allele of rs2395029 is on a high-REHH haplotype (p = 0.0089 and p = 0.046 in either direction, frequency of 18%). This haplotype contains approximately 10% of the susceptibility alleles at this locus present among EAS. The protective allele (C) of rs9264942 is on a high REHH haplotype (p = 0.024, frequency = 9%). This haplotype contains approximately 21% of the protective alleles at this locus present among EAS. Among OCE, the susceptibility allele of rs2395029 is on a high-REHH haplotype (p = 0.028 and p = 0.044, frequency = 30% and 14%, respectively). Since the susceptibility allele is fixed among OCE, these haplotypes contain approximately 30% and 14% respectively, of the susceptibility alleles at this locus among OCE. We find no qualifying haplotypes among AME.

Discussion
We hypothesized that the genetic variants found to be associated with HIV-1 VL control were subject to recent natural selection and population differentiation. This natural selection could have resulted in the observed population differences in HIV-1 pathogenicity among contemporary populations. We examined the most significant HLA and non-HLA variants associated with HIV-1 VL in GWAS among EA and AA, and found that the top associated loci in the HLA region are located in a sub-region of the HLA that shows very little relative differentiation between the considered sub-Saharan African group and other groups, compared to the relative levels of differentiation among the Eurasian groups considered. We also confirm that the patterns observed in this sub-region are not generalizable across all HLA sub-regions.
Considering all 53 populations, we find the greatest degree of differentiation at the rs2395029 locus for many pairwise comparisons with the Makrani and Sindhi in Pakistan, as well as the Bantu in Africa, and the Druze in Israel. Further studies are needed to confirm and gain a better understanding of the possible reasons for this pattern. Although we take into account a total of 253 SNPs in this locus, these results should be interpreted with some caution given the small sample sizes for some populations.
We find that the differentiation pattern observed in the HLA sub-region also applies to the non-HLA 'top hits', but only among those identified in the HIV-1 VL GWAS among AA. Averaging GSF ST over the top ten regions shows a general trend of a smaller degree of differentiation between the sub-Saharan African group and other groups. The lack of differentiation that we observe between the sub-Saharan African and other groups at the HLA sub-region could be due to a high degree of conservation. It is plausible that as each population in Eurasia was subjected to unique selection pressures, a greater degree of differentiation occurred among Eurasian groups than between each of these groups and sub-Saharan Africans. These results should be interpreted with caution because some of the 'top hits' that we examine do not reach genome-wide statistical significance in the respective GWAS.
Paralleling the pattern in differentiation, we observed evidence of extreme relative extended haplotype homozygosity (REHH) among Eurasian groups but not among sub-Saharan Africans or Native Americans. On the basis of our finding of multiple haplotypes with extreme REHH and at relatively low frequencies (i.e. all below 37%), our results appear to be more consistent with a mode of recent evolution characterized by multiple soft sweeps as opposed to single hard sweeps [16].They also suggest that the patterns of differentiation and REHH are not uniform and homogenous across all Eurasians. Instead, it is possible that population-specific patterns of genetic change, perhaps in response to region-specific selection pressures, resulted in unique localized adaptations in different Eurasian populations. Our REHH   results suggest that the evidence for selection in the HLA among Eurasian groups is one in which several different haplotypes of low to moderate frequency have spread through populations. In this context, it should be noted that REHH may not be powerful enough to detect very low frequency haplotypes [17]. Other tests of selection such as XP-EHH (using the HGDP browser: http:// hgdp.uchicago.edu/cgi-bin/gbrowse/HGDP/) do not appear to show similar trends as those we obtained using F ST and REHH. This may be due to the fact that XP-EHH is most powerful for cases of selection in which the selected haplotype reaches a very high frequency in one population but not another, which does not appear to be the situation for the loci that we have examined. Among Europeans in the HGDP, all rs2395029 protective alleles (G) are on a haplotype that has high REHH, suggesting that this allele or one linked to it may have quickly arisen in response to a selective pressure such as an infectious disease in Europe. However, among South Asians, East Asians and Oceanians, it is the susceptibility allele at this SNP that is the allele present on the haplotype that has high REHH. The inconsistency of these results could signify, among several things, that there are many different polymorphisms in this region that could have functional significance, that the genetic basis of adaptation to similar pressures could be different among different groups, and finally, that the genetic basis of HIV-1 control could be different among different racial/ethnic groups.
Our findings are consistent with recent findings that signatures of selection among Europeans are enriched for immunity related genes in the HLA [18,19]. In fact, Kudaravalli et al. [20] recently found that SNPs associated with gene expression levels of HLA-C also show evidence of selection in both Europeans and East Asians in the HapMap samples. Along with these and other studies (see [21]), our findings suggest the presence of  Listed is the number of haplotypes with REHH empirical p-value less than 0.01 and core haplotype frequency greater than 5%.   Listed is the number of haplotypes with REHH empirical p-value less than 0.01 and core haplotype frequency greater than 5%. (AFR: sub-Saharan Africa; MID: Middle East; SAS: South Asia; EUR: Europe; EAS: East Asia; OCE: Oceania; AME: Americas) selection pressures on the immune system, possibly due to geographic, demographic, cultural, or environmental factors related to subsistence. For example, our fine-scaled analysis of all 53 HGDP populations shows that among the sub-Saharan populations, those populations that show elevated F ST between each other (Bantu, Mandenka, and Yoruba) are also the groups that have a history of practicing agriculture. It should be noted that we have not examined populations from East Africa, as these may show very unique patterns, especially considering that the frequency of the rs2395029 protective allele among the Maasai in the HapMap3 sample is quite elevated (14%), and that Henn et al. have found that this allele is also at a relatively high frequency among the Sandawe (17%) in East Africa, and that it exhibits evidence of recent selection in this group [22]. While we have found evidence suggesting the action of natural selection on the genomic regions associated with HIV-1 VL among Eurasian populations, one has to be cautious in interpreting our results. Although in all of our analyses, we have controlled for the genomic background by comparing F ST and REHH at the locus of interest to a distribution of other loci on the respective chromosome, we have not necessarily directly tested whether the patterns we observe are consistent with natural selection as opposed to more stochastic evolutionary forces such as genetic drift. Although it is difficult to distinguish selection from drift with certainty, the greater extended haplotype homozygosity among Eurasians coinciding with elevated differentiation among these groups makes the situation we have outlined (i.e. positive selection among Eurasians and balancing selection among some sub-Saharan Africans) more compelling.

Conclusions
In conclusion, our results show that the HLA region, as well as some of the non-HLA regions, that contain loci associated with HIV-1 VL control among persons of both European and African ancestry, appear to have been subject to natural selection, resulting in population differentiation among Eurasian groups, but not among other groups. Future studies on different samples and/or with different methods will be needed to confirm our findings, and should examine more fine-scale patterns both at the geographic and genetic level. Understanding the genome-wide selection pressure on HIV-1-interacting proteins can provide insights into the evolutionary dynamics of host factors, the genetic basis of differences between nonpathogenic and pathogenic lentivirus infection, and the roles of individual genes in host-pathogen interaction and immunopathogenesis.

HGDP populations and genotypes
We used the publicly available data from the Human Genome Diversity Project (HGDP) collection of 938 individuals from 53 different populations, genotyped at over 650,000 SNPs on the Illumina 650Y platform, and phased using fastPHASE [23]. Following recent reports [24,25] we grouped the 53 populations into seven broader groups based on geographical regions: Sub-Saharan Africa (AFR; n = 102), Middle East (MID; n = 160), South Asia (SAS; n = 200), Europe (EUR; n = 156), East Asia (EAS; n = 229), Oceania (OCE; n = 28), and America (AME; n = 63). A listing of the populations in each group can be found in Additional file 4.

HIV-1-associated loci from GWAS studies
The set of loci examined in this study (see Table 1) was determined from two HIV-1 GWAS, one performed in European Americans (EA) [11], and the other in African Americans (AA) [13]. In EA, the most strongly associated variants were located in the HLA region of chromosome 6. In AA, the most strongly associated variants were located in various parts of the genome, including the same HLA region found in EA. We chose the top 2 'hits' (according to p-value) in the EA study (rs2395029, rs9264942). We also examined four other independent associations in the HLA region (rs259919, rs9468692, rs9266409, rs8192591) reported in the EA study, and the top five HLA 'hits' in the AA study. We also examined the top ten non-HLA 'hits' in each of the two studies. Among these top ten 'hits', we only considered the SNP with the lowest p-value if more than one in the same genomic region made the list.

F ST
The F ST statistic partitions the total genetic variance into within-and between-population components, thereby quantifying the extent of population differentiation [26]. An elevated F ST at a given locus relative to the rest of the genome indicates a high degree of differentiation among populations, which may be indicative of positive selection, while a low F ST is consistent with balancing or purifying selection, for example in highly conserved regions of the genome. Since individual SNPbased estimates of F ST are highly variable and may therefore be unreliable indicators of differentiation at a genomic locus [27,28], we calculated an average of F ST values for all SNPs contained in varying window sizes, from 50 Kb to 1.6 Mb ("zooming out" by a factor of two), each centered on the HIV-1-associated SNP. Given that we are interested in relatively recent selection events, a window-based estimate of F ST allows us to detect the hitchhiking of loci near the putatively selected variants. In addition, a window-based approach to F ST seems preferable since an associated variant detected by GWAS may not represent the actual causal variant, but instead a proxy for a nearby variant. In order to account for variation in recombination patterns along the genome, we also perform the F ST analyses using cM distance instead of Kb distance. We used the hg18/build 36 genetic map, retrieved from http://hgdp.uchicago. edu/Browser_tracks/Genetic_Maps/.
To calculate F ST , we used the method of Weir and Cockerham [29]. We calculated a single global estimate of F ST (based on all 7 population groups), as well as all 21 pairwise estimates of F ST . F ST was not calculated for a SNP if that SNP is monomorphic in the groups being compared. Negative values of F ST were given a value of 0 because negative values are biologically meaningless.
To control for population-specific demographic effects on the genome, we compared the F ST in the risk window to a null distribution of F ST values calculated in randomly chosen windows. Specifically, for each risk SNP, we randomly chose 1000 equally sized (in bp) windows along the same chromosome, with similar genic/ non-genic content (± 10%), since F ST tends to be slightly higher for genic SNPs [30]. The genic/non-genic classification was performed according to the annotation provided by Sullivan et al. (https://slep.unc.edu/evidence/? tab=Downloads), which classifies a SNP based on whether it is in the transcribed region of a gene. SNP annotations were created using the TAMAL database [31], based chiefly on UCSC genome browser files [32], HapMap [33], and dbSNP [34].
For all 21 possible pairwise F ST comparisons, we obtained percentile ranks of the window centered on the risk SNP, compared to the 1000 randomly centered windows across chromosome 6, and use these to compare differentiation of the risk-SNP window across groups. However, in order to determine the significance of the difference between the risk-SNP window F ST between populations A and B and the same risk-SNP window F ST between populations A and C, while considering the difference in null chromosomal background (through the mean F ST of 1000 randomly chosen windows, excluding the risk-SNP window, as described above), we generated 1000 bootstrap resamples from the total sample of 938 individuals. For each of the bootstrap resamples we calculated the risk-SNP window F ST and the mean F ST of the 1000 random (non-risk) windows. We then tested the null hypothesis that the difference in risk-SNP window F ST is the same as the difference for the mean of the random windows, formulated as follows: where (θ X, A-B ) is the F ST for the risk window (X) between population A and B, θ X, A-C is the F ST for the risk window (X) between population A and C, θ -X,A−B is the F ST at regions other than the risk region X (-X) between population A and B, and θ -X,A−B is the F ST at regions other than the risk region X (-X) between population A and C. To do so, we assumed that the sample estimators of the four θ's utilized in the null hypothesis stated above are asymptotically unbiased estimators. Let us denote the sample estimators of the population parameters by the use of a circumflex and let: T = (θ X,A−B −θ X,A−B ) = (θ X,A−C −θ X,A−C ). Under the null hypothesis and the assumption that the estimators are unbiased, E(T) = 0. Therefore, testing the null hypothesis that E(T) = 0 is a valid test of the null stated above. In an ordinary sample of independent observations (individuals) taken from the populations of interest, one can test E(T) = 0 using bootstrap methods. None of the above analyses depend on any of theθs or any quantities on which they are based being independent. The value of T from the observed sample T obs was calculated. One thousand bootstrap samples were taken from the original sample with replacement, and a value of T calculated in each bootstrap sample, denoted T * i for the ith bootstrap sample. A p-value to test the null hypothesis stated above was then calculated in two ways, conservatively as p 1 + 1000 i=1 I(|T * i − T obs | T obs ) 1000 , and asymptotically by lettingσ T obs be the ordinary sample standard deviation of the T*, and assuming that the T obŝ σ T obs 2 is asymptotically distributed as χ 2 with 1 df. Due to computational limitations, we restricted this analysis to the rs2395029 SNP. Given that we are interested in examining the top ten non-HLA variants as a set, as well as each one separately, we averaged the 10 percentiles of all pairwise F ST comparisons. In order to obtain a group-specific F ST for a given group, which we refer to as GSF ST , we averaged the F ST percentile ranks of the six pairwise comparisons that contain the group in question. For example, to obtain an estimate of GSF ST for AFR, we took the average of the percentile F ST of the following six pairwise comparisons: AFR-MID, AFR-SAS, AFR-EUR, AFR-EAS, AFR-OCE, and AFR-AME. Instead of considering many pairwise comparisons for each locus, GSF ST allows us to examine how differentiated a given group is at a given locus compared to all other groups, and relative to the rest of the chromosome.
Because the associated loci in the HLA region were clustered in a relatively small region, we examined the mean F ST of all SNPs in a sliding 400 Kb window with an overlap of 200 Kb, from 29.8 to 32.5 Mb on chromosome 6. We also examined the entire HLA region to determine if the pattern observed in the 29.8 to 32.5 Mb region is characteristic of the HLA region as a whole (28 to 35.6 Mb). To investigate the level of differentiation among all 53 populations, we constructed a matrix of F ST percentiles for all possible pairwise comparisons for the 400 Kb region surrounding rs2395029.