Evidence of recent natural selection on the Southeast Asian deletion (--SEA) causing α-thalassemia in South China

Background The Southeast Asian deletion (--SEA) is the most commonly observed mutation among diverse α-thalassemia alleles in Southeast Asia and South China. It is generally argued that mutation --SEA, like other variants causing hemoglobin disorders, is associated with protection against malaria that is endemic in these regions. However, little evidence has been provided to support this claim. Results We first examined the genetic imprint of recent positive selection on the --SEA allele and flanking sequences in the human α-globin cluster, covering a genomic region spanning ~410 kb, by genotyping 28 SNPs in a Chinese population consisting of 76 --SEA heterozygotes and 138 normal individuals. The pattern of linkage disequilibrium (LD) and the long-range haplotype test revealed a signature of positive selection. The network of inferred haplotypes suggested a single origin of the --SEA allele. Conclusions Thus, our data support the hypothesis that the --SEA allele has been subjected to recent balancing selection, triggered by malaria.


Background
Malaria has long been known to be a prevalent selective pressure which has greatly shaped the human genome over the past 10,000 years [1,2]. The extensive overlap of the distribution of globin variants and the historical record for malaria prevalence suggests hemoglobinopathies might reflect a selective influence. Many other genes, such as glucose-6-phosphate dehydrogenase (G6PD), ABO, and human leukocyte antigen, are also believed to be evolutionarily interacted with malaria [2,3].
Strong natural selection can leave distinctive genetic imprint in DNA sequences, such as altering the levels of nucleotide variability and/or linkage disequilibrium. These signatures can be identified by comparison with the background distribution of genetic variation in humans, which is generally argued to evolve largely under neutrality [4]. These events have been studied since the advent of techniques for detecting positive selection in humans based on allele and/or haplotype frequency distribution pattern. Novel haplotype-based techniques have been widely used to demonstrate malaria-protective genes. For example, the limited number of observed haplotypes, low haplotype diversity, and extended haplotype homozygosity had indicated recent positive selection on G6PD in India [5]. The pattern of linkage disequilibrium suggests that the HbE mutation occurred recently in Southeast Asia [6]. In our previous study, we found that a 4 bp frameshift deletion of β-globin (β CD41/42 ), common in East and Southeast Asia, perhaps resulting from natural selection, spread widely with the help of gene conversion [7].
Alpha-thalassemia is one of the most common hemoglobinopathies. It is especially frequent in old-world malaria regions, such as Mediterranean countries, South-East Asia, Africa, the Middle East and in the Indian subcontinent [8]. It is caused by the genetic mutations which can lead to reduced production of α-globin chains, thus result in an imbalance between the proportions of αand β-chains in hemoglobin synthesis. In normal individuals there are two copies of α-globin per chromosome. Most documented cases of α-thalassemia are caused by deletions involving one or both of the duplicated α globin genes, although increasing numbers of point mutations have been described. Therefore it is logical to divide α-thalassemia into two different types based on the number of remaining functional copies of the α-globin gene: α + -thalassemia (one copy is deleted or inactivated) and α 0 -thalassemia (both copies are deleted or inactivated). The more functional copies that are lost, the more severe the phenotype is presented: the loss of one copy of α-thalassaemia is phenotypically silent; the loss of two copies leads to mild anaemia; the loss of three copies leads to HbH disease that present anaemia of variable severity and the loss of all four copies is lethal.
Compared with α + -thalassemic deletions, α 0 deletions are much less common on a global scale. Among the various α 0 deletions, most of which are rare, the Southeast Asian deletion (--SEA ) is remarkable for its high frequency. It is found most commonly in Southeast Asia, including Southern China. Our previous epidemiological surveys in Guangdong and Guangxi provinces found that the frequencies of --SEA carriers were around 4-8% in Southern China [9,10]. Its localized distribution and relatively high frequency, considering the severity of its homozygous phenotype, might indicate a recent origin and selective advantage. However, no credible evidence of this has been reported to date. Some research suggests that --SEA in the Thailand population had a single origin [11], but there is a lack of a detailed picture as to how selection elevated its gene frequency.
Fortunately, with the advent of dense maps of human genetic variation, it is now convenient to detect whether positive natural selection operates on variants of interest [12]. Here, we report the results of our fine-grained study of the --SEA allele and haplotypes, based on a population from Southern China, to examine whether the --SEA has been the target of recent natural selection.

Allele frequency
In this study, 28 SNPs encompassing the SEA deletion and covering a~410kb region ( Figure 1, Table 1) were genotyped in 214 samples from Guangdong and Guangxi of southern China. Information for all polymorphisms and the --SEA allele in our samples are listed in Table 1. The derived allele frequency (DAF) for each SNP was computed in carriers and controls. Two SNPs, rs77308790 and rs3760053, which are non-common variants in the normal population (<0.1, Table 1) but were predominant in carriers.

Extended LD around --SEA
We calculated |D'| [13] between markers to evaluate the LD pattern in the surveyed region for the constructed random sample (CRS; See Method). Pairwise comparisons show that strong LD within sites over short distance except those affected by recombination hotspot (Figure 2). However, while hotspot breaks down the LD downstream of the SEA deletion, the --SEA still showed a high degree of LD with other SNPs (Figure 3), such as rs9940585, rs214247, rs1981492, rs11648673 and rs1698232 (although some of them with no statistical significance). Moreover, two SNPs, rs77308790 and rs3760053, showed a pattern similar to the --SEA (Figure 2), indicating a hitchhiking effect due to linkage with the --SEA allele.
Great differences of recombination between haplotypes with the --SEA and wild-type The SEA deletion involves a large fragment, approximately 20.5 kb in length. The increased LD might be due to a lack of recombination in the deletion region on chromosomes carrying the causal allele rather than natural selection, especially when there is a hotspot located in the deletion region. To assess this, we scanned the genetic maps of CHS (Han Chinese in Singapore) and CHB (Han Chinese in Beijing). The local rate of recombination in deletion region were about 0.2 to 0.8 cM/Mb in these populations, which indicated that no hotspot was found in the deletion region compared with the average recombination rate of 1.2 cM/Mb in chromosome 16 [14]. Furthermore, we compared the minimum number of recombination events, namely RM [15] which was obtained using the four-gamete test, between haplotypes with the --SEA allele and wild-type in our samples. The results of the two regions upstream and downstream of the deletion are provided in Table 2. It can be seen that recombination events seem to occur more actively on normal chromosomes than the --SEA chromosomes in both regions.

Test for recent selection on --SEA
A long-range haplotype test (LRH) using the relative extended haplotype homozygosity (REHH) parameter was conducted to detect the signature of recent natural selection [16]. A variant under neutral evolution would take a long time to reach a high frequency and the LD around the variant would decay substantially during this period due to recombination. In contrast, alleles under positive selection will rise to a high frequency so quickly that long-range associations with neighboring polymorphisms are not disrupted by recombination. In this LRH test, we assigned the core region containing four SNPs, rs77308790, rs6600143, rs2858925, and rs3760053, and the --SEA allele using the default parameters in the Sweep software (http://www.broadinstitute.org/ mpg/sweep/). The REHH value of the deletion-bearing haplotype in CRS reached 16.22 on the centromere-proximal side at a distance of 0.35 centimorgans from the focal allele ( Figure 4a). We obtained SNP data on chromosome 16 from the CHS population, which is close to Southern Chinese population, to calculate the empirical distributions of REHH values for core haplotypes. Only the core haplotypes with similar frequency of our data, i.e. 3.1% < MAF < 4.2%, were considered. Compared with other SNPs, the observed REHH values of the deletion haplotypes were significantly larger and the P-value was less than 0.044 ( Figure 4b). To verify the reliability of the result, we regenerated 1000 CRS datasets to obtain a distribution of REHH for the LRH test. The result was consistent, with a mean value of 17.41, and most of the REHH values were within the 5% extreme high value ( Figure 4c).
An approach introduced by Voight [17] and coworkers was used to obtain a crude estimate of the age of the --SEA variant. In this method, age was calculated using the equation, Pr [Homoz] = e -2rg , where Pr [Homoz] is the probability that two chromosomes are homozygous at a recombination distance r from the selected site, given a common ancestor g generations before the present. Here, we used a linear regression to evaluate the value of g through the transformation formula -ln(EHH) = g*2r based on our EHH data. As in Figure 5, the parameter of generation g was 130.03 in CRS. Assuming a generation time of 25 years for humans, the age is thus about 3250 years (If human generation time is 30 years, the age is 3909 years).

Origin of the --SEA allele
A network of haplotypes carrying the --SEA and four SNPs adjacent to downstream of the breakpoint was constructed. There was only one haplotype carrying the --SEA and it arose from the ααAAGC haplotype by a single mutation ( Figure 6). The AAGC haplotype was common in our CRS population (15%) and prominent in East Asian populations in the HapMap project [18] but barely seen in the rest of the world, with a few exceptions such as CEU (only 2 in 300 chromosomes) and GIH (only 1 in 170 chromosomes).

Discussion
Presumably the most common form of α 0 -thalassemia, --SEA , is maintained at a high frequency through heterozygote advantage by malarial selection. In malarious areas, carriers of this mutation will tend to have higher viability, leading to increased gene frequency until it is balanced by its loss in homozygotes who die before reaching sexual maturity. This process can elevate levels of LD around the target locus relative to that expected under a neutral model. From genotyping 28 SNPs around the SEA deletion, our results are consistent with this hypothesis. In the CRS population, mimicking the frequency of the variant in Southern China, extended LD was observed. In Asians, as well as African-Americans and European-Americans, strong LD is hardly observed for SNPs more distant thañ 80 kb [6,19]; there is significantly extended LD from the --SEA to SNPs outside the deletion locus (~200 kb), and the range would be much larger than the average range of strong LD expected in an Asian population.  Although it is difficult to evaluate the effects of largescale fragment deletion on long-range LD accurately, this is unlikely to be the consequence of a lack of recombination in the deletion region on chromosomes. This is done by estimation of the minimum number of recombination events and an LRH test. First, the deletion region in the normal chromosome exhibits a low degree of recombination, which does not explain the difference between chromosomes carrying the --SEA and the wild-type allele. Moreover, it is believed that recombination events in the deletion region only impact the correlation between upstream and downstream, but interaction of markers within upstream or downstream are not affected. However, we can see gaps between the --SEA chromosomes and normal chromosomes in both regions. Another convincing argument is that the LRH test showed a significantly high REHH value, indicating selection favoring the --SEA variant.
This signature probably results from the malarious environment through estimating the age of the --SEA variant. Diseases that might be caused by parasites have been described in detail in early written records from 3000 to 300 BC [20]. By the beginning of the Christian era, malaria was widespread in Central, South, and Southeast Asia, including China [21]. Many variants conferring resistance to malaria appear to be recent polymorphisms from the last 5000-10000 years or less   [1]. These records are consistent with our results, that the --SEA variant is relatively young and arose within the last 3250 years. This estimate is crude and likely underestimates the true age of the alleles: First, this method assumes a star-like genealogy for the selected haplotypes; furthermore, the derived allele may be at an equilibrium stage for some time [22]. However, it is still believed that the SEA deletion is a very recent mutation, unlikely to have been present before the early migrations of populations into Asia. The haplotype ααAAGC from which the --SEA arose was specific in Asian population, probably generated after modern humans migrated into Asia. We added three SNPs (rs2252214, rs4374177 and rs2239739) downstream to the four SNPs to construct the haplotype network again and found that --SEA arose from the ααAAGCCAG haplotype which was only found in Asian populations (data not shown). Considering the geographical distribution in which the --SEA has been found in only Thai, Filipino, Vietnamese, and Chinese populations [23], our phylogenetic network indicates that the --SEA appears to come from an Asian-specific haplotypic background, is consistent with the localized distribution of this mutation, indicating a recent origin of the deletion. The mechanism of protection in α 0 -thalassemia, and α + -thalassemia, remains unclear at the cellular level because there is no evidence for a reduced rate of invasion or growth of P. falciparum in red cells of the genotypesα/αα, -α/-α, or --/αα. However, some reviews of the mechanisms of malaria resistance, based on in vitro work, have supplied some potential mechanisms [24,25]. The most relevant mechanism is an increased immune response against parasite-infected erythrocytes, which was supported by a case-control study carried out by Williams, et al [26]. Some other plausible explanation involves reduced rosetting, the amount of which correlates  strongly with the severity of malaria infection. Another case-control study demonstrated that complement receptor 1 (CR1) expression required for rosette formation, is reduced on α-thalassemia red cells [25,27]. Recently, Fowkes et al. suggested that the increased erythrocyte count and microcytosis in children homozygous for α + -thalassemia may confer an advantage during acute infection with the malaria parasite P. falciparum [28]. This may provide an underlying mechanism for the --SEA protection against malaria, considering that the phenotypic effect of α 0 -thalassemia heterozygotes appears to be the same as α + -thalassemia homozygotes. Moreover, many genetic variants against malaria are involved in hemoglobin-inherited disorders, erythrocyte polymorphisms, enzymopathies, and immunogenetic variants, which may be only a small proportion of the complex interaction between Plasmodium parasites and humans. By harvesting the fruit of high-throughput sequencing techniques and human genome analysis, we may identify some new loci with impacts on susceptibility to malaria by association 2r -lnEHH Figure 5 Evaluation of the ages of the --SEA by linear regression of -ln(EHH) and 2r. X-axis represents the 2*r, where r is the genetic distance between core and a given marker; Y-axis represents the -lnEHH of that marker. We could obtain vector of (r, EHH) for each marker and make a plot. Each diamond corresponds to a SNP. studies. Such work may make the mechanism underlying the --SEA and/or other α 0 -thalassemias clearer. The fact that several hemoglobinopathies variants coexist in southern China make the expansion of the --SEA more complex. These variants interact with each other. For example, the frequency of α-thalassaemias coexisting with β-thalassaemia could be higher than when it exits on its own because of the increased fitness of the α/β-thalassaemia heterozygote; α 0 -thalassemias and α + -thalassemias are in competition for the combination of these two alleles results in a more deleterious effects: HbH disease. According to the result of our previous epidemiological survey, the frequencies of the --SEA and α + deletion appear to be relatively similar [9,10]. Given the model proposed by Hedrick [22], --SEA can persist at the current frequency for a long time, but will always eventually be outcompeted by α + . The interactions between the --SEA and other variants mean that the spread of this allele is not a straightforward case of balancing selection. In Zhang's study on β CD41/42 , gene conversion was shown to play an important role in the spread of the 4 bp frame deletion [7]. Unfortunately, this process cannot be analogized to the --SEA because the lengths of maximally converted tracts in gene conversion rarely exceed 1 kb [29]. Obviously, in addition to the role of natural selection, a more comprehensive mechanism of the spread of the --SEA is deserving of further investigation.

Conclusions
Our data support the hypothesis that the --SEA allele has been subjected to recent balancing selection, triggered by malaria. It ensures the basis of functional experiments about mechanisms of protection in α 0 -thalassemia and α +thalassemia. We plan to explore the mechanism underlying the --SEA and/or other α 0 -thalassemias by some highthroughput sequencing techniques in our future work.

Sample
A total of 214 samples from Guangdong and Guangxi of southern China were collected in this study. They consisted of 76 --SEA /αα heterozygotes and 138 wild-type individuals. Genomic DNA was extracted by a standard phenol/chloroform method from leucocytes in peripheral blood. We recognized the limitation of our sampling strategy and thus built a constructed random sample (CRS) [30] on the basis of the population frequency of the --SEA carrier by randomly excluding 65 --SEA /αα samples to arbitrarily maintain a total frequency of 7.4% (i.e. frequency of --SEA allele was 3.7%) in the studied population. For those analyses in which random sample was preferred, such as LD analysis LRH test and network construction, we used CRS dataset rather than the whole dataset.All participants were recruited after providing written informed consent. The institutional review board of Southern Medical University approved this study.

Gap-PCR to confirm heterozygotes
The Gap-PCR method was performed to discriminate the heterozygotes from normal individuals as reported by our laboratory [31]. In this assay, two primers were designed: one pair bridging the breakpoints producing a 740 bp fragment to identify the chromosomes with the --SEA determinant, while the other was located in the deletion region to detect the normal chromosomes through a 1052 bp fragment.

SNP selection
Because the α-globin gene cluster is located on chromosome 16p13.3, close to the telomere, we scanned a fragment from the beginning to 470 kb, about 300 kb behind the downstream deletion breakpoint, to screen appropriate SNPs. Data from HapMap release 27 were used. SNPs were selected based on their minor allele frequency (MAF ≥ 0.1) in the CHB population and haplotype-tagging SNPs were assessed using HAPLOVIEW 4.2 [32] under the parameter r 2 ≥ 0.8. To reduce the genotyping cost, just one or two SNPs were picked out from each block or boundary. Additionally, three extra SNPs (rs77308790, rs2858925, rs3760053) in the proximal region upstream of the deletion breakpoint, from a study [unpublished] in our laboratory, were also included. As a result, a total of 28 SNPs encompassing the SEA deletion and covering ã 410 kb region were determined ( Figure 1, Table 1).

High resolution melting (HRM) analysis
A high-throughput technique termed high resolution melting of small amplicons (~40-90 bp) was used to genotype these markers after sequencing about 10 samples for each SNP. The common PCR system was 10 μl of 1 × PCR buffer ( Table S1.

Hardy -Weinberg equilibrium tests
In the CRS dataset, we performed a Hardy-Weinberg equilibrium test. None of the markers, including the --SEA allele, deviated from HWE at a significance level of 0.05 except rs216091 (P = 0.05). A double check, sequencing 22 individuals for rs216091, ruled out the possibility of genotyping errors. Additionally, the genotype frequencies of rs216091 were similar to that of CHS population, which is believed to represent the Southern Chinese population, in the SGVP (Singapore Genome Variation Project) [34].

Data analysis
Haplotypes were inferred by a Bayesian statistical method with the PHASE 2.1 software [35] using the default parameter set with 1,000 iterations. This program assigns a pair of haplotypes for each diploid entry and provides probability scores for each uncertain position. Reconstructed haplotypes were inserted into Haploview v. 4.2 to show LD pattern and |D'| value which is a measurement of the deviation of the observed frequency of a haplotype from the expected. To evaluate whether the long-range LD around the focal allele was due to large-scale deletion, we compared the minimum number of recombination events, namely RM, between haplotypes with the --SEA allele and wild-type in our samples using a bootstrap approach. The distribution of difference of RM was obtained in three steps: (1) 428 chromosomes were randomly divided into two groups. One contained 76 and the other contained 352 chromosomes.
(2) The difference of RM between the two groups was calculated.
(3) Steps 1-2 were repeated 5000 times from which we obtained the distribution of the statistics in which we were interested. Relative extended haplotype homozygosity (REHH) values were calculated using the Sweep 1.1 software (http://www.broadinstitute.org/mpg/ sweep/). REHH values of the core haplotypes with frequencies similar (3.1% < MAF < 4.2%) to the --SEA on chromosome 16 were obtained for CHS and calculated as empirical data. The inferred haplotype network was constructed by a median-joining method with NET-WORK 4.6.1.0 [36] using the default settings.