Y-chromosome evidence suggests a common paternal heritage of Austro-Asiatic populations

Background The Austro-Asiatic linguistic family, which is considered to be the oldest of all the families in India, has a substantial presence in Southeast Asia. However, the possibility of any genetic link among the linguistic sub-families of the Indian Austro-Asiatics on the one hand and between the Indian and the Southeast Asian Austro-Asiatics on the other has not been explored till now. Therefore, to trace the origin and historic expansion of Austro-Asiatic groups of India, we analysed Y-chromosome SNP and STR data of the 1222 individuals from 25 Indian populations, covering all the three branches of Austro-Asiatic tribes, viz. Mundari, Khasi-Khmuic and Mon-Khmer, along with the previously published data on 214 relevant populations from Asia and Oceania. Results Our results suggest a strong paternal genetic link, not only among the subgroups of Indian Austro-Asiatic populations but also with those of Southeast Asia. However, maternal link based on mtDNA is not evident. The results also indicate that the haplogroup O-M95 had originated in the Indian Austro-Asiatic populations ~65,000 yrs BP (95% C.I. 25,442 – 132,230) and their ancestors carried it further to Southeast Asia via the Northeast Indian corridor. Subsequently, in the process of expansion, the Mon-Khmer populations from Southeast Asia seem to have migrated and colonized Andaman and Nicobar Islands at a much later point of time. Conclusion Our findings are consistent with the linguistic evidence, which suggests that the linguistic ancestors of the Austro-Asiatic populations have originated in India and then migrated to Southeast Asia.


Background
The Indian subcontinent is presently inhabited by four major linguistic groups, viz. Austro-Asiatic, Dravidian, Indo-European and Tibeto-Burman that might have entered at different points of time. Based on the observation that Austro-Asiatic family has the greatest divergence in their nouns [1] and some other linguistic features (for details, refer to discussion), it is considered to be the oldest of the four linguistic families [1,2] and consists of three sub-families [3]: (1) Mundari, spoken by a number of tribes inhabiting Chota-Nagpur plateau in Central and Eastern India, (2) Mon-Khmer, spoken by Nicobarese and Shompen tribes from Andaman and Nicobar islands and (3) Khasi-Khmuic (which linguists earlier considered as part of Mon-Khmer) represented by only the Khasi subtribes from Northeast India (Fig. 1). The Indian Khasi-Khmuic to a certain extent and Mon-Khmer groups have physical features of East Asian populations [4], whereas the Mundari populations have features similar to those of the Dravidian linguistic family. Further, except the Mundari sub-family which is restricted to the Indian subcontinent, the languages of the other two sub-families of Austro-Asiatics are spoken by a large number of populations in Southeast Asia (Fig. 1). However, neither the possibility of any genetic link among the three linguistic branches of Indian Austro-Asiatics, nor that between the Indian and Southeast Asian Austro-Asiatics has been comprehensively explored till now, despite the fact that the Indian subcontinent has been considered to have probably served as an important corridor for migrations to Southeast Asia. Two routes of migration by which Austro-Asiatic groups possibly entered the Indian subcontinent have been suggested based on the linguistic, archaeological and classical genetic marker [4] and the references therein]; the first being migration from Africa to India via Central Asia, while the second route is from Africa to Northeast Asia and then to the Indian subcontinent. Basu et al. [5] found high frequency of Haplogroup K-M9 among the Mundari populations and inferred that the Austro-Asiatic populations have migrated from Africa to India via central Asia, which is flawed since this haplogroup is ubiquitously found in Asia and has a substantial presence in the whole of East Asia. On the other hand, from the analysis of mtDNA 9bp (9-base-pair) del/ins (deletion/insertion) polymorphisms, Thangaraj et al. [6] and Prasad et al. [7] reported only East Asian-specific mtDNA haplogroups in Nicobarese, while Roychoudury et al. [8] and Metspalu et al. [9] found only Indian-specific mtDNA haplogroups in Mundari populations. The above inferences were, however, based on meager genetic evidence and very few Austro-Asiatic populations (a maximum of 3) were included in those studies. Although Kumar et al. [10] analysed a large number of Austro-Asiatic populations and suggested distinct origins and migration histories of the Mundari, Khasi-Khmuic and Mon-Khmer populations, the analysis was based only on the mtDNA 9bp del/ins polymorphisms and its characterization.
We sampled almost all the Austro-Asiatic populations of India covering the entire geographic and micro-linguistic heterogeneity inherent among them (Table 1 and Fig. S1 [see Additional file 1]). This includes molecular genetic data on the Austro-Asiatic Khasi from Northeast India, which is considered an important corridor for human migrations to Southeast Asia. We present results based on the analysis of Y-chromosome SNP and STR data of Austro-Asiatic tribes along with the previously published data of 214 other relevant populations and try to trace the origin and historic expansion of Austro-Asiatic groups of India. Based on this evidence, we propose that haplogroup O-M95 has originated in the Indian Austro-Asiatics, particularly among the Mundaris, whose ancestors moved further to Southeast Asia along with this haplogroup.

Distribution and frequency of Y-chromosome haplogroups
The population-wise distribution of Y-haplogroup frequency and diversity along with haplotype diversity based on 16 Y-STR is furnished in Table 2. Overall, the haplotype diversity is high (98.87%) and ranges from 95.26% in Pando to 100% in Khasi, Garo, Paharia, Nagesia and Birijia. Out of the 13 potential haplogroups defined by the binary markers typed in the present study ( Fig. 2) nine haplogroups were found among these populations. The average frequency of haplogroup O-M95 is highest (52%) followed by H-M69 (26%). Among the three sub-families of Austro-Asiatics, on an average, 55% of Mundari, 41% of Khasi-Khmuic from Northeast India and all the 11 Nicobarese samples belong to O-M95. To know if the unclassified O-M95 samples have sub-linegaes, we also typed downstream M88 binary marker but none showed the presence of O-M88 haplogroups. Except Khasi (29%) and 1 sample of Korku (2%), none of the Indian Austro-Asiatic populations shows the presence of haplogroup O-M122. Further, the Garo tribe shows haplogroup O-M122 as most common (55%) followed by O-M95 (18%). Since Austro-Asiatic Khasi and Tibeto-Burman Garo live in close proximity in Meghalaya and are known to have frequent marital interactions [11,12], we further typed all the samples of haplogroup O-M122 from Garo and Khasi populations to see if O-M122 among the Khasis is not due to admixture with the Garo. We found only 3 out of the 8 haplogroups defined by the binary markers used in this study (Fig. 3)

Analysis of Molecular Variance and Y-STR Networks
The Analysis of Molecular Variance (AMOVA) based on Y-SNPs (Table 3) suggests that Indian and Southeast Asian Austro-Asiatic populations are well differentiated (F ST = 0.203). The F ST value is larger by 3% between Mundari and Southeast Asian Austro-Asiatics indicating relatively higher level of differentiation between them. However, the F ST value turns out to be relatively much smaller (0.045) between Khasi and Southeast Asian Austro-Asiatic populations, suggesting greater genetic affinity between them when compared with the Khasi affinity to Mundari populations (F ST = 0.099). Nicobarese were not included in this analysis as all their samples fall in O-M95. AMOVA based on Y-STRs suggests high F ST value (0.175) between Mundari and Khasi, but relatively much smaller when compared to that between Khasi and Nicobarese (0.289) or between Mundari and Nicobarese (0.442).

Time to the most recent common ancestor (TMRCA)
Since O-M95 was found to be the most common haplogroup, we estimated TMRCA for this lineage with the help of BATWING program [13] and using 16 Y-STRs. We chose the demographic model with an exponential growth from an initially constant-sized population with sub-division. Although there are other 2 models, the demographic model chosen for this study is considered to be the most appropriate [14][15][16][17] for human populations; this model assumes that population sizes were constant until they became semi-sedentary or sedentary, probably till the advent of agriculture, which was followed by an Map showing present-day distribution of Austro-Asiatic groups (modified from van Driem [2]) and the schematic representa-tion of the routes of migration of the different Austro-Asiatic linguistic subgroups of India Figure 1 Map showing present-day distribution of Austro-Asiatic groups (modified from van Driem [2]) and the schematic representation of the routes of migration of the different Austro-Asiatic linguistic subgroups of India.

Khasi Khasi
Nicobarese exponential growth. (for details on computational procedure, see Methods). The median estimate of TMRCA (Table 4) with 10 6 MCMC cycles for the whole of Austro-Asiatic groups turns out to be ~68,000 YBP (95% C.I. 25,442 -132,230). The TMRCA for the Mundari (~66,000) and Khasi (~57,000) is similar, whereas for Nicobarese it is considerably lower (~17,000). Further, the average of TMRCA estimated for individual Mundari and transitional populations turns out to be large (~48,000), ranging from ~70,000 YBP to ~30,000 YBP (results not shown), suggesting that the haplogroup O-M95 might have originated early, possibly in the Paleolithic period.

Haplogroup isofrequency maps
Isofrequency maps were generated for all the haplogroups but only the relevant two maps pertaining to O-M95 and O-M122 are presented in Figure 5. The data of our study along with the comparative data on 214 other relevant populations suggest that the haplogroup O-M95 is ubiquitously found in Southeast Asia, while in India it is restricted to the regions where Austro-Asiatic populations are found. This strongly suggests that Austro-Asiatic populations of India are not only linguistically linked to Southeast Asian populations but also genetically associated. The present day distribution of Austro-Asiatic linguistic groups and the distribution of haplogroup O-M95 "AA" Austro-Asiatic; "Dra" Dravidian; "TB" Tibeto-Burman; "IE" Indo-Eurpoean appear to be highly correlated (Table 5 and Fig. 5). For example, its average frequency is only 3.4% and 0.1% (Table 5), respectively, in northeast and Central Asia where no Austro-Asiatic population is found, whereas it is much higher in Southeast Asian Austro-Asiatics (38%) as well as in the neighboring non-Austro-Asiatics (14.7%). Further, this frequency is significantly much higher in Austro-Asiatics than in non-Austro-Asiatics (χ 2 = 22.77; P < 0.001). There is also a decreasing gradient of O-M95 frequency as we move from India to Southeast Asia, although this trend is less apparent in the map because 7 of 45 groups from Southeast Asia show O-M95 frequency in the range of 50% to 75%. However, for six of those 7 populations the sample sizes are less than 20, some being very small. In any case, the average frequency of O-M95 in Indian Austro-Asiatic populations is much higher (54%) when compared ( 1]) appears to be confined to the boundaries of the Indian subcontinent and, therefore, very strongly suggests its origin in the Indian subcontinent.

A common genetic heritage of the Austro-Asiatic groups
Mundari populations show O-M95 as the most common haplogroup and only three of the 22 populations -Lodha, Savara and Mahali-show departure from this general trend (Table 2), which appears to be because of their disputed origin [31]. This haplogroup is also found in a relatively high frequency in the Khasi and Nicobarese. It is therefore not surprising that in a recent study [32] all the 12 samples of Shompen from Nicobar islands, like their own linguistic neighbors from the region, the Nicobarese, showed O-M95. This may underscore that the Mundari, Khasi-Khmuic and Mon-Khmer groups of India are not only linguistically related but also genetically linked, probably with a single but relatively broad paternal genetic source. This haplogroup has been reported to be absent or present in low frequency in other linguistic groups of India [20,[25][26][27][28][29], suggesting a distinct genetic identity of the Indian Austro-Asiatic populations. On the other hand, while the Austro-Asiatic populations of Southeast Asia show high frequency of O-M95 (average 38%) their neighboring populations also show consider-Rooted maximum-parsimony tree of haplogroups defined by binary markers along with their frequency in different groups

Figure 2
Rooted maximum-parsimony tree of haplogroups defined by binary markers along with their frequency in different groups.
able frequency (14.7%). However, this haplogroup has negligible presence in North and Central Asia ( Fig. 5 and Table 5). Thus the predominance of this haplogroup both in Austro-Asiatic populations of India and Southeast Asia and its absence/negligible presence in other Asian populations suggests a common genetic heritage of the people of this linguistic family.
The virtual absence of O-M95 in the Tibeto-Burman populations of India [20,28,29] suggests that the migrations of these populations into India were not accompanied by the O-M95 haplogroup. Therefore, the presence of this haplogroup in the Garo tribe of Meghalaya is due to high degree of gene flow from the neighboring Khasi, which has been facilitated by the matrilocal system of marriage among these two tribes [11,12] [28] including Garo from Northeast India may imply that the O-M122 in Khasi probably had its source in the neighboring Tibeto-Burman groups, particularly from the Garo. Although the foregoing analysis suggests that the Austro-Asiatic populations of India share common genetic ties, a comparative analysis among the subfamilies suggests that these populations have separated quite early and are now well differentiated as indicated by the results of AMOVA (Table 3), the M-J network (Fig. 4) and the TMRCA ( Table 4).

Origin of haplogroup O-M95 and expansion of Austro-Asiatic populations
Given the overwhelmingly high frequency of O-M95 in the Austro-Asiatic populations it is most likely that this haplogroup has originated among them. However, the question is whether it has its origin in India or Southeast Asia? The most likely region of origin of a haplogroup can be identified on the basis of two characteristics -the high-

Mon-Khmer group
Tibeto-Burman group used relatively higher mutation rates and only 7 of the 16 loci, we reanalyzed our data based on those 7 loci and the mutation rate used by Kayser et al. [16] and observed a similar TMRCA (~65,000 YBP) suggesting that the TMRCA of the present study may not be an artifact of large number of loci and low mutation rate. Furthermore, the Mundari populations are considered to be traditionally hunters and food-gatherers and at present they inhabit the areas unfit for cultivation, which may reflect their traditional mode of subsistence. Therefore, migration of Mundari populations during demic expansion of the agriculturalists in the Neolithic era appears improbable as has been suggested for Nicobarese [6]. Based on these evidences, we suggest that the ancestors of present day Mundari populations have migrated to Southeast Asia instead of coming from Southeast Asia. This scenario is also consistent with the inference that Mundari language is grammatically and phonologically the most conservative branch of the Austro-Asiatic family [2,38] and more similar to proto-Austroasiatics than the other branches of this family suggesting that linguistic ancestors of the Mundari populations have originated in India [39]. The foregoing analysis therefore suggests in-situ origin of O-M95 haplogroup, most probably in the ancestors of present day Mundari populations, who might have carried it further to Southeast Asia.
The results of AMOVA (Table 3), M-J Network (Fig. 4) and TMRCA of haplogroup O-M95 (Table 4) suggest an early separation of Mundari and other Austro-Asiatic populations. Due to this early separation, we expected that at least in one of these groups sublineage of O-M95 might have originated. However, none of the groups showed the sublineage O-M88 (Fig. 2). Till now this lineage has been reported from the region of Cambodia and Laos only in 1 sample [40] suggesting probably that this lineage is present with a very low frequency and is probably originated and confined to that region. Therefore, if the sublineage(s) exists, it is probably identified by some other binary marker(s) which is yet to be known. Since the Khasi shows relatively high frequency of O-M122 (29%) and given that populations of Khasi-Khmuic sub-family are concentrated in the regions North of Burma and Thailand (Fig. 1) [8,41]. Therefore, the presence of East Asian mtDNA among the Khasi could be due to gene flow from the neighboring Garo and the other Tibeto-Burman populations which have virtually only East Asian mtDNA haplogroups. This may reinforce the suggestion that Mundari and Khasi-Khmuic populations have separated long back and the latter have probably gone to Southeast Asia, via the northeast Indian corridor, as reflected in their geographic distribution (Fig. 1).
The Nicobarese is also quite distinct from both the Mundari and Khasi-Khmuic tribes as revealed by the AMOVA (Table 3) and M-J network (Fig. 4) based on Y-STRs. This tribe has only East Asian female lineages [6,7] and only O-M95 as male lineages ( Fig. 2 and Table 2). We also performed AMOVA based on the same set of 16 Y-STR for Shompen tribe [32], which is also a Mon-Khmer group. The results suggest that Shompen like the Nicobarese are also quite distinct from the Mundari (F ST = 0.402) and the Khasi (F ST = 0.476). The TMRCA of Nicobarese (~17,000 YBP) and the Shompen (~19,000 YBP), and the distribution of Mon-Khmer populations (Fig. 1), which is confined to lower part of Burma and Thailand, Vietnam and Cambodia suggest that they have migrated from Southeast Asia to India during demic expansion of the agriculturalists in the Neolithic era [42]. The complete absence of O-M122 among them appears to be due to the profound impact of founder effect and subsequent genetic drift, although the ascertainment bias due to small sample size cannot be ruled out.

Two possible routes of entry of Austro-Asiatics into the Indian Subcontinent
Kumar and Reddy [4] suggested the possibility of ancestors of Indian Austro-Asiatic tribes having migrated from Africa to India via either Northeast Asia through the

O-M95 O-M122
motifs in the Southeast Asian Austro-Asiatics. The age estimation of fossils of anatomically modern man excavated from East Asia is not older than 40,000 YBP [18,47,48], which may imply that the earliest possible migration of Austro-Asiatic populations to Southeast Asia is about 40,000 YBP or later. Therefore, the Mundari populations appear to be one of the earliest source of populations from which the Khasi-Khmuic and Mon-Khmer populations have separated quite early and migrated to and settled in Southeast Asia, while another wave of migration, much later, by the Mon-Khmer people from Southeast Asia through Thailand and coastal southern Burma to Andaman and Nicobar Islands can be inferred from the current spread of Mon-Khmer populations (Fig. 1).

Conclusion
To sum up we conclude that, because of its very high frequency and diversity, haplogroup O-M95 had an in-situ origin among the Indian Austro-Asiatics, particularly among the Mundaris, not in Southeast Asia as envisaged earlier. Given the large estimate of TMRCA, our study suggests that the Mundari populations are one of the earliest settlers in the Indian Subcontinent. It is most likely that these populations have come from Central Asia through the Western Indian corridor and subsequently colonized Southeast Asia, although more data on Y-chromosome and mtDNA are needed from other relevant populations to draw firmer conclusions.

Sampling
Intravenous blood samples were collected from a total of 1222 healthy unrelated individuals belonging to 25 tribal populations out of which 17 are Austro-Asiatic groups and 7 are transitional groups, which are considered to have certain geographical and historical affinities to the former, and also presumed to be formerly speaking Austro-Asiatic languages. We collected blood samples from different Mundari, Khasi-Khmuic and Mon-Khmer speaking tribal groups in such a way that the entire gamut of genetic heterogeneity of the Austro-Asiatic tribal populations in India is comprehensively reflected; we collected samples from different dialectical categories of the Austro-Asiatic tribal populations and also from different geographic units of the same tribe as some of them are distributed in a wide territory. We have also included the Tibeto-Burman speaking Garo tribe of Meghalaya since it inhabits the contiguous and/or overlapping geographic locations of the Khasi with which it is known to have marital interaction. The area of sampling within each state is shown in Figure S1 [see Additional file 1] while the names of populations along with the sample size and districtwise details of sampling are furnished in Table 1.
Informed consent was obtained from all the donors before collecting their blood.

Genetic Analyses
DNA was extracted from these samples using the protocol of Sambrook [49]. The following 20 Y-Single Nucleotide Polymorphisms (SNPs) which are known to detect variations in Asia were screened using direct sequencing ( Fig. 3  and 4): M89, M69, M172, M9, M11, M175, M95, M88,  M122, M119, M45, M173, M124, M134, M159, M164, M7, M121, M133 and M162. The details of these markers are described elsewhere [44]. Many of the samples were typed with all the binary markers for internal check on the reliability of the typing and also to detect recurrent mutations. We used the nomenclature as suggested and followed by Y-chromosome consortium [50].
Although Y-chromosome is highly variable, because of the low rate of parallel and back mutation of the binary markers on the non-recombinant part, they are particularly useful for reconstructing and identifying stable paternal lineages that can be traced back in time over thousand of years [18,50]. Further, the smaller effective population size makes Y-chromosome probably the best genetic tool to study early human migrations as bottleneck events that are often associated with such migrations becoming more pronounced [18,50]. Despite natural selection being a potentially important force which may affect the entire Ychromosome and produce an increase in frequency of a lineage more rapidly than would be expected by drift, the empirical evidence so far is not conclusive [46].

Statistical Analyses
Since the DYS389II allele length also contains DYS389I, for all statistical analyses a simple subtraction of DYS389I allele length from that of DYS389II was done to avoid the double-counting variation at DYS389I. The subtracted DYS389II allele is named as DYS389b. Since DYS385a and DYS385b, and YCAIIa and YCAIIb alleles could not be assigned to their respective loci, these loci were omitted and further analyses were done based on the remaining16 Y-STRs. The Y-SNP and modified Y-STR data were then analyzed for haplogroup and haplotype diversity, respectively, along with their associated SE by means of the software package ARLEQUIN 3.01 [52]. Analysis of Molecular Variance (AMOVA) was also carried out for both Y-SNP and STR data using ARLEQUIN 3.01 software. Y-STR hap-lotypes were grouped according to haplogroups and Median Joining (M-J) network were constructed by use of the program NETWORK 3.0 [53]. A weighting scheme was followed on the basis of the molecular variance of each microsatellite in all the chromosomes, with the weight inversely proportional to the variance. The weights assigned were from 2 to 8.
TMRCA of a haplogroup provides an important limitation to its spread implying that this haplogroup must have spread after this time and the population carrying this lineage must have arrived in this region prior to the origin of this haplogroup. Therefore we estimated the TMRCA using Bayesian Analysis of Tree With Internal Node Generation (BATWING) as implemented in Wilson et al. [13]. This program uses a Markov chain Monte Carlo (MCMC) procedure to generate phylogenetic trees and associated parameter values consistent with input data (a set of Y haplotypes), besides genetic and demographic models. The genetic model assumes single-step mutations of the STRs and the demographic model chosen was an exponential growth from an initially constant-sized population with sub-division. Based on the Zhivotovsky et al. [54] evolutionary mutation rate we applied a gamma distribution of (1.47, 2130) as mutation rate for all the 16 Y-STRs. For alpha, beta and N priors the gamma distribution of (2, 400), (2, 1) and (1, 0.001), respectively, was used [15,17]. We have used the generation time of 25 years. In all runs, 13,000 samples of the output were taken and the first 3,000 runs were discarded as burn-in. Thus, all the results are based on 10,000 samples. The number of MCMC iterations between each sample varied between runs from 10 2 to 10 5 , so the overall runs ranged from 10 6 to 10 9 MCMC cycles. Our initial analyses were performed with 10 6 MCMC cycles using all the O-M95 Y-chromosomes of all the populations, but the value of median Nposterior (the effective population size before the population began to expand) was ~13,000 which was quite large given that global value is ~5000 [15,17]. We increased the MCMC cycle and found that the Nposterior value and expansion time decrease and increase, respectively, with increase in MCMC cycle and do not stabilize even at 10 9 cycles. Therefore, we chose 3-5 Y-chromosomes of haplogroup O-M95 from each population at random to produce a sample subset and observed a convergence of values of Nposterior and expansion time at 10 6 and 10 7 MCMC cycles. These parameters suggest a value of ~1000 for the effective population size for this region which is consistent with other studies on East Asia [17,18].
Haplogroup frequency data on 214 populations (sources given in the legend to Fig. 7) from whole of Asia, including the Indian subcontinent, Oceania and Australia gathered from the published sources were used along with our data to generate isofrequency maps using the ArcView program of the GIS software. The data points used here are shown as black dots. As the Nicobarese is the only population sampled from that region, it had an overwhelming influence on the contour, hence excluded from the haplogroup isofrequency map calculation.
Publish with Bio Med Central and every scientist can read your work free of charge