Timing and deciphering mitochondrial DNA macro-haplogroup R0 variability in Central Europe and Middle East
© Brandstätter et al. 2008
Received: 14 February 2008
Accepted: 04 July 2008
Published: 04 July 2008
Skip to main content
© Brandstätter et al. 2008
Received: 14 February 2008
Accepted: 04 July 2008
Published: 04 July 2008
Nearly half of the West Eurasian assemblage of human mitochondrial DNA (mtDNA) is fractioned into numerous sub-lineages of the predominant haplogroup (hg) R0. Several hypotheses have been proposed on the origin and the expansion times of some R0 sub-lineages, which were partially inconsistent with each other. Here we describe the phylogenetic structure and genetic variety of hg R0 in five European populations and one population from the Middle East.
Our analysis of 1,350 mtDNA haplotypes belonging to R0, including entire control region sequences and 45 single nucleotide polymorphisms from the coding region, revealed significant differences in the distribution of different sub-hgs even between geographically closely located regions. Estimates of coalescence times that were derived using diverse algorithmic approaches consistently affirmed that the major expansions of the different R0 hgs occurred in the terminal Pleistocene and early Holocene.
Given an estimated coalescence time of the distinct lineages of 10 – 18 kya, the differences in the distributions could hint to either limited maternal gene flow after the Last Glacial Maximum due to the alpine nature of the regions involved or to a stochastic loss of diversity due to environmental events and/or disease episodes occurred at different times and in distinctive regions. Our comparison of two different ways of obtaining the timing of the most recent common ancestor confirms that the time of a sudden expansion can be adequately recovered from control region data with valid confidence intervals. For reliable estimates, both procedures should be applied in order to cross-check the results for validity and soundness.
According to the newest interpretation of C14 calibration data, and according to previous studies on human evolution, Europe was populated around 41–46 thousand years ago (kya) [1–3]. The main features of the post-glacial colonization of Europe was reliably reconstructed using parts of the human mitochondrial genome (mainly the hypervariable segment I; HVS-I [2–5]) or the entire mtDNA molecule [6–8].
In Europe, with the exception of U5 and V, which most likely arose in situ, all mtDNA hgs (H, I, J, K, T, U2e, U3, U4, X, and W) are most likely of Middle Eastern origin and were introduced by either the protocolonization 41–46 kya, by later arrivals in the late Paleolithic or more recent contacts [2, 9, 10].
Nearly half of the West Eurasian pool of human mtDNA lineages is composed of subclades of the predominant West Eurasian hg R0 [6, 11–13]. R0, formerly known as pre-HV , is defined by substitutions at nucleotide positions (nps) 73 and 11719 relative to R [15, 16]. Its frequency as a whole declines towards the East and South, but in the Near East, the Caucasus and Central Asia its frequency is still as high as 10–30% [6, 13, 17]. Until now, more than 20 sub-lineages of hg H, the predominant subclade of R0, which accounts for roughly 40% of West Eurasian mtDNAs, have been described [6, 12, 13, 17] and the variance of their regional distributions has been discussed [4, 6, 13, 17, 18].
Previous studies have proposed that hg H originated in the Middle East ~30 – 25 kya, expanded into Europe in association with a second Paleolithic wave (25 – 20 kya) and was strongly involved in late-glacial expansions from ice-age refugia after the LGM (20 kya) [2, 9]. For a few sub-hg of hg H, coalescence ages were determined using either entire mtDNA genomes  or parts of the mtDNA control region . Hg H1, H3 and V share an estimated common origin in the terminal Pleistocene (16 – 11.5 kya), with major expansion in the early Holocene (~10 kya) . Recent estimates on expansion times of selected H sub-hgs  are in conflict with the appraisals derived by , thus leaving the question on the reliability on the applied methods unanswered.
The objective of this study was to provide new information concerning the phylogenetic characteristics of macro-hg R0, as well as to determine spatial distribution patterns and coalescence ages of all its major sub-hgs.
In a total of 3,569 samples from five European populations residing in Central and South-East Europe (Austria, Germany, Hungary, Macedonia and Romania) and one Middle East population (Dubai), we found 1,408 samples (~39%) to belong to hg R0 based on either control region or coding region analysis. Of these, 1,350 samples contained enough DNA to obtain a dataset of full haplotypes consisting of entire control region sequences and 45 SNPs from the coding region. These haplotypes are listed in the Additional file 1.
Frequency estimates of R0 sub-hgs found in six European populations (Austria, Germany, Hungary, Macedonia and Romania) and one Middle East population (Dubai).
Design and results of AMOVA (Analysis of Molecular Variance)
Source of variation
Sum of squares
Percent of variation
Population pairwise FSTs
The R0a sub-lineage is defined by the motif C64T T2442C T3847C C13188T T16126C T16362C . This hg shows a particular instability in the neighborhood of position 64. All of our R0a samples showed C64T (which is part of the basal motif), half of them showed a T-insertion at position 60 (indicated in  as 57+T). This constitutes a sub-clade of R0a, R0a2; in  this sub-clade is named as (preHV)1b). All but one of the 20 R0a samples carried an additional transition T58C; which could correspond with (preHV)1a1 in ; this seems to constitute part of the motif defining a sub-branch of hg (preHV)1a (identified by transition 827 on top of the basal motif); here we rename (preHV)1a and (preHV)1a1 as R0a1 and R0a1a for consistency. Transition T58C appears also within R0a2 . Conclusively, we found four different haplotypes between nps 58 and 64 within R0a: T58C-60.1T-C64T (10 individuals), 60.1T-C64T (1 individual), T58C-64T (7 individuals) and C64T alone (2 individuals). In  we find additional variation at this segment, for instance, their complete genome Ar222 (GenBank accession number DQ904241) carries a transversion T59A, while their sample Ar20 (DQ904235) carries transversion T58A.
Transition 16209C was found in eleven out of twelve H1a1-samples; therefore, this variant could be perfectly part of the motif of an H1a1 sub-branch. Position 73 has been traditionally used as a diagnostic position of hg R together with only one coding region transition, G11719A; this variant appears also quite often within hg H . In addition, this variant appears for instance in complete mtDNA genomes within H1a (e.g. ), H3 , H4a (e.g. ), etc. Therefore, defining hg R by only A73G is imprecise. In our dataset, we observed for example the occurrence of A73G within H4a1. Within H7, a new sub-lineage, here identified as H7-709 seems to be characterized at least by the following array of variants G709A, T16172C and C16173T. This lineage seems to be geographically restricted to eastern Europe since it appears only in the Székely ; note that the control region motif T16172C C16173T appears also in the Kurdish (, but within hg I, as well as in other L and M hgs [25, 26]. Within H10, another new sub-lineage carrying the additional control region markers T16093C and C16221T could be distinguished; the complete genome #32/H from  give support to this new branch. The transition C16114T could be part of the motif of a sister clade, as inferred from two complete genomes: Tor31 (AY738970; ) and the one from Family Tree DNA retrieved from GenBank with accession number EU073971. Transition T195C is probably part of the basal motif of H11, as identified in three complete genomes [6, 27]. Our data also support the unnamed cluster identified by  characterized by the transition G7337A (and also T13326C). It can be tentatively said that H17 is additionally identified by position G16129A; however, so far there is only one complete genome available supporting this fact (AY495154; ).
Age estimates of R0 subclades.
T ± ΔT (kya) (CR)a
T ± ΔT (kya) (CR)c
T ± ΔT (kya) (HVS-I)c
T ± ΔT (kya) (HVS-I)d
T ± ΔT (kya) (Coding Region)e
18.7 ± 2.4
11.0 ± 1.5
17.4 ± 4.1
18.4 ± 2.0
H1 (w/o H1a)
13.0 ± 0.9
12.0 ± 2.6
12.0 ± 3.2
15.2 ± 5.1
12.8 ± 2.4
18.7 ± 2.4
18.7 ± 6.7
6.4 ± 3.6
13.1 ± 5.8
6.9 ± 3.0
11.9 ± 5.9
10.8 ± 7.3
14.5 ± 4.1
20.7 ± 7.8
43.9 ± 18.5
10.8 ± 1.9
8.0 ± 3.0
9.8 ± 5.7
25.6 ± 5.4
20.7 ± 8.3
23.7 ± 8.6
10.2 ± 3.4
10.2 ± 4.1
2.0 ± 1.4
16.0 ± 6.6
5.1 ± 2.4
7.9 ± 4.9
19.3 ± 2.0
17.4 ± 7.1
13.9 ± 8.0
12.3 ± 4.6
10.2 ± 5.0
10.2 ± 2.8
15.4 ± 6.0
16.0 ± 8.1
10.3 ± 2.4
9.7 ± 3.0
7.3 ± 3.5
3.7 ± 2.1
11.5 ± 5.4
13.6 ± 1.2
9.0 ± 1.7
13.3 ± 3.3
12.6 ± 4.4
9.7 ± 3.1
7.5 ± 2.0
9.1 ± 2.8
3.4 ± 1.7
16.5 ± 5.2
10.1 ± 3.5
17.9 ± 7.4
16.1 ± 7.4
17.0 ± 2.0
13.4 ± 3.8
9.4 ± 3.3
12.4 ± 2.5 g
35.0 ± 6.6
26.4 ± 9.1
30.3 ± 12.4
Interestingly, although different sequence ranges were used and different mathematical and conceptually different methods were applied, the estimates of the majority of subclades corresponded well with each other. In general, it can be said that most hg H sub-lineages show a strong signal for the beginning of the population expansion after the LGM, covering the Late Pleistocene and early Holocene.
The near eastern hg R0a, which is geographically centred in the midst of the Arabian Peninsula , was estimated to have expanded between 26.4 ± 9.1 kya (rho statistics from the CR) and 35.0 ± 6.6 kya (MMD); the estimate derived from HVS-I lay in between (30.3 ± 12.4 kya).
The oldest sublineage of hg H was found to be H14, with highly concordant estimates for the TMRCA of 25.6 ± 5.4 kya (MMD) respectively 23.7 ± 8.6 kya (rho statistics). As in hg H, the estimate derived from CR rho statistics was lowest (20.6 ± 8.3 kya).
Consistent estimates for the TMRCA using both computational methods were obtained for most of the hgs (e.g. H, H1, H1a, H5, H6, H7, H13a1, H14, H15, H16) (Table 5). However, apart from H1 and H13a1, where our three estimates (MMD, rho statistics from entire CR and from HVS-I sequences) were highly consistent with each other, matching estimates were derived by comparing the MMD estimate with rho statistics from either CR or HVS-I sequences. In hgs H11 and HV0 the different approached used yielded very different mean values but associated with very large and overlapping confidence intervals.
Published estimates for the coalescence time of hg H3 lay at 16.0 ± 8.1 kya  or 10.3 ± 2.4 kya . Our estimates were 10.2 ± 5.0 kya (MMD), 10.2 ± 2.8 kya (rho CR) and 16.7 ± 6.1 kya (rho HVS-I). Thus, both our values (MMD and rho) derived from entire control region sequences fit very well with the estimate from coding region sequences, while our rho estimate from HVS-I sequences fits well with the Roostalu's et al. HVS-I appraisal. It seems that HVS-I sequences in hg H3 do not correspond to the molecular clock calibration from the Eurasian tree, in a sense that mutations in HVS-I occur at a higher rate in H3 compared to phylogenetically closely related lineages.
Most recently, molecular dating has been questioned fundamentally, and the calibration of the molecular clock of the mtDNA control region as basic requirement for timing the most recent common ancestor has led to an ongoing discussion (for a review see [34, 35]). However, occasional concerns that the molecular clock might be elusive and not tick regularly for human mtDNA were widely eliminated . A coding-region rate estimate of 5140 years per substitution  corresponds well with estimates for the HVS-I rate  and with a rate of one control-region mutation per 9250 years for the Eurasian tree  and can be used to estimate the coalescence time of various haplogroups of the mtDNA phylogeny. A straight forward way to estimate the age of a particular "root" haplotype, given the mutation rate, is to consider all available descendant individual sequences and take the arithmetic mean over all distances to the root haplotype. This method is referred to as "rho" estimation  and can be performed using the freely available software package Network.
One of the disadvantages of the simple rho method is the conversion of DNA-data into files that can be handled with the network program. Without some in-house executables that assist the conversion, this would be an elaborate and error-prone process. On the other hand, network offers the calculation of the standard error of rho, which would be very complicated if calculated by hand.
The estimation of the demographic parameter τ from the mismatch distributions between pairs of sequences can be performed with ARLEQUIN, which seems to be more user-friendly regarding data import. One of the disadvantages of this method is that it does not account for the rate heterogeneity of the different mutations, and that it needs calibration against an archaeological or historical record.
The observed differences between the two approaches might thus rely on the slightly imprecise estimate of the mtDNA control region's molecular clock affecting the rho estimate and on the disregard of variation in site-specific mutation rates affecting the MMD estimation.
It seems that the rho statistics derived from entire CR-sequences tends to underestimate the TMRCA, while the rho statistics derived from HVS-I sequences tends to predate the TMRCA. This is most likely due to the calibration of the molecular clock of the different DNA stretches of the mitochondrial genome, which need further fine-tuning. Under this perspective, it seems advisable to follow both approaches for timing the coalescence age of phylogenetic clusters, and to put the estimates into a geological-historical context. In general, more complete genome information is desirable in order to get more accurate estimates.
Hg H1 accounted for 12.4% of the samples and showed a high diversity in terms of different lineages (Additional file 2; mean number of pairwise differences: 4.92). Reticulations within H1 were mainly caused by mutational hotspots in the control region that emerged in its different sub-hgs, such as variation at 16129, 16189, etc.
Hg H2a represented 4.6% of the samples and displayed less genetic variation (Additional file 3; mean number of pairwise differences: 4.62). The control region marker 456 identifies hg H1d  and H5, but also popped up on an H2a2 background and led to one reticulation in the H2a network. This transition is however slightly unstable since it appears frequently in other non-H hg backgrounds, such as hg K .
Hg H3 was assigned to 5.6% of samples and was characterized by several clusters of identical profiles, thus implying little genetic differentiation (Additional file 4; mean number of pairwise differences: 3.06). The network of Additional file 4 shows just two reticulations provoked by two fast control region variants, T152C and T16311C.
Hg H4 corresponded to 2.5% of the samples, and showed very little genetic variation (Additional file 5; mean number of pairwise differences: 3.12). As in hg H3, mutations at position 16311 were responsible for network reticulations.
Hg H5 was realized in 9.2% of samples and revealed an enormous amount of lineage diversity (Additional file 6; mean number of pairwise differences: 4.51), especially when compared to the putatively same-aged Hg H3. Again, some fast positions are responsible for a number of reticulations, 146, 152, etc.
Hg H6 appeared in 3.1% of the samples and in relation to its relatively recent arrival in Europe (~9.7 kya), it exhibited pronounced genetic diversification (Additional file 7; mean number of pairwise differences: 2.97).
Hg H7 rendered for 3.9% of samples and demonstrated perfect tree-like evolution of lineages (Additional file 8; mean number of pairwise differences: 2.61).
Hgs H8, H9, H12, H17 and H21 could be differentiated in less than 0.5% of samples each. Due to small sample sizes, corresponding networks could only be constructed for H8 (Additional file 9) and H17 (Additional file 10).
Hg H10 was determined in 3.3% of the sequences but manifested hardly any genetic heterogeneity (Additional file 11; mean number of pairwise differences: 1.70).
Hg H11 encompassed 2.6% of the profiles and was estimated to have arisen 43.9 kya . With respect to its frequency, its genetic distinction (Additional file 12; mean number of pairwise differences: 4.67) and our TMRCA estimations, the age estimate of  seems overstated.
Hg H13a1 comprised 2.7% of samples and similar to hg H7, the profiles grouped in big clusters mirroring a perfect tree-like pattern of evolution (Additional file 13; mean number of pairwise differences: 2.60).
Hg H14 and H16 covered 1.3% of sequences each, but the network of H14 (Additional file 14; mean number of pairwise differences: 5.84) indicated higher genetic diverseness than the network of H16 (Additional file 15; mean number of pairwise differences: 1.80).
Hg H15 embraced 1.5% of samples and was characterized by an array of uncommon mutations that led to a reticulation-free structure of the network (Additional file 16; mean number of pairwise differences: 4.91).
Hg H* (without the above described sub-hgs) was found in 25.6% of samples and showed an amazingly high grade of diversification (Additional file 17; mean number of pairwise differences: 4.13).
Hg R0a spanned only 1.5% of samples, but the corresponding network suggest an old age of the hg, with a complete tree-like structure (Additional file 18; mean number of pairwise differences: 5.85).
Hgs V, HV* (without H), HV1, HV0 and HV0a together comprehended 5.3% of the study individuals and map onto a straightforward phylogenetic network, with reticulations mainly caused by mutations on position 72 (Additional file 19).
Our analysis of 1,350 mtDNA haplotypes belonging to R0, the most common hg-cluster in West Eurasia, revealed significant differences in the distribution of different sub-hgs even between geographically closely located regions. Given an estimated coalescence time of the distinct lineages of 10 – 18 kya, the differences in the distributions could hint to either limited maternal gene flow after the LGM due to the alpine nature of the regions involved, or to a stochastic loss of diversity due to environmental events and/or disease episodes occurred at different times and in distinctive regions (e.g. Black Plague between 1347–1666), or most likely, to genetic drift, particularly strong for less frequent haplogroups as most of R0 sub-haplogroups characterized here.
Our comparison of two different ways of obtaining the timing of the TMRCA confirms that the time of a sudden expansion (τ) can be adequately recovered from control region data with valid confidence intervals. The observed differences between the two approaches might rely on the slightly imprecise estimate of the mtDNA control region's molecular clock affecting the rho estimate and on the disregard of variation in site-specific mutation rates affecting the MMD estimation. So for reliable estimates, both procedures should be applied in order to cross-check the results for validity and soundness.
A total of 1,350 samples that belong to the hg cluster R0 were selected from five West Eurasian populations and one population from the Middle East, corresponding to a total of 3,569 samples. The majority of samples came from Austria (nr. of total samples: 2,487; nr. of R0-samples: 973) [12, 38], the donors of the remaining samples originated in Germany (nr. of total samples: 100; nr. of R0-samples: 31) , Hungary (nr. of total samples: 173; nr. of R0-samples: 71) , Macedonia (nr. of total samples: 200; nr. of R0-samples: 100) , Romania (nr. of total samples: 360; nr. of R0-samples: 124)  and Dubai (nr. of total samples: 249; nr. of R0-samples: 51) .
All samples were sequenced for the entire mtDNA control region (positions 16024–16569; 1–576) as described in  and screened for hg R0-specific single nucleotide polymorphisms within the mtDNA coding region as described in . 847 Austrian samples had already been typed for the 45 coding region SNPs before  and were sequenced for the entire control region for the present study. Of the remaining 503 samples the control region sequences had been generated before [19, 38–42] and the coding region SNPs were additionally typed for the present study.
Phylogenetic networks were constructed with Network.exe  applying the median joining algorithm  with the parameter epsilon set to 0. Polymorphisms within the control region were divided into five classes roughly according to their presumptive mutation rate . Superfast positions (152, 16519, insertions at 573) were weighted by one, fast positions (16093 and the AC-repeat around 523–524) by three and speedy positions (93, 150, 151, 182, 183, 194, 195, 198, 199, 200, 204, 207, 228, 499, 513, 16051, 16069, 16111, 16124, 16126, 16129, 16145, 16148, 16153, 16163, 16166, 16167, 16168, 16169, 16172, 16173, 16184, 16186, 16187, 16189, 16192, 16209, 16214, 16218, 16223, 16224, 16230, 16234, 16243, 16245, 16248, 16249, 16256, 16260, 16261, 16264, 16266, 16270, 16271, 16278, 16287, 16290, 16292, 16294, 16295, 16298, 16300, 16309, 16311, 16316, 16319, 16320, 16325, 16327, 16344, 16355, 16360, 16362, 16390) by six. Transversions at the following positions were weighted by 15: 95, 185, 189, 16114, 16147, 16176, 16188, 16206, 16207, 16257, 16265, 16266, 16286, 16293 and 16318. The remaining polymorphisms within the control region were assigned weight 10. All coding region SNPs were assigned weight 25. Length heteroplasmic C-insertions at 309 and 315 were weighted by zero.
ARLEQUIN (Version 2.0; ) was used for the calculation of haplotype frequencies and analysis of molecular variance (AMOVA). Permutation tests (1,000 replicates) were used to evaluate the significance of FST inter-population genetic distances. For all population genetic analyses, length heteroplasmic C-insertions were not taken into consideration.
The mismatch distributions (MMDs) within hgs (based on entire control region sequences) were calculated as the distribution of the observed number of differences between pairs of haplotypes with ARLEQUIN. The age of the demographic expansion, i.e. the demographic parameter τ, was estimated from the mismatch distributions of the different R0 subclades using a generalized nonlinear least-square approach . The demographic parameter τ was calculated by minimizing the sum of square deviations (SSD) between the observed mismatch distribution and its expectation under an infinite-sites model. Approximate confidence intervals for τ were obtained by parametric bootstrapping with 1000 replicates. In order to test the validity of the sudden expansion model, SSD was complementarily used as a test statistic. The conversion of the expansion time τ into years since a population entered a demographic expansion phase was performed according to the equation τ = 2 ut, where u is the total mutation rate per generation per gene and t is the number of generations passed since the demographic expansion. Assuming a (matrilineal) generation time of 25 years, and dating the age of hg H at 18,400 years , u is estimated as 0.0022 in correlation with a value for τ of 3.3 (Table 5).
As a complementary model-free approach to date the most recent common ancestor of R0 sub-clades, median-joining networks of hypervariable segment I (HVS-I) sequences (16024–16365) and of entire control region sequences were constructed with the parameters as described above. The rho (ρ) statistics were calculated as the average number of mutations from the nodes to the root of the sub-hg networks and scaled by the mutation rate for HVS-I, where one transitional step was taken equal to 20,180 years , or by the rate of one control region mutation per 9,250 years . Standard deviations for age estimates were calculated as in .
The authors thank Harald Niederstätter for discussion and Daniela Niederwieser for excellent technical assistance (both Institute of Legal Medicine, Innsbruck Medical University).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.