Estimating ancestry and heterozygosity of hybrids using molecular markers
- Benjamin M Fitzpatrick^{1, 2}Email author
DOI: 10.1186/1471-2148-12-131
© Fitzpatrick; licensee BioMed Central Ltd. 2012
Received: 10 May 2012
Accepted: 13 July 2012
Published: 31 July 2012
Abstract
Background
Hybridization, genetic mixture of distinct populations, gives rise to myriad recombinant genotypes. Characterizing the genomic composition of hybrids is critical for studies of hybrid zone dynamics, inheritance of traits, and consequences of hybridization for evolution and conservation. Hybrid genomes are often summarized either by an estimate of the proportion of alleles coming from each ancestral population or classification into discrete categories like F1, F2, backcross, or merely “hybrid” vs. “pure”. In most cases, it is not realistic to classify individuals into the restricted set of classes produced in the first two generations of admixture. However, the continuous ancestry index misses an important dimension of the genotype. Joint consideration of ancestry together with interclass heterozygosity (proportion of loci with alleles from both ancestral populations) captures all of the information in the discrete classification without the unrealistic assumption that only two generations of admixture have transpired.
Methods
I describe a maximum likelihood method for joint estimation of ancestry and interclass heterozygosity. I present two worked examples illustrating the value of the approach for describing variation among hybrid populations and evaluating the validity of the assumption underlying discrete classification.
Results
Naively classifying natural hybrids into the standard six line cross categories can be misleading, and false classification can be a serious problem for datasets with few molecular markers. My analysis underscores previous work showing that many (50 or more) ancestry informative markers are needed to avoid erroneous classification.
Conclusion
Although classification of hybrids might often be misleading, valuable inferences can be obtained by focusing directly on distributions of ancestry and heterozygosity. Estimating and visualizing the joint distribution of ancestry and interclass heterozygosity is an effective way to compare the genetic structure of hybrid populations and these estimates can be used in classic quantitative genetic methods for assessing additive, dominant, and epistatic genetic effects on hybrid phenotypes and fitness. The methods are implemented in a freely available package “HIest” for the R statistical software (http://cran.r-project.org/web/packages/HIest/index.html).
Background
Research on hybrids and hybrid zones offers unique insights into several aspects of evolutionary and ecological genetics [1–6], and natural hybridization might sometimes have a key role in evolutionary diversification and innovation [7–11]. Hybridization can also present a major challenge for conservation when it involves endangered and/or invasive species [12–16]. Therefore, accurate detection and characterization of hybridization is important for both basic and applied biology. Molecular genetic markers are making such analyses accessible across a wide range of organisms, but careful data analysis and interpretation are required to avoid erroneous inferences or misleading communications with non-scientists.
When describing a possible hybrid population, investigators often wish to summarize each individual’s multilocus genotype in a simple and informative way. This usually takes the form of either a hybrid index indicating the proportion of an individual’s ancestors belonging to each “parental” lineage [17–20], or a classification as putative parental, F1, F2, or backcross [21–24]. The hybrid index recognizes that hybrids often form a continuum rather than discrete categories, but the index can be unsatisfactory because it summarizes only one dimension of the genotype. Classification emphasizes the differences between early and later generation hybrids (e.g., F1 and F2 hybrids have the same expected hybrid index $=\frac{1}{2}$ but important differences in the fraction of heterozygous loci). This distinction is important because parental genotypes can potentially be recovered from a population in the early generations of admixture [25], and absence of later generation hybrids might indicate hybrid sterility [26]. However, analyses or management strategies that assume discrete classification fail to recognize the continuum of genotypes characteristic of most hybrid zones in the wild, and might perpetuate misleading ideas about the existence of discrete genetic categories [27, 28].
Expected genomic proportions of early generation hybrids
Class | S | H _{ I } | p _{11} | p _{12} | p _{22} |
---|---|---|---|---|---|
P1 | 0 | 0 | 1 | 0 | 0 |
P2 | 1 | 0 | 0 | 0 | 1 |
F1 | 1/2 | 1 | 0 | 1 | 0 |
F2 | 1/2 | 1/2 | 1/4 | 1/2 | 1/4 |
B1 | 1/4 | 1/2 | 1/2 | 1/2 | 0 |
B2 | 3/4 | 1/2 | 0 | 1/2 | 1/2 |
Below, I present simple maximum likelihood methods for estimating ancestry and heterozygosity from molecular marker data and explicitly testing the assumption that a discrete classification adequately describes an individual or dataset. I use empirical data and simulations to illustrate these two dimensions of hybridity and assess the reliability of inferences about discrete vs. continuous distributions of hybrid genotypes.
Methods
Ancestry and interclass heterozygosity for codominant markers
Buerkle [20] developed a maximum likelihood procedure for estimating the ancestry index S from non-diagnostic markers. Here, I generalize his method to jointly estimate S and H_{ I }(the interclass heterozygosity index) for individual hybrid genotypes given parental allele frequencies. It is useful to express genotypic probabilities using Turelli and Orr’s [33] three genomic proportions: p_{11} = proportion of loci with both alleles derived from parental species 1, p_{22} = proportion of loci with both alleles derived from parental species 2, and p_{12} = proportion with one allele from each species. The system is completely specified by two parameters (because p_{11} + p_{12} + p_{22} = 1), and perfectly represents ancestry and interclass heterozygosity because H_{ I }= p_{12}, and $S={p}_{11}+\frac{1}{2}{p}_{12}$ (Table 1) [32].
These expressions assume alleles were drawn at random from within each parental gene pool when the initial admixture was formed, but do not assume Hardy-Weinberg equilibrium within a hybrid population. Equivalent probability statements were used by Pritchard et al. in developing the Bayesian methods implemented in the program STRUCTURE [19, 34, 35]. STRUCTURE provides estimates of ancestry that incorporate uncertainty about parental allele frequencies. Using site-by-site analysis [34], it can also give Bayesian estimates of interclass heterozygosity. However the latter method requires mapped markers and has been used only rarely [27, 36]. Here, I use likelihood to provide simple estimates of ancestry and heterozygosity that allow analysis of the relationship between these two fundamental dimensions of hybrid genotypes. My estimates depend on given parental allele frequencies (rather than incorporating uncertainty about the ancestral populations) and assume all markers are unlinked or sampled at random with respect to linkage (see below). Despite these limitations, I illustrate the usefulness of considering these two dimensions of hybridity, and hope to encourage further development of methods.
Maximizing this function provides estimates of $\u015c={\widehat{p}}_{11}+\frac{1}{2}{\widehat{p}}_{12}$ and ${\u0124}_{I}={\widehat{p}}_{12}$. For diagnostic biallelic markers (f_{ij1 }= 1 and f_{ij2 }= 0), the joint MLE has closed form $\u015c={x}_{11}+\frac{1}{2}{x}_{12}$ and ${\u0124}_{I}={x}_{12}$, where x_{11} is the observed fraction of markers homozygous for species 1 alleles, and x_{12} is the observed fraction of markers heterozygous for species 1 and species 2 alleles.
Dominant Markers
Implementation
Simulations
Evolution of ancestry and heterozygosity in admixed populations
To illustrate how the joint distribution of S and H_{ I }change in the generations following admixture, I created a simple simulation model following Long’s “intermixture” [39]. The simulation begins with individuals from two parental populations in relative frequencies μ and 1−μ. A first admixed generation of size N is formed by randomly drawing N pairs of parents with replacement and randomly drawing one gamete from each parent to form N diploid offspring. Loci are assumed unlinked, so haploid gametes are formed by randomly drawing one allele from each locus within each parent. This model gives expected frequencies of μ^{2}, 2 μ (1 − μ), and (1 − μ)^{2} P1, F1, and P2 genotypes in the first generation. Each succeeding generation is formed in the same way by random mating of pairs from the previous generation. I kept track of diploid genotypes to estimate S and H_{ I } through time. R code for the simulations is available as Additional File 1.
To illustrate the effect of ongoing gene flow, I repeated the simulations above with stochastic immigration from unchanging parental populations (the continent-island admixture model [40, 41]). Each generation, individuals in the hybrid population were replaced by pure parentals with probability m (so the expected number of immigrants was Nm). Each immigrant was equally likely to be a P1 or P2 genotype.
Linkage and sampling of the genome
Linkage among markers is expected to affect the sampling variance (hence reliability) of parameter estimates because linked markers will tend to provide redundant information. The assumption that two markers each provide independent information is violated if they are linked (i.e., if the probability of recombination is less than 0.5). In general this should not be a problem if loci represent a simple random sample with respect to recombinational distance [42]. On the other hand, systematic sampling of a linkage map might provide more reliable estimates if the sample covers most of the genome and the sampling interval does not happen to coincide with some natural periodicity [42], e.g., if the sampled loci were always located near centromeres.
To evaluate the potential effects of linkage on bias and sampling variance, I created a simple linkage model. Each model genome included four diploid chromosomes with 100 loci each. The loci were evenly distributed across two chromosome arms, and one recombination event was modeled per chromosome arm per meiosis (a minimal rate based on mammalian disjunction [43, 44]). Recombination breakpoints were drawn with equal probability at any interval on a chromosome arm. This means the recombinational distance between adjacent loci was 2cM. This certainly does not capture all of the complexities of recombination in real genomes [44–46], but it efficiently models a highly structured genome where many randomly sampled markers will be on the same chromosomes.
Using this model, I simulated F2, backcross, and later generation crosses (up to F10) from parental lines with diagnostic alleles at each marker. For comparison, I simulated the same series of cross types allowing free recombination between all markers (400 unlinked markers). For each simulated individual, I recorded the true values of S and H_{ I } from all 400 loci, and then estimated S and H_{ I }from samples of L = 3, 10, 20, 30, 40, 50 and 60 loci. For the four-chromosome individuals, I compared estimates using simple random sampling to estimates using systematic sampling where a series of L loci at regular 2cM or 10cM intervals was obtained by choosing a single random starting locus. For each simulated individual (1000 of each cross type), I estimated the bias and sampling variance from 1000 random samples of markers for each genomic sample size L and sampling regime.
Uncertainty of parental allele frequencies
My implementation of the estimators for S and H_{ I }depends on prior estimates of parental allele frequencies taken as known constants. To briefly illustrate the consequences of inaccurate assumptions about parental allele frequencies, I simulated ten generations of admixture in small populations (N = 50) with different sets of actual parental allele frequencies, and then estimated S and H_{ I } for each individual under different assumed parental allele frequencies. To evaluate the effect of an overall bias, I used four scenarios: (i) parental populations with L diagnostic markers, (ii) L diallelic markers with allele frequencies all equal to 0.9 in one lineage and 0.1 in the other, (iii) L diallelic markers with allele frequencies all equal to 0.8 in one lineage and 0.2 in the other, and (iv) L diallelic markers with allele frequencies all equal to 0.7 in one lineage and 0.3 in the other. For each of these sets of actual parental allele frequencies, I performed estimation under each set of parental allele frequencies as an assumption. I repeated these analyses with L = 3 and L = 50 to assess how uncertainty interacts with marker number.
To evaluate the effect of balanced inaccuracy, I simulated admixture from parental lineages with 25 diallelic markers with allele frequencies all equal to 0.9 in one lineage and 0.1 in the other, and 25 additional diallelic markers with allele frequencies all equal to 0.7 in one lineage and 0.3 in the other, and then performed estimation assuming all 50 markers had allele frequencies of 0.8 and 0.2. Finally, to assess the impact of having just a few known diagnostic markers, I repeated this analysis replacing one locus of each type with a diagnostic locus, and performed estimation assuming those two were diagnostic but still assuming the other 48 markers had allele frequencies of 0.8 and 0.2.
Hybrid Classification
The most valuable inference from genealogical classification of wild samples is in identifying situations where F1 hybrids are infertile so later generations are never formed [26], or distinguishing brand new hybrid zones from hybrid swarms that are several generations old and therefore unlikely to contain any true parental or F1 individuals [27]. This can be accomplished by evaluating whether any individuals have F1 or parental likelihoods that are (i) sufficiently greater than their likelihoods for other genotype frequency classes to rule those classes out, and (ii) sufficiently similar to the maximum likelihood ancestry and interclass heterozygosity to say the hypothesized classifications cannot be rejected. One approach is to accept a putative classification as credible if the log-likelihood of the best-fit class is over 2 units greater than the log-likelihood of the second best-fit class and within 2 units of the maximum log-likelihood. The first criterion is based on the approximate equivalence of a 2x log-likelihood interval to a 95 percent confidence interval for some distributions [47, 48]. The second is based on the conventional penalty of two log-likelihood units for an additional estimated parameter in model selection [49, 50]. The classification model can be viewed as having one free parameter (for an individual, once the best-fit class is set to “chosen”, the other five are constrained to “not chosen”), while the continuous model has two (S and H_{ I }). This approach has the disadvantage of effectively treating the classification as a null model, which is not biologically justified. A better approach is to accept the classification only if its AIC is lower than the AIC of the MLE (in this case, equivalent to a criterion of within 1.0 log-likelihood units of the MLE). Note that the AIC of the best classification cannot be less than the MLE by more than 2 (the case where MLE is identical to the expectation for a class). This approach avoids the pitfall of assuming that individuals fall into a small set of discrete classes, and instead directly evaluates the validity of classification relative to the continuous model MLE.
Examples
To illustrate inferences based on S and H_{ I }, I analyzed two published data sets. The first is a sample of hybrid tiger salamanders from a 60-year old hybrid swarm where we expect to find no true parental or F1 individuals [51]. The second is from a hybrid zone between Ensatina salamanders in southern California, where Devitt et al. [52] inferred that a large proportion of individuals in the hybrid zone were in fact F1 hybrids, based on analysis with NewHybrids. To describe ancestry and interclass heterozygosity in these datasets and evaluate support for the existence of true F1 hybrids in the wild, I wrote functions in R [37] to find the joint maximum likelihood estimates of S and H_{ I }, and to evaluate the likelihoods of the six genotype frequency classes typically of interest (corresponding to the expectations for pure parentals, F1’s, F2’s and first backcrosses in each direction). These functions and others used in this paper are available as a CRAN package called “HIest” (for “hybrid index estimation”) at http://cran.r-project.org/web/packages/HIest/index.html.
Introduced x native hybrid swarm in tiger salamanders
Barred Tiger Salamanders (Ambystoma tigrinum mavortium) were deliberately introduced from Texas to California in the 1940’s and 1950’s [53]. They have been interbreeding with the native California Tiger Salamander (A. californiense) in ponds throughout the Salinas Valley for roughly 20-30 generations. Thus, unless there has been an unknown source of new “pure” Barred Tiger Salamanders in the recent past, it is extremely unlikely that any true F1, F2, or backcross individuals exist in the wild.
Fitzpatrick et al. [51] used 65 putatively diagnostic markers (one allele assumed fixed in each ancestral population) to genotype 255 salamander larvae from five breeding ponds. This example is instructive because diagnostic markers allow use of the closed-form MLE’s as benchmarks for testing the optimization, and the large number of markers gives high precision in evaluating how the distribution of hybrid genotypes varies across populations and whether any populations might contain putatively pure parentals or F1’s.
A natural hybrid zone in Ensatina
Ensatina eschscholtzii is a classic example of the “ringspecies” pattern illustrating the gradual evolution of reproductive isolation and distinctiveness between species taxa [54–58]. Devitt et al. [52] analyzed a narrow hybrid zone in southern California between the distinctive forms E. e. eschscholtzii and E. e. klauberi using one mitochondrial and three nuclear loci assayed for 335 salamanders densely sampled from across the contact zone. They used NewHybrids [23] and STRUCTURE [19, 35] to estimate ancestry (the Baysian Q-value estimates the same underlying quantity as S here), and classified as “hybrids” the 46 individuals with point estimates between 0.1 and 0.9. Of these, 22 were classified as F1 hybrids and 24 as F2 or backcrosses based on posterior probabilities from NewHybrids. I used their nuclear data (published as online supplementary material) to compare their inferences to my joint likelihood estimation of S and H_{ I }. This example is instructive because the small number of non-diagnostic markers should give considerably less precision than the tiger salamander example, and because the high frequency of F1 hybrids is biologically significant if the inference is credible.
The nuclear markers used by Devitt et al. [52] were not diagnostic, so I repeated their analysis using the admixture model in STRUCTURE (version 2.3.2) with standard settings to estimate “ancestral” allele frequencies to use as givens (f_{ij1},f_{ij2}) for my likelihood calculations. I also saved the Q-values estimated by STRUCTURE to compare to my MLE’s of S (though the inferences are obviously not independent because both depend on the parental allele frequencies inferred by STRUCTURE). This reliance on external estimates of parental allele frequencies is a weakness of my implementation, but I suspect that my approach could be integrated in a fully Bayesian analysis using NewHybrids [23], STRUCTURE [19, 34, 35], or Introgress [59] as a starting point. To evaluate support for classification of Ensatina hybrids into the six standard classes, I once again used both criteria; (i) classification required a difference of two log-likelihood units between the best fit class and any other, and (ii) the best fit class had to have lower AIC than the joint MLE’s of S and H_{ I }.
Sampling and false classification
markers. Although this applies precisely only in the case of F2 hybrids and diagnostic markers, it might be taken as a rule of thumb in the absence of other criteria. In the case of the Ensatina data with 46 putative hybrids and three markers, we might expect 5.75 false F1’s and would have wanted 10 markers to keep the error rate near 5%.
Results and Discussion
Evolution of ancestry and heterozygosity in admixed populations
Figure 2 shows S and H_{ I } from a single random simulation for N = 100 with 100 diagnostic codominant markers. The case is typical in showing clear genotypic clusters corresponding to parentals, F1’s, F2’s, and backcrosses in the first two generations, followed by a few generations with high variance of S, effectively looking like a continuum between backcross-like and F2-like genotypes (0.25 < S < 0.75, H_{ I } near 0.5). By N/10 generations almost all individuals are clustered around S = H_{ I }= 0.5, and the population slowly becomes more homozygous as alleles are lost by drift (S remains roughly constant while H_{ I } declines toward zero).
Codominant markers
Maximum likelihood estimates of S and H_{ I }appear consistent and unbiased for known codominant genotypes (Figure 5). Precision depends on the number of markers and how ancestry-informative they are (how different the known parental allele frequencies are). The simplicity of the triangular sample space makes it easy to visualize the likelihood surface for any individual and get a feel for the uncertainty around an estimate. Figure 5 illustrates that a large number of highly informative markers are needed for precise inference about any single genotype.
Dominant markers
Maximizing the log-likelihood for dominant markers also gives unbiased estimates of S and H_{ I }(Figure 6). With the inherently lower information content of dominant markers, more markers are needed for precision, as seen in other methodological studies [24, 60, 61]. These markers are less informative about heterozygosity, hence the oval ellipses in Figure 6. The method works well as long there is a mixture of loci for which the dominant allele is more common in ancestral species 1 and other loci for which the dominant allele is more common in ancestral species 2. The validity of the estimates depend on the validity of homozygous recessive genotypes as information about p_{11} and p_{22}. If, for example, the absence of PCR product or particular band on a gel cannot be interpreted as a homozygous recessive genotype, the marker system should not be used for this or any other method relying on typical population genetic assumptions.
Linkage and sampling of the genome
Markers sampled at random from a structured genome were indistinguishable from truly unlinked markers in terms of bias and sampling variance of Ŝ and ${\u0124}_{I}$ (Additional File 2: Figures S1-S4). Average bias was indistinguishable from zero for all sampling regimes (Additional file 2: Figures S1, S2), and sampling variance decreased with larger numbers of markers, as expected (Additional file 2: Figures S3, S4). Systematically sampling linked markers affected sampling variance in a manner consistent with statistical intuition [42]. Estimates based on small numbers of tightly linked markers had high sampling variance (i.e., a different sample of markers was likely to give substantially different estimates). However, when coverage of the genome was very good, systematic sampling resulted in lower sampling variance (Additional file 2: Figures S3, S4). For example, given the modeled genome structure (four 200cM chromosomes) 60 markers at 10cM intervals spans 75% of the genome and leads to more reliable estimates of S and H_{ I } than a simple random sample of 60 markers. Thus, for systematically sampled genomes with good coverage, support intervals based on my likelihood calculations will be somewhat conservative.
Uncertainty of parental allele frequencies
Effects of systematic over- or under-estimating differentiation between parental lineages predictably biased hybrid index estimates toward intermediate or extreme values respectively (Additional file 2: Figures S5-S8, Tables S1-S4). For example, if markers are assumed to be diagnostic but actually have frequencies of 0.8 in a parental population, then we would estimate that most pure parental individuals have ancestry $\u015c=0.8$ and are heterozygous for foreign alleles with probability $\widehat{{H}_{I}}=2\left(0.8\right)\left(0.2\right)=0.32$. In contrast, if allele frequencies are assumed to be more intermediate than they truly are in parental lineages (e.g., if parental allele frequencies are estimated from introgressed populations), then estimates will tend to be more extreme than the true values. This situation might result in population samples appearing to have excess F1 hybrids (high H_{ I }) and/or parental-like genotypes (high or low S).
When an equal number of parental allele frequencies were over- and under-estimated, estimates of S were very accurate, but estimates of H_{ I }had increased variance and were slightly biased toward extreme values (Additional file 2: Figure S9, Table S5). Adding two known diagnostic loci to the set made negligible difference. Presumably a Bayesian method that could account for uncertainty in parental allele frequencies would ameliorate the slight bias in ${\u0124}_{I}$ and take better advantage of markers where the quality of information is better. Nevertheless, the simple likelihood approach used here is pretty robust to small errors in the assumed parental allele frequencies, especially if the errors are unbiased.
Examples
Introduced x native hybrid swarm in tiger salamanders
The distributions of individual estimates of ancestry and interclass heterozygosity from the tiger salamander data are illustrated in Figure 4. Populations vary considerably in their joint distributions of S and H_{ I }. The patterns for Bluestone, Pond H, and Sycamore are consistent with gene flow between populations differing in allele frequencies (Figure 4). Melindy is surrounded by predominantly native populations. Toro is relatively isolated and resembles the simulations of neutral admixture with little immigration (Figure 2). For all except Toro, there seems to be a high concentration of estimates near the maximum possible H_{ I } given S (the legs of the triangle), which is consistent with the earlier observation of hybrid vigor in this system [14]. For these diagnostic markers, MLE’s found via MCMC agreed perfectly with the closed form MLE’s (Figure 4f).
Thus, with sufficiently high-resolution data, this kind of analysis can show that admixture has been ongoing for more than two generations and the simple hybrid classification scheme of F1, F2, and backcross is clearly inadequate to describe the distribution of genotypes in the wild. Even for Toro Pond, where 14/52 would be classified as F2, the joint distribution of S and H_{ I }is inconsistent with two generations of admixture because random mating is expected to produce the full array of parental, F1, F2, and backcross genotypes in a population (Figure 2).
A natural hybrid zone in Ensatina
The likelihood surfaces fitted to the Ensatina data are rather flat (Figure 8d-f). All but three of the 46 putative “hybrids” had maximum likelihood estimates of interclass heterozygosity at the maximum possible value given their MLE values of S (Figure 8c). Intuitively, this distribution of genotypes seems consistent with a narrow hybrid zone structured by ongoing immigration of homozygous E. e. eschscholtzii and E. e. klauberi genotypes (corroborated by other analyses in [52] and [62]). Even so, the extreme concentration of estimates at the edges of the sample space might not hold up with the inclusion of more than three markers (see below), or if there is substantial inaccuracy in the estimates of parental allele frequencies. For example the effective sample size for either parental lineage might be small or many generations of introgression might have made the contemporary populations more similar than the true ancestral lineages. It is important to note that the key conclusions about differential introgression of mtDNA across a narrow hybrid zone [52] are not affected by the validity of the hybrid classification in this case.
Sampling and false classification
False classification in subsamples of the tiger salamander data was largely attributed to the difficulty of distinguishing F2 and backcross categories from later generation hybrids. Misclassification of later generation hybrids from these populations as parental or F1 was a problem only for small numbers of markers (Figure 9c and d). In Bluestone, with its more dispersed distribution of S and H_{ I }(Figure 3), a substantial fraction of hybrids could be mistaken for parentals when 10 or fewer markers were used. The tighter distribution of genotypes in Toro Pond made this less of a problem, but a fraction of the Toro Pond animals were consistently classified as F2. Both ponds showed means of ca. 10% F1 misclassification when three markers were used, slightly below the 12.5% that would be expected for a population of F2’s or backcrosses, and substantially below the 37% putative F1’s in the Ensatina dataset.
Conclusions
Hybrids are generally conceived as the genetically mixed descendants of two or more distinct ancestral populations [63]. The mixed genomes of hybrids can be characterized in terms of ancestry (S, the fraction of alleles derived from each ancestral group), and interclass heterozygosity (H_{ I }, the fraction of loci heterozygous for alleles from each ancestral group). Heretofore, interclass heterozygosity has been used only rarely in analyses of hybridization in the wild, but to great effect [14, 27, 36, 64]. I present an effective method for jointly estimating S and H_{ I }. The joint likelihood is efficiently expressed in terms of Turelli and Orr’s [33] genomic proportions given information on ancestral allele frequencies. A future improvement would be to jointly estimate ancestral allele frequencies along with individual ancestries and heterozygosities for a sample. This might be achieved in a Bayesian MCMC framework [19, 41].
Joint consideration of S and H_{ I }provides considerably more biological insight than a single ancestry index or classification of hybrids into the limited categories generated in the first two generations of admixture [14, 29, 32]. My analysis illustrates how reliance on the simple classification scheme (parental, F1, F2, backcross) can be misleading. Classification is appropriate only for study systems in the first two generations of admixture. Even with modest numbers of markers, false acceptance of discrete hybrid classes is likely. More stringent criteria for accepting a classification might be used, but in all cases investigators should carefully consider whether classification of individuals into discrete categories is both realistic and of interest given their research questions. With large numbers of markers (such as the tiger salamander example), the validity of discrete classification can be evaluated and rejected for populations with over two generations of admixture. This might be of biological interest in some cases. In other cases, investigators might be more interested in the MLEs of S and H_{ I } than in the likelihood that an individual is truly an F2 hybrid [22].
Declarations
Acknowledgements
I thank A. Buerkle, J. Fordyce, Z. Gompert, C. Nice, T. Devitt, Marius Roesti, and two anonymous reviewers for helpful discussions and comments on a draft of the manuscript. NSF grant DEB-0516475 helped support the work on tiger salamanders.
Authors’ Affiliations
References
- Barton NH, Hewitt GM: Adaptation, speciation and hybrid zones. Nature. 1989, 341: 497-503. 10.1038/341497a0.PubMedView Article
- Whitham TG: Plant hybrid zones as sinks for pests. Science. 1989, 244: 1490-1493.View Article
- Harrison RG: Hybrid zones: windows on evolutionary process. Oxford Surveys in Evolutionary Biol. 1990, 7: 69-128.
- Jiggins CD, McMillan WO, Neukirchen W, Mallet J: What can hybrid zones tell us about speciation? The case of Heliconius erato and H. himera (Lepidoptera: Nymphalidae). Biol J Linnean Soc. 1996, 59: 221-242.
- Butlin R: What do hybrid zones in general, and the Chorthippus parallelus zone in particular, tell us about speciation?. Endless forms: species and speciation. Edited by: Howard DJ, Berlocher SH. 1998, New, York: Oxford University Press, 367-378.
- Rieseberg LH, Whitton J, Gardner K: Hybrid zones and the genetic architecture of a barrier to gene flow between two sunflower species. Genet. 1999, 152: 713-727.
- Grant BR, Grant PR: Hybridization and speciation in Darwin’s finches. Endless forms: species and speciation. Edited by: Howard DJ, Berlocher SH. 1998, New, York: Oxford University Press, 404-422.
- Rieseberg LH, Raymond O, Rosenthal DM, Lai Z, Livingstone K, Nakazato T, Durphey JL, Schwartzbach AE, Donovan LA, Lexer C: Major ecological transitions in wild sunflowers facilitated by hybridization. Sci. 2003, 301: 1211-1216. 10.1126/science.1086949.View Article
- Gompert Z, Fordyce JA, Forister ML, Shapiro AM, Nice CC: Homoploid hybrid speciation in an extreme habitat. Sci. 2006, 314: 1923-1925. 10.1126/science.1135875.View Article
- Mavarez J, Salazar CA, Bermingham E, Salcedo C, Jiggins CD, Linares M: Speciation by hybridization in Heliconius butterflies. Nature. 2006, 441: 868-871. 10.1038/nature04738.PubMedView Article
- Arnold ML, Martin NH: Adaptation by introgression. J Biol. 2009, 8: 82-10.1186/jbiol176.PubMedPubMed CentralView Article
- Rhymer JM, Simberloff D: Extinction by hybridization and introgression. Annu Rev Ecol and Syst. 1996, 27: 83-109. 10.1146/annurev.ecolsys.27.1.83.View Article
- Allendorf FW, Leary RF, Spruell P, Wenburg JK: The problems with hybrids: setting conservation guidelines. Trends in Ecol and Evol. 2001, 16: 613-622. 10.1016/S0169-5347(01)02290-X.View Article
- Fitzpatrick BM, Shaffer HB: Hybrid vigor between native and introduced salamanders raises new challenges for conservation. Proceedings of the National Academy of Sciences, USA. 2007, 104: 15793-15798. 10.1073/pnas.0704791104.View Article
- Fitzpatrick BM, Johnson JR, Kump DK, Smith JJ, Voss SR, Shaffer HB: Rapid spread of invasive genes into a threatened native species. Proceedings of the National Academy of Sciences, USA. 2010, 107: 3606-3610. 10.1073/pnas.0911802107.View Article
- Placyk JS, Fitzpatrick BM, Casper GS, Small R, Reynolds RG, Noble DWA, Brooks RJ, Burghardt GM: Hybridization between two gartersnake species (Thamnophis) of conservation concern: a threat or an important natural interaction?. Conserv Genet. 2012, 13: 649-663. 10.1007/s10592-012-0315-4.View Article
- Harrison RG: Pattern and process in a narrow hybrid zone. Heredity. 1986, 56: 337-349. 10.1038/hdy.1986.55.View Article
- Jiggins CD, Mallet J: Bimodal hybrid zones and speciation. Trends in Ecol and Evol. 2000, 15: 250-255. 10.1016/S0169-5347(00)01873-5.View Article
- Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genet. 2000, 155: 945-959.
- Buerkle CA: Maximum-likelihood estimation of a hybrid index based on molecular markers. Mol Ecol Notes. 2005, 5: 684-687. 10.1111/j.1471-8286.2005.01011.x.View Article
- Boecklen WJ, Howard DJ: Genetic analysis of hybrid zones: Numbers of markers and power of resolution. Ecol. 1997, 78: 2611-2616. 10.1890/0012-9658(1997)078[2611:GAOHZN]2.0.CO;2.View Article
- Rieseberg LH, Linder CR: Hybrid classification: insights from genetic map-based studies of experimental hybrids. Ecol. 1999, 80: 361-370. 10.1890/0012-9658(1999)080[0361:HCIFGM]2.0.CO;2.View Article
- Anderson E, Thompson EA: A model-based method for identifying species hybrids using multilocus genetic data. Genet. 2002, 160: 1217-1229.
- Vähä JP, Primmer CR: Efficiency of model-based Bayesian methods for detecting hybrid individuals under different hybridization scenarios and with different numbers of loci. Mol Ecol. 2006, 15: 63-72. [http://dx.doi.org/10.1111/j.1365-294X.2005.02773.x]PubMedView Article
- Allendorf FW, Luikart G: Conservation and the genetics of populations. 2007, Malden, MA: Blackwell
- Kanda N, Leary RF, Allendorf FW: Evidence of introgressive hybridization between bull trout and brook trout. Trans Am Fisheries Soc. 2002, 131: 772-782. 10.1577/1548-8659(2002)131<0772:EOIHBB>2.0.CO;2.View Article
- Lexer C, Joseph JA, van Loo M, Barbará T, Heinze B, Bartha D, Castiglione S, Fay MF, Buerkle CA: Genomic Admixture Analysis in European Populus spp. Reveals Unexpected Patterns of Reproductive Isolation and Mating. Genet. 2010, 186: 699-712. 10.1534/genetics.110.118828.View Article
- Fujimara J, Rajagopalan R: Different differences: the use of ’genetic ancestry’ versus race in biomedical human genetic research. Social Stud Sci. 2011, 41: 5-30. 10.1177/0306312710379170.View Article
- Lynch M: The genetic interpretation of inbreeding depression and outbreeding depression. Evol. 1991, 45: 622-629. 10.2307/2409915.View Article
- Hill WG: Dominance and epistasis as components of heterosis. Z. Tierzuchtg. Zuchtgsbiol. 1982, 99: 161-168.View Article
- Lynch M, Walsh B: Genetics and analysis of quantitative traits. 1998, Sunderland, MA: Sinauer Ass
- Fitzpatrick BM: Hybrid dysfunction: Population genetic and quantitative genetic perspectives. Am Naturalist. 2008, 171: 491-198. 10.1086/528991.View Article
- Turelli M, Orr HA: Dominance, epistasis, and the genetics of postzygotic isolation. Genet. 2000, 154: 1663-1679.
- Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genet. 2003, 164: 1567-1587.
- Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data: Dominant markers and null alleles. Mol Ecol Notes. 2007, 7: 574-578. 10.1111/j.1471-8286.2007.01758.x.PubMedPubMed CentralView Article
- Scascitelli M, Whitney KD, Randell RA, King M, Buerkle CA, Rieseberg LH: Genome scan of hybridizing sunflowers from Texas (Helianthus annuus and H. debilis) reveals asymmetric patterns of introgression and small islands of genomic differentiation. Mol Ecol. 2010, 19: 521-541. 10.1111/j.1365-294X.2009.04504.x. [http://dx.doi.org/10.1111/j.1365-294X.2009.04504.x]PubMedView Article
- R Development Core Team: R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing, http://www.R-project.org,
- Hastings W: Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika. 1970, 57: 97-109. 10.1093/biomet/57.1.97.View Article
- Long JC: The genetic structure of admixed populations. Genet. 1991, 127: 417-428.
- Asmussen MA, Arnold J, Avise JC: The effects of assortative mating and migration on cytonuclear associations in hybrid zones. Genet. 1989, 115: 755-768.
- Gompert Z, Buerkle CA: Bayesian estimation of genomic clines. Mol Ecol. 2011, 20: 2111-27. 10.1111/j.1365-294X.2011.05074.x.PubMedView Article
- Cochran WG: Sampling techniques (Third ed.). 1977, New York: John Wiley and Sons
- Pardo-Manuel de Villena F, Sapienza C: Recombination is proportional to the number of chromosome arms in mammals. Mamm Genome. 2001, 12: 318-322. 10.1007/s003350020005.PubMedView Article
- Lynch M: The Origins of Genome Architecture. 2007, Sunderland, MA.: Sinauer Assocs., Inc.
- Auton A, Fledel-Alon A, Pfeifer S, Venn O, Seguerel L, Street T, Leffler EM, Bowden R, Aneas I, Broxholme J, Humburg P, Iqbal Z, Lunter G, Maller J, Hernandez RJ, Melton C, Venkat A, Nobrega MA, Bontrop R, Myers S, Donnelly P, Przeworski M, McVean G: A fine-scale chimpanzee genetic map from population sequencing. Sci. 2012, 336: 193-198. 10.1126/science.1216872.View Article
- Roesti M, Hendry AP, Salzburger W, Berner D: Genome divergence during evolutionary diversification as revealed in replicate lake-stream stickleback population pairs. Mol Ecol. 2012, 21: 2852-2862. 10.1111/j.1365-294X.2012.05509.x.PubMedView Article
- Hudson DJ: Interval estimation from the likelihood function. J R Statistical Soc, Ser B. 1971, 33: 256-262.
- Hilborn R, Mangel M: The ecological detective: Confronting models with data. 1997, New Jersey: Princeton University Press
- Edwards AWF: Likelihood. 1972, Cambridge: Cambridge University Press
- Burnham KP, Anderson DR: Multimodel inference: understanding AIC and BIC in model selection. Sociological Methods Res. 2004, 33: 261-304. 10.1177/0049124104268644.View Article
- Fitzpatrick BM, Johnson JR, Kump DK, Shaffer HB, Smith JJ, Voss SR: Rapid fixation of non-native alleles revealed by genome-wide SNP analysis of hybrid tiger salamanders. BMC Evolutionary Biol. 2009, 9: 176-10.1186/1471-2148-9-176.View Article
- Devitt TJ, Baird SJE, Moritz C: Asymmetric reproductive isolation between terminal forms of the salamander ring species Ensatina eschscholtzii revealed by fine-scale genetic analysis of a hybrid zone. BMC Evolutionary Biol. 2011, 11: 245-10.1186/1471-2148-11-245.View Article
- Fitzpatrick BM, Shaffer HB: Introduction history and habitat variation explain the landscape genetics of hybrid tiger salamanders. Ecol Appl. 2007, 17: 598-608. 10.1890/06-0369.PubMedView Article
- Stebbins RC: Speciation in salamanders of the Plethodontid genus Ensatina. University of California Publications in Zoology. 1949, 48: 377-526.
- Mayr E: Animal species and evolution. 1963, Cambridge: BelknapView Article
- Wake DB, Yanev KP, Frelow MM: Sympatry and hybridization in a “ring species”: the plethodontid salamander Ensatina eschscholtzii. Speciation and its consequences. Edited by: Otte D, Endler JA. 1989, Sunderland, MA: Sinauer, Ass., 134-157.
- Wake DB: Incipient species formation in salamanders of the Ensatina complex. Proceedings of the National Academy of Sciences, USA. 1997, 94: 7761-7767. 10.1073/pnas.94.15.7761.View Article
- Futuyma DJ: Evolution, 2nd edition. 2009, Sunderland, MA: Sinauer Associates, Inc.
- Gompert Z, Buerkle CA: INTROGRESS: A software package for mapping components of isolation in hybrids. Mol Ecol Resour. 2010, 10: 378-384. 10.1111/j.1755-0998.2009.02733.x.PubMedView Article
- Evanno G, Regnaut S, Goudet J: Detecting the number of clusters of individuals using the software structure: a simulation study. Mol Ecol. 2005, 14: 2611-2620. 10.1111/j.1365-294X.2005.02553.x.PubMedView Article
- Gerber JC, Mariette S, Streiff R, Bodence C, Kremer A: Comparison of microsatellites and amplified fragment length polymorphism markers for parentage analysis. Mol Ecol. 2000, 9: 1037-1048. 10.1046/j.1365-294x.2000.00961.x.PubMedView Article
- Pereira RJ, Wake DB: Genetic leakage after adaptive and nonadaptive divergence in the Ensatina eschscholtzii ring species. Evol. 2009, 63: 2288-2301. 10.1111/j.1558-5646.2009.00722.x.View Article
- Harrison RG: Hybrids and hybrid zones: historical perspective. Hybrid zones and the evolutionary process. Edited by: Harrison RG. 1993, New York: Oxford University Press, 3-12.
- Fitzpatrick BM: Dobzhansky-Muller model of hybrid dysfunction supported by poor burst-speed performance in hybrid tiger salamanders. J Evolutionary Biol. 2008, 21: 342-351.View Article
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.