Estimating ancestry and heterozygosity of hybrids using molecular markers

Fitzpatrick, Benjamin M

doi:10.1186/1471-2148-12-131

Methodology article
Open access
Published: 31 July 2012

Estimating ancestry and heterozygosity of hybrids using molecular markers

Benjamin M Fitzpatrick^1,2

BMC Evolutionary Biology volume 12, Article number: 131 (2012) Cite this article

11k Accesses
105 Citations
1 Altmetric
Metrics details

Abstract

Background

Hybridization, genetic mixture of distinct populations, gives rise to myriad recombinant genotypes. Characterizing the genomic composition of hybrids is critical for studies of hybrid zone dynamics, inheritance of traits, and consequences of hybridization for evolution and conservation. Hybrid genomes are often summarized either by an estimate of the proportion of alleles coming from each ancestral population or classification into discrete categories like F1, F2, backcross, or merely “hybrid” vs. “pure”. In most cases, it is not realistic to classify individuals into the restricted set of classes produced in the first two generations of admixture. However, the continuous ancestry index misses an important dimension of the genotype. Joint consideration of ancestry together with interclass heterozygosity (proportion of loci with alleles from both ancestral populations) captures all of the information in the discrete classification without the unrealistic assumption that only two generations of admixture have transpired.

Methods

I describe a maximum likelihood method for joint estimation of ancestry and interclass heterozygosity. I present two worked examples illustrating the value of the approach for describing variation among hybrid populations and evaluating the validity of the assumption underlying discrete classification.

Results

Naively classifying natural hybrids into the standard six line cross categories can be misleading, and false classification can be a serious problem for datasets with few molecular markers. My analysis underscores previous work showing that many (50 or more) ancestry informative markers are needed to avoid erroneous classification.

Conclusion

Although classification of hybrids might often be misleading, valuable inferences can be obtained by focusing directly on distributions of ancestry and heterozygosity. Estimating and visualizing the joint distribution of ancestry and interclass heterozygosity is an effective way to compare the genetic structure of hybrid populations and these estimates can be used in classic quantitative genetic methods for assessing additive, dominant, and epistatic genetic effects on hybrid phenotypes and fitness. The methods are implemented in a freely available package “HIest” for the R statistical software (http://cran.r-project.org/web/packages/HIest/index.html).

Background

Research on hybrids and hybrid zones offers unique insights into several aspects of evolutionary and ecological genetics [1–6], and natural hybridization might sometimes have a key role in evolutionary diversification and innovation [7–11]. Hybridization can also present a major challenge for conservation when it involves endangered and/or invasive species [12–16]. Therefore, accurate detection and characterization of hybridization is important for both basic and applied biology. Molecular genetic markers are making such analyses accessible across a wide range of organisms, but careful data analysis and interpretation are required to avoid erroneous inferences or misleading communications with non-scientists.

When describing a possible hybrid population, investigators often wish to summarize each individual’s multilocus genotype in a simple and informative way. This usually takes the form of either a hybrid index indicating the proportion of an individual’s ancestors belonging to each “parental” lineage [17–20], or a classification as putative parental, F1, F2, or backcross [21–24]. The hybrid index recognizes that hybrids often form a continuum rather than discrete categories, but the index can be unsatisfactory because it summarizes only one dimension of the genotype. Classification emphasizes the differences between early and later generation hybrids (e.g., F1 and F2 hybrids have the same expected hybrid index $= \frac{1}{2}$ but important differences in the fraction of heterozygous loci). This distinction is important because parental genotypes can potentially be recovered from a population in the early generations of admixture [25], and absence of later generation hybrids might indicate hybrid sterility [26]. However, analyses or management strategies that assume discrete classification fail to recognize the continuum of genotypes characteristic of most hybrid zones in the wild, and might perpetuate misleading ideas about the existence of discrete genetic categories [27, 28].

Although no summary method is likely to satisfy all needs, the situation can be greatly improved by adding a single calculation so that hybrid genotypes are characterized by estimates of both ancestry (S, the axis that arranges all hybrids between two ancestral extremes) and interclass heterozygosity (H_I, the axis that distinguishes F1, F2, and recombinant inbred lines). In fact, joint estimates of ancestry and interclass heterozygosity include all of the information in the typical six-type classification because each class has a unique pair of expected values (Table 1) [29–31]. In evolutionary quantitative genetics, early generation hybrid classes are used to study dominance and epistasis precisely because they provide information on S and H_I, not because the classification itself contains any other information [22, 29–32].

Table 1 Expected genomic proportions of early generation hybrids

Full size table

Below, I present simple maximum likelihood methods for estimating ancestry and heterozygosity from molecular marker data and explicitly testing the assumption that a discrete classification adequately describes an individual or dataset. I use empirical data and simulations to illustrate these two dimensions of hybridity and assess the reliability of inferences about discrete vs. continuous distributions of hybrid genotypes.

Methods

Ancestry and interclass heterozygosity for codominant markers

Buerkle [20] developed a maximum likelihood procedure for estimating the ancestry index S from non-diagnostic markers. Here, I generalize his method to jointly estimate S and H_I(the interclass heterozygosity index) for individual hybrid genotypes given parental allele frequencies. It is useful to express genotypic probabilities using Turelli and Orr’s [33] three genomic proportions: p₁₁ = proportion of loci with both alleles derived from parental species 1, p₂₂ = proportion of loci with both alleles derived from parental species 2, and p₁₂ = proportion with one allele from each species. The system is completely specified by two parameters (because p₁₁ + p₁₂ + p₂₂ = 1), and perfectly represents ancestry and interclass heterozygosity because H_I= p₁₂, and $S = p_{11} + \frac{1}{2} p_{12}$ (Table 1) [32].

The probability of a hybrid being homozygous for allele j at locus i in terms of the allele frequencies in parental population 1 (f_ij1) and population 2 (f_ij2), and Turelli and Orr’s [33] genomic proportions is

Pr {(j, j)}_{i} = p_{11} f_{ij 1}^{2} + p_{12} f_{ij 1} f_{ij 2} + p_{22} f_{ij 2}^{2} .

(1)

And the probability of being heterozygous for alleles j and k at locus i:

Pr {(j, k)}_{i} = p_{11} 2 f_{ij 1} f_{ik 1} + p_{12} (f_{ij 1} f_{ik 2} + f_{ik 1} f_{ij 2}) + p_{22} 2 f_{ij 2} f_{ik 2} .

(2)

These probabilities can be generalized to consider any number A of ancestral gene pools:

Pr {(j, j)}_{i} = \sum_{a = 1}^{A} p_{aa} f_{ija}^{2} + \sum_{a = 1}^{A - 1} \sum_{b > a}^{A} p_{ab} f_{ija} f_{ijb} .

(3)

And

Pr {(j, k)}_{i} = 2 \sum_{a = 1}^{A} p_{aa} f_{ija} f_{ika} + \sum_{a = 1}^{A - 1} \sum_{b > a}^{A} p_{ab} (f_{ija} f_{ikb} + f_{ika} f_{ijb}) .

(4)

These expressions assume alleles were drawn at random from within each parental gene pool when the initial admixture was formed, but do not assume Hardy-Weinberg equilibrium within a hybrid population. Equivalent probability statements were used by Pritchard et al. in developing the Bayesian methods implemented in the program STRUCTURE [19, 34, 35]. STRUCTURE provides estimates of ancestry that incorporate uncertainty about parental allele frequencies. Using site-by-site analysis [34], it can also give Bayesian estimates of interclass heterozygosity. However the latter method requires mapped markers and has been used only rarely [27, 36]. Here, I use likelihood to provide simple estimates of ancestry and heterozygosity that allow analysis of the relationship between these two fundamental dimensions of hybrid genotypes. My estimates depend on given parental allele frequencies (rather than incorporating uncertainty about the ancestral populations) and assume all markers are unlinked or sampled at random with respect to linkage (see below). Despite these limitations, I illustrate the usefulness of considering these two dimensions of hybridity, and hope to encourage further development of methods.

The log-likelihood of a set of genomic proportions for a given hybrid genotype with n loci is (following Buerkle [20])

ℓ (p_{11}, p_{12}, p_{22} | genotype) = \sum_{i = 1}^{n} \{\begin{matrix} ln Pr {(j, k)}_{i} & heterozygous loci \\ ln Pr {(j, j)}_{i} & homozygous loci \end{matrix} .

(5)

Maximizing this function provides estimates of $Ŝ = {\hat{p}}_{11} + \frac{1}{2} {\hat{p}}_{12}$ and $Ĥ_{I} = {\hat{p}}_{12}$ . For diagnostic biallelic markers (f_ij1= 1 and f_ij2= 0), the joint MLE has closed form $Ŝ = x_{11} + \frac{1}{2} x_{12}$ and $Ĥ_{I} = x_{12}$ , where x₁₁ is the observed fraction of markers homozygous for species 1 alleles, and x₁₂ is the observed fraction of markers heterozygous for species 1 and species 2 alleles.

Dominant Markers

The method can be extended to dominant markers (e.g., AFLP). Assume allele j is dominant and k is recessive (e.g., for the phenotype of presence/absence of a PCR product at position i in a gel). The log-likelihood is

\begin{array}{l} ℓ (p_{11}, p_{12}, p_{22} | marker phenotype) \\ = \sum_{i = 1}^{n} \{\begin{matrix} ln [Pr {(j, j)}_{i} + Pr {(j, k)}_{i}] band present \\ \ln Pr {(k, k)}_{i} band absent \end{matrix} . \end{array}

(6)

Implementation

For finding maximum likelihood estimates using equations 5 or 6, I used the general purpose optimization function optim in R [37]. The function uses a quasi-Newton optimization algorithm that can handle simple constraints (i.e., proportions must be in the interval [0,1]). However, it sometimes failed for genotypes close to the edge of the triangular sample space (Figure 1), where the likelihood surface is discontinuous. Therefore I implemented two simple Markov Chain Monte Carlo approaches to more thoroughly explore the likelihood surface. The optim function can use a built-in simulated annealing (SANN) algorithm, given a function for proposing new estimates. I also wrote a simple MCMC algorithm using Metropolis-Hastings sampling [38]. For both of the these approaches, I wrote a proposal function that draws new genomic proportions (p₁₁^″p₁₂^″p₂₂^″) from a three dimensional Dirichlet distribution centered on the old genomic proportions and with concentration parameter α. I.e., the probability density of the proposal distribution is Dir(α p₁₁,α p₁₂,α p₂₂). Larger α makes the proposal distribution more concentrated near the current state. For efficiency, starting values were obtained by calculating likelihoods for 100 equally spaced pairs of S and H_I on a grid over the sample space and starting the MCMC at the grid point with highest likelihood. For present purposes, I ran the MCMC for 1000 steps (with α = 100) and used the pair of estimates with the maximum likelihood as the MLE. The sample space for this problem is simple (Figure 1) and inspection of dozens of likelihood surfaces never suggested the existence of local optima. The quasi-Newton algorithm was unreliable at the edge of the sample space because it could not approximate the local surface as a continuum, not because it was getting stuck at a local optimum.

Simulations

Evolution of ancestry and heterozygosity in admixed populations

To illustrate how the joint distribution of S and H_Ichange in the generations following admixture, I created a simple simulation model following Long’s “intermixture” [39]. The simulation begins with individuals from two parental populations in relative frequencies μ and 1−μ. A first admixed generation of size N is formed by randomly drawing N pairs of parents with replacement and randomly drawing one gamete from each parent to form N diploid offspring. Loci are assumed unlinked, so haploid gametes are formed by randomly drawing one allele from each locus within each parent. This model gives expected frequencies of μ², 2 μ (1 − μ), and (1 − μ)² P1, F1, and P2 genotypes in the first generation. Each succeeding generation is formed in the same way by random mating of pairs from the previous generation. I kept track of diploid genotypes to estimate S and H_I through time. R code for the simulations is available as Additional File 1.

To illustrate the effect of ongoing gene flow, I repeated the simulations above with stochastic immigration from unchanging parental populations (the continent-island admixture model [40, 41]). Each generation, individuals in the hybrid population were replaced by pure parentals with probability m (so the expected number of immigrants was Nm). Each immigrant was equally likely to be a P1 or P2 genotype.

Linkage and sampling of the genome

Linkage among markers is expected to affect the sampling variance (hence reliability) of parameter estimates because linked markers will tend to provide redundant information. The assumption that two markers each provide independent information is violated if they are linked (i.e., if the probability of recombination is less than 0.5). In general this should not be a problem if loci represent a simple random sample with respect to recombinational distance [42]. On the other hand, systematic sampling of a linkage map might provide more reliable estimates if the sample covers most of the genome and the sampling interval does not happen to coincide with some natural periodicity [42], e.g., if the sampled loci were always located near centromeres.

To evaluate the potential effects of linkage on bias and sampling variance, I created a simple linkage model. Each model genome included four diploid chromosomes with 100 loci each. The loci were evenly distributed across two chromosome arms, and one recombination event was modeled per chromosome arm per meiosis (a minimal rate based on mammalian disjunction [43, 44]). Recombination breakpoints were drawn with equal probability at any interval on a chromosome arm. This means the recombinational distance between adjacent loci was 2cM. This certainly does not capture all of the complexities of recombination in real genomes [44–46], but it efficiently models a highly structured genome where many randomly sampled markers will be on the same chromosomes.

Using this model, I simulated F2, backcross, and later generation crosses (up to F10) from parental lines with diagnostic alleles at each marker. For comparison, I simulated the same series of cross types allowing free recombination between all markers (400 unlinked markers). For each simulated individual, I recorded the true values of S and H_I from all 400 loci, and then estimated S and H_Ifrom samples of L = 3, 10, 20, 30, 40, 50 and 60 loci. For the four-chromosome individuals, I compared estimates using simple random sampling to estimates using systematic sampling where a series of L loci at regular 2cM or 10cM intervals was obtained by choosing a single random starting locus. For each simulated individual (1000 of each cross type), I estimated the bias and sampling variance from 1000 random samples of markers for each genomic sample size L and sampling regime.

Uncertainty of parental allele frequencies

My implementation of the estimators for S and H_Idepends on prior estimates of parental allele frequencies taken as known constants. To briefly illustrate the consequences of inaccurate assumptions about parental allele frequencies, I simulated ten generations of admixture in small populations (N = 50) with different sets of actual parental allele frequencies, and then estimated S and H_I for each individual under different assumed parental allele frequencies. To evaluate the effect of an overall bias, I used four scenarios: (i) parental populations with L diagnostic markers, (ii) L diallelic markers with allele frequencies all equal to 0.9 in one lineage and 0.1 in the other, (iii) L diallelic markers with allele frequencies all equal to 0.8 in one lineage and 0.2 in the other, and (iv) L diallelic markers with allele frequencies all equal to 0.7 in one lineage and 0.3 in the other. For each of these sets of actual parental allele frequencies, I performed estimation under each set of parental allele frequencies as an assumption. I repeated these analyses with L = 3 and L = 50 to assess how uncertainty interacts with marker number.

To evaluate the effect of balanced inaccuracy, I simulated admixture from parental lineages with 25 diallelic markers with allele frequencies all equal to 0.9 in one lineage and 0.1 in the other, and 25 additional diallelic markers with allele frequencies all equal to 0.7 in one lineage and 0.3 in the other, and then performed estimation assuming all 50 markers had allele frequencies of 0.8 and 0.2. Finally, to assess the impact of having just a few known diagnostic markers, I repeated this analysis replacing one locus of each type with a diagnostic locus, and performed estimation assuming those two were diagnostic but still assuming the other 48 markers had allele frequencies of 0.8 and 0.2.

Hybrid Classification

Equations (5) and (6) can be used to calculate the likelihood of predefined genotype frequency classes, as in Anderson and Thompson’s program NewHybrids [23]. For example, the likelihood an individual is in the parental 1 genotype frequency class is ℓ (p₁₁ = 1,p₁₂ = p₂₂ = 0|marker phenotype), the likelihood for the F2 genotype frequency class is ℓ (p₁₁ = 0.25,p₁₂ = 0.5,p₂₂ = 0.25|marker phenotype), etc. This provides an instructive comparison between the research goals of estimating ancestry and heterozygosity vs. classifying individuals into genealogical categories. First, as noted clearly by Anderson and Thompson [23] among others [21, 22], the one-to-one correspondence between genotype frequency class and genealogical class (parental, F1, backcross, etc.) applies only to the first two generations of interbreeding, and arbitrarily similar classes become indistinguishable in practice (Figure 2). Second, for most purposes, the value of knowing the genealogical class is as an indicator of the most likely genotype frequencies, not vice versa [22]. I.e., there is no more genetic information in the classification“backcross to parental 1” than in the set of expected genomic proportions p₁₁ = 0.5, p₁₂ = 0.5, p₂₂ = 0.0 [31]. Finally, the pitfall of classifying samples from a wild population into a limited set of predefined categories is that a best classification will be obtained even if the set of assumed genealogical classes is not relevant (e.g., after more than two generations of admixture).

The most valuable inference from genealogical classification of wild samples is in identifying situations where F1 hybrids are infertile so later generations are never formed [26], or distinguishing brand new hybrid zones from hybrid swarms that are several generations old and therefore unlikely to contain any true parental or F1 individuals [27]. This can be accomplished by evaluating whether any individuals have F1 or parental likelihoods that are (i) sufficiently greater than their likelihoods for other genotype frequency classes to rule those classes out, and (ii) sufficiently similar to the maximum likelihood ancestry and interclass heterozygosity to say the hypothesized classifications cannot be rejected. One approach is to accept a putative classification as credible if the log-likelihood of the best-fit class is over 2 units greater than the log-likelihood of the second best-fit class and within 2 units of the maximum log-likelihood. The first criterion is based on the approximate equivalence of a 2x log-likelihood interval to a 95 percent confidence interval for some distributions [47, 48]. The second is based on the conventional penalty of two log-likelihood units for an additional estimated parameter in model selection [49, 50]. The classification model can be viewed as having one free parameter (for an individual, once the best-fit class is set to “chosen”, the other five are constrained to “not chosen”), while the continuous model has two (S and H_I). This approach has the disadvantage of effectively treating the classification as a null model, which is not biologically justified. A better approach is to accept the classification only if its AIC is lower than the AIC of the MLE (in this case, equivalent to a criterion of within 1.0 log-likelihood units of the MLE). Note that the AIC of the best classification cannot be less than the MLE by more than 2 (the case where MLE is identical to the expectation for a class). This approach avoids the pitfall of assuming that individuals fall into a small set of discrete classes, and instead directly evaluates the validity of classification relative to the continuous model MLE.

Examples

To illustrate inferences based on S and H_I, I analyzed two published data sets. The first is a sample of hybrid tiger salamanders from a 60-year old hybrid swarm where we expect to find no true parental or F1 individuals [51]. The second is from a hybrid zone between Ensatina salamanders in southern California, where Devitt et al. [52] inferred that a large proportion of individuals in the hybrid zone were in fact F1 hybrids, based on analysis with NewHybrids. To describe ancestry and interclass heterozygosity in these datasets and evaluate support for the existence of true F1 hybrids in the wild, I wrote functions in R [37] to find the joint maximum likelihood estimates of S and H_I, and to evaluate the likelihoods of the six genotype frequency classes typically of interest (corresponding to the expectations for pure parentals, F1’s, F2’s and first backcrosses in each direction). These functions and others used in this paper are available as a CRAN package called “HIest” (for “hybrid index estimation”) at http://cran.r-project.org/web/packages/HIest/index.html.

Introduced x native hybrid swarm in tiger salamanders

Barred Tiger Salamanders (Ambystoma tigrinum mavortium) were deliberately introduced from Texas to California in the 1940’s and 1950’s [53]. They have been interbreeding with the native California Tiger Salamander (A. californiense) in ponds throughout the Salinas Valley for roughly 20-30 generations. Thus, unless there has been an unknown source of new “pure” Barred Tiger Salamanders in the recent past, it is extremely unlikely that any true F1, F2, or backcross individuals exist in the wild.

Fitzpatrick et al. [51] used 65 putatively diagnostic markers (one allele assumed fixed in each ancestral population) to genotype 255 salamander larvae from five breeding ponds. This example is instructive because diagnostic markers allow use of the closed-form MLE’s as benchmarks for testing the optimization, and the large number of markers gives high precision in evaluating how the distribution of hybrid genotypes varies across populations and whether any populations might contain putatively pure parentals or F1’s.

A natural hybrid zone in Ensatina

Ensatina eschscholtzii is a classic example of the “ringspecies” pattern illustrating the gradual evolution of reproductive isolation and distinctiveness between species taxa [54–58]. Devitt et al. [52] analyzed a narrow hybrid zone in southern California between the distinctive forms E. e. eschscholtzii and E. e. klauberi using one mitochondrial and three nuclear loci assayed for 335 salamanders densely sampled from across the contact zone. They used NewHybrids [23] and STRUCTURE [19, 35] to estimate ancestry (the Baysian Q-value estimates the same underlying quantity as S here), and classified as “hybrids” the 46 individuals with point estimates between 0.1 and 0.9. Of these, 22 were classified as F1 hybrids and 24 as F2 or backcrosses based on posterior probabilities from NewHybrids. I used their nuclear data (published as online supplementary material) to compare their inferences to my joint likelihood estimation of S and H_I. This example is instructive because the small number of non-diagnostic markers should give considerably less precision than the tiger salamander example, and because the high frequency of F1 hybrids is biologically significant if the inference is credible.

The nuclear markers used by Devitt et al. [52] were not diagnostic, so I repeated their analysis using the admixture model in STRUCTURE (version 2.3.2) with standard settings to estimate “ancestral” allele frequencies to use as givens (f_ij1,f_ij2) for my likelihood calculations. I also saved the Q-values estimated by STRUCTURE to compare to my MLE’s of S (though the inferences are obviously not independent because both depend on the parental allele frequencies inferred by STRUCTURE). This reliance on external estimates of parental allele frequencies is a weakness of my implementation, but I suspect that my approach could be integrated in a fully Bayesian analysis using NewHybrids [23], STRUCTURE [19, 34, 35], or Introgress [59] as a starting point. To evaluate support for classification of Ensatina hybrids into the six standard classes, I once again used both criteria; (i) classification required a difference of two log-likelihood units between the best fit class and any other, and (ii) the best fit class had to have lower AIC than the joint MLE’s of S and H_I.

Sampling and false classification

To further explore how the number of markers assayed affects erroneous classification, I took the tiger salamander data from Bluestone Pond and Toro Pond (Figure 3a and e) and randomly subsampled markers and recalculated the likelihoods of the six hybrid classes and the joint MLE of S and H_I. I randomly subsampled three markers (without replacement) and repeated the analysis 1000 times. Then I did the same for samples from 5 to 60 (out of the total of 65) in increments of 5. Given the history of the tiger salamander hybrid swarm and the low frequency of classification using the full dataset, I considered any “successful” classification a false positive.

Because the primary value of classification is in the identification of true F1 or pure parental genotypes [25], I also specifically assessed the frequency with which individuals were classified as parental or F1. For diagnostic markers, this can happen only if an individual is heterozygous at all markers, or homozygous at all markers, respectively. In these cases, the likelihood of the classification is equal to the maximum likelihood, and the AIC-based test will always favor the classification over the continuous model because of the difference in degrees of freedom. However, for small numbers of markers, spurious inference can be made because all markers might be heterozygous or homozygous by chance. For example, in a true F2 or backcross, 50% of markers are expected to be heterozygous and the probability of sampling three heterozygous markers by chance is (1/2)³ = 0.125. To avoid spurious inference, investigators should avoid classifying individuals based on small numbers of markers [21]. For example, the expected fraction of n F2’s with all heterozygous genotypes at L markers is α = n (1/2)^L. So, in order to maintain an experiment-wise error rate of α, one would need at least

L = \frac{log n - log α}{log 2}

(7)

markers. Although this applies precisely only in the case of F2 hybrids and diagnostic markers, it might be taken as a rule of thumb in the absence of other criteria. In the case of the Ensatina data with 46 putative hybrids and three markers, we might expect 5.75 false F1’s and would have wanted 10 markers to keep the error rate near 5%.

Results and Discussion

Evolution of ancestry and heterozygosity in admixed populations

Figure 2 shows S and H_I from a single random simulation for N = 100 with 100 diagnostic codominant markers. The case is typical in showing clear genotypic clusters corresponding to parentals, F1’s, F2’s, and backcrosses in the first two generations, followed by a few generations with high variance of S, effectively looking like a continuum between backcross-like and F2-like genotypes (0.25 < S < 0.75, H_I near 0.5). By N/10 generations almost all individuals are clustered around S = H_I= 0.5, and the population slowly becomes more homozygous as alleles are lost by drift (S remains roughly constant while H_I declines toward zero).

Figure 4 illustrates the effect of ongoing immigration from parental gene pools. With N = 100 and m = 0.10, a stationary distribution was reached at generation 3. The distribution fluctuates from generation to generation, but a wide range is consistently observed. With lower immigration (Nm ≤ 1), results were similar to the no-gene-flow scenario in Figure 2, but H_I remained moderate instead of dropping toward zero. With Nm = 1, the population settled in a steady state similar to t = 50 or t = 100 in Figure 2.

The same basic patterns can be seen when the loci are not entirely diagnostic (e.g., parental allele frequencies of 0.9 vs 0.1). However, when estimates were based on fewer markers, or less informative markers, it was often impossible to discern discrete genotype clusters by generation 2 (e.g., see Figures 5 and 6).

Codominant markers

Maximum likelihood estimates of S and H_Iappear consistent and unbiased for known codominant genotypes (Figure 5). Precision depends on the number of markers and how ancestry-informative they are (how different the known parental allele frequencies are). The simplicity of the triangular sample space makes it easy to visualize the likelihood surface for any individual and get a feel for the uncertainty around an estimate. Figure 5 illustrates that a large number of highly informative markers are needed for precise inference about any single genotype.

Dominant markers

Maximizing the log-likelihood for dominant markers also gives unbiased estimates of S and H_I(Figure 6). With the inherently lower information content of dominant markers, more markers are needed for precision, as seen in other methodological studies [24, 60, 61]. These markers are less informative about heterozygosity, hence the oval ellipses in Figure 6. The method works well as long there is a mixture of loci for which the dominant allele is more common in ancestral species 1 and other loci for which the dominant allele is more common in ancestral species 2. The validity of the estimates depend on the validity of homozygous recessive genotypes as information about p₁₁ and p₂₂. If, for example, the absence of PCR product or particular band on a gel cannot be interpreted as a homozygous recessive genotype, the marker system should not be used for this or any other method relying on typical population genetic assumptions.