Bayesian inference of population size history from multiple loci
 Joseph Heled^{1} and
 Alexei J Drummond^{1, 2}Email author
DOI: 10.1186/147121488289
© Heled and Drummond; licensee BioMed Central Ltd. 2008
Received: 22 February 2008
Accepted: 23 October 2008
Published: 23 October 2008
Abstract
Background
Effective population size (N_{ e }) is related to genetic variability and is a basic parameter in many models of population genetics. A number of methods for inferring current and past population sizes from genetic data have been developed since JFC Kingman introduced the ncoalescent in 1982. Here we present the Extended Bayesian Skyline Plot, a nonparametric Bayesian Markov chain Monte Carlo algorithm that extends a previous coalescentbased method in several ways, including the ability to analyze multiple loci.
Results
Through extensive simulations we show the accuracy and limitations of inferring population size as a function of the amount of data, including recovering information about evolutionary bottlenecks. We also analyzed two real data sets to demonstrate the behavior of the new method; a single gene Hepatitis C virus data set sampled from Egypt and a 10 locus Drosophila ananassae data set representing 16 different populations.
Conclusion
The results demonstrate the essential role of multiple loci in recovering population size dynamics. Multilocus data from a small number of individuals can precisely recover past bottlenecks in population size which can not be characterized by analysis of a single locus. We also demonstrate that sequence data quality is important because even moderate levels of sequencing errors result in a considerable decrease in estimation accuracy for realistic levels of population genetic variability.
Background
Coalescent theory has been described as "the most significant progress in theoretical population genetics in the past two decades of [last] century" [1]. Methods based on coalescent theory enable estimation of both current and past population sizes directly from genetic data. Effective population size (N_{ e }) is linked to the rate of genetic drift and inbreeding, and is useful for example when investigating the possibility of interbreeding between Neanderthals and early humans [2], or when looking at patterns of genetic variation in human genes [3]. Even when population size history is of secondary interest the incorporation of models of population size may improve the genetic mapping of diseases and estimation of genetic traits [4].
A concise introduction to the coalescent
The coalescent was formally introduced by Kingman in 1982 [5]. This was the culmination of eight years of developing the "circle of ideas known as the coalescent" [6] which brought into light the essence of the relation between population size and ancestry. Informally, the larger the population, the longer any two individuals have to trace their ancestry back in time until meeting their common ancestor (concestor [7]). The meeting of those two ancestral lineages is known as the coalescence event.
To formally develop the theory one has to assume the population is mixing perfectly so that all members of the same generation have equal probability of being the ancestor of any member in the next generation. Kingman introduces the idea using an idealized WrightFisher population but shows how the Moran model also has the coalescent as its diffusion limit (for a good summary of these models see [8]).
Consider two random members from a population of fixed size N. By perfect mixing, the probability they share a concestor in the previous generation is 1/N. The probability the concestor is g + 1 generations back is $\frac{1}{N}{\left(1\frac{1}{N}\right)}^{g}$. This elementary reasoning shows that g, as a random variable, has a geometric distribution with a success rate of λ = 1/N, and so has mean N and variance of N^{3}/(N  1).
With n lineages the time to the first coalescence is derived in the same way, only now there are $\left(\begin{array}{c}n\\ 2\end{array}\right)$ possible pairs that may coalesce, resulting in a success rate of $\lambda =\left(\begin{array}{c}n\\ 2\end{array}\right)/N$ and mean time to coalescence of $N/\left(\begin{array}{c}n\\ 2\end{array}\right)$.
This assumes N is much larger than O(n^{2}). An interesting consequence is that the total number of generations required for n lineages to coalesce into one is ${\sum}_{i=2}^{n}N/\left(\begin{array}{c}i\\ 2\end{array}\right)}=2N(11/n)$, which is always less than 2N regardless of n.
Kingman goes on to show that as N grows the coalescent process converges to a continuous time Markov chain. For the above λ = 1/N is the instantaneous probability of coalescing, i.e. the probability of coalescing on a short time interval Δt is O(λΔt). Unsurprisingly the solution turns out to be the exponential distribution f(t) = λe^{λt}, the continuous equivalent of the geometric distribution.
In the rest of the article we shall call N = N(t) the demographic function or sometimes just demographic when the context is clear. Note that while N(t) may take any form whose inverse can be integrated, the density is characterized by two numbers only, the intensity (average of population size inverse over the interval) and population size at the end of the interval.
One nice feature of coalescent theory is that demographic inference depends only on coalescent times and so can be coupled with any method which can estimate the genealogy in a statistically consistent manner. There are numerous examples of the coalescent theory being coupled with different data and estimation procedures for the coalescent times. Tavare et al estimated coalescent times from genetic data using only the number of segregating sites while assuming a known demographic function [10]. Coalescent theory has also been used within Markov chain Monte Carlo (MCMC) algorithms to estimate constant [11] and exponentially growing [12] population sizes using microsatellite data. More recently the Skyline Plot, a maximum likelihood estimator (MLE) of the demographic function for a known genealogy, was introduced [13]. This was further extended in the form of the Bayesian Skyline Plot (BSP) [14], a MCMC method that estimates the demographic function directly from sequence data and provides much needed credible intervals. This paper extends the BSP further by allowing the joint analysis of multiple loci and eliminating the requirement to prespecify the model dimensionality. An alternative approach using reversible jump MCMC to dynamically change the dimensionality of the demographic model was previously introduced [15], but was implemented for a known genealogy only. More recently an alternative which uses Gaussian Markov random fields (GMRFs) to achieve temporal smoothing of the population sizes has been developed [16], but this method does not currently support multiple loci.
Inferring population size from coalescent times
Estimating changes in population size is a challenging task, especially when the magnitude of the change is small. Even if the exact coalescence times are known the space of demographic functions capable of generating them is large. In general the coalescent times are unknown and have to be estimated by phylogenetic reconstruction, thus increasing the set of plausible demographic functions further. In addition, the "observation window" is limited by the last coalescent which is on average 2N generations in the past. Moreover, under a constant population size it is expected that half of the time to the root of the tree is spanned by just one coalescent interval (one data point), so no dynamics can be inferred, just the average population size over that time span. Similarly (under constant size expectations) just two data points account for 2/3 of the tree, and so on. Thirdly evolutionary bottlenecks (periods with smaller population sizes) have shorter coalescence times, reducing the observation window further.
Faced with these problems, one might be tempted to increase the number of samples. However the returns from such an investment diminish quickly – the additional coalescent events occur inside a small stretch of time. It is much better to add sequences from independent loci from the same population since all loci share the same demographic history. The assumption of independence requires loci from different chromosomes or sufficiently distant to one another to be considered unlinked by recombination. Doubling the number of (independent) loci doubles the amount of information over the whole "observation window". In this paper we introduce the Extended Bayesian Skyline Plot (EBSP), a new variabledimension Bayesian method for inferring nonparametric population size changes through time from multiple loci. The EBSP builds on the Bayesian Skyline Plot (BSP) [14] in several ways:

Permits analysis of multiple loci. Any number of unlinked nuclear or mitochondrial loci from individuals in the population may be combined to infer the shared population size history. Each loci may have its own population factor which takes care of differences in ploidy and inheritance. For example, In many animals the population of alleles of a nuclear gene is four times greater than that of mtDNA since there are two copies of the nuclear gene in each individual and mtDNA is inherited exclusively maternally in most species. In that case the population factor can be set to 4 for nuclear genes and 1 for mtDNA if inference of mtDNA gene population size is desired, or to 2 for nuclear genes and 1/2 for mtDNA to infer the number of individual animals.

Uses Bayesian stochastic variable selection. The original BSP required the researcher to arbitrarily choose i the number of population size steps, or control points. The demographic function is then constrained to be a piecewise constant function with exactly i distinct levels. It is not obvious how to a priori choose i, and a poor choice may lead to larger credible bounds and in more extreme cases inhibit convergence. The EBSP, in a true Bayesian spirit, lets the data select the appropriate smoothness of the demographic function using Bayesian stochastic variable selection (BSVS) [17–19].

Supports piecewise linear demographic functions. Because real life population size dynamics tend to be continuous, a piecewise linear demographic function will generally be a more appropriate model than the piecewise constant function used by the BSP.
Methods
Consider the most simple case of two lineages and constant population size N. Given that the time to coalesce was t, what can be said about N? The maximum likelihood estimate (i.e. the value which maximizes f(tN)) is N = t [20]. While this is the best point estimate possible, the large variability inherent in a stochastic process driven by an exponential distribution makes a point estimate unsatisfactory. The Bayesian framework provides a way to quantify the uncertainty using Bayes rule, $f(Nt)=\frac{f(tN)f(N)}{f(t)}$.
With the natural non informative prior for a scale variable f(N) = 1/N [21] the cumulative probability can be solved which enables us to obtain an explicit numerical solution for the credible interval. This turns out to be [0.093t, 19.504t], an order of magnitude at both sides of t. Point estimates from this distribution can also be computed. The mode is $\sqrt{t/2}$, median is t/ln(2) and the mean is infinite.
On the other hand, a genealogy with n lineages contains n  1 independent observations via the time intervals between subsequent coalescence events, providing a much better estimate of N. However, it should be noted that point estimates from a Bayesian inference typically contain a builtin bias. For example the bias for the median in the two lineages case above is on average 44% (1/ln(2)  1). The bias in real life data sets which contain multiple loci from several individual is naturally much reduced.
The likelihood of genealogies from multiple loci
The likelihood of the EBSP is derived from m genealogies in the form of rooted trees, denoted G = {g_{1}, g_{2}...g_{ m }}, were g_{ k }is estimated from n_{ k }contemporaneous sequences. The time scale for all genealogies should match that of the target population being estimated, but the substitution rate may vary among loci. For example, mtDNA is known to evolve at a much faster rate than nuclear DNA [22, 23] so when combining both in one analysis the difference in substitution rate needs to be estimated. We designate the set of substitution rate parameters μ to keep the notation from becoming overly confusing. In addition let P = p_{1}, p_{2},...,p_{ k }be the population size factor of g_{ k }, which accounts for any differences in ploidy and/or mode of inheritance among loci. The internal nodes of each genealogy g_{ k }define n_{ k } 1 coalescent event times ${u}_{k}={u}_{k,{n}_{k}},{u}_{k,{n}_{k}1},\mathrm{...},{u}_{k,2}$, where u_{k,j}is the time j lineages have coalesced into j  1. The start point is fixed at zero, ${u}_{k,{n}_{k}+1}=0$.
The perlocus demographic function depends on the population size factor as well, ${\widehat{\theta}}_{k}(t)={p}_{k}\theta (t)$.
The use of the ranked coalescence times in the construction of the demographic function is natural since from equation 1 each interval contains the same amount of information. Associating one parameter with each "coalescent interval" (i.e. an interval where all lineages survive terminated by a coalescent event) helps to avoid over specifying the demographic function. This approach has been used in the BSP [14] and more recently in the Bayesian skyride method [16].
The priors on all of the θ's contribute to the posterior, but only active ones participate in the demographic function, and therefore the coalescent likelihood in equation 5. Therefore, when inactive, θ_{ j }follows just the prior distribution but when active it follows the posterior (prior and coalescent).
When an appropriate prior value for the mean population size is not known in advance ϕ may be estimated in a hierarchical manner. A suitable prior distribution is selected and ϕ is allowed to change under that prior. One may choose the scalefree reference prior (f_{ ϕ }(ϕ) = 1/ϕ) as the least informative option or a socalled "diffuse" prior such as a lognormal with high variance. Note that using the reference prior may lead to very slow mixing and so it may be advisable to follow a "empirical Bayes" approach as suggested in [15] and obtain an estimate from the data itself. It is important to remember that since ϕ is the mean of an exponential distribution the choice of prior will make little difference unless the amount of data is small. However selecting a ϕ (or a prior for it) which is smaller by two orders of magnitude or more than the truth may, in our experience, cause non convergence of the chain (data not shown). Selecting higher values may slow mixing but do not appear to impact convergence. In practice fixing ϕ to a large enough value works very well (simulation results not given here).
Our simulation studies used a lognormal prior with a standard deviation in logspace of 2 and the mean (in real space) was randomly selected uniformly from the interval [0.5$\widehat{\varphi}$, 3.5$\widehat{\varphi}$] ($\widehat{\varphi}$ is the mean of the true demographic function averaged over the ages of the simulated trees).
The choice of $\overline{\lambda}$ = ln(2) gives a 50% prior weight to a constant population size and 50% to a nonconstant one. This prior is used in all simulations and analyses unless specifically noted. The prior parameter $\overline{\lambda}$ may be increased to indicated stronger prior belief in a nonconstant demographic. However data with some support for changes in population size will tend to overcome the prior and a prior that focuses probability on a small number of changepoints will tend to result in narrower credible intervals.
The method works for serially sampled data as well, and has been implemented in the software package BEAST [24] for both contemporaneous and timestamped data. The definitions above can be easily modified to accommodate genealogies containing nonzero tip times but the notation is even more cumbersome and therefore for clarity is not given here. For an example of BEAST XML input file of 32 loci [see Additional file 1].
MCMC Implementation
The term f_{ D }{D_{ k } g_{ k }, μ} is the genealogy likelihood calculated from the data and model parameters using standard methods [25]. The genealogy can be sampled by any of the published methods available in BEAST [24].
Inactive θ's can be drawn directly from the prior θ_{ i }~ Exp(ϕ). The posterior distribution for active θ's has no closed form and is sampled by applying generic scaleparameter proposal schemes (the scale operator) available in BEAST to each parameter of population size.
The indicators are sampled by combination of a bitflip operator and Poisson weighting. By itself the bitflip generates samples with a stationary probability of $Pr(\Lambda )=\frac{1}{(n+1)\left(\begin{array}{c}n\\ r\end{array}\right)}$ taking care of the first term in equation 7, while the Poisson weighting accounts for the second half.
The Poisson weighting contributes an additional factor of r/$\overline{\lambda}$.
The ratio for changing a 0 to 1 is derived in the same way and is equal to $\frac{r+1}{nr}\frac{\overline{\lambda}}{r+1}$.
Results and Discussion
To demonstrate some of the features of EBSP and the inference of demographic functions in general, this section describes the results of simulations performed using an implementation of EBSP in BEAST [24] (The BeastEBSP program is available from http://beastmcmc.googlecode.com/ and an example XML input file is available as an additional file). Simulations are invaluable during the development of a new method and in addition may provide insight into the properties of the EBSP in particular and demographic inference in general. Having said this, the simulation results we present here are aimed to serve only as examples of the various issues that need to be considered, since only a small subset of parameter combinations can be feasibly explored.
Here are the steps taken to generate a simulated data set. First pick a demographic N(t), number of loci, ploidy and number of samples for each loci. Then, for each locus, a genealogical tree T is simulated under N(t) and the coalescent process. Then a set of sequences is simulated using T and a DNA substitution model (which has its own set of parameters). The sequences are then used as the raw data for an MCMC run, and posterior samples from the run are used to estimate the population size history (and other parameters of the model if required). Note that while the simulated data contains the complete set of loci for each "individual" this is not strictly necessary. Data sets containing only a subset of loci for some individuals can be used as well.
Bayesian summary for functions
Summary statistics computed from posterior samples are the standard and straightforward way to present results of an MCMC run. When dealing with functions, however, there is no direct equivalent for single value statistics such as a median or Highest/Central Posterior Density (HPD/CPD [27]). For example there is no obvious way to pick one population size function out of the MCMC sample which represents a "middle" or a "center" in the same way a median does for single value statistics. However such statistics are highly useful for the purpose of visualization, quantification and comparison and are easily constructed from multiple estimates at specific grid points [28].
A piecewise linear function connecting the median population size estimated at specific time points is a natural choice for the median when posterior samples are piecewise linear functions. Since the demographic has a natural resolution limit n, the total number of coalescence events, we propose that the construction of the median demographic also use n points which are estimated by the mean times of the ordered coalescent events over all posterior trees. This is different from the approach currently taken by Tracer [29] which uses a fixed (say 100) evenly spaced points. This inappropriately ignores the natural spacing of coalescent events and lumps together information from several intervals at the beginning while half the points at past times are essentially based on the age of a single coalescent interval. It is of course possible to refine the time axis further by introducing more points between mean coalescent event but this adds only a small amount of information and we prefer to keep a visual clue about of the amount of data the estimate is based upon. It should be noted that these choices are relevant only for summarizing the posterior distribution. Specific hypothesis tests related to population size history should always be constructed directly from the posterior samples and not from any summary statistic.
By comparing the median demographic ${{N}^{\prime}}_{50}(t)$ to the true demographic N(t) using a function norm, from t = 0 to the median root height, t = τ, we can define a relative recovery error:
$err({N}^{\prime}(t),N(t))=\left\right\frac{{{N}^{\prime}}_{50}(t)}{N(t)}1{}_{2}=\sqrt{{\displaystyle \underset{0}{\overset{\tau}{\int}}{\left(\frac{{{N}^{\prime}}_{50}(t)N(t)}{N(t)}\right)}^{2}dt}}.$
Note that while the word error above is used to describe the average distance between an estimate and the truth, this quantity contains a bias component which is to be expected of a Bayesian estimate of a scale parameter.
where I is the indicator function.
Estimation when population size is constant
The first set of simulations was set up to reflect a typical data set with current technology. We simulated three nuclear markers and one mtDNA gene from 12 individuals sampled from a constantsized population of 50,000. Twentyfour recombination free sequences of length 1600 were generated for each nuclear marker but only 12 for the mtDNA, mimicking a situation where both alleles are sequenced from each nuclear locus. We used the HKY85 [30] substitution model to describe mutation within each locus with κ = 2.5 for the nuclear loci and κ = 15 for the mtDNA. The mtDNA locus substitution rate was set to 10^{7}, 20 times faster than that of the nuclear (5 × 10^{9}), producing a mtDNA tree with an average height of approximately 0.01 substitutions (2 × 5 × 10^{4} × 10^{7}). Note that although the MCMC runs try to mimic real life usage and estimate all unknown parameters (κ, mtDNA μ etc), only results for the demographic function are given below.
Constant population size, N = 50,000
model  % inside 95 HPD  median relative error  median relative bias  HPD relative bounds 
EBSP  97.65%  0.22  0.053  1.23 
constant population size  96%  0.17  0.009  0.9 
constant population size, fixed trees  96%  0.086  0.006  0.44 
This initial set of simulations demonstrates that even with four loci the credible interval has the same order of magnitude as the population size itself. Furthermore, the uncertainty that arises from estimating the coalescent times from the genetic data (about 40%) is comparable to the uncertainty arising from estimating the population size from the coalescent times. The remaining 20% is due to using the EBSP when the demographic function was in fact constant (this can be considered to be the price of model averaging).
Constant population size, 10× more sites
model  % inside 95 HPD  median relative error  median relative bias  HPD relative bounds 
EBSP (16000 bp)  96.84%  0.13  0.034  0.77 
Constant population size, 10× population size (N = 500,000)
model  % inside 95 HPD  median relative error  median relative bias  HPD relative bounds 
EBSP  98.94%  0.12  0.008  0.71 
constant population size  99%  0.1  0.024  0.54 
constant population size, fixed trees  99%  0.079  0.005  0.44 
The second and third sets of simulations show how longer sequences or a larger population effectively narrow the credible intervals by allowing better estimation of coalescent times. When studying smaller populations it is advisable to use longer sequences in order to ensure accurate estimates of the branch lengths. However we note that this is not necessarily possible with nuclear loci that experience large amounts of recombination.
Number of loci vs. error
Empirically, both the error and 95% credible interval are reduced by a factor of $\sqrt{2}$ when doubling the number of loci. This suggests that the relation between median error/HPD interval size and the number of coalescent points (sample size) follows the pattern of simpler cases where doubling the sample size reduces the variance by half. A rigorous proof requires an analytical solution for the median and HPD and is a non trivial task beyond the scope of this paper.
Number of samples vs. error
The exact form of the errors is unclear but it does seem to fit an inverse relation with a positive limit. The lower bound depends on the nature of the demographic and on the number of loci as well. Unsurprisingly the bounds and error for the constant demographic are smaller than those for the nonconstant case, but the total amount of reduction is approximately the same beyond 8 samples. For example using 16 samples instead of 8 gains an additional reduction of the HPD bounds by 16% and 14% respectively.
Detecting evolutionary bottlenecks
Evolutionary bottlenecks present a tough challenge for reconstruction of population size history. Periods of low population size increase the rate of coalescence and limits the number lineages that survive the bottleneck, therefore severely reducing the ability to detect changes in population size prior to the bottleneck time. Interest in the effect of evolutionary bottlenecks on genetics predates the coalescent. For example Nei and colleagues investigated the effect of a population bottleneck on the expected heterozygosity for a neutral locus [31]. A more recent study [32] demonstrates the difficulties in analyzing population structure from contemporary sequences using several methods.
While there is an equal number of coalescent events in both cases, only the analysis of multiple loci is able to "see" past the first bottleneck and even past the second. It is important to stress that this is a carefully constructed example. Lowering the population size at the bottlenecks or changing the difference between the maximum and minimum population sizes will alter the number of loci required for a successful recovery.
Testing for a non constant demographic function
Effect of DNA sequencing errors
The effect of DNA sequencing errors
% inside 95  median relativeeerror  median relative bias  HPD relative bounds  
Original data, no errors HPD  97.65%  0.22  0.053  1.23 
0.01% sequencing error  96.05%  0.53  0.29  1.95 
0.1% sequencing error  73.97%  29.54  12.22  65.28 
For this particular setting an error rate of 0.1% has a catastrophic effect. Even a rate of 0.01% was enough to double the size of the credible intervals. It is clear that data quality is an important prerequisite for a successful population genetic analysis.
Hepatitis C Virus in Egypt
The epidemic history of Hepatitis C virus in Egypt has been previously analyzed using both the BSP [14] and parametric Bayesian coalescent analysis [35]. We take another look at the data to examine the two nonparametric methods side by side and investigate the effects of prior and model choice.
Drummond et al [14] note that the sequences "contain ample phylogenetic information". Our simulation studies suggest that while 63 samples seem sufficient, a single loci can only detect general trends and not finer details. In this case both methods agree on a sharp decline (going back in time) at around 50–60 years ago. However the BSP favours a constant demographic until that time, while the EBSP favours steady decline and constant demographic from 60 years ago backwards. We suggest this difference is an effect of model and prior choice and we present one run (out of several) which supports this view.
The population size history of Drosophila ananassae
Aparup et al [37] investigated the demography and population structure of Drosophila ananassae. The authors collected samples from 160 individuals over 16 geographic locations and sequenced 10 fragments whose length ranges from 371 to 487, giving a total of 650 kb.
Our simulation study has already demonstrated how posterior probabilities of r (the number of change points) can be used to infer demographic change. Aparup et al found support for population expansion in 4 of the central populations – BKK, BOG, KL, and MNL (Location names are abbreviated as in the original article). Our analysis finds strong evidence for population expansion in 5 of the 16 populations (BKK, BOG, CH, KK and MNL). In each of the five populations the posterior probability of a nonconstant population was greater than 95%, i.e. Pr(r > 0) > 95%.
As a more lenient alternative one may use the Bayes factor $\frac{Pr(Dr\ne 0)}{Pr(Dr=0)}$ to test for the presence of change [38]. The populations of KL and PUR with Bayes factors of 5.2 and 4.4 respectively show "substantial" support according to the suggested interpretation of Kass and Raftery, but whether those values are large enough is a matter of personal preference.
The expansion in the Indian CH and PUR populations started more recently than the expansion in the central areas. The Puri (PUR) population seem to experience a trend which is somewhere in between that of Chennai (CH) to its Southwest which shows clear expansion and Bhubaneswar (BBS) to its Northeast having no expansion.
Performance and Mixing
We would like to briefly comment on the EBSP performance for the two real data sets presented above. The single locus Hepatitis C data was analyzed using both the EBSP and the BSP. In both cases the chain length was set to 100 m and sampled every 2000 states, leaving 49500 samples after removing 10% burnin. The EBSP run took 2.6 hours on a Quad core Intel Xeon CPU at 2.66 GHz and the BSP took 3.6 hours. The effective sample size (ESS) for the posterior was 5200 for the EBSP and 630 for the BSP, giving a rate of approximately 2 seconds per EBSP effective sample and 20 seconds for the BSP. A closer look reveals that the component responsible for the slower mixing is the genealogy likelihood, i.e. likelihood of sequence data given the tree (f_{ D }{D_{ k } g_{ k }, μ}). Since the coalescent acts as a prior for the tree it seems that the EBSP is allowing a faster mixing of the coalescent times in the tree by providing a demographic history N(t) with less variability.
For the Drosophila ananassae data set one MCMC chain was produced for each of the 16 locations. Each chain was run 2.75 × 10^{7} steps and sampled every 5000 steps, leaving 5000 samples after removing 10% burnin. The ESS for the posterior ranged between 220 and 2957, with a mean of 1154. This wide range may be due to variations between populations in the amount of concordance in estimated population size history among loci. Populations which exhibit conflicting demographic signals among loci are expected to mix slower. However this conjecture is based only on visual inspection of results from the analysis of individuals genes. It is also possible that some improvement might come from allowing variation in mutation rate among loci.
Conclusion
Multilocus approaches are becoming more attractive with decreasing DNA sequencing costs and increasing computational power. Felsenstein has recently investigated the ideal combination of samples, sites and loci for a fixed cost under maximum likelihood settings [39]. He found that increasing the number of loci is the most cost effective way of improving accuracy, as well as determining that a small number of samples (around 8) is sufficient. Carling and colleagues find that 25 loci from 10 samples is sufficient to accurately estimate constant population size, again under MLE [40].
The EBSP provides a tool to investigate populations that change size through time, by directly inferring the population size as a function of time using sequence data from multiple loci. The EBSP estimates the dimensionality of the population size function automatically from the data, avoiding model overspecification and its associated noise. The Bayesian nature of the EBSP provide an explicit measure of modeling uncertainty. Since we need less individuals when using multiple loci, computational performance is typically not an issue. Running an EBSP for 10 loci and 10 individuals is approximately the same computational effort as running 1 locus for 90 individuals. Mixing, however, is always an issue with MCMC and is data dependent. The EBSP may take longer to mix when there is very little information in the data or there is some conflicting signal among the loci analyzed.
The power of the multilocus approach in providing better estimates of population size under the stochastic coalescence process is clearly demonstrated by simulation results. However, the results show that taking care of other factors is important too: quality of sequence data, sufficient number of sites and enough samples all help to provide better estimation of coalescent times. Those improvements are reflected directly in the bias and credible intervals of the demographic function estimation.
The results emphasize the importance of support measures when inferring population size history. The inherent uncertainty in population size inference is still substantial even for the large datasets being collected nowadays. Therefore, it is important to always provide credible intervals with any estimate of population size. From our investigations it appears that highquality data sets of 16 unlinked loci or more should be used to accurately infer population size history. A relatively small number of individuals is sufficient, but more individuals still provide tangible benefits for data sets using less loci. We are currently working on an extension of this work to the estimation of a species tree from multiple gene trees.
Declarations
Acknowledgements
The authors would like to thank David Bryant for discussion and acknowledge the financial support of Marsden grant #UOA0502.
Authors’ Affiliations
References
 Fu YX, Li WH: Coalescing into the 21st Century: An Overview and Prospects of Coalescent Theory. Theoretical Population Biology. 1999, 56: 110. 10.1006/tpbi.1999.1421.View ArticlePubMedGoogle Scholar
 Serre D, Langaney A, Chech M, TeschlerNicola M, Paunovic M, Mennecier P, Hofreiter M, Possnert G, Paabo S: No evidence of Neandertal mtDNA contribution to early modern humans. PLoS Biol. 2004, 2 (3): E5710.1371/journal.pbio.0020057.PubMed CentralView ArticlePubMedGoogle Scholar
 Akey JM, Eberle MA, Rieder MJ, Carlson CS, Shriver MD, Nickerson DA, Kruglyak L: Population history and natural selection shape patterns of genetic variation in 132 genes. PLoS Biol. 2004, 2 (10): e28610.1371/journal.pbio.0020286.PubMed CentralView ArticlePubMedGoogle Scholar
 Wang Y, Rannala B: In silico analysis of diseaseassociation mapping strategies using the coalescent process and incorporating ascertainment and selection. Am J Hum Genet. 2005, 76 (6): 10661073. 10.1086/430472.PubMed CentralView ArticlePubMedGoogle Scholar
 Kingman JFC: The coalescent. Stoch Proc Appl. 1982, 13: 235248. 10.1016/03044149(82)900114.View ArticleGoogle Scholar
 Kingman JFC: Origins of the Coalescent: 1974–1982. Genetics. 2000, 156 (4): 14611463.PubMed CentralPubMedGoogle Scholar
 Dawkins R: The ancestor's tale. 2004, PhoenixGoogle Scholar
 Blythe RA, McKane AJ: Stochastic models of evolution in genetics, ecology and linguistics. Journal of Statistical Mechanics: Theory and Experiment. 2007, 2007 (07): P0701810.1088/17425468/2007/07/P07018.View ArticleGoogle Scholar
 Griffiths RC, Tavare S: Sampling Theory for Neutral Alleles in a Varying Environment. Royal Society of London Philosophical Transactions Series B. 1994, 344: 403410. 10.1098/rstb.1994.0079.View ArticleGoogle Scholar
 Tavare S, Balding DJ, Griffiths RC, Donnelly P: Inferring Coalescence Times From DNA Sequence Data. Genetics. 1997, 145 (2): 505518.PubMed CentralPubMedGoogle Scholar
 Wilson IJ, Balding DJ: Genealogical Inference From Microsatellite Data. Genetics. 1998, 150: 499510.PubMed CentralPubMedGoogle Scholar
 Beaumont MA: Detecting Population Expansion and Decline Using Microsatellites. Genetics. 1999, 153 (4): 20132029.PubMed CentralPubMedGoogle Scholar
 Pybus OG, Rambaut A, Harvey PH: An Integrated Framework for the Inference of Viral Population History From Reconstructed Genealogies. Genetics. 2000, 155 (3): 14291437.PubMed CentralPubMedGoogle Scholar
 Drummond AJ, Rambaut A, Shapiro B, Pybus OG: Bayesian Coalescent Inference of Past Population Dynamics from Molecular Sequences. Mol Biol Evol. 2005, 22 (5): 11851192. 10.1093/molbev/msi103.View ArticlePubMedGoogle Scholar
 OpgenRhein R, Fahrmeir L, Strimmer K: Inference of demographic history from genealogical trees using reversible jump Markov chain Monte Carlo. BMC Evolutionary Biology. 2005, 5: 610.1186/1471214856.PubMed CentralView ArticlePubMedGoogle Scholar
 Minin VN, Bloomquist EW, Suchard MA: Smooth skyride through a rough skyline: Bayesian coalescentbased inference of population dynamics. Mol Biol Evol. 2008, 25 (7): 14591471. 10.1093/molbev/msn090.PubMed CentralView ArticlePubMedGoogle Scholar
 George E, McCulloch R: Variable selection via Gibbs sampling. Journal of the American Statistical Association. 1993, 88: 881889. 10.2307/2290777.View ArticleGoogle Scholar
 Kuo L, Mallick B: Variable selection for regression models. Sankhya B. 1998, 60: 6581.Google Scholar
 Chipman H, George E, McCulloch R: The practical implementation of Bayesian model selection. IMS Lecture Notes – Monograph Series. 2001, 38: 67134.Google Scholar
 Cowan G: Statistical Data Analysis. 1998, Oxford University PressGoogle Scholar
 Jeffreys H: An Invariant Form for the Prior Probability in Estimation Problems. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences. 1946, 186 (1007): 453461. 10.1098/rspa.1946.0056.View ArticlePubMedGoogle Scholar
 Denver DR, Morris K, Lynch M, Vassilieva LL, Thomas WK: High direct estimate of the mutation rate in the mitochondrial genome of Caenorhabditis elegans. Science. 289 (5488): 23422344. 10.1126/science.289.5488.2342. 2000 Sep 29Google Scholar
 Montooth K, Rand D: The spectrum of mitochondrial mutation differs across species. PLoS Biol. 2008, 6 (8): e21310.1371/journal.pbio.0060213.PubMed CentralView ArticlePubMedGoogle Scholar
 Drummond AJ, Rambaut A: BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol. 2007, 7: 21410.1186/147121487214.PubMed CentralView ArticlePubMedGoogle Scholar
 Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution. 1981, 17: 368376. 10.1007/BF01734359.View ArticlePubMedGoogle Scholar
 Hastings W: Monte Carlo sampling methods using Markov chains and its applications. Biometrika. 1970, 57: 97109. 10.1093/biomet/57.1.97.View ArticleGoogle Scholar
 Knight K: Mathematical Statistics. 2000, CRC PressGoogle Scholar
 Thompson W, Rosen O: A Bayesian Model for Sparse Functional Data. Biometrics. 2007, 64: 5463. 10.1111/j.15410420.2007.00829.x.View ArticlePubMedGoogle Scholar
 Rambaut A, Drummond AJ: Tracer [computer program]. 2003, [http://beast.bio.ed.ac.uk/tracer]Google Scholar
 Hasegawa M, Kishino H, Yano T: Dating the humanape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution. 1985, 22: 160174. 10.1007/BF02101694.View ArticlePubMedGoogle Scholar
 Nei M, Maruyama T, Chakraborty R: The Bottleneck Effect and Genetic Variability in Populations. Evolution. 1975, 29: 110. 10.2307/2407137.View ArticleGoogle Scholar
 Johnson J, Dunn P, Bouzat J: Effects of recent population bottlenecks on reconstructing the demographic history of prairiechickens. Molecular Ecology. 2007, 16 (11): 22032222. 10.1111/j.1365294X.2007.03285.x.View ArticlePubMedGoogle Scholar
 Pienaar E, Theron M, Nelson M, Viljoen H: A quantitative model of error accumulation during PCR amplification. Computational Biology and Chemistry. 2006, 30 (2): 102111. 10.1016/j.compbiolchem.2005.11.002.PubMed CentralView ArticlePubMedGoogle Scholar
 Ewing B, Green P: BaseCalling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Research. 1998, 8 (3): 186View ArticlePubMedGoogle Scholar
 Pybus O, Drummond A, Nakano T, Robertson B, Rambaut A: The Epidemiology and Iatrogenic Transmission of Hepatitis C Virus in Egypt: A Bayesian Coalescent Approach. Molecular Biology and Evolution. 2003, 20 (3): 381387. 10.1093/molbev/msg043.View ArticlePubMedGoogle Scholar
 Ray SC, Arthur RR, Carella A, Bukh J, Thomas DL: Genetic epidemiology of hepatitis C virus throughout egypt. J Infect Dis. 2000, 182 (3): 698707. 10.1086/315786.View ArticlePubMedGoogle Scholar
 Das A, Mohanty S, Stephan W: Inferring the Population Structure and Demography of Drosophila ananassae From Multilocus Data. Genetics. 2004, 168 (4): 19751985. 10.1534/genetics.104.031567.PubMed CentralView ArticlePubMedGoogle Scholar
 Kass R, Raftery A: Bayes Factors. Journal of the American Statistical Association. 1995, 90 (430):Google Scholar
 Felsenstein J: Accuracy of Coalescent Likelihood Estimates: Do We Need More Sites, More Sequences, or More Loci?. Mol Biol Evol. 2006, 23 (3): 691700. 10.1093/molbev/msj079.View ArticlePubMedGoogle Scholar
 Carling MD, Brumfield RT: Gene Sampling Strategies for MultiLocus Population Estimates of Genetic Diversity. PLoS ONE. 2007, 2: e16010.1371/journal.pone.0000160.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.