Patterns of genetic variation in populations of infectious agents
© Gordo and Campos. 2007
Received: 25 September 2006
Accepted: 13 July 2007
Published: 13 July 2007
Skip to main content
© Gordo and Campos. 2007
Received: 25 September 2006
Accepted: 13 July 2007
Published: 13 July 2007
The analysis of genetic variation in populations of infectious agents may help us understand their epidemiology and evolution. Here we study a model for assessing the levels and patterns of genetic diversity in populations of infectious agents. The population is structured into many small subpopulations, which correspond to their hosts, that are connected according to a specific type of contact network. We considered different types of networks, including fully connected networks and scale free networks, which have been considered as a model that captures some properties of real contact networks. Infectious agents transmit between hosts, through migration, where they grow and mutate until elimination by the host immune system.
We show how our model is closely related to the classical SIS model in epidemiology and find that: depending on the relation between the rate at which infectious agents are eliminated by the immune system and the within host effective population size, genetic diversity increases with R 0 or peaks at intermediate R 0 levels; patterns of genetic diversity in this model are in general similar to those expected under the standard neutral model, but in a scale free network and for low values of R 0 a distortion in the neutral mutation frequency spectrum can be observed; highly connected hosts (hubs in the network) show patterns of diversity different from poorly connected individuals, namely higher levels of genetic variation, lower levels of genetic differentiation and larger values of Tajima's D.
We have found that levels of genetic variability in the population of infectious agents can be predicted by simple analytical approximations, and exhibit two distinct scenarios which are met according to the relation between the rate of drift and the rate at which infectious agents are eliminated. In one scenario the diversity is an increasing function of the level of transmission and in a second scenario it is peaked around intermediate levels of transmission. This is independent of the type of host contact structure. Furthermore for low values of R 0, very heterogeneous host contact structures lead to lower levels of diversity.
Patterns of genetic diversity in populations of infectious agents contain important information about their epidemiology and evolution. They depend on the population dynamics of the infectious agents, which involves their replication within hosts and transmission between hosts, their mutation and recombination rate. Infectious agents vary enormously in their ability to mutate and to transmit, which will lead to large differences in levels of variability. Furthermore there can be variation within an infectious species for the ability to evade the host immune system. In fact, infectious agent genetic diversity can help in targeting genes under selection pressure created by the immune system . In addition patterns of infectious agent variation can, under certain circumstances, be used to infer host population history , and the level of infectious agent genetic structure may reflect its evolutionary potential . Importantly, the need for a continuous integration between population genetics and epidemiology has been increasingly recognized [4–7].
In population genetics the standard neutral model has a long history in DNA sequence data analysis , and has been extensively used as a null model for understanding genetic variation in natural populations, including that in our own species [8, 9]. The standard neutral model makes several simplifying assumptions: in particular it makes the simple assumption that individuals form one single constant size population. When considering populations of infectious agents it is much more reasonable to assume, as the null model, a population composed of a collection of much smaller populations.
Here we develop population genetics models of structured populations, that incorporate epidemiological parameters explicitly, in order to study genetic variability under one of the simplest possible epidemiological models. We ask mainly two questions: 1) what do levels and patterns of sequence variation in these infectious agents look like under this model? And 2) how does host contact structure influence their diversity?
The models we will study here are very similar to the metapopulation models where each subpopulation can go extinct and be recolonized [10–12]. Generally studies of genetic diversity in such subdivided populations [13, 14] assume a simple symmetric topology for the metapopulation – the most well studied is the island model of Wright. Simple as it is, this model has provided a wealth of results that have led to enormous contributions to our understanding of evolution in structured populations [15, 16]. Nevertheless, there are several reasons to think that this model is too simple to be readily applicable to natural populations [14, 17], especially if the goal is to understand molecular diversity of infectious agents. As we know, the underlying topology at which certain disease epidemics and spreading takes place is that of social networks . Several recent investigations have demonstrated that real networks of interaction have a much more complex structure than those predicted by totally regular networks or totally random networks . Most real networks of social interactions present two different topological properties: a low average pairwise distance between nodes and a high clustering degree (which measures local structuring).
The former occurs in random networks and the latter in regular networks. In such way, some models of network topologies have been recently proposed in the literature (for a review see Ref. ). One of the most successful models for network structure is the scale-free network . In addition to the common properties of real interaction networks, in scale-free networks the distribution of connectivities obeys a power-law distribution as , which is observed in some actual systems ranging from World Wide Web to the network of human sexual contacts [22, 23]. As initially proposed, scale-free networks are dynamical networks where growth and preferential attachment are some of the key mechanisms.
Accordingly, each newly introduced node in the network preferentially joins with an already well connected-node. As a result, it will produce a highly heterogeneous network where most nodes have a low connectivity while a few nodes display a very large connectivity. These latter ones are referred to as hubs. The understanding of the interplay between the underlying topology and the forces driving systems is of crucial relevance [24, 25]. One example of this, that has received a great deal of attention, is that of network epidemiology: the study of epidemic and disease spreading [26–29], which are strictly tied to the topology of social contact networks. In this context, a striking result has arisen from the study of the classical susceptible-infected-susceptible (SIS) epidemiological model on scale free networks: scale-free networks are more prone to spreading of diseases than random graphs and regular lattices [26, 27]. In this kind of model the role of microbe evolution is disregarded. Recently, we have focused on this latter feature and we have shown that although scale-free networks are more prone to infectious agent spread, the accumulation of deleterious mutations in asexual infectious agent with high mutation rates can also be accelerated in this kind of networks in comparison to random graphs . This shows that not only disease dynamics but also its evolution should be considered as an important key in the investigation of epidemiological models . Another very important feature that has to be considered is co-evolution between infectious agent and their hosts . Modeling of these complex systems have provided us with insights into how host-parasite interactions can modulate the mode of reproduction , ploidy levels , the patterns of gene expression in hosts and parasites  and how different types of interspecies interactions affect genetic and phenotypic variation .
The susceptible-infected-susceptible model (SIS model) is one of the simplest classical models in epidemiology. In this model, hosts born susceptible (S) can become infected (I) at a rate β per unit time, given contact with at least one infected host. Infected hosts become susceptible at a rate λ, such that 1/λ is the average duration of an infection. One of the most fundamental quantities to assess the equilibrium frequency of infections in the population is the R 0 of the infectious agent. The R 0 is defined as the number of secondary cases produced by an infectious individual in a totally susceptible population. At epidemiological equilibrium, the frequency of infected individuals is i = 1 - 1/R 0, with R 0 = β/λ. If R 0 < 1 then the infection does not spread.
meaning in metapopulation genetics
meaning in epidemiology
number of demes
number of hosts
number of individuals within a deme
number of infectious agents within an infected host
probability that a deme goes extinct
probability that the immune system clears the infection
transmission ability between hosts
mutation rate of the infectious agent
number of demes connected to deme j
number of contacts of host j
We now relate our metapopulation model with the SIS model and in this study we will ask what equilibrium patterns of infectious agent genetic variation look like under this model. In our model a deme corresponds to a host. An empty deme means that the host is susceptible, whereas a deme which is full corresponds to an infected host. A deme that is currently full can become empty with probability e, which means that e corresponds to λ. A deme that is currently empty can become full through the migrants it receives from nearby demes. This implies that β is proportional to m. Given that the average connectivity of a deme is K and that the number of migrants per link is N d m, then β corresponds to N d mK.
In order to assess the correspondence between our model and the SIS model, we have compared the average frequency of infected individuals in our metapopulation with the expectation for the deterministic SIS model, which implies that:
i = 1 - 1/R0 = 1 - e/N d mK (1)
Equation 1 is the expected frequency when there is no variance in k i , which is not the case in scale free networks.
One may expect deviations to be observed when these assumptions are violated . Nevertheless the deviations we observe are small, unless R 0 is very low. In fact in the case of very low R 0 there is a high probability that the infection does not spread. For example in the scale free network, if the infection starts in a poorly connected host it may have very little chance of spreading. We performed simulations with the scale free topology in conditions where the infection starts in a single randomly chosen host. With the same parameters as in Figure 1 and for R 0 = 1.5, we observed 66% of cases where the disease could not spread. With R 0 = 3, the fraction of cases where the infectious agent could not invade dropped to 40%.
We observe that, for both topologies and for the sets of parameters considered, the level of π t is maximal for intermediate values of R 0. For instance, when e = 0.01 this maximum value is achieved at R 0 around 3 for the island model and around 10 for the scale free topology. Beyond these points the level of diversity starts to decrease with increasing R 0. From Figure 2 we observe the occurrence of two quite distinct regimes, according to the level of transmission. In the region of low transmission, R 0 is small, extinction is much stronger than migration (e >> mK), the fraction of infected hosts is small and levels of diversity are low. In fact, starting from R 0 = 1, where the fraction of infected individuals, i, is 0, as we increase R 0 (by increasing m), the level of infection rises and the level of diversity accompanies that increase. In this region the level of infection bounds the level of diversity in the population, since it is expected that diversity will be higher when the total number of infectious agents in the metapopulation is larger. When the level of infection achieves a value close to 0.9, increments in m, lead to small increments in i and the level of diversity stops increasing. The second regime comes about at high transmission, where R 0 is very large. In this region migration is much stronger than extinction, mK >> e, the level of infection is close to 1 and so it is not the limiting factor for diversity to grow. From this point, increments in migration cause a drastic reduction in the isolation between demes and lead to a reduction in diversity. In fact in the limit of extremely high levels of migration the diversity in the structured population tends to that expected in a panmitic population of size N t = DN d . So, for very high values of R 0, diversity tends toward the value π t = π d = 2 N d D μ , which in the case of Figure 2 is 8, for the value of the mutation rate, μ, assumed. Figure 2 also shows that in the region of low R 0, diversity in the island model is higher than in the scale free network, whereas for large values of R 0, there is little difference between the topologies. The latter is expected since the larger the value of the migration rate the less important the precise contact structure will be. The former can be understood as follows: a low value of R 0 corresponds to a small fraction of infected hosts both in the island model and in the scale free network. But whereas in the island model new infections of a susceptible host occur from contact with any of the infected hosts in the metapopulation, in the scale free network infections are more likely to come from well connected hosts, which are a small subset of the metapopulation. This then will lead to lower diversity levels in the scale free network, as compared to the island model, for the same low R 0 value.
Mean values of Tajima's D in the scale free network with parameters
D t 2 SE
which is valid only when mK < e . We therefore expect this expression to provide a good approximation for cases in which R 0 < N d .
Equation 2 suggests a strong dependence of the level of metapopulation diversity with N d , the effective population size within a host. This effective population size is likely to vary considerably among different infectious agent species. We have therefore explored how the value of N d affects the levels of diversity with simulations.
Furthermore, as suggested by Equation 2, for small values of R 0, increasing N d has a very small effect on the level of diversity, but for intermediate to high R 0 values the effect is more pronounced.
When R 0 > 10, the level of infection is not a limiting factor in the level of diversity, because the number of infected hosts is very high. Thus for large values of R 0 infectious agent diversity will increase with N d .
Comparing the panels in Figure 4, we can observe that when R 0 << 10, diversity is always smaller in the scale free network, whereas when R 0 >> 10 and e > 1/N d the levels of diversity are similar in both contact networks. In fact, for large values of R 0, the largest difference between the topologies can be observed when e = 1/N d .
In this metapopulation model there are two forces which generate diversity within each host: mutation and transmission; there are also two forces that undermine diversity: extinction and genetic drift. So in general, we can expect that, when the forces that reduce diversity are stronger than those that generate it (that is low R 0, low N d or high e), diversity levels will be low. On the contrary, high R 0, high N d or low e, we can expect levels of diversity to be much larger.
The spectrum of frequencies of mutations that are segregating in the population is important to understand deviations from the standard neutral model, which assumes an undivided, constant size population at equilibrium between mutation and drift . In fact, the mutational spectrum of infectious agent gene sequences has been used to reject the standard neutral model suggesting that natural selection is determining the evolution of certain genes [40, 41]. Tajima's D is a widely used statistic to assess distortions in the frequency spectrum . If the number of mutations that appear at frequency 1/n in sample of size n (singletons) is higher than that expected under the standard neutral model, then Tajima's D becomes negative. On the other hand if the number of mutations at intermediate frequency is large then Tajima's D becomes positive. When a departure from the standard neutral model is observed in a given gene of a given species, several alternative hypotheses can be made. These typically involve natural selection and/or demographic factors, such as population growth or population structure. In infectious agent populations the relevant null model against which we would like to test for the molecular signature of selection is closer to a metapopulation neutral model than to the standard neutral model. From all the simulations in all the metapopulation structures we have studied, we have observed that D t was always very close to 0. This is in agreement with the results of coalescent theory and simulations in metapopulations under the island model [43, 44]. However, we have observed that in some simulations of scale free networks a slight distortion of the frequency spectrum was apparent. In cases of low R 0 mean values of the Tajima's D statistic become negative. In Table 2 we show one example where this occurs. Although the values of D t are not very negative when the sample size is small, they become more negative with increasing sample size.
One of the main goals in infectious disease research is to understand how infectious agent variation, host immunity, transmission dynamics and epidemic dynamics determine patterns of infectious agent evolution. Information about evolutionary and epidemiological processes can be extracted from studying infectious agent genetic diversity. In particular it can help us to understand the origin of disease and the selective pressures that act on certain infectious agent genes. The link between infectious agent dynamics and genetic diversity at within and between host level is a very important problem. The means towards its solution requires the integration of population genetics and epidemiology. This has recently been recognized as a major step for understanding infectious agent evolution .
Here we have studied levels and patterns of infectious agent diversity under one of the simplest classical epidemiological models: the SIS model. In this model, hosts that are susceptible can become infected at a given rate, and hosts that are infected can become susceptible by clearance of the infectious agent. We have found that, under this model and in the conditions studied, for low clearance rates and low intrahost effective population size, levels of genetic variability in samples from the whole infectious agent population are maximal for intermediate levels of transmission. This pattern of DNA sequence diversity was found to be independent of the type of host contact structure.
Although we have not performed simulations with values of N d close to those that have been estimated for some infectious agent (N d ≃ 1000 estimated for HIV-1 ) due to the high computational cost, from the simulations we have done we have checked that when the rate at which the immune system clears the infectious agent (e) is higher than the rate of drift (1/Nd) within the host, levels of infectious agent diversity in the whole metapopulation monotonically increase with R 0.
In highly transmitted infectious agents, levels of diversity are weakly dependent on the type of host contact structure. However for infectious agents with low values of R 0, levels of diversity do depend on the host contact structure: when interactions between hosts are such that every host is in contact with every other, levels of diversity are higher than when the host contact structure is such that a few hosts have a disproportionate number of contacts, whereas the majority has a small number of contacts. In this latter case levels of infectious agent diversity are expected to be low. Furthermore, in this latter case the frequency spectrum of neutral mutations can be distorted, in relation to that expected for the standard neutral model . This feature is captured by negative values of the Tajima's D statistics. The observation of positive values of D t in infections agent genes suggests that strong diversifying selection could be occurring, since even when we account for the complex contact structure in which infectious agents evolve, under a neutral model one would expect to observe values of D t close to 0 or negative.
The results presented here can also be used to make some predictions about future adaptation in infectious agents. If we assume that new adaptive mutations in infectious agents arise from standing neutral variation [49, 50], Figures 2, 3 and 4 imply that for infectious agents with low intrahost effective population size, those with intermediate R 0 will be likely to adapt more rapidly than those with larger R 0. For infectious agents in which these conditions are met, an important implication regarding public health measures can be drawn: if control programs with the aim of lowering transmission do not reduce R 0 to very low values, but instead only lead to small reductions in R 0, then this may imply an increased chance of the infectious agent escaping the immune system.
where π ij is the number of differences between two sampled sequences, and also for each deme (π d ).
where , b n = e 1 S + e 2 S (S - 1) and e 1 e 2 as defined by Tajima .
A well studied topology in the population genetics literature is the island model, introduced by Wright, which corresponds to a fully connected network where every deme is connected to the others, so k i = D - 1. A commonly studied topology in epidemiology is the scale-free network, where the distribution of connectivities obeys a power-law: . In real systems the exponent γ is in the range between 2 and 3. Nodes of low connectivity are predominant in the network, whereas well-connected nodes are rare. One of the mechanisms that can lead to the occurrence of a network with a power-law degree distribution is growth with preferential attachment, where nodes newly introduced to the network are preferentially attached to those nodes which are already well connected. We use the standard algorithm by Albert and Barabasi to build up the scale-free networks , and so we generate networks with exponent γ = 3. Scale free networks, that are extremely heterogeneous, may be appropriate descriptions for studying sexually transmitted diseases . Our results for scale-free networks were compared to the island model. For every network and every parameter set we have run 30 independent simulations.
We thank Gabriela Gomes, David Conway and Gareth Weedall for helpful suggestions. This work was supported by project POCTI/BSE/46856/2002 through Fund. para a Ciência e Tecnologia (FCT). I.G. is supported by FCT/FEDER fellowship. PRAC is partially supported by Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.