Phenotypic effect of mutations in evolving populations of RNA molecules
© Stich et al; licensee BioMed Central Ltd. 2010
Received: 14 May 2009
Accepted: 17 February 2010
Published: 17 February 2010
The secondary structure of folded RNA sequences is a good model to map phenotype onto genotype, as represented by the RNA sequence. Computational studies of the evolution of ensembles of RNA molecules towards target secondary structures yield valuable clues to the mechanisms behind adaptation of complex populations. The relationship between the space of sequences and structures, the organization of RNA ensembles at mutation-selection equilibrium, the time of adaptation as a function of the population parameters, the presence of collective effects in quasispecies, or the optimal mutation rates to promote adaptation all are issues that can be explored within this framework.
We investigate the effect of microscopic mutations on the phenotype of RNA molecules during their in silico evolution and adaptation. We calculate the distribution of the effects of mutations on fitness, the relative fractions of beneficial and deleterious mutations and the corresponding selection coefficients for populations evolving under different mutation rates. Three different situations are explored: the mutation-selection equilibrium (optimized population) in three different fitness landscapes, the dynamics during adaptation towards a goal structure (adapting population), and the behavior under periodic population bottlenecks (perturbed population).
The ratio between the number of beneficial and deleterious mutations experienced by a population of RNA sequences increases with the value of the mutation rate μ at which evolution proceeds. In contrast, the selective value of mutations remains almost constant, independent of μ, indicating that adaptation occurs through an increase in the amount of beneficial mutations, with little variations in the average effect they have on fitness. Statistical analyses of the distribution of fitness effects reveal that small effects, either beneficial or deleterious, are well described by a Pareto distribution. These results are robust under changes in the fitness landscape, remarkably when, in addition to selecting a target secondary structure, specific subsequences or low-energy folds are required. A population perturbed by bottlenecks behaves similarly to an adapting population, struggling to return to the optimized state. Whether it can survive in the long run or whether it goes extinct depends critically on the length of the time interval between bottlenecks.
The fate of evolving populations is determined by a number of intrinsic properties of the ensemble and extrinsic mechanisms that interact in a highly non-trivial manner. The natural mutation rate of populations and their size, the effect that such mutations have on fitness and how these effects vary with the state of the population, or the environmental perturbations they have to cope with, all are relevant variables conditioning long-term survivability .
Populations evolving in constant environments eventually reach a fitness plateau, often interpreted as an optimum in the fitness landscape. The time required to reach the optimum and the structure of the population at the mutation-selection equilibrium corresponding to that environment depend, in addition to the factors listed above, on the environment and on the initial state of the population. During adaptation, the effect of mutations is different from their effect at equilibrium, probably due, among others, to differences in the fitness of populations and to changes in the genomic context [2, 3]. Though it is common to observe an increase of fitness during adaptation, situations where fitness decreases before the new plateau is attained are not rare [4–6]. Based on several experimental observations and in mathematical models separating genotype and phenotype, it seems that the rate of incorporation of new mutations in adapting populations proceeds regularly, while the phenotype experiences discontinuous changes [7, 8]. This search-and-fix behaviour has characteristic time scales dependent on the mutation rate and the size of the population, such that efficient optimization will only be observed in environments that remain constant for a time span longer than that required to find and fix beneficial changes . Often, the process of smooth adaptation in constant environments is systematically interrupted by external perturbations. Common disturbances are unavoidable environmental fluctuations or durable changes that constitute a new, previously inexperienced selection pressure. An extreme perturbation is represented by population bottlenecks, where one or few individuals face the task of reconstructing, usually in relatively short time, the whole population.
The relationship between the mutation rate of a population and the variability of the environment where it evolves has been a matter of interest for a long time [10–12]. In strictly constant environments there should be selection against mutability, while periodically varying environments would demand different types in the population (optimal at one or another period), such that mutation rates that minimize the genetic load of the population would be favoured . Early theories assumed fitness values monotonically decreasing with the number of mutations accumulated by an individual , and recognised the effect that epistatic interactions could play. Subsequent studies focused on the distribution of beneficial effects  and, under the strong assumption that the mutational neighbourhood of a genome is random, predicted an exponential distribution of the differences between the parental and mutated phenotypes . More recent studies have identified the relevant effect of fitness-dependent mutation rates  and the dependence of the effect of mutations on the environment where they occur . Still, an increasing number of empirical observations [6, 18] indicate that existing theories are not of general applicability [19, 20], and encourage additional efforts to quantify the effect of mutations on phenotype. Knowledge of the functional relationship between genomic mutation rates and phenotypic changes is essential to further develop phenomenological theories of evolution and adaptation. This is a main motivation behind the efforts currently devoted to obtain the distribution of fitness effects of mutations. The activity in the field acknowledges the profound difficulty in deriving such a relationship and points to its conceivable non-universality.
Simple, computational models of evolution explicitly separating genotype and phenotype might be of great aid in bridging the gap between phenomenological theories and experimental observations. The former are built on empirical observations which, as of yet, are far from complete. Actually, most experimental observations deal with a relatively small number of cases and model organisms, such that generalization is still a difficult enterprise . Another problem presented by the experimental data is the small number of beneficial mutations that can be identified. Drift and clonal interference cause many beneficial mutations, especially when they have small effects, to be lost. Occasionally, some of these small effect mutations reach appreciable frequencies in the population, though they can often be incorrectly classified as neutral mutations. Another source of trouble concerns the selection coefficients of the mutations responsible for adaptation. Initial studies concluded that most evolutionary change was due to mutations with a small effect on fitness [22, 23]. However, more recent evidences assign a prominent role to large effect mutations, especially at the first stages of adaptation or when the fitness of the wild type in the new environment is low [19, 20, 24–26]. Studies carried out with bacterial or viral populations focused on the analysis of the distributions of the fitness effects of beneficial mutations produced prior to selection [18, 27, 28]. Some of these studies report a good agreement with an exponential distribution, even when there are large variations in the fitness values of the wild type across environments . On the other hand, other studies on adaptation to environments in which the wild type has low fitness reveal deviations from an exponential distribution due to the increase in the amount of beneficial mutations with large effect on fitness . Finally, a study combining simulation and experimental data suggests that, regardless of the underlying distribution of mutations accessible to individuals, adaptation can be well described by a similar distribution of successful mutations with a simple form, peaked around a single value .
The computational study of random ensembles of RNA sequences folding into their minimum free energy secondary structure (a proxy for the phenotype) has permitted one to assess the role played by compensatory mutations . More recently, analyses of the distribution of fitness effects on such ensembles have led to the conclusion that fitness values among similar genotypes are correlated . Evolving populations of RNA sequences subject to point mutations and selection on structure have been successfully used as a suitable framework to explore several other aspects of evolution in heterogeneous populations . However, these populations differ fundamentally from random sets of sequences.
Genomes within a population have to be highly correlated, since they share a phylogenetic history and have experienced similar selection pressures.
The aim of this contribution is to investigate the internal organization of optimized, adapting and perturbed populations of RNA molecules evolving under different genomic mutation rates, with the goal of quantifying the effect of mutations on fitness and establishing links between the state of the population and the effect of mutations. In the case of optimized populations, different fitness landscapes are used to probe the generality of our results. The model we use does not presuppose any underlying distribution of mutational effects. Mutations arise according to the established mutation rate and their effects are not selected a priori. They depend on the specific change that they produce in the secondary structure of the evolving molecules, in particular subsequences, or in the energy of the folded state. Our model also permits us to analyze the multiple mutations simultaneously present in a population, independently of whether they become fixed or not. Thus, mutations with highly deleterious effects can be pinpointed and those with a very small effect can be distinguished from neutral changes.
Results and Discussion
Selection and mutation
where d i is the distance between the structure corresponding to sequence i and the target structure S. As a measure of structural distance, we use the base-pair distance (see Definitions). The constant l is a scale factor, since the maximum base pair distance between two molecules of length l is proportional to l. The overall normalization factor is . The fitness function defined in Eq. (1) corresponds to the main fitness landscape to be studied in this work (S-landscape).
When a molecule replicates, each nucleotide has a probability μ to be replaced by a nucleotide randomly chosen with equal probability among A, C, G, and U. In cases where the populations exceeds the maximum size N, we perform the usual Wright-Fisher sampling.
The Hamming distance between two RNA sequences or subsequences is given by the number of positions in which their nucleotides differ. Structural differences can be estimated by various means. In this work, we apply the base-pair distance, given by the number of base pairs that have to be opened and closed to transform one structure into the other (as implemented in the RNAfold algorithm ).
A relevant macroscopic quantity to characterize the degree of optimization of the population is the fraction ρ of structures in the population folding into the target structure. This quantity is zero when our simulation starts: note that the probability that a random sequence of length l folds into an arbitrary secondary structure is negligibly small, of the order of 10-15 in the present case  (but see also ). The maximum value of ρ is attained at the statistically stationary state, once the target structure is fixed in the population. The process of fixation and the dependence of ρ with the mutation rate μ have been described in detail elsewhere .
The change in fitness of an RNA molecule under replication with mutation rate μ is quantified by comparing the secondary structures of the mother i and daughter j sequences. We calculate the distances d i and d j , as described, and the difference Δ ij = d i - d j . If Δ ij = Δ > 0 (meaning that the mother sequence was farther from the target structure than its mutated daughter), the mutations increased fitness and count in the total of beneficial changes. If Δ ij = Δ < 0, mutations have caused a negative change in fitness, and the daughter sequence is farther from the target than her mother was. When Δ ij = 0, either mutations had no effect on fitness or no mutation has occurred.
The fraction of replication events with no change in fitness, n' = Π(0)-regardless of whether no mutations occurred or because the ones occurring were neutral-, can be immediately obtained from p and q: n' = 1 - p - q.
The probability of incorporating no mutations is thus P(0) = e-μl, which for μl ≪ 1 reduces to P(0) = 1 - μl. The fraction of truly neutral mutations is thus n = n' - P(0). If one wishes to calculate the fraction of deleterious, beneficial or neutral mutations conditional on at least one mutation having taken place, it suffices to divide the values of p, q, and n above by the expected number of sequences with one or more mutations, 1 - P(0). The effect of single mutations can be estimated in this way only for sufficiently small values of the mutation rate μ, since for relatively large values the probability of having two mutations in the same sequence becomes non-negligible. For example, for μ = 10-3, the probability that a sequence of length l = 50 is hit by two or more mutations is P(k ≥ 2) = 1 - P(0) - P(1) ≃ 0.00121, thus one in a thousand molecules gets more than one mutation under replication. At sufficiently low mutation rates, the distribution of effects of mutations on fitness thus corresponds to the distribution of the effect of single mutations. As μ increases, sequences at a distance of two and more mutations from the parental molecule can appear in a single replication event. This has implications in the mobility of the population in sequence space and in its evolutionary dynamics .
where l corresponds to the maximum fitness change, such that 0 < (σ p , σ q ) < 1. The fraction of deleterious and beneficial mutations and the corresponding average selection coefficients are calculated while keeping a fixed value of the genomic mutation rate μ.
In the following we investigate the response of populations of RNA sequences evolving under the situations described. The three different cases explored are schematically represented in Figure 1. The target structure selected in our simulations is depicted in the upper left corner. Starting with a random population of sequences, for values of the mutation rate below the error threshold μ c , the population is able to climb up towards the optimum of the phenotype space. After the transient, if mutation-selection equilibrium is reached, the population sits around the optimum, with a fraction ρ of correctly folding sequences determined by μ: the error rate at which a population evolves determines the degree of adaptation reached at the equilibrium. The mutation rate also determines the spread of the population in sequence and structure spaces. Populations perturbed through bottlenecks are forced to recover starting with a single sequence. If the time between bottlenecks is long enough, beneficial mutations can be found and fixed, recovery is possible, and the population is to be located near the optimum most of the time. However, for frequent perturbations it is likely that suboptimal sequences are repeatedly chosen. This favours the accumulation of deleterious mutations that separate the population steadily from the optimum: recovery is not possible, and eventual extinction might supervene.
We start the simulation as described, with N = 1000 random RNA sequences subject to selection as a function of their distance to the target structure S, Eq. (1). In the statistically stationary state, the population has attained its maximum degree of optimization and is located as close to the optimum (represented by the target structure) as allowed by the operating mutation rate μ. The fraction ρ of sequences folding into S is a decreasing function of μ for values of the mutation rate below the error threshold; ρ becomes strictly zero for μ > μ c only for sequences of infinite length. In our case, where sequences are short, there is a non-zero probability that the target structure is found, even above the error threshold. However, it is rapidly lost due to the high mutation rate. Hence, though we sporadically see the appearance of S above threshold, it cannot be fixed in the population.
The distributions shown in Figure 2(a) and 2(b) correspond to a mutation rate μ = 0.004, which yields a fraction ρ ≃ 0.34 of correctly folded sequences. Larger values of μ would yield broader distributions and smaller values of μ, corresponding to larger values of ρ, would pack the population closer to the optimum. An important point is to notice that there is always place for improvement within the population for any μ > 0. Suboptimal sequences (with d i ≥ 1) might incorporate mutations (mostly compensatory once at the mutation-selection equilibrium) that diminish their distance to S and thus increase their fitness, while optimal sequences (with d i = 0) might be affected by mutations that change their folded state, and thus count as deleterious. At equilibrium, the two processes balance so as to maintain the fraction ρ constant. The situation is different regarding the genomic diversity of the population. Though at the first generations after fixation of the target structure all sequences are similar, in successive iterations the population explores the neutral network of genotypes. This permits the population to expand in sequence space, displaying broad variability in sequences while maintaining the optimized phenotype. Further, the topological properties of the neutral network allow a deep exploration of the space of configurations, a feature that has been shown to underly the extreme plasticity observed not only in RNA sequence evolution in silico , but also in natural populations . The reorganizations of the populations at the sequence level, which occur all the time, can be masked by the more visible process of adaptation, where the similarity between phenotypes and their degree of optimization is the dominant and obvious outcome of selection processes. The relationship between sequence and structure in RNA molecules illustrates the complex mechanisms relating the genomic level to its phenotypic expression, and constitutes a first, simple instance that allows us to measure the effect of mutations on fitness.
with exponents α ≃ 0.89 ± 0.01 and γ ≃ 1.77 ± 0.02, indicating that, as the degree of adaptation of a population reduces as a consequence of evolution at progressively increased mutation rates, compensatory mutations increase their frequency at a higher speed than do deleterious mutations. We expect the algebraic relationship between phenotypic and microscopic mutation rates to be generic, although the specific values of the exponents α and γ will depend on the length of the sequences and on the particular secondary structure chosen. The ratio q/p quantifies the number of compensatory mutations per deleterious mutation. Its dependence with μ is displayed in the inset of Figure 4. Together with the functional relationships reported in Eq. (6), we obtain q/p ≃ μξ, with ξ = γ - α = 0.87 ± 0.03.
The average selection coefficients vary only slightly with μ, with σ p taking about twice the value of σ q . This relationship is dependent on sequence length and target structure.
The study of adapting populations is performed in a way similar to that for optimized populations. An ensemble of N = 1000 random RNA sequences is the starting point, and we replicate and select sequences as described. Now, however, fitness changes are measured before the population has found the target structure, such that it is formed by a population of phylogenetically related, suboptimal structures. In practice, we measure the structure of the population and the effect of mutations during 50 generations after the initial random state. The average number of generations required to find and fix the target structure is well above this value.
The internal structure of the population at generation 50 is illustrated through the representative example shown in Figure 2(c) and 2(d). The narrow peak around a distance d ≃ 9 to the target structure in Figure 2(c) indicates that the phenotypes of the population are located close to a suboptimal structure at this distance. Note that most sequences are not folding into the best structure at this generation, since structures at distances smaller than the most abundant are also present. They will be likely fixed in subsequent generations, representing a further step towards optimization. Since the population is out of equilibrium and continuously jumping to increasingly better structures, the spread in the genome space is smaller than in the optimized case.
An optimized population can be shifted from equilibrium through the action of an external perturbation. Severe population bottlenecks, very common in the natural environment of highly heterogeneous populations such as RNA viruses, force the population to regenerate from one or a few individuals. The dynamics of this process can be simulated in our scenario. Consider an optimized population that has evolved until reaching mutation-selection equilibrium, as described. Take one of the sequences in the population and let it replicate during g generations. Now the growth of the population is unconstrained until reaching the size N = 1000.
This dynamical rule applies now to the case of a growing population and substitutes the definition of replicative ability given in Eq. (1), the latter applying to populations of constant size. If the sequence at the bottleneck folds into the target structure, then ⟨c⟩ = 2, meaning that it duplicates on the average at each generation. Suboptimal sequences (with d i > 0) replicate at a slower pace. Note that the relative advantage of sequences folding into structures at relative distance m is the same as for the case of constant populations, Eq. (2).
If the number g of generations elapsed between bottlenecks is large enough, the population will be able to recover and again attain mutation-selection equilibrium. As g decreases, the time for optimization becomes shorter and, for g small enough, the total population will not be capable of achieving the optimum. Still, survival might be possible. At too low values of g, however, the perturbation becomes too strong and the few sequences after the bottleneck might be unable to replicate, in which case the population goes extinct.
For optimized and adapting populations, the mutation rate μ is the main source of randomness in the system, permitting as well change and improvement. Population bottlenecks constitute an additional source of stochasticity and represent a stronger perturbation, since selecting a random sequence to found a new population implies choosing a suboptimal structure with a probability of at least 1 - ρ. In those cases, the population faces two important difficulties: first, it has to go again through the adapting transient, where the target structure has not yet been found; second, the reduction in the population size impairs the capability of the ensemble to generate beneficial variants.
Figure 2(e) and 2(f) show the typical structure of a population 20 generations after a bottleneck. A sequence at a large distance from the optimum was selected by chance, such that, after 20 generations, the population is still far from finding the target structure, and formed at that moment by two groups that are identified as the two peaks in the distribution of structural distances. It is interesting how concentrated the population is in sequence space [Figure 2(f)], due to the clonal regeneration and the short time elapsed from the common ancestor of all sequences.
In the possible case that an optimal sequence is selected, it replicates fast and reaches high density in a small number of generations. In this case, almost all sequences in the first generations after the bottleneck fold into optimal structures, thus generating a superoptimal, unstable population that slowly relaxes towards the diversity characteristic of the mutation-selection equilibrium under the action of the mutation rate. Figure 7(b) shows how an increase in the frequency of bottlenecks, here applied every 25 generations, represents a hazard to the population. We have seen extinction in a number of realizations with these parameters, while some other runs show survival for the whole length of the simulation. Extinction becomes systematic for bottlenecks applied too frequently. In Figure 7(c) the perturbation is applied every 10 generations and extinction is certain. Note that, irrespectively of the frequency of bottlenecks, any population suffering extreme bottlenecks is condemned to extinction, since there is a non-zero probability that the molecule chosen at the bottleneck fails to replicate in cases when it is far enough from the optimal structure. This argument applies to any number of molecules, though as the population increases in size this probability becomes negligibly small. The extension of these results to natural systems has to be done with due care. For example, extinction is rare in viral populations subject to bottlenecks even when these are applied at a frequency incompatible with the full regeneration of the population , due to the ability to recover fitness through a multiplicity of pathways.
The critical frequency of bottleneck application for the population to survive depends on other model parameters, especially on the mutation rate μ. Low values of μ correspond to a higher density ρ of correctly folded sequences, thus favouring selection of an optimal sequence at the bottleneck. However, a sufficiently low value of μ hinders the generation of diversity and with it the appearance of more optimal variants. Actually, when bottlenecks are frequent a good evolutionary strategy is to replicate under a not too low mutation rate, such that, in the case that a suboptimal phenotype is the founder of a new population, the time to recover mutation-selection diversity is minimized . This represents an interesting compromise between time-scales: to guarantee long-term survivability, the number of generations between bottlenecks should not be smaller that the number of generations required to find and fix the target structure in the population.
Optimized populations in other fitness landscapes
An important question is whether our results are robust when different definitions of fitness for the RNA molecules are used. Up to now, we have only considered that, the closer a sequence folds to the target structure, the fitter it is. In this section, we introduce two additional definitions of fitness and discuss how they affect the dynamics and organization of optimized populations. As will be shown, the system is able to find a remarkably large number of genotypes compatible with the new selection pressures applied to the system, and the collective statistical properties of the population are only weakly affected.
where is the Hamming distance between the corresponding subsequence in molecule i and the target sequence Q, l s is the length of the target sequence, and β H is a selection parameter that measures the relative strength of selection for target sequence Q versus selection for target structure S. The landscape defined by Eq. (8) is the S + Q-landscape. In the examples to be shown, we have chosen a target sequence GUAUCUUCAC, with l s = 10, placed at the 5' end of molecules in the population, and a selection parameter β H = 1. As previously, β = 2.
where, analogous to previous definitions, is the distance between the energy of molecule i and the reference energy E m , and β E is a selection parameter that again measures the relative strength of selection for low energy versus selection for target structure S. Equation (9) defines the S + E-landscape. In the forthcoming examples, E m = -72 kcal/mol, which corresponds to the energy of a sequence of length l = 50 with 23 G=C pairs that contribute -3 kcal/mol each and where the positive contribution of the loop has been discarded. This value of E m is a lower bound to the energy of sequences of length l = 50. The selection parameter β E = 0.5, and β = 2. Let us compare the mutation-selection equilibria of three populations evolving on each of the landscapes described under a mutation rate μ = 0.003. In order to evaluate the degree of adaptation achieved for each of the traits under selection (structure, sequence, and energy), three relevant quantities describing the macroscopic state of these populations are used: as before, the density of sequences correctly folding into the target sequence, the average energy ⟨E⟩ = N-1∑ i E i of the population, and the fraction ρ H of sequences at distance = 0 from the target sequence Q. As previously performed, the simulation begins with an ensemble of N = 1000 random sequences. At each generation, the quantities d i , and/or E i relevant in each landscape are evaluated. Substitution by a new generation of offspring sequences proceeds as described.
At the mutation-selection equilibrium, we observe the following. The population evolving in landscape S has ρ ≃ 0.43, its average energy is ⟨E⟩ ≃ -12.3 kcal/mol and ρ H ≃ 0. That is, the required secondary structure has been successfully found but essentially none of the subsequences matches Q. The energy of the population is comparable to that of an ensemble of randomly chosen sequences folding into the target structure S. When selection for sequence is turned on in the S + Q landscape, there is a fraction ρ ≃ 0.45 of sequences correctly folding into the target structure. However, this occurs simultaneously with a fraction ρ H ≃ 0.85 of molecules bearing the correct target sequence Q. The average energy in this case is comparable to that in the previous landscape, ⟨E⟩ ≃ -13.0 kcal/mol. Hence, the population has been completely displaced from its position in the space of sequences (compared to landscape S), allowing the optimization of a second trait (target sequence) while keeping other macroscopic values essentially unchanged. The third population, evolving on landscape S + E, is optimized at ρ ≃ 0.36 with ρ H ≃ 0. However, its average energy has significantly decreased to ⟨E⟩ ≃ -28.4 kcal/mol. This again speaks for a major displacement in sequence space with respect to the two previous landscapes, favouring those sequences folding into S with minimum energy, and with a composition that does not match the target sequence Q.
The target structure S and the target sequence Q can be simultaneously (and partly independently) optimized since Q is compatible with folding into S. This is not a general situation, since demanding a target sequence too long or too biased in its composition might prevent the existence of solutions fulfilling both requirements. This is also the case for the selection of S together with the requirement of a low folding energy. If the parameter β E would take an exceedingly high value, structures of low energy (different from S) would dominate the population. This is probably the reason why ρ takes a value slightly lower in landscape S + E than in the two former landscapes. While it is important to take these possibilities into account, our results indicate that, in this model, the simultaneous optimization of two traits is possible for a broad range of selection parameters β, β H , and β E .
In addition to the different conceptual situations represented by the three landscapes introduced in this work, there are relevant geometrical differences among them. Selection for structure represents a rough landscape where changes in a single nucleotide can cause deep reorganizations in the folded configuration. On the contrary, selection for a specific sequence represents a smooth landscape where changes in the distance are strictly proportional to the mutation rate. Actually, for a mutation rate μ the distribution of changes in is a Poisson distribution of average μl, identical to Eq. (4). The landscape corresponding to selection for low energy shares characteristics with both. On the one hand, if mutations acquired through replication do not disrupt the folded state, changes in energy have to be small, since only the composition of the sequence is affected. However, if mutations lead to a different folded state, the minimum energy can suffer major changes, due to the appearance of a different distribution of structural motifs (stacks, loops and dangling ends) having a major role in the folded energy.
with Δ E = E i - E j .
The effects of mutations on fitness
In this section we discuss some of our results in quantitative detail. First, we perform a systematic analysis of the functional form of the distribution of fitness effects for optimized and adapting populations under selection for structure. Second, we analyse the effects on fitness for optimized populations in different fitness landscapes. Finally, we derive gross quantities that are usually measured in experimental research, as the fraction of mutations causing positive or negative changes in fitness conditional on the incorporation of a single mutation.
Statistical analysis of the distribution of fitness effects
Functions used to fit numerical data
Least squares fits to distributions of mutation effects on fitness
μ = 0.001
λ = 11.6 ± 0.6
a = 0.94 ± 0.14
a = 0.87 ± 0.14
k = 0.021 ± 0.001
m = 2.88 ± 0.05
b = 0.092 ± 0.015
b = 9.42 ± 1.70
a = 0.93 ± 0.03
σ = 1.03 ± 0.06
R2 = 0.940
R2 = 0.940
R2 = 0.934
R2 = 0.982
R2 = 0.969
μ = 0.004
λ = 11.4 ± 0.5
a = 1.32 ± 0.15
a = 1.21 ± 0.15
k = 0.022 ± 0.001
m = 2.79 ± 0.03
b = 0.065 ± 0.008
b = 13.2 ± 1.8
a = 0.95 ± 0.05
σ = 0.88 ± 0.03
R2 = 0.974
R2 = 0.975
R2 = 0.972
R2 = 0.959
R2 = 0.989
μ = 0.001
λ = 15.0 ± 1.0
a = 1.95 ± 0.32
a = 1.87 ± 0.32
k = 0.021 ± 0.001
m = 3.01 ± 0.04
b = 0.031 ± 0.006
b = 28.9 ± 5.2
a = 1.13 ± 0.04
σ = 0.76 ± 0.04
R2 = 0.950
R2 = 0.969
R2 = 0.967
R2 = 0.985
R2 = 0.984
μ = 0.004
λ = 13.4 ± 0.7
a = 1.83 ± 0.22
a = 1.73 ± 0.22
k = 0.021 ± 0.001
m = 2.89 ± 0.02
b = 0.039 ± 0.005
b = 23.0 ± 3.1
a = 1.06 ± 0.06
σ = 0.76 ± 0.03
R2 = 0.969
R2 = 0.981
R2 = 0.979
R2 = 0.965
R2 = 0.992
μ = 0.001
λ = 4.85 ± 0.16
a = 0.93 ± 0.09
a = 0.75 ± 0.06
k = 0.026 ± 0.003
m = 2.04 ± 0.05
b = 0.224 ± 0.024
b = 3.03 ± 0.27
a = 0.58 ± 0.05
σ = 1.12 ± 0.07
R2 = 0.962
R2 = 0.964
R2 = 0.972
R2 = 0.840
R2 = 0.947
μ = 0.004
λ = 6.15 ± 0.19
a = 1.27 ± 0.10
a = 1.06 ± 0.07
k = 0.026 ± 0.003
m = 2.17 ± 0.04
b = 0.126 ± 0.011
b = 5.70 ± 0.41
a = 0.68 ± 0.06
σ = 0.91 ± 0.05
R2 = 0.989
R2 = 0.985
R2 = 0.989
R2 = 0.843
R2 = 0.973
μ = 0.001
λ = 6.31 ± 0.18
a = 1.08 ± 0.09
a = 0.90 ± 0.07
k = 0.025 ± 0.003
m = 2.25 ± 0.04
b = 0.146 ± 0.014
b = 4.98 ± 0.42
a = 0.68 ± 0.05
σ = 1.00 ± 0.06
R2 = 0.982
R2 = 0.980
R2 = 0.982
R2 = 0.868
R2 = 0.970
μ = 0.004
λ = 7.86 ± 0.21
a = 1.19 ± 0.09
a = 1.03 ± 0.08
k = 0.024 ± 0.002
m = 2.44 ± 0.03
b = 0.106 ± 0.008
b = 7.28 ± 0.60
a = 0.78 ± 0.06
σ = 0.93 ± 0.04
R2 = 0.990
R2 = 0.987
R2 = 0.986
R2 = 0.900
R2 = 0.987
None of the functions assayed are thus able to explain the shape of the distributions obtained in the whole domain of changes in fitness. One reason might be that our sequences are relatively short and thus fold into small secondary structures. We cannot discard that large effects on fitness could be affected by structural motifs of the particular target secondary structure studied in this work (e.g. the prominent shoulder observed in several distributions of deleterious fitness effects is probably caused by global disruptions of the target secondary structure).
Least-squares fit of a Pareto function to the distribution of small effects on fitness
% of mutations
μ = 0.001, B Optimized
k = 0.0202 ± 0.0003
a = 0.848 ± 0.014
R2 = 0.998
μ = 0.004, B Optimized
k = 0.0210 ± 0.0010
a = 0.812 ± 0.043
R2 = 0.981
μ = 0.001, B Adapting
k = 0.0205 ± 0.0006
a = 1.065 ± 0.048
R2 = 0.988
μ = 0.004, B Adapting
k = 0.0210 ± 0.0011
a = 0.960 ± 0.065
R2 = 0.971
μ = 0.001, D Optimized
k = 0.0212 ± 0.0010
a = 0.393 ± 0.015
R2 = 0.987
μ = 0.004, D Optimized
k = 0.0233 ± 0.0016
a = 0.446 ± 0.027
R2 = 0.967
μ = 0.001, D Adapting
k = 0.0216 ± 0.0011
a = 0.475 ± 0.020
R2 = 0.983
μ = 0.004, D Adapting
k = 0.198 ± 0.0015
a = 0.586 ± 0.036
R2 = 0.967
Comparison of Pareto fits to the distribution of small effects on fitness for three different fitness landscapes
% of mutations
k = 0.0197 ± 0.0003
a = 0.993 ± 0.020
R2 = 0.997
k = 0.0200 ± 0.0003
S + Q
a = 0.976 ± 0.017
R2 = 0.998
k = 0.0198 ± 0.0004
S + E
a = 1.170 ± 0.034
R2 = 0.995
k = 0.0190 ± 0.0010
a = 0.391 ± 0.016
R2 = 0.984
k = 0.0219 ± 0.0016
S + Q
a = 0.529 ± 0.034
R2 = 0.963
k = 0.0197 ± 0.0014
S + E
a = 0.596 ± 0.036
R2 = 0.965
Average effect of mutation rate on phenotype
In this section, we compare by means of explicit examples the overall effect of mutations, as obtained in our simulations with RNA sequences, with some measures performed in natural systems. Due to the important differences between populations, environments, and fitness definitions used in different works, the effect of mutations in those different systems should be compared with due care. Nevertheless, the values obtained might give clues about the degree of optimization of the population and about the likeliness that fitness is improved. In the following, the numerical results obtained in our simulations correspond to populations evolving on the S-landscape.
Let us illustrate, with an example, the overall relative fractions of beneficial, deleterious, and neutral mutations affecting optimized and adapting populations. Take the mutation rate μ = 0.001. As described, there is a fraction P(0) = 0.951 of replication events without mutations and an amount 1 - P(0) - P(1) = 1.209 × 10-3 of daughter sequences incorporating two or more mutations. Hence, in most cases where mutations occur, the replicated sequence is hit by a single mutation (P(1) = 0.047). Let us consider only those cases, where a mutation has been incorporated. Our simulations indicate that, at equilibrium, 0.59% of such mutations have beneficial effects, while 46.34% diminish fitness. A large amount of mutations is neutral: 53.07%. Adapting populations suffer a similar amount of neutral mutations, 54.33% at the same mutation rate μ, but they substantially differ in the amount of mutations with an effect on fitness. Deleterious mutations represent 38.51%, and beneficial mutations increase to 7.16%. It would be difficult to evaluate the effect of single mutations at higher rates of μ. For example, if the average number of mutations per replication event is one per genome (μ = 0.02 in our case), over 26% of the daughter genomes incorporate two or more mutations, and epistatic effects should not be discarded. This might entail a difficulty when translating the results of controlled experiments to natural situations involving fast-mutating replicators.
Empirical results concerning the effect of mutations on fitness have been obtained in different contexts and for different model systems. Single nucleotide substitutions in vesicular stomatitis virus yielded complex distributions of fitness effects, with the functional form of those corresponding to beneficial mutations (Γ-distribution) differing from that of deleterious mutations (a Lognormal distribution), and both depending on how mutations were acquired . The overall probability of a deleterious mutation (conditional on a single mutation having taken place) was above 35%, and the probability of beneficial mutations was around 9%. Such a large amount of beneficial mutations suggests that the population was not yet optimized with respect to the fitness trait measured in laboratory assays.
We have quantified the effect of mutations on fitness for populations of RNA sequences in different situations. Beneficial or compensatory mutations are systematically less probable than deleterious ones, the ratio between both types being strongly dependent on the degree of optimization of the population. Once the error threshold is crossed (at high error rates), optimized and adapted populations behave similarly. Selection coefficients are almost constant, regardless of the degree of adaptation of the population. They are only slightly lower in adapting populations than in optimized populations evolved at the same mutation rate, indicating that adaptation takes place through an increase in the fraction of beneficial mutations, preferably of those with small effect in fitness.
Efforts to quantify the distribution of positive fitness effects rely on the interest to understand the dynamics of the adaptive process. Mutations of small effect are difficult to detect and quantify, so both theory and experiment have mainly addressed the tail of the distribution, where mutations with a large positive effect on fitness sit. Extreme value theory predicted an exponential shape for large effects . However, available data for beneficial mutations in the phage Φ6 rejects the exponential shape and points to a right-truncated distribution . As yet, the distribution of small beneficial effects has not been empirically evaluated. Our numerical results indicate that the whole distribution is compatible with an algebraic decay for small effects up to values around a 25% change in fitness, explaining however up to 90% of mutations. The shape of the distribution of fitness effects is similar for beneficial and deleterious effects, and both can be satisfactorily fit by a Pareto probability distribution.
The results here presented could be used to design more realistic evolutionary models where an explicit representation of the microscopic mutation rate is not feasible. These suggest using phenotypic mutation fractions that increase algebraically with the genomic mutation rate μ. In a first approximation, and in the absence of additional evidence, it would be advisable to maintain constant selection coefficients. As a consequence, at least in the system that we have studied, fitness equilibria are reached and maintained mainly through compensatory epistasis, and not through a variation in the average selective coefficients with fitness. This result is in good agreement with recent measurements of the same effects in populations of the bacteriophage ΦX174, where, moreover, the distributions of beneficial and deleterious fitness effects follow a single functional form , as obtained here.
The restrictions imposed by the simultaneous selection of more than one phenotypic trait is an important subject worthy of additional analysis. Landscapes such as those introduced here could be a first step towards the quantification of constraints in evolution and adaptation of complex populations. Despite remarkable differences among the situations investigated, the distributions of effects on fitness are best explained by a unique functional form (at least for small effects) irrespectively of the state of adaptation of the population (whether optimized or adapting) and of the fitness landscape on which the population evolves (S, S + Q, or S + E). One of our future objectives is to deepen the study of the universality of these and similar models, analyzing whether populations differing in their fitness values across environments, evolving towards different target structures, or incorporating a larger number of realistic phenotypic traits, still adapt through an increase in the amount of beneficial mutations with small effects on fitness, and whether the functional form of the distribution of fitness effects agrees with those obtained here.
Simulations have been carried out at the Itanium II cluster of INTA (Instituto Nacional de Técnica Aeroespacial, Spain). For random number generation, we relied on the Mersenne Twister and Ziff's GFSR4 algorithms as provided by GNU Scientific Library (GSL), Version 1.7 . For secondary structure folding (minimum free energy) and calculation of base-pair and Hamming distances, we use the Vienna RNA package , version 1.5, with the current standard parameter set. Nonlinear regressions to probability distribution functions were performed with the Statistics packages of Mathematica 5.2.
Useful discussions with J. Aguirre are gratefully acknowledged. We thank D. Hochberg for a careful reading of the manuscript. Support from the Spanish MICINN through research project FIS2008-05273 is gratefully acknowledged.
- de Visser JAGM, Rozen DE: Limits to adaptation in asexual populations. J Evol Biol. 2005, 18: 779-788. 10.1111/j.1420-9101.2005.00879.x.View ArticlePubMedGoogle Scholar
- Wilke CO, Lenski RE, Adami C: Compensatory mutations cause excess of antagonistic epistasis in RNA secondary structure folding. BMC Evol Biol. 2003, 3: 3-10.1186/1471-2148-3-3.PubMed CentralView ArticlePubMedGoogle Scholar
- Domingo E, Holland JJ: RNA virus mutations and fitness for survival. Annu Rev Microbiol. 1997, 51: 151-178. 10.1146/annurev.micro.51.1.151.View ArticlePubMedGoogle Scholar
- Lázaro E, Escarmís C, Domingo E, Manrubia SC: Modeling viral genome fitness evolution associated with serial bottleneck events: Evidence of stationary states of fitness. J Virol. 2002, 76: 8675-8681. 10.1128/JVI.76.17.8675-8681.2002.PubMed CentralView ArticlePubMedGoogle Scholar
- Lázaro E, Escarmís C, Pérez-Mercader J, Manrubia SC, Domingo E: Resistance of virus to extinction upon bottleneck passages: study of a decaying and fluctuating pattern of fitness loss. Proc Natl Acad Sci USA. 2003, 100: 10830-10835. 10.1073/pnas.1332668100.PubMed CentralView ArticlePubMedGoogle Scholar
- Silander OK, Tenaillon O, Chao L: Understanding the evolutionary fate of finite populations: The dynamics of mutational effects. PLoS Biol. 2007, 5: 922-931. 10.1371/journal.pbio.0050094.View ArticleGoogle Scholar
- Huynen MA, Stadler PF, Fontana W: Smoothness within ruggedness: the role of neutrality in adaptation. Proc Natl Acad Sci USA. 1996, 93: 397-401. 10.1073/pnas.93.1.397.PubMed CentralView ArticlePubMedGoogle Scholar
- Koelle K, Cobey S, Grenfell B, Pascual M: Epochal evolution shapes the phylodynamics of interpandemic influenza A (H3N2) in humans. Science. 2006, 314: 1898-1903. 10.1126/science.1132745.View ArticlePubMedGoogle Scholar
- Stich M, Briones C, Manrubia SC: Collective properties of evolving molecular quasispecies. BMC Evol Biol. 2007, 7: 110-10.1186/1471-2148-7-110.PubMed CentralView ArticlePubMedGoogle Scholar
- Kimura M: On the evolutionary adjustment of spontaneous mutation rates. Genet Res. 1967, 9: 23-34. 10.1017/S0016672300010284.View ArticleGoogle Scholar
- Leigh EG: Natural selection and mutability. Am Nat. 1970, 104: 301-305. 10.1086/282663.View ArticleGoogle Scholar
- Ishii K, Matsuda H, Iwasa Y, Sasaki A: Evolutionary stable mutation rate in a periodically changing environment. Genetics. 1989, 121: 163-174.PubMed CentralPubMedGoogle Scholar
- Kimura M, Maruyama T: The mutational load with epistatic interactions in fitness. Genetics. 1966, 54: 1337-1351.PubMed CentralPubMedGoogle Scholar
- Gillespie J: A simple stochastic gene substitution model. Theor Pop Biol. 1983, 23: 202-215. 10.1016/0040-5809(83)90014-X.View ArticleGoogle Scholar
- Orr HA: The distribution of fitness effects among beneficial mutations. Genetics. 2003, 163: 1519-1526.PubMed CentralPubMedGoogle Scholar
- Agrawal AF: Genetic loads under fitness-dependent mutation rates. J Evol Biol. 2002, 15: 1004-1010. 10.1046/j.1420-9101.2002.00464.x.View ArticleGoogle Scholar
- Martin G, Lenormand T: The fitness effect of mutations across environments: A survey in the light of fitness landscape models. Evolution. 2006, 60: 2413-2427.View ArticlePubMedGoogle Scholar
- Sanjuán R, Moya A, Elena SF: The distribution of fitness effects caused by single-nucleotide substitutions in an RNA virus. Proc Natl Acad Sci USA. 2004, 101: 8396-8401. 10.1073/pnas.0400146101.PubMed CentralView ArticlePubMedGoogle Scholar
- Rokyta DR, Joyce P, Caudle SB, Wichman HA: An empirical test of the mutational landscape model of adaptation using a single-stranded DNA virus. Nat Gen. 2005, 37: 441-444. 10.1038/ng1535.View ArticleGoogle Scholar
- Rokyta DR, Beisel CJ, Joyce P, Ferris MT, Burch CL, Wichman HA: Beneficial fitness effects are not exponential for two viruses. J Mol Evol. 2008, 67: 368-376. 10.1007/s00239-008-9153-x.PubMed CentralView ArticlePubMedGoogle Scholar
- Eyre-Walker A, Keightley PD: The distribution of fitness effects of new mutations. Nat Rev Genet. 2007, 8: 610-618. 10.1038/nrg2146.View ArticlePubMedGoogle Scholar
- Fisher RA: The genetical theory of natural selection. 1930, Oxford University Press, Oxford, EnglandView ArticleGoogle Scholar
- Lande R: The response to selection on major and minor mutations affecting a metrical trait. Heredity. 1983, 50: 47-65. 10.1038/hdy.1983.6.View ArticleGoogle Scholar
- Orr HA: The population genetics of adaptation: the distribution of factors fixed during adaptive evolution. Evolution. 1998, 52: 935-949. 10.2307/2411226.View ArticleGoogle Scholar
- Bull JJ, Badgett MR, Wichman HA: Big-Benefit mutations in a bacteriophage inhibited with heat. Mol Biol Evol. 2000, 17: 942-950.View ArticlePubMedGoogle Scholar
- Barrett RDH, McLean RC, Bell G: Mutations of intermediate effect are responsible for adaptation in evolving Pseudomonas fluorescens populations. Biol Lett. 2006, 2: 236-238. 10.1098/rsbl.2006.0439.PubMed CentralView ArticlePubMedGoogle Scholar
- Mc Lean RC, Buckling A: The distribution of fitness effects of beneficial mutations in Pseudomonas aeruginosa. PLoS Genetics. 2009, 5: e1000406-10.1371/journal.pgen.1000406.View ArticleGoogle Scholar
- Kassen R, Bataillon T: Distribution of fitness effects among beneficial mutations before selection in experimental populations of bacteria. Nat Gen. 2006, 38: 484-488. 10.1038/ng1751.View ArticleGoogle Scholar
- Hegreness M, Shoresh N, Hartl D, Kishony R: An equivalence principle for the incorporation of favourable mutations in asexual populations. Science. 2006, 311: 1615-1617. 10.1126/science.1122469.View ArticlePubMedGoogle Scholar
- Cowperthwaite MC, Bull JJ, Ancel Meyers L: Distribution of beneficial fitness effects in RNA. Genetics. 2005, 170: 1449-1457. 10.1534/genetics.104.039248.PubMed CentralView ArticlePubMedGoogle Scholar
- Schuster P: Prediction of RNA Secondary Structures: From Theory to Models and Real Molecules. Rep Prog Phys. 2006, 69: 1419-1477. 10.1088/0034-4885/69/5/R04.View ArticleGoogle Scholar
- Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast Folding and Comparison of RNA Secondary Structures. Monatsh Chem. 1994, 125: 167-188. 10.1007/BF00818163.View ArticleGoogle Scholar
- Stein PR, Waterman MS: On some new sequences generalizing the Catalan and Motzkin numbers. Discrete Math. 1978, 26: 261-272. 10.1016/0012-365X(79)90033-5.View ArticleGoogle Scholar
- Stich M, Briones C, Manrubia SC: On the structural repertoire of pools of short, random RNA sequences. J Theor Biol. 2008, 252: 750-763. 10.1016/j.jtbi.2008.02.018.View ArticlePubMedGoogle Scholar
- Wilke CO: Adaptive evolution on neutral networks. Bull Math Biol. 2001, 63: 715-730. 10.1006/bulm.2001.0244.View ArticlePubMedGoogle Scholar
- Cases-González C, Arribas M, Domingo E, Lázaro E: Beneficial Effects of Population Bottlenecks in an RNA Virus Evolving at Increased Error Rate. J Mol Biol. 2008, 384: 1120-1129. 10.1016/j.jmb.2008.10.014.View ArticlePubMedGoogle Scholar
- Escarmís C, Gómez-Mariano G, Dávila M, Lázaro E, Domingo E: Resistance to extinction of low fitness virus subjected to plaque-to-plaque transfers: Diversification by mutation clustering. J Mol Biol. 2002, 315: 647-661. 10.1006/jmbi.2001.5259.View ArticlePubMedGoogle Scholar
- Aguirre J, Lázaro E, Manrubia SC: A trade-off between fast growth and diversity generation limits the optimization of viral quasispecies. J Theor Biol. 2009, 261: 148-155. 10.1016/j.jtbi.2009.07.034.View ArticlePubMedGoogle Scholar
- GNU Scientific Library (GSL). [http://www.gnu.org/software/gsl/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.