Probabilistic models for CRISPR spacer content evolution
 Anne Kupczok^{1}Email author and
 Jonathan P Bollback^{1}
DOI: 10.1186/147121481354
© Kupczok and Bollback; licensee BioMed Central Ltd. 2013
Received: 22 November 2012
Accepted: 14 February 2013
Published: 26 February 2013
Abstract
Background
The CRISPR/Cas system is known to act as an adaptive and heritable immune system in Eubacteria and Archaea. Immunity is encoded in an array of spacer sequences. Each spacer can provide specific immunity to invasive elements that carry the same or a similar sequence. Even in closely related strains, spacer content is very dynamic and evolves quickly. Standard models of nucleotide evolution cannot be applied to quantify its rate of change since processes other than single nucleotide changes determine its evolution.
Methods
We present probabilistic models that are specific for spacer content evolution. They account for the different processes of insertion and deletion. Insertions can be constrained to occur on one end only or are allowed to occur throughout the array. One deletion event can affect one spacer or a whole fragment of adjacent spacers. Parameters of the underlying models are estimated for a pair of arrays by maximum likelihood using explicit ancestor enumeration.
Results
Simulations show that parameters are well estimated on average under the models presented here. There is a bias in the rate estimation when including fragment deletions. The models also estimate times between pairs of strains. But with increasing time, spacer overlap goes to zero, and thus there is an upper bound on the distance that can be estimated. Spacer content similarities are displayed in a distance based phylogeny using the estimated times.
We use the presented models to analyze different Yersinia pestis data sets and find that the results among them are largely congruent. The models also capture the variation in diversity of spacers among the data sets. A comparison of spacerbased phylogenies and Cas gene phylogenies shows that they resolve very different time scales for this data set.
Conclusions
The simulations and data analyses show that the presented models are useful for quantifying spacer content evolution and for displaying spacer content similarities of closely related strains in a phylogeny. This allows for comparisons of different CRISPR arrays or for comparisons between CRISPR arrays and nucleotide substitution rates.
Keywords
CRISPR/Cas Maximum Likelihood Microbial genome evolution Bacterial immunityBackground
Bacteria and Archaea have an adaptive heritable immune system against viruses, plasmids and other mobile genetic elements [1, 2]. This locus, CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats), consists of an array of repeats and unique spacers. The repeats are of length 2148 nucleotides depending on CRISPR type and species. The spacer sequences are 2672 nucleotides in length, where the variance of spacer length within one array is small. The spacer sequences were found to be of extrachromosomal origin [3] and are involved in immunity [1, 2]. Cas (CRISPRassociated) genes adjacent to the CRISPR arrays are necessary for the biogenesis of the CRISPR RNA, for the interference with the target nucleic acid and for the acquisition of new spacer sequences [4]. Different types of CRISPR/Cas systems exist based on the set of Cas genes present [5].
Comparisons of the CRISPR array of closely related strains showed that the CRISPR array undergoes a rapid evolution that is mainly determined by the gain and loss of the whole system or of individual spacers [6, 7]. In most cases, spacer addition was observed at the beginning, the ‘leader’ end, of the array [1] and the pattern in metagenomic samples suggest that deletion of consecutive repeatspacer units occurs [6]. Bacterial genomes can have multiple CRISPR arrays that differ in their dynamics [7–9]. It was observed that closely related strains can differ in their spacer content, thus the CRISPR array is used as a tool for strain typing (e.g., [8, 10, 11]).
The targeting of extrachromosomal elements by the CRISPR/Cas system was discovered recently [1, 2] and many questions regarding the functions, mechanisms and evolution of this locus are still open. This is complicated by the fact that different CRISPR/Cas systems have different mechanisms and may have different function [4]. Thus computational methods that make predictions are important to narrow the space of hypotheses that need to be tested experimentally. For example, selftargeting spacers are not conserved between species and CRISPR arrays with selftargeting spacers may get inactivated. These observations exclude the hypothesis of gene regulation by CRISPR [12].
Using model simulations can provide insights into the parameters allowing CRISPR existence and into the details of CRISPR dynamics. One result using population genetics models is that CRISPR is maintained if it provides immunity to viruses or plasmids even when there is a cost of having CRISPR [13]. Simulating a spatial model of virus and host population showed that coexistence is possible with a CRISPRbased immune system [14]. Furthermore, a spatially structured environment can lead to intermediate array lengths, i.e., the number of spacers has an optimum between 0 and the number of viruses excluding the extreme values. Then the lengths are determined by the spacer insertion rate and by the cost for having spacers not by the total number of phages in the environment [15]. Modeling coevolution of hosts and viruses results in the observation that spacers at the leaderdistal end tend to be more conserved, due to selective sweeps, and that immunity to contemporary viruses is mainly determined by the most recently acquired spacers [16, 17]. In addition, simulations can find parameter regimes that are important for the existence of CRISPR like a threshold on the viral mutation rate [18].
Our approach differs from the population genetics models described since it estimates parameters directly from the array data. We describe the dynamics of the CRISPR locus over time in diverging populations related by a phylogeny. This is the phylogeny of the CRISPR/Cas locus. Since the locus can be transferred horizontally [19], the CRISPR/Cas phylogeny does not need to be identical to the strain phylogeny. There are a few instances of recombination inside cas genes [20], but in our model, we exclude recombination in the spacer arrays. The evolutionary events we model are spacer insertion and deletion. By using only strains harboring the locus, we ignore the loss or gain of the whole CRISPR/Cas system.
Mutations inside the CRISPR locus are also not included in the model, but in data analyses multiple spacers with sequence similarities can be subsumed into one identity.
Even before the function of the CRISPR/Cas system was clear, Pourcel et al. [8] formulated three observations for CRISPR evolution by comparing Yersinia pestis arrays: Random deletions of one or more spacers and repeats; polarized addition of new spacers; and identical spacers reflect shared ancestry not independent events. We also assume that the CRISPR arrays analyzed are homologous and that each spacer was only inserted once, i.e., all spacers with identical sequence are identical by descent. Thus we present three models: an unordered model (spacer content is considered as a set), an ordered model (where insertion is polarized, i.e., insertions occur at one end only) and a fragment loss model (where insertion is polarized and successive spacers can get deleted together in a single event).
Another class of models that take order relationships into account are gene order models, i.e., they model the order of genes in the genome over time. Most methods for evaluating the distance between two gene orders find the minimum number of rearrangement events between these genomes. This approach can also be combined with insertions and segment deletions [21, 22]. Probabilistic methods of rearrangement only model inversions [23, 24] or inversions and transpositions [25]. Multigene events are considered in one model of gene innovation, duplication and deletion, but ignoring the order of genes on the genome [26].
Our ordered and fragment loss models are thus different from the probabilistic models for gene order since they capture the properties specific for CRISPR spacer evolution. We describe our method and investigate its properties by simulation and application to real world Yersinia pestis data sets [8, 27].
Methods
Models
We describe different models for estimating insertion and deletion rates from CRISPR arrays. We ignore repeats and only use the spacer information and their order encoded in an array. The leader end is displayed on the left (see also Figure 1). In our models, these arrays evolve by insertion and deletion events. An overview of the types of insertions and deletions allowed in the different models can be found in Table 1. In all models, the waiting time for insertion events is exponentially distributed with rate λ (Figure 1). One spacer is inserted for each insertion event.
Overview over models
Independent loss model  Fragment loss model  

Unordered  Ordered  
Insertions  Random  Polarized  Polarized 
Deletions  Single  Single  Fragments 
In the fragment loss model the position is informative since insertion occurs in the beginning and subsequent spacers can get lost together. This model is motivated by the pattern in metagenomic samples that shows deletion of consecutive repeatspacer units [6]. Each possible nonempty substring of the array is a fragment. Thus fragments can be overlapping and one spacer inside an array is then part of different fragments. For example, the array 321 (Figure 1) consists of the fragments 1, 2, 3, 21, 32 and 321. The fragment 32 overlaps with the fragment 21 in the spacer 2. And the spacer 2 occurs in 4 fragments: 2, 21, 32 and 321. For each possible fragment, the waiting time to get lost is exponentially distributed with rate μ, independent of the number of spacers a fragment contains. In the length model, all lengths smaller than the current length are accessible in a single step (Figure 2).
Since μ has a different meaning in both models, we emphasize this by using μ_{ F } for the fragment loss model (μ_{ F } affects each possible fragment), and μ_{ I } for the independent loss model (μ_{ I } affects only single spacers). The rates are always rescaled such that one event (insertion or deletion) is expected in time t = 1. This allows for estimating times, but only the ratio $\rho =\frac{\lambda}{\mu}$ can be estimated. Again, we distinguish the two models by using ${\rho}_{F}=\frac{\lambda}{{\mu}_{F}}$ and ${\rho}_{I}=\frac{\lambda}{{\mu}_{I}}$. Subscripts are omitted when the underlying model is clearly stated.
Now, we present the stationary distribution of the length models and the transition probabilities of the full model necessary to formulate an estimation approach under each of these models. Afterwards details of the estimation approaches are described.
Independent loss models
Length model
Transition probabilities
M and D follow directly from the exponential model. I is known from queuing theory [28]. The probability of inserting j spacers is the probability of observing j spacers after time t when there were 0 spacers at time 0. That is the integration over all possible paths leading to j, including paths where spacers were inserted and lost and thus never observed.
Fragment loss models
Length model
Equation (3) can be solved from the conditions that in stationarity the flow into a state equals the flow out of that state and that the probabilities of such events necessarily sum to 1 (see Additional file 1).
Transition probabilities
As before, the probability of inserting i spacers includes unobserved spacers that were inserted and lost again. These equations were found by integration over all possible paths in Mathematica 8.0 [29].
For example, in Figure 4 the transition probability is I(3t, λ, μ) × M(2t, μ) × D(t, μ) × M(1t, μ).
Estimation
Maximum likelihood function
We describe a maximum likelihood approach to estimate rates and times of spacer insertions and deletions, given a set of ordered spacer arrays from different strains. Since we do not have phylogenetic information, we consider each pair of arrays and their possible common ancestors.
where s is the list of all different pairs of S and t is the corresponding list of pairs of times.
where λ and μ are computed from ρ given the constraints $\frac{\lambda}{\mu}=\rho $ and the expected number of insertions and deletions in time 1 is 1. Then, q(aλ, μ) is the probability of observing a, T(a → bt, λ, μ) is the transition probability of changing from a to b in time t given insertion rate λ and deletion rate μ.
Note that q and p are different but related by the following constraints: The sum of all q(a) with a = n is p(n) and q(a) = q(b) if a = b.
Optimization
We are interested in both the estimate of ρ, $\widehat{\rho}$, and the estimation of the divergence times. For a pair, we denote the estimated time between two arrays as $\widehat{\tau}={\widehat{t}}_{1}+{\widehat{t}}_{2}$. τ for a phylogeny or for a collection of pairs denotes the average of τ over all pairs.
 1.
Estimate a starting ρ from the length model by maximum likelihood. The likelihood function is ${L}_{\text{start}}\left(\rho \right)=\sum _{\text{arrays}s}p\left(\rights\left\phantom{\rule{1em}{0ex}}\right\rho )$, where S is the length of s and p is the stationary distribution of the length model.
 2.
For each pair of spacers with overlap, generate the possible ancestors: Ancestral arrays can be arbitrarily large, but the probability of observing a certain length is given by p(n). For practical reasons we do not consider ancestors whose length is outside the central 99% of the stationary distribution given by ρ estimated in step 1, since they would have a negligible contribution to the likelihood. In detail, the length l _{1} where the cumulative distribution exceeds 0.005 is the minimum ancestor length and the length l _{2} where the cumulative distribution exceeds 0.995 is the maximum ancestor length. Then the possible ancestor lengths n are between l _{1} and l _{2}: l _{1} ≤ n ≤ l _{2}.
 3.
 (a)
For all pairs with overlap, estimate the times with fixed ρ. It is possible to iterate through the pairs and estimate their times independently of the other pairs. The estimation of both times is iterated alternatingly until the likelihood has converged.
 (b)
Estimate ρ with fixed times using L(ρt, S).
 (c)
Check if the loglikelihood of the estimated parameters has converged, then return the estimated parameters, else repeat step (a) with the new parameters.
 (a)
All three models are analyzed in this computational framework. All optimization steps only optimize one parameter and use Powell’s method from the python package scipy[30]. The python package mpmath is used for highprecision computing [31] that is necessary to compute the probability functions accurately.
Ancestors
Here, we describe for each of the models how we generate the ancestors in step 2 above. Thereby we must account for unobserved spacers, that are not present in the data but in ancestral lineages. We overcome the problem of the infinite state space by ignoring the identity of unobserved spacers. For example, there may be four unobserved spacers, each of them gets a new unique name, but then no other four unobserved spacers with other names or an other order are considered.
Unordered model
Given a pair of arrays s_{1} and s_{2}, they have c spacers in common, d_{1} are unique to s_{1} and d_{2} are unique to s_{2}. Then all n between min(c, l_{1}) and l_{2} are generated. When length n is generated, enumerate all i, j, u such that c + i + j + u = n, i ≤ d_{1} and j ≤ d_{2}. Then for ancestor a, there are c common spacers, i only occur in s_{1}, j only occur in s_{2} and u are unobserved (they are lost in both lineages). Since this ancestor comprises multiple spacer identities, we assign a weight to it, $w\left(a\right)=\left(\genfrac{}{}{0ex}{}{{d}_{1}}{i}\right)\times \left(\genfrac{}{}{0ex}{}{{d}_{2}}{j}\right)$. The weights for each n are rescaled such that they sum to 1, i.e., the rescaled weight w_{ s } is ${w}_{s}\left(a\right)=\frac{w\left(a\right)}{\sum _{b,\leftb\right=\lefta\right}w\left(b\right)}$. Then q(a) = w_{ s }(a)p(a).
Ordered model
Given a pair of arrays s_{1} and s_{2}, find the first shared spacer. The ancestor must contain this spacer and all subsequent spacers from both arrays, these are c spacers in total. There are d_{1} and d_{2} spacers before the first shared spacer in s_{1} and s_{2}, respectively. With these new definitions of c, d_{1} and d_{2}, the method from the unordered model is applied.
Fragment loss model
For the fragment loss model, the ancestors must fulfil several constraints given by the order in the observed arrays. Since all shared ancestors are identical by descent and insertions occurs only in the beginning, all spacers from the first shared spacer on must be present in the ancestor. Thereby the order of spacers must be preserved. Enumerating the ancestors is best explained with an example. Consider the arrays s_{1} = 8764321, s_{2} = 111097652.

7 is the first shared spacer.

The set of spacers necessarily present in the ancestor is the union of all spacers after the first shared spacer: {1,2,3,4,5,6,7}.

Possible orders of these spacers: 7654321, 7645321, or 7643521


The set of spacers possibly present in the ancestor is the union of all spacers before the first shared spacer: {8,9,10,11}.

Order constraints for these spacers: 11 before 10 before 9 before 7


Unobserved spacers (spacers present in the ancestor and lost in both lineages) may have occurred at all possible positions.
Since a lot of possible arrays are generated by this approach, heuristics are used to reduce their number:

Shared fragments cannot be interrupted by an unobserved spacer.

In the example, there is no unobserved spacer between 6 and 7.


Unique fragments in the beginning are not mixed.

In the example, 8 and 11109 are in the beginning and then the following ancestral fragments are not allowed: 11810 and 1089.


Deleted pairs are also not mixed.

In the example 43 and 5 are deleted and the ancestral fragment 453 is ignored.


The number of positions with unobserved spacers is maximal four. That means there can still be a lot of unobserved spacers but they occur only in maximal four stretches.
This reduction is only for computational reasons, and may result in the true/simulated ancestor not being included in the set of possible ancestors. For small simulations it was shown that the results are very similar (data not shown) and that the ancestors generated contain enough information for the likelihood function.
Loss time
Two arrays do not contain information about the divergence time if they have no overlap. To include them in the analysis, we are interested in the time passed until an array lost all spacers present in the ancestor.
The lineage loss time distribution for a given ρ is the following distribution of times: Given an array in stationarity, when does the last spacer from the ancestral array gets lost? The expected lineage loss time is the expectation of this distribution. Analogously, we define the pairwise loss time distribution as the distribution of times when two independently evolving lineages lost their last common spacer. In detail, we simulate two lineages starting from a common ancestor and track changes in both lineages simultaneously. t is the time when the deletion in one lineage results in the loss of all spacers that are present at that time in the other lineage. The pairwise loss time simulated is then 2t since there were two lineages. The distribution is always approximated using 10,000 simulated pairs.
Simulation
Simulation under each model is implemented in a python program. Input is a phylogeny with branch lengths, ρ and the type of the model. An ancestor length is drawn at the root of the phylogeny from the stationary distribution of the length model. Spacers are labelled arbitrarily. Then the tree is traversed in preorder and the descendent of each branch given its ancestor and branch length t is simulated as follows.
 1.Determine the time until the next event of each type:
 (a)
Draw a waiting time until the next insertion event from an exponential distribution with rate λ.
 (b)
Draw the waiting times until the next deletion for each spacer or fragment.
(b1) If the independent loss model is simulated, draw n exponential waiting times, each with rate μ.
(b2) If the fragment loss model is simulated, draw $\frac{n(n+1)}{2}$ exponential waiting times with rate μ, one for each fragment.
 (a)
 2.
Find the minimal time t _{min} over all times generated in step 1.
 3.
t _{ c } = t _{ c } + t _{min}.
 4.
If t _{ c } > t, return s as the sequence at the descendent node.
 5.
Else the event that corresponds to t _{min} is realized, the other events are discarded. If t _{min} corresponds to an insertion, one spacer with a new name is inserted. In case of the unordered model, the spacer is inserted at a random position, in the other cases it is always inserted in the beginning of s. If t _{min} corresponds to a deletion, modify s by deleting the corresponding fragment or spacer.
 6.
Continue at step 1 with the modified s.
Phylogeny computation using CRISPR distances
The sum of the estimated times given two strains, τ, can be interpreted as the distance between these two strains. These distances can be used to compute a distancebased phylogeny using neighbor joining [32] as was presented by Huson and Steel [33]. For the nonreversible models, however, there is more information available, since there is an estimate for the distance of the last common ancestor to each of the two strains. We use a modified neighbor joining method to utilize this information and refer to it as rooted neighbor joining. We describe the algorithm with an example.
Input: For k taxa, all $\left(\genfrac{}{}{0ex}{}{k}{2}\right)$ pairs with rooted time estimates, that is d_{x, y} for the distance to taxon x from the ancestor of the pair (x, y).
Output: A rooted phylogenetic tree with times t.
Algorithm
 1.Compute the weights for all pairs (x,y):$\begin{array}{c}{w}_{x,y}=\sum _{z\ne x,y}({d}_{x,z}{d}_{x,y}+{d}_{y,z}{d}_{y,x})\hfill \\ \phantom{\rule{2.6em}{0ex}}=(2k)({d}_{x,y}+{d}_{y,x})+\sum _{z\ne x,y}({d}_{x,z}+{d}_{y,z})\hfill \end{array}$
 2.
Choose the pair with maximal weight w _{x,y}. Create a new node r that is the ancestor of (x, y) with t _{r,x} = d _{x,y} and t _{r,y} = d _{y,x}.
 3.Compute the distances between all other nodes z and r:${d}_{r,z}=\frac{1}{2}({d}_{x,z}{d}_{x,y}+{d}_{y,z}{d}_{y,x})\text{,}\phantom{\rule{0.3em}{0ex}}{d}_{z,r}=\frac{1}{2}({d}_{z,x}+{d}_{z,y})$
 4.
If only one node is left, return it as the root, else continue with step 1.
By construction, the method results in the correct rooted tree if the distances were extracted from a rooted tree. We show this for three taxa.
For three taxa, there is only one clade, we choose (1,2) to be the correct clade. Then the branch lengths are given in Figure
Weights: w_{1,2} = (d_{1,2} + d_{2,1}) + d_{1,3} + d_{2,3} = (a + b) + a + c + b + c = 2c, w_{1,3} = (a + c + d) + a + d = c, w_{2,3} = (b + c + d) + b + d = c. Thus for all possible a, b, c, d, w_{1,2} = argmax_{i,j}w_{i,j} and the correct grouping is chosen by the algorithm.
Create node 4 with t_{4,1} = d_{1,2} = a and t_{4,2} = d_{2,1} = b. The tree is now (t1:a,t2:b)4.
We abbreviate the method rooted neighbor joining with times from the fragment loss model by RNJ_{ F }, analogously for NJ and subscript O for the ordered model and subscript U for the unordered model.
Yersinia pestisdata set
CRISPR arrays from Yersinia pestis genomes
Strain  Accession  Yp1  Yp2  Yp3 

91001  GenBank:NC_005810.1  2^{1}10  3210  0 
a1122  GenBank:NC_017168.1  76251430  43210  210 
angola  GenBank:NC_010159.1  8140  
antiqua  GenBank:NC_008150.1  1091430  520  210 
ca884125  GenBank:ABCD00000000.1  76251430  43210  210 
co92  GenBank:NC_003143.1  76251430  43210  210 
d106004  GenBank:NC_017154.1  6251430  3210  210 
d182038  GenBank:NC_017160.1  11625430  3210  210 
e1979001  GenBank:AAYV00000000.1  11625430  3210  210 
f1991016  GenBank:ABAT00000000.1  76251430  43210  210 
harbin35  GenBank:NC_017265.1  4 3^{1} 0^{1}  3210  210 
india195  GenBank:ACNR00000000.1  7625  43210  210 
kim10  GenBank:NC_004088.1  430  3^{1}210  210 
mg051020  GenBank:AAYS00000000.1  76251430  43210  210 
nepal516  GenBank:NC_008149.1  0^{2}  3210  210 
pestoidesa  GenBank:ACNT00000000.1  21 0^{3}  6320  1 0^{1} 
pestoidesf  GenBank:NC_009381.1  1251430  986170  43210 
pexu2  GenBank:ACNS00000000.1  76251430  43210  210 
z176003  GenBank:NC_014029.1  6251430  3210  210 
The whole locus was extracted, i.e., the sequence from the start of cas1 until the end of csy4. Nucleotide sequences from the resulting 19 strains were aligned using clustalw [35] into an alignment of 8555 sites that is subsequently used for phylogeny estimation with iqpnni [36].
Putative CRISPR arrays for the 19 strains are extracted using CRISPRfinder [37]. True CRISPR elements are found by comparing the repeat sequence to the known Yersinia pestis repeat. The three types of CRISPR arrays are distinguished by their last degenerated repeat [8]. In total, four CRISPR arrays are missing from the CRISPRfinder results. In these cases, we located the respective leader in the genome and extracted repeats and spacers manually. These arrays harbor none or one spacer. For each data set, spacers were assigned the same identity if they show more than 90% sequence similarity. This is a natural cutoff to choose since there was no pair of spacers with similarity between 65% and 90%. Spacer sequences can be found in Additional file 2 for Yp1, in Additional file 3 for Yp2 and in Additional file 4 for Yp3.
Results
Parameter estimation for simulated pairs
In the first simulation setting, we present basic simulations with clocklike twotaxon trees. A tree height of 1, 5 and 10 is investigated, resulting in τ = 2, 10, 20, and different possible values for ρ: ρ = 10, 50, 100 for the fragment loss model and ρ = 2.4, 4.8, 6.4 for the independent loss model. These values were chosen because they are corresponding ρs (Figure 3).
We also compare the estimation using the full likelihood with the ancestor fixed to the true ancestor and using the full likelihood with summing over possible ancestors. The estimated values of ρ are very similar, which leads to the conclusion that the ancestor enumeration works appropriately.
Next, we use the same simulated data sets, but investigate the results when using an incorrect model for the estimation. We only compare the models with single deletions among each other and the models with polarized insertions among each other (see also Table 1). The independent loss models differ only in their insertions. When using the incorrect insertion model, ${\widehat{\rho}}_{I}$ is very similar (Figure 7A, B). These models are also very similar in their construction. They are the same if after the first shared spacer there are no spacers unique to one strain. When using the incorrect deletion model, the corresponding ρ tends to be estimated (Figure 7C, E). In detail, the ρ_{ I } that is estimated under the ordered model from the data generated under the fragment loss model is on average the ρ_{ I } corresponding to ρ_{ F } used for the simulations (red line in Figure 7E). The underestimation of ρ_{ F } is even present to a larger extent when ρ_{ F } is estimated from data generated under the ordered model compared to the estimation under the true model.
Median ρ estimates
ρ  τ  n _{ o }  $\widehat{\mathit{\rho}}$  ${\widehat{\mathit{\rho}}}_{\mathit{o}}$  ${\widehat{\mathit{\rho}}}_{\mathit{e}}$ 

Unordered model  
2.4  2  781  2.000  2.500  0.500 
2.4  10  269  1.861  3.000  1.491 
2.4  20  29  1.833  3.552  1.833 
4.8  2  981  4.631  4.872  0.788 
4.8  10  840  4.352  4.669  2.191 
4.8  20  455  3.846  4.965  3.351 
6.4  2  998  6.000  6.000  1.000 
6.4  10  949  5.716  5.862  2.326 
6.4  20  715  5.285  5.928  3.757 
Ordered model  
2.4  2  809  2.424  2.500  0.500 
2.4  10  267  1.861  2.950  1.500 
2.4  20  26  1.833  3.282  1.833 
4.8  2  977  4.500  4.552  0.500 
4.8  10  794  4.346  4.761  2.564 
4.8  20  460  3.963  5.128  3.108 
6.4  2  997  6.431  6.431  0.500 
6.4  10  942  6.144  6.361  2.598 
6.4  20  733  5.625  6.286  3.846 
Fragment loss model  
10  2  761  6.437  9.201  2.039 
10  10  173  9.011  9.832  9.011 
10  20  24  9.187  8.705  9.187 
50  2  921  35.040  39.779  9.011 
50  10  602  31.272  33.123  26.442 
50  20  261  36.708  34.167  36.708 
100  2  965  71.452  72.466  25.464 
100  10  704  57.810  62.657  44.085 
100  20  440  66.145  66.145  66.560 
Using the true model, we find that times are well estimated until a threshold depending on the simulated ρ. For example, for the independent loss model, for ρ = 4.8, only τ = 2 is well estimated, but for ρ = 6.4, τ = 2 and τ = 10 are well estimated. This threshold is below the expected pairwise loss time. Time estimation for the fragment loss model is more noisy and a slight overestimation can occur for intermediate times that may be related to the underestimation of ρ for these parameter settings (Figure 8D).
Time estimates for the incorrect independent loss model are very similar (Figure 8A, B). In general, the ordered model results in slightly lower time estimates. Small and intermediate times are overestimated when the ordered model is applied to data generated under the fragment loss model (Figure 8E), possibly because more events are necessary to explain this data. Applying the fragment loss model to ordered independent loss data also results in an overestimation for intermediate times (Figure 8C).
Parameter estimation for simulated phylogenies
Next we apply the estimation to data sets simulated on a phylogeny. The same values of ρ as in the previous simulations were used. Phylogenies of 10 taxa are generated under a Yule process and rescaled to a specific tree height (tree height of 1, 5, 10, 20 and 30, respectively).
These results generally confirm the results for pairs of arrays, but resulting distributions of$\widehat{\rho}$ and$\widehat{\tau}$ have a lower
Yersinia pestisanalysis
Yersinia pestis results
Unordered model  Ordered model  Fragment loss model  

Data  Avg.  Avg.  Avg.  Avg.  Avg.  Avg.  
set  Array  ${\widehat{\mathit{\rho}}}_{\mathit{I}}$  time  diversity  ${\widehat{\mathit{\rho}}}_{\mathit{I}}$  time  diversity  ${\widehat{\mathit{\rho}}}_{\mathit{F}}$  time  diversity 
Yp1  5.555  5.802  0.2271  5.527  5.947  0.2362  46.24  4.96  0.3331  
1  Yp2  4.027  3.479  0.2188  4.005  3.538  0.2241  22.87  3.645  0.3108 
Yp3  2.667  1.506  0.1755  2.667  1.486  0.1772  11.36  1.313  0.2061  
2  Yp1  6.625  4.726  0.1445  6.624  4.761  0.1446  90.21  3.859  0.188 
Yp2  4.676  2.984  0.1494  4.655  3.082  0.156  28.61  3.701  0.2748  
Yp1  6.401  9.138  0.2943  6.36  9.259  0.2983  80.76  9.741  0.4408  
3  Yp2  4.613  3.329  0.1707  4.607  3.379  0.1749  39.31  3.359  0.2492 
Yp3  2.969  0.7221  0.07339  2.959  0.7184  0.07114  12.6  0.8776  0.1025 
time between the arrays divided by the pairwise loss time, where maximum diversity is 1. Yp3 has the lowest diversity for each data set it is present. However, the results for the other array types differ. Pourcel et al. [8] argued that Yp1 is the most dynamic CRISPR locus. Based on the data of Pourcel et al. (data set 2), diversity is similar in Yp1 and Yp2 under the independent loss model, and diversity is lower in Yp1 than in Yp2 under the fragment loss model. This discrepancy is resolved when comparing the average times. The average time is larger in Yp1 than in Yp2 for each model. Thus there are more events present in Yp1 compared to Yp2. When considering that the longer arrays in Yp1 could resolve larger times, the diversity results in a similar value. Data set 3 [27] shows higher diversity in Yp1 compared to Yp2 under all three models. Diversity in Yp1 is also higher in data set 3 compared to data set 2. Data set 3 thus captures a larger fraction of the diversity in CRISPR spacer content present in Yersinia pestis.
Discussion
We present a new method for analyzing CRISPR spacer data from microbial populations. The evolution of CRISPR is mainly driven by the insertion of new spacers during infection with foreign DNA and by the presumably random deletion of successive spacers. We try to meet these biological characteristics in the models presented here. Estimating insertion and deletion rates and time in number of expected events in one lineage allows for comparisons of empirical data sets that could lead to relevant conclusions. First, bacterial groups in different environments can be compared in terms of CRISPR dynamics to assess the relative importance of CRISPR in
these environments. Different CRISPR array types might show different dynamics and thus have different utility for strain typing. An observed switch in spacer dynamics on a phylogeny might suggest a change in CRISPR cost or environment.
The three models presented here capture different mechanisms of CRISPR evolution, namely polarized addition of spacers and deletion of multiple successive spacers (Table 1). The CRISPR spacer arrays used for the analysis are assumed to be homologous. CRISPR homology can be determined by synteny in genomic positions and by repeat and leader similarities.
Models are necessarily a simplification of the past biological process. In our model, we ignore population dynamics. Our insertion and deletion rates are, as the substitution rates in phylogenetics, a compound parameter including the process of random changes and selection. The model is based on a timehomogenous Markov process and the dynamics are assumed to be in stationarity. Since an analysis is based on one species and one CRISPR type, it is reasonable to assume that the mechanistic insertion and deletion rates are homogeneous across the set of strains analyzed. We can not exclude, however, that subsets of strains experienced a different environment and thus different selection pressure on their spacer content. Simulations showed that the number of spacers in an array is determined mainly by internal parameters, like spacer insertion rate and cost of having spacers, not by external parameters, like the number of viruses in an environment [15].
We are not aware of previous publications estimating parameters under the ordered or the fragment loss model. The length model of the independent loss model corresponds to an M / M / ∞ queuing model [28]. The unordered model corresponds to the gene content model for the maximumlikelihood distance estimation in [33].
In the context of birth and death processes it is known as the simple deathandimmigration process (e.g., [40]).
The times estimated under these models also allow a comparison to substitution rates if sequence data is available. This analysis is however complicated by several facts. First, microbial genomes often harbor multiple CRISPR arrays. As a consequence, it is not clear how to combine these estimates to make a comparison possible. Second, spacer content might be different for very closely related strains. Then only a few polymorphisms are available and the substitution rate cannot be estimated reliably. Finally, frequent horizontal gene transfer of the CRISPR/Cas system has been suggested (e.g., [41]), and thus CRISPR rates can only be compared to substitution rates of cas genes.
The parameter estimation as presented here does not use an explicit phylogeny. This is advantageous since no search through tree space is necessary or no precomputed phylogeny needs to be given. The latter may
not be possible since no external information might resolve the CRISPR relationships. On the other hand, only a distancebased approach is available to display the CRISPR relationships. We can use the rooted distance from nonreversible models to compute rooted nonclocklike distancebased trees.
We find that estimation of the rate parameter performs well on average, but the estimates under the independent loss models show a lower variance. The fragment loss model tends to an underestimation and may be affected by the factorization of the likelihood function. The time estimates are most accurate for shorter times. For longer times, the absence of overlap complicates an accurate time estimation. In the analyses presented here, the different models result in similar estimates. If the incorrect loss model is applied, the corresponding ρ tends to be estimated fairly accurately. There is also no clear bias that affects the time estimation under an incorrect model. Note that there is a wide range of possible models accounting for fragment deletions. We chose one with the same instantaneous rate for each possible fragment, i.e., ignoring fragment length. This simplification is mainly for computational reasons. Future work on other fragment loss models, including lengths of fragments, might lead to a better fit for CRISPR spacer data.
We compare the estimations between data sets and between different CRISPR arrays present in a genome. Three Yersinia pestis data sets were chosen since they harbor three CRISPR array types and thus this data sets allows for comparison between data sets and between CRISPR array types evolving with different dynamics. Using this data set, we find ρ estimates to be similar using two published data sets but lower in a data set assembled from published genomes. Time and diversity estimates differ between data sets thus the presented methods allow comparisons of the diversity of CRISPR loci sampled from different populations.
For the Yersinia pestis data from published genomes, we observe only few differences in the cas gene sequences but a high diversity at the spacer level. Thus substitution
rates cannot be compared with reliability, but nucleotide and CRISPR spacer data provide phylogenetic information at very different time scales. It is possible to compute cas gene phylogenies on the species level (e.g., [41]). In contrast, spacer information could be utilized for closely related strains that have only few differences in the other nucleotide sequences, which has already been done in using CRISPR in strain typing (e.g., [8],[10],[11]). The method presented here can be used to define groups based on the clustering or to find relationships between groups.
A steadily growing literature suggests many other possible mechanisms of CRISPR evolution apart from polarized addition and fragment deletion. Spacer insertion can happen together with an internal deletion [42], or at an internal repeat [43]. Spacers or whole fragments may be duplicated [44]. And present spacers can guide the acquisition of new spacers from the same DNA molecule [45]. Note that these results affect only the insertion step of the CRISPR evolution process. But the fragment deletion model as it is presented here is based on the polarized insertion assumption. Combining an unordered insertion with a fragment deletion process is currently infeasible. Given these studies and the fact that the models presented here do not give substantially different results, the unordered model may be a robust choice for estimating rate and time parameters from CRISPR array data. Note that several simplifications are possible for the likelihood computation under this model. First, for the start likelihood, the estimate of the Poisson parameter is well known to be the mean of the data values. Second, it is reversible, thus only the time between two arrays can be estimated and the ancestor generation can be omitted. Third, the loss time can be calculated analytically and does not need to be acquired using simulations. To make the model comparisons fair, the same computational approach is used for all models in this paper. But it is possible to implement a more efficient approach for the unordered model. Under this model, an algorithm for the likelihood computation on a phylogeny is also potentially feasible.
Conclusions
We present different models specific for CRISPR spacer content evolution. The three models differ in two aspects. First, fragment loss models differ from the independent loss models since they allow the loss of a succession of spacers in one event. Second, the unordered independent loss model differs from the others since spacers can be incorporated throughout the array, not only on one end. A probabilistic model for each of these three models is presented here. We developed an approach derived from a well behaved stationary distribution, to establish the bounds on the state space that is a priori infinite. We find that the simpler model, without fragment deletions, is more robust. Distancebased phylogenies can be calculated from the time estimates, but the rapid change of spacer content restricts this method to closely related strains with similar spacer content.
In summary, the models facilitate quantitative statements about the spacer dynamics of microbial communities. Thus comparisons are possible, for example, between strain collections from one species at different locations or between different homologous CRISPR arrays in the same set of species.
Declarations
Acknowledgements
The authors would like to thank Christoph Lampert, Sebastian Matuszewski and Rodrigo A.F. Redondo for productive discussions and helpful comments on the manuscript, and two anonymous reviewers for valuable comments improving the manuscript.
Authors’ Affiliations
References
 Barrangou R, Fremaux C, Deveau H, Richards M, Boyaval P, Moineau S, Romero DA, Horvath P: CRISPR provides acquired resistance against viruses in prokaryotes. Science. 2007, 315 (5819): 17091712. 10.1126/science.1138140.PubMedView Article
 Marraffini LA, Sontheimer EJ: CRISPR interference limits horizontal gene transfer in staphylococci by targeting DNA. Science. 2008, 322: 18431845. 10.1126/science.1165771.PubMed CentralPubMedView Article
 Bolotin A, Quinquis B, Sorokin A, Ehrlich SD: Clustered regularly interspaced short palindrome repeats (CRISPRs) have spacers of extrachromosomal origin. Microbiology. 2005, 151 (Pt8): 25512561.PubMedView Article
 Wiedenheft B, Sternberg SH, Doudna Ja: RNAguided genetic silencing systems in bacteria and archaea. Nature. 2012, 482 (7385): 331338. 10.1038/nature10886.PubMedView Article
 Makarova KS, Aravind L, Wolf YI, Koonin EV: Unification of Cas protein families and a simple scenario for the origin and evolution of CRISPRCas systems. Biol Direct. 2011, 6: 3810.1186/17456150638.PubMed CentralPubMedView Article
 Tyson GW, Banfield JF: Rapidly evolving CRISPRs implicated in acquired resistance of microorganisms to viruses. Environ Microbiol. 2008, 10: 200207.PubMed
 Horvath P, Romero Da, CoûtéMonvoisin AC, Richards M, Deveau H, Moineau S, Boyaval P, Fremaux C, Barrangou R: Diversity, activity, and evolution of CRISPR loci in streptococcus thermophilus. J Bacteriol. 2008, 190 (4): 14011412. 10.1128/JB.0141507.PubMed CentralPubMedView Article
 Pourcel C, Salvignol G, Vergnaud G: CRISPR elements in yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary studies. Microbiology. 2005, 151 (Pt 3): 653163.PubMedView Article
 Cady KC, White aS, Hammond JH, Abendroth MD, Karthikeyan RSG, Lalitha P, Zegans ME, O’Toole Ga: Prevalence, conservation and functional analysis of Yersinia and Escherichia CRISPR regions in clinical Pseudomonas aeruginosa isolates. Microbiology. 2011, 157 (Pt 2): 430437.PubMed CentralPubMedView Article
 van Embden JD, van Gorkom T, Kremer K, Jansen R, van Der Zeijst, Schouls LM: Genetic variation and evolutionary origin of the direct repeat locus of Mycobacterium tuberculosis complex bacteria. J Bacteriol. 2000, 182 (9): 23932401. 10.1128/JB.182.9.23932401.2000.PubMed CentralPubMedView Article
 Liu F, Barrangou R, GernerSmidt P, Ribot EM, Knabel SJ, Dudley EG: Novel virulence gene and clustered regularly interspaced short palindromic repeat (CRISPR) multilocus sequence typing scheme for subtyping of the major serovars of Salmonella enterica subsp. enterica. Appl Environ Microbiol. 2011, 77 (6): 19461956. 10.1128/AEM.0262510.PubMed CentralPubMedView Article
 Stern A, Keren L, Wurtzel O, Amitai G, Sorek R: Selftargeting by CRISPR: gene regulation or autoimmunity?. Trends Genet. 2010, 26 (8): 335340. 10.1016/j.tig.2010.05.008.PubMed CentralPubMedView Article
 Levin BR: Nasty viruses, costly plasmids, population dynamics, and the conditions for establishing and maintaining CRISPRmediated adaptive immunity in bacteria. PLoS Genet. 2010, 6 (10): e100117110.1371/journal.pgen.1001171.PubMed CentralPubMedView Article
 Haerter JO, Trusina A, Sneppen K: Targeted bacterial immunity buffers phage diversity. J Virol. 1055, 85 (20): 410560.
 Haerter JO, Sneppen K: Spatial structure and Lamarckian adaptation explain extreme Genetic diversity at CRISPR Locus. mBio. 2012, 3 (4): e0012612.PubMed CentralPubMedView Article
 Childs LM, Held NL, Young MJ, Whitaker RJ, Weitz JS: Multiscale model of CRISPRinduced coevolutionary dynamics: diversification at the interface of lamarck and darwin. Evolution. 2012, 66 (7): 20152029. 10.1111/j.15585646.2012.01595.x.PubMed CentralPubMedView Article
 Weinberger AD, Sun CL, Pluciński MM, Denef VJ, Thomas BC, Horvath P, Barrangou R, Gilmore MS, Getz WM, Banfield JF: Persisting viral sequences shape microbial CRISPRbased immunity. PLoS Comput Biol. 2012, 8 (4): e100247510.1371/journal.pcbi.1002475.PubMed CentralPubMedView Article
 Weinberger a, Wolf YI, Lobkovsky a, Gilmore MS, Koonin EV: Viral diversity threshold for adaptive immunity in Prokaryotes. mBio. 2012, 3 (6): e0045612.PubMed CentralPubMedView Article
 Horvath P, CoûtéMonvoisin AC, Romero DA, Boyaval P, Fremaux C, Barrangou R: Comparative analysis of CRISPR loci in lactic acid bacteria genomes. Int J Food Microbiol. 2009, 131: 6270. 10.1016/j.ijfoodmicro.2008.05.030.PubMedView Article
 Takeuchi N, Wolf YI, Makarova KS, Koonin EV: Nature and intensity of selection pressure on CRISPRassociated genes. J Bacteriol. 2012, 194 (5): 12161225. 10.1128/JB.0652111.PubMed CentralPubMedView Article
 ElMabrouk N: Genome rearrangement by reversals and insertions/deletions of contiguous segments. Combinatorial Pattern Matching. Edited by: Giancarlo R, Sankoff D. 2000, Berlin, Heidelberg: SpringerVerlag, 222234.View Article
 Marron M, Swenson KM, Moret BM: Genomic distances under deletions and insertions. Theor Comput Sci. 2004, 325 (3): 347360. 10.1016/j.tcs.2004.02.039.View Article
 York TL, Durrett R, Nielsen R: Bayesian estimation of the number of inversions in the history of two chromosomes. J Comput Biol. 2002, 9 (6): 805818. 10.1089/10665270260518281.PubMedView Article
 Larget B, Simon DL, Kadane JB, Sweet D: A bayesian analysis of metazoan mitochondrial genome arrangements. Mol Biol Evol. 2005, 22 (3): 486495.PubMedView Article
 Miklos I: MCMC genome rearrangement. Bioinformatics. 2003, 19 (Suppl 2): ii130ii137. 10.1093/bioinformatics/btg1070.PubMedView Article
 Spencer M, Susko E, Roger AJ: Modelling prokaryote gene content. Evol Bioinform Online. 2003, 2: 157178.
 Cui Y, Li Y, Gorgé O, Platonov ME, Yan Y, Guo Z, Pourcel C, Dentovskaya SV, Balakhonov SV, Wang X, Song Y, Anisimov AP, Vergnaud G, Yang R: Insight into microevolution of Yersinia pestis by clustered regularly interspaced short palindromic repeats. PLoS One. 2008, 3 (7): e265210.1371/journal.pone.0002652.PubMed CentralPubMedView Article
 Tijms HC: A First Course in, Stochastic Models. 2003, West Sussex: WileyView Article
 Wolfram Research Inc: Mathematica Edition: Version 8.0. 2010, [http://www.wolfram.com/mathematica/]
 Jones E, Oliphang T, Peterson P, Others: SciPy: Open source scientific tools for python. 2001, [http://www.scipy.org]
 Johansson F, Others: mpmath: a Python library for arbitraryprecision floatingpoint arithmetic (version 0.17). 2011, [http://www.code.google.com/p/mpmath/]
 Saitou N, Nei M: The neighbor–joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4 (4): 406425.PubMed
 Huson DH, Steel M: Phylogenetic trees based on gene content. Bioinformatics. 2004, 20 (13): 20442049. 10.1093/bioinformatics/bth198.PubMedView Article
 Eddy SR: HMMER 3.0. [http://www.hmmer.org/]
 Larkin Ma, Blackshields G, Brown NP, Chenna R, McGettigan Pa, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23 (21): 29472948. 10.1093/bioinformatics/btm404.PubMedView Article
 Vinh LS, von Haeseler A: IQPNNI: Moving fast through tree space and stopping in time. Mol Biol Evol. 2004, 21 (8): 15651571. 10.1093/molbev/msh176.View Article
 Grissa I, Vergnaud G, Pourcel C: CRISPRFinder: a web tool to identify clustered regularly interspaced short palindromic repeats. Nucleic Acids Res. 2007, 35: W52W57. 10.1093/nar/gkm360.PubMed CentralPubMedView Article
 Eppinger M, Worsham PL, Nikolich MP, Riley DR, Sebastian Y, Mou S, Achtman M, Lindler LE, Ravel J: Genome sequence of the deeprooted yersinia pestis strain angola reveals new insights into the evolution and pangenome of the plague bacterium. J Bacteriol. 2010, 192 (6): 16851699. 10.1128/JB.0151809.PubMed CentralPubMedView Article
 Rambaut A: FigTree. [http://www.tree.bio.ed.ac.uk/software/figtree/]
 Novozhilov AS, Karev GP, Koonin EV: Biological applications of the theory of birthanddeath processes. Brief Bioinform. 2006, 7: 7085. 10.1093/bib/bbk006.PubMedView Article
 Godde JS, Bickerton A: The repetitive DNA elements called CRISPRs and their associated genes: evidence of horizontal transfer among prokaryotes. J Mol Evol. 2006, 62 (6): 718729. 10.1007/s002390050223z.PubMedView Article
 Deveau H, Barrangou R, Garneau JE, Fremaux C, Boyaval P, Romero DA, Horvath P, Moineau S, Labonté J: Phage response to CRISPRencoded resistance in streptococcus thermophilus. J Bacteriol. 2008, 190 (4): 13901400. 10.1128/JB.0141207.PubMed CentralPubMedView Article
 Erdmann S, Garrett Ra: Selective and hyperactive uptake of foreign DNA by adaptive immune systems of an archaeon via two distinct mechanisms. Mol Microbiol. 2012, 85 (6): 10441056. 10.1111/j.13652958.2012.08171.x.PubMed CentralPubMedView Article
 Sorokin Va, Gelfand MS, Artamonova II: Evolutionary dynamics of clustered irregularly interspaced short palindromic repeat systems in the ocean metagenome. Appl Environ Microbiol. 2010, 76 (7): 21362144. 10.1128/AEM.0198509.PubMed CentralPubMedView Article
 Datsenko Ka, Pougach K, Tikhonov A, Wanner BL, Severinov K, Semenova E: Molecular memory of prior infections activates the CRISPR/Cas adaptive bacterial immunity system. Nat Commun. 2012, 3: 945PubMedView Article
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://www.creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.