We present a new method for analyzing CRISPR spacer data from microbial populations. The evolution of CRISPR is mainly driven by the insertion of new spacers during infection with foreign DNA and by the presumably random deletion of successive spacers. We try to meet these biological characteristics in the models presented here. Estimating insertion and deletion rates and time in number of expected events in one lineage allows for comparisons of empirical data sets that could lead to relevant conclusions. First, bacterial groups in different environments can be compared in terms of CRISPR dynamics to assess the relative importance of CRISPR in

these environments. Different CRISPR array types might show different dynamics and thus have different utility for strain typing. An observed switch in spacer dynamics on a phylogeny might suggest a change in CRISPR cost or environment.

The three models presented here capture different mechanisms of CRISPR evolution, namely polarized addition of spacers and deletion of multiple successive spacers (Table 1). The CRISPR spacer arrays used for the analysis are assumed to be homologous. CRISPR homology can be determined by synteny in genomic positions and by repeat and leader similarities.

Models are necessarily a simplification of the past biological process. In our model, we ignore population dynamics. Our insertion and deletion rates are, as the substitution rates in phylogenetics, a compound parameter including the process of random changes and selection. The model is based on a time-homogenous Markov process and the dynamics are assumed to be in stationarity. Since an analysis is based on one species and one CRISPR type, it is reasonable to assume that the mechanistic insertion and deletion rates are homogeneous across the set of strains analyzed. We can not exclude, however, that subsets of strains experienced a different environment and thus different selection pressure on their spacer content. Simulations showed that the number of spacers in an array is determined mainly by internal parameters, like spacer insertion rate and cost of having spacers, not by external parameters, like the number of viruses in an environment [15].

We are not aware of previous publications estimating parameters under the ordered or the fragment loss model. The length model of the independent loss model corresponds to an *M* / *M* / *∞* queuing model [28]. The unordered model corresponds to the gene content model for the maximum-likelihood distance estimation in [33].

In the context of birth and death processes it is known as the simple death-and-immigration process (e.g., [40]).

The times estimated under these models also allow a comparison to substitution rates if sequence data is available. This analysis is however complicated by several facts. First, microbial genomes often harbor multiple CRISPR arrays. As a consequence, it is not clear how to combine these estimates to make a comparison possible. Second, spacer content might be different for very closely related strains. Then only a few polymorphisms are available and the substitution rate cannot be estimated reliably. Finally, frequent horizontal gene transfer of the CRISPR/Cas system has been suggested (e.g., [41]), and thus CRISPR rates can only be compared to substitution rates of cas genes.

The parameter estimation as presented here does not use an explicit phylogeny. This is advantageous since no search through tree space is necessary or no precomputed phylogeny needs to be given. The latter may

not be possible since no external information might resolve the CRISPR relationships. On the other hand, only a distance-based approach is available to display the CRISPR relationships. We can use the rooted distance from non-reversible models to compute rooted non-clocklike distance-based trees.

We find that estimation of the rate parameter performs well on average, but the estimates under the independent loss models show a lower variance. The fragment loss model tends to an underestimation and may be affected by the factorization of the likelihood function. The time estimates are most accurate for shorter times. For longer times, the absence of overlap complicates an accurate time estimation. In the analyses presented here, the different models result in similar estimates. If the incorrect loss model is applied, the corresponding *ρ* tends to be estimated fairly accurately. There is also no clear bias that affects the time estimation under an incorrect model. Note that there is a wide range of possible models accounting for fragment deletions. We chose one with the same instantaneous rate for each possible fragment, i.e., ignoring fragment length. This simplification is mainly for computational reasons. Future work on other fragment loss models, including lengths of fragments, might lead to a better fit for CRISPR spacer data.

We compare the estimations between data sets and between different CRISPR arrays present in a genome. Three *Yersinia pestis* data sets were chosen since they harbor three CRISPR array types and thus this data sets allows for comparison between data sets and between CRISPR array types evolving with different dynamics. Using this data set, we find *ρ* estimates to be similar using two published data sets but lower in a data set assembled from published genomes. Time and diversity estimates differ between data sets thus the presented methods allow comparisons of the diversity of CRISPR loci sampled from different populations.

For the *Yersinia pestis* data from published genomes, we observe only few differences in the cas gene sequences but a high diversity at the spacer level. Thus substitution

rates cannot be compared with reliability, but nucleotide and CRISPR spacer data provide phylogenetic information at very different time scales. It is possible to compute cas gene phylogenies on the species level (e.g., [41]). In contrast, spacer information could be utilized for closely related strains that have only few differences in the other nucleotide sequences, which has already been done in using CRISPR in strain typing (e.g., [8],[10],[11]). The method presented here can be used to define groups based on the clustering or to find relationships between groups.

A steadily growing literature suggests many other possible mechanisms of CRISPR evolution apart from polarized addition and fragment deletion. Spacer insertion can happen together with an internal deletion [42], or at an internal repeat [43]. Spacers or whole fragments may be duplicated [44]. And present spacers can guide the acquisition of new spacers from the same DNA molecule [45]. Note that these results affect only the insertion step of the CRISPR evolution process. But the fragment deletion model as it is presented here is based on the polarized insertion assumption. Combining an unordered insertion with a fragment deletion process is currently infeasible. Given these studies and the fact that the models presented here do not give substantially different results, the unordered model may be a robust choice for estimating rate and time parameters from CRISPR array data. Note that several simplifications are possible for the likelihood computation under this model. First, for the start likelihood, the estimate of the Poisson parameter is well known to be the mean of the data values. Second, it is reversible, thus only the time between two arrays can be estimated and the ancestor generation can be omitted. Third, the loss time can be calculated analytically and does not need to be acquired using simulations. To make the model comparisons fair, the same computational approach is used for all models in this paper. But it is possible to implement a more efficient approach for the unordered model. Under this model, an algorithm for the likelihood computation on a phylogeny is also potentially feasible.