Statistical tests for natural selection on regulatory regions based on the strength of transcription factor binding sites

Background Although cis-regulatory changes play an important role in evolution, it remains difficult to establish the contribution of natural selection to regulatory differences between species. For protein coding regions, powerful tests of natural selection have been developed based on comparisons of synonymous and non-synonymous substitutions, and analogous tests for regulatory regions would be of great utility. Results Here, tests for natural selection on regulatory regions are proposed based on nucleotide substitutions that occur in characterized transcription factor binding sites (an important type functional element within regulatory regions). In the absence of selection, these substitutions will tend to reduce the strength of existing binding sites. On the other hand, purifying selection will act to preserve the binding sites in regulatory regions, while positive selection can act to create or destroy binding sites, as well as change their strength. Using standard models of binding site strength and molecular evolution in the absence of selection, this intuition can be used to develop statistical tests for natural selection. Application of these tests to two well-characterized regulatory regions in Drosophila provides evidence for purifying selection. Conclusion This demonstrates that it is possible to develop tests for selection on regulatory regions based on the specific functional constrains on these sequences.

Many tests for natural selection can be applied to noncoding DNA and several important studies have identified signatures of natural selection in well-characterized regulatory regions (reviewed in [15]). Tests for selection on differences between species often compare the ratio of substitutions in transcription factor binding sites (an important class of functional element within cis-regulatory regions) to the surrounding non-coding DNA [16]. These tests are modelled after tests on coding regions that compare the patterns of amino acid changing differences to synonymous differences, which are amongst the most widely used and most powerful tests to detect the effects of natural selection on individual protein coding genes [17]. However, in applying these tests to binding sites, several Open Access important caveats must be considered [15]. In particular, it must be assumed that all of the functional elements in a regulatory region have been characterized, and that these remain constant in all species considered.
Here I develop a new approach to detect selection on individual cis-regulatory regions that takes advantage of the specificity of transcription factors to assign functional impact to nucleotide changes in binding sites. Recently, evolutionary analyses of large sets of transcription factor binding sites have highlighted the importance of considering the binding affinity or strength of the binding sites for their appropriate transcription factor [10,11,13,18]. Specifically, sequence differences in transcription factor binding sites can increase protein-DNA affinity, decrease it, or have no effect. In the absence of selection, fixation of random mutations will tend to decrease the strength of binding sties [19,20], whereas purifying selection will tend to preserve binding sites, such that the effects of subsequent fixations will cancel out [18]. On the other hand, though binding sites can arise in regulatory sequences as a result of the action of positive selection [19][20][21] or through genetic drift alone [22], I show that an increase in binding affinity on average is not expected in the absence of selection. I therefore propose to use the distribution of changes in strength of transcription factor binding sites to develop tests for natural selection on regulatory regions where the binding sites have been identified.
I analyze the fixed differences in two well-characterized regulatory regions in Drosophila (the hb anterior activator and the eve stripe 2 enhancer). These tests reveal statistical evidence for conservation of cis-regulatory information, which is consistent with the known conservation of function of these regulatory sequences.

Results
Quantifying the effects of substitutions in regulatory regions Motivated by the power of tests for natural selection that exploit the constraints imposed on coding sequences by the genetic code, I sought to develop a test for natural selection on regulatory regions that takes into account the specific constraints on these regions: binding by transcription factors. Using standard matrix models for DNA binding specificity (known as Position Weight Matrices or Position Specific Scoring Matrices [23]), the binding energy of the interaction between a transcription factor and DNA is given by a sum of independent contributions from each residue at each position [23]. An estimate of the relative affinity or 'strength' a transcription factor binding site X of length w for its binding protein can be quantified using Where X ib = 1 if the sequence X is nucleotide b at position i and 0 otherwise, f ib is the probability of observing nucleotide b at position i in a binding site for a transcription factor (from the specificity matrix), and g b is the probability of observing nucleotide b in the genomic background distribution [23].
Alternatively, the strength of the transcription factor binding sites in a region can be considered the regulatory information in that region, and the formula above can be motivated by information theoretic arguments [23]. Note that the framework and tests for selection presented here can equally be applied to information contained in the cis-regulatory region as to binding affinity. However, because recent work has focused on binding affinity (e.g., [11,13]) this work is presented from that perspective.
In order to quantify the effects of evolutionary changes in binding sites, I consider the effect of a single nucleotide change. In this case I define associated with a change from base a to base b (a, b in {A, C, G, T}) where, once again, i is the position in the motif, g are background probabilities, and f are the probabilities in the specificity matrix model. Extending these methods to the general case of arbitrary numbers of substitutions is an area for further research (see Discussion).
The effect of substitutions on binding sites in the absence of selection Most random mutations will decrease the strength of a transcription factor binding site, and therefore substitutions in the absence of selection will tend to decrease the affinity [19,20]. This follows from the fact that high affinity binding sites represent a small fraction of the possible sequences of a particular length. Since a substitution process that operates independently at each position in the sequence will tend to explore the majority sequence space, sequences that currently represent binding sites are much more likely to move away from these regions of sequence space than to remain in the relatively small regions of sequence space that represent binding sites. This implies that on average ΔS should be negative in the absence of selection. To illustrate this, I simulated the evolution of a binding site for Bcd (a developmental transcription factor in Drosophila whose specificity is well-characterized) under an HKY model (dotted trace in Figure 1a). The strength (S) of the binding site begins high (near the expected value of S for binding sites) and decays as substitutions eventually hit the critical residues. Consistent with this, the distribution of the changes in score (ΔS) is concentrated on values less than zero (dotted trace in Figure 1b).

The effect of substitutions in binding sites under selection
In contrast, in functionally constrained regulatory regions, purifying selection will preferentially remove nucleotide changes that greatly alter the affinity of the binding sites [6,13]. When these substitutions do become fixed (albeit rarely), positive selection will tend to fix additional nucleotide changes that restore the binding affinity [18]. This process will tend to preserve the binding affinity, and ΔS will therefore tend to be zero if the regulatory region is under functional constraint.
Finally, consider adaptive evolution, which could have arbitrary effects on ΔS. For example, new transcription factor binding sites could be created from background sequence through successive adaptive fixations that increase binding site strength; this would lead to an increase in S, and therefore ΔS would be greater than zero on average. However, because new binding sites can also appear by genetic drift [21,22] it is possible that ΔS can be greater than zero in the absence of selection. To illustrate this, I simulated a background sequence of length equal to the Bcd binding site under the same HKY model as above, and found examples where binding sites arose in the absence of selection (Figure 1a solid trace). I argue that, although arbitrarily strong binding sites (high values of S) can be generated in the absence of selection, the distribution of changes in score (ΔS) is specified by the substitution process. Interestingly, since evolution in the absence of selection is unbiased with respect to the strength of the binding site, the distribution of changes in score is symmetric, with mean equal to zero (Figure 1b solid trace). This indicates that in the absence of selection, in background sequences we expect changes in score to cancel out. Therefore, while the creation of binding sites from background sequence cannot be considered evidence for positive selection, if the distribution of ΔS observed is statistically different from the pattern expected in the absence of natural selection, this can only be consistent with adaptive evolution.
Creation of new binding sites in regulatory regions is an intuitive case of adaptive regulatory evolution. However, depending on the situation, natural selection could also favour mutations that remove functional binding sites within a regulatory region, thus leading to an average ΔS of less than zero. Therefore, although a decrease in S on average is expected in the absence of selection, it could also occur in the presence of selection. Nevertheless, if ΔS is more negative than expected in the absence of selection, we have evidence that natural selection must be acting to remove binding sites.
In summary, for substitutions in a set of characterized binding sites we expect: ΔS < 0 in the absence of constraint or adaptive destruction of binding sites ΔS = 0 in the presence of functional constraint ΔS > 0 during the creation of new binding sites (due to selection or genetic drift)

Statistical tests for natural selection in regulatory regions
An attractive feature of using ΔS for a single substitution (as defined above) in a test for natural selection on ). An example of a binding site created from background sequence in the absence of selection (solid trace) is also shown. b) shows the probability (p) of observing a change in score of size ΔS given than a sequence is a binding site (dotted trace) or a background sequence (solid trace).
regulatory regions is that its distribution can be computed exactly under standard models of molecular evolution in the absence of selection (see Methods, Figure 1b). I therefore propose to use the distribution of ΔS to test for the presence of natural selection on regulatory regions. If the distribution of ΔS is significantly different from that expected in the absence of selection, we can rule out the null hypothesis of evolution in the absence of selection.
Here I consider the tests for selection in the following cases.
1. If the observed ΔS in real binding sites is greater on average than ΔS expected for binding sites in the absence of selection, this indicates purifying selection to retain binding sites.
2. If the observed ΔS in real binding sites is less on average than ΔS expected for binding sites in the absence of selection, this indicates adaptive destruction of binding sites.
3. If the observed ΔS in real binding sites is greater on average than the ΔS expected for binding sites arising from background sequence in the absence of selection, this indicates adaptive creation of new sites.
Case 1: Here the pattern of evolution is consistent with purifying selection to preserve the function of the binding sites in the regulatory region. To rule out the null hypothesis of no constraint, we must compare the observed values of ΔS to the distribution of ΔS in sequences we know to be transcription factor binding sites, but in the absence of selection.
In the case of binding sites evolving in the absence of constraint: where E[X] and V[X] are the mean and variance of the random variable X, respectively, and P ab is the probability of substitution between bases a and b (i.e., a, b in {A, C, G, T}), computed under a standard model of molecular evolution, such that P = e Rt where R are the instantaneous rates of substitution and t is time (see Methods). The dotted trace in Figure 1b shows the distribution of ΔS for binding sites evolving in the absence of constraint.
In a practical setting, we expect to have observed some relatively modest number (N) of substitutions in characterized binding sites. Therefore, in order to test the significance of a set of observed ΔS values, I propose the statistic: where k indexes N observed values of ΔS.
Since we can compute the mean and variance of ΔS under standard models of evolution (see Methods), according to the central limit theorem this statistic should be normally distributed with mean = 0 and variance = 1(the standard normal) under the hypothesis that the model of evolution is correct. We can therefore perform a one-tailed test that the observed mean is greater than that expected in the absence of selection.
I sought to confirm that the distribution of this statistic was as expected, particularly in the case of small N (few observed substitutions in binding sites) which is typical of real datasets. To simulate the null hypothesis of binding sites evolving in the absence of constraint, I simulated molecular evolution of the 6 real Bcd sites in the hb anterior activator under an HKY model with the transition-transversion rate ratio estimated from the alignment of the hb anterior activator (see methods) and evolutionary distance scaled so that we would observe approximately 5 substitutions in the 6 binding sites. I computed Z using E[ΔS] and V[ΔS] either under this model, and I observed good agreement with the expected standard normal behavior ( Figure 2, 'exact').
Case 2: If we wish to test for adaptive destruction of transcription factor binding sites in a regulatory region, the average of ΔS should be significantly less than expected in the absence of selection. To test for this, we can perform a one tailed test using the statistic defined above, but in the opposite direction.
Case 3: If the average ΔS in a regulatory region is greater than 0, we wish to test whether the average ΔS is greater than we would expect to observe in the absence of selection. Now the null hypothesis is that the average increase in binding affinity we have observed is due binding sites arising in background sequence by genetic drift. Once again the distribution of ΔS can be computed exactly, and the mean and variance are: The solid trace in Figure 1b shows this distribution. This distribution is symmetric, and the expectation is zero. This follows from the fact that the substitution processes in the absence of selection is unbiased with respect to the binding site strength, and that the residue frequencies in background genomic sequence are assumed to be drawn from the equilibrium of the substitution process. The means and variances can be used to form a Z-statistic as illustrated above, and simulations again confirm the expected distribution of the statistic (data not shown). If the observed average ΔS is significantly greater than expected in the absence of selection, we find evidence for adaptive evolution. For example, for the 20 substitutions shown in Figure 1a (solid trace) the average ΔS is 0.45, which gives Z = 0.97 and is not significant. Thus, although there is a large change in S, the pattern of changes is consistent with the absence of selection.
An approximation to the distribution of ΔS Under substitution models with no transition-transversion bias [24], the distribution of ΔS does not depend on evolutionary distance. For example, I can show (see Methods) that for binding sites evolving in the absence of selection, A similar, albeit more complicated expression is available for the variance (see Methods). These expressions depend only on the equilibrium probabilities of the four nucleotides and the probabilities in the specificity matrix model for the transcription factor.
In the general case, the distribution of ΔS depends very weakly on the evolutionary distance ( Figure 3a) and only somewhat more strongly on the transition-transversion bias ( Figure 3b). It is therefore possible to obtain a good approximation of the distribution of ΔS using the formulas obtained under the simpler substitution models. I refer to this approximation of the distribution of ΔS as the 'long-time limit' because it becomes exact in the limit of long evolutionary time even in the presence of transition-transversion bias ( Figure 3). As expected, using the long-time limit E[ΔS] and V[ΔS] when calculating the Z statistic described above also gives the standard normal behaviour (Figure 2, 'Long time limit'). Thus, this approximation allows application of tests based on the distribution of ΔS without estimates or assumptions about the evolutionary process in the absence of selection.

Application to the hb anterior activator
The hb anterior activator (Figure 4a) responds to the Bcd gradient in the early D. melanogaster embryo [25]. It is thought to have been conserved since the divergence of D. melanogaster and D. virilis [26] and contains well-defined binding sites for Bcd [27]. We therefore expect to see evidence of functional constraint on this regulatory region.
Using D. virilis and D. pseudoobscura as outgroups, I identified 10 substitutions in the alignment of the 6 well-characterized Bcd binding sites ( Figure 4b) and used equation 2 above to compute ΔS for each substitution ( Table 1). The average ΔS for these substitutions was    1) and was concerned that this might indicate that the assumption of single substitutions at each site was invalid on this lineage. I therefore performed the test excluding these substitutions and found similar results ( Table 2). I also found evidence for constraint when excluding substitutions along the relatively long branch leading to the melanogaster group. Noting that removing the substitutions led to an average ΔS for the region greater than 0, I tested for evidence for adaptive creation of binding sites in this regulatory region (case 3). However, performing the test above (equations 7-9) yielded Z = 0.22, which is not significant, indicating that the observed increase in binding strength could have been observed in the absence of selection.

More complex regulatory regions
The hb anterior activator serves as a good test case for this method because it contains multiple binding sites for the same transcription factor. However, in general regulatory regions contain multiple binding sites for multiple different transcription factors. Note that the arguments above regarding the expected ΔS in regulatory regions apply regardless of whether the binding sites are for a single transcription factor or many different transcription factors.
To extend the statistical test to regulatory regions with multiple binding sites for different factors, two approaches are possible. If enough substitutions in each type of binding site are present, the test above can be performed for each type, and then their results can be combined. However, in the case of few substitutions, it may be preferable to pool the substitutions first. To do so, we must compute the distribution of ΔS expected from a mixture of transcription factor binding sites. ΔS is now drawn from a v component mixture model, where v is the number of types of transcription factor binding sites, π i is the probability that the substitution occurred in the j-th type, and p(ΔS) j is the distribution of ΔS for the j-th type of binding motif. We can compute these π j for any regulatory region given the numbers of each type of binding site in a characterized regulatory region (see Methods): where n j , f j and w j are the number of occurrences, specificity matrix and width of the j-th type of binding site, and P is a substitution matrix as above. Since it is possible to compute the means and variances of mixture distributions as a func tion of the component distributions (see Methods), we can apply the test suggested above in the mixture case as well. Once again, we can obtain a good approximation to this distribution using the long-time limit. To confirm the distribution of the test statistic proposed above in the case of mixtures of binding sites, I again performed simulations under the null hypothesis of no constraint, this time using the 6 Kr sites and 5 Bcd sites in the eve stripe 2 enhancer, and once again found the expected standard normal behavior (data not shown).

Application to the eve stripe 2 enhancer
The individual binding sites in the eve stripe 2 enhancer are well-characterized [28,29] and this enhancer exemplifies complex cis-regulatory sequences in that it contains multiple binding sites for multiple transcription factors [30]. Here I consider the 5 Bcd binding sites and 6 Kr binding sites to illustrate the application of the test for selection to a complex regulatory region (see Additional file 1 for alignments of these binding sites). Two of the Kr and Bcd binding sites the eve stripe 2 enhancer overlap,  Positions in the Bcd binding motif (Pos) are numbered as shown in Figure 4. Abbreviations for lineages are as in Figure 4. Coordinates are based on mapping of sites to the D. melanoaster genome [52] BMC Evolutionary Biology 2009, 9:286 http://www.biomedcentral.com/1471-2148/9/286 and I excluded two substitutions that affect the strength of more than one binding site (Table 3).
This left 16 substitutions (9 in bcd binding sites and 7 in Kr binding sites) for which I used equation 2 to compute associated ΔS values ( Table 3). The average ΔS for all the substitutions was -0.23, and I performed the test described above with evolutionary distance and transition-transversion rate ratio estimated from the alignment of the eve stripe 2 enhancer (see methods  (Table 4).
Although its function has been conserved over evolution [31], the eve stripe 2 enhancer has undergone some linage specific evolution [32], as well as gained and lost individual binding sites; its evolution is characterized by rapid sequence divergence [31,33]. Consistent with this, the alignments of D. pseudoobscura for BC-3 were not possible, as this site seems to have appeared recently [32]. Within the closely related species in the melanogaster subgroup, BC-3 contains seven inferred substitutions, four of which are inferred to occur along the lineage leading to D. yakuba. In addition to the rapid divergence of BC-3, I again found cases where more than one substitution had occurred along the D. ananassae lineage in a single binding site. In addition, I therefore performed the tests excluding lineages with multiple substitutions, or excluding BC-3 entirely. In all cases there is still sufficient power to provide statistical evidence against the null hypothesis of no constraint (table 4). In no case could I find evidence for adaptive evolution (case 2 or case 3, data not shown).

Discussion and Conclusion
A new test for natural selection on regulatory regions One of the difficulties in many current evolutionary analyses of cis-regulatory regions is that it is difficult to choose an appropriate set of unconstrained sites to which to compare the functional regulatory sites. In general, studies either choose the rate of substitution in surrounding   Figure 3. Naming of binding sites is as in [31]. Coordinates are based on mapping of sites to the D. melanoaster genome [52]. non-coding sequence [16] or in synonymous sites in adjacent protein coding regions [34]. Both assumptions may be problematic. The former assumes that the surrounding DNA is under no functional constraint (as opposed to some unknown constraints). In the latter case, because noncoding sequences show larger numbers of insertions and deletions than coding regions, it is not always clear that rate estimates based on alignments of coding and non-coding regions can be directly compared.
Tests based on the distribution of ΔS, such as those proposed here, do not rely on these assumptions, as they consider only substitutions that occur in binding sites. Practically, this is an attractive feature of these tests, as they only require accurate alignments of the binding sites, which are generally more reliable than alignments of unconstrained non-coding DNA [35].
Another attractive feature of tests based on the distribution of ΔS in the absence of selection is that they make few assumptions about the nature of selection on binding sites. For example, it is not assumed that binding sites are all under the same strength of selection, or that they all have the same binding affinity -only the changes in strength of binding are important. Further, even under a stabilizing selection model, where binding sites for a given transcription factor are gained and lost over evolution [33], ΔS will be zero on average if the total output of the regulatory sequence is preserved: the negative ΔS associated with the binding site loss will be compensated by positive ΔS associated with the binding site gain. However, if binding sites for one transcription factor are replaced by binding sites for another, ΔS may no longer be zero on average and testing for selection in this case is an area for further research.
Practical considerations, limitations and future improvements Application of these tests to two well-characterized regions demonstrates that they have enough power to detect constraint on individual regulatory regions with 10 substitutions in binding sites, and perhaps even as few as 5 or 6 substitutions (tables 2 and 4). However, application to the eve stripe 2 enhancer illustrates several practical difficulties: First, I didn't include the Hb binding sites in this enhancer [36] because these binding sites contain homopolymeric runs, and it is difficult to assign a 'position' to a substitution; ΔS cannot be reliably computed for each substitution in this case. Second, although the eve stripe 2 enhancer has characterized sites for Gt, I did not include these because the sequence specificity of this transcription factor is poorly characterized. Third, the eve stripe 2 enhancer contains substitutions in overlapping binding sites, for which it is not clear how to calculate ΔS; these were therefore excluded from the analysis. Finally, the distribution of ΔS is sensitive to the estimation of the frequency parameters in the specificity matrix. For example, I excluded the Bcd binding sites in the eve stripe 2 enhancer and reconstructed the Bcd matrix for analysis of that regulatory region. If the binding sites in the regulatory region of interest are included in the estimation of the specificity matrix, there is a potential for circularity in the analysis. Thus, the tests require (i) well-characterized transcription factor binding specificity and (ii) confident alignment of a binding site to a single specificity matrix. None of these constraints are present for tests that compare binding sites to surrounding regions or synonymous sites [16,34] or for tests of natural selection based on spacing between conserved blocks [35][36][37]. However, rapid advances in methods to characterize DNA-protein interactions are making specificity information available for large numbers of transcription factors [38][39][40]. Among these are methods that yield information about binding to each sequence, such that the assumption of independent contributions to binding of each DNA base in the binding site could in principle be relaxed [38,41].
In addition, the tests I have proposed assume that only a single substitution has occurred at any position in binding sites. Although for most of the data analyzed here this assumption seems valid, I noted several cases were multiple substitutions occurred on a single lineage, suggesting the possibility of 'multiple hits' at a single site. Furthermore, there is clear evidence of insertions and deletions occurring near or within the binding sites considered here. These are likely to affect their binding affinity, but are not included in the null model of molecular evolution in the absence of selection. More sophisticated models of molecular evolution [42] may be able to account for these effects, and these could be applied in this framework. Similarly, the evolutionary models here do not account for di-nucleotide substitution bias, particularly the elevated rate of CpG to TpG found in mammals; these could be included using an improved null model [43,44].
Finally, I note that I have suggested one simple statistical test based on the observed average ΔS, however many tests based on distribution of ΔS are possible. For example, purifying selection might also be expected to reduce the variance of ΔS. Indeed, in the case of the Bcd sites in the hb anterior enhancer, the observed variance of ΔS is less than expected, though this difference is not significant (e.g., 1.15 vs. 2.86, n = 10, chi-square test P = 0.089). Determining what tests have the most power to detect various types of selection in regulatory regions is an area for further research. In general, however, it seems very likely that tests that consider the effects of substitutions on transcription factor binding site affinity will facilitate the detection of adaptive evolution in regulatory regions.

Methods
Construction of motif matrices I used publically available compilations of characterized binding sites for Bcd and Kr [33,45] to construct specificity matrices using a pseudocount of 1 per position. Throughout this study, I use as the background distribution (g A , g C , g G , g T ) = (0.3, 0.2, 0.2, 0.3) which is close to the observed nucleotide probabilities in drosophila noncoding DNA. In order to avoid the possibility of circularity, for analysis of each regulatory region I excluded the characterized sites from that region and reconstructed the matrix, such that (for example) Bcd sites from the hb anterior activator were not included in the matrix used for analysis of the hb anterior activator. These matrices were used to compute ΔS for each substitution (Tables 1 and 3

) and E[ΔS] and V[ΔS] (Tables 2 and 4).
Alignments and phylogenetic analysis of regulatory sequences I obtained homologous regions for each regulatory region from the UCSC genome-browser alignments [46]. The sequences were then realigned using mLAGAN [47]. Using these alignments and the known species relationships for these species [48], I estimated the evolutionary distance and transition-transversion rate ratio bias under an HKY model [24] using paml [49]. The parameters estimated using paml for each regulatory region were then used to compute the exact E[ΔS] and V[ΔS] shown in Tables 2 and 4.

Simulations of molecular evolution
To confirm that the test statistic had a standard normal distribution under the null hypothesis, I simulated the evolution of the 6 known binding sites in the hb anterior activator. To do so, I inferred the ancestral sequences using maximum parsimony [50], and then simulated their evolution by introducing substitutions using an HKY model with kappa estimated from the alignment, 60% AT content for the equilibrium distribution of nucleotides, and evolutionary distance scaled to observe an average of 5 substitutions over the 6 binding sites. I then computed the average ΔS for the substitutions we observed, and calculated the Z statistic using E[ΔS] and V [ΔS] computed exactly using the evolutionary model or using the long-time limit approximation. I repeated this simulation until I observed 1000 cases with at least 3 substitutions in total. Simulations for the eve stripe 2 enhancer were similar, except I used the actual D. melanogaster binding sites (because reliable inference of the ancestral sites was difficult) and that the evolutionary distance was scaled so that the 5 substitutions were distributed over the 5 Bcd and 6 Kr sites.
Distribution of ΔS I sought to compute the distribution of ΔS in the absence of selection. Because the number of observed evolutionary differences in any particular binding site is typically small, I make the assumption that each DNA difference in a transcription factor binding site occurs independently, and presence of a single change has no effect on the probability of other changes. Under this assumption, the probability of observing the particular change from base a to base b (a, b in {A, C, G, T}) at position i is and P = e Rt is a substitution probability matrix. The expected value of ΔS for binding sites that evolve in the absence of selection is ΣΔS iab p(a b at i|one subs.) or Similarly, for the variance, we have As computed here, the distribution of ΔS is exact only for the first substitution at each site in a particular sequence. Therefore it is important to apply the tests described here to cases where only small numbers of substitutions have occurred on each lineage. For the regulatory sequences considered here, this assumption seems appropriate. However, if enough substitutions have occurred such that multiple subsequent substitutions occur at the same position, the distribution of ΔS computed based on the sequence of a reference species or inferred ancestral sequence will no longer be exact. Computing the distribution of ΔS under more relaxed assumptions is area for further research.
Since I am considering the conditional probability that one particular substitution occurs out of all the possible substitutions that could have occurred, under some substitution models such as F81 [51], or in the limit of long evolutionary time, this probability does not depend on time and mutation rate (evolutionary distance). I refer to this time independent approximation as the which depends only on the frequencies in the matrix, f, and the background distribution of nucleotides g, where now a, b and c index the bases {A, C, G, T}. Therefore under this model, the long time limit is exact. Substitution into the general formulas for the expectation gives for the case of binding sites evolving in the absence of selection (case 1). This formula can be simplified using the fact that ΔS = 0 if a = b: Therefore, we have for case 1, While these formulas are complicated, they depend only on the residue probabilities in the matrix (f) and the BMC Evolutionary Biology 2009, 9:286 http://www.biomedcentral.com/1471-2148/9/286 background (g), and therefore phylogenetic analysis is not required.

Mixtures of binding sites
In the case of v transcription factors binding a regulatory region, ΔS is drawn from a mixture distribution, where v is the number of types of transcription factor binding sites, π i is the probability that the substitution occurred in the j-th type, and p(ΔS) j is the distribution of ΔS for the j-th type of binding motif. To compute this we need π i = p(subs. in type j | one subs.), so π j p subs in type j p one subs This can be computed using for case 1. To compute the mean and variance of arbitrary mixture models we proceed as follows. To simplify the notation, I will indicate sums over i, a, b, as sums over ΔS. In this notation, We now add and subtract the square of E[ΔS] for the j-th motif.