Volume 7 Supplement 1

## First International Conference on Phylogenomics

# Detecting non-coding selective pressure in coding regions

- Hui Chen
^{1}and - Mathieu Blanchette
^{1}Email author

**7(Suppl 1)**:S9

**DOI: **10.1186/1471-2148-7-S1-S9

© Chen and Blanchette; licensee BioMed Central Ltd. 2007

**Published: **8 February 2007

## Abstract

### Background

Comparative genomics approaches, where orthologous DNA regions are compared and inter-species conserved regions are identified, have proven extremely powerful for identifying non-coding regulatory regions located in intergenic or intronic regions. However, non-coding functional elements can also be located within coding region, as is common for exonic splicing enhancers, some transcription factor binding sites, and RNA secondary structure elements affecting mRNA stability, localization, or translation. Since these functional elements are located in regions that are themselves highly conserved because they are coding for a protein, they generally escaped detection by comparative genomics approaches.

### Results

We introduce a comparative genomics approach for detecting non-coding functional elements located within coding regions. Codon evolution is modeled as a mixture of codon substitution models, where each component of the mixture describes the evolution of codons under a specific type of coding selective pressure. We show how to compute the posterior distribution of the entropy and parsimony scores under this null model of codon evolution. The method is applied to a set of growth hormone 1 orthologous mRNA sequences and a known exonic splicing elements is detected. The analysis of a set of *CORTBP2* orthologous genes reveals a region of several hundred base pairs under strong non-coding selective pressure whose function remains unknown.

### Conclusion

Non-coding functional elements, in particular those involved in post-transcriptional regulation, are likely to be much more prevalent than is currently known. With the numerous genome sequencing projects underway, comparative genomics approaches like that proposed here are likely to become increasingly powerful at detecting such elements.

## Background

Vertebrate genomes are now recognized as containing a huge number of non-coding functional regions, a large fraction of which is likely to be involved in regulating the various steps of gene expression [1–4]. While most of the attention has been centered on understanding the regulation of transcription, post-transcriptional regulatory mechanisms now appear to be more important than originally thought. Cis-regulation of pre-mRNA splicing is believed to be operated by splicing factors binding intronic and exonic splicing enhancers and helping to include or exclude specific exons from the transcript [5, 6]. Post-splicing, parts of the mature mRNA often folds into some RNA secondary structure that determines the level of mRNA degradation [7] as well as mRNA localization [8]. Translational efficiency and accuracy have been shown to be largely determined by the choice of synonymous codon, thus imposing some selective pressure on the codons of certain genes [9]. Translation is also known to be affected by certain secondary structure elements in the mRNA [10]. While most of the known examples of formation of functional secondary structure are restricted to the 5' and 3' UTRs, the coding portion of the mRNA has also been shown to form functional structures [11]. Finally, there are also examples of transcription factor binding sites located in coding exons (e.g. in CD28 [12]). The method presented here should allow the detection of many of these functional elements, which we call coding regions under non-coding selection (CRUNCS). To this point, the computational methods that have proven the most valuable for identifying non-coding functional regions are based on comparative genomics. The guiding principle of this family of approaches is that functional features of a DNA sequence tend to evolve slower than non-functional ones, because of selective pressure. This simple idea is at the core of phylogenetic footprinting, a method that compares orthologous regulatory DNA regions to identify short conserved motifs likely to be transcription factor binding sites [13, 14]. The key here is that most of the DNA in promoter regions is non-functional, with the exception of the regulatory elements we are interested in. The same reasoning applies to the detection of intronic splicing enhancers [15]. With the ongoing sequencing of a large number of vertebrate genomes [16], the power of these methods is quickly improving and, coupled with algorithmic improvements [17], they are now able to detect very short regions under selection, or regions under weak selection.

The search for CRUNCS is more challenging. Although the same "conservation implies function" principle applies in this case, it needs to be used more cautiously. Indeed, CRUNCS are *not* located in non-functional sequences as are, for example, most known transcription factors binding sites, but rather in *coding* regions. This means that the sequence conservation observed in exons may be the result of two types of selective pressures. The first one is the pressure to maintain the function of the protein encoded by the gene, which probably explains most of the sequence conservation observed in coding regions. The second type of selective pressure applies only to CRUNCS, which are required to maintain their regulatory role. To apply phylogenetic footprinting to the detection of CRUNCS, one needs to determine which type of selective pressure is responsible for the sequence conservation observed.

The method suggested here takes a conservative approach to the problem. Given a set of aligned orthologous coding sequences, we first evaluate the degree of conservation of each column of the alignment, using either a parsimony score or an entropy score. We then put the burden of explaining the conservation observed as much as possible on the shoulders of the selective pressure on the protein product. Because most amino acids are encoded by many synonymous codons, amino acid selective pressure leaves room for some sequence variation. A region of the sequence will be predicted to be an CRUNCS only if the conservation observed cannot be explained solely by the need for conservation of the encoded amino acids. The method introduced here build a mixture model of codon evolution, and then uses it as a null model to assess the significance of the observed degree of conservation. We illustrate our approach on two sets of orthologous vertebrate genes (growth hormone 1 and *CORTBP2*) and compare it to a related approach by Blanchette [18].

## Results and Discussion

Given a multiple alignment of orthologous mRNA sequences, our goal is to identify alignment columns that are conserved beyond what would be expected by chance if the corresponding sites were evolving only under the selective pressure on the amino acid they contribute to encode. Such sites are likely to be under non-coding selective pressure. This section, which constitutes the main contribution of this paper, is structured as follows. First, we define two commonly used sequence conservation scoring methods: the entropy, and the parsimony score. We then describe a methods for assigning a p-value to a given entropy or parsimony score, under null models of evolution of codons that are only under coding selective pressure. Under this method, we model codon evolution as a mixture of codon substitution models and use these models to assign a posterior p-value to a given conservation score.

### Two measures of sequence conservation

A number of methods have been proposed to measure the degree of conservation of a set of orthologous sequences and to identify regions under selective pressure (see [19] for an evaluation of some of these methods for finding regulatory elements in intergenic regions). In this paper, we consider two such methods, the entropy score and the parsimony score, and later show how to assess the statistical significance of these scores in the context of coding regions.

#### Entropy

In the area of transcription factor binding sites detection, a popular method for evaluating sequence conservation is the entropy (see, for example, [20]). Given a set of orthologous nucleotides *x*_{1}, *x*_{2}, ...,*x*_{
n
}from *n* different species, the entropy measures the distance between the observed frequency of A, C, G, and T's at that site and the uniform distribution: *entropy* (*x*_{1}, *x*_{2}, ...,*x*_{
n
}) = -∑*α*∈{*A*,*C*,*G*,*T*} *f*_{
α
}log_{2}(*f*_{
α
}), where *f*_{
α
}is the relative frequency of nucleotide *α*. Perfectly conserved sites have an entropy of zero, while the worst levels of conservation obtain a score of 2.

#### Parsimony score

*x*

_{1},

*x*

_{2}, ...,

*x*

_{ n }and a phylogenetic tree

*T*whose leaves are labeled with these nucleotides, the parsimony score is defined as the minimum number of substitutions, performed along the branches of the tree

*T*, that can explain the set of nucleotides observed at the leaves. It is thus a lower bound on the actual number of substitutions at that site, and, in cases where this number is not too large compared to the number of branches in the tree, it is a fairly good estimate of the actual number of substitutions that occurred at that site. The parsimony score of a set of

*n*nucleotides can be computed in time

*O*(

*n*) using Sankoff's algorithm [22] or Fitch's algorithm [21]. Figure 1 gives an example of orthologous nucleotides whose conservation is well characterized by either the entropy or the parsimony scores. Entropy and parsimony scores attempt to measure the total selective pressure on a given site. While, for non-coding regions, this selective pressure can be assumed to come completely from the presence of non-coding functional elements, this is not the case in coding regions. The rest of this paper describes how to measure the statistical significance of a certain conservation (entropy or parsimony) score when the site under study lies within a coding region.

### Conservation p-values under a mixture of codon models

In this section, we introduce a null model of coding sequence evolution that consists of a mixture of codon substitution models representing the evolution of codons that are under different types of *purely* coding selective pressure. We then show how to compute posterior p-values for the entropy and parsimony scores of orthologous sequences evolving under those models.

#### Mixture models for codon evolution

We learn amino acid functional categories using the Pfam database of amino acid sequence alignments of protein domains [24]. We start by learning the amino acid stationary distribution of each rate category. We then use these distributions to classify the Pfam alignment columns, and use this classification to estimate the amino acid and codon substitution rate matrices for each class.

*π*

_{1},

*π*

_{2}, ...,

*π*

_{50}and class prior probability distribution

*τ*are estimated using an unsupervised Expectation-Maximization algorithm (see Methods) in order to fit the Pfam alignments as closely as possible. Figure 2 (top) shows the amino acid distribution obtained for each of the 50 classes (see also Supplementary data). We observe that most of the expected functional classes are present among our 50 classes. First, for each amino acid, there is at least one category where that amino acid has a probability at least 0.5. These correspond to positions that tolerate little variation outside of that particular amino acid. Less constrained categories include several combinations of residues with similar properties: hydrophobic residues (ILV), small neutral residues (AST), aromatic residues (YF), positively charged residues (KR), etc. The class of negatively charged amino acids (DE) is surprisingly not directly represented, although these two amino acids show up together in several more weakly defined classes. Various categories correspond to positions under little selective pressure. Many of our classes are similar to those reported by Sjolander et al. [25] using a related procedure.

#### Amino acid and codon substitution rate matrices

For each of the 50 classes above, an amino acid substitution rate matrix and a codon substitution rate matrix are derived. We first compute the probability of each Pfam alignment column to belong to each of the classes, and use these to estimate the probability of amino acid and codon transitions between human and mouse sequences. Rate matrices are then derived from these empirically estimated transition probabilities matrices. The detailed procedure is described in Methods.

Figure 2 (bottom) shows two of the codon rate matrices obtained. Note how mutations toward codons encoding favored amino acids occur at a high rate, whereas substitutions away from those are rare. Notice that these rate matrices automatically take into account codon biases, as they are built from mRNA sequences. We make the assumption that the codon biases of human and mouse, which are reflected in our rate matrices, are representative of the codon biases in the other species used in our analysis. This assumption appears to hold quite well within mammals and birds, and to a lesser extent within vertebrates in general [26]. However, we recognize the fact that some genes (e.g. those requiring a high rate of translation) may have codon biases that are stronger than the average, resulting in an unexpected degree of conservation. Still, since this type of selective pressure would most likely apply to the entire transcript, it would be easily detectable using our approach, and would not result in false identification of other types of non-coding elements.

#### Distribution of conservation scores

*X*

_{1}...

*X*

_{ m }, where

*X*

_{ i }is the triplet of alignment columns corresponding to the

*i*-th codon in the alignment. Let

*X*

_{ i }(

*j*) be the codon observed in species

*j*∈ {1...

*s*}, where

*s*is the number of column triplets in the alignment, and let

*X*

_{i,p}(

*j*) be the nucleotide at position

*p*(

*p*= 1, 2 or 3) in that codon. Given a column of orthologous codons

*X*

_{ i }, we want to assess whether the conservation observed at position

*p*of the codon is unexpected. To this end, we compute the posterior p-value of the observed conservation score (entropy or parsimony score) of that codon position, under the null hypothesis that the columns are only under coding selective pressure.

To describe more formally our null model of sequence evolution, we need to introduce some notation. Let *T* = (*V, E*) be a binary phylogenetic tree with vertices *V*, edges *E*, root *r*, and with leaves numbered 1,2,...,*n*. Let *λ*(*u*, *v*) be the length of the branch going from node *u* to node *v*, let *a*(*c*) be the amino acid encoded by codon *c*, and let **Q** be some codon substitution rate matrix. The codon transition probability matrix for a branch (*u*,*v*) is given by **P**_{(u,v)}= *e*^{λ(u,v)}^{
Q
}[27]. Let *b*(c) be the background probability of codon *c*, which is assumed to be the stationary distribution of **Q**. These three parameters (*T*, *λ*, **Q**) describe a process that generates random but related codons at the leaves of the tree *T*, by drawing a codon from the stationary distribution of **Q** at the root of *T* and letting it mutate along the branches of the tree using the appropriate transition probability matrices. Let *C*(*u*) be the random variable representing the codon that has been generated by this process at node *u* of the tree.

We are interested in computing the distribution of the conservation score of a given position *p* of a set of random orthologous codons generated at the leaves of the tree. We start by showing how to compute this distribution for the entropy score *entropy*(*C*_{
p
}(1),*C*_{
p
}(2), ..., *C*_{
p
}(*n*)), and later show the modifications required to do the same for the parsimony score. For a fixed codon position *p* ∈ {1,2,3}, and for any node *u* ∈ *V*, let *Y*_{
u
}= (*Y*_{
u
}(A), *Y*_{
u
}(C), *Y*_{
u
}(G), *Y*_{
u
}(T)) be a random multivariable where *Y*_{
u
}(*α*) is the number of nucleotides of type *a* at position *p* of the codons at the leaves of the subtree rooted at *u*. Notice that *Y*_{
u
}is only a function of the codons at the leaves of subtree(*u*), and not of those at the internal nodes of subtree(*u*). The p-value of a certain entropy score *e* for position *p* is obtained by summing the probabilities of all values of *Y*_{
r
}that yield an entropy score *e* or better:

We will show how to compute Pr[*Y*_{
u
}= (*y*_{
a
}, *y*_{
c
}, *y*_{
g
}, *y*_{
t
})|*C*(*u*) = *k*], for every node *u* and all choices of *y*_{
a
}, *y*_{
c
}, *y*_{
g
}, *y*_{
t
}and *k*, using a dynamic programming algorithm visiting the nodes of *T* in post-order. When *u* is a leaf, we have

Define (*y*_{
a
}, *y*_{
c
}, *y*_{
g
}, *y*_{
t
}) ⊕ (*z*_{
a
}, *z*_{
c
}, *z*_{
g
}, *z*_{
t
}) : = (*y*_{
a
}*+ z*_{
a
}, *y*_{
c
}*+ z*_{
c
}, *y*_{
g
}*+ z*_{
g
}, *y*_{
t
}*+ z*_{
t
}). Now, let *u* be an internal node with children *v* and *w*. Notice that *Y*_{
u
}= *Y*_{
v
}⊕ *Y*_{
w
}. We compute the desired conditional probabilities at node *u* based on those at nodes *v* and *w*:

Implementation optimizations and computational complexity analysis are given in Methods.

#### Distribution of parsimony scores

The method described in the previous section can be surprisingly easily modified to compute the conditional p-value of parsimony scores instead of that of the entropy score. We need to redefine the random variable *Y*_{
u
}= (*y*_{
a
}, *y*_{
c
}*, y*_{
g
}*, y*_{
t
}) so that *y*_{
α
}is now the parsimony score obtained for the nucleotides at position *p* of codons at the leaves of the subtree rooted at *u*, assuming that the nucleotide at the ancestral node *u* is required to be *a*. Note that this set of four numbers per node is exactly that computed by the Sankoff algorithm for computing parsimony scores [22]. We also redefine the ⊕ operator as

(*y*_{
a
}, *y*_{
c
}, *y*_{
g
}, *y*_{
t
}) ⊕ (*z*_{
a
}, *z*_{
c
}, *z*_{
g
}, *z*_{
t
}) = (min(*y*_{
a
}+ *z*_{
a
}, *y*_{
a
}+
+ 1,
+ *z*_{
a
}+ 1,
+
+ 2),

min(*y*_{
c
}+ *z*_{
c
}, *y*_{
c
}+
+ 1,
+ *z*_{
c
}+ 1,
+
+ 2),

min(*y*_{
g
}+ *z*_{
g
}, *y*_{
g
}+
+ 1,
+ *z*_{
g
}+ 1,
+
+ 2),

min(*y*_{
t
}+ *z*_{
t
}, *y*_{
t
}+
+ 1,
+ *z*_{
t
}+ 1,
+
+ 2))

*min*

_{j≠i}

*y*

_{ j }. Notice that this is again in direct analogy to Sankoff's algorithm. Again,

*Y*

_{ u }=

*Y*

_{ v }⊕

*Y*

_{ w }and we get

*parsimony*(

*C*

_{ p }(1), (

*C*

_{ p },(2),...,

*C*

_{ p }(

*n*)) =

*min*(

*Y*

_{ r }). With these redefinitions, Equation 2 holds without any modifications needed. We get

### Posterior distributions of conservation scores

Having shown how to compute the p-value of a given entropy or parsimony score under a fixed codon rate matrix, it is simple to compute posterior p-values for the case where the functional class is not known in advance. Consider a given set of aligned codons *X*_{
i
}= (*X*_{
i
}(1), ..., *X*_{
i
}(*n*)) encoding the set of amino acids *A*_{
i
}= (*A*_{
i
}(1), ..., *A*_{
i
}(*n*)). Define the unobserved variables *Z*_{i,k}= 1 if the site *i* belongs to functional class *k*, and zero otherwise. We first compute the posterior probability of each *Z*_{i,k}, given the observed amino acids at site *i*:

where Pr[*A*_{
i
}(1), ..., *A*_{
i
}(*n*)|*Z*_{i,k}= 1,
] is computed using Felsenstein's algorithm [28], with rate matrix
. Finally, we obtain the posterior estimate *pv*_{
post
}(*i*,*p*) for the p-value of the entropy score *e* observed at position *p* of codon *i*, by summing over all classes:

and similarly for parsimony scores p-values.

### Conditional p-values

*j*, the codon

*had*to encode amino acid

*a*(

*X*

_{ j }(

*i*)), for each leaf

*j*∈ {1, ...,

*n*}, is the conservation observed in

*X*

_{ i }surprising? Notice how, compared to the mixture model approach, this model transfers the responsibility of sequence conservation even more onto the shoulders of coding selection. See Figure 3 for an example. Mathematical details are provided in Methods.

#### A sliding window approach

Until now, we have shown two ways to compute p-values for individual alignment columns. Since most non-coding functional elements are expected to span several consecutive positions (5-15nt for transcription factor binding sites and exonic splicing enhancers, and up to a few hundred nucleotides for RNA secondary structure elements), we can improve the sensitivity of the method by using a simple sliding window approach. For each position *i*, let *pv*(*i*) be the p-value obtained for the *i*-th column of the alignment (not the *i*-th codon), and let *w* be the width of the sliding window (assume *w* is odd, for simplicity). We compute *S*_{
i
}= ∏_{j = i- ⌊w/2⌋...i + ⌊w/2⌋}*pv*(*j*). If we assume that, under the null model, *pv*(*j*) is approximately uniformly distributed, a compounded p-value can be assigned to *S*_{
i
}: *cpv*(*i*) = Pr[Product of *w* i.i.d. uniform (0,l) ≤ *S*_{
i
}] = *S*_{
i
}(1 + ∑_{j = 1...w-1}- ln(*S*_{
i
})^{
j
}*/j*!). This is the type of p-value being reported in the Applications section. It should however be noted that although the uniformity assumption holds quite well for the third codon positions, the first two codon positions are often completely determined by the codon they encode, so the range of possible p-values they can take is quite small. This results in the compounded p-values being quite conservative.

#### Implementation

The algorithms were implemented in C++ and the program is available upon request. A number of optimizations described in [18] have also been implemented and make the program relatively fast. In particular, a caching mechanism allows to re-use the results of computations done on previous columns. This allows the program to handle very long sequences quickly. All analyses reported here have been obtained in less two hours of computation on a desktop machine.

### Analysis of simulated data

#### Analysis biological data

We illustrate our approach on two sets of orthologous mRNA sequences: a set of 69 vertebrate growth hormone 1 (*GH1*) sequences, and a set of 13 vertebrate *CORTBP2* (also known as *CTTNBP2*) sequences (see Supplementary data).

*GH1*was one of the first gene shown to harbor an exonic splicing enhancer, in cow [29]. The availability of a large number of orthologous sequences makes it ideal for our study. Figure 5 shows the compounded p-values obtained from the mixture-based method, for the parsimony score, using a sliding window with

*w*= 9. The human region orthologous to a known exonic splicing enhancer in the cow, located around position 577 [29], is clearly identified, obtaining a compounded p-value of about 2 × 10

^{-3}. Although this region could have been identified on the basis of parsimony scores alone (gray curve on Figure 5), it is highlighted much more clearly by the posterior p-values.

*CORTBP2*transcript is particularly intriguing. Little is known about the post-transcriptional regulation of this gene. We find that its mRNA contains a very large region (roughly between positions 1500 and 2200, see Figure 8 and Supplementary material), which obtain fairly low p-values. This region is much too large to be a binding site, and we conjecture that it may form some large RNA secondary structure affecting the pre-mRNA splicing or the mRNA stability, localization or translation. Notice that in this case, a simple parsimony score is not sufficient for identifying the region, as many other regions of the transcript are well equally well conserved.

## Conclusion

With the many genome sequencing projects rapidly producing vertebrate genomic data, comparative genomic approaches are becoming increasingly powerful. In the case of CRUNCS, additional data is often available in the form of ESTs and cDNAs. We believe that within one year or two, there will be sufficient data for accurate detection of CRUNCS in vertebrate genes, using methods like those described here. Once many CRUNCS will have been detected, the next step will of course be to assign functions to these elements. Although the last word will remain with experimentalists, we have good hopes that more advanced bioinformatics approaches will yield insights into these questions.

Finally, we expect that organisms that are under severe genome size constraints, in particular bacteria and viruses, will more often use CRUNCS. We believe that our approaches will prove particularly fruitful to analyze these genomes.

## Methods

### Estimating amino acid stationary distributions for each class

The Pfam database consists of a set of multiple alignments of homologous protein domain sequences. For some domains, the database contains several sequences that are very closely related. To reduce biases due to this over-representation, one member of any pair of domain sequences that share more than 60% identity is discarded. Let *D*_{1}, ..., *D*_{
m
}be the set of alignment columns in this reduced Pfam database, let *S*_{
i
}be the number of species in alignment column *i*, let *D*_{
i
}(*j*) be the amino acid from species *j* in column *i*, and let *E*_{
i
}(*j*) be the codon encoding that amino acid in the mRNA of species *j*. Starting from Pfam 17.0, we have *m* = 2292182 amino acid alignment columns.

For simplicity, we assume that the amino acids in a given column are drawn independently from the same unknown amino acid distribution. Let *π*_{1}, *π*_{2}, ..., *π*_{50} be a set of amino acid distributions, where *π*_{
k
}(*α*) is the probability of amino acid *a* in class *k*. Let *τ*(*k*) be the prior probability of class *k*. The class membership of column *i* is unknown. Let *Z*_{
i
}(*k*) be a hidden variable that takes value 1 if column *i* belongs to class *k*, and 0 otherwise. We have Pr[*D*_{
i
}|*Z*_{
i
}(*k*) = 1] =
.

We search for the distributions *π*_{1}, ..., *π*_{50} and prior probabilities *τ* that maximize Pr[*D*_{1} ... *D*_{
m
}] = ∏_{i = 1...m}Pr[*D*_{
i
}] = ∏_{i = 1...m}∑_{k = 1..50}Pr[*D*_{
i
}|(*k*) = l]*τ*(*k*). This is achieved using a simple EM algorithm for learning mixtures of multinomial distributions, using the following update formulas to go from iteration *t* to iteration *t* + 1.

and

where *N*_{
i
}(*α*) is the number of occurrences of amino acid *α* in column *D*_{
i
}and where *ζ* and *ζ*' are normalizing constants that ensure that the probabilities sum to one.

Initialized with noisy uniform distributions, the algorithm converges quickly (less than 50 iterations) to the local optimum depicted in Figure 2. Ten restarts from other random initial conditions converged to very similar distributions, so the first one was retained.

### Estimating amino acid and codon rate matrices

Once the amino acid distributions *π*_{1}, ..., *π*_{50} and prior distribution *τ* are learned, we can use them to compute the posterior probability of each class *k* for each alignment column *D*_{
i
}:

where *ζ*'' is another normalizing constant. For simplicity, the amino acid and codon substitution rate matrices
and
are estimated for each class *k* based only on the observed substitutions between human and mouse sequences (however, the class posterior probabilities of each column are based on all available species). Any column that does not include an entry for these two species is excluded. Let *E*_{
i
}(*human*) and *E*_{
i
}(*mouse*) be the codons encoding amino acid *D*_{
i
}(*human*) and *D*_{
i
}(*mouse*) in the corresponding mRNA sequences. The transition matrices are then estimated as follows:

*α*,

*β*) = ln (

*α*,

*β*)/

*t*

_{(h,m)}, where

*t*

_{(h,m)}is the expected number of substitutions per site between human and mouse, in neutrally evolving DNA. Codon rate matrices are obtained in a similar manner.

Note that the codon substitution rates we obtain are slightly underestimating the true rates for sites evolving only under coding selective pressure, because some sequences in Pfam are likely to be evolving slower due to non-coding selective pressure. However, we expect that this underestimation is negligible as the fraction of such sites is likely to be small. In any case, underestimating the rates will only cause conservative estimates of the conservation p-values.

### Computing entropy and parsimony score p-values

The probabilities Pr[*Y*_{
u
}= (*y*_{
a
}, *y*_{
c
}, *y*_{
g
}, *y*_{
t
})|*C*(*u*) = *k*_{
u
}] are stored in a hash table associated to each node, indexed by the quintuplet (*y*_{
a
}, *y*_{
c
}, *y*_{
g
}, *y*_{
t
}, *k*_{
u
}). Only non-zero probabilities are stored. To compute these probabilities for a node *u* with children *v* and *w*, it is simpler to enumerate all pairs (*y*_{
a
}, *y*_{
c
}, *y*_{
g
}, *y*_{
t
}, *k*_{
v
}), (*z*_{
a
}, *z*_{
c
}, *z*_{
g
}, *z*_{
t
}, *k*_{
w
}) of quintuplets from the hash tables of the two children, and add the proper quantity to the entry ((*y*_{
a
}, *y*_{
c
}, *y*_{
g
}, *y*_{
t
}) ⊕ (*z*_{
a
}, *z*_{
c
}, *z*_{
g
}, *z*_{
t
}), *k*_{
u
}) of the hash table at *u*, for all choices of *k*_{
u
}. Please see [18] for more details.

To study the complexity of the resulting algorithm for the entropy p-value computation, observe that the hash table associated with node *v* whose subtree contains *l*(*v*) leaves will contain at most
∈ *O*(*l*(*v*)^{3}) non-zero entries and thus computing all entries of the hash table for node *u* takes *O*(*l*(*v*)^{3}*l*(*w*)^{3}) time. Depending on the topology of *T*, computing the hash tables of all nodes takes between *O*(*n*^{4}) time for a completely skewed tree and *O*(*n*^{6}) time for a balanced tree. The time complexity analysis applies to both the posterior p-value and the conditional p-value computations. However, for posterior p-values, this computation needs to be repeated 50 times, once for each rate matrix.

The hash table implementation works well in the case of parsimony score p-value computation too, especially with the following optimizations. First, for binary trees, the only choices of *y*_{
a
}, *y*_{
c
}, *y*_{
g
}, *y*_{
t
}that may have non-zero probability are those where |*y*_{
i
}– *y*_{
j
}| ≤ 2 for all *i, j* ∈ {A, C, G, T} (this is a direct consequence of the ⊕ operator). When evaluating the p-value for a parsimony score *ψ*, choices of *y*_{
a
}, *y*_{
c
}, *y*_{
g
}, *y*_{
t
}such that *min*(*y*_{
a
}, *y*_{
c
}, *y*_{
g
}, *y*_{
t
}) > *ψ* are not affecting the final p-value and can be safely ignored. Therefore, the hash table associated with node *v* contains *O*(*ψ*) = *O*(*n*) entries, and computing all entries for node *u* from those of nodes *v* and *w* take *O*(*l*(*u*) · *l*(*v*)). Thus, one can compute all entries of all tables in *O*(*n*^{2}), irrespective of the tree topology. More details can be found in [18].

### Conditional P-values

The method of conditional p-values, introduced by Blanchette [18], can be summarized as follows. The method computes the following conditional p-value:

*pv*_{
cond
}(*i, p*) = Pr[*entropy*(*C*_{
p
}(1), ...,*C*_{
p
}(*n*)) ≤ *entropy*(*X*_{
i
},_{
p
}(1), ..., *X*_{
i
},_{
p
}(*n*))*|a*(*C*(1)) = *a*(*X*_{
i
}(l)), ..., *a*(*C*(*n*)) = *a*(*X*_{
i
}(*n*))]

This p-value can be computed in a manner that is similar to the Equation 2. Let us denote by *l*(*u*) the set of leaves in the subtree rooted at *u* and let us write the set of conditions for the subtree rooted at *u* as *A*(*u*) = ∧_{j∈l(u)}(*a*(*C*(*j*)) = *a*(*X*_{
i
}(*j*))). We get

and

where Pr[*C*(*u*) = *k|A*(*u*)], is computed using Felsenstein's algorithm and Bayes rule. As before, a dynamic programming algorithm proceeding in a post-order traversal of the tree allows the computation of all terms required.

## Supplementary data

Multiple alignments, phylogenetic trees and detailed results are available at [30].

## Declarations

### Acknowledgements

We would like to thank Martin Tompa, Saurabh Sinha, Adam Siepel, and David Haussler for useful discussions early in this project. We also thank the editor Hervé Philippe and one anonymous referee for their hard work on this manuscript and for their useful suggestions.

This article has been published as part of *BMC Evolutionary Biology* Volume 7 Supplement 1, 2007: First International Conference on Phylogenomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcevolbiol/7?issue=S1.

## Authors’ Affiliations

## References

- The International Mouse Genome Sequencing Consortium: Initial sequencing and comparative analysis of the mouse genome. Nature. 2002, 420 (6915): 520-62. 10.1038/nature01262.View ArticleGoogle Scholar
- Margulies EH, Blanchette M, Haussler D, Green ED: Identification and characterization of multi-species conserved sequences. Genome Research. 2003, 13 (12): 2507-2518. 10.1101/gr.1602203.PubMed CentralView ArticlePubMedGoogle Scholar
- Blanchette M, Bataille A, Chen X, Poitras C, Laganiere J, Lefebvre C, Deblois G, Giguere V, Ferretti V, Bergeron D, Coulombe B, Robert F: Genome-wide computation prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Research. 2006, 16 (5): 656-68. 10.1101/gr.4866006.PubMed CentralView ArticlePubMedGoogle Scholar
- King DC, Taylor J, Elnitski L, Chiaromonte F, Miller W, Hardison RC: Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences. Genome Res. 2005, 15 (8): 1051-1060. 10.1101/gr.3642605.PubMed CentralView ArticlePubMedGoogle Scholar
- Hastings M, Krainer A: Pre-mRNA splicing in the new millenium. Current Opinion in Cell Biology. 2001, 13: 302-309. 10.1016/S0955-0674(00)00212-X.View ArticlePubMedGoogle Scholar
- Fairbrother W, Yeh R, Sharp P, Burge C: Predictive identification of exonic splicing enhancers in human genes. Science. 2002, 297: 1007-1013. 10.1126/science.1073774.View ArticlePubMedGoogle Scholar
- Meisner NC, Hackermuller J, Uhl V, Aszodi A, Jaritz M, Auer M: mRNA openers and closers: modulating AU-rich element-controlled mRNA stability by a molecular switch in mRNA secondary structure. Chembiochem. 2004, 5 (10): 1432-1447. 10.1002/cbic.200400219.View ArticlePubMedGoogle Scholar
- Jansen RP: mRNA localization: message on the move. Nat Rev Mol Cell Biol. 2001, 2 (4): 247-256. 10.1038/35067016.View ArticlePubMedGoogle Scholar
- Kozak M: Determinants of translational fidelity and efficiency in vertebrate mRNAs. Biochimie. 1994, 76 (9): 815-821. 10.1016/0300-9084(94)90182-1.View ArticlePubMedGoogle Scholar
- Kozak M: Regulation of translation via mRNA structure in prokaryotes and eukaryotes. Gene. 2005, 361: 13-37. 10.1016/j.gene.2005.06.037.View ArticlePubMedGoogle Scholar
- Meyer IM, Miklos I: Statistical evidence for conserved, local secondary structure in the coding regions of eukaryotic mRNAs and pre-mRNAs. Nucleic Acids Res. 2005, 33 (19): 6338-6348. 10.1093/nar/gki923.PubMed CentralView ArticlePubMedGoogle Scholar
- Lin C, Tam RC: Transcriptional regulation of CD28 expression by CD28GR, a novel promoter element located in exon 1 of the CD28 gene. J Immunol. 2001, 166 (10): 6134-6143.View ArticlePubMedGoogle Scholar
- Tagle D, Koop B, Goodman M, Slightom J, Hess D, Jones R: Embryonic
*ε*and*γ*globin genes of a prosimian primate (Galago crassicaudatus) nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. Journal of Molecular Biology. 1988, 203: 439-455. 10.1016/0022-2836(88)90011-3.View ArticlePubMedGoogle Scholar - Blanchette M, Tompa M: Discovery of Regulatory elements by a computational method for phylogenetic footprinting. Genome Research. 2002, 12: 739-748. 10.1101/gr.6902.PubMed CentralView ArticlePubMedGoogle Scholar
- Sorek R, Ast G: Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. Genome Res. 2003, 13 (7): 1631-1637. 10.1101/gr.1208803.PubMed CentralView ArticlePubMedGoogle Scholar
- Margulies EH, Vinson JP, Miller W, Jaffe DB, Lindblad-Toh K, Chang JL, Green ED, Lander ES, Mullikin JC, Clamp M, Program NISCCS: An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc Natl Acad Sci USA. 2005, 102 (13): 4795-4800. 10.1073/pnas.0409882102.PubMed CentralView ArticlePubMedGoogle Scholar
- Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005, 15 (8): 1034-1050. 10.1101/gr.3715005.PubMed CentralView ArticlePubMedGoogle Scholar
- Blanchette M: A comparative analysis method for detecting binding sites in coding regions. Proceedings of the seventh annual international conference on computational molecular biology. 2003Google Scholar
- Stojanovic N, Florea L, Riemer C, Gumucio D, Slightom J, Goodman M, Miller W, Hardison R: Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions. Nucleic Acids Res. 1999, 27 (19): 3899-910. 10.1093/nar/27.19.3899.PubMed CentralView ArticlePubMedGoogle Scholar
- Bailey TL, Elkan C: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning. 1995, 21 (1–2): 51-80.Google Scholar
- Fitch WM: Toward Defining the Course of Evolution: Minimum Change for a Specified Tree Topology. Systematic Zoology. 1971, 20: 406-416. 10.2307/2412116.View ArticleGoogle Scholar
- Sankoff DD: Minimal mutation trees of sequences. SIAM Journal on Applied Mathematics. 1975, 28: 35-42. 10.1137/0128004.View ArticleGoogle Scholar
- Lartillot N, Philippe H: A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Mol Biol Evol. 2004, 21 (6): 1095-109. 10.1093/molbev/msh112.View ArticlePubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer ELL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res. 2004, D138-D141. 10.1093/nar/gkh121. 32 DatabaseGoogle Scholar
- Sjolander K, Karplus K, Brown M, Hughey R, Krogh A, Mian IS, Haussler D: Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci. 1996, 12 (4): 327-345.PubMedGoogle Scholar
- Nakamura Y, Gojobori T, Ikemura T: Codon usage tabulated from international DNA sequence databases: status for the year 2000. Nucleic Acids Res. 2000, 28: 292-10.1093/nar/28.1.292.PubMed CentralView ArticlePubMedGoogle Scholar
- Karlin S, Taylor H: A first course in stochastic processes. 1975, Academic Press, secondGoogle Scholar
- Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution. 1981, 17: 368-376. 10.1007/BF01734359.View ArticlePubMedGoogle Scholar
- Dirksen WP, Li X, Mayeda A, Krainer AR, Rottman FM: Mapping the SF2/ASF binding sites in the bovine growth hormone exonic splicing enhancer. J Biol Chem. 2000, 275 (37): 29170-29177. 10.1074/jbc.M001126200.View ArticlePubMedGoogle Scholar
- Supplemental material. [http://www.mcb.mcgill.ca/~blanchem/CRUNCS]

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.