Patterns of intron gain and conservation in eukaryotic genes

Background: The presence of introns in protein-coding genes is a universal feature of eukaryotic genome organization, and the genes of multicellular eukaryotes, typically, contain multiple introns, a substantial fraction of which share position in distant taxa, such as plants and animals. Depending on the methods and data sets used, researchers have reached opposite conclusions on the causes of the high fraction of shared introns in orthologous genes from distant eukaryotes. Some studies conclude that shared intron positions reflect, almost entirely, a remarkable evolutionary conservation, whereas others attribute it to parallel gain of introns. To resolve these contradictions, it is crucial to analyze the evolution of introns by using a model that minimally relies on arbitrary assumptions. Results: We developed a probabilistic model of evolution that allows for variability of intron gain and loss rates over branches of the phylogenetic tree, individual genes, and individual sites. Applying this model to an extended set of conserved eukaryotic genes, we find that parallel gain, on average, accounts for only ~8% of the shared intron positions. However, the distribution of parallel gains over the phylogenetic tree of eukaryotes is highly non-uniform. There are, practically, no parallel gains in closely related lineages, whereas for distant lineages, such as animals and plants, parallel gains appear to contribute up to 20% of the shared intron positions. In accord with these findings, we estimated that ancestral introns have a high probability to be retained in extant genomes, and conversely, that a substantial fraction of extant introns have retained their positions since the early stages of eukaryotic evolution. In addition, the density of sites that are available for intron insertion is estimated to be, approximately, one in seven basepairs. Conclusion: We obtained robust estimates of the contribution of parallel gain to the observed sharing of intron positions between eukaryotic species separated by different evolutionary distances. The results indicate that, although the contribution of parallel gains varies across the phylogenetic tree, the high level of intron position sharing is due, primarily, to evolutionary conservation. Accordingly, numerous introns appear to persist in the same position over hundreds of millions of years of evolution. This is compatible with recent observations of a negative correlation between the rate of intron gain and coding sequence evolution rate of a gene, suggesting that at least some of the introns are functionally relevant.


Background
The presence of spliceosomal introns and the concurrent splicing machinery are one of the principal distinctive features of eukaryotic genomes [1][2][3]. Indeed, even early branching eukaryotes that were once suspected of being intronless have been shown in recent years to posses at least a few introns [4][5][6][7]. This suggests that the evolution of introns is tightly linked to the central aspects of the evolution of eukaryotes, and it even has been proposed that introns were the driving force behind the emergence of the eukaryotic nucleus and other features of the eukaryotic cell [8,9].
Thanks to their ubiquity and potential major role in the evolution of eukaryotes, intron evolution has drawn considerable attention [1][2][3]10]. It is generally accepted that introns are units of evolution such that their presence/ absence pattern is a result of stochastic processes of loss and gain. The details of these processes, however, remain elusive. Only in recent years, with the accumulation of genomic data, evolution of introns has become amenable to a systematic, genome-wide analysis. However, different attempts to accomplish such a study led to incongruent conclusions regarding the prevalence, rates, and timing of intron loss and gain during the evolution of eukaryotes [11][12][13][14][15][16][17][18]. Apparently, these discrepancies are due, primarily, to incomplete underlying evolutionary models and biased estimation techniques [10].
Recently, we have obtained more conclusive results by developing a comprehensive probabilistic model of intron evolution, and by compiling a data set that is considerably larger than any one previously used [19,20]. The probabilistic models used so far to study intron evolution can be classified into two groups: branch-specific and gene-specific ones. The branch-specific models assume that the processes of intron gain and loss along a branch are determined only by the properties of the branch, regardless of the particular gene in question [17,18,21]. Conversely, gene-specific models assume that these processes are determined only by the particular gene, independent on the branch in question [12]. Obviously, in reality, the characteristics of intron gain and loss processes vary considerably both across genes and across branches. Thus, each of the models employed thus far seem to describe only one facet of a more complex reality. By contrast, our model allows for the variability of intron gain and loss characteristics with respect to both genes and branches such that any of the previously suggested models can be shown to be a special case of this comprehensive model [20]. Moreover, this model is even more realistic in that it also includes rate variability between sites, with respect to both intron gain and intron loss. In order to estimate the model parameters, we devised an expectation-maximization (EM) algorithm that can also be used to reconstruct ancestral states [22]. Combining this algorithm with the profile likelihood technique allows one, in addition to all of the above, to compute confidence intervals of the model parameters.
Here, we describe, in detail, the developed model of intron evolution and the derivation of an improved version of the EM algorithm used to estimate its parameters. Previously, we have applied this comprehensive model to a set of 391 conserved genes from 19 eukaryotic species, to investigate evolutionary trends in gene structure both at the lineage level [20] and at the gene level [19]. The results of this analysis suggest that introns invaded eukaryotic genomes at early stages of eukaryogenesis in a nearly neutral process. At early times, during periods of major transitions in the eukaryotic evolution that led to population bottlenecks, these introns seem to have vastly proliferated in the ancient genomes. Gradually, a considerable fraction of the introns appear to become involved in various cellular functions, mostly, regulation of gene expression [19].
In the present work, we focus on the process of intron gain, and address the causes of the high level of intron position conservation between eukaryotic taxa. It had been already noticed that many intron positions are shared between distant eukaryotic taxa [11,13]. For example, plants and animals share up to 25% of the intron positions [13]. However, these findings can be explained by either remarkable conservation of ancient introns or by parallel, independent, intron gains in the same positions, or (perhaps, most likely) by a combination of both these factors. The previous analyses that have attempted to quantify the relative contributions of evolutionary conservation and parallel gain in intron position sharing have differed widely, with estimates of the extent of parallel gain ranging from nearly 0% to nearly 100% [11][12][13]18,23]. Using our probabilistic model, we developed a rigorous measure for assessing the amount of parallel gain of introns. We found that, overall, parallel gain is responsible for ~8% of the shared intron positions, with the rest due to shared ancestry. However, we also demonstrate substantial heterogeneity in the extent of parallel gain, with almost none in closely related lineages, but up tõ 20% in distant ones, such as plants and unikonts. On the whole, these results support the notion that intron positions are highly conserved during evolution.

Notation
The primary input component in the study of intron evolution consists of G sets of aligned protein-coding sequences of orthologous genes from S species. Each nucleotide in these alignments is substituted by 0 or 1 depending on whether or not an intron is present following the respective position. We allow for missing data by using a third symbol (*) to indicate lack of knowledge about the presence or absence of an intron. Consequently, every site in the alignments, called pattern, is a vector of length S over the alphabet (0,1,*) and is denoted by ω. Let Ω be the total number of unique patterns in the entire set of G alignments, and let n gp be the count of the number of times pattern ω p (p = 1, ..., Ω) is found in the multiple alignment of gene g. Assuming that the sites evolve independently, the set M g = (n g1 , ..., n gΩ ) fully characterizes the multiple alignment of the g th gene.
Let T be a rooted, bifurcating phylogenetic tree with S leaves (terminal nodes), describing the evolutionary relationships between the S species above. The total number of nodes in T is N = 2S -1, and we index them by t = 0,1, ..., N -1, with the convention that zero is the root node. The state of node t is described by the variable q t , which can take the values 0 and 1 (and * in leaves). We use V t for the set of all leaves such that node t is among their ancestors. The entire collection of leaves is, obviously, V 0 . The parent node of t is denoted P(t). We use the notations and for q P(t) and V P(t) , respectively. Analogously, the two direct descendents of the node t are denoted L(t) and R(t), and we use the notations , , , and for q L(t) , q R(t) , V L(t) , and V R(t) , respectively. The branches are indexed by the node into which they are leading, and ∆ t denotes the length (in time units) of the t th branch. Hereinafter we assume that the tree topology, as well as all the branch lengths ∆ 1 , ..., ∆ N-1 , are known.

The probabilistic model
A bifurcating phylogenetic tree can be viewed as a graphical model depicting the probabilistic model We use the notation π i = Pr(q 0 = i) to describe the prior probability of the root, and T ij (g, t) = Pr(q t = j| = i, g) to describe the transition probability for gene g along branch t. In our model, we assume that this transition probability depends on both the gene and the branch, and that it takes the explicit form Here, η g and θ g are non-negative parameters which determine, respectively, the intron gain and loss rates per site for gene g. That is, along branch t the gene's contribution to intron gain and retention probabilities per site is and , respectively. We assume that each branch is characterized by an intrinsic branch-specific intron gain coefficient, ξ t , as well as an intrinsic branch-specific intron loss coefficient, φ t , both of which are bounded, 0 ≤ ξ t , φ t ≤ 1.
In other fields of molecular evolution, it had been long realized that the precision of the analysis significantly improves if one allows for rate variability across sites [24][25][26]. Typically, such rate variability is modeled by introducing a rate variable, r, which scales, for each site, the time units of the phylogenetic tree, ∆ t ← r·∆ t . This rate variable is a random variable that is distributed according to a distribution function with non-negative domain and unit mean, typically, the unit-mean gamma distribution. The rate variability captures rate variations among sites of the same gene. Specifically, there are fast-evolving sites (r >> 1), as well as slow-evolving ones (r << 1). In our model of intron evolution, we extend this idea by assuming that the gain and loss processes are subject to rate variability, independently of each other. Hence, a site can have any combination of gain and loss rates, for example, it can be fast to gain introns but slow to lose them. To implement this approach, we use two independent rate variables, r η and r θ , that are used to scale, for each site, the gene-specific gain rate, η g ← r η ·η g , and the gene-specific loss rate, θ g ← r θ ·θ g , respectively. We further assume that the distributions of these rate variables are independent of the genes, and are explicitly given by Here, Γ(x; λ) is the unit-mean gamma distribution of the variable x with the shape parameter λ, δ(x) is the Dirac delta-function, and ν is the fraction of sites that are assumed to have zero gain rate. The existence of these zero sites reflects the notion that introns cannot be gained at any location within genes, but rather are preferentially inserted at specific locations, contingent on particular sequence motifs known as proto-splice sites [27][28][29], the density of other introns in the neighborhood, the chromatin exposure, and more. According to this interpretation, 1 -ν measures the density of potential intron insertion sites. Importantly, using the same value of ν for the entire tree does not mean that the proto-splice sites are constant throughout the evolution, or are identical for different lineages. It only means that, on the average, the fraction of potential insertion sites, whatever is their concrete nature, is similar across the lineages throughout the course of eukaryotic evolution. The incorporation of such invariant sites in a rate variability model appears natural for intron evolution and has proved beneficial also in other fields of molecular evolution [30][31][32]. By contrast, intron loss does not have an invariant counterpart because the assumption is that, once an intron is gained, it can always be lost. Therefore, the loss rate variable is assumed to be distributed according to a gamma distribution, which is by far the most popular distribution for describing rate variability [24,33].
In practice, the rate distributions in (3) are rendered discrete [34]. We assume that the gain rate variable can take φ t } be the set of parameters that are specific to branch t, and let Ξ = (Ξ 1 , ..., Ξ N-1 ) be the set of all branch-specific parameters. Similarly, let Ψ g = (η g , θ g ) be the set of parameters that are specific for gene g, and let Ψ = (Ψ 1 , ..., Ψ G ) be the set of all gene-specific parameters. Additionally, we denote by Λ = (π 0 , ν, λ η , λ θ ) the "global" parameters that determine the rate variability and the prior probability of an intron absent in the root. When the distinction between the different sets of parameters is irrelevant, we shall use Θ = (Ξ, Ψ, Λ) as the set of all the model's parameters. We achieve further succinctness in notations by denoting the actual gene-specific rate values for particular values and of the rate variables as Ψ kk'g = (η kg , θ k'g ).

The Expectation-Maximization algorithm
We estimate the parameters of the model using maximum likelihood. As the probability model includes observed random variables (state of the tree leaves) as well as hid-den random variables (state of internal nodes in the tree and value of actual rate variables), the expectation-maximization algorithm is a natural tool to use [35]. This is a hill climbing iterative algorithm that requires two steps in each iteration -the expectation (E) step followed by the maximization (M) step. The details of the algorithm are given under Methods.
The total number of parameters in the model is 2G + 4S, where G is the number of genes and S is the number of species. For data sets in the hundreds of genes, this sums up to >1000 parameters. For infinite data, maximum likelihood estimators are known to be nonbiased and efficient. However, as each gene typically goes through a small number of intron-related events during its lifetime, the information content of our data is limited, and cannot straightforwardly support the estimation of such a large number of parameter. To overcome this, we adopt a twophase approach in the data analysis. In the first phase, that we denoted homogeneous evolution, it is assumed that all the genes have identical intron gain and loss rates, formally, that θ g = θ 0 and η g = η 0 for any gene g. In the second phase, denoted heterogeneous evolution, all the global and branch-specific parameters are fixed, which allows for estimation of the gene-specific parameters θ g and η g that can now take different values for different genes. Except for the rate variability within genes, the model of evolution under the homogeneous phase resembles the branch-specific models, and consequently the EM algorithm used in this part has similar structure to the EM algorithm developed by Nguyen et al. [18,36].
Using simulations, we showed that this approach yields highly accurate evolutionary reconstructions: a relative error of 1%, 3%, and 11% in estimating the number of introns in internal nodes, the number of loss events along each branch, and the number of gain events along each branch, respectively [20]. The current analysis is patterncentered rather than gene-centered and thus we have used the results of the homogeneous phase only. While slightly less accurate than the heterogeneous counterparts, the reconstructions after the homogeneous phase are still highly accurate: a relative error of 2%, 4%, and 12% in estimating the number of introns in internal nodes, the number of losses along each branch, and the number of gains along each branch, respectively [20]. Also notable is that, apart from the fraction of zero sites v, the rate variability within genes can be ignored without having any significant effect on the results (Ref. [20] and Additional file 1), indicating that those sites that are capable of gaining introns do so at comparable rates.
The EM algorithm was applied to the set of 391 orthologous genes from 19 eukaryotic species ( [20] and see Methods), under the homogeneous evolution assumption, and the profile likelihood technique was used to estimate the 95% confidence interval for each parameter (Additional file 1). In computing the confidence intervals for a particular parameter, we allow all other parameters to vary. Sometimes, different combinations of parameters yield similar likelihood values, an observation that has already been made formal for simpler models [18]. This occasionally leads to wide single-parameter confidence intervals, especially for the intron gain and loss rates of deep nodes. Importantly, this does not reflect similarly large errors in the ensuing inference, as was demonstrated by an exhaustive simulation study [19,20].

One in 7 nucleotides is a potential intron insertion site
The maximum-likelihood estimate for the fraction of zero sites is ν = 0.862, suggesting a potential intron insertion site every 7 nucleotides. Taking into account the 95% confidence interval for ν, [0.631, 0.928], the density of potential insertion sites is estimated to be 1 site per 3-14 nucleotides. Based on a smaller data set, Nguyen et al. [18] computed a 95% confidence interval of 1 potential insertion site in 9-14 nucleotides. Although this estimate falls within our confidence interval, our results suggest, generally, a denser population of potential insertion sites. The present estimate is also compatible with the high density of intron insertion sites (one site per 9.7 nucleotides) observed for some large protein families, e.g., small G proteins [37]. The problem of estimating the number of zero sites is of fundamental importance in maximum likelihood techniques, so both Csuros [17] and Nguyen et al. [18] developed different heuristic procedure to handle it. We took the alternative and, arguably, more natural approach of integrating this estimate into the model as part of the gain rate variability distribution (see above) such that additional, ad hoc computations are not required.
One should be cautious about identifying potential intron insertion sites with proto-splice sites. More realistically, the density of potential insertion sites reflects the overall average (across species) of the impact of multiple factors such as differential tendencies to be inserted into different proto-splice sites, local densities of pre-existing introns, and degree of chromatin exposure. Furthermore, this density estimation strongly depends on the specific data set as the parameter ν "absorbs" the information on zero sites.
Thus, if intronless genes are added to the data, only this parameter will be heavily affected.
With the total number of sites N 0 = 289,902 our calculations yield 39,962 sites that can gain introns (95% confidence interval 20,891 to 106,901). Accordingly, the 5,755 sites that are actually occupied by introns in the genes analyzed here comprise 14.4% of all sites that can potentially gain introns (95% confidence interval 5.4% to 27.5%).
Thus, even when data from 19 species are combined, the density of the sites actually occupied by introns is still far below the density of sites capable of hosting introns, which emphasizes the inadequacy of any analysis that considers only those sites in which introns are, actually, observed.

The number of shared intron positions is much greater than expected by chance
Numerous intron positions are shared even between taxa that had diverged during the early stages of the eukaryotic evolution [11,13]. To quantify this conservation more precisely and to estimate how surprising it is, compared to the random expectation, we propose the following measure: let N i and N j be the number of intron positions in species i and j, respectively; let S ij be the number of intron positions shared by the two species; and let N be the total number of sites capable of gaining introns. Then, approximately, we would expect that N i N j /N positions will be shared between the two species by chance alone. We define the conservation level as the log-ratio between the observed number and the number expected by chance. Positive values designate that S ij exceeds random expectation, whereas negative values indicate that S ij is below expectation. Even if we take for N its lower 95% confidence interval bound of 20,891 (see the preceding section), almost all pairs show a positive value, namely, a greater than random number of shared positions (Fig. 1). The only exception is the pair S. cerevisiae -P. falciparum (the two most intron-poor species in our set) that do not have even a single shared position.
The number of positions shared by chance between two species is a random variable distributed, approximately, according to a binomial distribution with the probability of success p = N i N j /N 2 , and N experiments. Therefore, it is easy to associate a p-value with any observed number of shared positions, S ij measuring how improbable it is to obtain by chance this value or a greater one. An overall significance level of 0.05 is equivalent to a Bonfferoni-corrected significance level of 0.0003 (overly conservative as we assume that all pairs are independent). The calculations indicate that only 20 species pairs out of 171, all involving the intron-poor S. cerevisiae or one of the apicomplexans, had a number of shared positions that was indistinguishable from the random expectation; all other pairs had a significant excess of shared intron positions (Additional file 2).

Twelve out of thirteen shared intron positions reflect common ancestry
The significant excess of shared intron positions over the random expectation can be explained by one of two factors or a combination thereof. The first explanation is that this observation reflects genuine evolutionary conservation; the implication is that, once an intron is gained, it is hard to lose, perhaps, due to functional importance of introns or because the deletion event itself is destructive. The second explanation is that different lineages gain introns at the same position, independently. The chance of such parallel gain is not necessarily as low as it seems at first sight because introns are, apparently, preferentially inserted into proto-splice sites (see above). Accordingly, regions of high sequence conservation might gain introns at exactly the same position.
The two explanations have opposing impacts on our understanding of the evolution of eukaryotic genes. If introns are persistent, many must have been gained early in the eukaryotic evolution, and the later evolution involved multiple losses [16,38]. By contrast, if parallel gain is the dominant mechanism, many of the extant introns are evolutionarily young, and the eukaryotic evolution involves, primarily, multiple introns gains [12]. In order to estimate the extent of parallel gain, we examined all the sites that host introns in at least two species. Assuming that these sites are not invariant, the probability was estimated that the corresponding pattern had arisen from parallel gain, by computing the probability of the last common ancestor of the intron-bearing species to lack an intron. This approach assumes that, if the last common ancestor had an intron in a particular position, no parallel gain occurred, i.e., highly unlikely scenarios involving at least two gains and one loss at the same site are neglected. The probabilities of parallel gain for all the patterns observed in our data set are summarized in Additional file 3. Overall, the data set harbors 3176 sites with shared intron positions (741 unique patterns) out of which 317 positions (10%) are expected to result from parallel intron gain. As inferences involving the root of the tree are prone to significantly elevated standard errors [20], the calculation was repeated using only patterns that do not have the root as the last common ancestor of all intronbearing species. This calculation yielded 2913 sites with shared intron positions (568 unique patterns) of which 229 (~7.9%) are expected to be due to parallel intron gain. Each of these calculations provides a rough estimate of the level of errors introduced into some of the recent studies that explicitly excluded the possibility of parallel gain [13,16,21]. Importantly, however, the contributions of parallel gains to the emergence of different patterns of intron sharing are widely different; in particular, some rare patterns are explained (almost) entirely by parallel gain and do not reflect evolutionary conservation (Additional file 3). For example, for the single site that harbors introns only in humans and N. crassa, the probability is >0.99 that it results from parallel gain. Considering somewhat more frequent patterns, 11 sites harbor introns in C. intestinalis, A. thaliana and O. saliva, with the probability of parallel gain ~0.875, and another 12 sites harbor introns only in C. elegans and C. intestinalis, with the probability of parallel gain ~0.8.

Conservation of intron positions between eukaryotic species
Generally, the distribution of parallel gains in comparisons of specific clades is, obviously, more informative than overall counting (Table 1). To obtain this information, for each internal node t (excluding the root of the tree), all patterns that have 1s in the two sub-clades stemming from t, and were tallied, and the probability that t is in state zero was computed. By this analysis, in which 728 unique patterns were included, we found that nearly 20% of the shared intron positions between plants and unikonts, thought to have diverged more than a billion and a half years ago [39], are due to parallel gain. In fungi and metazoa, diverged ~1.4 billion years ago, >10% of the shared positions are estimated to derive from parallel gains. In contrast, many recently diverged clades show almost no parallel gain (e.g., humans versus rodents, birds versus mammals, Aspergillus versus Neurospora, and flies versus mosquitoes). Table 2 lists all patterns for which the estimated contribution to parallel gain was greater than two sites. The estimated total number of sites with parallel gain is 248. Out of the 728 unique patterns,  Given that most of the other studies report overall estimates on a smaller (8 species) data set [13], a direct comparison of their results with the present ones is not feasible due to the apparent non-uniformity in the extent of parallel gain in different parts of the tree. Nevertheless, an overall parallel gain of ~10% has been computed also for the smaller data set. Rogozin et al. [13] used simulations to assess the extent of parallel gain and showed that, under various assumptions regarding the density of protosplice sites, parallel gains could be responsible to ~2-40% of the shared intron positions. Accounting more accurately for the likely density of proto-splice sites, Sverdlov et al. estimated the contribution of parallel gains to the observed sharing of intron positions to be in the range of 5-10% [23]. An even higher estimate, 18.5% parallel gains, was obtained by Nguyen et al. using a branch-specific maximum-likelihood model [18].

Intron retention
We conclude, therefore, that a substantial majority of the shared intron positions are due to evolutionary conservation, hence intron positions tend to be retained for long times. To further validate this conclusion, we explicitly computed the probability of an intron to survive along given paths in the phylogenetic tree. To this end, let B denote the set of branches that comprise a path in the tree. Then, intron retention probability along this path is where is the retention probability along branch t, and is the probability of being at loss-rate category k'.
In accord with the above conclusions on the high level of evolutionary conservation, this probability is, typically, high, even for very long paths in the phylogenetic tree (Table 3). For example, an intron present in the last common ancestor of the metazoa has a probability of 0.83 to be retained in humans whereas an intron present in the last common ancestor of multicellular life (AME) has a probability of 0.57 to be retained in extant plants.
A complementary approach involves the computation of the probability of extant introns to be of ancient origin. To calculate this probability, we assumed that a site is known to host an intron in one species, and that no information is available on this site in other species (that is, their state is *). Then, we used our model to compute the probability that this intron was present in any of the ancestors of that species (Table 4 and Fig. 2). Clearly, these probabilities decay quite slowly with evolutionary time. For instance,  Species and lineages abbreviations are as in Figure 1.
an intron in C. elegans, H. sapiens and D. melanogaster has a probability of 0.44, 0.69, and 0.68, respectively, to have been present in the last common ancestor of metazoa.

Conclusion
Despite the extensive attention given to the evolution of eukaryotic gene structure over the last three decades, the fundamental characteristics of this process remain controversial. In particular, depending on the methods and data sets used, different researchers have reached opposite conclusions on the causes of the remarkably high fraction of shared introns in orthologous genes from distant eukaryotic species. Some attribute it (almost) entirely to a remarkable evolutionary conservation of intron positions and others, largely, to parallel gain of introns. To resolve these contradictions, it is important to analyze the evolu-tion of introns by using a probabilistic model that minimally relies on arbitrary assumptions. To this end, we developed a model that allows for variability of intron gain and loss rates over branches of the phylogenetic tree, individual genes, and individual sites. Applying this model to an extended set of conserved eukaryotic genes, we found that parallel gain, on average, accounts for onlỹ 8% of the shared intron positions. However, the distribution of parallel gains over the phylogenetic tree of eukaryotes is highly non-uniform such that there are, practically, no parallel gains in closely related lineages, whereas for distant lineages, such as animals and plants, parallel gains might have contributed up to 20% of the shared intron positions. Given the distinctly non-uniform distribution of the inferred gain events over the phylogenetic tree of eukaryotes [20], most of the recently diverged Species and lineage abbreviations are as in Fig. 1. lineages have amassed very few gains during their separate evolution; by contrast, deeply diverged lineages, such as animals and plants appear to have gone through independent stages of extensive intron gain, which would explain the greater share of parallel gains.
We estimated that ancestral introns have a high probability to be retained in extant genomes, and conversely, many of the extant introns are ancient. The reasons for this remarkable endurance of a substantial fraction of the introns are not clear. One possibility is mechanistic, i.e., removing existing introns might be imprecise and hence potentially deleterious. Another possibility is functional, i.e., introns might have acquired many functional roles since entering eukaryotic genomes. The latter possibility is compatible with the recent observation of a negative correlation between the rate of intron gain and coding sequence evolution rate of a gene [19] which suggests that at least some of the introns are functionally relevant.

The data set
The methods and criteria used in compiling the data set have been described previously [19,20] The probability of an intron in extant species to be present in ancient ancestors Figure 2 The probability of an intron in extant species to be present in ancient ancestors. a) an intron in D. melanogaster, b) an intron in H. sapiens, c) an intron in C. neoformans. AME stands for the last common ancestor of multicellular eukaryotes. Phylogenetic tree topology Throughout this paper we assumed the traditional "crown-group" tree topology (Additional file 4). Specifically, the root position is between the Apicomplexa and the common ancestor of multicellular eukaryotes (plants and animals) [40], as opposed to the alternative Unikont-Bikont division [41]. We furthermore assume the Coelomata topology (Deuterostomia and insects are grouped together to the exclusion of nematodes) [42,43] as opposed to the Ecdysozoa topology (insects and nematodes are grouped together to the exclusion of Deuterostomia) [44,45]. The results, however, are not sensitive to the exact tree topology, as explicitly shown elsewhere [19,20].

The EM Algorithm
For each site, the S leaves form a set of observed random variables, their states being described by the corresponding pattern ω p . The states of all the internal nodes, denoted σ, form a set of hidden random variables, that is, random variables whose states are not observed. In order to account for rate variability across sites, we associate with each pattern two hidden random variables, and , that determine the value of the rate variables in that site.
To sum up, the observed random variables are ω p , and the hidden random variables are (σ, , ).
We assume that sites within a gene, as well as the genes themselves, evolve independently. Therefore, the total likelihood can be decomposed as and so According to the well-known EM paradigm [35], log L(M 1 , ..., M G |Θ) is guaranteed to increase as long as we maximize the auxiliary function where If we replace the formal summing over all states of and in (6) by a direct sum, we get Using our notational conventions, we can write the first term in (7) as and the second term as Substituting (8) and (9) back into (7) gives Denoting by w gpkk' and Q gpkk' the first and second square brackets, respectively, this expression becomes And, consequently.
The E-step In this step, the function Q(Θ, Θ 0 ) or, equivalently, the set of coefficients w gpkk' and Q gpkk' is computed by using inward-outward recursion on the tree.

The inward (γ) recursion
Here we suggest a variation on the well-known Felsenstein's pruning algorithm [46]. Let us associate with each node t (except for the root) a vector . This is the probability of observing the nodes V t (which are a subset of the pattern ω p ) for a gene g, when the gain and loss rate variables are and , respectively, and the parent node of t is known to be in state i. By definition, this function is initialized at all leaves (t ∈ V 0 ) by Here and in the derivations to follow, we omit the superscript from γ. For all internal nodes (except for the root), γ is computed using the recursion where is defined as γ j (L(t))γ j (R(t)). This is easy to see, as The first term is, simply, the definition of T ij (g, t). Given q t , is independent of , thus the second term is just Pr( | q t = j) = γ j (t L ) . By similar argument, the third term is simply Pr( | q t = j) = γ j (t R ). Substituting these results in (14), we recover the recursion formula (13).
The γ-recursion allows for computing the likelihood of any observed pattern ω p , given the values of the rate variables: Given q 0 , is independent of , and so and One of the useful features of this recursion is that is allows to treat missing data fairly easily. Only a single option has to be added to the initialization phase (12),

The outward (α-β) recursion
Once the γ-recursion is computed, we can use it to com- ' '  To prove this recursion, let us start with the definition of α, and make the decomposition V 0 = V t + , with being the set of all leaves such that node t is not among their ancestors. But, given , the state of node t is independent on , and therefore (19) becomes From Bayes formula, But, given q t , V t is independent of P(t) and therefore Combining (22) and (21) in (20), we get which is just another form of (18). Finally, for each leaf that is not a descendant of the root When missing data are present, two simple modifications are required. First, we have to add to the initialization phase (17) an option Second, we have to add to the finalization phase (23) an option These inward-outward recursions are the phylogenetic equivalent of the backward-forward recursions known from hidden Markov models, and other versions of this method have been developed previously [47,48]. The version developed here can be shown to be the realization of the junction tree algorithm [49] on rooted bifurcating trees. The junction tree algorithm is a scheme to compute marginal probabilities of maximal cliques on graphs by means of belief propagation on a modified junction tree.
Indeed, the matrix α computes marginal probabilities of pairs (t, P(t)), but such pairs are nothing but maximal cliques on rooted bifurcating trees.

Computing the coefficients w gpkk'
Here we show that the γ-recursion is sufficient to compute  Here we show that those coefficients require the α, βrecursion. By definition, The probability Pr(ω p , σ | Ξ, Ψ gkk' ) is the likelihood of a particular realization of the tree, thus from (1) Here, δ(a, b) is the Kronecker delta function, which is 1 for Hence, Q gpkk' is given by The M-step Substituting (26) in (11), we obtain an explicit form of the function whose maximization guarantees stepping up-hill in the likelihood landscape, Actually, any increase in Q is sufficient to guarantee an increase in the likelihood, suggesting that the precise maximization of Q is not particularly important. Therefore, we speed up the computations by performing low-tolerance maximization with respect to each of the parameters individually. Except for the parameters λ η and λ θ , it is easy to differentiate Q twice with respect to any parameter. This lends itself to using simple zero-finding algorithms of which we chose the Newton-Raphson algorithm [50].
Maximizing Q with respect to the shape parameters λ η and λ θ is more involved, as Q depends on these parameters only through the discrete approximation of the rate variability distributions (3). In our implementation, we used Yang's quantile method [34] to compute the discrete levels of the gamma distributions such that each level has equal probability. Formally, for k = 2, ..., K η and for k = 1, ..., K θ . To perform the maximization in this case, we used Brent's maximization algorithm that does not require derivatives [50].

Reconstruction of Ancestral States and Events
Given the α, β, γ-recursions on the tree with the final model parameters, it is straightforward to reconstruct the history of intron evolution, and to assign gains and losses to specific branches. The number of introns in an internal node t for a gene g, assuming gain rate variable and loss rate variable , given that the observed pattern is ω p , is . Similarly, the number of loss events

Confidence Intervals
In order to obtain confidence intervals on the model parameters, we used the profile likelihood technique. In brief, if ϑ is one parameter in the model, and is the set of remaining parameters, then the profile likelihood of ϑ is defined as That is, we compute the maximum likelihood under the constraint that the value of ϑ is given. If we denote the overall maximum likelihood by L(Θ) = max Θ Pr(Θ), then the likelihood ratio under the hypothesis that ϑ = ϑ 0 is distributed according to the χ 2 distribution with one degree of freedom. In order to find the 95% confidence interval of the parameter ϑ around its optimal value ϑ 0 , we find the value of ϑ for which the likelihood ratio (28) exceeds the value 3.84 (95%-percentile of the χ 2 (1) distribution). This value is found numerically using Ridder's method [50].

Authors' contributions
LC contributed to the development of the probabilistic model, developed the EM algorithm, and drafted the manuscript; IBR collected the data and contributed to the development of the probabilistic model; YIW contributed to the development of the probabilistic model; EVK conceived of the study, provided the biological interpretation of the results, and finalized the manuscript. All authors read and approved the final manuscript.