Modeling the evolution dynamics of exonintron structure with a general random fragmentation process
 Liya Wang^{1}Email author and
 Lincoln D Stein^{1, 2, 3}Email author
DOI: 10.1186/147121481357
© Wang and Stein; licensee BioMed Central Ltd. 2013
Received: 20 November 2012
Accepted: 22 February 2013
Published: 28 February 2013
Abstract
Background
Most eukaryotic genes are interrupted by spliceosomal introns. The evolution of exonintron structure remains mysterious despite rapid advance in genome sequencing technique. In this work, a novel approach is taken based on the assumptions that the evolution of exonintron structure is a stochastic process, and that the characteristics of this process can be understood by examining its historical outcome, the presentday size distribution of internal translated exons (exon). Through the combination of simulation and modeling the size distribution of exons in different species, we propose a general random fragmentation process (GRFP) to characterize the evolution dynamics of exonintron structure. This model accurately predicts the probability that an exon will be split by a new intron and the distribution of novel insertions along the length of the exon.
Results
As the first observation from this model, we show that the chance for an exon to obtain an intron is proportional to its size to the 3rd power. We also show that such size dependence is nearly constant across gene, with the exception of the exons adjacent to the 5^{′} UTR. As the second conclusion from the model, we show that intron insertion loci follow a normal distribution with a mean of 0.5 (center of the exon) and a standard deviation of 0.11. Finally, we show that intron insertions within a gene are independent of each other for vertebrates, but are more negatively correlated for nonvertebrate. We use simulation to demonstrate that the negative correlation might result from significant intron loss during evolution, which could be explained by selection against multiintron genes in these organisms.
Conclusions
The GRFP model suggests that intron gain is dynamic with a higher chance for longer exons; introns are inserted into exons randomly with the highest probability at the center of the exon. GRFP estimates that there are 78 introns in every 10 kb coding sequences for vertebrate genomes, agreeing with empirical observations. GRFP also estimates that there are significant intron losses in the evolution of nonvertebrate genomes, with extreme cases of around 57% intron loss in Drosophila melanogaster, 28% in Caenorhabditis elegans, and 24% in Oryza sativa.
Keywords
Evolution of exonintron structure General random fragmentation process SimulationBackground
Most eukaryotic genes contain spliceosomal introns, which are removed from mRNA after transcription by the RNA splicing apparatus. The biological origins of introns are uncertain. Since the discovery of introns, there has been significant debate as to whether introns in modernday organisms were inherited from a common, ancient ancestor, the intronearly hypothesis [1–3], or whether they appeared in genes more recently in the evolutionary process, the intronlate hypothesis [4, 5], or indeed whether they result from a mixed model [6, 7]. The mixed model suggests that most introns were gained very early in the evolution of eukaryotic genes, followed by intron loss/gain during the course of eukaryotic diversification. The details of such processes, however, remain elusive.
One way to understand the process is to examine the size distribution of internal translated exons, referring to exons that are fully translated and referred to as itexon in [8]. However, to avoid introducing an unfamiliar term, we will simply refer to itexons as “exon” within this communication unless specified. Although the distribution of exons is just a snapshot of the present day world, fitting it to a model based on well characterized mathematical functions may provide insights into the evolution of exonintron structure. In a previous study, Gudlaugsdottir et al. [9] have suggested that the size distribution of exons can be approximated with a combination of Weibull [10] and exponential distributions. The Weibull distribution is a particular case of the generalized extreme value distribution. It is widely used in survival analysis, describing the size of particles generated by grinding, milling and crushing operations, etc. The exponential distribution can be used to describe the length of intervals between uniformly distributed points. Therefore, GJudlaugsdottir et al. hypothesized that the exponential distribution is the outcome of random insertion of introns (intronlate). However, they then related the Weibull distribution to the intronearly theory without providing a stochastic model that explains the observed distribution.
In later work, Ryabov and Gribskov [11] showed that a combination of two lognormal distributions gives the best fit quality to the size distribution of exons. The lognormal distribution could result from a random Kolmogoroff fractioning process [12], which assumes that the chance of fragmentation is independent of exon size. Inserting an intron into an exon is equivalent to fragmentation (splitting) of the exon. Therefore, they hypothesized that the process of intron insertion is independent of exon size.
On the other hand, Tenchov and Yanev [13] demonstrated that the Weibull distribution could result from a uniform random fragmentation process. Here, “uniform” means that the chance of fragmentation is linearly proportional to the size of the particle (or exon). Under certain conditions, the resulted Weibull distribution is indistinguishable from lognormal distribution. Hence they concluded that the model of random fragmentation could not be inferred based on the basis of fit quality. Therefore, the hypothesis that intron insertion is independent of exon size is debatable.
One assumption made in the exon size based approaches [9, 11] is that introns are inserted into exon randomly. The notion of random insertion of intron has also been proposed based on the analysis of intron distribution in ancient paralogs [14]. Others have argued that there exist certain favored sites for intron insertion  the socalled protosplice sites [4, 15, 16]. Unfortunately, none of the sizebased approaches provide evidence to support the assumption of random intron insertion.
In this work, we aim to revisit these competing hypotheses by addressing the following open questions: Do longer exons have an increased chance of gaining a new intron? For intron gain events, will the intron be inserted into exon randomly or at some protosplice sites? Is there an intron gain/loss bias? Are intron insertion events independent of each other? Is there a common mechanism to explain intron gain/loss in different species? In order to answer these and other related questions, we propose a General Random Fragmentation Process (GRFP) to characterize the evolution dynamics of exonintron structures. The parameters of GRFP are determined by combining simulation and analysis of real genomic data.
Methods
GRFP model
The model of GRFP is motivated by generalizing both Kolmogoroff fractioning process and the uniform random fragmentation process. In GRFP, the probability for an exon to split (gaining an intron) is assumed to be exponentially proportional to the length of the kth exon (L_{ k }) as L_{ k }^{ a }. Under such a generalization, the Kolmogoroff fractioning process, in which insertion events are independent of exon length, is a particular case of GRFP with α = 0, while the uniform random fragmentation process, in which insertions are linearly proportional to exon length, is another special case with α = 1. The generalization not only allows GRFP to model either Kolmogoroff or uniform fragmentation process but also allows it to model the fragmentation process of exons (intron gain) with varying α. In the results section, we will use the empirical size distribution of exons to determine the value of α. GRFP also assumes that introns insert into exons randomly and independently, and these assumptions are confirmed by the analysis of real genome data.
 1.Given a set of n exons, the opportunity for kth exon to acquire a new intron is proportional to its size to the αth power:$\phantom{\rule{1em}{0ex}}{P}_{s}\left(K\right)={L}_{k}^{a}/{\displaystyle \sum _{k=1}^{n}{{\displaystyle L}}_{k}^{a}}$(1)
 2.Within kth exon, the new intron insertion loci follow a normal distribution:$\phantom{\rule{1em}{0ex}}{P}_{I}\left(x\right)~{L}_{k}*N\left({\mu}_{I},{\sigma}_{I}\right),x\in \left(\begin{array}{ll}0,\hfill & {L}_{k}\hfill \end{array}\right)$(2)
 3.
Intron gains are independent of each other.
Where Р_{ Ѕ }(K) denotes the probability for kth exon to obtain an intron; Р_{ I }(x), probability of inserting an intron after xth position within kth exon; L_{ k }, length of the exon k; α, dependency value; μ_{ I } and σ_{ I }, mean and standard deviation of the distribution of insertion loci. The model of GRFP has three unknown parameters to be determined, α, μ_{ I } and σ_{ I }.
Simulation testing
We start each simulation with a long exon. The diagram in Figure 1 shows how the splitting of a long exon during evolution is simulated. The diagram shows intron gains in the following order. After inserting the first intron, Ρ_{ Ѕ }(KL_{k=1,2}) denotes the picking probability between L_{1} and L_{2}; Assuming L_{1} is selected and split by a new intron; the next exon to be split will be chosen from L_{3}, L_{4} and L_{2} with probability Ρ_{ Ѕ }(KL_{k=2,3,4}); Assuming L_{2} is selected, and so on.
For simulations, sequence of pseudorandom number is obtained using the Mersenne Twister algorithm [17] implemented in standard MATLAB 7.12. In this study, all of the simulations start with one single exon. This simplifies the simulation, but it does not imply that the evolution of eukaryotic genes always starts with one long exon. Under such simplification, two more parameters are the initial size of starting exon (L_{0}) and the number of splitting (m). For a given species, we can estimate them from the annotated gene sets.
 1.Mean and standard deviation of the size distribution by fitting it with lognormal distribution (equation (3)) or Weibull distribution (equation (4)):$\phantom{\rule{1em}{0ex}}\phantom{\rule{0.5em}{0ex}}\mathit{dN}=\left(N/\left({\sigma}_{E}\sqrt{2\pi}\right){e}^{{\left(\left(1\mathit{nE}{\mu}_{E}\right)/\sqrt{2{\sigma}_{E}}\right)}^{2}}\right)d\phantom{\rule{0.25em}{0ex}}\text{ln}E$(3)$\phantom{\rule{1em}{0ex}}\phantom{\rule{0.25em}{0ex}}\mathit{dN}=\left({\mathit{Nkz}}^{kI}{e}^{zk}/\lambda \right)d\phantom{\rule{0.25em}{0ex}}\text{ln}E$(4)
Where z=E/λ, E is exon length, dN the number of exons in a bin (bin size is 0.1 unless specified), N the amplitudes of the peak; k shape parameter, and λ scale parameter of the Weibull distribution; μ_{ E } the mean position, and σ_{ E } the standard deviation of the lognormal distribution. These and subsequent fittings in this study are performed using the nonlinear TrustRegionReflective curvefitting algorithm [18, 19] implemented in MATLAB 7.12. Simulation demonstrates that σ_{ E } is primarily determined by the choice of α (in equation (1)).
 2.Mean and standard deviation of the insertion ratio defined below:$\phantom{\rule{1em}{0ex}}{x}_{i}={L}_{i}/\left({L}_{i}+{L}_{i+1}\right)$(5)
Where L_{ i } and L_{ i+1 } are the length of two adjacent fragments (exons). This is an indirect estimation of insertion loci (x in equation (2)). Both simulation and real genome data indicate that x follows a normal distribution (with a standard deviation σ_{ x }).
 3.Correlation between x _{ i } and x _{ j } defined by equation (6):$\phantom{\rule{1em}{0ex}}\rho \left(i,j\right)=\frac{{\sigma}_{{x}_{i}+{x}_{j}}^{2}{\sigma}_{{x}_{i}}^{2}{\sigma}_{{x}_{j}}^{2}}{2{\sigma}_{{x}_{i}}{\sigma}_{{x}_{j}}}$(6)
Where σ_{ x } is estimated from fitting the histograms of ratio x with a normal distribution. In theory, x_{ i } + x_{ j } is still normally distributed, and the mean value is the sum of the means. However, the variances are not additive if x_{ i } and x_{ j } are correlated. We can estimate the relationship between x_{ i } and x_{ j } with equation (6).
In the first experiment, we examine the relationship between GRFP parameters and the size distribution of the simulated fragments. With fixed μ_{ I }, σ_{ I }, initial size of starting exon (L_{0}), and the number of splitting (m), one long exon is fragmented with different choices of α. μ_{ E } and σ_{ E } are estimated through fitting a lognormal distribution to the size distribution of the resulted fragments. The correlation between μ_{ E }, σ_{ E } and α is examined. Then, with fixed μ_{ I }, σ_{ I }, and α, the relationship between μ_{ E }, σ_{ E } and initial size of starting exon (L_{0}), the number of splitting (m) is examined through similar simulations.
In the second experiment, we examine the relationship between real σ_{ I } (in equation (2)) and estimated σ_{ x } (from equation (5)). By fragmenting a long exon, we construct a binary tree to track the splitting process. We classify the adjacent fragments pair (the order is maintained during fragmentation) into four groups based on whether they have the same parent nodes, or if not same parents, comparing their depths. The size distribution of each group and the mixture (equation (5)) is examined. With fixed μ_{ I }, L_{0}, m, and α, the correlation between σ_{ I } and σ_{ x } is examined by simulations with different choices of σ_{ I }. Then, by coupling with empirical observations, we use ExpectationMaximization (EM) iteration to determine the value of α and σ_{ I }.
In the third experiment, we examined the effects of intron loss on the statistical characters of resulted fragments. By introducing various percentages of intron loss after intron gain, we evaluate how σ_{ E }, σ_{ x }, and ρ(i,j) of the resulted fragments are changed.
Empirical data analysis
In this study, we obtained the cDNA sequences of 14 species (Homo sapiens (GRCh37.p8), Mus musculus (GRCm38), Rattus norvegicus (RGSC3.4), Danio rerio (Zv9), Caenorhabditis elegans (WBcel215), Drosophila melanogaster (BDGP5), Bos taurus (UMD3.1), Pan troglodytes (CHIMP2.1.4), Gallus gallus (WASHUC2), Sus scrofa (Sscrofa10.2), Arabidopsis thaliana (TAIR10), Oryza sativa (MSU6), Sorghum bicolor (Sorbi1), Zea mays (AGPv2)) from Ensembl and plant Ensembl database [20]. To ensure the quality of the data, we only use the cDNA sequences of protein coding genes with both RefSeq mRNA ID and the known status of both gene and transcript. To examine the size distribution of exons, we extracted the genomic positions of the exons from cDNA sequences to compute exon sizes. We also extracted the genomic positions of the 5′ and 3′ UTRs and used them to identify internal translated exons.
For testing the first assumption of GRFP, we fitted both Weibull and normal distribution to the size distribution of vertebrate exons (logarithm scale). We also grouped exons by positions for testing position bias of intron gain/loss. For the second assumption, we fitted a normal distribution to x (equation (5)). For the third assumption, we computed ρ(i,j) for exon pairs at ith and jth position in all protein coding genes. Finally, we examined the differences in the parameters fitted to vertebrate and nonvertebrate species.
Results
Empirical data analysis
Statistical counts of empirical data
Statistical counts of coding genes, splitting (number of exons minus one) and total CDS length (b.p.)
Number of coding genes  Total CDS length (10^{7})  Number of splitting (m,10^{5})  Estimated splitting (m_{e}, 10^{5})  $\frac{\mathbf{m}{\mathbf{m}}_{\mathbf{e}}}{{\mathbf{m}}_{\mathbf{e}}}$  

H. sapiens  17275  2.443  1.827  1.901   3.9% 
M. musculus  16319  2.276  1.705  1.768   2.7% 
R. norvegicus  17354  2.193  1.722  1.703  1.0% 
D. rerio  15068  1.932  1.462  1.501   2.7% 
G. gallus  5416  0.655  0.537  0.509  5.4% 
P. troglodytes  12508  1.694  1.295  1.316   1.6% 
B. taurus  11948  1.413  1.142  1.099  4.0% 
S. scrofa  5498  0.524  0.433  0.408  6.1% 
C. elegans  17684  1.833  1.024  1.425   28.4% 
D. melangaster  8063  1.141  0.383  0.886   57.1% 
A. thaliana  16547  1.501  1.083  1.167   7.2% 
O. sativa  23566  2.255  1.329  1.749   24.0% 
S. bicolor  17769  1.445  1.041  1.123   7.3% 
Z. mays  15887  1.320  0.987  1.025   3.7% 
In this study, we ignored noninternal translated exons considering the rate of indels (a type of mutations affecting exon size distribution) is significantly lower in the coding region than the noncoding region [21]. It is true that introns can be inserted anywhere, including noncoding exons or even another intron, but their size distribution is confounded by the appearance of more frequent indels.
Size distribution of exons
Fitted parameters for size distributions of observed vertebrate exons
Weibull  Normal  

λ  κ  μ _{ E }  σ _{ E }  
H. sapiens  2.81  6.98  4.81  0.432 
M. musculus  2.81  7.01  4.82  0.431 
R. norvegicus  2.80  6.89  4.81  0.437 
D. rerio  2.79  6.87  4.79  0.437 
G. gallus  2.80  6.80  4.81  0.442 
P. troglodytes  2.81  6.87  4.81  0.440 
B. taurus  2.80  6.69  4.80  0.449 
S. scrofa  2.82  6.44  4.81  0.472 
These observations suggest that the size distribution of vertebrate exons could be properly fit with either Weibull or normal distribution. The Weibull distribution gives a better fit to both left and right tails (e.g., Additional file 1: Figure S2) because the distribution is skewed to the left. For numerical simulations, we will show that similar size distribution of fragments will be generated based on GRFP model.
Distribution of insertion ratio
Fitted parameters for distribution of insertion ratios from empirical data
μ _{ x }  σ _{ x }  

H. sapiens  0.501  0.132 
M. musculus  0.501  0.132 
R. norvegicus  0.501  0.135 
D. rerio  0.501  0.132 
G. gallus  0.501  0.132 
P. troglodytes  0.502  0.134 
B. taurus  0.502  0.136 
S. scrofa  0.502  0.142 
C. elegans  0.499  0.185 
D. melangaster  0.502  0.215 
A. thaliana  0.501  0.152 
O. sativa  0.502  0.226 
S. bicolor  0.501  0.157 
Z. mays  0.501  0.152 
Another interesting observation in Figure 4 is the sharp spike at 0.5, which suggests that there are excessive adjacent exons pairs with the same length. This is consistent with the observation of tandem exon duplication [22]. Because the spikes are located right on the center, mathematically such deviation has little effect on the fitting of the histogram.
The normal function fitted in Figure 4 describes where introns get inserted into an exon. To assess whether it is consistent or against the hypothesis of protospliced sites, we calculate the position distribution of four possible protosplice sites (tested in [23, 24]) within human coding sequences, and the results are shown in Additional file 1: Figure S3. The position for each of the four sites (GG, AGG, AGGT and (C/A)AG(A/G)) is calculated by dividing the distance between the intron starting site and start codon by the length of the coding sequence. All coordinates are extracted from Ensembl annotation of H. sapiens. The symbol “” stands for the intron position and “/” indicates wo alternative states of one nucleotide site. Additional file 1: Figure S3 shows that these protospliced sites are distributed nearly uniformly within CDS. If introns strongly prefer to be inserted into these sites, the insertion ratio should follow a uniform distribution instead of normal distribution as we observed. Therefore, the analysis here does not favor the protosplice site hypothesis.
Correlation between insertion ratios
The key observation in Figure 5 is that nonadjacent insertion ratios are nearly uncorrelated for vertebrate genomes. However, the correlation between intron insertion ratios has quite different patterns for nonvertebrates. Their insertion ratios are more negatively correlated with each other than those for vertebrates, especially for C. elegans, D. melanogaster, and O. sativa.
In summary, analysis of empirical data reveals three significant differences between vertebrate and nonvertebrate genomes. First, a mixture of two normal functions gives a better fit to the size distribution of nonvertebrate exons, instead of one normal function for that of vertebrate exons; Second, the insertion ratio of nonvertebrates also follows a normal distribution but with larger standard deviation than that of vertebrates; Third, the insertion ratios of nonvertebrates are more negatively correlated than that of vertebrates.
Simulation testing
Default values for L_{0}, m, σ_{ I }, and μ_{ I }
These values are determined through an EM iteration process that will be discussed in the simulation testing section. The EM iteration uses observed values of σ_{ x } and μ_{ x } for vertebrates (Table 3). The simulation described below shows that σ_{ x } overestimates but is linearly proportional to σ_{ I }, while μ_{ x } approximates μ_{ I } extremely well.
Relationship between α, L_{0}, m and σ_{ E }, μ_{ E }
Fitted parameters for size distribution of simulated exons
Weibull  Normal  

λ  κ  μ _{ E }  σ _{ E }  
α = 0.3  4.58  3.95  4.24  1.216 
α = 1  2.88  4.42  4.72  0.688 
α = 3  2.93  7.25  4.85  0.430 
From equation (9), we estimate that α ≈ 3 gives the observed σ_{ E } ≈ 0.43 (Table 2). This suggests the chance of intron gain is proportional to the exon length to the 3^{rd} power, which disagrees with the independency hypothesis of earlier work [11].
Parameterizing GRFP via EM iteration
In the previous simulation studies, with the assumption of known σ_{ I }, we have shown that σ_{ E } is dependent on α but not on L_{0} and m, which suggests that the value of α can be estimated from σ_{ E }. However, simulations show that σ_{ E } is also dependent on σ_{ I }. To derive the values of α and σ_{ I } simultaneously without assuming knowing any one of them, here we determine their values through EM iterations, by combining simulations with empirically observed σ_{ I } (Table 2), μ_{ x }, and σ_{ x } (Table 3) for vertebrate genomes.
Before performing EM iteration, we need to quantify how σ_{ I } is related to σ_{ x }. We performed a series of GRFP simulations with σ_{ I } ranging from 0.06 to 0.18 and α = 3. For each simulation, we calculate the insertion ratio (equation (5)) from the resulted fragments, and estimated σ_{ x } from fitting a normal function to the histogram of insertion ratios. σ_{ x } is plotted against given σ_{ I } in Additional file 1: Figure S5A. The plot shows that there is a linear relationship between the two. From this relationship, we estimate that real σ_{ I } is closer to 0.11 than the 0.13 estimated from Figure 4. To see the over estimation of σ_{ I }, we show the simulation process and results in Additional file 1: Figure S6 and Additional file 1: Figure S7. By classifying adjacent exon pairs into four different groups, we show that the mixture of these four groups still follows a normal distribution but with larger σ_{ x }.
 1.
Given σ _{ I }, determine the relationship between α and σ using simulation (Figure 6A)
 2.
 3.
With α, determine the relationship between σ _{ x } and σ _{ I } (Additional file 1: Figure S5A)
 4.
With the relationship and σ _{ x } = 0.13, estimate σ _{ I }
 5.
Return to step (1), iterate until convergence
The results of the EM iteration are shown in Additional file 1: Figure S5B and Additional file 1: Figure S5C. At the end, σ_{ I } converges to 0.11; α converges to approximately 3. Again, α ≈ 3 suggests that, during evolution, longer exon has much higher chance to gain an intron than the shorter one, with a probability proportional to its size to the 3rd power.
Intron losses accounting for increasing σ_{ I }, σ_{ E } and more negative ρ(i, j)
In Table 3 and Figure 4, we show that σ_{ x } of nonvertebrates is significantly larger than those of vertebrate genomes. In Figure 5, we also observed more negative correlation between insertion ratios for nonvertebrates. Additional file 1: Figure S1 also shows that the size distributions of nonvertebrate exons are different from those vertebrates (Figure 2). In Table 1, we have estimated that there is a significant amount of intron losses in nonvertebrate genomes. Next, numerical simulations indicate that these three differences could result from excessive intron loss during the evolution of nonvertebrate genomes.
With simulated GRFP fragments, we gradually introduce 550% of “intron loss” by randomly reconnecting adjacent fragment pairs. The size distributions of the resulted fragments are fitted with a normal function, (Additional file 1: Figure S8) and the fitted σ_{ E } is plotted against percentage of intron loss in Additional file 1: Figure S9A. The insertion ratio between each adjacent fragment pairs is calculated using equation (5). Their histograms are fitted with a normal function (Additional file 1: Figure S10) with the fitted σ_{ x } plot against intron loss in Additional file 1: Figure S9B. In Additional file 1: Figure S9C, we calculate the correlation of insertion ratios between i and i+4 sites using equation (6) for each of the intron loss simulations and plot them against intron loss. Results in Additional file 1: Figure S9 suggest that intron loss might account for increasing σ_{ x } (Figure 4 and Table 3), and more negative correlation between nonadjacent insertion ratios (Figure 5).
Additional file 1: Figure S8 also shows that the size distribution of exons no longer can be fit properly to a normal function. As the percentage of intron loss increases, a second peak is appearing, as the size distribution of exons for nonvertebrates in Additional file 1: Figure S1.
Discussion and conclusion
In this study, we analyze the size distribution of exons for 14 species, including eight vertebrates and six nonvertebrates. Our approach overcomes the limits of using orthologous genes, thus allowing us to infer evolutionary processes affecting the exonintron structure across widely divergent species. The use of size distributions is more reliable than alignment based approaches if considering the accumulation of repeating intron gain/loss. Based on the size distribution of exons, we propose GRFP to characterize the evolution of eukaryotic genes. The solid agreements between GRFP simulations and observations on genomic data provide several key findings on the evolution of exonintron gene structures.
Chance of intron gain is proportional to exon size to the 3rd power
GRFP reveals that longer exons have a higher chance to gain an intron during evolution, and reveals the novel finding that the chance of intron gain is proportional to the exon length to the third power. This finding was derived after investigating real genome data, comparing with numerical simulations, and excluding various effects on GRFP through EM iterations. This finding might explain why long exons are rare in modern organisms. E.g., statistical study has shown that only 3.5% of the primate exons are longer than 300 nt [8, 9, 25].
The “third power” is derived from σ_{ E } (or width) of the exon size distributions. The model of GRFP indicates that σ_{ E } will remain constant given the same dependency value α. Thus, the nearly identical σ_{ E } (0.43) in Table 2 suggests the existence of a common dependency value (α ≈ 3) across vertebrate species. However, the 5′ deviations of α value might indicate that intron is less preferred there. For nonvertebrate species, α cannot be directly estimated since their exons follow a mixture of two lognormal distributions (instead of one) possibly due to excessive intron loss. For estimation of intron loss in Table 1, we simply assume that α ≈ 3 holds for nonvertebrates.
Why is the probability of intron gain proportional to the exon length to the third power? Given that the third power is usually related to volume, it might be possible that exon occupies a volume proportional to its length to the third power due to dynamic movement, and the chance of an intron attacking it is proportional to this volume. Further investigation will be needed to support this hypothesis.
No evidence for sitespecific bias of intron insertion
We derive this finding from indirectly estimating the position distribution of intron insertion loci. We demonstrate that the insertion loci follow a normal distribution, peaking around the center of the exon with a standard deviation (σ_{ I }) of 0.11. This observation does not support the protosplice site hypothesis. If there were protosplice sites in the exon, the insertion loci would follow the position distribution of these sites, which will most likely be a uniform distribution (Additional file 1: Figure S3).
In Figure 5, we also demonstrate that, for vertebrate genomes, intron gains are independent of each other. This observation is also one of the core assumptions of GRFP simulation. It holds on vertebrate genomes and nonvertebrate genomes if the effect of intron losses is considered. Another simplification of GRFP is that the effect of exon duplications is ignored. As mentioned earlier, the sharp spikes in Figure 4 are related to tandem exon duplication [22]. Such effect is not considered since overall their contribution is not significant in estimating either insertion loci or the chance of intron gain. This is illustrated in Additional file 1: Figure S7 and Additional file 1: Figure S10, where no such spikes are observed.
The assumption behind the estimation of insertion ratio (equation (5)) is that the order of the exons within each gene is maintained during evolution. In the cases of tandem exon duplication, exon shuffling, or intron loss, the order is just locally disrupted. Simulation also shows that the estimated insertion ratio is a mixture of four different groups of adjacent fragment pairs (Additional file 1: Figure S7, Additional file 1: Figure S8), but σ_{ x } is linearly related to σ_{ I } (Additional file 1: Figure S5A).
Suggesting 5′ intron gain/loss bias
By grouping exons by positions within a gene, we demonstrate that exons next to the 5′ UTR have bigger standard deviation (σ_{ E }) than other exons. One may argue that the deviation near the 5′ UTR is caused by the fact that on average exons are longer for genes contain fewer exons. If this is the case, similar trend near the 3′ UTR should have been observed. From equation (9), bigger σ_{ E } indicates smaller GRFP dependency value (α). The dropping of α values for exons adjacent to the 5′ UTR implies that introns are not favored there; In the GRFP model (Equation (1)), a smaller dependency value indicates a lower chance in acquiring introns during evolution. Alternatively, it might be explained as intron loss bias, that is, introns right after 5′ UTR has a tendency to lose than other introns. Certainly such comparison is limited to introns in the coding region and debatable due to the unclear stochastic process of intron loss.
Excessive intron losses accounting for deviations from GRFP
In this study, we show that exons of nonvertebrates are different from those of vertebrates in three aspects. First, the size distribution of their exons fit a mixture of two normal distributions (Additional file 1: Figure S1) instead of one for vertebrates (Figure 2). Second, their insertion ratios have much larger standard deviations (σ_{ x }) as shown in Table 3. Third, their nonadjacent insertion ratios are more negatively correlated as shown in Figure 5.
The estimations in Table 1 (also Figure 7) suggest that there are excessive intron losses in nonvertebrate genomes. Based on this, we performed simulations of intron loss after GRFP fragmentation. Additional file 1: Figure S9 demonstrates that, with increasing intron losses, σ_{ E } increases, σ_{ I } increases, and ρ(i, j) decreases from zero, consistent with the observation on empirical data analysis here. Therefore, GRFP model holds on nonvertebrate genomes when the effect of intron loss is considered. Comparative approaches also show that frequent intron loss has been inferred during the evolution of Nematode [26, 27] and Drosophila genomes [28], though they cannot provide a genome wide estimation of percentage of intron losses.
Here, the excessive intron loss hypothesis in nonvertebrate genomes is interpreted as breaking the equilibrium between intron gain and intron loss. Although GRFP model is built on modeling intron gain events, it does not assume that introns in vertebrate genomes are never lost. Instead, we interpret the straight line in Figure 7 as a “dynamic equilibrium line for vertebrates”, where the genome reaches a state of stability for its intron count. In Additional file 1: Figure S11, we observed the similar linear relationship between intron counts and CDS length by grouping genes by chromosomes for H. sapiens. This further proves that the “equilibrium” is reached in each chromosome and the linear relationship is independent of lineage (since human chromosomes are not related by a simple lineage relationship). If there are many intron losses during evolution, subsequent gains must have also occurred to bring the genome back to the equilibrium line. Therefore, the statistical measurements of the exons can remain the same across examined vertebrate genomes. For the nonvertebrate genomes that fall below the line, equilibrium is shifted to intron loss relative to vertebrates. The shifting (variation of intron density) might be related to the differences in the generation time of each species [29].
The size distribution of exons (Additional file 1: Figure S1), the insertion ratios (Table 3) and the correlation map (Figure 5) suggest that A. thaliana, S. bicolor, and Z. mays underwent intron loss during evolution though not as significant as O. sativa. The estimated intron losses are around 7% for A. thaliana and S. bicolor, and 4% for Z. mays (Table 1). However, such percentage of intron loss is not significant enough to justify the size distribution of exons for these three plant species, a mixture of two Gaussians instead of one as shown in Additional file 1: Figure S1. Another possible explanation is that plants have undergone significant genome duplications and the rate of indels is higher for the duplicate genes [30]. The reason is that one copy of the duplicate genes is freed from the selection pressure.
Weakness and strength of GRFP
In this work, we propose the GRFP model to capture the dynamic processes describing the evolution of exonintron structures. For vertebrate genomes, the model fits well with the well annotated genome data, including exon size distribution, distribution of insertion loci, total CDS length, number of introns, independency among intron gains, and 5′ intron gain bias. For nonvertebrate genomes, simulations show that the deviations from the vertebrate genome can be explained by excessive intron loss. The GRFP model implies that the evolution of gene structure is purely random, from picking which exon to split (gains intron) to picking intron insertion loci on the selected exon. The solid agreements between GRFP simulations and real genome data confirm that GRFP model provides one possible explanation on the exonintron structure evolution.
It is well known that a modern genome is a collection of introns that have accreted (and been deleted) over at least a billion years. Here, by considering the whole process as a black box, we reproduce the output of this box (the current day genomes) with numerical simulations. The size distribution of exons serves as the key component in building GRFP model because of two reasons. First, the dominant factor that can shape such distribution is intron gain/loss (fragmentation). Second, the most prominent cofounding factor on exon sizes  the rate of indels in them during evolution is low. Certainly, the mechanism of intron gain is complicated considering differences across lineage, differences in rates of insertion across sites, the age of introns, the possibility of indels to maximize fit to epigenomic structures that can occur following intron gain, alternative splicing, the different models of intron gain, and so on. The process of exon fragmentation (or intron gain) might be as straightforward as the model of GRFP describes. By focusing on internal translated exons only, we have demonstrated outstanding agreements between empirical observations and GRFP simulations.
It is crucial to note that GRFP does not make any assumptions on the rate of intron gain/loss. Recent studies [6, 7, 31–36] have suggested that the early eukaryotes experienced a rapid burst of intron gain. In later evolution, intron loss dominates the landscape, with occasional bursts of intron gains. This does not contradict the results presented here since GRFP makes no assumptions about the rate of intron gain or loss during evolution, but instead estimates the chance for an exon to gain an intron somewhere along its length and predicts the distribution of those insertion events.
One may argue that the agreement between the GRFP model and well annotated genome structure could be fortuitous. While we cannot rule out that other models might reproduce the exons of modern genomes, the predictive power of GRFP is striking, and we believe that it is a promising approach to understanding the evolution of exonintron structures, and an excellent starting point for new models for revealing the hidden stochastic processes of evolution.
Unanswered questions and future studies
GRFP model provides explicit rules on the exonintron structure evolution. However, it does not address the origin of introns, the mechanism of intron insertion, and the rate of intron gain/loss. In other words, GRFP addresses where introns are inserted (which exon and where in the exon), but not when and how introns are inserted. Future research will focus on extending GRFP to model the evolution of noncoding exon, and developing GRFPbased methods for comparative genomics studies.
Abbreviations
 GRFP:

General Random Fragmentation Process
 UTR:

Untranslated region
 CDS:

Coding DNA Sequence
 EM:

Expectation Maximization.
Declarations
Acknowledgements
We thank the National Science Foundation (DBI0735191) and National Institute of Health (P41HG02223) for funding aspects of this work.
Authors’ Affiliations
References
 Gilbert W: The exon theory of genes. Cold Spring Harb Symp Quant Biol. 1987, 52: 901905. 10.1101/SQB.1987.052.01.098.PubMedView Article
 Gilbert W, Glynias M: On the ancient nature of introns. Gene. 1993, 135 (1–2): 137144.PubMedView Article
 Penny D, Hoeppner MP, Poole AM, Jeffares DC: An overview of the intronsfirst theory. J Mol Evol. 2009, 69 (5): 527540. 10.1007/s0023900992795.PubMedView Article
 Stoltzfus A, Spencer DF, Zuker M, Logsdon JM, Doolittle WF: Testing the exon theory of genes: the evidence from protein structure. Science. 1994, 265 (5169): 202207. 10.1126/science.8023140.PubMedView Article
 Logsdon JM, Tyshenko MG, Dixon C, DJ J, Walker VK, Palmer JD: Seven newly discovered intron positions in the triosephosphate isomerase gene: evidence for the intronslate theory. Proc Natl Acad Sci USA. 1995, 92 (18): 85078511. 10.1073/pnas.92.18.8507.PubMed CentralPubMedView Article
 Rogozin IB, Carmel L, Csuros M, Koonin EV: Origin and evolution of spliceosomal introns. Biol Direct. 2012, 7: 1110.1186/17456150711.PubMed CentralPubMedView Article
 Csuros M, Rogozin IB, Koonin EV: A detailed history of intronrich eukaryotic ancestors inferred from a global survey of 100 complete genomes. PLoS Comput Biol. 2011, 7 (9): e100215010.1371/journal.pcbi.1002150.PubMed CentralPubMedView Article
 Zhang MQ: Statistical features of human exons and their flanking regions. Hum Mol Genet. 1998, 7 (5): 919932. 10.1093/hmg/7.5.919.PubMedView Article
 Gudlaugsdottir S, Boswell DR, Wood GR, Ma J: Exon size distribution and the origin of introns. Genetica. 2007, 131 (3): 299306. 10.1007/s1070900791394.PubMedView Article
 Weibull GW: Citation Classic  a Statistical Distribution Function of Wide Applicability. Curr Cont/Eng Technol Appl Sci. 1981, 10: 1818.
 Ryabov Y, Gribskov M: Spontaneous symmetry breaking in genome evolution. Nucleic Acids Res. 2008, 36 (8): 27562763. 10.1093/nar/gkn086.PubMed CentralPubMedView Article
 Kolmogoroff AN: Concerning the logarithmic normal distribution principle of dimensions of particles during dispersal. Cr Acad Sci Urss. 1941, 31: 99101.
 Tenchov BG, Yanev TK: Weibull Distribution of Particle Sizes Obtained by Uniform Random Fragmentation. J Colloid Interface Sci. 1986, 111 (1): 17. 10.1016/00219797(86)900020.View Article
 Cho G, Doolittle RF: Intron distribution in ancient paralogs supports random insertion and not random loss. J Mol Evol. 1997, 44 (6): 573584. 10.1007/PL00006180.PubMedView Article
 Rogozin IB, Sverdlov AV, Babenko VN, Koonin EV: Analysis of evolution of exonintron structure of eukaryotic genes. Brief Bioinform. 2005, 6 (2): 118134. 10.1093/bib/6.2.118.PubMedView Article
 Logsdon JM, Palmer JD: Origin of intronsearly or late. Nature. 1994, 369 (6481): 526author reply 527–528PubMedView Article
 Matsumoto M, Nishimura T: Mersenne twister: a 623dimensionally equidistributed uniform pseudorandom number generator. ACM Trans Model Comput Simul. 1998, 8 (1): 330. 10.1145/272991.272995.View Article
 Coleman TF, Li YY: An interior trust region approach for nonlinear minimization subject to bounds. Siam J Optim. 1996, 6 (2): 418445. 10.1137/0806023.View Article
 Thomas FC, Li YY: On the Convergence of InteriorReflective Newton Methods for Nonlinear Minimization Subject to Bounds. Math Program. 1994, 67 (2): 189224.
 Flicek P, Amode MR, Barrell D, Beal K, Brent S, CarvalhoSilva D, Clapham P, Coates G, Fairley S, Fitzgerald S: Ensembl 2012. Nucleic Acids Res. 2012, 40 (Database issue): D8490.PubMed CentralPubMedView Article
 Chen JQ, Wu Y, Yang H, Bergelson J, Kreitman M, Tian D: Variation in the ratio of nucleotide substitution and indel rates across genomes in mammals and bacteria. Mol Biol Evol. 2009, 26 (7): 15231531. 10.1093/molbev/msp063.PubMedView Article
 Letunic I, Copley RR, Bork P: Common exon duplication in animals and its role in alternative splicing. Hum Mol Genet. 2002, 11 (13): 15611567. 10.1093/hmg/11.13.1561.PubMedView Article
 Long MY, De Souza SJ, Rosenberg C, Gilbert WE: Relationship between "protosplice sites" and intron phases: Evidence from dicodon analysis. Proc Natl Acad Sci USA. 1998, 95 (1): 219223. 10.1073/pnas.95.1.219.PubMed CentralPubMedView Article
 Long MY, Rosenberg C: Testing the "protosplice sites" model of intron origin: Evidence from analysis of intron phase correlations. Mol Biol Evol. 2000, 17 (12): 17891796. 10.1093/oxfordjournals.molbev.a026279.PubMedView Article
 Berget SM: Exon recognition in vertebrate splicing. J Biol Chem. 1995, 270 (6): 24112414.PubMedView Article
 Cho S, Jin SW, Cohen A, Ellis RE: A phylogeny of caenorhabditis reveals frequent loss of introns during nematode evolution. Genome Res. 2004, 14 (7): 12071220. 10.1101/gr.2639304.PubMed CentralPubMedView Article
 Kiontke K, Gavin NP, Raynes Y, Roehrig C, Piano F, Fitch DH: Caenorhabditis phylogeny predicts convergence of hermaphroditism and extensive intron loss. Proc Natl Acad Sci U S A. 2004, 101 (24): 90039008. 10.1073/pnas.0403094101.PubMed CentralPubMedView Article
 CoulombeHuntington J, Majewski J: Intron loss and gain in Drosophila. Mol Biol Evol. 2007, 24 (12): 28422850.PubMedView Article
 Jeffares DC, Mourier T, Penny D: The biology of intron gain and loss. Trends Genet. 2006, 22 (1): 1622. 10.1016/j.tig.2005.10.006.PubMedView Article
 Xu G, Guo C, Shan H, Kong H: Divergence of duplicate genes in exonintron structure. Proc Natl Acad Sci USA. 2012, 109 (4): 11871192. 10.1073/pnas.1109047109.PubMed CentralPubMedView Article
 Yenerall P, Zhou L: Identifying the mechanisms of intron gain: progress and trends. Biol Direct. 2012, 7: 2910.1186/17456150729.PubMed CentralPubMedView Article
 Koonin EV, Csuros M, Rogozin IB: Whence genes in pieces: reconstruction of the exonintron gene structures of the last eukaryotic common ancestor and other ancestral eukaryotes. Wiley Interdiscip Rev RNA. 2013, 4 (1): 93105.PubMedView Article
 Roy SW, Gilbert W: The evolution of spliceosomal introns: patterns, puzzles and progress. Nat Rev Genet. 2006, 7 (3): 211221.PubMed
 RodriguezTrelles F, Tarrio R, Ayala FJ: Origins and evolution of spliceosomal introns. Annu Rev Genet. 2006, 40: 4776. 10.1146/annurev.genet.40.110405.090625.PubMedView Article
 Koonin EV: The origin of introns and their role in eukaryogenesis: a compromise solution to the intronsearly versus intronslate debate. Biol Direct. 2006, 1: 122. 10.1186/1745615011.PubMed CentralPubMedView Article
 Whitney KD, Garland T: Did Genetic Drift Drive Increases in Genome Complexity. PLoS Genet. 2010, 6 (8): 16.View Article
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.