In this study, we analyze the size distribution of exons for 14 species, including eight vertebrates and six non-vertebrates. Our approach overcomes the limits of using orthologous genes, thus allowing us to infer evolutionary processes affecting the exon-intron structure across widely divergent species. The use of size distributions is more reliable than alignment based approaches if considering the accumulation of repeating intron gain/loss. Based on the size distribution of exons, we propose GRFP to characterize the evolution of eukaryotic genes. The solid agreements between GRFP simulations and observations on genomic data provide several key findings on the evolution of exon-intron gene structures.
Chance of intron gain is proportional to exon size to the 3rd power
GRFP reveals that longer exons have a higher chance to gain an intron during evolution, and reveals the novel finding that the chance of intron gain is proportional to the exon length to the third power. This finding was derived after investigating real genome data, comparing with numerical simulations, and excluding various effects on GRFP through EM iterations. This finding might explain why long exons are rare in modern organisms. E.g., statistical study has shown that only 3.5% of the primate exons are longer than 300 nt [8, 9, 25].
The “third power” is derived from σ
(or width) of the exon size distributions. The model of GRFP indicates that σ
will remain constant given the same dependency value α. Thus, the nearly identical σ
(0.43) in Table 2 suggests the existence of a common dependency value (α ≈ 3) across vertebrate species. However, the 5′ deviations of α value might indicate that intron is less preferred there. For non-vertebrate species, α cannot be directly estimated since their exons follow a mixture of two lognormal distributions (instead of one) possibly due to excessive intron loss. For estimation of intron loss in Table 1, we simply assume that α ≈ 3 holds for non-vertebrates.
Why is the probability of intron gain proportional to the exon length to the third power? Given that the third power is usually related to volume, it might be possible that exon occupies a volume proportional to its length to the third power due to dynamic movement, and the chance of an intron attacking it is proportional to this volume. Further investigation will be needed to support this hypothesis.
No evidence for site-specific bias of intron insertion
We derive this finding from indirectly estimating the position distribution of intron insertion loci. We demonstrate that the insertion loci follow a normal distribution, peaking around the center of the exon with a standard deviation (σ
) of 0.11. This observation does not support the proto-splice site hypothesis. If there were proto-splice sites in the exon, the insertion loci would follow the position distribution of these sites, which will most likely be a uniform distribution (Additional file 1: Figure S3).
In Figure 5, we also demonstrate that, for vertebrate genomes, intron gains are independent of each other. This observation is also one of the core assumptions of GRFP simulation. It holds on vertebrate genomes and non-vertebrate genomes if the effect of intron losses is considered. Another simplification of GRFP is that the effect of exon duplications is ignored. As mentioned earlier, the sharp spikes in Figure 4 are related to tandem exon duplication . Such effect is not considered since overall their contribution is not significant in estimating either insertion loci or the chance of intron gain. This is illustrated in Additional file 1: Figure S7 and Additional file 1: Figure S10, where no such spikes are observed.
The assumption behind the estimation of insertion ratio (equation (5)) is that the order of the exons within each gene is maintained during evolution. In the cases of tandem exon duplication, exon shuffling, or intron loss, the order is just locally disrupted. Simulation also shows that the estimated insertion ratio is a mixture of four different groups of adjacent fragment pairs (Additional file 1: Figure S7, Additional file 1: Figure S8), but σ
is linearly related to σ
(Additional file 1: Figure S5A).
Suggesting 5′ intron gain/loss bias
By grouping exons by positions within a gene, we demonstrate that exons next to the 5′ UTR have bigger standard deviation (σ
) than other exons. One may argue that the deviation near the 5′ UTR is caused by the fact that on average exons are longer for genes contain fewer exons. If this is the case, similar trend near the 3′ UTR should have been observed. From equation (9), bigger σ
indicates smaller GRFP dependency value (α). The dropping of α values for exons adjacent to the 5′ UTR implies that introns are not favored there; In the GRFP model (equation (1)), a smaller dependency value indicates a lower chance in acquiring introns during evolution. Alternatively, it might be explained as intron loss bias, that is, introns right after 5′ UTR has a tendency to lose than other introns. Certainly such comparison is limited to introns in the coding region and debatable due to the unclear stochastic process of intron loss.
Excessive intron losses accounting for deviations from GRFP
In this study, we show that exons of non-vertebrates are different from those of vertebrates in three aspects. First, the size distribution of their exons fit a mixture of two normal distributions (Additional file 1: Figure S1) instead of one for vertebrates (Figure 2). Second, their insertion ratios have much larger standard deviations (σ
) as shown in Table 3. Third, their non-adjacent insertion ratios are more negatively correlated as shown in Figure 5.
The estimations in Table 1 (also Figure 7) suggest that there are excessive intron losses in non-vertebrate genomes. Based on this, we performed simulations of intron loss after GRFP fragmentation. Additional file 1: Figure S9 demonstrates that, with increasing intron losses, σ
increases, and ρ(i, j) decreases from zero, consistent with the observation on empirical data analysis here. Therefore, GRFP model holds on non-vertebrate genomes when the effect of intron loss is considered. Comparative approaches also show that frequent intron loss has been inferred during the evolution of Nematode [26, 27] and Drosophila genomes , though they cannot provide a genome wide estimation of percentage of intron losses.
Here, the excessive intron loss hypothesis in non-vertebrate genomes is interpreted as breaking the equilibrium between intron gain and intron loss. Although GRFP model is built on modeling intron gain events, it does not assume that introns in vertebrate genomes are never lost. Instead, we interpret the straight line in Figure 7 as a “dynamic equilibrium line for vertebrates”, where the genome reaches a state of stability for its intron count. In Additional file 1: Figure S11, we observed the similar linear relationship between intron counts and CDS length by grouping genes by chromosomes for H. sapiens. This further proves that the “equilibrium” is reached in each chromosome and the linear relationship is independent of lineage (since human chromosomes are not related by a simple lineage relationship). If there are many intron losses during evolution, subsequent gains must have also occurred to bring the genome back to the equilibrium line. Therefore, the statistical measurements of the exons can remain the same across examined vertebrate genomes. For the non-vertebrate genomes that fall below the line, equilibrium is shifted to intron loss relative to vertebrates. The shifting (variation of intron density) might be related to the differences in the generation time of each species .
The size distribution of exons (Additional file 1: Figure S1), the insertion ratios (Table 3) and the correlation map (Figure 5) suggest that A. thaliana, S. bicolor, and Z. mays underwent intron loss during evolution though not as significant as O. sativa. The estimated intron losses are around 7% for A. thaliana and S. bicolor, and 4% for Z. mays (Table 1). However, such percentage of intron loss is not significant enough to justify the size distribution of exons for these three plant species, a mixture of two Gaussians instead of one as shown in Additional file 1: Figure S1. Another possible explanation is that plants have undergone significant genome duplications and the rate of indels is higher for the duplicate genes . The reason is that one copy of the duplicate genes is freed from the selection pressure.
Weakness and strength of GRFP
In this work, we propose the GRFP model to capture the dynamic processes describing the evolution of exon-intron structures. For vertebrate genomes, the model fits well with the well annotated genome data, including exon size distribution, distribution of insertion loci, total CDS length, number of introns, independency among intron gains, and 5′ intron gain bias. For non-vertebrate genomes, simulations show that the deviations from the vertebrate genome can be explained by excessive intron loss. The GRFP model implies that the evolution of gene structure is purely random, from picking which exon to split (gains intron) to picking intron insertion loci on the selected exon. The solid agreements between GRFP simulations and real genome data confirm that GRFP model provides one possible explanation on the exon-intron structure evolution.
It is well known that a modern genome is a collection of introns that have accreted (and been deleted) over at least a billion years. Here, by considering the whole process as a black box, we reproduce the output of this box (the current day genomes) with numerical simulations. The size distribution of exons serves as the key component in building GRFP model because of two reasons. First, the dominant factor that can shape such distribution is intron gain/loss (fragmentation). Second, the most prominent cofounding factor on exon sizes - the rate of indels in them during evolution is low. Certainly, the mechanism of intron gain is complicated considering differences across lineage, differences in rates of insertion across sites, the age of introns, the possibility of indels to maximize fit to epigenomic structures that can occur following intron gain, alternative splicing, the different models of intron gain, and so on. The process of exon fragmentation (or intron gain) might be as straightforward as the model of GRFP describes. By focusing on internal translated exons only, we have demonstrated outstanding agreements between empirical observations and GRFP simulations.
It is crucial to note that GRFP does not make any assumptions on the rate of intron gain/loss. Recent studies [6, 7, 31–36] have suggested that the early eukaryotes experienced a rapid burst of intron gain. In later evolution, intron loss dominates the landscape, with occasional bursts of intron gains. This does not contradict the results presented here since GRFP makes no assumptions about the rate of intron gain or loss during evolution, but instead estimates the chance for an exon to gain an intron somewhere along its length and predicts the distribution of those insertion events.
One may argue that the agreement between the GRFP model and well annotated genome structure could be fortuitous. While we cannot rule out that other models might reproduce the exons of modern genomes, the predictive power of GRFP is striking, and we believe that it is a promising approach to understanding the evolution of exon-intron structures, and an excellent starting point for new models for revealing the hidden stochastic processes of evolution.
Unanswered questions and future studies
GRFP model provides explicit rules on the exon-intron structure evolution. However, it does not address the origin of introns, the mechanism of intron insertion, and the rate of intron gain/loss. In other words, GRFP addresses where introns are inserted (which exon and where in the exon), but not when and how introns are inserted. Future research will focus on extending GRFP to model the evolution of noncoding exon, and developing GRFP-based methods for comparative genomics studies.