Patterns of selective constraints in noncoding DNA of rice
© Guo et al; licensee BioMed Central Ltd. 2007
Received: 29 September 2007
Accepted: 01 November 2007
Published: 01 November 2007
Several studies have investigated the relationships between selective constraints in introns and their length, GC content and location within genes. To date, however, no such investigation has been done in plants. Studies of selective constraints in noncoding DNA have generally involved interspecific comparisons, under the assumption of the same selective pressures acting in each lineage. Such comparisons are limited to cases in which the noncoding sequences are not too strongly diverged so that reliable sequence alignments can be obtained. Here, we investigate selective constraints in a recent segmental duplication that includes 605 paralogous intron pairs that occurred about 7 million years ago in rice (O. sativa).
Our principal findings are: (1) intronic divergence is negatively correlated with intron length, a pattern that has previously been described in Drosophila and mammals; (2) there is a signature of strong purifying selection at splice control sites; (3) first introns are significantly longer and have a higher GC content than other introns; (4) the divergences of first and non-first introns are not significantly different from one another, a pattern that differs from Drosophila and mammals; and (5) short introns are more diverged than four-fold degenerate sites suggesting that selection reduces divergence at four-fold sites.
Our observation of stronger selective constraints in long introns suggests that functional elements subject to purifying selection may be concentrated within long introns. Our results are consistent with the presence of strong purifying selection at splicing control sites. Selective constraints are not significantly stronger in first introns of rice, as they are in other species.
Noncoding intronic and intergenic DNA of multicellular organisms typically comprises a large fraction of their genomes. Comparative genomic studies have revealed extensive evolutionary conservation of noncoding DNA in several mammalian and other species and are beginning to reveal the extent of potentially functional noncoding DNA [1–8]. Several lines of evidence have suggested that introns harbour a variety of untranslated RNAs (for example ) that are involved in mRNA processing, editing and transport [2, 3]. In plants, conserved noncoding sequences have been first identified in the grasses [5–7], and evidence of regulatory elements or binding sites in these noncoding sequences has been obtained [6, 7]. Interestingly, in Arabidopsis thaliana, based on a well-documented recent genome duplication event, intragenomic conserved noncoding sequences have also been investigated, and a unique set of noncoding DNA sequences enriched for function has been uncovered . The above observations indicate that at least some functional regions in introns are likely to be under the influence of natural selection in plants in general.
Selective constraint (also known as functional or evolutionary constraint) is defined here as the factor by which evolutionary divergence of a functional sequence is reduced, relative to a neutrally evolving sequence, due to the action of purifying selection . Several methods for estimating of evolutionary constraints have been proposed, and applied to coding and noncoding DNA of invertebrates and mammals [11–16]. Shabalina and Kondrashov  proposed a method to quantify the proportion of bases that are subject to strong purifying selection by comparing the genomes of distantly related species. It is assumed that homologous segments that show significant similarity are under strong functional constraints, otherwise are evolving free from functional constraints.
Another approach to identify functional regions in the genome is to compare sequences from species showing lower levels of divergence that are far from saturation . The basis of the method is to compare the relative divergence of putatively constrained segments of the genome with that of linked putatively neutrally evolving sequences. In the selectively constrained segments, nucleotides are assumed to fall into two classes: neutral, which evolve at the same rate as the neutral sequence; or strongly constrained, in which mutations are eliminated unconditionally by natural selection. Selective constraint is then the proportion of new mutations that are strongly deleterious and removed by purifying selection [11, 14, 15]. It should be noted that the presence of adaptive substitutions tends to lead to underestimation of constraint, since this leads to divergence of functional regions.
One difficulty in analyzing evolutionary constraints in noncoding DNA is the inference of the correct sequence alignment. If the sequence alignment method tends to miss genuine similarities, then functional elements could be miss-assigned as non-functional. This uncertainty largely arises due to the unknown pattern of indels (gaps) between the pair of sequences . A solution to this problem is to compute probabilities of alternative alignments according to explicit models of indel evolution. Based on this method, MCALIGN2 has been developed to tackle the problem of aligning noncoding DNA .
Selective constraints of introns have recently been investigated in Drosophila, mammals and other animals [11–15, 18]. Several patterns of nucleotide divergence, polymorphism, and selective constraints have been uncovered (described in our results and discussion section). Until recently, no such investigation has been done in plants.
The methodology chosen to study the pattern of noncoding DNA evolution heavily depends on the dataset investigated. In general, noncoding DNA sequences need to be not too far diverged, so that it is not too difficult to align them. On the other hand, sequences should not be too similar, otherwise there may be insufficient statistical power available for comparative genomics analysis. Until now, all studies of evolutionary constraints have compared different lineages, under the assumption of the same selective pressures acting on them (e.g, Drosophila [12, 13, 15], rodents [11, 14] and hominids ). Here, we have compared intronic sequences from just one species, a dataset including 272 paralogous pairs from a recent segmental duplication in rice (O. sativa). The duplication event encompasses a ~3 Mb segmental pair with perfect synteny between chromosome 11 and 12 . The duplication is estimated to have occurred about 7 million years ago (mya) [19–21], although an alternative date of 21 mya has also been proposed . The evolutionary divergence is compatible with estimates for human-chimpanzee (5–7 mya, ) and members of the Drosophila genus (e.g., 2.5–3.4 mya between D. melanogaster and D. simulans, ), which have been previously used for noncoding DNA analysis [For example, [12, 13, 15, 18]]. Their average divergences are about 0.1 between Drosophila simulans and melanogaster, about 0.01 for human-chimpanzee, whereas ours is about 0.08. The divergence of this segment is more suitable for noncoding DNA analysis than, for example, different rice subspecies, or rice and other cereals. Rice has two cultivated subspecies, indica and japonica, for which the genomes have been sequenced. However, the two subspecies separated within about 0.5 mya [25, 26], so their sequence similarity is too high and power to infer constraints is low. The divergence time of rice and other cereals is estimated to be about 50 mya , and alignment of noncoding sequences between them is usually problematic.
Results and Discussion
Compilation of intron dataset
In this study, we employed several methods to minimize the frequency of incorrect alignments. These included amino acid-guided methods (see methods section) to anchor the coding regions of a paralogous gene pair (T-COFFEE), alignment using explicit models of indel evolution (MCALIGN2), and the use of two masking protocols for nonhomologous sites (for details see methods section). Our finals sample size of 605 intron pairs from 272 loci is compatible with other similar studies. For example, 200–300 loci were used by Keightley and Gaffney , 24 loci by Halligan et al.  and 225 intron segments by Haddrill et al. .
Intron ordinal position
Divergence and GC content values for intronic sequences. Introns were divided into two classes based on their average intron length (232 bp): short introns, ≤232 bp; long introns, >232 bp. Divergence values (K i ) are means across introns (standard errors are in parenthesis). Results of Wilcoxon two-sample test (P) between short and long intron (in column) and first and non-first intron (in line) are shown.
In addition to single nucleotide mutations, we also investigate the frequency distribution of indels in first and non-first intron. A total of 1,398 indels were identified in our dataset, and no significant difference in frequencies of indel lengths between first and non-first intron was observed (non-parametric Wilcoxon test, Z = -0.052, P = 0.95). However, significant differences between indel numbers and lengths per base or gene pair were observed (Wilcoxon test, P < 0.002), with more indels in first than non-first introns. This result indicates that the evolutionary pattern of indels seems to be somewhat different from nucleotide divergence in introns in rice. Whether this trend exists in other plants or animal species need further investigation.
In summary, selective constraints seem not to be specific to first intron in rice, so our results are similar to those previously reported in Drosophila. In a comparison of two species of Drosophila (D. melanogaster and D. yakuba), Haddrill et al.  found that first introns evolve at similar rates to other introns. In rodents and mammals, however, it has been reported that divergence varies along introns and depend on their ordinal position within gene. Gaffney and Keightley  observed a negative correlation between mean intronic selective constraint and intron ordinal number in rodents, implying that first introns are more conserved other introns. Level of intronic divergence between humans and closely related species suggest that divergence also depends on intronic ordinal number . The above results indicate that the rule of high constraint at first introns is not common to all taxonomic groups. Whether the phenomenon is present in other plants needs further investigation.
Splice control sites
Estimates of selective constraint in intronic sequence close to the intronic splice sites. Mean (standard error) of constraint values are shown.
Position (base pairs)
Proportions of nucleotide differences at 4-fold, FEI sites and first introns. Standard errors are shown in parentheses
Type of nucleotide change
In general, fractions of nucleotide differences at FEI sites are consistently higher than 4-fold sites and first introns. The transition events A↔G and T↔C changes are expected to be the most common substitutional changes in all categories of sites (Table 3). The situation at 4-fold sites has previously been observed in rice coding sequences, where the two changes A↔G and T↔C are predominantly from A/T to G/C, and thereby increase GC content . Beside of transition T↔C, the fractions of transversion C↔G change are relatively higher than other four types of nucleotide changes in first introns compared to introns in general.
We investigate selective constraints in a recent segmental duplication that includes 605 paralogous intron pairs that occurred about 7 million years ago in rice. Our observation of stronger selective constraints in long introns suggests that functional elements subject to purifying selection may be concentrated within long introns. Our results are consistent with the presence of strong purifying selection at splicing control sites. Selective constraints are not significantly stronger in first introns of rice, as they are in other species.
Identification of segmentally duplicated genes
Gene sequences and their annotations (release 5) were downloaded from the Rice Genome Annotation of TIGR (The Institute of Genomic research, http://ftp.tigr.org). The segmental duplication was identified using a reciprocal BLASTP search with E-value < 10-14 within a distance of 100 kb between collinear gene pairs . A total of 272 pairs of non-transposable element-derived duplicated genes were obtained between chromosomes 11 and 12. A chromosomal alignment between chromosome 11 and 12 is shown in Additional file 1 and a list of the 272 duplicated gene pairs is provided as Additional file 2.
Identification of conserved introns and alignment masking
Following the methods of Coghlan and Wolfe , duplicated protein pairs were re-aligned using the T-COFFEE program , then used as a guide to check the quality of the alignments around the intron splice sites. An unambiguously aligned region was defined as one with at least 5 conserved amino acids and no alignment gaps in the 10 positions on each side of the splice site (20 positions in total) [34, 35]. A homologous intron was identified if the location and phase were identical in the alignment of the two paralogs and if there were no other introns within 5 amino acids of this position on either side. A total of 730 pairs of intron were identified by this approach.
Intronic DNA sequences were aligned using MCALIGN2, which aligns noncoding DNA sequences based on explicit models of indel evolution . To infer an appropriate indel frequency model, we first aligned the dataset with an indel model for Drosophila using the Jukes-Cantor model of nucleotide substitution. Then, the parameters for the alignment model (θ = 0.211 and w1 = 0.081) were estimated from 400 paralogous intron sequences, in which nucleotide and indel divergence are sufficiently low as to make the alignments practically unambiguous. In order to minimize the possibility of nonhomologous sites contributing to estimates of divergence, two simple masking protocols were implemented: 1) Regions that contained short aligned blocks surrounded by large gaps (>40 bp) were considered unlikely to be truly homologous and were masked off. A total of 608 pairs identified by this criteria were included for further analysis. 2) A moving window of 40 bp was used to check the degree of divergence in each alignment. Pairs containing more than 25 putatively nonparalogous sites in a window were excluded from further analyses. A total of 3 pairs was identified and excluded according to this criterion. Taken together, the final dataset used in this study contained 605 intron pairs. (Sequence alignments of the 605 intron pairs are provided as Additional file 3).
Divergence Estimates and Calculation of Evolutionary Constraint
Introns were either analyzed as complete sequences or as partial sequences after removal of putative splice control sequences (i.e., excluding the 6 bp and 16 bp at the 5' and 3' ends of the intron, respectively). The exact limits of the control sequence are somewhat arbitrary . Divergence estimates (K i ) were generated for each alignment by applying the Jukes-Cantor correction to the number of substitution per intronic site using the distmat program from EMBOSS package .
In order to estimate selective constraint, a variation of the method of Kondrashow and Crow was employed, as in previous studies [11, 37, 38]. For each sequence, observed substitution rates were compared to that expected under neutrality. Here, we used substitution rates at FEI sites to predict expected numbers (E) of substitutions in adjacent intronic sequences under the assumption that point mutation rates of each possible kind are equal at FEI sites, 4-fold and adjacent intronic DNA sites. The FEI sites are defined as sequences in introns, excluding first introns and introns of length > 232 bp, and the 6 bp/16 bp at the 5'/3' end of each intron. FEIs were treated as independent observations in the data sets and were used to predict six different substitution rate parameters (A↔T, A↔C, A↔G, T↔C, T↔G, C↔G), which were calculated as the rate of substitution expected under neutrality. For each possible substitution type, Let p i (i = 1, 2...6) be the pairwise divergence in the FEI segment, i.e.,
p i = d i /N i
where M i is the corresponding number of intronic sites. This model assumes that symmetric mutation rates and equivalent base composition in the FEI sites and the other region of interest.
We calculated constraint by comparing E to numbers of observed substitutions (O):
C = 1 - O/E
Standard errors and confidence limits for C were calculated by bootstrapping the data values of O and E 1000 times.
Proportions of difference at nucleotides in FEIs, 4-fold and intronic were treated as independent observation, respectively, and were calculated with six different substitution rate parameters (A↔T, A↔C, A↔G, T↔C, T↔G, C↔G). Standard errors and confidence for mean divergence were also calculated by bootstrapping the results by FEIs, 4-fold and intronic.
This work was supported by the National Basic Research Program of China (2006CB101700), National High Technology Research and Development Program of China (2006AA10A102) and the National Natural Science Foundation of China (30471067).
- Marais G, Nouvellet P, Keightley PD, Charlesworth B: Intron size and exon evolution in Drosophila. Genetics. 2005, 170: 481-485. 10.1534/genetics.104.037333.PubMed CentralView ArticlePubMedGoogle Scholar
- Hardison RC: Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 2000, 16 (9): 369-372. 10.1016/S0168-9525(00)02081-3.View ArticlePubMedGoogle Scholar
- Mattick JS, Gagen MJ: The evolution of controlled multitasked gene networks: the role of introns and other noncoding RNAs in the development of complex organisms. Mol Biol Evol. 2001, 18 (9): 1611-1630.View ArticlePubMedGoogle Scholar
- Le SY, Chen JH, Konings D, Maizel JV: Discovering well-ordered folding patterns in nucleotide sequences. Bioinformatics. 2003, 19 (3): 354-361. 10.1093/bioinformatics/btf826.View ArticlePubMedGoogle Scholar
- Kaplinsky NJ, Braun DM, Penterman J, Goff SA, Freeling M: Utility and distribution of conserved noncoding sequences in the grasses. Proc Natl Acad Sci USA. 2002, 99 (9): 6147-6151. 10.1073/pnas.052139599.PubMed CentralView ArticlePubMedGoogle Scholar
- Inada DC, Bashir A, Lee C, Thomas BC, Ko C, Goff SA, Freeling M: Conserved noncoding sequences in the grasses. Genome Res. 2003, 13 (9): 2030-2041. 10.1101/gr.1280703.PubMed CentralView ArticlePubMedGoogle Scholar
- Guo H, Moose SP: Conserved noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution. Plant Cell. 2003, 15 (5): 1143-1158. 10.1105/tpc.010181.PubMed CentralView ArticlePubMedGoogle Scholar
- Thomas BC, Rapaka L, Lyons E, Pedersen B, Freeling M: Arabidopsis intragenomic conserved noncoding sequence. Proc Natl Acad Sci USA. 2007, 104 (9): 3348-3353. 10.1073/pnas.0611574104.PubMed CentralView ArticlePubMedGoogle Scholar
- Sunkar R, Girke T, Kumar P, Zhu JK: Cloning and characterization of microRNAs from rice. Plant Cell. 2005, 17: 1397-1411. 10.1105/tpc.105.031682.PubMed CentralView ArticlePubMedGoogle Scholar
- Kimura M, Takahata N: Selective constraint in protein polymorphism: study of the effectively neutral mutation model by using an improved pseudosampling method. Proc Natl Acad Sci USA. 1983, 80 (4): 1048-1052. 10.1073/pnas.80.4.1048.PubMed CentralView ArticlePubMedGoogle Scholar
- Keightley PD, Gaffney DJ: Functional constraints and frequency of deleterious mutations in noncoding DNA of rodents. Proc Natl Acad Sci USA. 2003, 100 (23): 13402-13406. 10.1073/pnas.2233252100.PubMed CentralView ArticlePubMedGoogle Scholar
- Halligan DL, Eyre-Walker A, Andolfatto P, Keightley PD: Patterns of evolutionary constraints in intronic and intergenic DNA of Drosophila. Genome Res. 2004, 14 (2): 273-279. 10.1101/gr.1329204.PubMed CentralView ArticlePubMedGoogle Scholar
- Haddrill PR, Charlesworth B, Halligan DL, Andolfatto P: Patterns of intron sequence evolution in Drosophila are dependent upon length and GC content. Genome Biol. 2005, 6 (8): R67-10.1186/gb-2005-6-8-r67.PubMed CentralView ArticlePubMedGoogle Scholar
- Gaffney DJ, Keightley PD: Genomic selective constraints in murid noncoding DNA. PLoS Genet. 2006, 2 (11): e204-10.1371/journal.pgen.0020204.PubMed CentralView ArticlePubMedGoogle Scholar
- Halligan DL, Keightley PD: Ubiquitous selective constraints in the Drosophila genome revealed by a genome-wide interspecies comparison. Genome Res. 2006, 16: 875-884. 10.1101/gr.5022906.PubMed CentralView ArticlePubMedGoogle Scholar
- Shabalina SA, Kondrashov AS: Pattern of selective constraint in C. elegans and C. briggsae genomes. Genet Res. 1999, 74: 23-30. 10.1017/S0016672399003821.View ArticlePubMedGoogle Scholar
- Wang J, Keightley PD, Johnson T: MCALIGN2: faster, accurate global pairwise alignment of non-coding DNA sequences based on explicit models of indel evolution. BMC Bioinformatics. 2006, 7: 292-10.1186/1471-2105-7-292.PubMed CentralView ArticlePubMedGoogle Scholar
- Gazave E, Marques-Bonet T, Fernando O, Charlesworth B, Navarro A: Patterns and rates of intron divergence between humans and chimpanzees. Genome Biol. 2007, 8 (2): R21-10.1186/gb-2007-8-2-r21.PubMed CentralView ArticlePubMedGoogle Scholar
- The Rice Chromosomes 11 and 12 Sequencing Consortia: The sequence of rice chromosomes 11 and 12, rich in disease resistance genes and recent gene duplications. BMC Biol. 2005, 3: 20-10.1186/1741-7007-3-20.PubMed CentralView ArticleGoogle Scholar
- Wang X, Shi X, Hao B, Ge S, Luo J: Duplication and DNA segmental loss in the rice genome: implications for diploidization. New Phytol. 2005, 165 (3): 937-946. 10.1111/j.1469-8137.2004.01293.x.View ArticlePubMedGoogle Scholar
- Jiang H, Liu D, Gu Z, Wang W: Rapid evolution in a pair of recent duplicate segments of rice. J Exp Zoolog B Mol Dev Evol. 2007, 308 (1): 50-57. 10.1002/jez.b.21122.View ArticleGoogle Scholar
- Yu J, Wang J, Lin W, Li S, Li H, Zhou J, Ni P, Dong W, Hu S, Zeng C: The Genomes of Oryza sativa: a history of duplications. PLoS Biol. 2005, 3 (2): e38-10.1371/journal.pbio.0030038.PubMed CentralView ArticlePubMedGoogle Scholar
- Glazko GV, Nei M: Estimation of Divergence Times for Major Lineages of Primate Species. Mol Biol Evol. 2003, 20 (3): 424-434. 10.1093/molbev/msg050.View ArticlePubMedGoogle Scholar
- Hey J, Kliman RM: Population genetics and phylogenetics of DNA sequence variation at multiple loci within the Drosophila melanogaster species complex. Mol Biol Evol. 1993, 10: 804-822.PubMedGoogle Scholar
- Ma J, Bennetzen JL: Rapid recent growth and divergence of rice nuclear genomes. Proc Natl Acad Sci USA. 2004, 101 (34): 12404-12410. 10.1073/pnas.0403715101.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu Q, Ge S: Phylogenetic relationships among A-genome species of the genus Oryza revealed by intron sequences of four nuclear genes. New Phytol. 2005, 167 (1): 249-265. 10.1111/j.1469-8137.2005.01406.x.View ArticlePubMedGoogle Scholar
- Gaut BS: Evolutionary dynamics of grass genomes. New Phytol. 2002, 154: 15-28. 10.1046/j.1469-8137.2002.00352.x.View ArticleGoogle Scholar
- Sharp PA: Split genes and RNA splicing. Cell. 1994, 77 (6): 805-815. 10.1016/0092-8674(94)90130-9.View ArticlePubMedGoogle Scholar
- Wong GK, Wang J, Tao L, Tan J, Zhang J, Passey DA, Yu J: Compositional gradients in Gramineae genes. Genome Res. 2002, 12 (6): 851-856. 10.1101/gr.189102.PubMed CentralView ArticlePubMedGoogle Scholar
- Guo X, Bao J, Fan L: Evidence of selectively driven codon usage in rice: implications for GC content evolution of Gramineae genes. FEBS Lett. 2007, 581 (5): 1015-1021. 10.1016/j.febslet.2007.01.088.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Coghlan A, Wolfe KH: Origins of recently gained introns in Caenorhabdities. Proc Natl Acad Sci USA. 2004, 101: 11352-11357. 10.1073/pnas.0308192101.View ArticleGoogle Scholar
- Notredame C, Higgins D, Heringa J: T-Coffee: A novel method for multiple sequence alignments. J Mol Biol. 2000, 302: 205-217. 10.1006/jmbi.2000.4042.View ArticlePubMedGoogle Scholar
- Lin H, Zhu W, Silva JC, Gu X, Buell CR: Intron gain and loss in segmentally duplicated genes in rice. Genome Biol. 2006, 7: R41-10.1186/gb-2006-7-5-r41.PubMed CentralView ArticlePubMedGoogle Scholar
- Knowles DG, McLysaght A: High rate of recent intron gain and loss in simultaneously duplicated Arabidopsis genes. Mol Biol Evol. 2006, 23: 1548-1557. 10.1093/molbev/msl017.View ArticlePubMedGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 2000, 16: 276-277. 10.1016/S0168-9525(00)02024-2.View ArticlePubMedGoogle Scholar
- Kondrashov AS, Crow JF: A molecular approach to estimating the human deleterious mutation rate. Hum Mutat. 1993, 2 (3): 229-234. 10.1002/humu.1380020312.View ArticlePubMedGoogle Scholar
- Eyre-Walker A, Keightley PD: High genomic deleterious mutation rates in hominids. Nature. 1999, 397 (6717): 344-347. 10.1038/16915.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.