Phylogenetic distribution of large-scale genome patchiness
© Oliver et al; licensee BioMed Central Ltd. 2008
Received: 15 November 2007
Accepted: 11 April 2008
Published: 11 April 2008
The phylogenetic distribution of large-scale genome structure (i.e. mosaic compositional patchiness) has been explored mainly by analytical ultracentrifugation of bulk DNA. However, with the availability of large, good-quality chromosome sequences, and the recently developed computational methods to directly analyze patchiness on the genome sequence, an evolutionary comparative analysis can be carried out at the sequence level.
The local variations in the scaling exponent of the Detrended Fluctuation Analysis are used here to analyze large-scale genome structure and directly uncover the characteristic scales present in genome sequences. Furthermore, through shuffling experiments of selected genome regions, computationally-identified, isochore-like regions were identified as the biological source for the uncovered large-scale genome structure. The phylogenetic distribution of short- and large-scale patchiness was determined in the best-sequenced genome assemblies from eleven eukaryotic genomes: mammals (Homo sapiens, Pan troglodytes, Mus musculus, Rattus norvegicus, and Canis familiaris), birds (Gallus gallus), fishes (Danio rerio), invertebrates (Drosophila melanogaster and Caenorhabditis elegans), plants (Arabidopsis thaliana) and yeasts (Saccharomyces cerevisiae). We found large-scale patchiness of genome structure, associated with in silico determined, isochore-like regions, throughout this wide phylogenetic range.
Large-scale genome structure is detected by directly analyzing DNA sequences in a wide range of eukaryotic chromosome sequences, from human to yeast. In all these genomes, large-scale patchiness can be associated with the isochore-like regions, as directly detected in silico at the sequence level.
As soon as genome sequences of sufficient length were available, three groups [1–3] independently described powerful methods (power spectra, analysis of fluctuations in DNA walks) to study large-scale genome structure at sequence level. The emerging view was the existence of long-range, power-law correlations, thus pointing to fractal (scale-invariant) structure in DNA sequences. However, such fractal structure, implying the existence of DNA segments of all sizes, directly clashes with the view of the genome as composed of long, homogeneous segments (isochores).
Isochores - long (>>300 kb), compositionally fairly homogeneous genome regions of different average GC levels were uncovered by analytical ultracentrifugation of bulk DNA [4–10]. The phylogenetic distribution of isochores was traditionally studied by centrifugation techniques [7–11], but the analysis of base composition at third codon position or the comparison of GC content between coding and non-coding sequences [12–17] has been also used.
The paradox between a fractal (scale-invariant) or an isochore structure for the genome has been recently solved in the human genome by the discovery that correlations can show deviations from the power-law behavior . Interestingly, such deviations can be associated to isochore-like regions  -long-homogeneous genome regions computationally predicted by directly examining the genome sequence and sharing many compositional and biological features with true isochores [20–24].
In this way, the phylogenetic distribution of large-scale genome patchiness can be now explored by analyzing the deviations of power-law behavior in long-range correlations. The method of choice is Detrended Fluctuation Analysis or DFA; the deviations from the power-law can then be revealed by the variations in the local behavior of the scaling exponent α [18, 19]. Here, we determined the variation of α at different scales in a wide phylogenetic range of genome sequences. Our analysis clearly distinguishes two characteristic length scales, the larger of which is demonstrated to be unambiguously associated with the isochore-like regions, as detected in silico. The phylogenetic distribution of such patterns leads to insights in understanding the evolution of genome compositional heterogeneity.
Two characteristic length scales in human DNA
Biological source for the two characteristic scales in human DNA
The extension of the intermediate scale in the other vertebrate genomes is similar to that observed in the human genome, and also in the Arabidopsis genome, while in the invertebrates (Drosophila and Caenorhabditis) the intermediate scale reaches only to 3 kbp. Notably, in the yeast genome this intermediate scale does not exist at all. The beginning of the large scale in all the genomes analyzed ranged from 10 to 30 kbp, while the ending length often extended beyond 100 kbp, or even 1 Mb, depending mainly on the length of the available sequence contigs.
The phylogenetic distribution of compositional patchiness within vertebrates has been traditionally assessed by ultracentrifugation of bulk DNA [11, 13, 16], uncovering a clear isochore structure in mammals and birds [6, 10] and also in some reptiles . The α(ℓ) profiles used here reveal large-scale genome patchiness on a wider phylogenetic range, extending to invertebrates (Drosophila and C. elegans), plants (A. thaliana) and yeasts (S. cerevisiae).
Besides the profiles for the natural sequence, all these figures also include the profiles for the corresponding artificial sequence obtained by internally shuffling the isochore-like sequence regions, as predicted by IsoFinder . In all the genomes analyzed, the large scale was practically unaffected (save the peak increase due to the homogenization at the small scales). Therefore, the large-scale patchiness observed in the α(ℓ) profiles in all these species can be attributed to the in silico determined, isochore-like regions.
Large-scale genome compositional heterogeneity has been traditionally revealed through analytical ultracentrifugation of bulk DNA in a large range of genomes (for recent reviews see [8, 10]). However, at the sequence level, such demonstration has been hampered by the lack of a consistent, reliable, and widely applicable method. The α(ℓ) profiles used here prove to be an excellent tool for this task, allowing us to determine the extent of the different scales of genome compositional heterogeneity in a wide phylogenetic range: yeasts, plants, invertebrates, fishes, birds and mammals. Additionally, the selective shuffling experiments used here to specifically randomize selected genome segments, while not touching others, allow us to identify the in silico determined, isochore-like regions as the biological source for large-scale patchiness throughout the entire range of the species analyzed.
A clear limitation of our approach is that it depends critically on the availability of good-quality sequence contigs of sufficient length. For this reason, we have analyzed only the best genome assemblies from eleven eukaryotic species. However, as these species cover a wide phylogenetic spectrum (mammals, birds, fishes, invertebrates, plants and yeasts) the conclusions should be general.
In all these genomes, the large-scale patchiness revealed by α(ℓ) profiles can be associated with the isochore-like genome regions predicted by IsoFinder, thus emphasizing the reliability of this algorithm in predicting isochore-like regions at the sequence level.
The analysis of the deviations in the power-law behavior of long-range correlations, through α(ℓ) profiles, allowed us to uncover large-scale genome structure in the eleven sequenced genomes for which sequence contigs of sufficient length and quality are available. Furthermore, through selective shuffling experiments, we were able to identify the computationally-determined, isochore-like structure of these genomes as the biological source for such large-scale patchiness.
The following genome assemblies, all having sequence contigs of sufficient length, were downloaded from the UCSC Genome Bioinformatics site : Homo sapiens (hg18), Pan troglodytes (panTro2), Mus musculus (mm8), Rattus norvegicus (rn4), Canis familiaris (canFam2), Gallus gallus (galGal3), Danio rerio (danRer4), Drosophila melanogaster (dm2), Caenorhabditis elegans (ce2), and Saccharomyces cerevisiae (sacCer1). The genome assembly for Arabidopsis thaliana (Arab) was downloaded from the NCBI FTP site ).
Detrended Fluctuation Analysis (DFA) and deviations from perfect power-law behavior
The DFA method [2, 27, 28] is aimed to detect and quantify long-range correlations in numerical sequences, and therefore DNA sequences have to be mapped into numerical ones prior to the application of DFA. Thus, a DNA chain of length N is of the form s1 s2 ... s N , and the s i values are obtained according to the strong-weak (SW) mapping rule: C or G → s i = 1, A or T → s i = 0. Note that this mapping rule is particularly appropriate to analyze genome-wide correlations, since it corresponds to the most fundamental partitioning of the four bases into their natural pairs in the double helix (A-T and G-C). As expected , when the purine-pyrymidine mapping rule was assayed, only structures at intermediate, but not at large, scale were detected (not shown).
The integrated series y(1) y(2) ... y(N) is divided into NW windows of equal length ℓ, and in each window we fit the integrated series by using a linear fit yfit (the local trend). Then, we detrend the integrated series by subtracting the local trend obtaining the detrended fluctuation function Y:
Y(i) = y(i) - y fit (i) (3)
where Y j (k) is the k-th point (k = 1 ... ℓ) within the j-th window (j = 1 ... N W ). Therefore, F(ℓ) accounts for the average fluctuation of the series around its local trend at scale ℓ.
Fractal long-range correlations appear when F(ℓ) scales in the form:
F(ℓ) ∝ ℓ α (5)
The exponent α quantifies the type and the strength of the correlations and can be determined by fitting F(ℓ) vs. ℓ to a straight line in a log-log plot, the slope being α (Fig. 1). If α = 0.5, there is no correlation and the sequence behaves as a random series (white noise), α < 0.5 indicates anti-correlations, and α > 0.5 positive correlations. However, this procedure can mask information present in the signal (see  for details) and we prefer to determine α as
α(ℓ) = d log (F(ℓ))/d log (ℓ)
In this way, deviations from power-law behavior lead to local variations of α which can be detected by plotting α(ℓ) vs. ℓ (Fig. 2). The DNA sequence shown in this figure, corresponding to the long contig of 92.1 Mb from human chromosome IV, clearly shows two main deviations from the power-law behavior, suggesting the presence of two main characteristic scales (the two major peaks in α(ℓ)) at intermediate and large ℓ values .
Selective shuffling experiments
A shuffled sequence is a random permutation of the original DNA sequence and can be used to test hypotheses concerning any pattern uncovered in DNA sequences. Shuffling can affect the entire sequence or be restricted to only certain regions within a sequence in order to test the influence of such regions on the overall behavior of the one being observed. Here, we carried out selective-shuffling experiments by shuffling only those regions in the genome sequence corresponding to the isochore-like regions predicted by IsoFinder . All the patterns within the isochore-like regions are destroyed by such shuffling, while those out of these regions remain intact. Selectivity in the shuffling process can go a step further when certain genome elements within the isochore-like regions (e.g. TEs) are preserved from shuffling, thereby also remaining intact while the rest of the region is randomized. We obtained both types of partially shuffled sequences in order to identify the biological source for the two length scales observed in the α(ℓ) profiles. The coordinates for isochore-like regions identified by IsoFinder on each chromosome sequence of the eleven genomes analyzed in this study are available at the Online Resource on Isochore Mapping .
This work was supported by the Spanish Government (BIO2005-09116-C03-01) and Plan Andaluz de Investigación (CVI-162, P06-FQM-01858, P07-FQM-03163 and TIC-640). The help of David Nesbitt with the English version of the manuscript is also appreciated.
- Li W, Kaneko K: Long-range Correlations and Partial 1/f Spectrum in a Noncoding DNA Sequence. Europhysics Letters. 1992, 17: 555-660. 10.1209/0295-5075/17/7/014.View ArticleGoogle Scholar
- Peng CK, Buldyrev SV, Goldberger AL, Havlin S, Sciortino F, Simons M, Stanley HE: Long-range correlations in nucleotide sequences. Nature. 1992, 356 (6365): 168-170. 10.1038/356168a0.View ArticlePubMedGoogle Scholar
- Voss RF: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys Rev Lett. 1992, 68 (25): 3805-3808. 10.1103/PhysRevLett.68.3805.View ArticlePubMedGoogle Scholar
- Macaya G, Thiery JP, Bernardi G: An approach to the organization of eukaryotic genomes at a macromolecular level. Journal of molecular biology. 1976, 108 (1): 237-254. 10.1016/S0022-2836(76)80105-2.View ArticlePubMedGoogle Scholar
- Thiery JP, Macaya G, Bernardi G: An analysis of eukaryotic genomes by density gradient centrifugation. Journal of molecular biology. 1976, 108 (1): 219-235. 10.1016/S0022-2836(76)80104-0.View ArticlePubMedGoogle Scholar
- Bernardi G, Olofsson B, Filipski J, Zerial M, Salinas J, Cuny G, Meunier-Rotival M, Rodier F: The mosaic genome of warm-blooded vertebrates. Science. 1985, 228 (4702): 953-958. 10.1126/science.4001930.View ArticlePubMedGoogle Scholar
- Bernardi G: The human genome: organization and evolutionary history. Annual review of genetics. 1995, 29: 445-476. 10.1146/annurev.ge.29.120195.002305.View ArticlePubMedGoogle Scholar
- Bernardi G: The compositional evolution of vertebrate genomes. Gene. 2000, 259 (1-2): 31-43. 10.1016/S0378-1119(00)00441-8.View ArticlePubMedGoogle Scholar
- Bernardi G: Isochores and the evolutionary genomics of vertebrates. Gene. 2000, 241 (1): 3-17. 10.1016/S0378-1119(99)00485-0.View ArticlePubMedGoogle Scholar
- Bernardi G: Structural and evolutionary genomics. Natural selection in genome evolution. 2004, Amsterdam , ElsevierGoogle Scholar
- Bucciarelli G, Bernardi G, Bernardi G: An ultracentrifugation analysis of two hundred fish genomes. Gene. 2002, 295 (2): 153-162. 10.1016/S0378-1119(02)00733-3.View ArticlePubMedGoogle Scholar
- Barakat A, Matassi G, Bernardi G: Distribution of genes in the genome of Arabidopsis thaliana and its implications for the genome organization of plants. Proceedings of the National Academy of Sciences of the United States of America. 1998, 95 (17): 10044-10049. 10.1073/pnas.95.17.10044.PubMed CentralView ArticlePubMedGoogle Scholar
- Bernardi G, Bernardi G: Compositional patterns in the nuclear genome of cold-blooded vertebrates. Journal of molecular evolution. 1990, 31 (4): 265-281. 10.1007/BF02101122.View ArticlePubMedGoogle Scholar
- Fortes GG, Bouza C, Martinez P, Sanchez L: Diversity in isochore structure among cold-blooded vertebrates based on GC content of coding and non-coding sequences. Genetica. 2007, 129 (3): 281-289. 10.1007/s10709-006-0009-2.View ArticlePubMedGoogle Scholar
- Hamada K, Horiike T, Ota H, Mizuno K, Shinozawa T: Presence of isochore structures in reptile genomes suggested by the relationship between GC contents of intron regions and those of coding regions. Genes & genetic systems. 2003, 78 (2): 195-198. 10.1266/ggs.78.195.View ArticleGoogle Scholar
- Hughes S, Clay O, Bernardi G: Compositional patterns in reptilian genomes. Gene. 2002, 295 (2): 323-329. 10.1016/S0378-1119(02)00732-1.View ArticlePubMedGoogle Scholar
- Chojnowski JL, Franklin J, Katsu Y, Iguchi T, Guillette LJ, Kimball RT, Braun EL: Patterns of vertebrate isochore evolution revealed by comparison of expressed Mammalian, avian, and crocodilian genes. Journal of molecular evolution. 2007, 65 (3): 259-266. 10.1007/s00239-007-9003-2.View ArticlePubMedGoogle Scholar
- Viswanathan GM, Buldyrev SV, Havlin S, Stanley HE: Quantification of DNA patchiness using long-range correlation measures. Biophysical journal. 1997, 72 (2 Pt 1): 866-875.PubMed CentralView ArticlePubMedGoogle Scholar
- Carpena P Bernaola-Galván, P., Coronado, A.V., Hackenberg, M., Oliver, J.L.: Identifying characteristic scales in the human genome. Phys Rev E. 2007, 75: 32903-10.1103/PhysRevE.75.032903.View ArticleGoogle Scholar
- Nekrutenko A, Li WH: Assessment of compositional heterogeneity within and between eukaryotic genomes. Genome Res. 2000, 10 (12): 1986-1995. 10.1101/gr.10.12.1986.PubMed CentralView ArticlePubMedGoogle Scholar
- Oliver JL, Bernaola-Galvan P, Carpena P, Roman-Roldan R: Isochore chromosome maps of eukaryotic genomes. Gene. 2001, 276 (1-2): 47-56. 10.1016/S0378-1119(01)00641-2.View ArticlePubMedGoogle Scholar
- Oliver JL, Carpena P, Roman-Roldan R, Mata-Balaguer T, Mejias-Romero A, Hackenberg M, Bernaola-Galvan P: Isochore chromosome maps of the human genome. Gene. 2002, 300 (1-2): 117-127. 10.1016/S0378-1119(02)01034-X.View ArticlePubMedGoogle Scholar
- Zhang CT, Zhang R: An isochore map of the human genome based on the Z curve method. Gene. 2003, 317 (1-2): 127-135. 10.1016/S0378-1119(03)00665-6.View ArticlePubMedGoogle Scholar
- Oliver JL, Carpena P, Hackenberg M, Bernaola-Galvan P: IsoFinder: computational prediction of isochores in genome sequences. Nucleic acids research. 2004, 32: W287-W292. 10.1093/nar/gkh399.PubMed CentralView ArticlePubMedGoogle Scholar
- The UCSC Genome Bioinformatics. [http://hgdownload.cse.ucsc.edu/downloads.html]
- NCBI FTP site. [ftp://ftp.ncbi.nih.gov/genomes]
- Coronado AV, Carpena P: Size effects in correlation measures. Journal of Biological Physics. 2005, 31: 121-133. 10.1007/s10867-005-3126-8.PubMed CentralView ArticlePubMedGoogle Scholar
- Peng CK, Buldyrev SV, Goldberger AL, Havlin S, Sciortino F, Simons M, Stanley HE: Fractal landscape analysis of DNA walks. Physica A. 1992, 191 (1-4): 25-29. 10.1016/0378-4371(92)90500-P.View ArticlePubMedGoogle Scholar
- Bernaola-Galvan P, Carpena P, Roman-Roldan R, Oliver JL: Study of statistical correlations in DNA sequences. Gene. 2002, 300 (1-2): 105-115. 10.1016/S0378-1119(02)01037-5.View ArticlePubMedGoogle Scholar
- Online Resource on Isochore Mapping. [http://bioinfo2.ugr.es/isochores/]
- Makse HA, Havlin S, Schwartz M, Stanley HE: Method for generating long-range correlations for large systems. Physical Review E Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics. 1996, 53 (5): 5445-5449.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.