Phylogenetic distribution of large-scale genome patchiness

Background The phylogenetic distribution of large-scale genome structure (i.e. mosaic compositional patchiness) has been explored mainly by analytical ultracentrifugation of bulk DNA. However, with the availability of large, good-quality chromosome sequences, and the recently developed computational methods to directly analyze patchiness on the genome sequence, an evolutionary comparative analysis can be carried out at the sequence level. Results The local variations in the scaling exponent of the Detrended Fluctuation Analysis are used here to analyze large-scale genome structure and directly uncover the characteristic scales present in genome sequences. Furthermore, through shuffling experiments of selected genome regions, computationally-identified, isochore-like regions were identified as the biological source for the uncovered large-scale genome structure. The phylogenetic distribution of short- and large-scale patchiness was determined in the best-sequenced genome assemblies from eleven eukaryotic genomes: mammals (Homo sapiens, Pan troglodytes, Mus musculus, Rattus norvegicus, and Canis familiaris), birds (Gallus gallus), fishes (Danio rerio), invertebrates (Drosophila melanogaster and Caenorhabditis elegans), plants (Arabidopsis thaliana) and yeasts (Saccharomyces cerevisiae). We found large-scale patchiness of genome structure, associated with in silico determined, isochore-like regions, throughout this wide phylogenetic range. Conclusion Large-scale genome structure is detected by directly analyzing DNA sequences in a wide range of eukaryotic chromosome sequences, from human to yeast. In all these genomes, large-scale patchiness can be associated with the isochore-like regions, as directly detected in silico at the sequence level.


Background
As soon as genome sequences of sufficient length were available, three groups [1][2][3] independently described powerful methods (power spectra, analysis of fluctuations in DNA walks) to study large-scale genome structure at sequence level. The emerging view was the existence of long-range, power-law correlations, thus pointing to fractal (scale-invariant) structure in DNA sequences. However, such fractal structure, implying the existence of DNA segments of all sizes, directly clashes with the view of the genome as composed of long, homogeneous segments (isochores).
The paradox between a fractal (scale-invariant) or an isochore structure for the genome has been recently solved in the human genome by the discovery that correlations can show deviations from the power-law behavior [18]. Interestingly, such deviations can be associated to isochore-like regions [19] -long-homogeneous genome regions computationally predicted by directly examining the genome sequence and sharing many compositional and biological features with true isochores [20][21][22][23][24].
In this way, the phylogenetic distribution of large-scale genome patchiness can be now explored by analyzing the deviations of power-law behavior in long-range correlations. The method of choice is Detrended Fluctuation Analysis or DFA; the deviations from the power-law can then be revealed by the variations in the local behavior of the scaling exponent α [18,19]. Here, we determined the variation of α at different scales in a wide phylogenetic range of genome sequences. Our analysis clearly distinguishes two characteristic length scales, the larger of which is demonstrated to be unambiguously associated with the isochore-like regions, as detected in silico. The phylogenetic distribution of such patterns leads to insights in understanding the evolution of genome compositional heterogeneity.

Two characteristic length scales in human DNA
Long-range correlations are detected by log-log plots of the fluctuation function F(ᐍ) vs. the length scale ᐍ. The scaling exponent α is then given by the slope of the linear fit (see Methods). Figures 1 and 2 show how α(ᐍ) profiles are used to uncover characteristic scales in human DNA sequences. Significant deviations from the power-law behavior reveal two main characteristics scales in a large human sequence contig (chromosome IV, positions 75671304-167795054, 92.1 Mb). The intermediate scale goes from log 10 ᐍ > 1,5 (ᐍ > 30 bp) to log 10 ᐍ ≈ 4.5 (ᐍ ≈ 30 kb), while the large scale spans up to log 10 ᐍ ≈ 7 (ᐍ ≈ 10 Mb). These values are representative for the entire human complement, but when the available contigs were shorter, the large scale typically extended only until log 10 ᐍ ≈ 6 (ᐍ ≈ 1 Mb).

Biological source for the two characteristic scales in human DNA
Short-and long-range genome structure is shaped by a wide range of structural, functional, and evolutionary factors acting on the DNA sequence: mutational bias, transcriptional activity, translational constraints, patterns of gene expression, open or closed state of the chromatin, insertion of repeat elements, etc. (see [7][8][9][10] for review). Because of its overlap, discerning the role of each of these factors is a difficult task. Experiments of selective shuffling of some genome regions, while leaving others intact, can be used to approach this problem. Here, we used such selective shuffling to identify the biological source for the two characteristic scales identified in DNA sequences [19]. As an example, Figure 3 shows the results for human chromosome 21. When the sequence segments corresponding to the isochore-like regions (as determined by the algorithm IsoFinder [24]) were internally shuffled, the small scale properties disappeared, and the α(ᐍ) profile remained flat with a constant value of 0.5, as expected in a random sequence without compositional structure. The only effect of the shuffling in the large scale was a slight increase in the corresponding peak: as the shuffling homogenizes the small scale, the large scale patchiness becomes more discernible. Therefore, the large scale properties observed in the α(ᐍ) profile can be unambiguously Detecting long-range correlations through DFA Figure 1 Detecting long-range correlations through DFA. Loglog plots of the fluctuation function F(ᐍ) vs. the length scale ᐍ. The scaling exponent α is given by the slope of the linear fit. The artificial sequence was generated using a standard method to create long-range correlated sequences [31] by imposing that α = 0.90, and the DFA recovers correctly this scaling exponent. The DNA sequence analyzed is the largest contig from human chr. IV mapped into a binary sequence using the SW (strong-weak) mapping rule: C or G → 1, A or T → 0. Its scaling exponent is α fit = 0.76. attributed to the computationally predicted, isochore-like regions [19]. However, when the isochore-like sequences were shuffled internally but transposable elements (TEs) were left untouched, or when only the TEs were shuffled internally, the α(ᐍ) profile is almost identical to the original α(ᐍ) profile of the natural sequence. This means that the TEs are the main contributors to the heterogeneity at intermediate scale, although introns and exons also make a contribution (see [19] for details).

Phylogenetic distribution
As shown above, the α(ᐍ) profile revealed as a simple and powerful method to determine the phylogenetic distribution of large-scale patchiness. Two α(ᐍ) profiles have been already shown for the human genome in Figures 2 and 3.
The numerical values to trace the α(ᐍ) profiles for the remaining chromosomes with sequence contigs of good-quality and sufficient length from all these species are given in the Additional File 1. The phylogenetic distribution of compositional patchiness within vertebrates has been traditionally assessed by ultracentrifugation of bulk DNA [11,13,16], uncovering a clear isochore structure in mammals and birds [6,10] and also in some reptiles [16]. The α(ᐍ) profiles used here reveal large-scale genome patchiness on a wider phylogenetic range, extending to invertebrates (Drosophila and C. elegans), plants (A. thaliana) and yeasts (S. cerevisiae).
Besides the profiles for the natural sequence, all these figures also include the profiles for the corresponding artificial sequence obtained by internally shuffling the isochore-like sequence regions, as predicted by IsoFinder [24]. In all the genomes analyzed, the large scale was practically unaffected (save the peak increase due to the homogenization at the small scales). Therefore, the large-Detecting deviations from power-law behavior  The profiles for the natural sequence, the artificial sequence obtained after the in silico determined, isochore-like regions were internally shuffled, the artificial sequence obtained by shuffling the isochore-like regions without touching the TEs, and the artificial sequence obtained by shuffling the TEs are shown.

Discussion
Large-scale genome compositional heterogeneity has been traditionally revealed through analytical ultracentrifugation of bulk DNA in a large range of genomes (for recent reviews see [8,10]). However, at the sequence level, such demonstration has been hampered by the lack of a consistent, reliable, and widely applicable method. The α(ᐍ) profiles used here prove to be an excellent tool for this task, allowing us to determine the extent of the different scales of genome compositional heterogeneity in a wide phylogenetic range: yeasts, plants, invertebrates, fishes, birds and mammals. Additionally, the selective shuffling experiments used here to specifically randomize selected genome segments, while not touching others, allow us to identify the in silico determined, isochore-like regions as the biological source for large-scale patchiness throughout the entire range of the species analyzed.
A clear limitation of our approach is that it depends critically on the availability of good-quality sequence contigs of sufficient length. For this reason, we have analyzed only the best genome assemblies from eleven eukaryotic spe-α(ᐍ) profiles in chimpanzee and dog Figure 4 α(ᐍ) profiles in chimpanzee and dog. The α(ᐍ) profiles in chimpanzee and dog genomes. The profiles for the natural sequence of an example chromosome from each genome, and the corresponding artificial sequence obtained after the isochore-like regions were internally shuffled, are shown.  cies. However, as these species cover a wide phylogenetic spectrum (mammals, birds, fishes, invertebrates, plants and yeasts) the conclusions should be general.
In all these genomes, the large-scale patchiness revealed by α(ᐍ) profiles can be associated with the isochore-like genome regions predicted by IsoFinder, thus emphasizing the reliability of this algorithm in predicting isochore-like regions at the sequence level.

Conclusion
The analysis of the deviations in the power-law behavior of long-range correlations, through α(ᐍ) profiles, allowed us to uncover large-scale genome structure in the eleven sequenced genomes for which sequence contigs of sufficient length and quality are available. Furthermore, through selective shuffling experiments, we were able to identify the computationally-determined, isochore-like structure of these genomes as the biological source for such large-scale patchiness.

Detrended Fluctuation Analysis (DFA) and deviations from perfect power-law behavior
The DFA method [2,27,28] is aimed to detect and quantify long-range correlations in numerical sequences, and therefore DNA sequences have to be mapped into numerical ones prior to the application of DFA. Thus, a DNA chain of length N is of the form s 1 s 2 ... s N , and the s i values are obtained according to the strong-weak (SW) mapping rule: C or G → s i = 1, A or T → s i = 0. Note that this mapping rule is particularly appropriate to analyze genomewide correlations, since it corresponds to the most fundamental partitioning of the four bases into their natural pairs in the double helix (A-T and G-C). As expected [29], when the purine-pyrymidine mapping rule was assayed, only structures at intermediate, but not at large, scale were detected (not shown).
To apply DFA, first we create the integrated series by accumulating the original one: where is the global mean: The integrated series y(1) y (2) ... y(N) is divided into N W windows of equal length ᐍ, and in each window we fit the integrated series by using a linear fit y fit (the local trend). Then, we detrend the integrated series by subtracting the local trend obtaining the detrended fluctuation function Y: Finally, we obtain the fluctuation function F(ᐍ) in this way: where Y j (k) is the k-th point (k = 1 ... ᐍ) within the j-th window (j = 1 ... N W ). Therefore, F(ᐍ) accounts for the average fluctuation of the series around its local trend at scale ᐍ.
Fractal long-range correlations appear when F(ᐍ) scales in the form: The exponent α quantifies the type and the strength of the correlations and can be determined by fitting F(ᐍ) vs. ᐍ to a straight line in a log-log plot, the slope being α (Fig. 1). If α = 0.5, there is no correlation and the sequence behaves as a random series (white noise), α < 0.5 indicates anti-correlations, and α > 0.5 positive correlations. However, this procedure can mask information present in the  signal (see [19] for details) and we prefer to determine α as α(ᐍ) = d log (F(ᐍ))/d log (ᐍ) In this way, deviations from power-law behavior lead to local variations of α which can be detected by plotting α(ᐍ) vs. ᐍ (Fig. 2). The DNA sequence shown in this figure, corresponding to the long contig of 92.1 Mb from human chromosome IV, clearly shows two main deviations from the power-law behavior, suggesting the presence of two main characteristic scales (the two major peaks in α(ᐍ)) at intermediate and large ᐍ values [19].

Selective shuffling experiments
A shuffled sequence is a random permutation of the original DNA sequence and can be used to test hypotheses concerning any pattern uncovered in DNA sequences. Shuffling can affect the entire sequence or be restricted to only certain regions within a sequence in order to test the influence of such regions on the overall behavior of the one being observed. Here, we carried out selective-shuffling experiments by shuffling only those regions in the genome sequence corresponding to the isochore-like regions predicted by IsoFinder [24]. All the patterns within the isochore-like regions are destroyed by such shuffling, while those out of these regions remain intact. Selectivity in the shuffling process can go a step further when certain genome elements within the isochore-like regions (e.g. TEs) are preserved from shuffling, thereby also remaining intact while the rest of the region is randomized. We obtained both types of partially shuffled sequences in order to identify the biological source for the two length scales observed in the α(ᐍ) profiles. The coordinates for isochore-like regions identified by IsoFinder on each chromosome sequence of the eleven genomes analyzed in this study are available at the Online Resource on Isochore Mapping [30].