Comparative genomics of eukaryotic small nucleolar RNAs reveals deep evolutionary ancestry amidst ongoing intragenomic mobility

Background Small nucleolar (sno)RNAs are required for posttranscriptional processing and modification of ribosomal, spliceosomal and messenger RNAs. Their presence in both eukaryotes and archaea indicates that snoRNAs are evolutionarily ancient. The location of some snoRNAs within the introns of ribosomal protein genes has been suggested to belie an RNA world origin, with the exons of the earliest protein-coding genes having evolved around snoRNAs after the advent of templated protein synthesis. Alternatively, this intronic location may reflect more recent selection for coexpression of snoRNAs and ribosomal components, ensuring rRNA modification by snoRNAs during ribosome synthesis. To gain insight into the evolutionary origins of this genetic organization, we examined the antiquity of snoRNA families and the stability of their genomic location across 44 eukaryote genomes. Results We report that dozens of snoRNA families are traceable to the Last Eukaryotic Common Ancestor (LECA), but find only weak similarities between the oldest eukaryotic snoRNAs and archaeal snoRNA-like genes. Moreover, many of these LECA snoRNAs are located within the introns of host genes independently traceable to the LECA. Comparative genomic analyses reveal the intronic location of LECA snoRNAs is not ancestral however, suggesting the pattern we observe is the result of ongoing intragenomic mobility. Analysis of human transcriptome data indicates that the primary requirement for hosting intronic snoRNAs is a broad expression profile. Consistent with ongoing mobility across broadly-expressed genes, we report a case of recent migration of a non-LECA snoRNA from the intron of a ubiquitously expressed non-LECA host gene into the introns of two LECA genes during the evolution of primates. Conclusions Our analyses show that snoRNAs were a well-established family of RNAs at the time when eukaryotes began to diversify. While many are intronic, this association is not evolutionarily stable across the eukaryote tree; ongoing intragenomic mobility has erased signal of their ancestral gene organization, and neither introns-first nor evolved co-expression adequately explain our results. We therefore present a third model — constrained drift — whereby individual snoRNAs are intragenomically mobile and may occupy any genomic location from which expression satisfies phenotype.


Background
Small nucleolar RNAs (snoRNAs) constitute a major class of small RNA in eukaryotes. There are two broad types of snoRNA, which differ in structure and function. H/ACA snoRNAs are characterized by a double stemloop structure, and many members of this class are involved in guiding pseudouridylation of other functional RNAs, most notably rRNA [1,2]. Most C/D snoRNAs are likewise known to be guides, directing 2'-O-methylation of ribose on a broad assortment of RNAs, including ribosomal RNA (rRNA) [3]. While the variety of target molecules is suggestive of an ongoing role in fine tuning or regulating a range of processes [4], including splicing [5], the antiquity of both classes of snoRNA has been established through the identification in Archaea of sno-like RNAs resembling both H/ACA and C/ D family snoRNAs [6][7][8]. Moreover, comparative analysis of modification patterns across Bacteria, Archaea and Eukaryote rRNA indicates that some pseudouridylation and 2'-O-methylation modification sites may predate the diversification of the three domains of life [9,10]. However, bacterial modifications are not generated in a snoRNA-dependent manner, so while comparative genomic and experimental analyses demonstrate that snoR-NAs and their associated proteins appear ubiquitous among major eukaryote and archaeal groups [11,12], there is a disconnect between the inferred antiquity of this class of RNA and conserved site modifications to rRNA. This may indicate that snoRNAs emerged following the divergence of bacteria from archaea and eukaryotes, replacing a protein enzyme-based modification system [13,14]. Alternatively, snoRNAs may have an origin in an RNA world, having coevolved with the early ribosome [15]. As part of the latter theory, it was proposed that the intronic position of snoRNAs may be related to the origin of the first mRNAs. Under this 'introns-first' model ( Figure 1a), exons were recruited from the regions between individual snoRNAs, the latter becoming intronic following the advent of the first protein-coding genes [16,17].
While it is difficult to definitively establish the timing of emergence of snoRNAs, the introns-first model does generate testable predictions. One prediction is that these early RNA-world snoRNAs may still be housed within the introns of the protein-coding genes postulated to have evolved around them. If such an association has been preserved, host genes should be among the oldest genes, traceable to the origin of templated protein synthesis [16,17]. A strong version of this hypothesis is that this association is evolutionarily stable, such that duplication and retrotransposition have not broken up the association. An alternative view is that the intronic position of snoRNAs reflects selection for coexpression [18] (Figure 1b). This model makes no assumption as to the evolutionary and genomic mechanisms that gave rise to such an organization. A strong version of this model is that the relationship between intronic snoRNA and host gene is fixed, as per intronsfirst, but with one key difference: under coexpression, unless individual snoRNAs originated in the optimal genomic environment, they must have been mobile early in their evolutionary history, with selection precluding further mobility once the optimal intronic location has been reached. These hypotheses both suggest intronic position of snoRNAs has been stable across deep evolutionary timescales, so are not mutually exclusive. A third possibility is that the only constraint on snoRNA location is maintenance of an expression profile compatible with snoRNA function.
Some snoRNAs have been shown to be intragenomically mobile [19,20], with the most stunning example being reported from platypus, where a single snoRNA family was found to be present in over 40 000 copies [21]. In support of the evolutionary stability of intronlocated snoRNAs, a recent study revealed that 14% of annotated snoRNAs present in more than one genome are positionally conserved across birds and mammals, and that of these 97% are intronic [22]. These observations indicate that both mobility and positional stablility of snoRNAs are observed in vertebrate genomes. However, for genomic position to be compatible with the predictions resulting from the introns-first model, there would need to be stability across the entire eukaryote tree. Given that large numbers of introns can be readily traced to the root of the eukaryote tree (i.e. the Last Eukaryotic Common Ancestor, LECA) [23], it seems reasonable to expect that a signal of positional stability, if present, should be preserved at this evolutionary depth. While the coexpression model is not mutually incompatible with introns-first, the function of snoRNAs involved in generating ribosomes means conserved intronic snoR-NAs should be associated with broadly-expressed host genes, but that those host genes need not be orthologous. Across birds and mammals, previous analyses [22] indicate that the most conserved snoRNAs are encoded in broadly expressed host genes, potentially compatible with both models.
In order to better understand the evolutionary dynamics of snoRNAs, we sought to address the following questions. 1. Do any snoRNA families trace back to the LECA? 2. Do eukaryotic snoRNAs show evidence of homology with archaeal snoRNA-like sRNAs? 3. Is the intronic location of snoRNAs evolutionarily stable across the eukaryote tree? We report that dozens of snoRNA families can be traced to the LECA. However, none of these LECA snoRNA families show significant levels of similarity with archaeal sno-like RNAs, precluding firm placement of individual snoRNA families in the common ancestor of these two groups. In agreement with previous studies [23], we find numerous introns can be placed in the LECA. A subset of these introns do carry snoRNAs, but close inspection reveals that snoRNAs in equivalent conserved positions are not orthologous. This result is consistent with independent gains of unrelated snoRNAs into orthologous introns. Finally, we report that snoRNA host genes are characterised by broad expression profiles, as judged from expression patterns across 37 human tissues. We therefore conclude that these data best fit ongoing genomic mobility of snoR-NAs, with selection maintaining expression patterns but not genomic location. On the basis of these results we outline a 'constrained drift' model for the evolution of snoRNA-host gene relationships.

Placement of snoRNA families in the LECA
To establish the degree to which snoRNAs are conserved across eukaryotes, we examined the distribution of all Rfam snoRNA families identifiable across 44 eukaryote genomes (see Materials & Methods). Using Dollo parsimony, we reconstructed the pattern of snoRNA conservation across the eukaryote tree using a five supergroup phylogeny, with the root of the eukaryote tree between unikonts and bikonts [24][25][26]. This analysis identified 10 individual snoRNA families and 32 multi-family snoRNA clans traceable to the LECA (Figure 2; see Materials and Methods for definition of an Rfam clan). Of these, 40 of 42 families/clans were C/D box snoRNAs (Additional file 1: Table S1). H/ACA snoR-NAs are generally more difficult to detect owing to shorter, less well-conserved sequence motifs [27][28][29], which may explain the comparatively lower number of conserved families recovered by this approach.
Characterized snoRNAs are heavily biased towards plants and animals [12] (Figure 2b) so these results are dependent on the eukaryote root in Figure 2 being correctly located. However, independent attempts to locate the root of the eukaryote tree using a range of methods consistently place Opisthokonts (animals, fungi and related microbes) on the opposite side of the root from Archaeplastida [24][25][26]30], as per Figure 2, and minor differences in the exact placement of the root across these studies are all nevertheless compatible with the results we present here.
Rfam clans may potentially provide an additional source of information about snoRNA relationships, and could reflect family expansion through processes such as duplication [31] or retrotransposition [32]. In our analysis a clan may be placed in the LECA if two related families are treated as a single characterthis could be the result of either, 1) a functional shift in one part of the eukaryote tree (orthology with a change in function), 2) hidden paralogy (with a change in function), or 3) Figure 1 Models accounting for the intronic location of snoRNAs. A) The introns-first model proposes that the intronic location of snoRNAs dates back to the origin of genetically encoded protein synthesis [17]. In this model, transcription of RNA genes requires subsequent processing of precursor RNAs to release functional RNAs. Processing via splicing leads to the production of processed snoRNAs, with intervening material spliced together. This material is later coopted for stabilisation of tRNA interactions during peptide synthesis, leading to the emergence of the first mRNAs. In this model, the first proteins serve to augment RNA function, so the association between snoRNAs and ribosomal protein-coding genes is historical and evolutionarily stable. B) Intronic organisation of snoRNAs is the result of selection for coexpression of snoRNA and its host gene [18]. This model does not explicitly state what the ancestral condition would have been, and is therefore compatible with this organisation being ancestral (panel A), a more recent de novo origin of the snoRNA, or with this association being the result of snoRNA retrotransposition. artefactual split of a family into smaller groups so as to maintain specificity of the respective covariance models used to define a family [33]. For clans containing individual Rfam families that can each be independently placed in LECA, it may be therefore possible that the clan evolved prior to the diversification of eukaryotes, with families having evolved in the stem by duplication and divergence.
To assess whether clan groupings carry biological signal consistent with duplication and divergence, we examined the three clans where the constituent Rfam families could be independently traced to the LECA ( Figure 3)all three clans are comprised of C/D-box snoRNAs. For SNORD29 (CL00051), 4 of 11 snoRNA families independently traverse the eukaryote root, while for clans SNORD33 (CL00054) and SNORD61 (CL00067) respectively, 5 of 8 and 3 of 3 families predate LECA. Phylogenetic trees cannot be produced for snoR-NAs at this evolutionary depth. We therefore reasoned that if individual snoRNA families traceable to LECA perform modifications at different target sites, this would be consistent with clans having evolved via (See figure on previous page.) Figure 2 A parsimony-based reconstruction reveals that 10 snoRNA families and 32 multi-family clans can be traced back to the Last Eukaryotic Common Ancestor (LECA). A) Radial tree, showing numbers of Rfam families/clans at each node. B) The same result shown as in A, but with Rfam family and clan distribution for individual species displayed. Colors in both panels delineate the five major eukaryotic supergroups for which genomic data are currently available: Opisthokonta (fungi & metazoa), orange; Amoebozoa, yellow; Excavata, purple; Archaeplastida, green; Chromalveolata, dark green. duplication and divergence prior to the diversification of eukaryotes (i.e. pre-LECA). To examine this, for each clan, we took all families and established probable sites of rRNA modification for each family (Materials and Methods). We then compared our results to published experimental data on modification sites in human, yeast and Arabidopsis. We find support for recent expansions (several Rfam families within the SNORD29 clan are human-specific for example), but no family can be traced to duplication events predating the diversification of eukaryotes. Instead, it appears that the division of clans into families may be artefactual for these three cases, and, in the case of the SNORD61 clan, the clan likely represents the orthologous group (Additional file 1: Table S2).
The modification targets of LECA C/D snoRNAs are also traceable to the LECA An implicit assumption in the classification of snoRNAs into families is that each family contains members that perform the equivalent biological function across species. Consequently, if individual snoRNA families can be traced back to the LECA, we may be able to examine to what extent their function is also conserved across eukaryote evolution. However, relative to sequenced genomes/identified snoRNAs, the number of species for which modification sites for rRNA have been experimentally characterized is comparatively low [9]. We therefore sought to establish a computational means by which to predict target sites, given snoRNA and rRNA sequence data from an organism. We chose to focus on C/D snoRNAs for two reasons. First, in our initial Rfambased analyses, the vast majority of snoRNAs putatively in LECA were C/D snoRNAs. Second, the guide sequences of C/D snoRNAs are contiguous; in contrast, H/ACA snoRNAs carry bipartite guide sequences, making computational identification non-trivial.
For each species in our dataset, we took all Rfam C/D family snoRNA entries and performed blasts against the SSU and LSU rRNA sequences from that species. This generated a blast map of potential interactions. As blast can detect similarities for reverse complements, each map detects hits equivalent to the interaction between C/D guide and cognate rRNA. To separate probable target sites from spurious blast hits, we looked for evidence of evolutionary conservation: for every putative modification site, we examined whether the distribution of hits across all species supported placement of that interaction in LECA (using parsimony). We then examined overlap between predicted LECA modification sites and Rfam families or clans previously determined ( Figure 2) to be traceable to the LECA. As shown in Figure 4 there is good correspondence between predicted LECA modifications and Rfam families (Additional file 2).
As these target sites are predicted using a computational strategy, we sought to establish whether they gave reasonable correspondence to independently verified target sites and their cognate snoRNAs [34][35][36]. We therefore compared our set of deeply conserved modifications against known modification sites from human, yeast and Arabidopsis. Under the unikont/bikont rooting, Arabidopsis is on the other side of the root from human The overlap between both data sets contained two familes. Note that, of the original 42 LECA families, four were not considered in this analysis because they either did not belong to the C/D box class (2) or are known to function exclusively in rRNA cleavage (2). Complete results are provided in supplementary material (Additional file 1: Table S3). and yeast, so this gives us an independent means of predicting putative LECA modifications. Table 1 shows the proportion of overlap between experimentally determined modification sites and our computational predictions for Arabidopsis, yeast and human. Of 60 previously reported sites conserved across these three species [34][35][36][37], our comparative genomics approach fully recovers 54 (indicated by an 'X' in the respective column).
To gauge the reliability of the three types of data (snoRNA genes, blast-mapped modification sites and independently determined sites), we compared the results of all three methods ( Figure 5). The intersection between the 42 LECA snoRNAs, 28 LECA blast-mapped sites and 37 independently-mapped LECA modification sites is 25. We note however that this is an underestimate: two LECA snoRNA families (snoU13 and U3) are involved exclusively in rRNA cleavage but not 2'-Omethylation, and two further putative LECA-snoRNAs belong to the H/ACA class (which we did not map against rRNA) -so are automatically excluded from this intersect for this reason. In summary, a stringent, total evidence-based approach ( Figure 5) identifies a minimum of 25 modification guide C/D snoRNA families with conserved function traceable to the root of the eukaryote tree, under the unikont/ bikont topology.

Some LECA snoRNAs have known archaeal counterparts
Archaea and eukaryotes both carry an equivalent H/ACA and C/D RNA-based guide machinery, with a conserved set of associated proteins [11,12]. However, it is not trivial to identify archaeal and eukaryote RNA counterparts using sequence data alone. We therefore took the alternative approach of asking whether any archaeal sno-like sRNAs are responsible for modification at a position in archaeal rRNA equivalent to conserved eukaryotic rRNA modifications. This identified only one case, archaeal sR12, found in both crenarchaea and euryarchaea, which had previously been predicted by Gaspin and co-workers to guide 2'-O-methylation at an equivalent site to yeast snR70 [6]. Our analysis (Figure 2) placed the snoRD43/ snR70 clan (CL00059) in LECA. In order to examine possible homology, we generated a multiple sequence alignment of members of the CL00059 clans and archaeal sR12 sequences using an alignment algorithm optimised for non-coding RNA [38], and sought to establish whether secondary structure was also conserved, using alifold from the Vienna package [39]. The alignment shown in Figure 6 shows similarities in the guide sequence, but we find no clear indication of structural conservation. Consequently, while these findings are consistent with SNORD43 & sR12 being orthologous (and therefore predating the archaeal-eukaryotic divergence), we cannot rule out that possibility that these sequence similarities are a result of convergence given selection for the underlying function. Gaspin and colleagues reported 4 additional cases of modification equivalence, three of which correspond to eukaryotic snoRNA families in our Rfam dataset. However, these could either not be placed in LECA (snR52) or exhibited limited sequence similarity, precluding clear inference of common ancestry (snoRD60/sR36 and snoRD96/sR11).
Testing the introns first hypothesis for the origin of mRNA Available data are consistent with snoRNA-based rRNA processing/modification being present in the common ancestor of archaea and eukaryotes [11,12]. However, an earlier RNA world origin has also been suggested [17,40]. Consistent with this, a small number of experimentally-characterized 2'-O-methylation and pseudouridylation sites on rRNA are conserved across all three domains of life [9], and we note that other RNA modifications likewise appear to have evolved prior to divergence of the three domains [10]. However, an alternative view is that the ancestral system for 2'-O-methylation and pseudouridylation was protein-enzyme based as in extant bacterial lineages, with the snoRNA-based system evolving in response to a need for more extensive modifications in archaea and eukaryotes [13,14]. It is not possible to directly test these alternative scenarios, both of which are plausible. However, one aspect of the 'snoR-NAs-ancient' view is amenable to testing through comparative genomics. Under the introns-first model, the intronic location of snoRNAs is an ancestral state, predating the origin of genetically-encoded protein synthesis. Briefly, the ancestral state is hypothesized to be RNA genes linked on chromosomes; the intervening spaces between these RNAs is later coopted into the role of messenger RNA. This model assumes that processing of individual snoRNAs from larger polycistronic RNA transcripts occurred via a proto-spliceosome [16,17] ( Figure 1a). Under this model, these intervening regions are coopted as the first proto-exons. Consistent with this view, many snoRNAs are housed in the introns of ribosomal protein genes, particularly in vertebrate genomes [22]. Moreover, ribosomal protein genes are one of the most ancient classes of gene in the cell [41], and, significantly, an evolutionary history dominated by vertical descent is apparent for this class of genes [42]. While any scenario for very early evolution is by nature speculative, introns-first predicts that the intronic position of snoRNAs is ancestral. If the intronic location of any snoRNA is conserved across the eukaryote tree, such a pattern would be consistent with introns-first (with the corollary that introns have been lost/reduced to selfsplicing forms in archaea and bacteria; of note, complete   intron loss has been documented in the Hemiselmis andersenii nucleomorph [43]. We therefore sought to establish whether intronic snoRNAs are positionally conserved across eukaryotes. We first screened our dataset of 44 eukaryote genomes for introns containing annotated snoRNAs. Our screen of eukaryote genomes yielded a set of 1782 host genes carrying intronic snoRNAs. Of these, 1091 of the host genes could be placed in LECA on distribution (Additional file 1: Figure S1; Additional file 3).
To establish if individual introns within the cohort of LECA host genes could also be traced to LECA, we reconstructed intron presence in LECA using parsimony (see methods). This yielded 9117/98871 (7.6%) orthologous intron loci that can be traced to LECA (Additional file 1: Figure S2), in broad agreement with other studies indicating high numbers of introns can be traced to LECA [23]. For any intron carrying a snoRNA, we next independently compared intron ancestry with snoRNA intron occupancy across the eukaryote tree. This yielded 22 orthologous LECA introns that also carry a snoRNA ( Figure 7). Finally, we examined Rfam family constituency for all 22 cases across all eukaryote genomes. We reasoned that if all positionally equivalent snoRNAs were from the same family, this would provide an initial indicator of snoRNA orthology.
Strikingly, we find no evidence for common ancestry for any of these 22 cases, based on Rfam family membership (Table 2), and in many cases the snoRNAs are not even of the same type (i.e. H/ACA or C/D), clearly precluding homology. This pattern, while precluding the placement of intronic snoRNAs in LECA, does not in itself indicate snoRNA mobility. For the 22 cases of nonorthologous snoRNA occupancy in Figure 7, we therefore examined snoRNA distribution more closely in an attempt to distinguish ancestral from derived snoRNA occupancy among orthologous introns. As shown in Table 2 none of the putative LECA-intronic snoRNAs can be claimed to trace to the LECA, being restricted for the most part to a few species. Manual inspection of alignments and sequence features for these equivalent intronic snoRNAs from the same class did not reveal Figure 5 Comparison of snoRNAs traceable to LECA ( Figure 1) and those experimentally characterised 2'-O-methylation sites attributable to LECA based on conservation between Arabidopsis thaliana (At) and at least one of Homo sapiens (Hs) and Saccharomyces cerevisiae (Sc). Using these independent lines of evidence, 25 families can be placed in the LECA. Plant snoRNAs are not always recognized by Rfam models for the equivalent animal snoRNA, suggesting that Rfam coverage is currently incomplete (Figure 4; light-grey boxes). Two of the snoRNAs listed here (U3, U13) do not function as modification guides and therefore cannot be mapped to existing sites. Figure 6 Alignment of eukaryote SNORD43 with archaeal sR12 sequences from Pyrococcus and Sulfolobus indicates detectable sequence similarity, suggestive of common ancestry. The alignment does not yield a conserved secondary structure, and therefore the possibility of convergence cannot be ruled out.
any cases of homology. We therefore conclude that these data do not support the evolutionary stability of intronic snoRNAs, predicted under the introns-first hypothesis.

LECA genes hosting intronic snoRNAs are broadly-expressed
While our results demonstrate that intronic location of snoRNAs does not trace to the LECA, it is striking that 61% (1091/1782) of snoRNA-containing host genes in our study can themselves be placed in the LECA (Additional file 1: Figure S1). We were therefore interested to know what features of these broadly-distributed genes might make them an ideal location for intronic snoR-NAs. As rRNA function is essential, modification of rRNA should also be required in all cell types in multicellular organisms. We may therefore expect that the main requirement for host genes is that they should be ubiquitously expressed. We previously showed that for positionally conserved intronic snoRNAs in the bird- Figure 7 Positional equivalence of intronic snoRNAs in LECA genes. We binned snoRNA-carrying introns into discrete loci based on proteinbased alignments of their respective genes. The results suggest that, on parsimony, 22 loci are traceable to the LECA. However, in no case were we able to establish homology of snoRNAs across supergroups (based on sequence, structure or modification site), suggesting independent gain rather than deep ancestry (see text). Colors are as per Figure 2. mammal ancestor, intronic snoRNA-containing genes were broadly expressed, and fundamental processes such as translation were significantly overrepresented amongst this cohort, whereas intronic snoRNAs with a shallower evolutionary association with their host gene (traceable to the primate ancestor) were not overrepresented among broadly-expressed genes [22]. More generally, 'older' genes are known to be more broadly expressed across a range of tissues in vertebrates [44]. We therefore examined the expression pattern of the entire set of 1091 LECA host genes using human expression data [45]. As shown in Figure 8, the distribution of expression entropies of human genes shows that genes hosting an intronic snoRNA are more broadly expressed (high entropy) than non-host genes (Mann-Whitney-U, p < 0.005). We conclude that the preference for integration into the introns of ancient (LECA) host genes is a consequence of these genes being broadly expressed.
Such host genes are compatible with the requirement for broad expression of snoRNAs.

Recent migration of a snoRNA into the intron of a ribosomal protein gene
As none of the non-homologous 22 snoRNAs identified in LECA introns could be definitively attributed to derived mobility-associated gain (to the exclusion of secondary losses), we combed our dataset for intronic snoRNAs that could be unambiguously attributed to gain. Figure 9 shows one such example. SNORA58 is an intronic snoRNA, the ancestral position of which is within the ubiquitin associated protein 2-like gene, UBAP2L. Both UBAP2L and SNORA58 trace to the amniote (mammal/bird) ancestor, but, prior to the divergence of primates, SNORA58 has expanded into the introns of two LECA genes, MRPL3 and NDC1. NDC1 is a nuclear pore complex constituent that traces to the LECA [46], whereas MRPL3 is a nuclear-encoded constituent of the mitochondrial ribosome, and ultimately derives from the bacterial ancestor of the mitochondrion [47][48][49]. Further, the introns in the MRPL3 gene derive from intron gain [50] and, in the case of SNORA58, the host intron is positionally-conserved across opisthokonts and plants. We therefore conclude that SNORA58 has moved into genes traceable to LECA, and that, in the case of MRPL3, this is an irrefutable case of gain of both intron and intronic snoRNA.

Discussion
Under the RNA world hypothesis, that geneticallyencoded proteins and DNA genomes were preceded by functional RNAs and RNA genomes, it has previously been suggested that many RNA-based processes may directly trace back to this early stage in the evolution of life. Such arguments have largely been made on the basis of comparisons of functional RNAs [40,52], though fewer than 1% of known RNA families show evidence of deep evolutionary ancestry [53]. For snoRNAs, while a compelling case can be made for an RNA-world origin [17,40], it is equally possible that this class of RNA has a more recent origin [13,14]. In so far as eukaryotic snoR-NAs have counterparts in archaea, and given that, in both domains, these RNAs interact with a common  Figure 8 Expression entropy (breadth of expression across tissues) based on data from the Norvatis transcriptome atlas [45] shows that deeply conserved, snoRNA carrying host genes are generally more broadly expressed than non-carrying genes.
On the backdrop of a requirement for broad snoRNA expression, this may suggest a selective advantage for this association. A total of 7022 genes were considered, based on usable expression data and reliable correlation to EnsEMBL genes ids in release 58.
set of associated proteins [11,12], the origin of snoRNAs in the common ancestor of both domains appears likely [6]. We aimed to specifically examine the antiquity of snoRNAs using comparative genomics. Our results confirm that individual snoRNA families can be placed in the Last Eukaryotic Common Ancestor. Using a conservative total-evidence approach, incorporating Rfam family conservation, a blast-based comparative genomics approach to identifying target modification sites, and comparison with experimentally-mapped sites, we report that a minimum of 25 C/D type snoRNAs can be placed in the LECA.
Based on a conservative estimate of snoRNAs in LECA, we found three LECA snoRNAs for which a positionally-conserved modification was present in archaea. While the archaeal counterpart was known for all these cases, we deemed the sequence and secondary structure similarities across domains to be insufficient for an inference of homology to be made. While the equivalence of the modification-sites is striking, our analysis was unable to provide stronger evidence in favour of a common origin for individual RNA families. It may simply be the case that, given that snoRNAs are short, evolutionary signal, if it did exist, has been erased over such timescales.
Finally, we aimed to establish whether the intronic location of snoRNAs is consistent with snoRNA gene mobility, with selection favouring colonisation of introns in genes with broad gene expression, or whether position is evolutionarily conserved. We find that, despite intron position being conserved across the eukaryote tree (suggestive of an intron-rich LECA), none of the snoRNAs residing in these putative LECA introns are positionally stable at this evolutionary depth. A specific prediction of the introns-first model is that snoRNAs are found in the introns of the most deeply-conserved protein-coding genes. Evolutionary signal consistent with this model requires the intron, the snoRNA associated with that intron, and the position of the snoRNA within the intron to all be traceable to the LECA. Our analysis therefore rejects this strict version of the introns-first hypothesis. As per the introns-early vs introns-late debate [16,54,55], alternative versions of introns-first are possible, and the snoRNA mobility we observe is not in itself incompatible with an RNA-world origin. However, if, as our data suggest, all trace of any patterns compatible with an early origin have been lost (if they were present at all), an early origin necessarily remains speculative.
We suggest that our results may instead better fit an alternative evolutionary model which we call 'constrained drift' (Figure 10). In this model, individual snoR-NAs can migrate to different genomic locations, provided the overall expression profile is preserved. This model is compatible with a range of genetic organisations, including a single snoRNA expressed from a ubiquitously-expressed host gene, subfunctionalised snoRNAs expressed from host genes with compatible, non-overlapping expression profiles, and independent snoRNA genes with either ubiquitous or subfunctionalised expression profiles. As phenotype is satisfied by all four cases, we predict that genomic location and organisation should be free to drift and is selectively constrained only by the requirement that a ubiquitous expression profile is maintained.
Our analysis of expression profiles of human genes bearing intronic snoRNAs (Figure 8) indicates that snoRNA-containing genes have a broad expression profile. However, these expression data do not enable us to  [47][48][49] and NDC1 [46] can be traced to the LECA, while UBAP2L is traceable to the vertebrate ancestor (data not shown). Expression data [51] indicate all three host genes are widely expressed in amniotes (Additional file 4). Grey squares indicate SNORA58 presence in the intron. The bottom graphic shows the position of SNORA58 in the introns of each gene is indicated by arrowheads. Gene orientation is indicated by < and >. The gene structure in the figure is from human. LECA-Last Eukaryotic Common Ancestor, VA-Vertebrate Ancestor, MA-Mammalian Ancestor, PR -Primate ancestor.
assess individual expression profiles of snoRNAs, and we have not examined snoRNA paralogy, both of which would need to be examined to fully determine whether one of the gene organisation patterns described in Figure 10 dominates. High-throughput expression data may soon enable us to examine this question in more detail.

Conclusions
Using an evolutionary analysis, we have shown that the snoRNA apparatus was well-established in the LECA. While numerous snoRNAs can be traced to the LECA, the intronic location of individual snoRNAs is not stable over large evolutionary timescales, with no evidence for positional conservation traceable to the LECA. The data presented here fit a model of ongoing mobility over shorter evolutionary timescales [19,32], with natural selection acting to constrain drift. As shown in Figure 10, some genomic locations may be selected against if function is compromised. However, a more extreme drift model is plausible, since this role for selection assumes that all modifications are under strong selection, a contention which has not been demonstrated. In light of experimental studies showing weak observable phenotype associated with individual snoRNA knockouts [56][57][58], it will be interesting to establish to what extent the modification of rRNAs by snoRNAs is governed by drift versus selection. For the set of most deeply conserved snoRNAs identified here, we suspect that constrained drift best accounts for their genomic evolution. However, for snoRNAs with a more recent evolutionary history the 'extreme' version of the drift model provides a valuable null hypothesis, which may be particularly helpful in assessing the significance of knockout studies [59][60][61]. While it is possible that snoRNAs have expanded into novel functions [5], it is notable that no other examples have been reported. Rather than representing a cohort of novel functional RNAs, some snoRNAs housed in narrowly-expressed host genes [22] may well be neutral, Figure 10 SnoRNA evolution by constrained drift. In this model, snoRNAs are genomically mobile and may vary in copy number (through processes of retrotransposition or DNA-level duplication). A) An intronic snoRNA (dark red) may be integrated into a new genomic location (orange gene) through retrotransposition. The original copy is retained (light red gene) leading to a copy number increase. B) The genomic location (and copy-number) of snoRNAs is constrained only by the requirement for phenotype to be satisfied, and a range of genomic organisations may satisfy phenotype. Within these bounds, multiple genomic architectures may generate the same phenotypic outcome, so architecture and copy number may be free to drift within these bounds. The top three expression profiles satisfy phenotype, and are therefore viable. Cases 1 & 2 show different host genes with equivalent expression profiles, whereas case 3 exhibits subfunctionalisation, where combined expression from two host genes, each with limited expression profiles, satisfies the required phenotypic expression profile through expression of the equivalent snoRNA. Profile 4 shows two (non-intronic) copies of a snoRNA gene, where the full expression profile is not achieved. An individual with such an expression profile would be eliminated from the population under constrained drift. SnoRNA genes (black squares); exons (rectangles); snoRNA expression profiles (thin black bars). Primes (') indicate a functionally equivalent snoRNA copy. Note that proteins A & B need not be evolutionarily related.
rather than the outcome of neofunctionalisation ( Figure 10). Combining comparative analyses, large-scale expression data and knockout phenotypes will be required to resolve this question.

Genome dataset assembly
We built a custom database, based on a local installation of the EnsEMBL PanCompara database [62]. Additional genome data, including cDNA coordinates and snoRNA annotations, were obtained from one of two sources: where available, these were downloaded directly from the Ensembl database, release 58 [63,64], through a Ruby-based API [65]. For Non-EnsEMBL species, genomes were downloaded from GenBank and parsed into a custom database using bio-ruby version 1.2 [66]. A full list of species is shown in Additional file 1: Table S3. Orthologous genes (identified as described below) from these genomes were subsequently merged into the comparative database.

SnoRNA dataset
Our snoRNA dataset derives from release 10 of the Rfam database [67]. Release 10 can be downloaded from the Rfam website (rfam.sanger.ac.uk). Rfam release 10 allows for RNA families which may perform distinct functions yet share sequence or structural similarity to be grouped into higher units, called clans. Clans are a pragmatic approach to ensuring homologous snoRNAs are not artefactually split into discrete families, and are described in detail elsewhere [67]. Rfam clans are representative of homology, but do not distinguish between orthologous and paralogous relationships. As we used both Rfam clans and families in our analysis, we assessed clans for evidence of paralogy, as described in the section, 'Identification of Putative Modification Sites via Comparative Analyses'.
The genomic location of snoRNAs was determined in several complementary ways. First, existing annotations were imported from the EnsEMBL database [63,64]. These annotations are based either on experimentally verified sequences and derived from specialist databases (including Flybase [68] and Wormbase [69]) or stem from computational predictions using Rfam covariance models (most EnsEMBL genomes) [68]. The latter are built from curated and thresholded seed alignments combined with information on conserved structural motifs, as described in Rfam documentation [67]. Each Rfam family corresponds to a group of significantly similar sequences [67] and was used in our study to establish snoRNA homology across species.
We then checked the literature for additional snoR-NAs not yet included in Rfam, and added these to our data set (Additional file 1: Table S4). Genomic coordinates were determined using BLAT [70], allowing for full-length matches with fewer than 2 substitutions. Where possible, sequences were assigned to an existing Rfam family.
We also used the (unmodified) Rfam annotation script (available from the Rfam FTP server at ftp://ftp.sanger. ac.uk/pub/databases/Rfam/) to identify additional snoRNA candidates across all genomes (unless such annotations were readily available from EnsEMBL) and added these to the final dataset. A full list of annotations with genomic coordinates is available from the authors upon request.

Assessing host gene orthology
Orthologous groups of snoRNA-bearing host genes were constructed in two steps using human genes as seed. First, for all EnsEMBL genomes, orthologs were obtained directly from the EnsEMBL database [71]. For non-EnsEMBL genomes (Additional file 1: Table S3), putative orthologs were identified using the InParanoid algorithm [72]. Each group of orthologs carrying at least one snoRNA (or snoRNA candidate) was subsequently aligned on the protein level using ProbCons [73], yielding a total of 1782 alignments to be used in the reconstruction of ancestral states for snoRNA and intron loci (below). A full list of groups and gene accession numbers is available upon request.

Identification of putative modification sites via comparative analyses
We reasoned that since C/D family snoRNAs possess linear and well-defined guide sequences, it should be possible to identify putative interactions using conserved sequence complementarity between snoRNA and target RNA. Putative ribosomal RNA targets of methylationguide (C/D) snoRNAs (including Infernal-derived candidatessee above) were therefore predicted through a comparative approach that utilised blastn from the Blast package [74]. To ensure specificity, search parameters were set to a word-size (W) of 5 and gap open (G) and gap extension (E) penalties of 50. Curated alignments of small-and large-subunit ribosomal RNA sequences were obtained from the SILVA database [75] (Additional file 1: Table S3). Ribosomal RNA sequences for Arabidopsis thaliana, Homo sapiens and Saccharomyces cerevisiae corresponding to those used in specialist snoRNA databases [34][35][36] and previous publications [9] were added to the alignment using the profile alignment option in ClustalW [76].
All putative rRNA-C/DsnoRNA interactions were binned into loci based on their center position (+/− 2 nucleotides). Loci restricted to a single species were discarded. We then reconstructed the ancestral state for all modification sites across species for families that we previously identified as likely LECA-candidates. Only those candidate sites whose phylogenetic distribution was compatible with presence in LECA were considered, resulting in a total of 40 deeply conserved sites. To gauge the accuracy of this approach, we compared predicted, conserved sites with previously reported methylation sites from specialist databases [34][35][36] (as shown in Figure 4).
We did not perform an equivalent analysis for H/ACA family snoRNAs, since their guide sequences are discontinuous, and therefore non-trivial to predict using the above methods.

Ancestral state reconstruction
We mapped introns and intronic snoRNAs onto aligned proteins using their respective genomic coordinates. Specifically, the position of the first codon in the exon upstream of the intron in question was converted from genomic coordinates into cDNA coordinates and divided by 3. Introns and intronic snoRNAs located in UTRs were excluded for obvious reasons. All positions were then binned into discrete loci within +/− 5 amino acids, as described previously [77].
The reconstruction of ancestral states requires an underlying phylogeny to establish the timing of emergence or degree of positional conservation of introns, modification sites and snoRNAs. To this end, we created a phylogenetic tree for the 44 species used in our analysis based on the supergroup division suggested by Adl and colleagues [78].
Ancestral states of loci and snoRNA families were reconstructed using Dollo Parsimony as implemented in DolloP from the Phylip package [79]. Parsimony is based on the underlying assumption that presence across taxa is due to common ancestry rather then independent gain and that absence is the default state for any feature [80]. It therefore offers a conservative estimate of the evolutionary processes governing snoRNA distribution. While more complex bayesian implementations are available [81], evolutionary models necessary to accurately describe the dynamics of gain and loss of snoRNAs are, to our knowledge, not yet available. We also note that independent gains, while possible, can often be distinguished, since there appear to be 1) many more sites for insertion than snoRNAs, and 2) snoRNAs from different families inserted into the equivalent position can be readily distinguished. This makes independent gains nonequivalent, in contrast to datasets such as intron insertion at proto-splice sites that may confound simple parsimony [50].