- Research article
- Open Access
Comparative genomics of eukaryotic small nucleolar RNAs reveals deep evolutionary ancestry amidst ongoing intragenomic mobility
BMC Evolutionary Biologyvolume 12, Article number: 183 (2012)
Small nucleolar (sno)RNAs are required for posttranscriptional processing and modification of ribosomal, spliceosomal and messenger RNAs. Their presence in both eukaryotes and archaea indicates that snoRNAs are evolutionarily ancient. The location of some snoRNAs within the introns of ribosomal protein genes has been suggested to belie an RNA world origin, with the exons of the earliest protein-coding genes having evolved around snoRNAs after the advent of templated protein synthesis. Alternatively, this intronic location may reflect more recent selection for coexpression of snoRNAs and ribosomal components, ensuring rRNA modification by snoRNAs during ribosome synthesis. To gain insight into the evolutionary origins of this genetic organization, we examined the antiquity of snoRNA families and the stability of their genomic location across 44 eukaryote genomes.
We report that dozens of snoRNA families are traceable to the Last Eukaryotic Common Ancestor (LECA), but find only weak similarities between the oldest eukaryotic snoRNAs and archaeal snoRNA-like genes. Moreover, many of these LECA snoRNAs are located within the introns of host genes independently traceable to the LECA. Comparative genomic analyses reveal the intronic location of LECA snoRNAs is not ancestral however, suggesting the pattern we observe is the result of ongoing intragenomic mobility. Analysis of human transcriptome data indicates that the primary requirement for hosting intronic snoRNAs is a broad expression profile. Consistent with ongoing mobility across broadly-expressed genes, we report a case of recent migration of a non-LECA snoRNA from the intron of a ubiquitously expressed non-LECA host gene into the introns of two LECA genes during the evolution of primates.
Our analyses show that snoRNAs were a well-established family of RNAs at the time when eukaryotes began to diversify. While many are intronic, this association is not evolutionarily stable across the eukaryote tree; ongoing intragenomic mobility has erased signal of their ancestral gene organization, and neither introns-first nor evolved co-expression adequately explain our results. We therefore present a third model — constrained drift — whereby individual snoRNAs are intragenomically mobile and may occupy any genomic location from which expression satisfies phenotype.
Small nucleolar RNAs (snoRNAs) constitute a major class of small RNA in eukaryotes. There are two broad types of snoRNA, which differ in structure and function. H/ACA snoRNAs are characterized by a double stem-loop structure, and many members of this class are involved in guiding pseudouridylation of other functional RNAs, most notably rRNA [1, 2]. Most C/D snoRNAs are likewise known to be guides, directing 2'-O-methylation of ribose on a broad assortment of RNAs, including ribosomal RNA (rRNA) . While the variety of target molecules is suggestive of an ongoing role in fine tuning or regulating a range of processes , including splicing , the antiquity of both classes of snoRNA has been established through the identification in Archaea of sno-like RNAs resembling both H/ACA and C/D family snoRNAs [6–8]. Moreover, comparative analysis of modification patterns across Bacteria, Archaea and Eukaryote rRNA indicates that some pseudouridylation and 2'-O-methylation modification sites may predate the diversification of the three domains of life [9, 10]. However, bacterial modifications are not generated in a snoRNA-dependent manner, so while comparative genomic and experimental analyses demonstrate that snoRNAs and their associated proteins appear ubiquitous among major eukaryote and archaeal groups [11, 12], there is a disconnect between the inferred antiquity of this class of RNA and conserved site modifications to rRNA. This may indicate that snoRNAs emerged following the divergence of bacteria from archaea and eukaryotes, replacing a protein enzyme-based modification system [13, 14]. Alternatively, snoRNAs may have an origin in an RNA world, having coevolved with the early ribosome . As part of the latter theory, it was proposed that the intronic position of snoRNAs may be related to the origin of the first mRNAs. Under this ‘introns-first’ model (Figure 1a), exons were recruited from the regions between individual snoRNAs, the latter becoming intronic following the advent of the first protein-coding genes [16, 17].
While it is difficult to definitively establish the timing of emergence of snoRNAs, the introns-first model does generate testable predictions. One prediction is that these early RNA-world snoRNAs may still be housed within the introns of the protein-coding genes postulated to have evolved around them. If such an association has been preserved, host genes should be among the oldest genes, traceable to the origin of templated protein synthesis [16, 17]. A strong version of this hypothesis is that this association is evolutionarily stable, such that duplication and retrotransposition have not broken up the association. An alternative view is that the intronic position of snoRNAs reflects selection for coexpression  (Figure 1b). This model makes no assumption as to the evolutionary and genomic mechanisms that gave rise to such an organization. A strong version of this model is that the relationship between intronic snoRNA and host gene is fixed, as per introns-first, but with one key difference: under coexpression, unless individual snoRNAs originated in the optimal genomic environment, they must have been mobile early in their evolutionary history, with selection precluding further mobility once the optimal intronic location has been reached. These hypotheses both suggest intronic position of snoRNAs has been stable across deep evolutionary timescales, so are not mutually exclusive. A third possibility is that the only constraint on snoRNA location is maintenance of an expression profile compatible with snoRNA function.
Some snoRNAs have been shown to be intragenomically mobile [19, 20], with the most stunning example being reported from platypus, where a single snoRNA family was found to be present in over 40 000 copies . In support of the evolutionary stability of intron-located snoRNAs, a recent study revealed that 14% of annotated snoRNAs present in more than one genome are positionally conserved across birds and mammals, and that of these 97% are intronic . These observations indicate that both mobility and positional stablility of snoRNAs are observed in vertebrate genomes. However, for genomic position to be compatible with the predictions resulting from the introns-first model, there would need to be stability across the entire eukaryote tree. Given that large numbers of introns can be readily traced to the root of the eukaryote tree (i.e. the Last Eukaryotic Common Ancestor, LECA) , it seems reasonable to expect that a signal of positional stability, if present, should be preserved at this evolutionary depth. While the coexpression model is not mutually incompatible with introns-first, the function of snoRNAs involved in generating ribosomes means conserved intronic snoRNAs should be associated with broadly-expressed host genes, but that those host genes need not be orthologous. Across birds and mammals, previous analyses  indicate that the most conserved snoRNAs are encoded in broadly expressed host genes, potentially compatible with both models.
In order to better understand the evolutionary dynamics of snoRNAs, we sought to address the following questions. 1. Do any snoRNA families trace back to the LECA? 2. Do eukaryotic snoRNAs show evidence of homology with archaeal snoRNA-like sRNAs? 3. Is the intronic location of snoRNAs evolutionarily stable across the eukaryote tree? We report that dozens of snoRNA families can be traced to the LECA. However, none of these LECA snoRNA families show significant levels of similarity with archaeal sno-like RNAs, precluding firm placement of individual snoRNA families in the common ancestor of these two groups. In agreement with previous studies , we find numerous introns can be placed in the LECA. A subset of these introns do carry snoRNAs, but close inspection reveals that snoRNAs in equivalent conserved positions are not orthologous. This result is consistent with independent gains of unrelated snoRNAs into orthologous introns. Finally, we report that snoRNA host genes are characterised by broad expression profiles, as judged from expression patterns across 37 human tissues. We therefore conclude that these data best fit ongoing genomic mobility of snoRNAs, with selection maintaining expression patterns but not genomic location. On the basis of these results we outline a 'constrained drift' model for the evolution of snoRNA-host gene relationships.
Placement of snoRNA families in the LECA
To establish the degree to which snoRNAs are conserved across eukaryotes, we examined the distribution of all Rfam snoRNA families identifiable across 44 eukaryote genomes (see Materials & Methods). Using Dollo parsimony, we reconstructed the pattern of snoRNA conservation across the eukaryote tree using a five supergroup phylogeny, with the root of the eukaryote tree between unikonts and bikonts [24–26]. This analysis identified 10 individual snoRNA families and 32 multi-family snoRNA clans traceable to the LECA (Figure 2; see Materials and Methods for definition of an Rfam clan). Of these, 40 of 42 families/clans were C/D box snoRNAs (Additional file 1: Table S1). H/ACA snoRNAs are generally more difficult to detect owing to shorter, less well-conserved sequence motifs [27–29], which may explain the comparatively lower number of conserved families recovered by this approach.
Characterized snoRNAs are heavily biased towards plants and animals  (Figure 2b) so these results are dependent on the eukaryote root in Figure 2 being correctly located. However, independent attempts to locate the root of the eukaryote tree using a range of methods consistently place Opisthokonts (animals, fungi and related microbes) on the opposite side of the root from Archaeplastida [24–26, 30], as per Figure 2, and minor differences in the exact placement of the root across these studies are all nevertheless compatible with the results we present here.
Rfam clans may potentially provide an additional source of information about snoRNA relationships, and could reflect family expansion through processes such as duplication  or retrotransposition . In our analysis a clan may be placed in the LECA if two related families are treated as a single character – this could be the result of either, 1) a functional shift in one part of the eukaryote tree (orthology with a change in function), 2) hidden paralogy (with a change in function), or 3) artefactual split of a family into smaller groups so as to maintain specificity of the respective covariance models used to define a family . For clans containing individual Rfam families that can each be independently placed in LECA, it may be therefore possible that the clan evolved prior to the diversification of eukaryotes, with families having evolved in the stem by duplication and divergence.
To assess whether clan groupings carry biological signal consistent with duplication and divergence, we examined the three clans where the constituent Rfam families could be independently traced to the LECA (Figure 3) — all three clans are comprised of C/D-box snoRNAs. For SNORD29 (CL00051), 4 of 11 snoRNA families independently traverse the eukaryote root, while for clans SNORD33 (CL00054) and SNORD61 (CL00067) respectively, 5 of 8 and 3 of 3 families predate LECA. Phylogenetic trees cannot be produced for snoRNAs at this evolutionary depth. We therefore reasoned that if individual snoRNA families traceable to LECA perform modifications at different target sites, this would be consistent with clans having evolved via duplication and divergence prior to the diversification of eukaryotes (i.e. pre-LECA). To examine this, for each clan, we took all families and established probable sites of rRNA modification for each family (Materials and Methods). We then compared our results to published experimental data on modification sites in human, yeast and Arabidopsis. We find support for recent expansions (several Rfam families within the SNORD29 clan are human-specific for example), but no family can be traced to duplication events predating the diversification of eukaryotes. Instead, it appears that the division of clans into families may be artefactual for these three cases, and, in the case of the SNORD61 clan, the clan likely represents the orthologous group (Additional file 1: Table S2).
The modification targets of LECA C/D snoRNAs are also traceable to the LECA
An implicit assumption in the classification of snoRNAs into families is that each family contains members that perform the equivalent biological function across species. Consequently, if individual snoRNA families can be traced back to the LECA, we may be able to examine to what extent their function is also conserved across eukaryote evolution. However, relative to sequenced genomes/identified snoRNAs, the number of species for which modification sites for rRNA have been experimentally characterized is comparatively low . We therefore sought to establish a computational means by which to predict target sites, given snoRNA and rRNA sequence data from an organism. We chose to focus on C/D snoRNAs for two reasons. First, in our initial Rfam-based analyses, the vast majority of snoRNAs putatively in LECA were C/D snoRNAs. Second, the guide sequences of C/D snoRNAs are contiguous; in contrast, H/ACA snoRNAs carry bipartite guide sequences, making computational identification non-trivial.
For each species in our dataset, we took all Rfam C/D family snoRNA entries and performed blasts against the SSU and LSU rRNA sequences from that species. This generated a blast map of potential interactions. As blast can detect similarities for reverse complements, each map detects hits equivalent to the interaction between C/D guide and cognate rRNA. To separate probable target sites from spurious blast hits, we looked for evidence of evolutionary conservation: for every putative modification site, we examined whether the distribution of hits across all species supported placement of that interaction in LECA (using parsimony). We then examined overlap between predicted LECA modification sites and Rfam families or clans previously determined (Figure 2) to be traceable to the LECA. As shown in Figure 4 there is good correspondence between predicted LECA modifications and Rfam families (Additional file 2).
As these target sites are predicted using a computational strategy, we sought to establish whether they gave reasonable correspondence to independently verified target sites and their cognate snoRNAs [34–36]. We therefore compared our set of deeply conserved modifications against known modification sites from human, yeast and Arabidopsis. Under the unikont/bikont rooting, Arabidopsis is on the other side of the root from human and yeast, so this gives us an independent means of predicting putative LECA modifications. Table 1 shows the proportion of overlap between experimentally determined modification sites and our computational predictions for Arabidopsis, yeast and human. Of 60 previously reported sites conserved across these three species [34–37], our comparative genomics approach fully recovers 54 (indicated by an 'X' in the respective column).
To gauge the reliability of the three types of data (snoRNA genes, blast-mapped modification sites and independently determined sites), we compared the results of all three methods (Figure 5). The intersection between the 42 LECA snoRNAs, 28 LECA blast-mapped sites and 37 independently-mapped LECA modification sites is 25. We note however that this is an underestimate: two LECA snoRNA families (snoU13 and U3) are involved exclusively in rRNA cleavage but not 2'-O-methylation, and two further putative LECA-snoRNAs belong to the H/ACA class (which we did not map against rRNA) - so are automatically excluded from this intersect for this reason. In summary, a stringent, total evidence-based approach (Figure 5) identifies a minimum of 25 modification guide C/D snoRNA families with conserved function traceable to the root of the eukaryote tree, under the unikont/bikont topology.
Some LECA snoRNAs have known archaeal counterparts
Archaea and eukaryotes both carry an equivalent H/ACA and C/D RNA-based guide machinery, with a conserved set of associated proteins [11, 12]. However, it is not trivial to identify archaeal and eukaryote RNA counterparts using sequence data alone. We therefore took the alternative approach of asking whether any archaeal sno-like sRNAs are responsible for modification at a position in archaeal rRNA equivalent to conserved eukaryotic rRNA modifications. This identified only one case, archaeal sR12, found in both crenarchaea and euryarchaea, which had previously been predicted by Gaspin and co-workers to guide 2'-O-methylation at an equivalent site to yeast snR70 . Our analysis (Figure 2) placed the snoRD43/snR70 clan (CL00059) in LECA. In order to examine possible homology, we generated a multiple sequence alignment of members of the CL00059 clans and archaeal sR12 sequences using an alignment algorithm optimised for non-coding RNA , and sought to establish whether secondary structure was also conserved, using alifold from the Vienna package . The alignment shown in Figure 6 shows similarities in the guide sequence, but we find no clear indication of structural conservation. Consequently, while these findings are consistent with SNORD43 & sR12 being orthologous (and therefore predating the archaeal-eukaryotic divergence), we cannot rule out that possibility that these sequence similarities are a result of convergence given selection for the underlying function. Gaspin and colleagues reported 4 additional cases of modification equivalence, three of which correspond to eukaryotic snoRNA families in our Rfam dataset. However, these could either not be placed in LECA (snR52) or exhibited limited sequence similarity, precluding clear inference of common ancestry (snoRD60/sR36 and snoRD96/sR11).
Testing the introns first hypothesis for the origin of mRNA
Available data are consistent with snoRNA-based rRNA processing/modification being present in the common ancestor of archaea and eukaryotes [11, 12]. However, an earlier RNA world origin has also been suggested [17, 40]. Consistent with this, a small number of experimentally-characterized 2'-O-methylation and pseudouridylation sites on rRNA are conserved across all three domains of life , and we note that other RNA modifications likewise appear to have evolved prior to divergence of the three domains . However, an alternative view is that the ancestral system for 2'-O-methylation and pseudouridylation was protein-enzyme based as in extant bacterial lineages, with the snoRNA-based system evolving in response to a need for more extensive modifications in archaea and eukaryotes [13, 14]. It is not possible to directly test these alternative scenarios, both of which are plausible. However, one aspect of the 'snoRNAs-ancient' view is amenable to testing through comparative genomics. Under the introns-first model, the intronic location of snoRNAs is an ancestral state, predating the origin of genetically-encoded protein synthesis. Briefly, the ancestral state is hypothesized to be RNA genes linked on chromosomes; the intervening spaces between these RNAs is later coopted into the role of messenger RNA. This model assumes that processing of individual snoRNAs from larger polycistronic RNA transcripts occurred via a proto-spliceosome [16, 17] (Figure 1a). Under this model, these intervening regions are coopted as the first proto-exons. Consistent with this view, many snoRNAs are housed in the introns of ribosomal protein genes, particularly in vertebrate genomes . Moreover, ribosomal protein genes are one of the most ancient classes of gene in the cell , and, significantly, an evolutionary history dominated by vertical descent is apparent for this class of genes . While any scenario for very early evolution is by nature speculative, introns-first predicts that the intronic position of snoRNAs is ancestral. If the intronic location of any snoRNA is conserved across the eukaryote tree, such a pattern would be consistent with introns-first (with the corollary that introns have been lost/reduced to self-splicing forms in archaea and bacteria; of note, complete intron loss has been documented in the Hemiselmis andersenii nucleomorph . We therefore sought to establish whether intronic snoRNAs are positionally conserved across eukaryotes.
We first screened our dataset of 44 eukaryote genomes for introns containing annotated snoRNAs. Our screen of eukaryote genomes yielded a set of 1782 host genes carrying intronic snoRNAs. Of these, 1091 of the host genes could be placed in LECA on distribution (Additional file 1: Figure S1; Additional file 3).
To establish if individual introns within the cohort of LECA host genes could also be traced to LECA, we reconstructed intron presence in LECA using parsimony (see methods). This yielded 9117/98871 (7.6%) orthologous intron loci that can be traced to LECA (Additional file 1: Figure S2), in broad agreement with other studies indicating high numbers of introns can be traced to LECA . For any intron carrying a snoRNA, we next independently compared intron ancestry with snoRNA intron occupancy across the eukaryote tree. This yielded 22 orthologous LECA introns that also carry a snoRNA (Figure 7). Finally, we examined Rfam family constituency for all 22 cases across all eukaryote genomes. We reasoned that if all positionally equivalent snoRNAs were from the same family, this would provide an initial indicator of snoRNA orthology.
Strikingly, we find no evidence for common ancestry for any of these 22 cases, based on Rfam family membership (Table 2), and in many cases the snoRNAs are not even of the same type (i.e. H/ACA or C/D), clearly precluding homology. This pattern, while precluding the placement of intronic snoRNAs in LECA, does not in itself indicate snoRNA mobility. For the 22 cases of non-orthologous snoRNA occupancy in Figure 7, we therefore examined snoRNA distribution more closely in an attempt to distinguish ancestral from derived snoRNA occupancy among orthologous introns. As shown in Table 2 none of the putative LECA-intronic snoRNAs can be claimed to trace to the LECA, being restricted for the most part to a few species. Manual inspection of alignments and sequence features for these equivalent intronic snoRNAs from the same class did not reveal any cases of homology. We therefore conclude that these data do not support the evolutionary stability of intronic snoRNAs, predicted under the introns-first hypothesis.
LECA genes hosting intronic snoRNAs are broadly-expressed
While our results demonstrate that intronic location of snoRNAs does not trace to the LECA, it is striking that 61% (1091/1782) of snoRNA-containing host genes in our study can themselves be placed in the LECA (Additional file 1: Figure S1). We were therefore interested to know what features of these broadly-distributed genes might make them an ideal location for intronic snoRNAs. As rRNA function is essential, modification of rRNA should also be required in all cell types in multicellular organisms. We may therefore expect that the main requirement for host genes is that they should be ubiquitously expressed. We previously showed that for positionally conserved intronic snoRNAs in the bird-mammal ancestor, intronic snoRNA-containing genes were broadly expressed, and fundamental processes such as translation were significantly overrepresented amongst this cohort, whereas intronic snoRNAs with a shallower evolutionary association with their host gene (traceable to the primate ancestor) were not overrepresented among broadly-expressed genes . More generally, ‘older’ genes are known to be more broadly expressed across a range of tissues in vertebrates . We therefore examined the expression pattern of the entire set of 1091 LECA host genes using human expression data . As shown in Figure 8, the distribution of expression entropies of human genes shows that genes hosting an intronic snoRNA are more broadly expressed (high entropy) than non-host genes (Mann–Whitney-U, p < 0.005). We conclude that the preference for integration into the introns of ancient (LECA) host genes is a consequence of these genes being broadly expressed. Such host genes are compatible with the requirement for broad expression of snoRNAs.
Recent migration of a snoRNA into the intron of a ribosomal protein gene
As none of the non-homologous 22 snoRNAs identified in LECA introns could be definitively attributed to derived mobility-associated gain (to the exclusion of secondary losses), we combed our dataset for intronic snoRNAs that could be unambiguously attributed to gain. Figure 9 shows one such example. SNORA58 is an intronic snoRNA, the ancestral position of which is within the ubiquitin associated protein 2-like gene, UBAP2L. Both UBAP2L and SNORA58 trace to the amniote (mammal/bird) ancestor, but, prior to the divergence of primates, SNORA58 has expanded into the introns of two LECA genes, MRPL3 and NDC1. NDC1 is a nuclear pore complex constituent that traces to the LECA , whereas MRPL3 is a nuclear-encoded constituent of the mitochondrial ribosome, and ultimately derives from the bacterial ancestor of the mitochondrion [47–49]. Further, the introns in the MRPL3 gene derive from intron gain  and, in the case of SNORA58, the host intron is positionally-conserved across opisthokonts and plants. We therefore conclude that SNORA58 has moved into genes traceable to LECA, and that, in the case of MRPL3, this is an irrefutable case of gain of both intron and intronic snoRNA.
Under the RNA world hypothesis, that genetically-encoded proteins and DNA genomes were preceded by functional RNAs and RNA genomes, it has previously been suggested that many RNA-based processes may directly trace back to this early stage in the evolution of life. Such arguments have largely been made on the basis of comparisons of functional RNAs [40, 52], though fewer than 1% of known RNA families show evidence of deep evolutionary ancestry . For snoRNAs, while a compelling case can be made for an RNA-world origin [17, 40], it is equally possible that this class of RNA has a more recent origin [13, 14]. In so far as eukaryotic snoRNAs have counterparts in archaea, and given that, in both domains, these RNAs interact with a common set of associated proteins [11, 12], the origin of snoRNAs in the common ancestor of both domains appears likely .
We aimed to specifically examine the antiquity of snoRNAs using comparative genomics. Our results confirm that individual snoRNA families can be placed in the Last Eukaryotic Common Ancestor. Using a conservative total-evidence approach, incorporating Rfam family conservation, a blast-based comparative genomics approach to identifying target modification sites, and comparison with experimentally-mapped sites, we report that a minimum of 25 C/D type snoRNAs can be placed in the LECA.
Based on a conservative estimate of snoRNAs in LECA, we found three LECA snoRNAs for which a positionally-conserved modification was present in archaea. While the archaeal counterpart was known for all these cases, we deemed the sequence and secondary structure similarities across domains to be insufficient for an inference of homology to be made. While the equivalence of the modification-sites is striking, our analysis was unable to provide stronger evidence in favour of a common origin for individual RNA families. It may simply be the case that, given that snoRNAs are short, evolutionary signal, if it did exist, has been erased over such timescales.
Finally, we aimed to establish whether the intronic location of snoRNAs is consistent with snoRNA gene mobility, with selection favouring colonisation of introns in genes with broad gene expression, or whether position is evolutionarily conserved. We find that, despite intron position being conserved across the eukaryote tree (suggestive of an intron-rich LECA), none of the snoRNAs residing in these putative LECA introns are positionally stable at this evolutionary depth. A specific prediction of the introns-first model is that snoRNAs are found in the introns of the most deeply-conserved protein-coding genes. Evolutionary signal consistent with this model requires the intron, the snoRNA associated with that intron, and the position of the snoRNA within the intron to all be traceable to the LECA. Our analysis therefore rejects this strict version of the introns-first hypothesis. As per the introns-early vs introns-late debate [16, 54, 55], alternative versions of introns-first are possible, and the snoRNA mobility we observe is not in itself incompatible with an RNA-world origin. However, if, as our data suggest, all trace of any patterns compatible with an early origin have been lost (if they were present at all), an early origin necessarily remains speculative.
We suggest that our results may instead better fit an alternative evolutionary model which we call 'constrained drift' (Figure 10). In this model, individual snoRNAs can migrate to different genomic locations, provided the overall expression profile is preserved. This model is compatible with a range of genetic organisations, including a single snoRNA expressed from a ubiquitously-expressed host gene, subfunctionalised snoRNAs expressed from host genes with compatible, non-overlapping expression profiles, and independent snoRNA genes with either ubiquitous or subfunctionalised expression profiles. As phenotype is satisfied by all four cases, we predict that genomic location and organisation should be free to drift and is selectively constrained only by the requirement that a ubiquitous expression profile is maintained.
Our analysis of expression profiles of human genes bearing intronic snoRNAs (Figure 8) indicates that snoRNA-containing genes have a broad expression profile. However, these expression data do not enable us to assess individual expression profiles of snoRNAs, and we have not examined snoRNA paralogy, both of which would need to be examined to fully determine whether one of the gene organisation patterns described in Figure 10 dominates. High-throughput expression data may soon enable us to examine this question in more detail.
Using an evolutionary analysis, we have shown that the snoRNA apparatus was well-established in the LECA. While numerous snoRNAs can be traced to the LECA, the intronic location of individual snoRNAs is not stable over large evolutionary timescales, with no evidence for positional conservation traceable to the LECA. The data presented here fit a model of ongoing mobility over shorter evolutionary timescales [19, 32], with natural selection acting to constrain drift. As shown in Figure 10, some genomic locations may be selected against if function is compromised. However, a more extreme drift model is plausible, since this role for selection assumes that all modifications are under strong selection, a contention which has not been demonstrated. In light of experimental studies showing weak observable phenotype associated with individual snoRNA knockouts [56–58], it will be interesting to establish to what extent the modification of rRNAs by snoRNAs is governed by drift versus selection. For the set of most deeply conserved snoRNAs identified here, we suspect that constrained drift best accounts for their genomic evolution. However, for snoRNAs with a more recent evolutionary history the 'extreme' version of the drift model provides a valuable null hypothesis, which may be particularly helpful in assessing the significance of knockout studies [59–61]. While it is possible that snoRNAs have expanded into novel functions , it is notable that no other examples have been reported. Rather than representing a cohort of novel functional RNAs, some snoRNAs housed in narrowly-expressed host genes  may well be neutral, rather than the outcome of neofunctionalisation (Figure 10). Combining comparative analyses, large-scale expression data and knockout phenotypes will be required to resolve this question.
Genome dataset assembly
We built a custom database, based on a local installation of the EnsEMBL PanCompara database . Additional genome data, including cDNA coordinates and snoRNA annotations, were obtained from one of two sources: where available, these were downloaded directly from the Ensembl database, release 58 [63, 64], through a Ruby-based API . For Non-EnsEMBL species, genomes were downloaded from GenBank and parsed into a custom database using bio-ruby version 1.2 . A full list of species is shown in Additional file 1: Table S3. Orthologous genes (identified as described below) from these genomes were subsequently merged into the comparative database.
Our snoRNA dataset derives from release 10 of the Rfam database . Release 10 can be downloaded from the Rfam website (rfam.sanger.ac.uk). Rfam release 10 allows for RNA families which may perform distinct functions yet share sequence or structural similarity to be grouped into higher units, called clans. Clans are a pragmatic approach to ensuring homologous snoRNAs are not artefactually split into discrete families, and are described in detail elsewhere . Rfam clans are representative of homology, but do not distinguish between orthologous and paralogous relationships. As we used both Rfam clans and families in our analysis, we assessed clans for evidence of paralogy, as described in the section, ‘Identification of Putative Modification Sites via Comparative Analyses’.
The genomic location of snoRNAs was determined in several complementary ways. First, existing annotations were imported from the EnsEMBL database [63, 64]. These annotations are based either on experimentally verified sequences and derived from specialist databases (including Flybase  and Wormbase ) or stem from computational predictions using Rfam covariance models (most EnsEMBL genomes) . The latter are built from curated and thresholded seed alignments combined with information on conserved structural motifs, as described in Rfam documentation . Each Rfam family corresponds to a group of significantly similar sequences  and was used in our study to establish snoRNA homology across species.
We then checked the literature for additional snoRNAs not yet included in Rfam, and added these to our data set (Additional file 1: Table S4). Genomic coordinates were determined using BLAT , allowing for full-length matches with fewer than 2 substitutions. Where possible, sequences were assigned to an existing Rfam family.
We also used the (unmodified) Rfam annotation script (available from the Rfam FTP server at ftp://ftp.sanger.ac.uk/pub/databases/Rfam/) to identify additional snoRNA candidates across all genomes (unless such annotations were readily available from EnsEMBL) and added these to the final dataset. A full list of annotations with genomic coordinates is available from the authors upon request.
Assessing host gene orthology
Orthologous groups of snoRNA-bearing host genes were constructed in two steps using human genes as seed. First, for all EnsEMBL genomes, orthologs were obtained directly from the EnsEMBL database . For non-EnsEMBL genomes (Additional file 1: Table S3), putative orthologs were identified using the InParanoid algorithm . Each group of orthologs carrying at least one snoRNA (or snoRNA candidate) was subsequently aligned on the protein level using ProbCons , yielding a total of 1782 alignments to be used in the reconstruction of ancestral states for snoRNA and intron loci (below). A full list of groups and gene accession numbers is available upon request.
Identification of putative modification sites via comparative analyses
We reasoned that since C/D family snoRNAs possess linear and well-defined guide sequences, it should be possible to identify putative interactions using conserved sequence complementarity between snoRNA and target RNA. Putative ribosomal RNA targets of methylation-guide (C/D) snoRNAs (including Infernal-derived candidates – see above) were therefore predicted through a comparative approach that utilised blastn from the Blast package . To ensure specificity, search parameters were set to a word-size (W) of 5 and gap open (G) and gap extension (E) penalties of 50. Curated alignments of small- and large-subunit ribosomal RNA sequences were obtained from the SILVA database  (Additional file 1: Table S3). Ribosomal RNA sequences for Arabidopsis thaliana Homo sapiens and Saccharomyces cerevisiae corresponding to those used in specialist snoRNA databases [34–36] and previous publications  were added to the alignment using the profile alignment option in ClustalW .
All putative rRNA-C/DsnoRNA interactions were binned into loci based on their center position (+/− 2 nucleotides). Loci restricted to a single species were discarded. We then reconstructed the ancestral state for all modification sites across species for families that we previously identified as likely LECA-candidates. Only those candidate sites whose phylogenetic distribution was compatible with presence in LECA were considered, resulting in a total of 40 deeply conserved sites. To gauge the accuracy of this approach, we compared predicted, conserved sites with previously reported methylation sites from specialist databases [34–36] (as shown in Figure 4).
We did not perform an equivalent analysis for H/ACA family snoRNAs, since their guide sequences are discontinuous, and therefore non-trivial to predict using the above methods.
Ancestral state reconstruction
We mapped introns and intronic snoRNAs onto aligned proteins using their respective genomic coordinates. Specifically, the position of the first codon in the exon upstream of the intron in question was converted from genomic coordinates into cDNA coordinates and divided by 3. Introns and intronic snoRNAs located in UTRs were excluded for obvious reasons. All positions were then binned into discrete loci within +/− 5 amino acids, as described previously .
The reconstruction of ancestral states requires an underlying phylogeny to establish the timing of emergence or degree of positional conservation of introns, modification sites and snoRNAs. To this end, we created a phylogenetic tree for the 44 species used in our analysis based on the supergroup division suggested by Adl and colleagues .
Ancestral states of loci and snoRNA families were reconstructed using Dollo Parsimony as implemented in DolloP from the Phylip package . Parsimony is based on the underlying assumption that presence across taxa is due to common ancestry rather then independent gain and that absence is the default state for any feature . It therefore offers a conservative estimate of the evolutionary processes governing snoRNA distribution. While more complex bayesian implementations are available , evolutionary models necessary to accurately describe the dynamics of gain and loss of snoRNAs are, to our knowledge, not yet available. We also note that independent gains, while possible, can often be distinguished, since there appear to be 1) many more sites for insertion than snoRNAs, and 2) snoRNAs from different families inserted into the equivalent position can be readily distinguished. This makes independent gains non-equivalent, in contrast to datasets such as intron insertion at proto-splice sites that may confound simple parsimony .
Placement of SnoRNAs in LECA
For the 22 snoRNA-carrying introns that we were able to trace back to the LECA, we employed three independent criteria to establish homology of the snoRNAs across major taxonomic groups. First, we checked that all putative orthologous snoRNAs belonged to the same Rfam family or clan (where defined). Second, we manually inspected the primary sequence and checked it for conservation of sequence and structural features using the R-coffee algorithm  and RNAalifold from the Vienna package . Finally, we examined the predicted site of guide-modification to establish whether each orthlogous group of snoRNAs are also likely to carry out equivalent functions across species. This cross-checking was undertaken to ensure our results were not the result of undetected errors in Rfam; none were detected.
Host gene expression profile analyses
Expression data from the human transcriptome atlas  were obtained from the ArrayExpress Archive (http://www.ebi.ac.uk/arrayexpress, accession E-TABM-145; . To estimate the global expression level of each gene, we calculated the median array signal for all tissues (removing duplicates, such as brain subsamples). To estimate the expression breadth, we calculated the Shannon entropy as S = −sum (Pi x ln(Pi)), where Pi is the proportion of expression in tissue i, and Pi = Ei/T; Ei is the expression of the gene in tissue i and T (total expression) is the sum of all expression values for tissues (1, …, i).
small nucleolar RNA
Last Eukaryotic Common Ancestor.
Ganot P, Bortolin ML, Kiss T: Site-specific pseudouridine formation in preribosomal RNA is guided by small nucleolar RNAs. Cell. 1997, 89: 799-809. 10.1016/S0092-8674(00)80263-9.
Ni J, Tien AL, Fournier MJ: Small nucleolar RNAs direct site-specific synthesis of pseudouridine in ribosomal RNA. Cell. 1997, 89: 565-573. 10.1016/S0092-8674(00)80238-X.
Kiss-Laszlo Z, Henry Y, Bachellerie JP, Caizergues-Ferrer M, Kiss T: Site-specific ribose methylation of preribosomal RNA: a novel function for small nucleolar RNAs. Cell. 1996, 85: 1077-1088. 10.1016/S0092-8674(00)81308-2.
Bachellerie JP, Cavaillé J, Hüttenhofer A: The expanding snoRNA world. Biochimie. 2002, 84: 775-790. 10.1016/S0300-9084(02)01402-5.
Kishore S, Stamm S: The snoRNA HBII-52 regulates alternative splicing of the serotonin receptor 2C. Science. 2006, 311: 230-232. 10.1126/science.1118265.
Gaspin C, Cavaille J, Erauso G, Bachellerie JP: Archaeal homologs of eukaryotic methylation guide small nucleolar RNAs: lessons from the Pyrococcus genomes. J Mol Biol. 2000, 297: 895-906. 10.1006/jmbi.2000.3593.
Omer AD, Lowe TM, Russell AG, Ebhardt H, Eddy SR, Dennis PP: Homologs of small nucleolar RNAs in Archaea. Science. 2000, 288: 517-522. 10.1126/science.288.5465.517.
Tang TH, Bachellerie JP, Rozhdestvensky T, Bortolin ML, Huber H, Drungowski M, Elge T, Brosius J, Huttenhofer A: Identification of 86 candidates for small non-messenger RNAs from the archaeon Archaeoglobus fulgidus. Proc Natl Acad Sci USA. 2002, 99: 7536-7541. 10.1073/pnas.112047299.
Ofengand J, Bakin A: Mapping to nucleotide resolution of pseudouridine residues in large subunit ribosomal RNAs from representative eukaryotes, prokaryotes, archaebacteria, mitochondria and chloroplasts. J Mol Biol. 1997, 266: 246-268. 10.1006/jmbi.1996.0737.
Cermakian N, Cedergren R: Modified nucleotides always were: an evolutionary model. Modification and editing of RNA. Edited by: Grosjean HRB. 1998, Washington D.C: ASM Press, 535-541.
Dennis PP, Omer A: Small non-coding RNAs in Archaea. Curr Opin Microbiol. 2005, 8: 685-694. 10.1016/j.mib.2005.10.013.
Gardner PP, Bateman A, Poole AM: SnoPatrol: how many snoRNA genes are there?. J Biol. 2010, 9: 4-10.1186/jbiol211.
Lafontaine DL, Tollervey D: Birth of the snoRNPs: the evolution of the modification-guide snoRNAs. Trends Biochem Sci. 1998, 23: 383-388. 10.1016/S0968-0004(98)01260-2.
Morrissey JP, Tollervey D: Birth of the snoRNPs: the evolution of RNase MRP and the eukaryotic pre-rRNA-processing system. Trends Biochem Sci. 1995, 20: 78-82. 10.1016/S0968-0004(00)88962-8.
Jeffares DC, Poole AM, Penny D: Pre-rRNA processing and the path from the RNA world. Trends Biochem Sci. 1995, 20: 298-299. 10.1016/S0968-0004(00)89053-2.
Penny D, Hoeppner MP, Poole AM, Jeffares DC: An overview of the introns-first theory. J Mol Evol. 2009
Poole AM, Jeffares DC, Penny D: The path from the RNA world. J Mol Evol. 1998, 46: 1-17. 10.1007/PL00006275.
Sollner-Webb B: Novel intron-encoded small nucleolar RNAs. Cell. 1993, 75: 403-405. 10.1016/0092-8674(93)90374-Y.
Weber MJ: Mammalian small nucleolar RNAs are mobile genetic elements. PLoS Genet. 2006, 2: e205-10.1371/journal.pgen.0020205.
Luo Y, Li S: Genome-wide analyses of retrogenes derived from the human box H/ACA snoRNAs. Nucleic Acids Res. 2007, 35: 559-571.
Schmitz J, Zemann A, Churakov G, Kuhl H, Grützner F, Reinhardt R, Brosius J: Retroposed SNOfall–a mammalian-wide comparison of platypus snoRNAs. Genome Res. 2008, 18: 1005-1010. 10.1101/gr.7177908.
Hoeppner MP, White S, Jeffares DC, Poole AM: Evolutionarily stable association of intronic snoRNAs and microRNAs with their host genes. Genome biology and evolution. 2009, 1: 420-428.
Roy SW, Gilbert W: Complex early genes. Proc Natl Acad Sci U S A. 2005, 102: 1986-1991. 10.1073/pnas.0408355101.
Stechmann A, Cavalier-Smith T: Rooting the eukaryote tree by using a derived gene fusion. Science. 2002, 297: 89-91. 10.1126/science.1071196.
Stechmann A, Cavalier-Smith T: The root of the eukaryote tree pinpointed. Curr Biol. 2003, 13: R665-666. 10.1016/S0960-9822(03)00602-X.
Derelle R, Lang BF: Rooting the eukaryotic tree with mitochondrial and bacterial proteins. Mol Biol Evol. 2012, 29: 1277-1289. 10.1093/molbev/msr295.
Schattner P, Barberan-Soler S, Lowe TM: A computational screen for mammalian pseudouridylation guide H/ACA RNAs. RNA. 2006, 12: 15-25. 10.1261/rna.2210406.
Schattner P, Decatur WA, Davis CA, Ares M, Fournier MJ, Lowe TM: Genome-wide searching for pseudouridylation guide snoRNAs: analysis of the Saccharomyces cerevisiae genome. Nucleic Acids Res. 2004, 32: 4281-4296. 10.1093/nar/gkh768.
Freyhult E, Edvardsson S, Tamas I, Moulton V, Poole AM: Fisher: a program for the detection of H/ACA snoRNAs using MFE secondary structure prediction and comparative genomics - assessment and update. BMC Res Notes. 2008, 1: 49-10.1186/1756-0500-1-49.
Katz LA, Grant JR, Parfrey LW, Burleigh JG: Turning the crown upside down: gene tree parsimony roots the eukaryotic tree of life. Syst Biol. 2012, 61: 653-660. 10.1093/sysbio/sys026.
Zemann A, op de Bekke A, Kiefmann M, Brosius J, Schmitz J: Evolution of small nucleolar RNAs in nematodes. Nucleic Acids Res. 2006, 34: 2676-2685. 10.1093/nar/gkl359.
Brosius J: The contribution of RNAs and retroposition to evolutionary novelties. Genetica. 2003, 118: 99-116. 10.1023/A:1024141306559.
Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA alignments. Bioinformatics. 2009, 25: 1335-1337. 10.1093/bioinformatics/btp157.
Brown JW, Echeverria M, Qu LH, Lowe TM, Bachellerie JP, Huttenhofer A, Kastenmayer JP, Green PJ, Shaw P, Marshall DF: Plant snoRNA database. Nucleic Acids Res. 2003, 31: 432-435. 10.1093/nar/gkg009.
Lestrade L, Weber MJ: snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Res. 2006, 34: D158-162. 10.1093/nar/gkj002.
Piekna-Przybylska D, Decatur WA, Fournier MJ: New bioinformatic tools for analysis of nucleotide modifications in eukaryotic rRNA. RNA. 2007, 13: 305-312. 10.1261/rna.373107.
Piekna-Przybylska D, Decatur WA, Fournier MJ: The 3D rRNA modification maps database: with interactive tools for ribosome analysis. Nucleic Acids Res. 2008, 36: D178-183.
Wilm A, Higgins DG, Notredame C: R-Coffee: a method for multiple alignment of non-coding RNA. Nucleic Acids Res. 2008, 36: e52-10.1093/nar/gkn174.
Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF: RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinforma. 2008, 9: 474-10.1186/1471-2105-9-474.
Jeffares DC, Poole AM, Penny D: Relics from the RNA world. J Mol Evol. 1998, 46: 18-36. 10.1007/PL00006280.
Lecompte O, Ripp R, Thierry JC, Moras D, Poch O: Comparative analysis of ribosomal proteins in complete genomes: an example of reductive evolution at the domain scale. Nucleic Acids Res. 2002, 30: 5382-5390. 10.1093/nar/gkf693.
Harris JK, Kelley ST, Spiegelman GB, Pace NR: The genetic core of the universal ancestor. Genome Res. 2003, 13: 407-412. 10.1101/gr.652803.
Lane CE, van den Heuvel K, Kozera C, Curtis BA, Parsons BJ, Bowman S, Archibald JM: Nucleomorph genome of Hemiselmis andersenii reveals complete intron loss and compaction as a driver of protein structure and function. Proc Natl Acad Sci U S A. 2007, 104: 19908-19913. 10.1073/pnas.0707419104.
Milinkovitch MC, Helaers R, Tzika AC: Historical constraints on vertebrate genome evolution. Genome Biol Evol. 2010, 2: 13-18. 10.1093/gbe/evp052.
Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U S A. 2004, 101: 6062-6067. 10.1073/pnas.0400782101.
Neumann N, Lundin D, Poole AM: Comparative genomic evidence for a complete nuclear pore complex in the last eukaryote common ancestor. PLoS One. 2010, 5: e13241-10.1371/journal.pone.0013241.
Desmond E, Brochier-Armanet C, Forterre P, Gribaldo S: On the last common ancestor and early evolution of eukaryotes: reconstructing the history of mitochondrial ribosomes. Res Microbiol. 2011, 162: 53-70. 10.1016/j.resmic.2010.10.004.
Bonen L, Calixte S: Comparative analysis of bacterial-origin genes for plant mitochondrial ribosomal proteins. Mol Biol Evol. 2006, 23: 701-712.
Smits P, Smeitink JA, van den Heuvel LP, Huynen MA, Ettema TJ: Reconstructing the evolution of the mitochondrial ribosomal proteome. Nucleic Acids Res. 2007, 35: 4686-4703. 10.1093/nar/gkm441.
Yoshihama M, Nakao A, Nguyen HD, Kenmochi N: Analysis of ribosomal protein gene structures: implications for intron evolution. PLoS Genet. 2006, 2: e25-10.1371/journal.pgen.0020025.
Brawand D, Soumillon M, Necsulea A, Julien P, Csardi G, Harrigan P, Weier M, Liechti A, Aximu-Petri A, Kircher M, et al: The evolution of gene expression levels in mammalian organs. Nature. 2011, 478: 343-348. 10.1038/nature10532.
Breaker RR: Riboswitches and the RNA World. Cold Spring Harb Perspect Biol. 2010, 10.1101/cshperspect.a003566.
Hoeppner MP, Gardner PP, Poole AM: Comparative analysis of RNA families reveals distinct repertoires for each domain of life. PLoS Comp Biol. 2012, in press
Rodriguez-Trelles F, Tarrio R, Ayala FJ: Origins and evolution of spliceosomal introns. Annu Rev Genet. 2006, 40: 47-76. 10.1146/annurev.genet.40.110405.090625.
Roy SW, Gilbert W: The evolution of spliceosomal introns: patterns, puzzles and progress. Nat Rev Genet. 2006, 7: 211-221.
Badis G, Fromont-Racine M, Jacquier A: A snoRNA that guides the two most conserved pseudouridine modifications within rRNA confers a growth advantage in yeast. RNA. 2003, 9: 771-779. 10.1261/rna.5240503.
King TH, Liu B, McCully RR, Fournier MJ: Ribosome structure and activity are altered in cells lacking snoRNPs that form pseudouridines in the peptidyl transferase center. Mol Cell. 2003, 11: 425-435. 10.1016/S1097-2765(03)00040-6.
Esguerra J, Warringer J, Blomberg A: Functional importance of individual rRNA 2'-O-ribose methylations revealed by high-resolution phenotyping. RNA. 2008, 14: 649-656. 10.1261/rna.845808.
Runte M, Varon R, Horn D, Horsthemke B, Buiting K: Exclusion of the C/D box snoRNA gene cluster HBII-52 from a major role in Prader-Willi syndrome. Hum Genet. 2005, 116: 228-230. 10.1007/s00439-004-1219-2.
Skryabin BV, Gubar LV, Seeger B, Pfeiffer J, Handel S, Robeck T, Karpova E, Rozhdestvensky TS, Brosius J: Deletion of the MBII-85 snoRNA gene cluster in mice results in postnatal growth retardation. PLoS Genet. 2007, 3: e235-10.1371/journal.pgen.0030235.
Ding F, Li HH, Zhang S, Solomon NM, Camper SA, Cohen P, Francke U: SnoRNA Snord116 (Pwcr1/MBII-85) deletion causes growth deficiency and hyperphagia in mice. PLoS One. 2008, 3: e1709-10.1371/journal.pone.0001709.
Kersey PJ, Lawson D, Birney E, Derwent PS, Haimel M, Herrero J, Keenan S, Kerhornou A, Koscielny G, Kahari A, et al: Ensembl Genomes: extending Ensembl across the taxonomic space. Nucleic Acids Res. 2010, 38: D563-569. 10.1093/nar/gkp871.
Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, et al: Ensembl 2009. Nucleic Acids Res. 2009, 37: D690-697. 10.1093/nar/gkn828.
Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S, et al: Ensembl 2011. Nucleic Acids Res. 2011, 39: D800-806. 10.1093/nar/gkq1064.
Strozzi F, Aerts J: A Ruby API to query the Ensembl database for genomic features. Bioinformatics. 2011, 27: 1013-1014. 10.1093/bioinformatics/btr050.
Goto N, Prins P, Nakao M, Bonnal R, Aerts J, Katayama T: BioRuby: bioinformatics software for the Ruby programming language. Bioinformatics. 2010, 26: 2617-2619. 10.1093/bioinformatics/btq475.
Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A: Rfam: Wikipedia, clans and the "decimal" release. Nucleic Acids Res. 2011, 39: D141-145. 10.1093/nar/gkq1129.
Drysdale R: FlyBase: a database for the Drosophila research community. Methods Mol Biol. 2008, 420: 45-59. 10.1007/978-1-59745-583-1_3.
Harris TW, Antoshechkin I, Bieri T, Blasiar D, Chan J, Chen WJ, De La Cruz N, Davis P, Duesbury M, Fang R, et al: WormBase: a comprehensive resource for nematode research. Nucleic Acids Res. 2010, 38: D463-467. 10.1093/nar/gkp952.
Kent WJ: BLAT–the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664.
Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E: EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009, 19: 327-335.
Ostlund G, Schmitt T, Forslund K, Kostler T, Messina DN, Roopra S, Frings O, Sonnhammer EL: InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 2010, 38: D196-203. 10.1093/nar/gkp931.
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res. 2005, 15: 330-340. 10.1101/gr.2821705.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.
Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glockner FO: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007, 35: 7188-7196. 10.1093/nar/gkm864.
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, et al: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23: 2947-2948. 10.1093/bioinformatics/btm404.
Carmel L, Rogozin IB, Wolf YI, Koonin EV: Patterns of intron gain and conservation in eukaryotic genes. BMC Evol Biol. 2007, 7: 192-10.1186/1471-2148-7-192.
Adl SM, Simpson AG, Farmer MA, Andersen RA, Anderson OR, Barta JR, Bowser SS, Brugerolle G, Fensome RA, Fredericq S, et al: The new higher level classification of eukaryotes with emphasis on the taxonomy of protists. J Eukaryot Microbiol. 2005, 52: 399-451. 10.1111/j.1550-7408.2005.00053.x.
Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. http://evolution.genetics.washington.edu/phylip.html,
Farris J: Phylogenetic analysis under Dollo's Law. Syst Biol. 1977, 26: 77-88. 10.1093/sysbio/26.1.77.
Pagel M, Meade A, Barker D: Bayesian estimation of ancestral character states on phylogenies. Syst Biol. 2004, 53: 673-684. 10.1080/10635150490522232.
Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M, Emam I, Farne A, Hastings E, Holloway E, et al: ArrayExpress update–an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 2011, 39: D1002-1004. 10.1093/nar/gkq1040.
AMP is supported through a Rutherford Discovery Fellowship, and a Marsden Fund grant, both administered by the Royal Society of New Zealand. Past support to AMP through a Royal Swedish Academy of Sciences Research Fellowship via a grant from the Knut and Alice Wallenberg Foundation is gratefully acknowledged. Past financial support for MPH through the Stockholm University Astrobiology Graduate School is gratefully acknowledged. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The authors declare that they have no competing interests.
Conceived and designed analyses: MPH, AMP. Performed analyses: MPH. Analyzed the data: MPH, AMP. Wrote the paper: MPH, AMP. Both authors read and approved the final manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.