Analysis of 5’ gene regions reveals extraordinary conservation of novel non-coding sequences in a wide range of animals
© Davies et al. 2015
Received: 9 June 2015
Accepted: 28 September 2015
Published: 19 October 2015
Phylogenetic footprinting is a comparative method based on the principle that functional sequence elements will acquire fewer mutations over time than non-functional sequences. Successful comparisons of distantly related species will thus yield highly important sequence elements likely to serve fundamental biological roles. RNA regulatory elements are less well understood than those in DNA. In this study we use the emerging model organism Nasonia vitripennis, a parasitic wasp, in a comparative analysis against 12 insect genomes to identify deeply conserved non-coding elements (CNEs) conserved in large groups of insects, with a focus on 5’ UTRs and promoter sequences.
We report the identification of 322 CNEs conserved across a broad range of insect orders. The identified regions are associated with regulatory and developmental genes, and contain short footprints revealing aspects of their likely function in translational regulation. The most ancient regions identified in our analysis were all found to overlap transcribed regions of genes, reflecting stronger conservation of translational regulatory elements than transcriptional elements. Further expanding sequence analyses to non-insect species we also report the discovery of, to our knowledge, the two oldest and most ubiquitous CNE’s yet described in the animal kingdom (700 MYA). These ancient conserved non-coding elements are associated with the two ribosomal stalk genes, RPLP1 and RPLP2, and were very likely functional in some of the earliest animals.
We report the identification of the most deeply conserved CNE’s found to date, and several other deeply conserved elements which are without exception, part of 5’ untranslated regions of transcripts, and occur in a number of key translational regulatory genes, highlighting translational regulation of translational regulators as a conserved feature of insect genomes.
The orchestration of gene expression is accomplished through a wide variety of regulatory mechanisms. One of the most well-characterized of these mechanisms is regulation of transcription through the binding of transcription factors to regulatory DNA sequence . The DNA sequences bound by transcription factors are generally short, the average binding site length being around 10 bp in eukaryotes . Other types of regulatory sequences are less well characterized; for example those sequences which become functional when transcribed into RNA. The most well-known example of this kind of regulatory element is perhaps the IRE (iron response element), a hairpin loop found in the mRNA of many genes involved in iron metabolism which helps to maintain iron homeostasis .
Detecting regulatory elements experimentally is time consuming , and identifying appropriate experimental targets may be difficult. Using straightforward computational methods for prediction of regulatory elements also presents issues; for example, prediction of transcription factor binding sites is usually accomplished by scanning a sequence of interest for matches to position-specific scoring matrices (PSSMs). These PSSMs  describe the kinds of, generally short  sequence motifs bound by these proteins. As such, the probability of finding a chance match in a sequence of any considerable length is high, and the majority of predicted transcription factor binding sites are therefore likely to have no functional role; a concept dubbed the ‘futility theorem’ . Many other regulatory elements are also characterized by short sequence motifs, and so identification of these elements through straightforward sequence scanning methods is subject to the same problem.
Phylogenetic footprinting is a method that can greatly reduce the search space when looking for functional regulatory elements . It is based on the principle that functionally important sequence elements are more likely to be conserved over time than less (or non-) functional elements, leaving behind a ‘footprint’ of functionality. This approach can be highly successful at identifying functional regulatory elements, the specificity of detection increasing with the divergence times of the species used; for example >40 % of conserved non-coding elements (CNEs) detected through a human-fugu (454.6 Myr divergence)  comparison showed enhancer activity when tested , as opposed to only 5 % of human-rodent CNEs tested . CNE detection tends to drop sharply at taxonomic boundaries, for example sensitive BLAST analysis shows a clear alignment signal between similar loci of two Drosophila species (~60 Myr), but an almost complete lack of alignment between two more diverged dipteran species (~75 Myr) .
The most deeply conserved CNEs detected to date originated before the divergence of deuterostomes and protostomes, only four examples of which have been found so far. The first two sequences of this kind to be discovered were found conserved between a variety of deuterostomes and a cnidarian, Nematostella vectensis , dating back over 670 million years . The other two conserved sequences that predate this split were, unlike the other two sequences, found to be present in species belonging to both Deuterostomia and Protostomia  and date back at least 600 million years .
Here, we took advantage of the recent releases of various insect genomes to identify novel regulatory elements conserved across large (180–700 myr) evolutionary distances. The majority of phylogenetic footprinting studies in insects use the model organism Drosophila melanogaster as a central comparison species, aimed at finding regulatory elements conserved within the fast-evolving  order Diptera. For a new perspective, we here use the emerging model organism Nasonia vitripennis, a member of the more slowly evolving order Hymenoptera, as a central comparison species to identify conserved regulatory elements. The aim of this study was to characterize a small subset of deeply conserved sequences in the upstream region of genes, thus potentially capturing both novel transcriptional and translational regulatory elements. By using a sensitive alignment algorithm (see Methods)  and ensuring that our analysis was conducted with a low false discovery rate, we identified a set of conserved sequences. Among the sequences that we identified are both known regulatory elements and a variety of novel regulatory elements on or near genes with core regulatory or developmental roles, some of which could potentially represent novel classes of RNA regulatory elements. We use our set of CNEs to examine the nature of conserved regulatory elements and their evolution. We also report the discovery of, to our knowledge, the two most deeply and ubiquitously conserved regulatory elements yet identified in the animal kingdom which date back to the radiations of basal animal phyla and are likely over 670  and 700 million years old  respectively.
Identification of deeply conserved non-coding elements
As a control, we aligned pairs of randomly matched upstream non-coding sequences. The number of ‘conserved’ sequences detected in the control set at various alignment score thresholds can therefore be used to estimate the false discovery rate. We adjusted the algorithm parameters such that no conservation at all was detected in the control, and then used these parameters to align the truly orthologous sequences. Sequences were pre-filtered for repetitive regions, and post-filtered for similarities to known coding sequences. At this very strict level of false discovery, we detected 322 CNEs on or near 276 genes (Fig. 1b). Each of these genes was given a combined conservation score (CCS) in the interval 0–1, where anything above zero is considered statistically significant and one represents particularly strong conservation (see Methods).
Since the most closely related species in the analysis (the three Hymenopteran species) diverged from Nasonia approximately 180 million years ago , all of the 322 CNEs have been conserved for at least this long. The CNEs found in Hymenoptera tend to be found in more than just two species; ~58 % of the hymenoptera-specific CNEs are conserved in three species or more (N. vitripennis and two others). Focusing on the 276 genes with an associated CNE, we expanded the analysis to a wider range of animals. A handful of CNEs were found to be conserved at greater evolutionary distances; 20 CNEs on or near 18 genes were found to have been conserved for at least 350 million years (i.e. the common ancestor of Holometabola) . Of these, one CNE dates back to the common ancestor of Mandibulata (myriapods, crustaceans, and hexapods), and 2 CNEs date further back to the radiations of basal animal phyla (Cnidaria, Placozoa, Ctenophora, and Porifera). These 20 anciently conserved CNEs exhibit a high degree of overlap, with only two being specific to N. vitripennis and one other species.
Relative position of CNEs is conserved along with sequence
In order to investigate the properties of the CNEs, we performed a series of analyses comparing the CNEs with a control set of sequences. To obtain these control sequences, we adjusted the parameters of the algorithm to allow for the capture of false discoveries, and ran the alignments on randomly matched (pseudo-orthologous) pairs of sequences. By setting an appropriate threshold, we extracted and post-filtered a similar number of sequences to the CNEs from the control set, representing the highest scoring non-orthologous sequence alignments. We term these sequences pseudo-CNEs, as they are sequences that have high alignment scores, albeit below our conservation threshold, but lack orthology. As high-scoring non-orthologous sequences, these pseudo-CNEs can be used as a comparison to elucidate important sequence properties about the true, orthologous CNEs, as opposed to comparisons with sequences with randomized properties.
CNEs are tightly associated with developmental and regulatory genes
The 322 CNEs that we identified here are associated with a specific class of genes. We tested for overrepresentation of gene ontology (GO) terms against the genomic background using the annotation information available (see Methods) for each N. vitripennis gene associated with a conserved region. 319 terms were significantly overrepresented with a q-value below 0.01. The most overrepresented term in the set (Additional file 2: Table S1) was ‘regulation of gene expression’ (q < 2.7e-31) which was associated with over a third (36.7 %) of the genes tested. In addition, many significant terms such as ‘nucleic acid binding transcription factor activity’ (q < 3.7e-28, 21.9 % of genes tested) and ‘developmental process’ (q < 8.8e-30, 51.5 % of genes tested) were returned, suggesting that genes associated with upstream conserved regions often themselves have regulatory and/or developmental roles. A set of 28 terms were overrepresented for the set of 359 pseudo-CNEs (Additional file 2: Table S2), albeit with lower significance compared to the CNE set. This suggests that the long highly AT-rich sequences that are picked up in this control have a weak, but detectable association with gene expression and specific processes – an observation not further explored here.
The 20 most deeply conserved sequences (> = 350 Myr) also appear to be associated with a specific class of genes. 14 of these CNEs were found to lie completely within transcribed regions (see Methods), and all 20 were found to overlap transcribed regions by at least a third of the length of the CNE. This enrichment is significant (p < 6.5e-04, hypergeometric test) when compared to the full set of 322, of which only ~70 % overlap transcribed regions by this amount. Remarkably for such a small set of genes, a GO term overrepresentation test turned up 39 significant terms. The 17 genes associated with these 20 CNEs are enriched for genes active in processes such as post-transcriptional regulation of gene expression (q < 6.1e-4), regulation of translation (q < 4.8e-3), and translational elongation (q < 1.2e-2). This list of overrepresented GO terms is, unlike that obtained from the full set of 322 CNEs, devoid of terms relating to transcriptional regulation, matching the shift towards putative translational regulatory CNEs.
5’ UTR CNEs contain conserved secondary structures
The Not1 hairpin loop has a stem sequence of 12 bp, and the CNE containing it is found directly adjacent to the translation start site. The CNE contains two conserved stem sequences with near-perfect complementarity, a weakly conserved apical sequence, and a highly conserved, upstream ATG-containing motif directly adjacent to the translation start site. In N. vitripennis, this CNE is present in the 5’ UTR of all four known transcripts. As the position of this CNE is so strongly conserved, we scanned the first 100 bp of every orthologous transcript in all Ensembl Metazoa species for presence of either the conserved hairpin or for the conserved sequence adjacent to the translation start site. The results of this search (see Methods) indicated that in all cases where the hairpin loop is present the conserved sequence adjacent to the translation start site is present too, but not vice-versa (i.e. the sequence adjacent to the translation start site may exist on its own). The presence of the sequence in the Antarctic krill Euphasia superba (Hunt and Rosato, unpublished data) and in a centipede (Strigamia maritima) shows that this CNE was an early arthropod adaptation.
An uncharacterized gene cluster contains several CNEs
We identified conserved putative regulatory sequences in six separate genes of the insect-specific Osiris gene cluster (Additional file 1: Figure S6). Our analysis indicates that these regions are Hymenoptera-specific, and are generally conserved in position relative to their associated gene. Since the conserved regions are associated with a specific class of genes with core functions, the fact that conserved promoter regions were identified near to six genes in the same cluster is perhaps indicative of an important developmental or regulatory role for this as-yet uncharacterized gene cluster.
Ribosomal stalk gene CNEs date back to early animals
In this paper, we used a high stringency statistical approach to identify and characterize 322 ancient non-coding elements (Additional file 2: Table S4) which have remained conserved over large evolutionary distances. The bulk of the conserved sequences that we identified are specific to Hymenoptera, but nevertheless have been conserved in place for at least 180 million years of insect evolution (which occurs at a faster pace than vertebrate evolution ). A small proportion of the CNEs (20) that we identified were at least 350 million years old, with three stretching back further still. Two CNEs are found conserved in a range of the most basal animal clades across a wide variety of both vertebrates and invertebrates and are likely over 670  and 700 million years old , likely the oldest CNEs described to date.
These two ancient CNEs are located in the 5’ UTRs of two genes that are known to interact with one another; RPLP1 and RPLP2. The two protein products of these ubiquitously expressed genes, P1 and P2, form a heterodimer; two copies of which bind to the 60s acidic ribosomal protein P0 (coded by the gene RPLP0) to form the ribosomal stalk. The ribosomal stalk is involved in translational fine tuning and is crucial for the correct folding of many proteins . The depth and breadth of conservation of these sequences is indicative of a fundamental regulatory role. Indeed, the 5’ UTR of RPLP2 has already been shown to have a regulatory role in Drosophila , being sufficient to confer full translational control unto RPLP2 as a non-translated gene in the early embryo, but not previously known to be conserved among animals. The fact that this CNE has been previously studied and identified as a regulatory element helps to validate the idea that other CNEs that we have identified are also functional regulatory elements. In Drosophila, the CNE we identified essentially spans the entirety of the RPLP2 5’ UTR, whereas in other organisms it is only a constituent part. In this study, we have characterized the motifs likely to be important for the function of this regulatory element and examined their evolution over time.
Most of the conserved regions identified in our analysis were found to be located within gene bodies as opposed to intergenic space, providing potential insights into a poorly understood class of regulatory elements. Our analysis revealed conserved secondary structures in the 5’ UTRs of several genes, examples including hairpin loops upstream of the Ferritin gene (an IRE), the Paramyosin gene, and the vital  regulatory  gene Not1. Secondary RNA structures such as hairpins can have important regulatory consequences, having the capacity to both enhance and inhibit translation. The effect of a hairpin on translation differs depending on the stability of the hairpin, its distance from the mRNA cap, and GC content . These three hairpins have different fundamental characteristics, and thus likely perform their putative regulatory functions in different ways. The strong (−52.60 kcal/mol), and GC-rich (66 %) Paramyosin hairpin is likely able to present a significant barrier to translation at any distance from the cap, whereas the potential function of the weaker (~ − 15.00 kcal/mol) Not1 hairpin is less obvious. The Not1 hairpin exhibits a complex conservation pattern, with near-perfect stem complementarity, tight positional conservation, and an associated conserved upstream AUG (uAUG), itself a tightly-suppressed  class of regulatory element.
The CNEs that we identified, confirming similar observations in other organisms , are associated with regulatory and developmental genes. This observation is consistent with the idea of regulatory gene cascades, that genes involved in regulation are themselves tightly regulated, allowing master regulators to exert overall control of gene regulation ‘programs’ to reprogram cells . This result is augmented by the more specific observation that the deeply conserved set of 20 CNEs (> = 350 Myr), which are likely to be post-transcriptional regulatory elements, are associated with genes themselves involved in post-transcriptional regulation.
We also identified fundamental differences in basic sequence properties of the CNEs when compared with control sequences. GC content in CNEs is generally elevated (Additional file 1: Figure S8); sharply peaking on the CNE itself but also raised in the flanking regions. This result is informative as GC content is known to be important for regulation; it is associated with regulatory mechanisms such as nucleosome occupancy , aspects of secondary structure stability and effects on translation, and for example in chicken, variance in GC content in the 5’ UTR of genes can perhaps explain 10 % of the variation in expression level .
One important feature of many of the CNEs that we discovered is that their positions relative to the translation start site are conserved, i.e. that the position of the CNE is conserved as well as its sequence (Fig. 3b). When more reliable transcription start site data are available, it will be possible to examine whether putative transcriptional regulatory mechanisms that we identified are conserved relative to the transcription start site, or whether some of the translational regulatory mechanisms that we identified are more closely associated with the mRNA cap position than the translation start site. This positional information could be useful in detecting CNEs over large evolutionary distances, under the assumption that evolution sometimes proceeds by modifying the sequences of existing cis-regulatory CNEs without significantly changing their relative position .
Overall, we have identified a large number of conserved sequence elements that, due to the strict false discovery controls that we have applied, and to previous experimental validation of a subset of these regions, are likely to be functionally important. It is our hope that each and every one of these regions will make interesting candidates for experimental analysis, helping to increase our understanding of regulation of gene expression, and particularly our understanding of regulatory elements in RNA.
Availability of supporting data
We have made all of these regions along with our analysis of each freely available to browse on our interactive website http://waspatlas.com/cns_temp. The website is easy to browse, and includes details of the associated gene as well as detailed graphical representations of a number of CNE features, including sequence alignments, secondary structures, and positional information. Genomes of Nasonia vitripennis, Apis mellifera, Atta cephalotes, Solenopsis invicta, Drosophila melanogaster, Megaselia scalaris, Aedes aegypti, Bombyx mori, Danaus plexippus, Heliconius melpomene, Dendroctonus ponderosae, Tribolium castaneum, and Acyrthosiphon pisum were obtained from the core databases in Ensembl metazoa release 21 .
Detecting non-coding sequence conservation
All insect genomes were obtained from the core databases in Ensembl metazoa release 21 . To search for conserved non-coding sequences, we extracted the 2 kb sequence upstream of each gene’s translation start site in N. vitripennis and compared this sequence with the sequence upstream of the orthologous gene in each comparator organism (Additional file 1: Figure S9). Our choice of 2 kb as an appropriate sequence length to analyse was twofold; firstly, based on the OGS v1.2 gene annotation, 2 kb is enough to cover over 95 % of 5’ UTR sequence, secondly, it is a computationally tractable amount of sequence given our resource and time constraints. Orthologs of N. vitripennis genes were computed using a pairwise reciprocal best BLAST hit (RBH) approach  based on the protein sequences (protein sequences from Ensembl metazoa release 21) of all transcripts in each genome. RBHs are calculated by a two-way comparison; for example if the best BLAST hit for a gene A in Nasonia is gene X in Drosophila, then it is called an RBH if and only if the best BLAST hit for gene X in Drosophila is gene A in Nasonia. The stringency of this criterion provides high-confidence orthologs. The number of N. vitripennis orthologs calculated for each genome is shown in Table S6. For our purposes of conducting a computationally intensive search for deeply conserved regions, this method was judged preferable to other, more comprehensive ortholog search methods such as orthoMCL  through a performance comparison of Apis mellifera and Nasonia vitripennis ortholog detection. While orthoMCL detects orthologs for more genes than the RBH method, RBH detects orthologs for most of these genes (>70 %, Additional file 1: Figure S11). Due to the higher sensitivity of orthoMCL to events such as gene duplication, the pairwise comparisons necessitated by an orthoMCL search (217,485) compared to those (best BLAST hits only) necessitated by an RBH search (7,435) would have been computationally prohibitive.
The 2 kb sequences were then extracted upstream of each translation start site. The choice of 2 kb allows us to capture both 5’ UTR and putative promoter sequence, while still being computationally tractable. If the 5’ end of a 2 kb sequence overlapped with a nearby gene, the sequence was truncated. If the sequence was entirely contained within another gene, then it was removed from the analysis entirely. Nasonia sequences were compared against sequences from each other species. The seaweeds algorithm provides optimal alignment scores for all pairs of possible windows across the two sequences, i.e. about 4 million short optimal alignments. The window length was chosen to be 50 bp. The alignment score was set to 1 for a match, 0 for a mismatch, and −0.5 for any gap. The rationale for this scoring matrix, and for using alignments in general, was to perform a search without a priori knowledge of the regions in question or the types of regulatory elements likely to be found. The choice of 50 bp is a variation on a previous study  which allows for greater sensitivity while maintaining specificity. An advantage of the window-based seaweeds algorithm  over other algorithms such as Smith-Waterman  is the avoidance of the “shadow effect”  where longer, but biologically less significant alignments may be computed while different, shorter alignments are ignored. Instead all windows are considered equally and results can be easily compared and tested for statistical significance as all sequences are of equal length.
To set the appropriate parameters for detecting conservation, we aligned randomly paired upstream sequences (pseudo-orthologs) to get an idea of what scores could be expected by chance. Using real non-orthologous genomic sequence as a control is preferable to using randomized or ‘shuffled’ sequence as it maintains the complex sequence makeup of true genomic sequences, whereas in shuffled sequences these motifs will be depleted providing a less stringent control. We first identified candidate CNEs by running the pseudo ortholog sequences and the true ortholog sequences with lower L and U parameters (L: 80 U: 94), and setting a CCS threshold at the level where no conservation was detected in the pseudo ortholog set. The lower and upper bound parameters were initially set based on examination of pairwise comparison histograms; the lower threshold was set at the point where scores were unlikely to be meaningful (i.e. where the control set shows a similar number of CNEs), and the upper threshold at a level where no sequences were reported in the control, as performed by Baxter et al. . These candidate CNEs were filtered for similarity to known coding sequences, and after increasing the parameters to the level where no conservation was detected in the pseudo-ortholog control (L: 87 U: 100), we scored these candidate CNEs for conservation producing a final filtered set of 322 CNEs. At this point of filtering, the lower and upper bound are simply used to define a continuum of confidence scores for CNEs already known to be significant.
Filtering for coding sequences and pseudogenes
To ensure that the CNEs were unlikely to contain unannotated coding sequences or pseudogenes, we utilized BLASTX 2.2.27+  to filter out such sequences. To set an appropriate filtering threshold, we first randomly permuted the nucleotide sequences in the set to be filtered. These sequences were then scored against the nr protein database  using BLASTX, and the minimum (most significant) E-value score was noted as the most significant hit likely to be produced by random sequences with identical nucleotide composition to the set to be filtered. Once this threshold was set, the true sequences were scored against nr using BLASTX, and any sequence scoring below this threshold was discarded. The thresholds and number of sequences filtered by this method can be seen in Additional file 2: Table S7.
Pseudo CNE comparisons
To elucidate properties of the CNEs through comparison, we generated a set of pseudo CNEs. The set of 359 pseudo CNEs that we obtained were created by aligning pseudo orthologs with relatively low threshold parameters (L: 80 U: 94), applying a CCS cutoff of 0.528 to retrieve a similar number of sequences to the true CNEs, and performing BLASTX filtering. Pseudo CNEs constitute a good control as they are similar to true CNEs in every way but for the fact that they do not represent true orthologous sequences. By comparing the CNEs with these pseudo CNEs, we can therefore identify properties likely to be characteristic of the truly conserved non-coding sequences.
All comparisons were performed on the N. vitripennis versions of each CNE. GO term overrepresentation analysis was performed on both the CNEs and pseudo CNEs using BiNGO , a plugin for Cytoscape . We used annotation obtained from the Gene Ontology Consortium  (data-version: 2013-08-23) and tested against a background annotation set, formed by combining annotation derived from the D. melanogaster Ensembl homologs of the full set of N. vitripennis genes, ensembl GO annotation, and the official gene set 2 GO annotation on NasoniaBase [43, 44]. Statistical comparisons of the distributions of GC content, CpG observed/expected, CNE length, and CNE position were performed using the Wilcoxon rank sum test in R . P-values of 2.2e-16 are at the floating point precision limit (p ~ 0 at machine precision). For each CNE, the distance from translation start site was calculated as the distance from the 3’ end of the CNE from the 5’-most annotated translation start site of each gene. Unless indicated otherwise, all comparisons are between the N. vitripennis true CNEs and the N. vitripennis pseudo CNEs.
Nucleosome occupancy prediction and GC content analysis
We used the nucleosome occupancy prediction software described in  to predict nucleosome occupancy within each CNE and in the flanking region. We extracted the sequence of each CNE along with flanking sequence in order to obtain a 10 kb sequence centered on each CNE. Sequences containing Ns were removed as per the software requirements. This step removed a significant proportion of CNEs - 142 of 322 CNEs (44 %). Control sets were produced by extracting, for each CNE, a region with the same properties (length, and distance from translation start site) upstream of a randomly selected gene. 10 random sets were created, and the results from these controls were averaged and the standard deviations calculated. For the GC content comparison, sequences were extracted in the same way and GC content was measured in a sliding window (50 bp window size, 10 bp step size) along each sequence using a custom Perl script.
5’ UTR analysis
To get an estimate of how many sequences were conserved in transcribed regions as opposed to non-transcribed regions, we split each CNE into all possible 20-mers and used bowtie2 v2.0.5  to map each sequence to the N. vitripennis evidential gene transcriptome dataset [43, 44] supplemented with RNA-seq data (data available on http://www.waspatlas.com), as well as the Ensembl version of the official gene set OGS v1.2 augmented with the same data. A 20-mer was counted as transcribed if it mapped to either transcriptome, and the percentage of mapped to unmapped reads was calculated to give an overlap percentage for each CNE.
To test for the presence of conserved secondary structures, the SCI (structure conservation index) was calculated for each CNE alignment and used to compute the probability of a conserved secondary structure. The SCI is defined as the minimum free energy of the consensus folding structure divided by the mean minimum free energy (MFE) of each sequence in the alignment folded independently . Sequences in alignments with high structural conservation will show similar energies whether allowed to fold independently or forced into the consensus structure. A high SCI (close to 1) therefore indicates a well-conserved structure, and an SCI of more than 1 may indicate the presence of compensatory mutations. We implemented the SCI in Perl using RNAfold  and RNAalifold , and used code from  to implement a shuffling procedure as a control. Alignments are shuffled on a column-by-column basis, keeping the overall conservation pattern intact. By generating sets of shuffled alignments in this way, we can thus calculate the probability that the conservation of RNA secondary structure in the true alignment is exceptionally strong. For each CNE, we generated 1000 control sequences, calculated the Z score distribution, and used this to generate an empirical p-value. 29 CNEs showed conserved structure at p < 0.05, and 11 at p < 0.01.
To test for overrepresentation of motifs, we used a set of 1038 position weight matrices, including matrices from JASPAR  and PLACE  and followed the procedure in . We reduced redundancy by performing hierarchical clustering based on the Hellinger distances between matrices, yielding a set of 735 representative matrices at a threshold of 1.5. The matrix with the median entropy was selected to be the representative of each cluster. Motifs were tested for overrepresentation using a binomial test taking into account the strength of the matches. We produced 100 control sets using the same method as in the nucleosome occupancy prediction/GC content analysis section, with the exception that if we found Ns in the true CNEs then Ns were inserted into each control CNE in the same positions. A matrix was counted as over-represented if the binomial p-value in the true CNE set was lower than the p-value in all control sets, and underrepresented if it was higher than the p-value in all control sets. Of the 735 representative matrices, 88 were found to be under-represented compared to the controls, and 35 were over-represented. As with the nucleosome occupancy analysis, this analysis was highly affected by GC content; in general, matrices with high GC content were found to be overrepresented whilst those with low GC content were found to be underrepresented. GC content of a matrix was measured proportionally with the weight of each position; a matrix with several highly weighted G or C nucleotides will thus have higher GC content than a matrix with G or C nucleotides that have low weight.
Not1 CNE characterization
To test for presence or absence of the Not1 CNE, the consensus sequence from the original CNE alignment was scored against the 100 bp upstream sequences of Not1 homologs in all organisms available in Ensembl metazoa using BLAST . The E-value distribution was plotted, and sequences with E-values lying outside of the main distribution (E < 0.001) (Additional file 2: Table S8) were aligned, and the 30 bp hairpin loop sequences extracted. The stabilities of the hairpins were predicted using RNAfold  and sequence logo diagrams prepared using the seqLogo  package in R.
Motif elicitation and RPLP1/RPLP2 analysis
MEME  was used to elicit motifs from the RPLP1 and RPLP2 CNEs. After initial exploratory analysis, we searched for the 3 best motifs in the sequences of both CNEs, looking for motifs from 6 to 13 bp in the RPLP1 CNE, and from 6 to 15 bp in the RPLP2 CNE. For the RPLP1 CNE, 500 bp sequences upstream of all RPLP1 homologs analyzed were used, and 2 kb sequences for the RPLP2 homologs analyzed. Due to total sequence length restrictions, the RPLP2 motif elicitation analysis was first done with a seed elicitation followed by analysis of the remaining sequences. 3 motifs were extracted for the RPLP1 CNE, and 2 for the RPLP2 CNE, one of which was manually split into two upon further inspection. Sequences were inspected for presence or absence of the CNE motif components, and sequences containing the CNEs were aligned based on the motif positions. Distances of the motifs from translation start sites were calculated and plotted. Sequence logo diagrams were prepared using the SeqLogo  package in R. All genes and organisms used in this analysis can be found in Additional file 2: Table S9.
The motif elicitation analysis was repeated using a phylogeny-aware method, Phylogibbs . The motif-based alignments produced by MAST  were cut to 17 sequences due to memory constraints, and used as inputs for the algorithm. Phylogenetic trees were constructed using divergence estimates from timetree , and blanket proximity values were assigned to branches based on the approximate figure of 0.85 proximity given in the Phylogibbs documentation for the mouse-rat divergence time of 22.6 Myr. The analysis for the RPLP1 CNE revealed three significant motifs overlapping the motifs produced by MEME. Similar results were also obtained for the RPLP2 CNE. The full results of this analysis including the elicited PSSMs can be found in Additional files 3 and 4.
The authors thank Ben Hunt and Ezio Rosato for pre-publication access to the Euphausia superba Not1 transcript sequence. ND is a member of the Midlands Integrative Bioscience Training Programme funded by the Biotechnology and Biological Sciences Research Council (BBSRC) grant BB/J014532/1. PK was funded by BBSRC and ERASysBio + Grant BB/I004823/1. ET was funded by BBSRC grant BB/K001922/1.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Weake VM, Workman JL. Inducible gene expression: diverse regulatory mechanisms. Nat Rev Genet. 2010;11(6):426–37.View ArticlePubMedGoogle Scholar
- Stewart AJ, Hannenhalli S, Plotkin JB. Why transcription factor binding sites are ten nucleotides long. Genetics. 2012;192(3):973–85.PubMed CentralView ArticlePubMedGoogle Scholar
- Piccinelli P, Samuelsson T. Evolution of the iron-responsive element. RNA. 2007;13(7):952–66.PubMed CentralView ArticlePubMedGoogle Scholar
- Haeussler M, Joly JS. When needles look like hay: how to find tissue-specific enhancers in model organism genomes. Dev Biol. 2011;350(2):239–54.View ArticlePubMedGoogle Scholar
- Stormo GD, Schneider TD, Gold L, Ehrenfeucht A. Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 1982;10(9):2997–3011.PubMed CentralView ArticlePubMedGoogle Scholar
- Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet. 2004;5(4):276–87.View ArticlePubMedGoogle Scholar
- Visel A, Bristow J, Pennacchio LA. Enhancer identification through comparative genomics. Semin Cell Dev Biol. 2007;18(1):140–52.PubMed CentralView ArticlePubMedGoogle Scholar
- Hedges SB, Dudley J, Kumar S. TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics. 2006;22(23):2971–2.View ArticlePubMedGoogle Scholar
- Nobrega MA, Zhu Y, Plajzer-Frick I, Afzal V, Rubin EM. Megabase deletions of gene deserts result in viable mice. Nature. 2004;431(7011):988–93.View ArticlePubMedGoogle Scholar
- Kazemian M, Suryamohan K, Chen JY, Zhang Y, Samee MA, Halfon MS, et al. Evidence for deep regulatory similarities in early developmental programs across highly diverged insects. Genome Biol Evol. 2014;6(9):2301–20.PubMed CentralView ArticlePubMedGoogle Scholar
- Royo JL, Maeso I, Irimia M, Gao F, Peter IS, Lopes CS, et al. Transphyletic conservation of developmental regulatory state in animal evolution. Proc Natl Acad Sci U S A. 2011;108(34):14186–91.PubMed CentralView ArticlePubMedGoogle Scholar
- Peterson KJ, Cotton JA, Gehling JG, Pisani D. The ediacaran emergence of bilaterians: congruence between the genetic and the geological fossil records. Philos Trans R Soc Lond B Biol Sci. 2008;363(1496):1435–43.PubMed CentralView ArticlePubMedGoogle Scholar
- Clarke SL, VanderMeer JE, Wenger AM, Schaar BT, Ahituv N, Bejerano G. Human developmental enhancers conserved between deuterostomes and protostomes. PLoS Genet. 2012;8(8):e1002852.PubMed CentralView ArticlePubMedGoogle Scholar
- Wyder S, Kriventseva EV, Schroder R, Kadowaki T, Zdobnov EM. Quantification of ortholog losses in insects and vertebrates. Genome Biol. 2007;8(11):R242.PubMed CentralView ArticlePubMedGoogle Scholar
- Krusche P, Tiskin A. Computing alignment plots efficiently. In: Chapman B, Desprez F, Joubert GR, Lichnewsky AI, Peter F, Priol T, editors. Parallel Computing: From Multicores and GPUs to Petascale. Volume abs/0909.2000. Amsterdam: IOS Press; 2009. p. 158–65.Google Scholar
- Schmieder S, Colinet D, Poirie M. Tracing back the nascence of a new sex-determination pathway to the ancestor of bees and ants. Nat Commun. 2012;3:895.PubMed CentralView ArticlePubMedGoogle Scholar
- Wiegmann BM, Trautwein MD, Kim JW, Cassel BK, Bertone MA, Winterton SL, et al. Single-copy nuclear genes resolve the phylogeny of the holometabolous insects. BMC Biol. 2009;7:34–7007. 7-34.PubMed CentralView ArticlePubMedGoogle Scholar
- Dohrmann M, Worheide G. Novel scenarios of early animal evolution--is it time to rewrite textbooks? Integr Comp Biol. 2013;53(3):503–11.View ArticlePubMedGoogle Scholar
- Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, et al. Ultraconserved elements in the human genome. Science. 2004;304(5675):1321–5.View ArticlePubMedGoogle Scholar
- Ahituv N, Zhu Y, Visel A, Holt A, Afzal V, Pennacchio LA, et al. Deletion of ultraconserved elements yields viable mice. PLoS Biol. 2007;5(9), e234.PubMed CentralView ArticlePubMedGoogle Scholar
- Perucho L, Artero-Castro A, Guerrero S, Ramon Y, Cajal S, LLeonart ME, et al. RPLP1, a crucial ribosomal protein for embryonic development of the nervous system. PLoS One. 2014;9(6):e99956.PubMed CentralView ArticlePubMedGoogle Scholar
- Patel RC, Jacobs-Lorena M. Cis-acting sequences in the 5'-untranslated region of the ribosomal protein A1 mRNA mediate its translational regulation during early embryogenesis of Drosophila. J Biol Chem. 1992;267(2):1159–64.PubMedGoogle Scholar
- Maillet L, Tu C, Hong YK, Shuster EO, Collart MA. The essential function of Not1 lies within the Ccr4-Not complex. J Mol Biol. 2000;303(2):131–43.View ArticlePubMedGoogle Scholar
- Collart MA, Panasenko OO. The Ccr4--not complex. Gene. 2012;492(1):42–53.View ArticlePubMedGoogle Scholar
- Babendure JR, Babendure JL, Ding JH, Tsien RY. Control of mammalian translation by mRNA structure near caps. RNA. 2006;12(5):851–61.PubMed CentralView ArticlePubMedGoogle Scholar
- Iacono M, Mignone F, Pesole G. uAUG and uORFs in human and rodent 5'untranslated mRNAs. Gene. 2005;349:97–105.View ArticlePubMedGoogle Scholar
- Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, et al. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 2005;3(1):e7.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee TI, Young RA. Transcriptional regulation and its misregulation in disease. Cell. 2013;152(6):1237–51.PubMed CentralView ArticlePubMedGoogle Scholar
- Tillo D, Hughes TR. G+C content dominates intrinsic nucleosome occupancy. BMC Bioinformatics. 2009;10:442–2105. 10-44.PubMed CentralView ArticlePubMedGoogle Scholar
- Rao YS, Chai XW, Wang ZF, Nie QH, Zhang XQ. Impact of GC content on gene expression pattern in chicken. Genet Sel Evol. 2013;45:9–9686. 45-9.PubMed CentralView ArticlePubMedGoogle Scholar
- Cande J, Goltsev Y, Levine MS. Conservation of enhancer location in divergent insects. Proc Natl Acad Sci U S A. 2009;106(34):14414–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2014. Nucleic Acids Res. 2014;42(Database issue):D749–55.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278(5338):631–7.View ArticlePubMedGoogle Scholar
- Li L, Stoeckert Jr CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13(9):2178–89.PubMed CentralView ArticlePubMedGoogle Scholar
- Baxter L, Jironkin A, Hickman R, Moore J, Barrington C, Krusche P, et al. Conserved noncoding sequences highlight shared components of regulatory networks in dicotyledonous plants. Plant Cell. 2012;24(10):3949–65.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195–7.View ArticlePubMedGoogle Scholar
- Arslan AN, Egecioglu O, Pevzner PA. A new approach to sequence comparison: normalized sequence alignment. Bioinformatics. 2001;17(4):327–37.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.View ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35(Database issue):D61–5.PubMed CentralView ArticlePubMedGoogle Scholar
- Maere S, Heymans K, Kuiper M. BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics. 2005;21(16):3448–9.View ArticlePubMedGoogle Scholar
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504.PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet. 2000;25(1):25–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Munoz-Torres MC, Reese JT, Childers CP, Bennett AK, Sundaram JP, Childs KL, et al. Hymenoptera genome database: integrated community resources for insect species of the order Hymenoptera. Nucleic Acids Res. 2011;39(Database issue):D658–62.PubMed CentralView ArticlePubMedGoogle Scholar
- Werren JH, Richards S, Desjardins CA, Niehuis O, Gadau J, Colbourne JK, et al. Functional and evolutionary insights from the genomes of three parasitoid Nasonia species. Science. 2010;327(5963):343–8.View ArticlePubMedGoogle Scholar
- R Development Core Team. R: A Language and Environment for Statistical Computing, 2.10.1. 2010.Google Scholar
- Kaplan N, Moore IK, Fondufe-Mittendorf Y, Gossett AJ, Tillo D, Field Y, et al. The DNA-encoded nucleosome organization of a eukaryotic genome. Nature. 2009;458(7236):362–6.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Gruber AR, Bernhart SH, Hofacker IL, Washietl S. Strategies for measuring evolutionary conservation of RNA secondary structures. BMC Bioinformatics. 2008;9:122–2105. 9-122.PubMed CentralView ArticlePubMedGoogle Scholar
- Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie/Chemical Monthly. 1994;125(2):167–88.View ArticleGoogle Scholar
- Bernhart SH, Hofacker IL, Will S, Gruber AR, Stadler PF. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics. 2008;9:474-–2105-. 9-474.PubMed CentralView ArticlePubMedGoogle Scholar
- Washietl S, Hofacker IL. Consensus folding of aligned sequences as a new measure for the detection of functional RNAs by comparative genomics. J Mol Biol. 2004;342(1):19–30.View ArticlePubMedGoogle Scholar
- Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 2004;32(Database issue):D91–4.PubMed CentralView ArticlePubMedGoogle Scholar
- Higo K, Ugawa Y, Iwamoto M, Korenaga T. Plant cis-acting regulatory DNA elements (PLACE) database: 1999. Nucleic Acids Res. 1999;27(1):297–300.PubMed CentralView ArticlePubMedGoogle Scholar
- Bembom O. seqLogo: An R package for plotting DNA sequence logos. 2007.Google Scholar
- Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009;37(Web Server issue):W202–8.PubMed CentralView ArticlePubMedGoogle Scholar
- Siddharthan R, Siggia ED, van Nimwegen E. PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol. 2005;1(7):e67.PubMed CentralView ArticlePubMedGoogle Scholar
- You M, Yue Z, He W, Yang X, Yang G, Xie M, et al. A heterozygous moth genome provides insights into herbivory and detoxification. Nat Genet. 2013;45(2):220–5.View ArticlePubMedGoogle Scholar
- Wiegmann BM, Trautwein MD, Winkler IS, Barr NB, Kim JW, Lambkin C, et al. Episodic radiations in the fly tree of life. Proc Natl Acad Sci U S A. 2011;108(14):5690–5.PubMed CentralView ArticlePubMedGoogle Scholar