Evolution of the unspliced transcriptome

Background Despite their abundance, unspliced EST data have received little attention as a source of information on non-coding RNAs. Very little is know, therefore, about the genomic distribution of unspliced non-coding transcripts and their relationship with the much better studied regularly spliced products. In particular, their evolution has remained virtually unstudied. Results We systematically study the evidence on unspliced transcripts available in EST annotation tracks for human and mouse, comprising 104,980 and 66,109 unspliced EST clusters, respectively. Roughly one third of these are located totally inside introns of known genes (TINs) and another third overlaps exonic regions (PINs). Eleven percent are “intergenic”, far away from any annotated gene. Direct evidence for the independent transcription of many PINs and TINs is obtained from CAGE tag and chromatin data. We predict more than 2000 3’UTR-associated RNA candidates for each human and mouse. Fifteen to twenty percent of the unspliced EST cluster are conserved between human and mouse. With the exception of TINs, the sequences of unspliced EST clusters evolve significantly slower than genomic background. Furthermore, like spliced lincRNAs, they show highly tissue-specific expression patterns. Conclusions Unspliced long non-coding RNAs are an important, rapidly evolving, component of mammalian transcriptomes. Their analysis is complicated by their preferential association with complex transcribed loci that usually also harbor a plethora of spliced transcripts. Unspliced EST data, although typically disregarded in transcriptome analysis, can be used to gain insights into this rarely investigated transcriptome component. The frequently postulated connection between lack of splicing and nuclear retention and the surprising overlap of chromatin-associated transcripts suggests that this class of transcripts might be involved in chromatin organization and possibly other mechanisms of epigenetic control. Electronic supplementary material The online version of this article (doi:10.1186/s12862-015-0437-7) contains supplementary material, which is available to authorized users.


Additional data
ESTs "within range" of RefSeq genes   Figure S3: ARHGAP26 (Rho GTPase activating protein) is a long proteincoding gene on chromosome 5 in human, stretching over appr. 46kb (blue line with arrows). Between the 20th and 21st exon we find a unspliced EST (uEST) cluster (HG73775) with a length of 2,700 nt (black box). Additional evidence for functional importance is given by ENCODE data. Long RNAseq transcripts antisense to the protein-coding gene have been reported (green bar, white arrows depict the reading direction). Additionally the region upstream of the uEST cluster is classified as transcription start [1] (the tiny red bar enclosed by yellow and green bars below the EST track), assuming the reading direction predicted by RNAseq. There is also an enrichment of the histone modification H3K27Ac around the uEST cluster (the colored transparent overlays near the bottom). This modification is often found near active regulatory elements.

Unspliced mRNAs
The track "all mrna" was downloaded for Human and Mouse from UCSC genome browser on 15th of November 2013 and 18th of November, respectively. We removed all mRNAs with more than one block, since we were interested in unspliced ones. A large part of the remaining data consisted of rather short mRNAs. One of the shortest mini-protein known is an artificially designed 20residue construct [2]. Furthermore one can assume that an actual protein needs some additional UTR sequences on the genomic level. Considering these facts it is very unlikely to find many true positive mRNAs with a length shorter than 60 nucleotides. Consequently we remove all mRNAs shorter than this length from our data set, see Tab. S1.  MALAT-1, for instance, appears among the loci with the highest coverage of unspliced ESTs. Together with NEAT1, which is located in the same genomic region, it belongs to a class of long nuclear retained transcripts involved in the organization of nuclear speckles [4,5], see also [6] and the references therein. Other examples, such as KCNQ1OT1 [7] and PTCSC3 [8] are clearly visible in our data, albeit with moderate coverage. Unspliced ESTs are also reported as parts of known spliced non-coding transcripts, in particular those with very long exons such as XIST [9]. In other cases, such as TUG1 [10], we observe predominantly unspliced ESTs that cover (nearly) the entire primary transcript, even though the genomic location is annotated by the spliced forms.

Independent UTR-derived RNAs
Humand and Mouse CAGE data Cell-type specific human data The TSS peaks in 156 primary human cell lines predicted by Ohmiya et al. [11] were downloaded from the FANTOM5 webserver. The unspliced EST cluster and RefSeq genes were the same as in the whole analysis. For the long noncoding RNAs a dataset based on the GENCODE v14 annotation but filtered for a higher reliability by Nitsche et al. [12].
The analysis was the same for all three data sets. We filtered for elements that overlaped with a predicted TSS on the forward strand in their first 84 nucleotides or a TSS from the reverse strand on their last 84 nt. This number was chosen to be able to compare the results to our first analysis in 2012 [13].
The number of cell lines for which an element was predicted to be expressed is visualized in Supp. Fig. S5.  Figure S6: A)PLA2R1 and downstream region on chromosome 2 in human is an example of a candidate for a new lncRNA. The importance of the unspliced EST (uEST) cluster HG61346 is supported by various evidence. Transcripts antinsense to the neighboring gene PLA2R1 have been detected by long RNAseq (green bars, white arrows depict the reading direction). The uEST overlaps with a region annotated as transcription start site by analysing the chromatin structure (red bar named TSS). In the middle of the uEST cluster there is a siginificant peak of evolutionary conservation visible (blue histogram). B) PLA2R1 and downstream region with lncRNA Gm13880 on chromosome 2 is the homologous region in mouse. In contrast to human there is an antisense and spliced long non-coding RNA annotated downstream of PLA2R1. A small part of its sequence (76nt) can be found in the human genome using BLAT. To know the exact structure of the lncRNA additional investigation, most likely by experimental methods is necessary. Figure S7: Conserved secondary structure elements providing one of the best RNAz scores. The multiple genome alignment of the region hg19.chr9:66,766,331 66,766,440 contained sequences from diverse mammals, including Pteropus vampyrus and Loxodonta africana. The upper picture is the structure on the forward strand while the lower one is on the reverse strand. Both have a p-value better than 0.99. The color code, from red to ochre and green indicates that 1, 2, or 3 different types of basepairs are observed in the corresponding alignment columns, unsaturated colors indicate basepairs that cannot be formed by 1 or 2 of the 6 sequences in the alignment. Substitutions in stem regions are indicated by circles.

Overlap with ENCODE RNA-seq data
We investigated the overlap of the unspliced EST cluster with the published ENCODE RNA-seq data. 96.6% of all human unspliced EST cluster are supported by RNA-seq data, defined as overlapping with at least 10 reads of at least 1 RNA-seq library. In most classes only a small fraction of the cluster is not supported by RNA-seq data. Only the class IGR stands out with 15.4% of unspliced EST cluster without RNA-seq evidence, all numbers can be found in Supp. Tab. S4.  Figure S8: Distribution of the classes in pairs between homologous cluster which are classified in the same class in human and mouse. 1,495 cluster are assigned to "NO CLASS".  Figure S9: Example of conserved totally intronic unspliced EST (uEST) cluster. The TINs HG45063/HG45064 and MM35793 have not been reported previously. The overlapping gene is HOXA3. The genomic loci is rather complex as one can see by the large number of differently connected ESTs and mRNAs. The uEST cluster HG45065 has been recently described as beeing an apoptosis repressor in certan cell lines and was called HOXA-AS2 [14]. The independence of HG45063/HG45064 can not be determined without additional experiments. It might be a splice variant of HOXA-AS2 or even of HOXA-AS3. In fact they could turn out to be a single large antisense ncRNA, maybe depending on the used gene definition. Nevertheless, transcription is supported by chromatin marks (cell lines HeLa-S3 and HepG2) and antisense RNAseq data (see caption of Fig. S6).