The evolution of ultraconserved elements with different phylogenetic origins
© Ryu et al.; licensee BioMed Central Ltd. 2012
Received: 26 April 2012
Accepted: 9 November 2012
Published: 5 December 2012
Skip to main content
© Ryu et al.; licensee BioMed Central Ltd. 2012
Received: 26 April 2012
Accepted: 9 November 2012
Published: 5 December 2012
Ultraconserved elements of DNA have been identified in vertebrate and invertebrate genomes. These elements have been found to have diverse functions, including enhancer activities in developmental processes. The evolutionary origins and functional roles of these elements in cellular systems, however, have not yet been determined.
Here, we identified a wide range of ultraconserved elements common to distant species, from primitive aquatic organisms to terrestrial species with complicated body systems, including some novel elements conserved in fruit fly and human. In addition to a well-known association with developmental genes, these DNA elements have a strong association with genes implicated in essential cell functions, such as epigenetic regulation, apoptosis, detoxification, innate immunity, and sensory reception. Interestingly, we observed that ultraconserved elements clustered by sequence similarity. Furthermore, species composition and flanking genes of clusters showed lineage-specific patterns. Ultraconserved elements are highly enriched with binding sites to developmental transcription factors regardless of how they cluster.
We identified large numbers of ultraconserved elements across distant species. Specific classes of these conserved elements seem to have been generated before the divergence of taxa and fixed during the process of evolution. Our findings indicate that these ultraconserved elements are not the exclusive property of higher modern eukaryotes, but rather transmitted from their metazoan ancestors.
Large numbers of DNA elements (≥200 bp) exhibiting 100% similarity have been found to be conserved across several mammalian species [1, 2]. Shorter ultraconserved elements (UCEs) longer than 50 bp and 100 bp have also been identified in several insect species and plants, respectively [3, 4].
Since the discovery of UCEs, a lot of effort has been expended on elucidating their functions and to determine the reasons for their extreme conservation. UCEs are often located near genes implicated in transcription and developmental processes, splicing, and ion flow control across membranes [1, 2, 5–7]. In vivo analysis of the embryos of transgenic mice uncovered the transcriptional enhancer activities of UCEs targeting developmental genes and transcription factors (TFs) [8, 9]. Depletion of UCEs among segmental duplications and copy number variations were also reported . Single nucleotide polymorphisms (SNPs) in UCEs have been linked to cancer risk, impaired TF binding, and homeobox gene regulation in the central nervous system [11, 12]. Nevertheless, homozygote embryo knockout experiments in mice revealed that deletion of ultraconserved elements can yield viable mice, suggesting the dispensability or functional redundancy of UCEs .
The origin and evolution of UCEs have also been also investigated. There is evidence that some UCEs originated from retroposons and stabilized in genomes after acquiring a function that benefitted the host . Stephen et al. studied the evolution of UCEs in several vertebrate genomes and found that they were generated and expanded on a large scale during tetrapod evolution . Other studies of the human genome showed that UCEs experienced strong purifying selection and were not mutational cold spots [16–18].
In this study, we investigated if evidence of the conservation of DNA elements could be found in primitive species, such as sponge and hydra, and if these conserved elements have similar functions as those previously reported for higher eukaryotes. We identified many UCEs across diverse phyla, including Porifera, Cnidaria, Arthropoda, Echinodermata, and Chordata, as well as a new type of short UCEs. By comparing distant species, we were able to identify new UCEs in human and fruit fly. Clustering the UCEs based on the sequence similarity unveiled lineage specificity and distinct functions outlined by protein domains of their flanking genes and DNA regulatory motifs. We concluded that each UCE group arose independently on a specific lineage and was “frozen” on the genome as a regulatory innovation after the divergence of specific taxa.
We began our analysis by asking if there is evidence of ultraconservation in primitive species and, if so, how UCEs diverged during the process of evolution. We considered six species whose genomes were previously sequenced including demosponge (Amphimedon queenslandica) from the phylum Porifera, hydra (Hydra magnipapillata) and sea anemone (Nematostella vectensis) from the phylum Cnidaria, sea urchin (Strongylocentrotus purpuratus) from the phylum Echinodermata, fruit fly (Drosophila melanogaster) from the phylum Arthropoda, and human (Homo sapiens) from the phylum Chordata. We identified UCEs (≥50 bp) and shorter UCEs (≥30 bp) by pairwise comparison of the whole genomic sequences across six species.
Identification of UCEs
N. vectensis(sea anemone)
D. melanogaster(fruit fly)
S. purpuratus(sea urchin)
We noticed that a large number of conserved DNA elements that we identified overlapped in each species because the UCE-identification program, MUMmer, reported all maximal matches regardless of the overlap . To minimize redundancy and facilitate downstream analysis, neighboring UCEs and short UCEs in each species were joined as non-overlapping ultraconserved regions (UCRs) (Additional file 1 and Additional file 2). The numbers of these non-overlapping UCRs (≥50 bp) were 30 for sponge, 64 for fruit fly, 673 for hydra, 56 for human, 3,807 for sea anemone, and 187 for sea urchin.
As a benchmark for our UCE discovery pipeline, we examined how many UCEs that had been previously identified we were able to recover. Previously reported UCEs in human and fruit fly were aligned to their reference genome using Bowtie  to determine their exact locations in the current genome build (hg19 and dm3, respectively). The majority of known UCEs (all 481 elements from the human-mouse-rat alignment , 23,695 out of 23,699 elements from the D. melanogaster Drosophila pseudoobscura alignment, and all 126 elements from the D. melanogaster Anopheles gambiae alignment ) were successfully aligned. We then compared these elements with our UCR set. Unlike in the fruit fly where 42 out of 64 UCRs overlapped with data reported by Glazov et al. , we could not find any UCR in human that overlapped with previously reported UCEs  (Additional file 3).
We then sought evidence for if UCRs from the same or different species share similarity. Considering the short length of UCRs and also assuming that distal regions of ultraconserved elements have higher mutation rates than proximal regions [15, 25, 26], we analyzed UCRs and their 50 bp-flanking sequences. In all, 4,817 UCRs with flanking sequences from all species were clustered, and orthologous and paralogous UCRs were defined. This yielded 61 clusters, of which the largest cluster consisted of 1,168 UCRs from hydra, sea anemone, and sea urchin (Additional file 4).
Although there are large numbers of UCRs across different taxa, we found that UCRs share sequence similarities and that each cluster of UCRs has a distinct species composition. Moreover, Cnidarian UCRs show a tight association, while human UCRs are largely clustered together with those of sea urchin and/or fruit fly (Additional file 4). Gain of essential functions for the survival of the species in ancestral sequences might contribute to the conservation of the sequence in a specific lineage . Another possible explanation would be that even if the ancestral sequences were not beneficial to the species, random sampling contributed to the elimination of other alleles and the fixation of these sequences in the downsized population, creating a new lineage, due to natural catastrophe or population migration, referred to as a “genetic drift” or “population bottleneck” . Although further study is required to explain the immutability of UCEs after lineage divergence and sequence fixation across a long evolutionary history, we cannot rule out this possibility. It also should be noted that the absence of UCRs in species from the same lineage does not necessarily mean that those UCRs disappeared in those species but rather that they may exist as derivative sequences by mutation [2, 15, 28, 29].
Network topology demonstrates the relationship between these UCR clusters, where some clusters are connected due to the sequence similarity between components, although most clusters do not share sequence similarity with others and have unique species composition (Figure 3C). Thus, the UCRs of each cluster may have their own independent origin in a specific lineage.
Ion channel and transporter domains are the predominant categories; they appear in many clusters composed of various species. Neurotransmitter-gated ion channels and sodium or calcium ion exchanger genes are overrepresented in clusters 13, 15, and 17, whose UCRs are conserved in all species considered here but human (Figure 4 and Additional file 4). Cation transporters are identified in cluster 30, which consists of human and fruit fly UCRs. Sugar transporters and mitochondrial carrier domains that transport various molecules across membranes are enriched in clusters 1, 16, and 21. These observations are probably because ion channels and transporters are crucial in all living organisms for the maintenance of water, salt, and nutrient homeostasis as well as for electric signal transmission in neuronal and muscle cells .
The homeobox domain, part of the TFs that act during the developmental process, is enriched in five clusters. This domain is found in all six species, with three of the five enriched UCR clusters composed of UCRs from human and fruit fly, one from fruit fly and sea urchin, and the last cluster from hydra, sea anemone, and sea urchin. Fruit fly genes regulating developmental programs ranging from axis patterning to molting, such as bicoid, fushi tarazu, and ecdysone receptor, are also found in several clusters, even those without significant domains.
Histones are overrepresented in cluster 19, which consists of sea anemone and sea urchin UCRs. Evidence that chromatin-related genes flank conserved elements in human (Additional file 7) and from other studies [32, 33] suggest that there is a liaison between conserved elements and epigenetic control mechanisms.
Detoxification domains such as cytochrome p450, UDPGT, and GST are enriched in cluster 3 and cluster 35. Cluster 3 consists of UCRs from sponge, hydra, sea anemone, and sea urchin; cluster 35 consists of UCRs from fruit fly and human. These enzymes are important to catalyzing and eliminating endogenous and exogenous substrates and therefore to providing a healthy environment for the cellular system . This remarkable linkage between UCRs and detoxification mechanisms has not previously been reported to our knowledge.
Further analysis of UCRs (≥50 bp) and short UCRs (≥30 bp) in human reveals similar but more interesting properties in terms of nearby gene functions and species conservation (Additional file 7 and Additional file 8). Genes acting in various developmental processes are highly enriched near the UCRs in human that are also conserved in fruit fly and sea urchin. To our surprise and contrary to previous studies, few genes related to development are enriched near the human sequences conserved in sponge, hydra, or sea anemone. Expansion of the relationship between developmental programs and UCRs in human, fruit fly and sea urchin (Figure 1 and Additional file 7 and Additional file 8) implies that the association of conserved sequences with the regulation of developmental genes started or expanded after the divergence of the Bilateria lineage from the metazoan stem. Our UCR clustering results bolster this hypothesis (Figure 4). Four out of five UCR clusters that have overrepresented homeobox domains of nearby genes come from human, fruit fly, and sea urchin.
Interestingly, genes surrounding short UCRs are enriched with epigenetic program-related genes (Figure 2 and Additional file 7). Short UCRs conserved in human and in fruit fly, hydra, sea anemone, or sea urchin are located near histone gene clusters across several chromosomes. Furthermore, many important epigenetic regulators are also found near elements conserved in sponge, hydra, sea anemone, or sea urchin. These include histone demethylases (KDM3B, KDM4C, KDM5C, and KDM5D), histone acetyltransferases (EP300 and KAT7), histone deacetylases (HDAC2 and HDAC10), retinoblastoma-like protein (RBL1), polycomb ring finger oncogene (BMI1), chromodomain helicase (CHD8), and components of the chromatin remodeling complex, SWI/SNF (SMARCA2, SMARCB1, SMARCC2, and SMARCD3). Taken together with the previously suggested relationship between highly/ultraconserved elements and epigenetic control [15, 32, 33], our results suggest an interesting hypothesis that epigenetic control mechanisms have tight relationships with conserved DNA sequences and that they might have coevolved from metazoan ancestors rather than recently developed.
Genes implicated in apoptosis, olfactory reception, and defense mechanisms are also enriched near DNA elements conserved in sponge, hydra, or sea urchin (Figure 2 and Additional file 7 and Additional file 8). Our analysis suggests that genomes preserve ancestral sequences well, and these ancestral sequences might have coevolved with a diverse set of essential genes. When and how genes and conserved elements initiated their relationships remains unclear and the mechanism for such an association needs to be further elucidated. However, our analysis expands the repertoire of conserved genomic elements that are possible regulatory elements.
Among 31 TFs that had significant 8-mer matches, 28 were implicated in developmental processes and many were homeobox TFs. Binding sites of homeobox TFs on UCEs near the developmental genes in higher eukaryotes have been identified [35–37], although our clustering results identified various nearby gene categories that were not limited to developmental genes. Prevalent occurrence of developmental TFBSs regardless of cluster and species may be an indication that extensive binding of developmental TFs on UCEs existed in metazoan ancestors and these TFs regulated various nearby genes to coordinate developmental functions. These may have contributed to the strong selective pressure on UCEs that function as regulatory sequences.
Genomes are dynamic entities and are under selective evolutionary pressure from mutation and fixation. Beneficial or neutral mutations in the ancestors of specific lineages are maintained in the population and vertically transferred to descendants . However, these dynamic and selective pressures are not applied uniformly across the whole genome [16, 39, 40]. Deleterious mutations in essential regions are corrected in a population [15, 16]. Sequence conservation thus implies that the function of the sequence is essential. Despite controversy about the indispensability of ultraconserved elements [13, 41], much work has demonstrated various vital functions of such elements [5, 6, 8–10].
As more genomes from various taxa are being sequenced, the opportunity to understand genome conservation and usage increases. Here, we compared genome sequences ranging from primitive aquatic to higher terrestrial species and described for the first time a number of novel UCEs present in primitive species as well as previously uncharacterized UCEs in human and fruit fly. We observed that UCEs cluster by sequence similarity and each cluster has distinct patterns of species composition. These UCEs also exhibited specific biases toward the function of nearby genes and oligomer compositions of the UCE sequences, suggesting that each group of UCEs was generated in the common ancestors of specific lineages and fixed during the evolution of descendants. Although a more detailed functional analysis of UCEs cannot currently be conducted due to the nature of the short draft sequences and because gene functions of non-model species have been less studied, our analysis suggests that UCEs harbor important sequence features, such as binding sites of developmental TFs to coordinate the expression of essential genes, which is why they were readily conserved over the long course of evolution.
Genome sequences, gene annotation, and protein sequences were downloaded from the UCSC database for human (assembly version: hg19) and fruit fly (assembly version: dm3), and each genome project for sponge (assembly version as of 5 Aug 2010) , hydra (assembly version as of 28 Jan 2009) , sea anemone (assembly version as of 26 Oct 2005) , and sea urchin (assembly version as of 13 Oct 2006) .
First, we identified single copy genes from each of six species under investigation to infer their phylogenetic relationships. This approach had been used previously in other studies to avoid the paralogy issue [44, 46, 47]. Inparanoid was used to identify orthologs and paralogs between species pairs . Only the longest peptide was used when multiple transcripts came from the same gene. We identified 472 single-copy genes that were found to be largely involved in ribosome, spliceosome, or proteasome pathways. Gene sequences were aligned using MUSCLE  and the evolutionary distance and phylogenetic tree were obtained using MEGA5 . The phylogenetic tree reveals the overall relationship between six species, which was in agreement with the known classification of these lineages (Figure 1) [45, 51, 52].
To identify UCEs for all species pairs, we masked repetitive sequences in the scaffolds of sponge, hydra, sea anemone, and sea urchin using CENSOR  and tandem repeats finder . Repeat-masked chromosomes from the UCSC database were used for human and fruit fly . To identify non-gapped conserved elements between two species, we used MUMmer, which rapidly aligned long sequences and detected exact matches using the suffix tree algorithm, with the maxmatch option to compute all maximal identical matches regardless of uniqueness . Both forward and reverse complement matches were reported. Identical matches equal to or longer than 50 bp were identified, and ≥30 bp matches were also identified for incidental analysis. Identified UCEs were further masked using CENSOR and tandem repeat finder again. It should be mentioned that this stringent repeat-masking process may have deleted potential UCEs containing repetitive elements.
Two UCEs were joined if they overlapped, and this merging process was repeated until no two UCEs overlapped (Additional file 1 and Additional file 2). Fifty base flanking sequences on both sides of merged UCEs were retrieved using the custom python script.
Merged ultraconserved elements with flanking sequences were grouped by sequence similarity. Pairwise alignment of all sequences was computed using BLASTN . The score density, i.e. the BLAST bit-score divided by the alignment length, was used as the similarity measure. Sequences were clustered using the Markov cluster (MCL) algorithm  with default parameters (Additional file 4). In the Minimum Curvilinear Embedding (MCE) analysis , 5-mer compositions of the sequences were used as features. In particular, we used the new singular-value-decomposition-based algorithm to implement MCE , using the Matlab code provided on the author’s website (https://sites.google.com/site/carlovittoriocannistraci/home). The embedding was performed without centering the minimum curvilinear kernel (non-centered MCE).
Flanking genes within 100 kb of the merged UCEs were obtained from all species under study. For human and fruit fly, we used the gene models from RefSeq . We used the gene models from the respective genome sequencing projects of the non-model metazoans.
where G is the total number of genes from the species pool in the cluster, g is the number of selected nearby genes in the species pool in the cluster, D is the number of occurrences of the domain in the species pool in the cluster, and d is the number of occurrences of the domain in the selected nearby genes in the species pool of the cluster.
Gene ontology enrichment of the nearby genes was analyzed using DAVID . Considering that human has the most comprehensive biological process terms and nearly nothing is annotated in non-model species, only human UCRs and their nearby genes were analyzed.
where x is the number of occurrences of the oligomer, n is the sample size, i.e. sequence length - oligomer size + 1, and p is the probability of observing such an oligomer in the random background sequence. Related TFs for oligomers were identified using STAMP .
single nucleotide polymorphism
Minimum Curvilinear Embedding
TF binding site
The authors thank Dr. Carlo Vittorio Cannistraci (KAUST) for conducting the visualization analysis by Minimum Curvilinear Embedding and for the creation of the 3D movie in the supporting information. We also are grateful to Professor Christoph Gehring for his critical review on evolutionary analysis. All the authors are supported by King Abdullah University of Science and Technology.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.