Exon-primed intron-crossing (EPIC) markers for non-model teleost fishes
© Li et al. 2010
Received: 3 December 2009
Accepted: 31 March 2010
Published: 31 March 2010
Skip to main content
© Li et al. 2010
Received: 3 December 2009
Accepted: 31 March 2010
Published: 31 March 2010
Exon-primed intron-crossing (EPIC) markers have three advantages over anonymous genomic sequences in studying evolution of natural populations. First, the universal primers designed in exon regions can be applied across a broad taxonomic range. Second, the homology of EPIC-amplified sequences can be easily determined by comparing either their exon or intron portion depending on the genetic distance between the taxa. Third, having both the exon and intron fragments could help in examining genetic variation at the intraspecific and interspecific level simultaneously, particularly helpful when studying species complex. However, the paucity of EPIC markers has hindered multilocus studies using nuclear gene sequences, particularly in teleost fishes.
We introduce a bioinformatics pipeline for developing EPIC markers by comparing the whole genome sequences between two or more species. By applying this approach on five teleost fishes whose genomes were available in the Ensembl database http://www.ensembl.org, we identified 210 EPIC markers that have single-copy and conserved exon regions with identity greater than 85% among the five teleost fishes. We tested 12 randomly chosen EPIC markers in nine teleost species having a wide phylogenetic range. The success rate of amplifying and sequencing those markers varied from 44% to 100% in different species. We analyzed the exon sequences of the 12 EPIC markers from 13 teleosts. The resulting phylogeny contains many traditionally well-supported clades, indicating the usefulness of the exon portion of EPIC markers in reconstructing species phylogeny, in addition to the value of the intron portion of EPIC markers in interrogating the population history.
This study illustrated an effective approach to develop EPIC markers in a taxonomic group, where two or more genome sequences are available. The markers identified could be amplified across a broad taxonomic range of teleost fishes. The phylogenetic utility of individual markers varied according to intron size and amplifiability. The bioinformatics pipelines developed are readily adapted to other taxonomic groups.
Molecular studies aimed at understanding species limits and population dynamics, are often thwarted by conflicting results arising from different markers. Although mitochondrial genes are frequently used for detecting genetic patterns in recently diverged populations , multiple independent nuclear loci are generally thought to provide more reliable estimates of the evolutionary history of populations . Not only do multilocus analyses provide better estimates of divergence times and effective population sizes , but they also provide more reliable estimates of species trees, one of the primary goals of systematics. Gene trees are often affected by factors other than divergence, such as incomplete lineage sorting and migration. The most widely accepted way to sort out the historical signals from stochastic effects of gene trees is to extract the common patterns from many independent loci. In the same vein, in the state of the art approaches for studying species delimitation and population dynamics, such as coalescence based methods [4, 5] and individual assignment tests [6, 7], have always emphasized using more loci.
One class of markers commonly used in such studies are anonymous nuclear loci, For example, they have been used to study demography of eastern fence lizard  and statistical phylogeography of bird . The major drawback of such approaches is the effort that has to be invested in developing the markers, which usually involves extensive cloning and sequencing of genomic DNA. In addition, the markers developed for one taxon often cannot be applied to other taxa, due to the high mutation rates in priming sites.
An alternative to anonymous nuclear markers is intron sequence. Introns have been successfully used in species-level studies [9–12]. The common strategy to sequence introns is to design primers on adjacent exon regions and amplify across the intron, so called exon-primed intron-crossing (EPIC) markers [13–15]. Because exons are usually more conserved than introns and most anonymous loci, the EPIC primers can generally be applied across a wider taxonomic range of organisms. An further advantage of EPIC markers is that having both the exon and intron fragments can be useful for examining genetic variation at the intraspecific and interspecific level simultaneously, a feature that is particularly useful when studying species complexes. Having both the exon and intron sequences also helps in assessing the orthology of collected sequences .
Development of molecular markers has benefited from the growth of publicly accessible genomes and EST data sets. A few bioinformatics tools have been successfully used to explore the potential of intron markers in plants [17, 18]. Recently, Backström  developed intron markers for a non-model species, zebra finch (Taeniopygia guttata), by comparing its expressed sequence tag (EST) sequences with the genome sequences of chicken. The success of this study demonstrates that genomic data from a model organism can be used effectively to develop EPIC markers for non-model species. These resources "pave the way for easy multilocus study of evolving populations and lineages of birds, and bring the goal of quickly turning nonmodel species into ecological genomic models tantalizingly close" .
The development and use of introns in fish studies, however, are still sporadic [11, 12, 21, 22] and few automated bioinformatics tools have been developed. On the other hand, the genomic resources of fishes are much more extensive than are those of birds. Currently, there are five full genome sequences of fishes (Danio rerio, Oryzias latipes, Gasterosteus aculeatus, Takifugu rubripes and Tetraodon nigroviridis) curated at the Ensembl genome browser http://www.ensembl.org. These five fishes encompass a wide phylogenetic breadth from the ostariophysans to tetraodontiforms. In addition, there are many EST sequences for other fish species available http://www.ncbi.nlm.nih.gov. Using these published fish genomes, Li et al.  have developed a bioinformatics pipeline to identify single-copy and conserved exons for phylogenetics. The strategy they used can be adapted to search for intron markers flanked by single-copy and conserved exons. The objectives for our particular study were: (1) to develop a bioinformatics tool to search for intron markers flanked by single-copy conserved exons; (2) to identify such markers for teleosts fish using the five published fish genomes; (3) to design primers and survey a set of teleost fish using the markers identified with our pipeline; (4) to investigate the relationship between intron size, amplifiability and genetic distance of target taxa in order to predict how distantly related two taxa can be and still amplify for the same markers.
The gene description, chromosomal location and start position of EPIC markers identified in this study.
Marker start (bp)
UPF0027 protein C22orf28 homolog. [Source:Uniprot/SWISSPROT;Acc:Q6NZS4]
Peroxisome proliferator activated receptor gamma coactivator 1 alpha (Fragment). [Source:Uniprot/SPTREMBL;Acc:Q52MY8]
ret proto-oncogene [Source:RefSeq peptide;Acc:NP_858048]
hypothetical protein LOC569455 [Source:RefSeq peptide;Acc:NP_001139076]
hypothetical protein LOC415169 [Source:RefSeq_peptide;Acc:NP_001002079]
60S ribosomal protein L18a [Source:UniProtKB/Swiss- Prot;Acc:Q7ZWJ4]
spectrin alpha 2 [Source:RefSeq peptide;Acc:NP_001091958]
60S ribosomal protein L8 [Source:UniProtKB/Swiss- Prot;Acc:Q6P0V6]
karyopherin (importin) beta 1 [Source:RefSeq_peptide;Acc:NP_001032791]
karyopherin (importin) beta 1 [Source:RefSeq_peptide;Acc:NP_001032791]
CCR4-NOT transcription complex subunit 1 (CCR4- associated factor 1) [Source:UniProtKB/Swiss- Prot;Acc:A1A5H6]
nucleoporin 155 [Source:RefSeq peptide;Acc:NP_956450]
The primer sequences used for the 12 EPIC markers.
Forward primer sequence
Reverse primer sequence
Taxa used and sequences collected in testing the 12 EPIC markers.
Species and higher classifications
Number of taxa amplified
Wiens  described a "wish list" for the ideal method in delimiting species. He suggested that such a method "considers both incomplete lineage sorting and gene flow among populations, can integrate data from multiple loci, can determine species limits without having those limits defined a priori (i.e., it will allow one to discover unanticipated species from the molecular data), and can allow one to estimate the statistical support for species-level decisions". If we think thoroughly about what kind of data are necessary to achieve such goals, we are always led to multilocus data. For example, many independent loci can sort out the signal of population divergence from gene flow or random lineage sorting. Individual based assignment tests, without the need of defining populations a priori, all require multilocus data and the power of those methods is highly dependent on the number of loci available [6, 7].
As mitochondrial loci typically are linked in vertebrates, they essentially act as a single locus, which precludes them as the sole data source for multilocus population studies. Single-copy nuclear polymorphic sequences are widely regarded as the marker of choice; but even these markers are not free of obstacles, such as the availability of the markers and technical hurdles in resolving haplotypes . A good nuclear marker for studies at the species-level should be amplifiable across different taxonomic groups and the target sequence should exhibit reasonable variation at the intraspecific level . EPIC markers seem to be a natural fit to this description. The conserved exon portion of EPIC markers improves the versatility of primer amplification across taxa whereas the more variable intron portion provides variation for intraspecific studies. On the other hand, our approach, focusing on single-copy and conserved coding sequence to facilitate universal primers design and homolog identification, is a biased representation of the whole genome. However, the sequence conservation in the exonic part of the gene does not necessarily suggest a slow evolutionary rate in the intronic part of the gene. Nevertheless, one study showed that introns had a low average genetic diversity relative to anonymous loci in birds ; another study found a correlation between the evolutionary rate of exons and non-coding regions in Drosophila melanogaster . Therefore, the limitation of variation in EPIC marker and the correlation between the rate of its intron and exon portion await more scrutiny from empirical studies.
Nonetheless, two of the obstacles in applying single-copy nuclear polymorphic sequence in population studies can now be overcome due to recent developments in genomics and high throughput sequencing techniques. First, due to the fast accumulation of public accessible genome sequences and EST databases, mining genetic markers in silico has never been easier, as shown by this study. Currently, the Ensembl genome browser maintains more than 51 genome assemblages (http://www.ensembl.org, as of Nov 2009), while GenBank holds 63,463,018 EST records (http://www.ncbi.nlm.nih.gov/, as of Nov 2009). Methods based on comparison between two or more genomes or between one genome and the EST sequences have been developed for plants, birds and fishes [16–19, 23, 28, 29]. The worries about the versatility of EPIC markers applying across different taxa  and the difficulties in developing such markers  should be alleviated now. Due to the limited genomic resources, the early approaches often utilized both the genome and EST sequences available at the time [17, 19]. Since EST sequences do not contain the information about introns and are usually incomplete, the EPIC markers developed from EST will likely be of a lower quality, such as multiple copies in the genome and less predictable in the size of the markers, which might require more lab work in screening steps for finding the good markers.
The other technical hurdle for utilizing intron sequences is how to resolve haplotypes. As the markers for studies at the species level, high polymorphism can exist within species and among alleles in EPIC markers. Strictly speaking, the alleles of diploid individuals have to be determined before performing any meaningful species level analysis. There are experimental methods to resolve haplotypes, such as cloning, gel extraction, "allele-dropout-effect" and SSCP , all of which demand considerable laboratory effort. Alternatively, haplotypes can be resolved statistically, using the software packages such as PHASE . However, physically separating the two alleles is usually the only choice, if there is length variation between them, so called length variant heterozygote (LVH) . A new solution to separating sequences from two alleles is using the next-generation sequencing. New developments in the next-generation sequencing , in theory allow for the two alleles in each individual to be sequenced in parallel simultaneously, avoiding the extensive cloning steps. The current challenge is to design ways of tagging DNA to facilitate sequencing many genes and many individuals in one run [34–36].
We aimed to develop a list of candidate markers for studying closely related species or populations for any group of teleost fish, especially the non-model species. Thus, we designed primers on alignments of all five fish genomes to increase to success rate of amplification in any teleosts. If we were interested in a particular group of teleosts, for example, gasterosteiforms, we could have used Gasterosteus aculeatus as the query and use Oryzias latipes as the reference to increase the specificity of resulting primers, but those primers might only applicable to that group of fishes. Since our primers were designed on conserved exonic part of the gene, using all five genomes provided good priming sites for designing universal primer, which can be applied to other non-model teleosts. There are only a handful of model organisms for which whole genome sequences are available, so our approach is particularly important for developing markers in non-model organisms. Our results show that several markers amplified distantly related non-model teleosts.
The putative "whole-genome duplication" events at the base of teleosts could have serious impact in ortholog identification. It is especially problematic if differential gene loss happened in different lineages after the genome duplication . Nevertheless, we intended to develop single-copy markers for studying closely related species or populations, so the differential gene loss would have less detrimental effects, i.e., the gene copy is most probably orthologous in closely related species or different individuals of the same species. If one pair of primers resulted in multiple fragments in a particular species possibly due to gene duplication, the marker should be discarded for that species.
In our experiment, we tested 12 candidate markers in nine distantly related teleost fishes to illustrate the broad taxonomic usage of those markers. We found that five to 12 of the markers amplified different species, leaving some missing data. Although this test case served as an illustration for how these markers might work across taxa, it was not a typical study, in which closely related species or populations are examined. In such cases, similar set of loci should amplify equally well in different individuals or species, as shown in our results (Table 3, Figure 3); thus, resulting in less or no missing data.
We developed an efficient strategy for mining EPIC markers by comparing genome sequences. Applying the bioinformatic tools developed in this study, we found thousands of candidate EPIC markers in teleost fishes. By testing some of the candidate markers, we illustrated the usefulness of these markers in a broad range of teleost taxa. The strategy and the bioinformatic tools we developed are readily adapted for other taxonomic groups.
The exon portion of each sequence was aligned using ClustalW  implemented in MEGA4 . PriFi  was used to find the optimal regions for designing primers on the aligned sequences. Primers were designed by eye and checked with Primer3 . Whenever possible, the 3' end of the primers was designed on the 1st or 2nd codon position and the number of mismatches in the last five nucleotides of 3' end of the primers also was minimized, in order to increase annealing accuracy of the primers.
Twelve randomly selected EPIC markers were tested in nine teleost fishes (Table 1 and 2). The species related to the current research interests of the authors were used, but they also were chosen to cover a broad phylogenetic range. Danio rerio was used as the positive control. In addition, two other cypriniforms (Hypophthalmichthys molitrix and H. nobilis), one clupeiforms (Thryssa hamiltonii) and five perciforms (Larimichthys crocea, Sinperca chuatsi, Odontobutis potamophila, Micropercops swinhonis and Rhynogobius giurinus) were picked. As deliberately designed, we had a taxon (T. hamiltonii) diverged from the others at about 307 million years ago , several families, two closely related genera (Odontobutis and Micropercops) and two congeneric species (H. molitrix and H. nobilis), so the selected taxa covered a broad range of genetic distance.
DNA samples were extracted from muscle tissues or fin clips using a standard phenol-chloroform methods . All PCR were performed in a total volume of 20 μl, including 0.1 μl TaKaRa Taq™ (Takara, Shanghai, China), 2.0 μl 10 × PCR buffer (+ MgCl2), 1.6 μl dNTP Mixture (2.5 mM each dNTP), 0.64 μl forward and reverse primers respectively, 0.8 μl DNA template and 14.22 μl distilled water. The PCR reactions were carried out in an Eppendorf Mastercycler with silver block (Eppendorf China Ltd, Shanghai, China). The PCR program consisted of a 95°C initial heating for 30 sec, 15 cycles of 98°C for 10 sec, 60°C for 30 sec and 72°C for 45 sec, 15 cycles of 98°C for 10 sec, 58°C for 30 sec and 72°C for 45 sec, followed by a final extension at 72°C for 5 min. The PCR products were visualized on agarose gels. The amplified products were sequenced by Shanghai Sangon Biological Engineering Technology & Services Co, Ltd. (Shanghai, China).
The sequences determined in this study and the sequences of the model species retrieved from the Ensembl database were aligned to each other via ClustalW . Because of the large genetic distance among most tested species, the intron sequences were not alignable except for the congeneric species (H. molitrix and H. nobilis) and two closely related genera (O. potamophila and M. swinhonis). Therefore, the alignment was made only on the exon parts of EPIC sequences and the size of introns was recorded. The exon sequences were translated into amino acid to be aligned using ClustalW  implemented in MEGA4 ; then, the aligned sequences were translated back into nucleotides.
in which, DA ij is the difference of amplifiability between taxon i and taxon j; is the number of markers amplified in both taxon i and taxon j, while is the number of markers amplified in any of the two taxa. The value of DA ij could range from 0 (no shared markers) to 1 (all markers are shared). One important practical question we might have is that if we know certain EPIC markers worked for one species, can we apply the same markers on other species? Or how close two species should be related to ensure the same markers work? We calculated the correlation between the amplifiability and the genetic distance among species, and the correlation between the intron size and the genetic distance in order to address this question.
Finally, we tested the utility of the exon parts of the 12 EPIC markers in reconstructing the species phylogeny. The exon sequences from 13 species (9 determined in this study plus 4 retrieved from Ensembl) were concatenated. The concatenated sequences were partitioned by codon positions, because the exon sequences collected for each gene were short (≈ 200 - 300 bp) in this study and not enough sites could be used if we partition them by both codon and gene. It also has been shown that most heterogeneity often could be captured through partitioning by codon position . The optimal model for each data partition was selected by using the "propose model" analysis in TreeFinder . Partitioned Bayesian analyses were carried out using MrBayes . The closest models to the TreeFinder selected models were applied in Bayesian analysis. Two independent runs, with 8 Markov Chain Monte Carlo (MCMC) per run were performed for Bayesian analysis. The heating parameter was set as "temp = 0.1" to improve the mixing of the MCMC. The runs were terminated after 10 million generations with a sampling frequency of 1 in 1000 (10,000 trees save for each run). After discarding the burnin samples (1000 trees from each run), 50% majority rule tree was calculated using sumt. Maximum likelihood phylogeny was searched using TreeFinder using the best models for each data partition. Bootstrap analysis with 1000 replications was performed to assess the statistical support for each node of the ML tree. The resulting phylogeny was graphed using Dendroscope .
We are grateful to Jinliang Zhao, Chenghui Wang and Xiaoling Gong for their help with fish tissue samples. We thank Dr Gavin Naylor for his help with writing. Financial support were provided by National Science Foundation grants DEB-0732838 (to C. Li), EPSCoR EPS-0701892 (to J.-J. M. Riethoven) and National Non-Profit Institutes (East China Sea Fisheries Research Institute) 2007Z01 (to L. Ma).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.