Evaluating the performance of anchored hybrid enrichment at the tips of the tree of life: a phylogenetic analysis of Australian Eugongylus group scincid lizards
© Brandley et al.; licensee BioMed Central. 2015
Received: 1 October 2014
Accepted: 24 February 2015
Published: 11 April 2015
High-throughput sequencing using targeted enrichment and transcriptomic methods enables rapid construction of phylogenomic data sets incorporating hundreds to thousands of loci. These advances have enabled access to an unprecedented amount of nucleotide sequence data, but they also pose new questions. Given that the loci targeted for enrichment are often highly conserved, how informative are they at different taxonomic scales, especially at the intraspecific/phylogeographic scale? We investigate this question using Australian scincid lizards in the Eugongylus group (Squamata: Scincidae). We sequenced 415 anchored hybrid enriched (AHE) loci for 43 individuals and mined 1650 exons (1648 loci) from transcriptomes (transcriptome mining) from 11 individuals, including multiple phylogeographic lineages within several species of Carlia, Lampropholis, and Saproscincus skinks. We assessed the phylogenetic information content of these loci at the intergeneric, interspecific, and phylogeographic scales. As a further test of the utility at the phylogeographic scale, we used the anchor hybrid enriched loci to infer lineage divergence parameters using coalescent models of isolation with migration.
Phylogenetic analyses of both data sets inferred very strongly supported trees at all taxonomic levels. Further, AHE loci yielded estimates of divergence times between closely related lineages that were broadly consistent with previous population-level analyses.
Anchored-enriched loci are useful at the deep phylogeny and phylogeographic scales. Although overall phylogenetic support was high throughout the Australian Eugongylus group phylogeny, there were nonetheless some conflicting or unresolved relationships, especially regarding the placement of Pseudemoia, Cryptoblepharus, and the relationships amongst closely-related species of Tasmanian Niveoscincus skinks.
With the increasing access to genomic data enabled by high-throughput sequencing (HTS), phylogeneticists are applying phylogenomics to resolve deeper nodes in the tree of life (e.g., [1-3]). When used along with some form of genome reduction, HTS enables analysis of hundreds to thousands of loci across highly divergent organisms. Especially promising approaches include comparative RNAseq (e.g., [4-6]) and targeted hybrid enrichment (also referred to as sequence capture) using conserved or highly conserved orthologous loci as targets [1,7-9]. To the extent that common target loci are employed across diverse groups, these enrichment approaches have the potential to produce highly connectable data across orthologous loci, as was the case for mtDNA and cpDNA in the past.
For molecular systematists working at the phylogeographic interface between population genetics and phylogenetics, the utility of hybrid enrichment methods is not immediately obvious. Several methods have been used to generate data at low-moderate scales of phylogenetic divergence, including amplicon sequencing [10-12], various RAD-type methods (e.g., [13-15]), and target enrichment either using custom PCR-baits  or custom exon-capture designed from de-novo assembled transcriptomes . Each of these has disadvantages – amplicon sequencing and RAD sequencing often generate very patchy matrices of loci by individuals, especially as divergence increases, and custom exon-capture requires more up-front investment. Custom PCR-bait capture has great potential for projects requiring tens of loci, but is not well suited where researchers aim to sequence hundreds of loci to improve phylogenetic precision. Directly mining transcriptomes (transcriptome mining or TM hereafter) is another source of potentially thousands of homologous loci, but phylogenetic analyses of mRNAs may be misled by recombination, especially in genes with exons separated by long introns. Moreover, transcriptome sequencing also requires very high quality, RNA-preserved tissue, and it remains expensive to sequence hundreds of individuals. Indeed, all phylogenomic methods require significant financial resources, but when factoring in the costs of reagents, sequencing, and labor versus the number of nucleotides sequenced, target enrichment type methods are far less expensive than traditional PCR-based methods .
Target enrichment using conserved targets – either “ultraconserved” elements (UCEs ) or conserved exons (anchored hybrid enrichment, AHE ) – could well yield useful data for phylogeographic inference, including historical demography, species delimitation and inference of divergence processes. However, given that the targets themselves are evolutionarily conserved, divergences and phylogenetic information content within and across closely related lineages could be low. In general, capture efficiency, and hence coverage, is highest across the relatively conserved target areas and decreases into the flanking sequences (e.g. intergenic regions or introns) that are expected to contain the most variability at this scale [9,17]. Further, given their high level of conservation, purifying selection could result in strongly biased estimates of population diversity, and thus of phylogenetic estimates via species trees or divergence parameters [18,19]. On the other hand, with the exception of genes that are highly conserved or consistently under strong adaptive or balancing selection, these issues may be less significant for loci mined from transcriptomes.
That conserved locus hybrid-capture might nonetheless have utility at a phylogeographic scale is suggested by a recent analysis of population divergence across multiple avian taxa from the Amazon basin that used anchored enrichment of UCEs . Similarly, AHE has provided good resolution of phylogeny and divergence histories across closely related species and lineages of chorus frogs (Pseudacris; Lemmon et al., unpub. data). However, using non-random sets of the genome for evolutionary inference has potential to bias estimates of divergence and historical demography because depending on how these loci are ascertained, they might show idiosyncratic patterns from the rest of the genome. For example, one recent study found that data from UCEs versus RAD-Seq yielded very different estimates of divergence history parameters among phylogeographic lineages .
Here we explore the information content of AHE loci at different taxonomic scales including across genera, congeneric species, and phylogeographic lineages of the Australian Eugongylus group skinks (Squamata: Scincidae). In doing so, we both infer phylogenies and divergence histories for this group of lizards. We also compare the performance of AHE to the only other current source of multi-locus data for the Eugongylus group: exons mined from transcriptomes.
The Eugongylus group is a species rich clade of lygosomine skinks that includes ~40 genera and 420 described species. The common ancestor of a clade consisting of 17 extant genera colonized Australia and adjacent landmasses (New Zealand, New Caledonia, South Pacific islands) from Asia in the Early to Middle Miocene (42–22 Mya [20-22]). Species from three of these genera – Carlia, Lampropholis and Saproscincus – have been subjects of detailed phylogeographic analyses in wet forests of eastern Australia [23-28] and of studies of the evolution of reproductive isolation among phylogeographic lineages . Previous molecular phylogenetic analyses using relatively few loci have inferred strong support for relationships among species within these genera (e.g. [23,30]). However, relationships among genera in the Eugongylus group remain largely obscure and are evidently difficult to resolve using few loci (Shea et al., unpub. data). Moreover, Niveoscincus and Pseudemoia are viviparous, and species in these genera show a range of both placental complexity and variation in mother-to-embryo exchange [31-34]. Without a robust intergeneric phylogeny, it remains impossible to uncover the evolutionary processes that explain the diverse ecological, physiological, and reproductive traits among Australian Eugongylus group skinks.
We constructed two multi-locus data sets using from data collected from different sequencing methods. The first data set consists of 415 orthologous genes sequenced for 43 taxa using anchored hybrid enrichment . The second data set includes 1650 exons from 1648 orthologous genes for 11 taxa that were mined from RNA transcriptomes.
Taxon sampling design
We evaluate the performance of AHE and TM at the intergeneric, interspecific, and intraspecific phylogenetic levels. Our nested sampling design includes 9 of 17 Australian Eugongylus group genera as well as two individuals per phylogeographic lineage for Carlia rubrigularis, Lampropholis coggeri, La. robertsi, and Saproscincus basiliscus (Additional file 1). As outgroup taxa, we sequenced Lerista bougainvillii of the Sphenomorphus group clade of lygosomine skinks, which shares a most recent common ancestor with the Eugongylus group 64–100 million years ago . We used a single outgroup because this is as required for STEAC analyses (see below), and this makes the results directly comparable across methods. Inclusion of all three outgroups in RAxML analyses of the concatenated data set does not affect support for relationships of the ingroup (not shown). The final AHE data set includes 42 ingroup samples, and the TM data set includes 10 ingroup samples and a single outgroup in each data set (Le. bougainvillii).
Anchored enrichment data collection and processing
Data were collected following the methods of  through the Center for Anchored Phylogenomics at Florida State University. Briefly, each genomic DNA sample was sonicated to a fragment size of ~300-700 bp. Subsequently, library preparation and indexing were performed following a protocol modified from . Indexed samples were then pooled at equal quantities (eight samples per pool), and enrichments were performed on each multi-sample pool using an Agilent Custom SureSelect kit, which contained probes designed for anchored loci from multiple vertebrate genomes (Vertebrate 1.0, described in ). After enrichment, the six reactions were pooled in equal quantities for sequencing on a single PE100 Illumina HiSeq2000 lane. Sequencing was performed at the QB3 Vincent J. Coates Sequencing Laboratory at the University of California, Berkeley.
Quality-filtered sequencing reads were processed following the methods described in . In short, reads were demultiplexed (with no mismatches tolerated) and pairs were merged following . Merged reads were then scanned for matches to the probe region sequences of Anolis carolinensis using the high sensitivity approach described in . Reads matching to a reference sequence for each individual were then aligned to produce an assembly for each locus. The reads were aligned using a script written by ARL that: (1) sorts the reads by number of matches to the reference, (2) for each read in the sorted list (starting with best-matched read), the position maximizing the match to the previous read in the list is noted, (3) if the best matching position does not generate a 90% match (of at least 20 bp) the read is skipped, (4) the entire sorted list is repeatedly traversed until no additional reads have aligned during a traversal (source code for all analyses in this study is available on Dryad doi:10.5061/dryad.g4mj2).
After assemblies were complete, consensus bases were called as follows: (1) all sites with less than 3-fold coverage and variant sites with less than 10-fold coverage were called as “N”, (2) invariant sites with coverage between 3-fold and 10-fold were called with the observed nucleotide, (3) for sites with greater than 9-fold coverage, the most common base was called unless the distribution of observed bases was unlikely to have arisen under a two allele model with equal allele frequencies. This likelihood was approximated using the equation p = (1-pbinom(nMax,n,0.5)), with n equaling the number of unambiguous base calls and nMax equaling the abundance of the most common base call. The most common base observed was called unless p > 0.05, suggesting that the two allele model could explain the diversity observed. In that case, an ambiguity code corresponding to all of the observed bases at that site was used in the consensus sequence. Sequences for each locus were aligned across species using Muscle version 3.8.31 under default settings . Prior to phylogenetic analysis, each locus data set was trimmed to the second shortest sequence, and all gapped characters were removed.
Transcriptome assembly, identification, and mining
Methods for sequencing, assembly, and identification of the transcriptomes of Le. bougainvillii, Pseudemoia entrecasteauxii, P. pagenstecheri, and P. spenceri contigs were similar to . Briefly, approximately 5 gigabases of 100 bp paired-end reads were generated using Illumina HiSeq 2000 from pregnant uterine (Le. bougainvillii, P. entrecasteauxii, and P. pagenstecheri) or brain tissue (P. spenceri) from a single individual of each species. Contigs were assembled for each species with Abyss v1.3.4  using kmer=69, and otherwise default parameters. Contigs ≤ 100 bp were removed. To identify each contig, we used blastx to align them to the genomes of Anolis lizard, chicken, mouse, human, platypus, and wallaby genomes (Ensembl build 64).
For Carlia, Lampropholis, and Saproscincus species, data were collected and processed as described in . Briefly, for each phylogeographic lineage, we collected 3.5 Gb of paired-end reads from each of five adults’ livers. Reads were trimmed and merged using FLASH v1.0.2  and Trimmomatic v0.16 . We then generated an assembly across all individuals for that species using Trinity r2012-01-25 , and then removed redundancy across these assemblies using a custom merge script using blat, cd-hit-est, and cap3 [44-46]. The final transcriptome data sets included eleven taxa including species of Carlia, Lampropholis, Pseudemoia, Saproscincus, and Le. bougainvillii (outgroup).
Assessing phylogenetic history
To test the performance of both 415 AHE loci and 1650 TM loci, we performed phylogenetic analyses for different taxonomic levels including the (i) the full 43 sample data set (Eugongylus group + outgroup) to assess performance at the deep phylogenetic scale, and (ii) Lampropholis only and Saproscincus only data sets to test performance at the interspecific and phylogeographic scale. For each of the AHE loci and TM loci, we first calculated the most appropriate models of sequence evolution for each locus using MrAIC  choosing the model with the best AIC score amongst 24 possible models (GTR, SYM, HKY, K2P, F81, and JC69, with and without I and Γ parameters). We then estimated a maximum likelihood gene tree for each locus using PHYML  using the appropriate model selected by MrAIC.
We performed concatenated data and species tree phylogenetic analyses for the full taxon, Lampropholis + S. basiliscus N1 (outgroup), and Saproscincus + La. coggeri N1 (outgroup). Invariant loci were excluded and the final AHE data sets for the full taxon, Lampropholis, and Saproscincus data sets were 415, 414, and 412 loci, respectively. There are no invariant TM loci in all three data sets and the full 1650 loci were used. Maximum likelihood phylogenetic analyses of the concatenated data sets were performed using RAxML  assuming separate GTR + Γ models for each locus. We then estimated clade support using 1000 pseudoreplicates of bootstrap resampling with the same partition and model scheme as the single-tree analyses.
Simulated and empirical analyses have shown that, if there is strong incongruence amongst loci, analyses of concatenated data sets may infer a strongly supported, but incorrect trees [50-52]. We therefore performed additional species tree analyses using STEAC . STEAC estimates a phylogeny by calculating the average coalescence times across multiple loci. The method is useful because it is computationally tractable with large multi-locus data sets. On the other hand, it assumes incongruence between the gene and species trees is due only to incomplete lineage sorting rather than other phenomena such as hybridization.
STEAC assumes that gene trees are estimated without error, so to incorporate estimation error for individual gene trees in our STEAC analyses, we used 1000 bootstrap trees estimated per locus rather than a single point estimate tree. Although our phylogenetic analyses of concatenated data used RAxML, we chose PHYML , an alternative maximum likelihood phylogenetic program, for this analysis to enable efficient scripting of the STEAC analytical pipeline. We first calculated the most appropriate models of sequence evolution for each locus using the same procedure as above. For each locus, we performed a 1000 pseudoreplicate bootstrap analysis using these chosen models. We used STEAC to infer a species tree. We estimated support for this tree by calculating 1000 ‘bootstrap’ STEAC species trees, where each ‘bootstrap’ species trees draws (without replacement) one of the PHYML bootstrap trees for each locus (e.g., making use, in total, of 415,000 trees for the full AHE data set). Species tree analyses were performed using the phybase package in R . Given the number of taxa and loci, use of Bayesian concordance analyses (e.g., BUCKy ) were not computationally feasible.
To evaluate the spatial distribution of variation across loci, and to enable comparison with results for ultraconserved elements (UCEs) , we assessed the distribution of phylogenetically informative characters (PICs)/across the 415 AHE loci. All loci were aligned so that the 0th nucleotide corresponded to the middle of the aligned locus. Moving laterally along the loci in both directions, we plotted the number of total loci (max = 415) with a PIC site at that nucleotide position across all taxa. For example, if there were 69 loci with a PIC at the 4th nucleotide position, then we recorded a value of 69. Because the loci vary in length, we overlaid these plots on a distribution of locus lengths calculated by counting the number of loci with any character at a nucleotide position. Unlike the alignments used for the phylogenetic analyses, we did not remove gapped characters to avoid artificially increasing the number of PICs near the alignment origin due to shortening the alignments. We repeated these analyses using only the Lampropholis and Saproscincus data sets. Finally, we constructed a “normalized” plot for the full-taxon data set by dividing the number of PICs at each nucleotide position by the number of alignments of that specific length.
Inferring divergence histories
To further explore the utility of AHE markers for evolutionary inference, we inferred divergence history parameters for the five lineage-pairs for which we had prior estimates of divergence history from population samples of transcriptomes . These lineage-pairs, representing different levels of divergence, were North and South populations of C. rubrigularis (N/S), Central and South La. coggeri (C/S), La. N/C, and S. basiliscus N/C. Additionally, we used comparative transcriptome data from Le. bougainvillii, P. entrecasteauxii and P. pagenstecheri (Griffith et al., unpublished data) for sequence divergence comparisons. To infer divergence history, we used the program 3s, which implements a coalescent model for isolation with migration . 3s builds its inference across many genome loci sampled at one chromosome in each of the two focal lineages and an outgroup lineage. Previous work  has shown that with single samples per taxon, migration rates and ancestral population sizes are non-identifiable but that estimates of divergence time (tau) are robust, so we focus our comparisons on divergence time.
To generate haplotypes for our 3s runs, we first took the raw reads generated from AHE data collection, trimmed them for quality and adaptors Trimmomatic v0.16 , merged them using FLASH v1.0.2 , and removed sequence duplicates. For a given lineage-pair, we defined one reference assembly, to which we mapped both ingroup lineages and the outgroup. We mapped cleaned reads to the previously-generated assemblies using Bowtie2  and called SNP variants using Samtools default parameters . We phased variants using GATK ReadBackedPhasing  and filtered any variant and non-variant sites with coverage less than 5X. The resulting, coverage-filtered haplotypes were used in all downstream analyses.
We calculated divergence between the lineage-pairs using several metrics. First, we calculated nuclear divergence for both AHE and TM sequences . Second, we used previously calculated estimates for divergence time in years, which has been estimated for all lineage-pairs but P. entrecasteauxii and P. pagenstecheri . Briefly, we had previously inferred these divergence times by sequencing transcriptomes for each lineage (N = 5) in the lineage-pair, identifying variants in the untranslated regions of the lineage-pair, constructing the two-dimensional site-frequency spectrum (2D-SFS) for these variants, and then using dadi  to fit an isolation-with-migration model to the 2D-SFS. Third, we used 3s to infer tau in coalescent units. For each lineage-pair, we used all available loci (details on loci number, sequence length, and outgroups available in Additional file 2: Table S1), running each analysis ten times to ensure convergence across runs. We compared these estimates of divergence history to those obtained previously, expecting that, although the values should not be absolutely similar due to differences in units and mutation rates, divergence estimates should be correlated.
From the original target set of 512 loci, we successfully captured and sequenced 415 across all samples including the outgroups. After applying coverage criteria, trimming all loci to the length of the second shortest sequence and removing all internal gapped characters, the mean locus length was 534 bp with a minimum and maximum lengths of 250 bp and 1458 bp, respectively (Note that loci with trimmed alignments shorter than 250 bp were removed). Of the 221,792 total characters, only 0.36% of the data was missing in any species.
For the TM analyses, we identified 10,847 exons of 7233 loci that met the initial filtering process. However, not all taxa were present for all loci. From this initial data set, we identified 1650 exons from 1648 genes that could be aligned across all ten taxa. These exons ranged in length from 101 to 3249 bp (mean = 337). Because all but two of our loci are single exons, we expect that phylogenetic error due to recombination is minimal. We refer to these 1650 exons as loci (rather than exons) to simplify comparative discussion with the AHE loci. Loci were trimmed so that all nucleotide positions are present for all genes (i.e., no missing data). Complete details on TM loci are available in Additional file 4.
Phylogenetic information content and gene tree concordance
Summary statistics for the anchored-enriched (AHE) loci and transcriptome mining (TM) exons at different taxonomic levels and excluding non-informative loci
C. rubrigularis 1
La. coggeri 1
S. basiliscus 1
Transcriptome mining (TM)
TM data pruned to number of AHE loci (highest PICs)
TM data pruned to number of AHE loci (randomly sampled)
AHE data pruned to TM taxon sampling
The transcriptome mining (TM) data showed even stronger skew in information content (Table 1) and this was also correlated with locus length (R2 = 0.65; p < 0.001). Although TM produced more usable loci than AHE (1650 vs 415), both the maximum number of PICs and average PICs per locus are higher in the AHE loci (mean PICAHE = 52; mean PICTM = 19.2; Table 1). To compare the data sets more directly, we limited our AHE data set to the 11 taxa also included in the TM data set. Analyzing these data, we find the mean number of PICs per locus dropped from 52 to 25. Despite this reduction, PICAHE remains higher than PICTM. We also trimmed the TM data to the 415 loci with the highest PICs to compare the most informative TM loci to the AHE loci. The mean number of PICs/locus in this data set is 41. To make a more direct comparison, we randomly selected 415 loci from the 1650 TM loci 10,000 times and mean number of PICs/locus (19) is identical to the entire 1650 locus data set, as expected.
Repeating the analyses at the congeneric to phylogeographic scale, there is no strongly supported incongruence amongst the concatenated RAxML tree and the STEAC species tree for either of the Lampropholis and Saprosincus datasets (Additional file 5: Figure S1), but clade support does differ at some nodes. Within Lampropholis, the RAxML and STEAC analyses strongly support the sister relationship of La. coggeri and La. adonis. Moreover, the relationships inferred among phylogeographic lineages within La. coggeri and within S. basiliscus, using either concatenation or species tree approaches accord with expectations from prior sequencing of multiple introns [24,28] (Figure 2).
Information content of phylogenomic data sets
We evaluated the phylogenetic information content of the anchored-enrichment capture approach (AHE) at multiple scales of divergence in Australian Eugongylus group skinks. By focusing on phylogeographic lineages that have been the focus of previous demographic inference (using multiple introns) and comparative transcriptome sequencing, we are able to gauge the performance of AHE loci against a broader sample of genes with varying rates of divergence.
Performance of anchored hybrid enrichment and transcriptome mining
When using conserved targets such as those employed in AHE, we expected to see low phylogenetic resolution among phylogeographic lineages and higher resolution towards the base of the tree, but we did not uncover this pattern. Rather, relationships among phylogeographic lineages and congeneric species were well resolved, except the Tasmanian Niveoscincus. This genus aside, the overall good performance of AHE loci across closely related taxa reflects the presence of reasonable levels of diversity within, as well as adjacent to, the “conserved” AE probe regions. However, ambiguity remains for some deeper relationships in the Eugongylus group and phylogenies inferred from concatenation and species tree approaches strongly supported alternative placements of Cryptoblepharus and Pseudemoia. A potential explanation for this conflict and/or lack of resolution is that current phylogenetic methods cannot adequately model the complex processes of molecular evolution inherent in such large data sets. However, the generally short and poorly resolved branches among genera suggest a rapid radiation subsequent to colonization of Australia (as suggested for Carlia ), in which case there is strong potential for confounding of phylogenetic signal due to incomplete lineage sorting across gene trees [63,64]. Because concatenation does not account for conflict due to incomplete lineage sorting, the STEAC tree is our preferred hypothesis of phylogenetic relationships of the Australian Eugongylus group species. However, this hypothesis should be subject to further testing with additional nucleotide data and broader sampling of taxa.
The key and positive result is that AHE loci provided mostly high phylogenetic resolution among closely-related species and phylogeographic lineages. This parallels the reported utility of another targeted enrichment approach – UCEs – for analyses of phylogeography . Our primary test cases concerned phylogeographic lineages within three species of skinks from the Australian Wet Tropics rainforests (Carlia rubrigularis, Lampropholis coggeri and Saproscincus basiliscus) that have been the subject of multilocus analyses of phylogeography, historical demography and reproductive isolation in contact zones in the rainforests of north-east Australia [23-25,28,29]. The results of our analyses of the AHE loci add substantial support to previous analyses of 6–10 mostly intronic loci that found deep (~ Late Miocene to Pliocene) divergence and strong reproductive isolation between the two major lineages of C. rubrigularis and between the southern (“S”) and central (“C”) lineages of La. coggeri, and more recent divergence and weaker isolation between the C and northern (“N”) lineages of La. coggeri and the N and C lineages of S. basiliscus. The southern “S” lineage of S. basiliscus represents a case of strong discordance between deep mtDNA divergence and shallow nDNA separation from the adjacent (but now isolated) central “C” lineage; ABC coalescent modeling inferred that this was due to recent range expansion of C and introgression .
Further bolstering AHE loci’s utility for phylogeographic studies, we found that AHE loci yield sequence divergence estimates strongly correlated with those from TM data, and inferred divergence histories reasonably consistent with previous population-level studies of intron sequences. Often, researchers use rapidly evolving markers like introns to infer divergence history, with the assumption that such markers have substitution rates that allow us to investigate recent history . Here, we show that AHE loci, despite largely being conserved across deep phylogenetic scales, are sufficiently informative that they can be used to estimate such divergence histories. That said, we note that our inference deviated from previous population genomic analyses. First, the AHE loci supported divergence with gene flow where previous inference did not, and second, the correlations, while on trend, are weak. These deviations could be due to differences in model assumptions between inference methods – i.e., how sensitive a method is to recombination, expectations for how gene flow should affect coalescent trees and times – and/or such discrepancies can arise simply from differences in bioinformatic pipelines ?
The phylogeny inferred from AHE loci in this study is identical to that obtained using all 1650 loci derived from transcriptome mining. This was the case even for Saproscincus lineages for which the information content of AHE loci and tree confidence factors were relatively low and noisy in relation to PIC scores. The stand-out exception is the poor resolution among closely-related species of Tasmanian Niveoscincus, evidently also a recent and rapid radiation (see also ). Ongoing improvements to the design of AHE targets, to increase locus length and information content by targeting adjacent exons and thus capturing intervening introns, will only serve to further increase their utility among closely related taxa (Lemmon and Lemmon, unpublished data).
Given the demonstrated strong performance of AHE loci for resolving a phylogeny at the phylogeographic scale of divergence and inferring divergence histories (and see also  for UCEs), along with the previous demonstrations of their utility at deeper phylogenetic scale (e.g., ), one could assume that these approaches will suffice for the vast majority of molecular systematics analyses. In particular, AHE loci exemplify a priori selection of loci to maximize both phylogenetic breadth of application and phylogenetic signal, both of which are emerging as key factors .
One caveat with the AHE (and UCE) approaches is the quality of alignments across non-coding regions adjacent to conserved anchor targets. By definition, conserved elements evolve slowly and therefore are expected to contribute few PICs at the shallow (e.g., phylogeographic) scale (Figure 1) as seen in UCE studies (Figure 2 in ). Although interspecific variation is lower close to the AHE target region compared to immediate flanking regions, there is nonetheless an intermediate level of variation as ~35% or more loci have at least one PIC close to the target region (Figure 1). As expected, the majority of PICs are from more variable flanking regions. On the other hand, sequencing coverage drops the farther from the anchor region, thereby increasing the possibility that variable sites are simply artifacts of sequencing error and low coverage (Figure 1). However, one could largely mitigate this problem by trimming flanking data so that all or most taxa share the same set of characters, and removing all gapped characters as we did in this study.
There will remain situations where broader sampling of exons is desirable, either to attempt to resolve short internodes, whether shallow or deep in the tree, or to estimate historical demography and divergence processes. Phylogenetic and species-delimitation analyses of genome-scale data remain constrained by computational limits, though there are promising developments in inference methods that use independent SNPs across thousands of loci [68,69].
Transcriptome mining can potentially provide 1000s of loci with many PICs (see Supp Info 2) and can incorporate transcriptomes for phylogenetic analysis that were otherwise collected for other gene expression projects (e.g., ). Exons do have the advantages of more secure alignment via open reading frames, a broad spectrum of variability, and potential to link the frameworks of molecular evolution and phylogenetics . Further, the development of custom exon capture systems from de novo assembled transcriptomes is now relatively routine  and will become all the more informative when paired with genomes from reasonably closely-related species. Depending on the application, thousands of exons sequenced can be analysed en masse (e.g. for population genomic analyses via site frequency spectra), or loci with the highest information content and most coherent modes of sequence evolution can be selected for phylogenetic analyses.
On the other hand, several aspects currently make TM less feasible for large-scale phylogenetics. First, transcriptome sequencing requires high quality RNA, preferably from fresh tissue or preserved in an RNA preservative and stored at ≤ −80C. This precludes ethanol-preserved tissues and older tissues that would otherwise preserve DNA. Second, transcriptome sequencing is still expensive when compared to AHE and exon-capture methods. Third, recombination between distant exons within single transcripts could violate assumptions of phylogenetic analyses, e.g., where all SNPs within a single locus are assumed to reflect a common gene tree history. This last phenomenon (termed a mixture tree in ) can be mitigated by using only a single exon from each gene.
Phylogeny of the Australian Eugongylus group skinks
We inferred phylogenies of the Australian Eugongylus group lizards with generally high clade support. When compared to other studies including Eugongylus group taxa, several relationships match our expectations such as a clade including Carlia and Lygisaurus , the interrelationships of Saproscincus basiliscus, S. czechurai, S. hannahae, S. lewisi, and S. tetradactyla (the “northern lineage” of ), and a putative rapid radiation early in the Australian Eugongylus group’s history . However, because there is no previous comprehensive molecular phylogenetic analysis of the group, most of our results are novel.
Despite using 415 loci (or subsets with the many PICs), some relationships remain unresolved or conflicting. The relationships of the Tasmanian species of Niveoscincus remain unknown. This lack of resolution likely represents a rapid radiation and that there is still insufficient amount of data (characters or taxa) to reconstruct the evolutionary history of these species. There is strong support for conflicting placements of both Cryptoblepharus and Pseudemoia between the concatenated data and STEAC analyses (Figures 3 and 4) reflecting the complex process of DNA evolution in these taxa as well. These conflicting or unresolved relationships continue to prohibit a complete understanding of placental evolution in Niveoscincus and Pseudemoia which exhibit a range of placental complexity from relatively simple placenta likely used for gas exchange and minimal maternal provisioning, to complex placentae with substantial maternal-to-embryo transfer [32,72]. Although TM performed well, it is not currently feasible to sequence transcriptomes for the remaining ~140 species of the group. Given the promising performance of AHE shown here, it is likely that further sampling across and within genera will do much to resolve the early phylogenetic history of the group.
Our study demonstrates that, overall, anchored-enriched loci are informative at the intraspecific phylogeographic scale as well as deeper in the tree of life. Transcriptome mining provides a wealth of informative loci, but unlike hybrid enrichment methods, is less practical when scaling up to 100 s of taxa unless obtained via a capture approach . In practice, we suggest a mixed approach that uses standard systems such as AHE or UCE to obtain initial results, and then, if warranted, more extensive capture of exons or anonymous loci for more detailed analyses of rapid radiations, divergence histories and population genomics (including for museum specimens ). Alternatively, mixed capture kits can be designed to span the scales of divergence to be studied for various projects. Design and testing of such capture systems containing mixtures of anchored loci (centered in exons) and anonymous loci (centered in randomly-chosen parts of the genome) are already underway in numerous animal and plant clades. These systems produce a spectrum of loci that are informative at the deep phylogenetic, phylogeographic, and population genetic scales.
Availability of supporting data
The data sets and scripts supporting the results of this article are available in the Dryad repository, doi:10.5061/dryad.g4mj2.
We thank L. Liu for assistance with STEAC analyses; D. Warren for assisting MCB with Perl scripting; J. Mai for computational assistance; Intersect Australia Ltd. for supercomputing resources; O. Griffith for access to unpublished Pseudemoia transcriptome data; S. Emme and H. Ralicki for assistance with AHE data collection; O. Griffith, M. Laird, A. Seago, J. Van Dyke, S. E. Williams and Moritz lab members for fieldwork assistance; and two anonymous reviewers for comments on the manuscript. MCB thanks S. Keogh and the Research School of Biology at the Australian National University for temporary accommodation. MCB was supported by the Discovery Early Career Researcher Award from the Australian Research Council (ARC; DE120101615), DGC by ARC DP0771913 and CM and JB by ARC LP110100104. Specimens were collected with permits from Queensland (WITK07252610), New South Wales (SL100401), South Australia (Q25988-1), and tissue collection was approved by the ANU (A2103/08), UC Berkeley (Moritz-R278) and University of Sydney Animal Ethics Committees (2010/1014 5409 and 2011/1013 5636).
- Lemmon EM, Lemmon AR. High-throughput genomic data in systematics and phylogenetics. Ann Rev Ecol Evol Syst. 2013;44:99–121.View ArticleGoogle Scholar
- Faircloth BC, Sorenson L, Santini F, Alfaro ME. A phylogenomic perspective on the radiation of ray-finned fishes based upon targeted sequencing of ultraconserved elements (UCEs). PLoS ONE. 2013;8:e65923.View ArticlePubMed CentralPubMedGoogle Scholar
- Johnson BR, Borowiec ML, Chiu JC, Lee EK, Atallah J, Ward PS. Phylogenomics resolves evolutionary relationships among ants, bees, and wasps. Curr Biol. 2013;23:2565.View ArticleGoogle Scholar
- Simon S, Narechania A, DeSalle R, Hadrys H. Insect phylogenomics: exploring the source of incongruence using new transcriptomic data. Genome Biol Evol. 2012;4:1295–309.View ArticlePubMed CentralPubMedGoogle Scholar
- Wen J, Xiong Z, Nie Z-L, Mao L, Zhu Y, Kan X-Z et al. Transcriptome sequences resolve deep relationships of the grass family. PLoS ONE 2013, e74394.Google Scholar
- Peters RS, Meusemann K, Petersen M, Mayer C, Wilbrandt J, Ziesmann T, et al. The evolutionary history of holometabolous insects inferred from transcriptome-based phylogeny and comprehensive morphological data. BMC Evol Biol. 2013;14:52.View ArticleGoogle Scholar
- Faircloth BC, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC. Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Syst Biol. 2012;61:717–26.View ArticlePubMedGoogle Scholar
- Bi K, Vanderpool D, Singhal S, Linderoth T, Moritz C, Good JM. Transcriptome-based exon capture enables highly cost-effective comparative genomic data collection at moderate evolutionary scales. BMC Genomics. 2012;13:403.View ArticlePubMed CentralPubMedGoogle Scholar
- Lemmon AR, Emme SA, Lemmon EM. Anchored hybrid enrichment for massively high-throughput phylogenomics. Syst Biol. 2012;61:727–44.View ArticlePubMedGoogle Scholar
- Bybee SM, Bracken-Grissom HD, Hermansen RA, Clement MJ, Crandall KA, Felderc DL. Directed next generation sequencing for phylogenetics: an example using Decapoda (Crustacea). Zool Anz. 2011;250:497–506.View ArticleGoogle Scholar
- O’Neill EM, Schwartz R, Bullock CT, Williams JS, Shaffer HB, Aguilar-Miguel X, et al. Parallel tagged amplicon sequencing reveals major lineages and phylogenetic structure in the North American tiger salamander (Ambystoma tigrinum) species complex. Mol Ecol. 2013;22:111–29.View ArticlePubMedGoogle Scholar
- Barrow LN, Ralicki HF, Emme SA, Moriarty Lemmon E. Species tree estimation of North American chorus frogs (Hylidae: Pseudacris) with parallel tagged amplicon sequencing. Mol Phylogenet Evol. 2014;75:78–90.View ArticlePubMedGoogle Scholar
- Miller MR, Atwood TS, Eames BF, Eberhart JK, Yan Y-L, Postlethwait JH, et al. RAD marker microarrays enable rapid mapping of zebrafish mutations. Genome Biol. 2007;8:R105.View ArticlePubMed CentralPubMedGoogle Scholar
- Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL, Lewis ZA, et al. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS ONE. 2008;3:e3376.View ArticlePubMed CentralPubMedGoogle Scholar
- Peterson BK, Weber JN, Kay EH, Fisher HS, Hoekstra HE. Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS ONE. 2012;7:e37135.View ArticlePubMed CentralPubMedGoogle Scholar
- Peñalba JV, Smith LL, Tonione MA, Sass C, Hykin SM, Skipwith PL, et al. Sequence capture using PCR-generated probes: a cost-effective method of targeted high-throughput sequencing for nonmodel organisms. Mol Ecol. 2014;14:1000–10.Google Scholar
- Smith BT, Harvey MG, Faircloth BC, Glenn TC, Brumfield RT. Target capture and massively parallel sequencing of ultraconserved elements for comparative studies at shallow evolutionary time scales. Syst Biol. 2014;63:83–95.View ArticlePubMedGoogle Scholar
- Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, et al. Ultraconserved elements in the human genome. Science. 2004;304:1321–5.View ArticlePubMedGoogle Scholar
- Katzman S, Kern AD, Bejerano G, Fewell G, Fulton L, Wilson RK, et al. Human genome ultraconserved elements are ultraselected. Science. 2007;317:915–5.View ArticlePubMedGoogle Scholar
- Smith SA, Sadlier RA, Bauer AM, Austin CC, Jackman T. Molecular phylogeny of the scincid lizards of New Caledonia and adjacent areas: evidence for a single origin of the endemic skinks of Tasmantis. Mol Phylogenet Evol. 2007;43:1151–66.View ArticlePubMedGoogle Scholar
- Chapple DG, Ritchie PA, Daugherty CH. Origin, diversification, and systematics of the New Zealand skink fauna (Reptilia: Scincidae). Mol Phylogenet Evol. 2009;52:470–87.View ArticlePubMedGoogle Scholar
- Skinner A, Hugall AF, Hutchinson MN. Lygosomine phylogeny and the origins of Australian scincid lizards. J Biogeogr. 2011;38:1044–58.View ArticleGoogle Scholar
- Moussalli A, Moritz C, Williams SE, Carnaval AC. Variable responses of skinks to a common history of rainforest fluctuation: concordance between phylogeography and palaeo-distribution models. Mol Ecol. 2009;18:483–99.View ArticlePubMedGoogle Scholar
- Bell RC, Parra JL, Tonione M, Hoskin CJ, Mackenzie JB, Williams SE, et al. Patterns of persistence and isolation indicate resilience to climate change in montane rainforest lizards. Mol Ecol. 2010;19:2531–44.PubMedGoogle Scholar
- Dolman G, Moritz C. A multilocus perspective on refugial isolation and divergence in rainforest skinks (Carlia). Evolution. 2006;60:573–82.View ArticlePubMedGoogle Scholar
- Chapple DG, Hoskin CJ, Chapple SNJ, Thompson MB. Phylogeographic divergence in the widespread delicate skink (Lampropholis delicata) corresponds to dry habitat barriers in eastern Australia. BMC Evol Biol. 2011;11:191.View ArticlePubMed CentralPubMedGoogle Scholar
- Chapple DG, Chapple SNJ, Thompson MB. Biogeographic barriers in south-eastern Australia drive phylogeographic divergence in the garden skink, Lampropholis guichenoti. J Biogeogr. 2011;38:1761–75.View ArticleGoogle Scholar
- Singhal S, Moritz C. Testing hypotheses for genealogical discordance in a rainforest lizard. Mol Ecol. 2012;21:5059–72.View ArticlePubMedGoogle Scholar
- Singhal S, Moritz C. Reproductive isolation between phylogeographic lineages scales with divergence. Proc R Soc B. 2013;280:20132246.View ArticlePubMed CentralPubMedGoogle Scholar
- Dolman G, Hugall AF. Combined mitochondrial and nuclear data enhance resolution of a rapid radiation of Australian rainbow skinks (Scincidae: Carlia). Mol Phylogenet Evol. 2008;49:782–94.View ArticlePubMedGoogle Scholar
- Adams SM, Biazik JM, Thompson MB, Murphy CR. Cyto-epitheliochorial placenta of the viviparous lizard Pseudemoia entrecasteauxii: a new placental morphotype. J Morph. 2005;264:264–76.View ArticlePubMedGoogle Scholar
- Thompson MB, Speake BK. A review of the evolution of viviparity in lizards: structure, function and physiology of the placenta. J Comp Physiol B Biochem Syst Environ Physiol. 2006;176:179–89.View ArticleGoogle Scholar
- Stewart JR, Thompson MB. Parallel evolution of placentation in Australian scincid lizards. J Exp Zool B Mol Dev Evol. 2009;312:590–602.View ArticlePubMedGoogle Scholar
- Stewart JR, Thompson MB. Placental ontogeny in Tasmanian snow skinks (Genus Niveoscincus) (Lacertilia: Scincidae). J Morph. 2009;270:485–516.View ArticlePubMedGoogle Scholar
- Meyer M, Kircher M. Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb Protoc. 2011;6:pdb prot5448.Google Scholar
- Rokyta DR, Lemmon AR, Margres MJ, Arnow K. The venom-gland transcriptome of the eastern diamondback rattlesnake (Crotalus adamanteus). BMC Genomics. 2012;13:312.View ArticlePubMed CentralPubMedGoogle Scholar
- Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and throughput. Nucleic Acids Res. 2004;32:1792–7.View ArticlePubMed CentralPubMedGoogle Scholar
- Brandley MC, Young RL, Warren DL, Thompson MB, Wagner GP. Uterine gene expression in the live-bearing lizard, Chalcides ocellatus, reveals convergence of squamate reptile and mammalian pregnancy mechanisms. Genome Biol Evol. 2012;4:394–411.View ArticlePubMed CentralPubMedGoogle Scholar
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–23.View ArticlePubMed CentralPubMedGoogle Scholar
- Singhal S. De novo transcriptomic analyses for non-model organisms: an evaluation of methods across a multi-species data set. Mol Ecol Resources. 2013;13:403–16.View ArticleGoogle Scholar
- Magoč T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27:2957–63.View ArticlePubMed CentralPubMedGoogle Scholar
- Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–20.View ArticlePubMed CentralPubMedGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotech. 2011;29:644–52.View ArticleGoogle Scholar
- Kent WJ. BLAT—The BLAST-Like Alignment Tool. Genome Res. 2002;12:656–64.View ArticlePubMed CentralPubMedGoogle Scholar
- Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–9.View ArticlePubMedGoogle Scholar
- Huang X, Madan A. CAP3: a DNA sequence assembly program. I. 1999;9:868–77.Google Scholar
- Nylander JAA. MrAIC.pl, program distributed by the author. Evolutionary Biology Centre. Uppsala University.Google Scholar
- Guindon S, Dufayard J-F, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010;59:307–21.View ArticlePubMedGoogle Scholar
- Stamatakis A. RAxML-VI-HPC: maximum likelihood based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006;22:2688–90.View ArticlePubMedGoogle Scholar
- Kubatko LS, Degnan JH. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Syst Biol. 2007;56:17–24.View ArticlePubMedGoogle Scholar
- Belfiore NM, Liu L, Moritz C. Multilocus phylogenetics of a rapid radiation in the genus Thomomys (Rodentia: Geomyidae). Syst Biol. 2008;57:294–310.View ArticlePubMedGoogle Scholar
- Weisrock DW, Smith SD, Chan LM, Biebouw K, Kappeler PM, Yoder AD. Mouse lemur phylogeny: an empirical demonstration of the effect of allele sampling in phylogenetics. Mol Biol Evol. 2012;29:1615–2630.View ArticlePubMed CentralPubMedGoogle Scholar
- Liu L, Yu L, Pearl DK, Edwards SV. Estimating species phylogenies using coalescence times among sequences. Syst Biol. 2009;58:468–77.View ArticlePubMedGoogle Scholar
- Liu L, Yu L. Phybase: an R package for species tree analysis. Bioinformatics. 2010;26:962–3.View ArticlePubMedGoogle Scholar
- Larget BR, Kotha SK, Dewey CN, Ané C. BUCKy: Gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics. 2010;26:2910–1.View ArticlePubMedGoogle Scholar
- Zhu T, Yang Z. Maximum likelihood implementation of an isolation-with-migration model with three species for testing speciation with gene flow. Mol Biol Evol. 2012;29(10):3131–42.View ArticlePubMedGoogle Scholar
- Yang Z. A likelihood ratio test of speciation with gene flow using genomic sequence data. Gen Biol Evol. 2010;2:200–11.View ArticleGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–9.View ArticlePubMed CentralPubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.View ArticlePubMed CentralPubMedGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20:1297–303.View ArticlePubMed CentralPubMedGoogle Scholar
- Tamura K. Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G + C-content biases. Mol Biol Evol. 1992;9:678–87.PubMedGoogle Scholar
- Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genet. 2009;5:e1000695.View ArticlePubMed CentralPubMedGoogle Scholar
- Degnan JH, Rosenberg NA. Discordance of species trees with their most likely gene trees. PLoS Genet. 2006;2:e68.View ArticlePubMed CentralPubMedGoogle Scholar
- Degnan JH, Rosenberg NA. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol Evol. 2009;24:332–40.View ArticlePubMedGoogle Scholar
- Brito PH, Edwards SV. Multilocus phylogeography and phylogenetics using sequence-based markers. Genetica. 2009;135:439–55.View ArticlePubMedGoogle Scholar
- Zwickl DJ, Stein JC, Wing RA, Ware D, Sanderson MJ. Disentangling Methodological and Biological Sources of Gene Tree Discordance on Oryza (Poaceae) Chromosome 3. Syst Biol. 2014;63:645–59.View ArticlePubMedGoogle Scholar
- Melville J, Swain R. Mitochondrial DNA-sequence based phylogeny and biogeography of the snow skinks (Squamata: Scincidae: Niveoscincus) of Tasmania. Herpetol. 2000;56:196–208.Google Scholar
- Bryant D, Bouckaert R, Felsenstein J, Rosenberg NA, RoyChoudhury A. Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis. Mol Biol Evol. 2012;29:1917–32.View ArticlePubMed CentralPubMedGoogle Scholar
- Mirarab S, Bayzid MS, Boussau B, Warnow T. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science. 2014;29:1250463.View ArticleGoogle Scholar
- Cutter AD. Integrating phylogenetics, phylogeography and population genetics through genomes and evolutionary theory. Mol Phylogenet Evol. 2013;69:1172–85.View ArticlePubMedGoogle Scholar
- Moussalli A, Hugall AF, Moritz C. A mitochondrial phylogeny of the rainforest skink genus Saproscincus, Wells and Wellington (1984). Mol Phylogenet Evol. 2005;34:190–202.View ArticlePubMedGoogle Scholar
- Van Dyke JU, Brandley MC, Thompson MB. The evolution of viviparity: molecular and genomic data from squamate reptiles advance understanding of live birth in amniotes. Reproduction. 2014;147:R15–26.View ArticlePubMedGoogle Scholar
- Bi K, Linderoth T, Vanderpool D, Good JM, Nielsen R, Moritz C. Unlocking the vault: next-generation museum population genomics. Mol Ecol. 2013;22:6018–32.View ArticlePubMed CentralPubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.