The field of molecular phylogenetics is heading towards an exciting future. In this future, genomics will allow for the use of dozens of unlinked nuclear loci to estimate phylogenies [e.g. [1–5]]. These data may then be analyzed using species-tree methods that use principles of population genetics to resolve incongruence among loci (e.g., BEST ; STEM ; *BEAST ).
But even as the field of phylogenetics seems to be moving towards such a future, it is clearly not there yet. For example, in animals, many phylogenies continue to be estimated based on mitochondrial (mtDNA) data alone [e.g. [9–12]], or a combined (concatenated) analysis of nuclear (nucDNA) and mtDNA data [e.g. [13–18]]. In many cases, these analyses of mtDNA or concatenated data may be necessary because sampling many species makes it impractical to include many nuclear loci (and due to fiscal constraints), and sampling many species and/or few loci makes it impractical to utilize explicit species-tree methods (despite strong theoretical justification for their use; e.g., [6, 8, 19]). Many review papers have addressed the pros and cons of mtDNA data [e.g. [20–24]], and many empirical studies have suggested the need for caution in their use [e.g. [25–27]]. However, most reviews have focused on the use of mtDNA in phylogeographic studies [e.g. [23, 24, 28]] and on the question of whether mtDNA should be used in phylogenetics at all [e.g. .
Here, we address a somewhat different question. Given that many systematists routinely estimate phylogenies using combined mtDNA and nucDNA, we ask: what are the consequences of the common practice of combining these two types of data? For example, will the combined-data tree tend to resemble the mtDNA tree due to larger numbers of variable mtDNA characters? Or will the combined-data tree contain a mixture of clades favored by the separate data sets? Are there any generalities that can be made about when mtDNA or nucDNA data will be favored in particular clades or data sets? These questions are particularly important because many published studies simply present trees from combined analyses of mtDNA and nucDNA, without any examination of whether the mtDNA and nucDNA trees are congruent, or to what extent the combined-data tree reflects the contributions of each data set [e.g. [14–18], but see for example . In fact, if combined-data trees are often discordant with trees from nucDNA and largely reflect the mtDNA data instead, there may be little to be gained by collecting and adding nucDNA data in the first place (i.e., if trees are estimated from the combined-data and nucDNA have negligible impact on the combined-data analysis). To our knowledge, these important questions have never been the subject of a focused study.
In this paper, we address these and related questions, by evaluating combined-data analyses that utilize both mtDNA and nucDNA data. We approach these questions using new data and analyses for Plethodon salamanders, along with new analyses of existing data sets from 13 other vertebrate groups. Below, we describe the four main questions (and five associated predictions) that we address. For each of the four main questions, we are attempting to discern if there are generalities that can be made regarding the interaction of mtDNA and nucDNA data sets in a combined-data analysis.
First, are there frequent conflicts between separate mtDNA and nucDNA trees, and are the conflicting clades strongly supported by each data set? Weakly supported conflicts may be spurious and thus not problematic, whereas strongly supported conflicts may reflect more serious issues (such as long-branch attraction or discordance between gene and species trees) that may confound combined analyses [e.g. [6, 30–34]]. As a working hypothesis, we predict that (i) discordance between mtDNA and nucDNA will generally be uncommon, and if found, will often be weakly supported by one or both data sets. This prediction is based on the simple expectation that both mitochondrial and nuclear genes will frequently share the same underlying phylogenetic history (especially given that smaller effective population sizes of mitochondrial genes may reduce discordance due to incomplete lineage sorting ), and that incongruence may often be due to estimated phylogenies that do not fully match the underlying gene trees [30–32].
Second, are conflicts between the separate mtDNA and nucDNA trees generally resolved in favor of mtDNA or nucDNA in the combined-data tree? Mitochondrial genes are generally thought to evolve more rapidly than nuclear genes, and so should have more variable characters but should also have more homoplasy [e.g. [21, 22]. In general, we expect conflicts between data sets to be resolved in favor of the data set with more variable characters, but also with less homoplasy. A data set with extensive conflict among characters (i.e., high homoplasy due to random noise from high overall rates of character change) may be less likely to overturn relationships inferred from a data set with less internal conflict among characters. Thus, the resolution of conflicts between mtDNA and nucDNA data sets in the combined-data tree may vary from analysis to analysis, depending on the number of characters sampled in each data set and their levels of variability and homoplasy. We predict that (ii) when mtDNA dominates a combined-data tree, it will be due to larger numbers of variable characters compared to nucDNA, and (iii) when nucDNA dominates a combined-data tree, it will be due to lower levels of homoplasy compared to mtDNA.
We address these predictions by first comparing the number of nodes shared between trees from mtDNA, nucDNA, and the combined-data, across 14 vertebrate clades. Next, we test if the proportion of nodes shared between the combined-data and mtDNA trees is correlated with the overall proportion of the variable sites in the combined data that are from mtDNA (given the prediction that the data set with more variable characters will have a stronger influence on the combined-data tree). We also test if the resolution of conflicts in the combined-data tree is related to the level of homoplasy in the mtDNA versus nucDNA data sets, given the prediction that the combined-data tree will be resolved in favor of the data set with less homoplasy (i.e., nucDNA) regardless of the relative numbers of variable sites.
Third, what generalities, if any, can we make about which nodes of the combined-data tree are resolved in favor of mtDNA vs. nucDNA? We expect that the resolution of nodes in the combined-data tree may depend on the underlying branch lengths and the depth of those branches in the tree. We predict (iv) mtDNA and nucDNA will be more congruent on longer branches, because allele histories should coalesce on longer branches, reducing discordance among genes due to incomplete lineage sorting . Furthermore, introgression is less likely among more distantly related species (i.e., separated by longer branches), due to the accumulation of reproductive isolating mechanisms over time , which should also contribute to greater congruence between mtDNA and nucDNA on longer branches (especially if mitochondrial introgression is an important source of discordance between mtDNA and nucDNA trees; e.g., ). Longer branches may also be more congruent if they tend to be more strongly supported by each gene , reducing spurious conflicts between mtDNA and nucDNA due to weak support. We expect shorter branches to be resolved in favor of mtDNA, given that there may be too little time for mutations to accumulate on the shortest branches for slower-evolving nuclear genes. In addition, there may be extensive incongruence among nuclear genes on short branches due to incomplete lineage sorting, also leading to weaker branch support [e.g. . In contrast, the mitochondrial genome is a single locus (such that there should be no incongruence among histories of mitochondrial genes), and incomplete lineage sorting may be less problematic at the between-species level due to the generally smaller effective population size of the mitochondrial genome [e.g. [20, 22, 37]].
Finally, when mtDNA and nucDNA trees conflict, we predict (v) that nucDNA may be more likely to win conflicts deeper in the combined-data tree, while mtDNA may win resolutions that are shallower [e.g. [38, 39]. Clades deep in the tree may be harder to resolve due to long-branch attraction , and faster evolving genes (like mtDNA) will likely exacerbate problems of long-branch attraction (i.e., branch lengths may generally tend to be longer). The importance of tree depth may depend not only on the relative placement of branches in the tree, but also on overall branch lengths (with mtDNA being more problematic when branches are generally longer). The potential for nucDNA data to better resolve deep branches may be an important justification for including these data in the first place, along with the desire to sample unlinked loci.
In summary, a consideration of general principles suggests conflicts between mtDNA and nucDNA may be uncommon and weakly supported, and that the resolution of conflicting nodes in the combined analysis (i.e., favoring mtDNA vs. nucDNA) may vary based on the number of variable characters and level of homoplasy in each mtDNA and nucDNA data set, the lengths of branches, and the depths of branches in the tree. We test these predictions empirically here, using new data from Plethodon salamanders and published data from 13 other vertebrate clades.
Plethodon is the most species-rich genus of North American salamanders . They are terrestrial, direct-developing salamanders that are generally common and diverse in North American forests . Plethodon have long interested evolutionary biologists and ecologists, and hundreds of papers have been published on Plethodon in diverse areas, including studies of behavior, [e.g. [43–46]], community ecology [e.g. [47–49]], patterns of trait evolution [e.g. [13, 50], speciation and hybridization [e.g. [51–58]], and response to environmental change [e.g. [59–61]]. Many of these studies have used a phylogenetic approach, making a reliable phylogeny for Plethodon particularly important.
Earlier studies addressed Plethodon phylogeny using data from allozymes [e.g. [52, 53] and mtDNA [e.g. , whereas more recent studies have combined mtDNA and nucDNA data [e.g., [13, 57]. In general, these studies have yielded similar estimates of higher-level Plethodon phylogeny (e.g., most agree on a split between eastern and western species, and on the species groups in eastern North America). However, there have been substantive disagreements between studies regarding some species-level relationships (e.g., within the cinereus group; ). Furthermore, all previous studies used relatively few nuclear loci (two or three; [13, 57, 61]). Here we obtain new data from five nuclear loci and combine these with existing data from four nuclear genes and three mitochondrial genes, and use these data to address Plethodon phylogeny and general questions about combining mtDNA and nucDNA in phylogenetic studies.