Detection of mitochondrial insertions in the nucleus (NuMts) of Pleistocene and modern muskoxen
© Kolokotronis et al. 2007
Received: 12 October 2006
Accepted: 27 April 2007
Published: 27 April 2007
Skip to main content
© Kolokotronis et al. 2007
Received: 12 October 2006
Accepted: 27 April 2007
Published: 27 April 2007
Nuclear insertions of mitochondrial sequences (NuMts) have been identified in a wide variety of organisms. Trafficking of genetic material from the mitochondria to the nucleus has occurred frequently during mammalian evolution and can lead to the production of a large pool of sequences with varying degrees of homology to organellar mitochondrial DNA (mtDNA) sequences. This presents both opportunities and challenges for forensics, population genetics, evolutionary genetics, conservation biology and the study of DNA from ancient samples. Here we present a case in which difficulties in ascertaining the organellar mtDNA sequence from modern samples hindered their comparison to ancient DNA sequences.
We obtained mitochondrial hypervariable region (HVR) sequences from six ancient samples of tundra muskox (Ovibos moschatus) that were reproducible but distinct from modern muskox sequences reported previously. Using the same PCR primers applied to the ancient specimens and the primers used to generate the modern muskox DNA sequences in a previous study, we failed to definitively identify the organellar sequence from the two modern muskox samples tested. Instead of anticipated sequence homogeneity, we obtained multiple unique sequences from both hair and blood of one modern specimen. Sequencing individual clones of a >1 kb PCR fragment from modern samples did not alleviate the problem as there was not a consistent match across the entire length of the sequences to Ovibos when compared to sequences in GenBank.
In specific taxa, due to nuclear insertions some regions of the mitochondrial genome may not be useful for the characterization of modern or ancient DNA.
NuMts have been characterized in numerous species . Various genome sequencing projects have demonstrated that many copies ranging in length from parts of genes to nearly full copies of the mtDNA genome exist in most mammalian genomes [2, 3]. When identified correctly, older NuMts can be used as a constructional source of phylogenetic outgroups . However, they are better known for their negative impact on the interpretation of collected data. For example, a series of mutations originally thought to be correlated with Alzheimer's syndrome were in fact NuMts . A well publicized sequence obtained from a dinosaur fossil was in fact a NuMt from a human contaminant . Using elephant hair, Greenwood and Pääbo  demonstrated that in some species and in some tissues, NuMts may be preferentially retrieved over mtDNA by PCR. More recently, it has been shown that in some primates, specific mtDNA loci may be unusable for phylogeny reconstruction because of NuMts [8–10].
Methods that depend on the circularity of the mtDNA genome, including amplification of long fragments or the entire mtDNA genome, could exclude linear integrated NuMts. However, this generally requires a good source of intact DNA and is not commonly practiced. Given that most non-invasive techniques for retrieving sequence from wild animals require using tissues such as hair or dung containing low concentrations of DNA, often of poor quality, high copy sequences such as those present in mtDNA are often the preferred target for analysis. The problem also applies to degraded DNA such as DNA from ancient samples. Although nuclear DNA is expected to be harder to retrieve from ancient samples, NuMts have been detected therein and such studies are not free from the risks presented by NuMts . For all these reasons, NuMts present a serious challenge.
In our experiments, two primer pairs amplifying overlapping PCR fragments (270 bp and 162 bp, respectively) were applied to both ancient muskox extracts and modern muskox DNA. The sequences were largely uniform among the ancient samples muskoxen although in one sample, among several clones sequenced, one unequivocal NuMt sequence was detected. In contrast to the ancient materials, the same primers yielded multiple distinct sequences from modern muskox hair and blood. Thus, unlike proboscideans, hair and blood both yielded large numbers of NuMts.
An additional PCR product was amplified from the modern materials encompassing the entire HVR. This yielded sequences that were more uniform; however, outside of the region being compared to the ancient samples, the sequences did not match Ovibos as well as it matched other related bovids. Phylogenetic analysis of the sequences obtained did not provide clear separation of NuMts into a clade separate from that for organellar mtDNA. Thus, for the specific HVR sequence under study, the organellar mtDNA sequence could not be determined with absolute certitude.
One clone of sample OMTai46 (clone 2.3 in Figure 1) was highly divergent (differing from the consensus sequence at 32 nucleotide positions). A BLASTN search of this sequence resulted in a highest similarity score to a roe deer control region sequence (Capreolus capreolus, e-value = 1e-51) . This finding is typical for a phylogentically older NuMt, such that the sequence divergence from the target taxon's mtDNA sequence places it as more similar, though not identical, to the outgroup. Given the likely age of the NuMt, as suggested by its large divergence from the other clone sequences and its similarity to other caprines, the clone is not a human contamination and likely represents an old NuMt. However, the clone was unique among clones from the same animal, and did not appear in other samples. On the whole, then, NuMt retrieval from the ancient specimens was minimal. This is not surprising as it is expected that, given the high copy number of mtDNA compared to nuclear DNA, in most subfossil samples mtDNA will be greatly in excess over single copy sequence nuclear DNA and therefore preferentially amplified by PCR.
Several indels observed in the ancient samples were not identified in an earlier investigation of modern muskox DNA by Groves . As noted by MacPhee et al. , these indels are probably artifacts produced by Groves'  single-strand direct sequencing protocol. By contrast, when both strands of multiple cloned sequences are investigated, no indels appear (Figure 1). Methodological improvements in sequencing methodology that have taken place since the original publication of the data could explain the discrepancy.
Having observed a NuMt sequence in the ancient DNA, we performed two additional experiments using modern DNA samples. In the first, the same primers used in the investigation of the ancient DNA specimens were applied to samples from two modern muskoxen (both hair and blood from the first, blood only from the second, unrelated individual). In the second experiment, the primers used by Groves  to generate her modern muskox database were applied to the ancient samples, which resulted in > 1 kb product (see Additional file 1). Thus, three primer pairs, two tissue types and two individuals were each tested for the presence of NuMts using the same protocol.
All PCR products were cloned and multiple clone sequences determined. The sequences were curated by removing the vector sequence from the reads and the beginning comparisons from the first base following the 3' end of the primers. The most consistent sequences obtained came from the largest fragment investigated (see Additional File 2). Some sequence heterogeneity was observed in the region of overlap for all sequences compared with the exception of one highly divergent clone (CHL.8). Two additional clone sequences, CBL.10 and CHL.5 however, were almost identical to the majority of clones in the region overlapping the shorter PCR products retrieved from ancient DNA but were more divergent in the 5' end of the HVR (see Additional File 2). This extended into the region of the HVR not covered by any of the database muskox sequences. The alignment of the sequences to an outgroup sequence (Taiwanese serow, Nemorhaedus swinhoei) and to the database did not generate a consistent alignment across the entire fragment covered by Groves'  sequences and those generated in this study. The 3' end of the first fragments from GenBank (U47061, 63, 65, 67, 69, 71, 73 and 75) aligned poorly to the sequences generated in this study and to the outgroup sequence. Similarly, the 5' end of the second fragment sequences from GenBank aligned poorly to the outgroup and to the sequences generated from this study. Thus, the sequences generated in this study aligned consistently to each other, except where noted, and to the outgroup; yet the database muskox sequences did not. Thus, sequence heterogeneity suggesting NuMts were detected among the long fragment sequences and odd behaviour of the database muskoxen was also observed.
Tests for recombination and gene conversion were negative. In addition, whereas in the Groves  study 37 muskoxen yielded only 8 distinct haplotypes, the ~1.1 kb fragment analyzed in this study of two individuals yielded two new haplotypes not previously observed. It should be noted that the indels tend to inflate the divergence of the sequences. Most of the differences were base substitutions. Given the presence of insertions and deletions among the ~1.1 kb fragment clones and alignment issues with muskox sequences in the database, we conclude that correct organellar mtDNA sequences cannot be unambiguously determined from the sequencing of the longest PCR product amplified in this study.
The chromosomal rearrangement feature enabled in POY is a powerful tool for the establishment of homology between loci differing in their physical arrangement. However, no clone sequence clusters resulted from rearrangement analysis as all combinations had the same cost in a maximum parsimony framework. In addition, no recombination was detectable in this sequence dataset. Thus, the multiple clones derived from different individuals cannot be explained as a PCR artefact or the presence of recombinant mtDNA.
The nucleotide frequency distribution across all haplotypes did not reveal any shifts in base composition, so as to hint at a pseudogenization (i.e. the process leading to loss of function of a locus following duplication or transposition), which is mostly expected for coding loci (graph not shown). Up to date, shifts in base composition of D-loop Numts have not been studied, as cases of noncoding Numts have rarely been reported [4, 9, 10, 16–18].
The Pairwise Relative Rates Test identified clones CH.1.3, CH.1.9, CH.1.1, and CH.1.2 to be evolving faster than all other sequences, when compared to all other clones, using the serow as outgroup (see Additional file 3). The RRT also returned significant results for other sequences of the divergent clade, but also for haplotypes of the greater clade. For instance, CH.1.4 evolved at a different rate than same-tissue CH.1.6, CB, or CHB -to name a few- and CB.1.1's rate differed from that of other blood CB and E clones.
The unusual haplotype structure and phylogenetic divergence of some CH hair clones, as well as their different evolutionary rate, suggest these sequences are control region fragments transposed to the nucleus. That said, there are other haplotypes that exhibit signs of differential evolution, although they are placed in the main clade and their base composition seems standard, such as CH.1.4, CB.1.1, CHB, CB.1.3. Among the sequences from the long fragment, not all of the clone sequences grouped together. For example, CHL.8 was in a different more basal portion of the tree than any other CHL clone.
In sum, our results are similar to those obtained by Thalmann et al. , who noted that, in gorillas, NuMts were distributed throughout the phylogenetic tree of HVR sequences. These authors concluded that non-invasive studies relying on the portion of HVR1 examined would not yield reliable results.
Although we were able to identify a set of clones as more divergent even than the outgroup Nemorhaedus swinhoei, we were not able to cluster together all clones with unusual sequences, a proof that NuMts may be distributed throughout the phylogenetic tree, rendering phylogenetic detection not the most powerful diagnostic tool.
NuMts are not an insurmountable problem if one is dealing with high quality DNA. Several methods have been suggested for determining correct sequences when confronted by NuMts:
Overlapping PCR strategies using multiple independent primers operates on the principle that the chances of detecting the same NuMts with independent PCR primers is low. However, this method is not infallible .
Phylogenetic and substitution rate analyses can be used to identify NuMts. However, this may only detect highly divergent NuMts and, as seen in this study, the distribution of sequences did not clearly separate a NuMt clade from an organellar mtDNA clade (Figure 3).
Amplifying large fragments of mtDNA that overlap from high quality genomic DNA or samples enriched for mtDNA are a good way of detecting organellar DNA. Most NuMts are short, containing one or a few genes. This technique, however, requires a source of high quality DNA or the possibility of enriching samples for mtDNA.
By contrast, studies of ancient DNA or non-invasive sampling often require the use of DNA of substandard quality. In addition, the hypervariable region is often particularly critical for determining variation among individuals that are closely related (i.e., intrapopulation variation). The results of this study demonstrate that even when high quality DNA from fresh blood extractions is available, difficulties in establishing the correct organellar mtDNA sequence may be encountered. In fact, in our example the modern samples proved to be more intractable than the ancient DNA samples.
In the muskoxen samples reviewed here, the frequency of NuMt detection varied among the individuals tested. However, unlike the case with proboscideans , both hair and blood from the same individual yielded abundant NuMt sequences. NuMts were also detected, though at lower frequency, in the ancient DNA samples. This makes sense: since only the highest-copy DNA generally remains in significant amounts, the incidence of NuMts should be reduced in ancient samples. Indeed, this inference remains the basic justification for the continued use of mtDNA in most ancient DNA studies. However, because NuMts have also been detected in nuclear DNA derived from ancient samples, there is no avoiding the conclusion that NuMts are a potential problem for all sample types, irrespective of their origin or condition .
Direct sequencing of the PCR products would not alleviate the problem: if a NuMt predominates, then the NuMt sequence will be retrieved. Furthermore, the presence of multiple distinct sequences would make direct sequencing impossible. In our example, the largest fragment sequenced demonstrated inconsistent homology with the 8 muskox haplotypes deposited in GenBank , despite the fact that the same primers were used throughout. Because of this inconsistency, even though one of the blood samples yielded matching sequences in all overlapping PCR fragments, it cannot be definitively stated that the sequence represented is actually organellar mtDNA. Nor did phylogenetic analysis distinguish NuMts from organellar mtDNA sequences.
Mitochondrial sequences have proven to be useful in a wide variety of contexts. However, there are some often overlooked risks associated with their use. We have observed significant differences in the retrieval of NuMts from modern and ancient DNA samples of the muskox Ovibos moschatus. These differences, which occurred among independent PCRs using different primer pairs, among individuals and among tissues of individuals, challenge our ability to correctly characterize organellar mtDNA sequences in this taxon. Ovibos joins proboscideans, some primates, and several other major taxa in which the high incidence of NuMts complicates proper sequence identification. Caution should be exercised in interpreting the results of any mtDNA study that relies on limited or degraded DNA samples or single PCR primer pairs. In extreme cases, the pervasive presence of NuMts may render some loci useless for the purpose of modern and ancient DNA studies.
Ancient muskox samples OMTai 14, 38, 39, 46, 95, and 23564, localities, carbon dates and museum identification numbers are described in detail in our previous study (see Table 1 in ). Hair and blood from a modern muskox male was sampled from Tierpark Hellabrunn (Munich, Germany). Blood from a second muskox was kindly provided by E. Willerslev (University of Copenhagen, Denmark). Ancient samples were extracted as described in . Modern blood and hair extractions were performed using QIAamp Mini DNA kits (QIAGEN, Germany) and eluted in 500 μl of distilled water.
PCR primers used in this study were designed to amplify the mitochondrial HVR of muskoxen based on the sequences for Ovibos in GenBank . To amplify the entire HVR from modern muskoxen, the primers from  were used. For both ancient and modern DNA two overlapping PCR fragments were used to obtain an approximately 276 bp fragment of the HVR. Primer sequences were HV.1L 5'-AAAGAATTCTGCTGTCATACATTT-3', HV.1H 5'-AAAGGATCCAGGGATGAGTGTGTT-3', HV.2L 5'-AAAGAATTCTATCATATATGCTCTTCGTA-3', HV.2H 5'-AAAGGATCCTATCTTGGTTGGAGTGCAGA-3'. Ancient DNA PCR buffer and reaction conditions are described in . Modern PCR reactions were performed using 0.8 pmol of each primer and Taq DNA polymerase (Promega). The re-amplification procedure, cloning of PCR products and sequencing of clones was done using standard methods and has already been described .
Sequences were aligned in MUSCLE 3.6  and manually adjusted in Se-Al 2.0a11  before being collapsed into haplotypes in COLLAPSE 1.2  treating gaps as 5th state (see Additional file 4). Previously published control region sequences  were also included in the alignment (GenBank:U47061–U47076). The haplotype alignment was used in evolutionary analyses. Sequences were submitted to GenBank (GenBank:EF057069–EF057098). The Taiwanese serow Nemorhaedus swinhoei (GenBank:AY149639) was used as outgroup, as a closely related caprine .
We attempted to cluster homologous clone sequences, so as to differentiate true mitochondrial sequences from Numts, by applying two different classification schemes: (i) clones were categorized by each muskox individual regardless of the number of clones per individual, resembling to a multiple-partitions dataset; the inequality of the number of clones per individual was corrected by coding missing clones as partitions with missing data. (ii) clones were categorized by individual and by tissue type, e.g. hair (H) and blood (B), knowing from previous research  that hair may be more Numt-rich than blood. The homology assessment analysis was done in POY  using the -chromosome -n2reorder command.
In recent years, evidence has burgeoned for the occurrence of mitochondrial recombination in a wide variety of animal taxa [27, 28], and more specifically in members of the family Bovidae, such as Cephalophus spp. (duikers), Ovis spp. (sheep), Tragelaphus spp., and Kobus spp (see Table 1 in ). Recombination should not produce misleading interpretations of populations with limited or nonexistent gene flow; however, it can mimic the traces of population size expansion, underestimate divergence times, and mask ancient polymorphisms as recurring mutations [27, 29]. In the particular case of caprines, in which mtDNA recombination is an acknowledged issue , it should be routine to search for evidence of recombination before making evolutionary inferences.
In order to examine the possibility of detecting HVR recombinant sequences, we employed several recombination detection methods, such as Bootscan, Chimaera, Geneconv, MaxChi, RDP, and SiScan, as implemented in the program RDP2 2b08 , using the automated scan option, and 1,000 permutations and 1,000 bootstrap replicates where applicable.
We tried to detect changes in base composition across sequences by plotting the relative compositions of individual bases, purines, and pyrimidines across all haplotypes, using a sliding window of 50 bp and a step of 10 bp in Treefinder .
The substitution model suggested by MrModeltest 2.2  in conjunction with PAUP* 4b10  using Akaike's Information Criterion (AIC)  was the general time-reversible model [35–37] with substitution rates following a Γ-distribution (GTR+Γ4) with shape parameter α = 0.3732.
Bayesian inference (BI) of phylogeny was performed for all haplotypes of the short alignment in MrBayes 3.1.2 [38, 39] on a 9-node dual G5 processor (2.0 GHz, 18 GB RAM) XServe cluster at AMNH, using the model suggested by MrModeltest (GTR+Γ, with a flat prior on base frequencies following a Dirichlet distribution and six rate categories). Two simultaneous analyses were run for 107 generations each, on two separate occasions, starting from different random trees, and the resulting trees were saved every 1000 generations. One "cold" and three "heated" chains were run with heating scheme fine-tuning; temperature parameters ranging from 2 to 0.03 were used and the latter was chosen, as it was leading to better mixing. The heat applied to all chains except for the "cold" one, was 0.97, 0.94, and 0.92, respectively. Successful chain swaps were in the range of 33–75%. Following visual inspection in Tracer 1.3 , stationarity of the Ln-likelihood (i.e. the Ln-probability of the data given the parameter values) was reached before 106 generations. Similarly, we examined the average standard deviation of split frequencies and confirmed it approached zero (0.003418), indicating a satisfactory run length. Subsequently, the first 1,000 trees were discarded as burn-in and a 50% majority-rule consensus tree  was built with 18,002 optimal trees. The proportion of resulting trees presenting a given clade, in other words, the posterior probability of this clade, represents clade support in a Bayesian phylogenetic analysis .
Phylogeny was also estimated in a maximum likelihood (ML) framework in Treefinder using the GTR+Γ4 model. Approximate bootstrap support was estimated by applying the Shimodaira-Hasegawa (SH) test with RELL approximation  to all local rearrangements around an edge on the topology (also referred to as edge support or LRSH) and 50,000 replicates. For every edge on the tree, all its adjacent branch nearest-neighbor interchanges are computed and edge lengths are re-estimated by fixing all other parameters, and finally the SH test is applied. LRSH support values are the complement of the worst p -values from the SH test. A support value of 99 means this clade is significant at the 0.01 level, a value of 95 at the 0.05 level, and so on.
We employed unweighted and weighted maximum parsimony (MP) in PAUP* with a heuristic search using 100 random addition sequence replicates and TBR branch swapping. Given the number of transitions and transversions as calculated in MacClade 4.08  (88 transitions and 22 transversions) yielding a transition-transversion ratio of 4:1, we downweighted transitions four times using a step-matrix. In unweighted parsimony, gaps were treated as missing data and as a 5th state. Clade support was provided with 500 bootstrap replicates  and strict consensus trees  were built using all equally most parsimonious trees.
We used a Pairwise Relative Rates Test  to detect sequences with significantly different evolutionary rates, as implemented in HyPhy 0.99b  with a GTR+Γ substitution model and a "Local" model. Given a predefined outgroup sequence, the maximum likelihood estimate (MLE) for each 3-taxon tree is calculated, followed by the MLE calculation for the 3-taxon trees with evolutionary rates along the ingroup sequences constrained to be equal. A likelihood-ratio test (LRT) is then performed to examine if rates are equal or independent. The desired precision (absolute error) in the calculation of the Ln-likelihood value was set at 0.001.
The authors wish to thank Simone Watzinger and Jörg Hauf (Scientific Research & Development GmbH, Germany), and Lars Giesen and Uwe Kohler (Medigenomix GmbH, Germany) for excellent technical assistance. We also thank Julia Maltzan (Tierpark Hellabrunn, Munich, Germany) and Eske Willerslev (University of Copenhagen, Denmark) for providing modern muskox materials. We are grateful to Andrés Varón (AMNH) and Kurt Pickett (University of Vermont) for assistance with POY and helpful discussion. SOK is supported by a Columbia University Doctoral Fellowship, and a Conservation Genetics Research Fellowship and the Lewis B. and Dorothy Cullman Program for Molecular Systematics Studies at the American Museum of Natural History. This work was supported by NSF grant OPP 0117400 (RDEM and ADG).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.