Low number of mitochondrial pseudogenes in the chicken (Gallus gallus) nuclear genome: implications for molecular inference of population history and phylogenetics

Background Mitochondrial DNA has been detected in the nuclear genome of eukaryotes as pseudogenes, or Numts. Human and plant genomes harbor a large number of Numts, some of which have high similarity to mitochondrial fragments and thus may have been inadvertently included in population genetic and phylogenetic studies using mitochondrial DNA. Birds have smaller genomes relative to mammals, and the genome-wide frequency and distribution of Numts is still unknown. The release of a preliminary version of the chicken (Gallus gallus) genome by the Genome Sequencing Center at Washington University, St. Louis provided an opportunity to search this first avian genome for the frequency and characteristics of Numts relative to those in human and plants. Results We detected at least 13 Numts in the chicken nuclear genome. Identities between Numts and mitochondrial sequences varied from 58.6 to 88.8%. Fragments ranged from 131 to 1,733 nucleotides, collectively representing only 0.00078% of the nuclear genome. Because fewer Numts were detected in the chicken nuclear genome, they do not represent all regions of the mitochondrial genome and are not widespread in all chromosomes. Nuclear integrations in chicken seem to occur by a DNA intermediate and in regions of low gene density, especially in macrochromosomes. Conclusion The number of Numts in chicken is low compared to those in human and plant genomes, and is within the range found for most sequenced eukaryotic genomes. For chicken, PCR amplifications of fragments of about 1.5 kilobases are highly likely to represent true mitochondrial amplification. Sequencing of these fragments should expose the presence of unusual features typical of pseudogenes, unless the nuclear integration is very recent and has not yet been mutated. Metabolic selection for compact genomes with reduced repetitive DNA and gene-poor regions where Numts occur may explain their low incidence in birds.


Background
The establishment of the mitochondrion as a cellular organelle by endosymbiosis [1] changed the fate of the ancestral genome that free-living eubacterial ancestors possessed. Mitochondria have reduced genome size as a result of the interaction between them and their host cells. Genes once needed to support life as a free-living organism were lost or transferred to the nuclear genome of the host eukaryote. One of the reasons why mitochondrial genes would benefit from being located in the nuclear genome is reduction in the accumulation of deleterious mutations. Asexually propagated genomes tend to build up their genetic load quicker than sexually propagated genomes, a principle known in population genetics as Muller's ratchet. Additionally, the formation of reactive oxygen species within mitochondria as a result of the process of respiratory electron transport increases the frequency of mutations, exacerbating the effects of the Muller's ratchet [2].
Mode of gene regulation, special properties of gene products, mechanisms of import of proteins into the mitochondrion, and other as yet unknown features may be acting against the complete transfer of all mitochondrial genes to the nucleus [reviewed in [3]]. Although essential genes are still located in the mitochondrial genome, amplification of nuclear copies of mitochondrial genes has been detected occasionally in several taxonomic groups [reviewed in [4]]. Non-functional nuclear copies or pseudogenes have been termed Numts (pronounced 'new-mights', for NUclear MiTochondrial DNA segments) by Lopez and collaborators [5], who found tandem-duplicated mitochondrial copies of a 7.9 kilobases (kb) fragment in the nuclear genome of cats. Subsequently, caution has been recommended when attempting to amplify authentic mitochondrial fragments by polymerase chain reaction (PCR) techniques, as the nuclear copies might be amplified in preference to mitochondrial ones, especially using conserved primers designed on gene sequences of organisms of different taxonomic levels [4,6,7].
With the completion of the sequencing of the human genome [8], extensive genomic analyses have found hundreds of Numts in the human nuclear genome [9][10][11]. These analyses indicate that nuclear copies are widespread in all human chromosomes and involve all mitochondrial genes and the control region. Some of these integrations encompass about 80% of the complete mitochondrial genome. Similarity between human Numts and their mitochondrial counterparts is as high as 99%, raising concerns for the fields of molecular population genetics and phylogenetics because PCR amplification and sequencing of mitochondrial DNA segments are major tools used to address many biological questions in ecology and evolution. High similarity of Numts with mitochondrial genes not only increases the chance of accidentally amplifying the nuclear copy but also lessens any suspicion that the fragment isolated is not of mitochondrial origin, and thus has the potential to invalidate the conclusions of many studies.
Although birds are well studied with over 815,000 sequences deposited in GenBank as of June 4, 2004, Numts have been reported for only four different avian orders (Table 1). In most cases, the nuclear integration involved the control region and cytochrome b gene (cyt b). Sequence divergence between these Numts and the corresponding mitochondrial segment varied from 2 to 31%. However, the extent and details of the Numt fraction in avian genomes will only be adduced when more nuclear genome sequences become available. The recent sequencing of the chicken (Gallus gallus) nuclear genome by the Genome Sequencing Center at Washington University, St. Louis and its availability for public access at the Ensembl [12] website provides an opportunity to check whether the high incidence and occasionally large size of Numts in the human genome also occur in this avian genome. Contrasting with humans, avian chromosomes are classified in macrochromosomes and microchromosomes, according to whether or not they are cytogenetically identifiable by conventional banding techniques. Consequently, chicken has a diploid number of 78 chromosomes classified in eight pairs of macrochromosomes, 30 pairs of microchromosomes, and one pair of sex chromosomes ZW. We searched for mitochondrial pseudogenes in the chicken genome, and provide a descriptive characterization of Numts found. We also compare our results to other similar studies on sequenced eukaryotic genomes and show that the frequency and amplification of Numts varies from species to species, and that the numerous Numts found in the human and plant genomes may be the exception to the general rule in eukaryotes.

Results
Results of the BLAST search for sequences in the chicken nuclear genome that have homology with chicken mitochondrial DNA revealed 22 alignments that seemed to be biologically significant as defined by our threshold of 10 -4 (Table 2). Moreover, size of alignments and similarity between legitimate mitochondrial sequences and their homologues in the nuclear genome indicate that the nuclear homologues could represent ancient degenerate mitochondrial sequences. Careful inspection of returned alignments led us to infer the presence of at least 13 mitochondrial fragments into the nuclear genome. They are numbered 1-7, 8a, 8b, [9][10][11][12][13]. Identities between Numts and corresponding mitochondrial sequences varied from 58.6 to 88.8%.
Regarding size of mitochondrial pseudogenes, we found six Numts ranging from 782 -1,733 bp that were recov-ered by two or more alignments, and seven Numts of 131 -412 bp recovered by a single alignment. Considered together, Numts contributed 8,869 bp or 0.00078% of the nuclear genome of the chicken.
Ten protein-coding genes (except for ND2, ND3, ND4L, ND6, CO1 and CO3), ribosomal genes, the control region (CR), and 10 of 22 tRNAs were found in Numts. The mitochondrial control region, ND5 and ND4 were each found in three different Numts, followed by cyt b, which was detected in two Numts. All other genes included in a Numt were present only once ( Table 2; Fig. 1).
Most genes found in chicken Numts correspond to partial mitochondrial sequences. Complete sequences for the control region, tRNA Phe , tRNA His , tRNA Leu , tRNA Lys , and ATPase 8 were found in Numts 2, 4, 7, and 13. Regardless of the completeness of the mitochondrial pseudogene in the nuclear genome, all protein-coding genes had internal stop codons and/or frame-shift mutations. No tRNA Numt could be perfectly folded in its predicted secondary structure, with the exception of the tRNA His that had only one substitution compared to the mitochondrial counterpart, and that did not interfere with its secondary structure (Fig. 2). However, this tRNA is part of a bigger nuclear fragment (containing partial sequences for tRNA Ser and ND4) that has high similarity with the chicken mitochondrial fragment, indicating that it may be a recent Numt.

and parameters of alignments returned on BLASTN searches. Start and end indicates positions of alignments in the chicken mitochondrial (mtDNA) and chromosomal (chrom) sequences. Orientation corresponds to whether integration in the nuclear genome is 5' > 3' (+) or 3' > 5' (-). E-val and % ID are respectively expected value and % of identity for each returned alignment. Some
Numts were identified by more than one alignment. The site of Numt integration was further analyzed using the graphic interface available at the Ensembl website [12], using the mitochondrial region and chromosomal positions given in Table 2. We found that integrations occurred in regions where no known or predicted genes were located. These regions were also rich in repeat elements like LINES, microsatellites and low complexity repeats, but with no apparent association between them and Numts. Ten Numts were localized in three macrochromosomes, and the remaining four in different microchromosomes (Fig. 1). No Numts were identified in the sex chromosomes W and Z, or in contigs not yet assigned to chromosomes. Chromosome 4 (GGA4) had three of the five largest Numts detected in the nuclear genome.
Two Numts deserve more consideration. Numt 7 in GGA4 was located at the very 3' end of the contig, and it may be longer than 1,182 bp. However, it is necessary to close the gap between the contig where it was found and the adjacent contig to check the extension of this integration. Numt 8a and 8b, also found in GGA4, would be considered to represent two independent integrations according to our criteria (e.g. they are inserted in opposite directions and they do not overlap). However, because the region between them spans 816 bp in the nuclear genome and this is similar to the missing fragment of 936 bp in the mitochondrial genome, they may have been part of one transfer event that was later involved in a chromosomal rearrangement, leading to change of orientation of one of the fragments. Moreover, the presence of an intercalated microsatellite at the 3' end of Numt 8a and 5' end of Numt 8b indicates that the rearrangement is more plausible than two independent transfers.
The mechanism of mitochondrial integration in the nuclear genome may be via RNA [13] or DNA [5], and can be identified by checking the 5' and 3' ends of the genes involved. Integration via a DNA intermediate is the most common mechanism in the human genome [10]. In chicken, Numts 2, 9, and 11 contain the CR and the integration is clearly by a DNA intermediate, as the CR is not transcribed. Polycistronic mitochondrial RNA transcripts Representation of the chicken mitochondrial genome and chicken karyotype Figure 1 Representation of the chicken mitochondrial genome and chicken karyotype. Gene names are as follows: cyt b -cytochrome b; COI, COII and COIII -subunits I, II and III of cytochrome oxidase; ND1-6 -subunits 1 to 6 of NADH reductase; tRNAs are represented by their IUPAC one-letter amino acid abbreviations; ribosomal gene subunits are represented by 12S and 16S. Relative position of each Numt, and their numbers as in Table 2, are shown outside the circular mitochondrial genome. A karyotype representation for chicken is shown inside the circular mitochondrial genome. Chromosomes 1-8 are macrochromosomes, W and Z are sex chromosomes, and all others are microchromosomes. Not all chicken microchromosomes can be unambiguously identified by conventional banding techniques, and they are not represented here. Range for BLAST scores is also shown.
Prediction of secondary structure for tRNAs Figure 2 Prediction of secondary structure for tRNAs. Secondary structure for legitimate mitochondrial tRNA Phe and tRNA His are shown to the left, and their corresponding nuclear pseudogenes to the right.
are quickly processed as they are transcribed, and mRNAs do not have polyadenylation signals at the 5' end [14]. Therefore Numts 3, 4, 6, 7, 8a, 8b and 13 were also integrated via DNA as these signals of processing were not present. Mode of integration cannot be inferred for Numts 1, 5, 10 and 12 as they represent integrations of partial fragments of protein-coding genes with no associated neighboring gene in the same integration, and no end is present in the Numt to check for these processing signatures. We also discarded the possibility that any of these Numts found in the chicken nuclear genome originated from a duplication of another Numt. Furthermore, we found no evidence of tandem repeats around the site of integration.

Numts in the chicken and other eukaryotic genomes
Our search of the chicken nuclear genome indicates the presence of 13 apparently independent integrations of mitochondrial DNA genes. Two of these Numts may actually represent a single integration that underwent rearrangement resulting in loss of an intermediate region and change of the orientation of one of the remaining fragment. Such rearrangements of Numts have been detected previously in birds and humans [9,15]. No correlation seems to exist between the size of a nuclear genome and number of Numts, although bigger genomes and larger chromosomes can bear more integrations [11]. Although no clear site for integration of Numts has been recognized so far, regions with low gene content are more prone to integrations [10], which probably avoid disruption of well-organized gene complexes in gene-rich regions, and therefore survival of the integration in the nuclear genome. In chicken, most insertions were detected in macrochromosomes that are low in gene content compared to microchromosomes [16,17]. Two mitochondrial DNA regions were identified as hotspots for insertions into the nuclear genome, one at the control region and the other encompassing the intervening sequence between ND4 and cyt b. Although most Numts detected in PCR products in birds are examples of the integration of the CR or cyt b genes (Table 1), this is a consequence of these genes being the most targeted for amplification in ecological and evolutionary studies compared to other regions of the mitochondrial genome. As our analysis was performed in a pre-assembled version of the chicken genome, other Numts may be found when the complete assembly is released. However, the conclusions of our study should still hold as the genomic assembly we searched included contigs not yet assigned to chromosomes.
The number of Numts found in chicken is within the range found in most sequenced eukaryotic genomes (Fig. 3). That is, mitochondrial pseudogenes do not seem to repre-sent a large portion of eukaryotic genomes, and with the exception of human, mouse and plants, they number less than 100. Our results are consistent with the observation that avian genomes harbor less repetitive elements and other non-coding sequences [18,19]. Only 17% of the chicken genome is assumed to be composed of repetitive elements including LINEs, SINEs, microsatellites, minisatellites and simple repeats [20] compared to 40 -50% of the genome of humans and rodents [8,21,22]. Flight has been claimed to impose constraints on the size of bird genomes, and there is a positive association between genome size and flying abilities: stronger fliers posses smaller genomes than weak fliers [23]. As flight demands a high metabolic rate and, and high metabolic rate in turn restricts cell size, genome content is expected to be reduced to fit a small cell. These same reasons appear to explain why bats have small genomes [23][24][25] and provide independent evidence for the association between flight and compact genomes in homoeothermic vertebrates. Because the number of genes in chicken is similar to those in human, small genome size in chicken has been achieved in part by loss of repetitive DNA and gene-poor chromosomal regions where most Numts occur. Metabolic selection for compact genomes could therefore explain the low incidence of Numts that have been observed in birds. Although sequenced plant genomes of Arabidopsis and rice are smaller than those of humans and chicken, they have a high number of Numts. The reasons for this discrepancy are not well understood, but plant genomes seem to be able to harbor a large number of repetitive elements and to transfer DNA bidirectionally between chloroplasts and mitochondria [28][29][30].

Implications for inference of population history and phylogenetics
As mitochondrial DNA is one of the main sources of information for population genetics and phylogenetics at several taxonomic depths, the inadvertent amplification of Numts via PCR technologies may seriously impact studies and lead to erroneous conclusions about phylogeography and taxon relationships. For example, in a recent study of great apes [7], Numts seem to have been preferentially amplified in gorillas, and similarity of these inserts with mitochondrial copies was high enough to avoid suspicion Numts were amplified. Also, the demonstration that the human genome has hundreds of Numts representing all mitochondrial regions including large portions of the mtDNA molecule, some of which have high similarity with their mitochondrial counterparts, has raised concerns that Numts may have gone undetected in many studies published in the last decade. This problem would be especially acute if Numts are a common feature of genomes. Fortunately, it seems that Numts are not as frequent in most sequenced eukaryotic genomes as they are in humans or plants (Fig. 3).
In birds, for example, mitochondrial pseudogenes have been occasionally detected (Table 1), but numbers reported may actually be underestimates as not all findings of Numts are formally published. Unfortunately, these studies do not provide information on the real extent of Numts because the experimental design was not aimed at a search for such elements. Our search indicates that the size of Numts detected in the chicken genome is often smaller than the usual size of the fragments isolated by PCR technology (>600 bp) in most published studies. Also, similarities between chicken Numts and their mitochondrial counterparts were below 89%, and the presence of indels, stop codons or frame-shift mutations would clearly indicate the amplification of a pseudogene instead of a fragment of the mitochondrial genome. In humans, some large Numts representing about 80% of the total mitochondrial genome have been found. However, most human mitochondrial pseudogenes are smaller than 500 bp [26]. Collectively these observations imply that the amplification of a Numt will be rare if mitochondrial fragments targeted for amplification are above the size range of most described Numts.
If the Numts in chicken are typical of those in other birds, amplifications of fragments of about 1.5 kb are highly likely to represent true mitochondrial amplification, and are economically more viable than performing amplification of very large segments (e.g. > 5 kb) of the mitochondrial genome, or by cloning PCR products. Moreover, sequencing of fragments of about 1.5 kb should easily detect the presence of unusual features of pseudogenes unless the nuclear integration is very recent. However, if PCR amplification results in more than one band, or sequence ambiguities or background signal are present, direct PCR amplification may not produce authentic mtDNA sequences. In this case other methodologies such as isolation of mitochondria from cells previous to DNA isolation, or isolation of DNA from mitochondriaenriched tissues may provide a solution. Use of conserved primers increases the chance that they might preferentially anneal to a Numt, as they are effectively molecular fossils because they have a slower rate of DNA substitution than does mitochondrial DNA [27,31].
Size of nuclear genome for eukaryotes and number of Numts detected Figure 3 Size of nuclear genome for eukaryotes and number of Numts detected. Scale to the left is genome size in Megabases. Numbers of Numts are indicated on gray cylinders. Data is from [11], except for chicken, dog, zebrafish and bee. See material and methods for more details.

Conclusions
We have shown that the numbers of Numts in the nuclear genome of the chicken is low compared to what was found in the genome of humans and plants. Although caution must still be exercised in PCR-based studies, the small size and sequence divergence of these chicken pesudogenes from mitochondrial copies indicates that they may be less of a concern in mtDNA-based studies of birds relative to primates and plants. However, we will only know to what extent these findings apply generally to avian genomes when sequences of more diverse bird taxa are completed.

Sequence analysis
The full-length mitochondrial genome for chicken [32] was retrieved from The National Center for Biotechnology Information database [33] under accession number NC_001323 and used to perform similarity searches against a database of the draft sequence of the chicken nuclear genome released by the Genome Sequencing Center at Washington University, St Louis (Build WASHUC1) and publicly available at the Ensembl Genome Browser [12,34] as of March 2004. BLAST [35] searches were used, with the whole mitochondrial genome sequence or mitochondrial genes individually as query. Results from both strategies were the same. We set the maximum expectation value in BLASTN searches to be e = 10 -4 to recover hits that are biologically significant. No filters were used during searches. Assuming e values in the range of 10 -4 to 10 resulted in extra hits that have lower similarity with the query sequence, and shorter alignments, therefore indicating the randomness of these hits. Further analyses indicated that some recovered alignments represented short T-rich regions in the nuclear genome that aligned with a short T-stretch present at the beginning of the chicken mitochondrial control region.

Identification of mitochondrial integrations in the nuclear genome
Results from searches were analyzed via BLASTView, a graphic interface that displays the results after a BLAST search in the Ensembl website. For all recovered alignments that had similarity between mitochondrial and nuclear genomes above 50% and a significant e value, we downloaded the contigs where these BLAST hits were observed to investigate the characteristics of the mitochondrial pseudogene in the nuclear genome, also known as Numt [5]. For most Numts, contig and chromosomal position was obtained from BLASTView, and information on the region of integration was gathered by examining the maps and annotation provided in ContigView and ExportView links, respectively. When two alignments were returned for the same contig, they were merged and considered to be the same integration event if they were in the same orientation and had overlapping bases. When gaps between alignments in the same contig were observed they were considered to be the same if the gap was similar in size to the mitochondrial fragment expected to fill this gap. Also, for the later case, we checked the intervening sequence for the possible presence of insertion or deletion of nucleotides.

Secondary structure for tRNAs involved in Numts
Prediction of secondary structure for legitimate tRNAs and their Numts which had the complete tRNA sequence were obtained using the DNA mfold web server [36]. Folding temperature used was the default set to 37°C. For some tRNAs, some bases were forced to pair to obtain the expected mitochondrial tRNA structure as previously described [32].

Search for Numts in other available genomes
A recent study has summarized the distribution of Numts in a variety of organisms [11]. However, they did not include information on Numts for dog, zebrafish and bee genomes that have only recently became available in Gen-Bank. Therefore, we performed an initial analysis for Numts in these genomes (GenBank Builds cra_dog_assembly, zebrafish_HTGS 1.1, Amel 1.1, respectively), using the corresponding full-length mitochondrial sequence. The database for these organisms is pre-draft assembly available in GenBank in late March, 2004. In these searches, we only recorded the number of alignments found by a BLASTN searches, and no further analysis was performed to evaluate the overlap between alignments. This same procedure was adopted in [11]. Therefore, caution is necessary in the interpretation of the number of Numts reported in those organisms and in chicken. Our goal was to have a rough estimate of number of Numts in these genomes for comparative purposes only.