Gene duplication as a pervasive force in the molecular evolution of angiosperms
As a result of pervasive and recurring small-scale duplications (e.g. local, tandem, segmental) and whole genome duplications [1–3], which may be followed by functional divergence, many nuclear genes in angiosperms are members of gene families and may exhibit copy number variation. This complicates the identification of potentially orthologous nuclear genes that could be used for applications such as molecular systematics and mapping markers. However, there is a small subset of genes that appear to persist in low copy numbers, ranging from 1-4 copies per taxa [4–6]. In the extreme case are genes that have resolved to single copy within a few million years after duplication in many independent lineages. There is evidence from Arabidopsis that genes that become single copy following genome duplication are more likely to return to single copy status after subsequent genome duplications . This suggests that there could be a small subset of single copy nuclear genes that are single copy throughout much of angiosperm diversity.
However, dosage-insensitive products are also likely at random to repeatedly return to single copy . If a small subset (< 15) of sequenced genomes is sampled for shared single-copy genes, it is likely that a subset of these genes would be merely shared by chance in single copy by all the surveyed genomes. As would be predicted, dosage-insensitive gene categories more frequently exhibit copy number variation compared to dosage-sensitive genes . Thus, dosage-insensitive genes may be shared by chance in single copy when comparing a few plant genomes, but in general would vary greatly in copy number throughout the plant kingdom. As the number of surveyed genomes increases, we predict that the number of dosage-insensitive genes in the shared single copy list will steadily decrease. Genes that encode dosage-insensitive products should ideally not be used as phylogenetic markers. Dosage-insensitive genes, shared in single-copy, may be paralogs that have randomly lost alternate copies since the shared duplication, which are known to produce artifacts in phylogenetic reconstruction, including inaccurate reconstructions of organismal history . However, other genes, which we would term conserved single copy genes, that are truly shared in single copy (i.e. strict orthologs) throughout seed plants are ideal nuclear phylogenetic markers. Conserved single copy genes have previously never been characterized in a large scale, and possible common mechanisms that repeatedly and convergently return these genes to single copy are still unknown.
An ancient history of genome duplication in plant genomes
Whole genome duplications have been inferred in all angiosperm genomes sequenced to date. Analysis of the Arabidopsis thaliana genome provides evidence based on synteny for at least three whole genome duplications [1, 11–14]. Analysis of the Oryza sativa genome suggests two genome duplications in the evolutionary history of the genome, one close to the divergence of the Poaceae and another older duplication [15, 16]. The Populus trichocarpa genome shows evidence for three whole genome duplications . The Vitis vinifera genome is currently interpreted to show evidence of an ancient hexaploidy event with no recent whole genome duplications . These results, in addition to evidence from analysis of ESTs in a number of species throughout the angiosperm tree of life, suggests that polyploidy has occurred in most if not all major extant angiosperm lineages [1–3]. Evidence for frequent gene duplication is also seen in the evolutionary history of numerous gene families that have expanded during the diversification of the angiosperms [19–22]. Gene families that retain duplicated genes can provide rich evidence about species relationships. If the rate of gene duplication and loss is modest relative to the rate of speciation, gene duplications should result in duplicate gene trees that are reciprocally monophyletic. Additionally, the position of a gene duplication in a gene tree can itself be valuable phylogenetic information. But obtaining sequences from throughout multiple gene families for large numbers of target species is costly as well as being experimentally and analytically challenging. The prevalence of duplication in flowering plants means that orthologous loci without retained duplicates in one or more flowering plant lineages may be rare. Especially for phylogenetics, it has been considered important to identify orthologous sequences. Evidence from yeast indicates that phylogenies based on paralogous genes with asymmetric divergence are misrepresentations of the organismal phylogeny . Duplication has been well documented to result in rate asymmetries in paralogs [10, 23, 24]. This rate variation that can confound phylogenetic analysis by introducing long branch attraction artifacts [10, 25]. Although strict orthologs cannot be identified without intensive study and identification of locus position, co-orthologs that are single or low-copy may reduce artifacts introduced by asymmetric divergence after duplication  and can be easier to amplify and sequence for phylogenetic studies than genes that are members of gene families.
Previously identified single copy nuclear genes in flowering plants
Relatively few single copy nuclear genes (in the context of the entire genome) have been well studied in flowering plants. Shared single copy nuclear genes in flowering plants are in the unique position of being the closest semblance of strict orthologs in their genomes, and therefore are of great interest. Given the amount of duplication present in flowering plant genomes and their evolutionary history, orthologous sequences that are only separated by speciation events and have not been duplicated since the most recent common ancestor can be considered to be rare, and the number of genes that can be considered orthologous decreases dramatically as we compare increasingly distant lineages. The identification of orthologous sequences is especially relevant for molecular systematics, since the addition of easily amplified and phylogenetically informative sequences to current datasets allows for the independent testing of phylogenetic hypotheses using as much data as possible. A recent study using high-throughput techniques to identify orthologous sequences in animals showed improved resolution of the animal tree of life when putatively single copy nuclear genes were used to determine the phylogeny and was able to test a variety of different phylogenetic hypotheses .
There has been a significant amount of attention paid towards the prospect of identifying single copy nuclear genes in flowering plants, primarily for their potential use as phylogenetic markers [4–6, 27–32]. A number of low copy nuclear genes have been previously identified in flowering plants, including the phytochromes, ADH, TPI, GAP3DH, LEAFY, ACCase, PGK, petD, GBSSI, GPAT, ncpGS, GIGANTEA, GPA1, AGB1, PPR and RBP2, primarily for their use as phylogenetic markers [5, 32–46]. Evidence from wheat indicates that duplicated low-copy genomic regions, which may include low-copy genes similar to those in the present study, are rapidly eliminated following polyploidization . In the rare instances in which duplicated copies of single or low copy genes are maintained over long evolutionary periods (tens of millions of years), paralogs show distinct patterns of functional and/or expression divergence. For example, over expression of LEAFY generally results in early flowering [48, 49] and cases in which LEAFY is present in duplicate (typically in recent polyploids), expression patterns are typically complementary, suggesting that subfunctionalization may be necessary for the maintenance of both loci [50, 51].
Current tools for phylogenetic analysis in angiosperms
Molecular systematics in flowering plants has been dominated by the use of phylogenetic markers derived from the plastid genome (e.g., rbcL, matK. ndhF, trnL-F) or ribosomal DNA (18S, 26S, ITS, ETS). The predominant use of plastid and ribosomal DNA markers limits the number of genes available phylogenetic analysis. Typically only angiosperms with sequenced genomes are included in taxon sets for large eukaryotic tree of life datasets [52, 53]. Although the majority of phylogenetic markers used in angiosperms are from the plastid or mitochondrial genomes, low copy nuclear genes have been sought after as phylogenetic and mapping markers [4, 29, 30, 54]. The incorporation of nuclear markers to a combined dataset including plastid and ribosomal DNA markers should improve the robustness of phylogenetic reconstructions at all taxonomic levels by increasing the total number of informative characters (i.e. increasing phylogenetic signal) . For example, a combined analysis of multiple low-copy nuclear genes did provide an improved and robust phylogeny that failed to fully resolve using cpDNA and nrDNA sequence data . Further, nuclear genes are necessary to detect hybridization, introgression events, and ancient allopolyploidization events [30, 56]. Ribosomal DNA markers, which undergo concerted evolution, are unreliable for reconstructing ancient allopolyploidization events . However, both recent and ancient gene duplications have complicated the identification of low-copy genes. Nonetheless, the nuclear genome is an important source of genetic diversity that can be used to establish phylogenetic relationships between species, genera, families, and deeper lineages, and resolve the timing of landmark events such as the origin of angiosperms [57, 58] and origin of eukaryotes[52, 59, 60]. For instance, previous studies indicate that intron sequences from nuclear genes such as LFY, ACCase, PGK, petD, GBSSI, GPAT, ncpGS and others are at least as useful as ITS or plastid intron/spacer sequences in resolving family-level phylogenies and in many cases are more informative than ITS and plastid intron or spacer sequences [34, 42, 43]. However, because of lineage-specific duplications, datasets using protein-coding nuclear genes to resolve relationships across all angiosperms are limited (but see PHYC [57, 58]; Pires, unpublished data) and analyses of organellar and ribosomal markers are more typical. Recently, whole plastid genome sequencing, has been used to decipher the angiosperm tree of life [61, 62].
Studies using genes that are members of gene families are excellent for identifying duplication events that can be used as synapomorphies and using the greatest amount of data available, since the majority of genes in the nuclear genome of flowering plants are members of gene families [2, 18, 63–67]. However, there are several technical limitations concerning the use of genes that are members of larger gene families. In large informatic-based studies, it can be difficult to eliminate noise, identify closest co-orthologs and select appropriate models of sequence evolution. In traditional PCR-based amplification of phylogenetic markers, the co-occurrence of paralogs results in increased efforts to isolate and sequence all available members of the gene family, with increased time spent in amplification and cloning to separate sequences from different members of the same gene family. However, it is possible that members of gene families can be used for phylogenetic purposes if copy number is stable and orthology can be easily assessed.
In this paper, we describe analyses of a set of shared single copy nuclear genes identified in four sequenced angiosperm genomes. The first analysis utilizes EST contigs and sequences from the TIGR Plant Transcript Assemblies to investigate the utility of 18 shared single copy nuclear genes for deep phylogenetic analysis, as well as to identify the occurrence and timing of lineage-specific duplications. This application of the shared single copy genes will provide information about whether these genes are suitable for use as phylogenetic markers and whether these genes have the ability to provide phylogenetically informative sequences. The second analysis is a family-level phylogeny in the Brassicaceae based on sequences for a set of shared single copy genes that have been amplified by RT-PCR and sequenced. The mustard family (Brassicaceae, 338 genera, 3,700 species) is an ideal system to test the utility of these shared single-copy nuclear genes for phylogenetic studies and to test if these genes have repeatedly returned to a single copy state following multiple whole genome duplication and diploidization events. The Arabidopsis thaliana genome harbors signatures of at least three rounds of whole genome duplication [1, 11–14], and the "diploid" Brassica species have undergone additional duplication events [68, 69]. A paleopolyploid event occurred approximately 40 mya near the origin of the family (the alpha event in Arabidopsis, ), while the Arabidopsis beta event occurred within the order Brassicales following the divergence from papaya [8, 71], and a more recent putative triplication event is shared within the tribe Brassiceae that occurred 7.9-14.6 mya [68–70].
In the face of local, tandem, segmental, and whole genome duplications, non-random gene loss may maintain some genes in low copy number [7, 65, 72]. The number of single copy genes shared among the four genomes investigated here is higher than expected if one assumes that all duplicate gene pairs are equally likely to lose one copy following genome-wide duplication events . This observation could be explained by selection for the retention of dosage-sensitive duplicates to maintain dosage balance following genome duplication in which case the shared single copy genes may represent a random subset of genes that are free to be lost following duplication. Alternatively, there may be selection to conserve some genes as singletons in plant genomes. Of course, these are not mutually exclusive hypotheses. In either case, we predict that RT-PCR amplification of multiple single copy nuclear genes will yield many single copy genes with sufficient phylogenetic information to resolve family-level phylogenies. At the same time, researchers must remain aware of the impact that paralogous genes may have on phylogenetic inference.