Slow but not low: genomic comparisons reveal slower evolutionary rate and higher dN/dS in conifers compared to angiosperms

Background Comparative genomics can inform us about the processes of mutation and selection across diverse taxa. Among seed plants, gymnosperms have been lacking in genomic comparisons. Recent EST and full-length cDNA collections for two conifers, Sitka spruce (Picea sitchensis) and loblolly pine (Pinus taeda), together with full genome sequences for two angiosperms, Arabidopsis thaliana and poplar (Populus trichocarpa), offer an opportunity to infer the evolutionary processes underlying thousands of orthologous protein-coding genes in gymnosperms compared with an angiosperm orthologue set. Results Based upon pairwise comparisons of 3,723 spruce and pine orthologues, we found an average synonymous genetic distance (dS) of 0.191, and an average dN/dS ratio of 0.314. Using a fossil-established divergence time of 140 million years between spruce and pine, we extrapolated a nucleotide substitution rate of 0.68 × 10-9 synonymous substitutions per site per year. When compared to angiosperms, this indicates a dramatically slower rate of nucleotide substitution rates in conifers: on average 15-fold. Coincidentally, we found a three-fold higher dN/dS for the spruce-pine lineage compared to the poplar-Arabidopsis lineage. This joint occurrence of a slower evolutionary rate in conifers with higher dN/dS, and possibly positive selection, showcases the uniqueness of conifer genome evolution. Conclusions Our results are in line with documented reduced nucleotide diversity, conservative genome evolution and low rates of diversification in conifers on the one hand and numerous examples of local adaptation in conifers on the other hand. We propose that reduced levels of nucleotide mutation in large and long-lived conifer trees, coupled with large effective population size, were the main factors leading to slow substitution rates but retention of beneficial mutations.


Background
Determining the mutational and the selective forces responsible for evolution has overarching implications in biology, e.g. in understanding what makes species unique and how organisms respond to biotic and abiotic challenges. Identifying the rate of evolution and the patterns of nucleotide substitution underlying DNA evolution has thus become a fundamental goal of molecular genomics [1,2]. Key to the central dogma of molecular biology, protein-coding sequences (hereafter referred to as genes) have classically been regarded as a major unit of evolution. Substitutions at synonymous (silent) and nonsynonymous (replacement) sites are commonly distinguished to differentiate between neutral (or at least weak) and active selective forces acting on genes, respectively. In pairwise comparisons of orthologous genes, the ratio of non-synonymous distance (i.e. number of substitutions per non-synonymous site; dN) over synonymous distance (dS) gives a general but conservative indication of the mode and strength of selection [1,2]. An excess of nonsynonymous substitutions (dN/dS > 1) suggests adaptive or diversifying selection, while an excess of synonymous mutations (dN/dS < 1) indicates purifying selection, and no difference between synonymous and non-synonymous mutation rates (dN/dS = 1) is taken as evidence for neutrality [3].
Large-scale sequence datasets now exist, allowing comparisons to be made for thousands of genes in all domains of life. Synonymous and non-synonymous substitution rates have been found to vary widely within and between taxa [4][5][6][7]. From early studies based on a limited number of species and genes to the era of genomics and systems biology [8,9], a complex blend of non-mutually exclusive biological, biochemical and demographic mechanisms emerged to explain these variations. While intraspecies differences are believed to be influenced by selection on protein structure and function (reviewed in [10][11][12][13][14]), interspecies differences are influenced by (i) the efficacy of the DNA repair machinery, (ii) life history traits (e.g. generation time), (iii) metabolic rate, (iv) effective population size (random genetic drift), (v) purifying (background) selection and (vi) reproductive strategy. Some factors (i -iii) influence the way mutations appear, while others (iv -vi) influence their fixation over generations (reviewed in [9,13,14]).
Among plants, most of the attention in comparative evolutionary studies has been focused on flowering plants [4,5,14,15], and interest is now growing for other plant taxa as more sequence data is produced. Gymnosperms are separated from angiosperms by~300 million years of evolution [16]. Expectedly, many biological features of gymnosperms and angiosperms differ greatly, including seed morphology, life span, diversification rate, pollination processes, environmental requirements and response to environmental stresses. With~600 extant species, conifers make up about two thirds of all gymnosperm species, and are the dominant plants in most temperate and boreal ecosystems. Conifers have an immense ecological and economical value such as practical forestry economics, immediate ecological value of forest ecosystems and in the long term, large capacity for carbon sequestration. Biological differences between angiosperms and conifers and the need for long-lived conifer species to cope with challenges such as insect pests and environmental changes, underscore the importance of understanding the molecular and functional evolution of conifer genomes.
In this study, we take advantage of the existing large and high-quality sequence data in two conifer species, Sitka spruce (Picea sitchensis) and loblolly pine (Pinus taeda), consisting of a collection of bona fide full-length cDNA sequences (FL-cDNAs) [33,34] and UniGenes constructed from several EST libraries, respectively. Together with whole-genome gene sets available for two angiosperms, Arabidopsis thaliana and Populus trichocarpa; a rich data set exists to identify rates and patterns of evolution between conifer species and between conifer and angiosperm species. We find evidence for significantly slower evolutionary rates in conifers. In stark contrast, we find a significantly higher dN/dS ratio in conifers as compared to angiosperms, indicating perhaps higher adaptation. We also investigate these patterns across functional categories of genes.

Conifer sequences
Clustered ESTs from loblolly pine were downloaded from NCBI UniGene (build 10, which had 18,921 clusters). Sitka spruce FLcDNAs came from the Treenomix II project [35]; as of Nov. 10 2009, this collection comprised 10,665 FLcDNAs, of which 3,218 clustered in contigs. We used all individual FLcDNAs because our approach ultimately removes any redundant or duplicated sequences.

Open reading frame (ORF) search in conifer genes
All possible ORFs (from start to stop codons) found in spruce FLcDNAs were queried against the plant Uni-ProtKB SwissProt and trEMBL datasets [36], with predicted proteins from Sitka spruce [33] removed from the trEMBL dataset. Only ORFs from the 5,680 spruce FLcDNAs that had no hit against the SwissProt dataset were queried against the trEMBL dataset. ORFs from 3,296 spruce FLcDNAs had no homology with either of the plant UniProtKB datasets; for those, the longest ORF was arbitrarily selected for further analysis. A single FLcDNA with no ORF structure in its sequence was discarded.
We did not use the same strategy for loblolly pine because the pine UniGene set may contain only a truncated portion of the actual coding sequence. For conifers, we looked in each member of the UniGene set for the ORF among all possible ORFs with the same frame as the longest overlapping sequence with the best-scoring BLAST query against the spruce ORFs. Of 18,921 pine UniGenes, we found 7,627 ORFs in the same frame as spruce ORFs.

Orthology of conifer genes
We used the reciprocal best hit (RBH) approach [37,38] to infer putative 1:1 orthologues between spruce FLcDNAs and pine UniGene sequences, using BLAST with -e threshold = 10 -20 . We found a total of 4,774 RBHs, of which 4,250 contained a complete ORF in pine.

Angiosperm orthologues
A. thaliana was chosen because it represents the best characterized plant genome. Poplar was included in the analyses as the first completely sequenced tree genome. A. thaliana coding sequences were downloaded from the TAIR9 annotation release [39]. Poplar coding sequences (annotation 1.1) were downloaded from the JGI Genome Portal [40]. We used Ensembl Compara predictions through the BioMart server [41] to select a list of orthologous genes from Arabidopsis and poplar. Only 1:1 and apparent 1:1 orthologous coding sequences were retained for analysis, finalizing a set of 5,108 orthologues.

Alignment
Gymnosperm (spruce-pine) and angiosperm (Arabidopsispoplar) orthologous coding sequences were aligned using DIALIGN-TX [42] with highest sensitivity (-L option). Gaps in the alignments and gap-free regions > 7 bp, interpreted as non-homologous by DIALIGN-TX, were excluded from the analysis. Finally, alignments shorter than 30 amino acids were discarded. The RBH conifer orthologue set contained 3,883 alignments and the angiosperm gene set totaled 5,073 successfully aligned 1:1 orthologues.

Data analysis Substitution rates
Pairwise distances at non-synonymous (dN), synonymous (dS) and 4-fold degenerate (4D) sites (d4) were estimated for individual genes in both gymnosperm and angiosperm alignment sets using codeml (PAML 4.0) [43,44], with settings seqtype = 1, CodonFreq = 2, Runmode = -2, and transition-transversion ratio () estimated from the data. Genes showing signs of saturated divergence were excluded because codeml results are reliable for moderate ranges of sequence divergence. For conifers, we discarded 42 orthologues with dN/dS = 98.99 and 118 with dS > 0.5, and for angiosperms, we discarded two genes with dN > 5 and 996 genes with dS > 4. Threshold dS values were determined by plotting dN as a function of dS and excluding outliers from the main distribution. Final RBH orthologue sets (see Additional file 1) contained 3,723 conifer genes (average gap-free length = 510 bp) and 4,080 1:1 angiosperm genes (average gap-free length = 387 bp). 95% confidence intervals for evolutionary estimates were calculated based on 1,000 bootstrap replicates using R [45]. Absolute rates of substitution at coding sites (μ) in pairwise comparisons were inferred using the formula: with d the distance at synonymous (dS), non-synonymous (dN) or 4D (d4) sites; T divergence time between spruce and pine, or between Arabidopsis and poplar. Divergence times are documented from fossil records, between~120 and~160 MYA for conifers [46][47][48][49][50][51], and between~105 and~115 MYA for Arabidopsis and poplar [52]. Unless mentioned otherwise, we used 140 MYA and 110 MYA, respectively, as working divergence times.

Analyses of functional categories
Functions of conifer orthologues were inferred using analogy with Arabidopsis proteins for GO annotations, and with plant proteins for descriptive annotation. In detail, spruce ORFs were queried against the TAIR9 protein-coding genes and the plant UniprotKB database using BLASTX (-e threshold = 10 -5 ). Of the 3,983 best hits against Arabidopsis, 1,230 contained an ORF that successfully aligned to loblolly pine ORFs and were assigned the GO annotation corresponding to that of the best Arabidopsis hits, when available.
For statistical comparisons among conifer genes, we used gene set enrichment analysis tools in the Babelomics platform [53], a web application that implements threshold-independent statistics (FatiScan and logistic regression) to investigate asymmetrical distributions of GO terms, KEGG pathways and InterPro domains within our list of annotated genes ranked by dN/dS. Fatiscan uses a Fisher exact test over a collection of partitions of the ranked list of genes, while the logistic model is used to find association of each functional block with the high or low values of the ranked list; under-and over-represented functional terms are then extracted. Prior to these analyses, we removed 43 genes that showed no nonsynonymous substitution. For other functional analyses, we used the 'GO Slim' classification system provided by TAIR database [54].

Substitution rates in conifer protein-coding genes
We aligned the sequences of 3,723 spruce-pine orthologous genes and inferred the number of pairwise synonymous (dS) and non-synonymous (dN) substitutions per site (see Table 1, Additional file 1). Mean dS was 0.191 (95% confidence interval [CI] = 0.188, 0.193), meaning that on average, one mutation occurred about every five sites along both lineages since the common ancestor. Mean dN was lower than dS (0.049; CI = 0.048, 0.050), reflecting the expected elevated mutational constraint on non-synonymous sites.
The neutral theory of molecular evolution predicts that the evolutionary rate at neutral sites corresponds to the actual mutation rate in an organism [55]. Because neutrality at synonymous sites is disputed [56], distance in a subset of synonymous sites known as 4-fold degenerate (μ 4D ) sites (i.e. sites where a change to any of the four nucleotides will not alter the amino acid during translation) stands as a better proxy to estimate the mutation rate. From our comparison in conifers, we inferred distance at μ 4D sites (d4) at 0.177 (95% CI = 0.174, 0.179), which translates into a substitution rate of 0.64 × 10 -9 per 4D site per year (μ 4 , see Table 1), and a range of 0.55 × 10 -9 and 0.74 × 10 -9 using the extreme estimates of divergence time between spruce and pine.

dN/dS in conifer protein-coding genes
Ideally, dN/dS should be estimated at every site to find evidence of selection (which is only possible when comparing more than two species in a phylogenetic context) and not averaged over the entire gene. However, an overrepresentation of non-synonymous substitutions can be used as a crude indication of either adaptive evolution or at least relaxed constraint in protein-coding genes. Mean dN/dS in conifer genes was 0.314 (95% CI = 0.299, 0.329). Of the 3,723 pairwise comparisons, 100 (2.68%) had a dN/dS > 1 (Additional file 2). We note the presence of genes that are involved in abiotic and biotic stress response; some examples are protein kinases, protein phosphatases, heat shock proteins, leucine-rich repeat proteins, histone modification proteins, glycosyltransferases, and transcription factors (see Table 2). Genes with dN/dS lower than 1 can in fact be under positive selection at specific sites [3] and dN/dS measured over the whole gene length is thus considered too conservative to identify genes or groups of genes putatively under positive selection. Hence, we also applied a segmentation test and a logistic regression test to look for functional groups of genes that are significantly and coordinately associated to high and/or low values of dN/dS. Based on 1,230 GO-annotated conifer genes, we found that heat shock proteins, genes involved in signal transduction and regulation of transcription and nucleic acids seem more likely to evolve under reduced constraint; whereas genes involved in translation, protein assembly, chlorophyll biosynthesis and cellular organization are under strong selective constraint (Additional File 3).

Comparison between gymnosperms and angiosperms
We compared evolutionary distances between two representative conifer taxa, Sitka spruce and loblolly pine, and two representative angiosperm taxa, Arabidopsis and poplar (see Table 1). Mean dN in 4,080 Arabidopsis-poplar orthologous genes was 0.202 (95% CI = 0.199, 0.205), mean dS was 2.184 (95% CI = 2.164, 2.206), and mean d4 was 2.006 (95% CI = 1.985, 2.026). Based on a relatively confident divergence time of~110 million years [52], we inferred an average synonymous mutation rate μ S of 9.93 × 10 -9 substitutions per year along the lineages separating Arabidopsis and poplar (CI = 9.84 × 10 -9 , 10.03 × 10 -9 ). This is 15-fold higher than the average mutation rate found in conifer orthologues (see Table 1). Even using the lowest estimate of divergence time between spruce and pine, μ S is more than 10-fold higher in angiosperms. Absolute rates of substitution are calculated assuming equal rates on the poplar and the Arabidospis lineages, but it has been suggested that the evolutionary rate in the poplar branch is one-sixth that of the Arabidopsis branch since divergence [57,58]. Using this factor, we obtained μ S estimates of 2.84 × 10 -9 in the poplar lineage and 1.70 × 10 -8 in the Arabidopsis lineage (Additional file 1), which compares well with 1.50 × 10 -8 , a previously known rate in Arabideae [59]. However, this rate has since been revised to 7.5 × 10 -9 with the recent finding that the divergence time between A. thaliana and A. lyrata is about twice the previously known time, i.e.~10 MYA instead of~5 MYA [60]. We also found a difference in μ N between gymnosperms and angiosperms (0.18 × 10 -9 and 0.92 × 10 -9 mutations per year, respectively), representing a five-fold difference. If we account for the differential rate between the two angiosperm species, the difference for μ N is 9-fold and 1.5-fold with Arabidopsis and poplar, respectively (see Table 1). Figure 1.A illustrates the difference in dS and dN distributions between conifers and angiosperms, in particular the strikingly low dS estimates for conifers.
Overall, our results indicate a relative over-representation of non-synonymous mutations versus synonymous mutations in conifer species compared to angiosperm species. Consequently, mean dN/dS is higher in conifers than in angiosperms, i.e. 0.3137 and 0.0924, respectively, on average, and the distribution of dN/dS values for conifers extends towards and over unity (Figure 1.B). While we found 100 conifer genes with dN/dS > 1 out of 3,723 orthologues, there was a single Arabidopsis-poplar orthologue out of 4,080 orthologues that showed signs of positive selection over the entire alignment (dN/dS = 1.8565). This gene (ORF25; TAIR ID: ATMG00640; Uni-Prot ID: Q04613) encodes a plant b subunit of mitochondrial ATP synthase.
We compared dN/dS between functional categories in conifers and gymnosperms, and consistently found higher dN/dS in conifers in most functional GO Slim categories ( Figure 2; Mann-Whitney test, P < 0.05). However, 'DNA/ RNA metabolism' (biological processes; P = 0.37), and 'chloroplast' and 'ribosome' (cellular component; P = 0.46 and P = 0.62, respectively) showed no significant difference.
If synonymous mutations, and even more so mutations at 4D sites, follow a neutral mode of evolution, we would expect no significant difference in average μ S between functional categories (Additional file 4). However, there were significant disparities among some of the functional categories, even when considering the 'more neutral' mutations at 4D sites (Kruskal-Wallis test; H = 52.831, P < 0.001), a surprising finding because it goes against the neutral expectancy. Interestingly, a recent study in birds has found evidence for selective constraints at 4D sites in the avian genome [61], and completes previous evidence accumulated in mammals [56]. Taken together, these results should call for careful attention when using dS as an estimate of neutral mutation rate, especially when inferring positive selection from dN/dS estimates or when applying molecular clocks. The present study does not claim positive selection but merely reports evolutionary trends; our results are therefore not significantly affected by the assumed neutrality of dS.

Discussion
Our findings, based upon large-scale sampling rather than a small set of genes, are of significance for understanding the differences in patterns of evolution between  conifers and angiosperms. First, we found that evolutionary rates are dramatically lower in conifers than in angiosperms. Second, we find that such differences vary across functional categories of genes. Classically, interspecific studies of protein-coding genes in conifers have involved very few loci. Kusumi et al. [62] studied evolutionary rates of 11 genes in the Cupressacea. Bouillé and Bousquet [63] compared polymorphisms of three nuclear genes in Picea. More recently, Palmé et al. [64] scrutinized patterns of selection in 21 nuclear genes in a pine phylogeny while Chen et al. [65] carried out similar analyses for 10 genes in four spruce species. Large-scale comparative approaches are needed to grasp global evolutionary trends representative of conifer genomes.
Genome-scale sequencing of conifer genomes is coming of age [26,27], in particular for two economically and environmentally important species of the Pinaceae: Sitka spruce and loblolly pine. EST datasets for these species have previously been used in a comparative framework to find conifer-specific genes [66] and studying the evolution of gene families [67] and of xylem-specific genes [68] in vascular plants. Here, we carried out the first comparative study of substitution rates and mutational patterns in a sizable fraction of the conifer gene set -or that of any gymnosperm.
Lower rates of evolution in conifers as compared to angiosperms Are evolutionary rates slower in conifers and gymnosperms than in angiosperms?
We estimated evolutionary measures at 3,723 conifer orthologues and 4,080 angiosperm orthologues. As in any partial list of ESTs (i.e. not genome-wide), there might have been an unintentional selection of particular functional categories of genes, but we believe that our gene set is large enough to be representative of the genome as a whole. We found a much smaller dS in conifers than in angiosperms (0.1908 and 2.1846, respectively; see Table 1). A practical consequence of this difference is that we discarded almost 10 times as many angiosperm genes before final analysis; these genes showed a significant level of genetic saturation compared to conifer genes. Genetic saturation artificially reduces sequence divergence because multiple mutations at any given site of a particularly fastevolving gene cannot be ruled out. All considered, not discarding these genes would only increase the difference in dS between conifers and angiosperms. Estimates of dN were also lower in conifers than in angiosperms (0.0492 and 0.2019, respectively), but the difference was not as dramatic as for dS (see Table 1, Figure 1.A), suggesting that substitutions at synonymous sites are particularly constrained -or that those at non-synonymous sites are less constrained, at equal mutation rate, in conifers as compared to angiosperms. Although the causes for this pattern of substitutions in conifer genes are unclear, the answer resides in what seems a unique picture of mutational processes and/or selective influences that affect conifer genes (see below).
Using published divergence times, we inferred an average synonymous mutation rate of 0.68 × 10 -9 substitutions per site per year in conifer genes (see Table 1); this is 15 times less than the average rate in 4,080 Arabidopsispoplar orthologues (μ S = 9.93 × 10 -9 ). If we account for the lower (1:6) rate in the poplar lineage [57], the difference is 25 times less in conifers than in Arabidopsis (μ S = 17.02 × 10 -9 ), and four times less than poplar (μ S = 2.84 × 10 -9 ). We compiled a list of substitution rates that have been published for gymnosperms and angiosperms (Additional File 5), and our findings fall well into the range of rates reported for the two seed plant groups. For example, two phytochrome genes were shown to evolve at a synonymous rate of 0.48 × 10 -9 per year in Pinus sylvestris and Picea abies [69]. For angiosperms, a rate of 1.5 × 10 -8 per year was commonly accepted for Arabidopsis [59] and the resulting 1:6 rate in poplar (2.5 × 10 -9 per year) is also very similar to our results (Table 1). However, with a divergence time between A. thaliana and A. lyrata recently revised at~10 MY [60], the current estimate of the mutation rate in Arabidopsis has doubled. Although it is unclear how this relates to our results, it is important to acknowledge the uncertainty that exists in our results, in the 1:6 poplar:Arabidopsis ratio and in timing divergence, even when relaxed molecular clocks are used. Interestingly, at the population level, conifers also exhibit lower nucleotide diversity despite high gene flow and low population structure [65,70,71]. In addition, low substitution rate and low nucleotide diversity in conifers are paralleled with reports of relatively low evolutionary rates above the nucleotide level. For example, angiosperms are highly diversified while gymnosperms have experienced a very low speciation rate [72]. At least in birds, diversification has been shown to be positively correlated with mutation rate [73]. At the chromosome level, not only is there little variation in the number of haploid conifer chromosomes (n = 11-13) with only scarce evidence of whole genome duplication and polyploidy [74] but comparative genome maps also suggest that macrosynteny is conserved; making it possible to easily navigate across genomes [75] and suggesting that conifer chromosomes are 'fossilized'. There is on the contrary, a high rate of chromosome evolution in angiosperms [72], as well as frequent polyploidy and genome duplication events. Finally, Jaramillo-Correa et al. [76] found that recombination, which has been correlated with levels of genetic diversity, is lower in conifers compared to angiosperms.
There are only a few known exceptions to this general trend of lower evolutionary rates in gymnosperms.
Conifers have larger genomes than angiosperms [74], partly due to larger gene families and abundance of pseudogenes and partly due to a very high content in repetitive DNA such as transposable elements [27,74]. Possible elevated rates of gene duplication and transposition could have occurred along the gymnosperm lineage to cause this genome expansion, with evidence to date suggesting that these events were ancient [77]. Despite these exceptions, conifers exhibit dramatically slower evolutionary rates compared to angiosperms, in particular substitution rates in protein-coding genes, suggesting the existence of conifer-specific evolutionary mechanisms.
What are the causes for the slow substitution rates in conifer genomes?
Substitution rates vary depending on rates at which mutations appear in individuals and are fixed in the population [9,13].
First, the rate at which mutations appear is affected by the efficacy of the DNA repair machinery, generation time, and metabolic rate. In animals, mitochondrial genes evolve ten times faster than nuclear genes, but the inverse situation is found in plants [4]. This difference may at least in part originate from the presence of the DNA repair gene recA in plant mitochondrial genomes, and its absence in those of animals [78]. To our knowledge, there is no information on the efficiency of the conifer DNA repair system compared to that of angiosperm species. Life history traits such as generation time or total life span are factors that are commonly called forth to explain differences in evolutionary rates detected between species, e.g. in mammals [79], in invertebrates [80] and in plants [81]. In angiosperms, rates of evolution are higher in annuals than in perennials [15]. Our data supports this finding as Arabidopsis (an annual) has higher rates than trees. This accords with the germline theory of mutations [82]. However, generation time effects will be unknown until we can reconcile the difference between cell lineage division time and generation time in plants [14]. Conifers exhibit lower values of nucleotide diversity at the population level despite high gene flow and low population structure [65,70,71] suggesting that trees accumulate fewer mutations per unit of time than other plants and thus generation time is not sufficient to explain the annual-perennial difference in mutation rates. Finally, the low metabolite rate of conifer trees, with their large body size and temperate to boreal habitats [83], as well as reduced recombination rates [76], could generate fewer nucleotide substitutions in their genomes.
Second, the fixation rate of new mutations depends on the interplay between random genetic drift (i.e. effective population size and population structure), purifying (background) selection and reproductive strategy. Large population sizes and extensive gene flow are often suggested as the causes of low synonymous polymorphism found in conifer populations [58]. Both empirically and theoretically, grey areas remain about the effect of effective population size (N e ), population subdivision and selection on the pattern of nucleotide divergence between species [84][85][86]. Our results however support the inverse relationship between N e and neutral substitution rate that is expected by the "nearly neutral theory of molecular evolution" [87]. In addition, with low diversification rate in conifers [72], there have been fewer speciationassociated bottleneck events than in angiosperms, thus continuous low diversity between populations. That conifers are mainly outcrossing (selfing is generally avoided through high early inbreeding depression) is only adding to the homogenization of populations. Indeed, studies have shown that there is weak population structure in Sitka spruce [88] and loblolly pine [89]. Finally, the influence of background selection and other selective forces such as hitchhiking on the genomic reduction of substitution rate in conifers is mostly unknown, although selective sweeps following bottlenecks have been reported for several loci [22,23,90].
Teasing out the evolutionary mechanisms controlling the rate of evolution in any organism is a daunting task. When comprehensive data are available across several conifer and other gymnosperm species, comparative analyses will help elucidate if, in what manner and to what extent typical conifer features such as low metabolite rate, long generation time, large effective population and low genetic structure affect substitution rates [91,92].
Is the evolutionary slow-down similar between conifer and angiosperm trees?
Conifers have high levels of genetic diversity within population but experience low nucleotide substitution rates and low speciation rates. Strikingly, the same trend can be seen in angiosperm trees and all trees (angiosperm and gymnosperm) share common attributes that may explain this similarity such as perenniality, outcrossed mating system and large population sizes [58,82]. However, vast evidences point at a more pronounced slow-down in conifers compared to angiosperm trees, for example: recombination rate [76], nucleotide diversity [58] and substitution rates. In this study, we found that conifers have a lower substitution rate at both synonymous and non-synonymous sites than poplar (see Table 1). The existence of conifer-specific factors that explain this difference is therefore likely; gymnosperms have evolved separately from angiosperms for about 300 MY. However, the exact nature and influence of these factors are still to be determined.

High adaptability of conifers to their environment
We found that mean dN/dS was about three times higher in conifers than in angiosperms (0.3137 vs. 0.0924, respectively; see Table 1) despite much lower substitution rates in conifer protein-coding genes, and that this trend was found throughout almost all functional categories. Higher dN/dS in conifers could be due to a general low mutation rate and a high selective constraint on synonymous mutations, which seems at odds with the neutral expectancy but cannot be completely ruled out, or a general very low mutation rate but a proportionally lower constraint (relative to angiosperm genes) at non-synonymous sites. Assuming a relatively high rate of amino acid change in conifer proteins, high average estimates of dN/dS in conifers have important evolutionary implications, especially in light of the distinctive biology of conifer trees.

Characteristics of fast-evolving genes and functional gene categories
Among 100 conifer genes with dN/dS > 1, we found a large fraction of genes involved in abiotic and biotic stress response. For example, we found two protein phosphatases with dN/dS > 6, and one protein kinase with dN/ dS~3 (see Table 2). Protein phosphatases and kinases act in tandem to regulate signaling pathways for plant stress tolerance or avoidance [93]. Four heat shock proteins, one leucine-rich repeat protein, one histone modification protein, two glycosyltransferases, four glycoside hydrolases, and seven transcription factors are also gene products involved in defense, resistance and/or stress response. Other genes with dN/dS > 1 were involved in cell signaling, development and growth, vesicle trafficking and DNA/RNA binding. These single-gene results were paralleled by a gene set analysis on 1,230 annotated genes ranked by dN/dS, where functional categories involved with heat shock proteins, signal transduction and in the regulation of transcription and nucleic acids were more likely to contain genes with high dN/dS (Additional File 3). Conifers, like other long-lived sessile plants, require responsiveness and plasticity to defend themselves against various herbivores and pathogens, as well as abiotic stresses (e.g. temperature and drought). This plasticity can for example be obtained by regulating transcription and DNA/RNA binding proteins, which could explain why these groups of genes seem to have experienced adaptive selection in conifer lineages. In contrast, categories of genes involved in translation, protein assembly, cellular organization and chlorophyll biosynthesis are under strong selective constraint (low dN/dS) because these processes are highly conserved across either the tree of Life, or across photosynthetic organisms (i.e. chlorophyll biosynthesis).

Adaptability of conifers
The conifer divergence was dramatically slower at synonymous sites than at non-synonymous sites (11-fold vs. 4-fold), suggesting that more adaptive mutations (and deleterious mutations, but see below) are fixed in conifers than in angiosperms. Indeed, there was a single Arabidopsis-poplar orthologue gene with a dN/dS > 1 while values for other orthologues were below 0.6. Conversely, we found a distribution of conifer dN/dS ratios significantly deviated near unity (Figure 1.B), with 100 genes showing values suggesting positive selection (dN/ dS > 1). In addition, all GO Slim functional categories showed a significantly higher dN/dS in conifers than in angiosperms, with the exception of DNA/RNA metabolism and translation, which are evolutionary stable processes ( Figure 2).
A threshold of unity is usually applied to determine if a gene shows signs of adaptive evolution, but this threshold is overly conservative in the case of pairwise comparisons over the whole length of the alignment. Algorithms exist to identify adaptive mutations at specific sites and/or on specific branches of a species tree, even when dN/dS < 1 over the entire gene, but there is an implicit requirement for comparisons of at least three species [3]. At the time of this study, loblolly pine and Sitka spruce had significantly more publicly available sequences than any other conifer, and we chose to restrain our study to two species and several thousands of genes, rather than opting for additional species but a few hundreds of genes. With more sequences becoming available for conifer species [94], it will be possible to test for positive selection using models of evolution across a tree composed of three or more species.
An overarching goal of modern biology is to uncover the genetic architecture of biological adaptations. Our study suggests that there is a substantial amount of adaptive substitutions in two conifer species and we expect that this finding will be generalized to other conifer taxa, especially in environments where conifers compete in extreme ecological niches. For example, the Vietnamese pine has evolved broad leaves, i.e. flattened needles, to compete for light with evergreen angiosperm trees in tropical forests [95]. In Western North America, lodgepole pine has evolved large and thick-scaled cones where squirrels are absent but crossbills are present, while crossbills evolve larger beaks [96]. An arms race between conifers and herbivorous insects, such as bark beetles, results in the diversification of constitutive defense and stress-induced genes in conifers [97]. Sitka spruce and loblolly pine, like most conifers in their natural environment, have been confronted by various endemic herbivorous pests, which we speculate could be reflected by high dN/dS estimates at genes involved in defense and stress response.
Why do conifers show more signs of adaptive evolution than most plant lineages?
Our results show that the low mutational rate seen in conifer genes is congruent with higher dN/dS, i.e. higher adaptability at the amino acid level, compared to angiosperm genes. At first, this relationship might seem contradictory and counter-intuitive; it is accepted that mutations are the foundation for adaptation. In conifers, a combination of factors seems to have promoted a staggering high rate of fixation for non-synonymous mutations, despite a generalized low mutation rate.
Little evidence has been found for adaptive evolution in angiosperm genes. In Arabidopsis thaliana and A. lyrata, purifying selection is the determinant force acting on amino acid substitutions [98]. In addition, Gossmann et al. [99] found little or no signal of adaptation in nine pairs of angiosperm species, except in sunflowers. Other exceptions to this rule are European aspen [100] and the crucifer Capsella grandiflora [101], where 30% and 40% of amino acid substitutions have been fixed by natural selection, respectively. What differentiates sunflowers and C. grandiflora from the other studied angiosperms are low population genetic structure and especially large effective population size (N e > 500,000). European aspen has a lower reported N e (118,000) but it has been argued that 500,000 individuals may not be unrealistic [100]. Strasburg et al. [102] compared different species of sunflowers, and found a positive correlation between N e and levels of adaptive divergence. Sunflowers, European aspen and C. grandiflora are also outcrossing species but an excess of non-synonymous mutations was found in the outcrossing A. lyrata [98], so mating system may only have limited effect on selective pressure compared to demographic factors. Lastly, selfing A. thaliana appears to have rare adaptive substitutions, likely due to consequent population subdivision and reduced N e through different bottleneck episodes [98,103,104].
In conifers, investigations of sequence divergence at the genome level have not been performed yet. Resequencing and comparative data have already provided a large body of evidence that several individual genes in conifers species have evolved under positive selection [58,64,89]. In addition, there are various examples of local adaptation in conifer species, whereby a specific population within the range of the species has expressed a phenotype adapted to an environmental constraint [105][106][107]. Concurrent with our results, the overall picture from the study of molecular evolution of conifer genes is that ecology, demography, life history and genome stability of conifers are favorable for the fixation of non-synonymous mutations. While fixation of deleterious mutations is reduced by outcrossing and large effective population size, most non-synonymous mutations are likely beneficial mutations in the conifer phyla. In addition, although deleterious mutations could be fixed through bottlenecks and selective sweeps, it has been shown that the time to establishment of complex adaptations is minimized in species with a large effective population size, even in the advent of deleterious intermediate steps [108].

Conclusions
Large-scale and genomewide comparative approaches go beyond comparisons of small groups of candidate genes and provide global evolutionary trends. In this study, we found that there was a dramatic slow-down in the overall mutation rate of conifer orthologues compared to angiosperm orthologues. This finding is compatible with an increase in the fixation of non-synonymous mutations, which can be beneficial for adaptation. Large effective population size is likely the main factor that contributes to this trend, along with low population structure, low recombination and outcrossing mating system.
Several genome sequencing projects in conifer species are now funded including for loblolly pine, Douglas fir, sugar pine, white spruce and Norway spruce. These data will allow phylogenetic comparisons of much greater power then we currently employ. Not only should the present approach be expanded to a phylogenetic context, but future studies may also apply comparative methods to tease out the evolutionary processes under various demographic and ecological scenarios [91,92]. Finally, resequencing large numbers of candidate genes, once a reference genome sequence is established, will further identify the mode and strength of selection in conifer genomes.

Additional material
Additional file 1: Evolutionary measures for angiosperm and gymnosperm orthologues. Includes gene/transcript/EST IDs, ORF length, aligned and analyzed length, and dN, dS and dN/dS estimates.
Additional file 2: Annotation and dN/dS values for conifer orthologous genes. A more detailed description based on UniProt, PFAM and Interpro searches is provided for the 100 genes that showed dN/dS > 1, as well putative function where relevant.
Additional file 3: Gene set analyses of conifer annotated genes (Fatiscan and logistic regression methods). Includes references to Babelomics and statistical methods, and results of over-represented categories of genes with high and low dN/dS (adjuste p < 0.05, flase discovery rate correction) in InterPro, KEGG pathways, and GO functional cetegories.
Additional file 4: dS estimates in conifer and angiosperm genes across Arabidopsis' GO Slim functional categories. Mean dS values for conifer (full circle) and angiosperm (open circle) protein-coding genes. Conifer genes were BLASTed against Arabidopsis gene transcripts, whose GO Slim annotations were used for homologous conifer genes. Brackets represent the standard error of the mean. A: Biological processes; B: Molecular functions; C: Cellular component. grant to KR and JB). We thank Stephen Ralph for the production of the Sitka spruce FL-cDNA, Nancy Liao at the Michael Smith Genome Sciences Centre for bioinformatics work, and Elizabeth Flavall for editing the manuscript.