A genetic polymorphism evolving in parallel in two cell compartments and in two clades

Background The enzyme phosphoenolpyruvate carboxykinase, PEPCK, occurs in its guanosine-nucleotide-using form in animals and a few prokaryotes. We study its natural genetic variation in Colias (Lepidoptera, Pieridae). PEPCK offers a route, alternative to pyruvate kinase, for carbon skeletons to move between cytosolic glycolysis and mitochondrial Krebs cycle reactions. Results PEPCK is expressed in both cytosol and mitochondrion, but differently in diverse animal clades. In vertebrates and independently in Drosophila, compartment-specific paralogous genes occur. In a contrasting expression strategy, compartment-specific PEPCKs of Colias and of the silkmoth, Bombyx, differ only in their first, 5′, exons; these are alternatively spliced onto a common series of following exons. In two Colias species from distinct clades, PEPCK sequence is highly variable at nonsynonymous and synonymous sites, mainly in its common exons. Three major amino acid polymorphisms, Gly 335 ↔ Ser, Asp 503 ↔ Glu, and Ile 629 ↔ Val occur in both species, and in the first two cases are similar in frequency between species. Homology-based structural modelling shows that the variants can alter hydrogen bonding, salt bridging, or van der Waals interactions of amino acid side chains, locally or at one another′s sites which are distant in PEPCK′s structure, and thus may affect its enzyme function. We ask, using coalescent simulations, if these polymorphisms′ cross-species similarities are compatible with neutral evolution by genetic drift, but find the probability of this null hypothesis is 0.001 ≤ P ≤ 0.006 under differing scenarios. Conclusion Our results make the null hypothesis of neutrality of these PEPCK polymorphisms quite unlikely, but support an alternative hypothesis that they are maintained by natural selection in parallel in the two species. This alternative can now be justifiably tested further via studies of PEPCK genotypes′ effects on function, organismal performance, and fitness. This case emphasizes the importance, for evolutionary insight, of studying gene-specific mechanisms affected by natural genetic variation as an essential complement to surveys of such variation.


Background
Phosphoenolpyruvate carboxykinase, PEPCK, converts phosphoenolpyruvate (PEP) plus nucleotide diphosphate and carbon dioxide to and from oxaloacetic acid (OAA) plus nucleotide triphosphate, in multiple metabolic contexts among the domains of life. Its guanosine-nucleotideusing form (EC 4.1.1.32; for reaction see Figure 1), while present in some Bacteria and Archaea, occurs mainly in Animalia and in both cytosol and mitochondrial matrix.
Its production of PEP from OAA begins gluconeogenesis or glycerol synthesis from Krebs cycle metabolites, or through them from dietary amino acids or lipids [1]. Its production of OAA from PEP may ″replenish″ Krebs cycle metabolites, or play a role in reaction paths which produce moderate ATP yields during chronic anoxia in some invertebrates [2]. In mice, its overexpression in skeletal muscle yields striking extensions of exercise capacity, lifespan, and reproduction [3].
We have studied natural genetic variation in enzymes of energy metabolism, using Colias (Lepidoptera, Pieridae) as a test system for evolutionary functional genomics [4,5]. While surveying such variation across all enzyme-coding genes of glycolysis and its links to other processes, separate study of PEPCK was prompted by finding 3 highfrequency amino acid polymorphisms at the same PEPCK codons in two Colias species from distinct clades. We first clarify the basis of dual cell compartment expression of PEPCK in Colias vs. other animals. We next study PEPCK 0 s natural variation in the two Colias species, especially the high-frequency amino acid variants shared between species. We locate these variants in PEPCK 0 s structure and explore their possible effects on structure-function relations. With coalescent simulations, we test a population-genetic null hypothesis of genetic drift as cause of this variation, the alternative cause being natural selection.

Methods
Animals and basic molecular biology PEPCK cDNA was made by reverse-transcription (with Invitrogen MMLV enzyme) of mRNA extracted from fat body (preserved in Ambion RNAlater) of 18 Colias eurytheme (randomly sampled near Tracy, California, elevation 25 m) and 18 Colias meadii (randomly sampled from Cottonwood Pass, Colorado, elevation 3780 m). Using ButterflyBase [6], we compared PEPCK sequences from Bombyx mori (BMP026541_1) and Heliconius erato (HEP05212_1) to design consensus primers for initial PCR amplification (using Invitrogen HiFi Platinum Taq and Stratagene Robo-cyclers) of a central sequence fragment of Colias PEPCK cDNA. Primers matching this fragment were designed, using Oligo 6 software (Molecular Biology Insights, Inc.), first to amplify the gene 0 s 3 0 end with an antisense primer matching the cDNA polyA tail (GEN22 (15)-A3end: [4,5]), and then, using the Ambion RLM-RACE kit, to amplify the 5 0 end (unexpectedly complex, as discussed below). Colias-specific primers were then designed (Additional file 1) to amplify and sequence the whole gene from 5 0 to 3 0 untranslated regions (UTRs). Nucleic acids were purified with Qiagen kits and a Qiacube processing robot. Sequences were read in sense and antisense directions with ABI BigDye 3.1 reagents and an ABI 377 sequencer.

Data processing and bioinformatics
Colias PEPCK sequences were cross-checked and edited using BioEdit [7]. DnaSP 5.1 [8] was used to estimate haplotype compositions from individuals 0 heterozygous PEPCK sequences using the PHASE algorithm [9,10], and to tabulate diverse evolutionary-genetic statistics from these sequences; previously written filter programs [5] were used to organize DnaSP analysis of linkage disequilibrium. Sequences of Bombyx mori 0 s PEPCK were retrieved from expressed sequence tag libraries in Butter-flyBase [6] and from assembled genomic DNA in SilkDB 2.0 [11], using search tools of each site. Drosophila sequences were drawn from FlyBase [12], and vertebrate and prokaryotic sequences from GenBank [13].
All sequences were evaluated for cell compartment specificity using the TargetP server [14]. This server 0 s elaborate algorithm assesses mitochondrial targeting of proteins, as contrasted to properties of proteins retained by default in the cytosol, on the basis of characteristics of their N-terminal amino acid sequences: a) richness in basic (Arg, Lys) and hydroxylated (Ser, Thr) amino acids; b) absence of acidic amino acids (Asp, Glu); c) certain secondary structure features [14].
Sequences were aligned with the ClustalW algorithm as implemented in BioEdit. Additional file 2 lists accession numbers of all sequences, including those from Colias as submitted to GenBank. Phylogenetic relationships among sequences were evaluated with PHYML and PHYLIP software [15,16]. Colias sequences were matched to the best available structural templates, for homology-based structural modelling, by the 3D-Jury metaserver [17]. Template structure files were drawn from the Protein Data Bank [18]. Homology-based calculation of Colias PEPCK structures using these templates was done with MODELLER 9.8 [19]; in each case, 5 replicates were run and the best-scoring one (i.e. with lowest value of the molpdf criterion [19]) was used. These models were visualized, and their structural features measured, with DeepView (Swiss-PDB Viewer) 4.0.1 [20].

Basic genomic structure of PEPCK in Colias, other insects, and vertebrates
During primer set development for the autosomal Colias PEPCK gene, we found two sequence forms, differing in their 5 0 ends and 5 0 -untranslated regions (UTRs). Inspection of Bombyx mori 0 s PEPCK sequences [6] clarified this: PEPCK sequence BMP026541_1, closely matching one of the Colias forms, is annotated to cytosolic expression, and BMP000643_1, close in sequence to the other Colias form, to mitochondrial expression. The TargetP server confirmed this compartment targeting for the two forms in each species. In each taxon, these sequences differ only in their 5 0 ends, being mRNA splice variants whose alternative 5 0 exons (each associated with a unique 5 0 untranslated region, in which unique 5 0 amplifying primers are located) are attached to common exons 2-13. (That the sequences following the 5 0 exon are the same, and not parts of fully distinct paralogs, was shown in Colias by the fact that in every case, in amplifying from the 5 0 untranslated region, regardless of which of the two 5 0 exons was amplified, all varying base positions following the 5 0 exon, whether heterozygous or variant-homozygous in an individual, were the same between the alternately amplified sequences.) But in Drosophila melanogaster, distinct, though closely linked, paralogous genes code for PEPCK of cytosol and of mitochondria (12; Figure 2 shows these insects 0 PEPCK 5 0 ends). Pairs of compartment-specific paralogs also occur in diverse vertebrates [13].
A truncated expressed-sequence-tag sequence from Bombyx, similar but not identical to BMP026541_1, occurs in ButterflyBase as BMP026778_1. Consultation of the Bombyx genome assembly [11] clarified this. The whole gene corresponding to BMP026541_1, including the cytosolic exon 1, occupies one locus, punctuated by 12 introns, on the minus strand of scaffold nscaf2789. Roughly 20 kb beyond this on the minus strand begins the second locus BMP026778_1, which is interrupted by base dropouts causing frameshifts in comparison to BMP026541_1; if the first of these is ″repaired″ by substitution from BMP026541_1, more PEPCK-like sequence is recovered to about codon 235, after which more frameshifting results in premature stop codons. Thus this locus behaves like an incipient, but not yet fully silenced, pseudogene. We0ve found no expressed mRNA sequence evidence of any such locus in Colias.

Evolutionary history of PEPCK
What is the evolutionary history of the alternate PEPCK expression strategiesparalogs vs. splice variants? One study has examined the phylogenetic relationships of the GTP-using (E.C. 4.1.1.32) and ATP-using (E.C. 4.1.1.49) PEPCK enzymes [21]. It focused on distribution of these co-substrate types among domains Bacteria, Archaea, and Eukarya, but did not address the eukaryotic cellcompartment-specific forms. Therefore we reconstructed, using protein sequences, the phylogeny of vertebrate and insect GTP-using PEPCKs, with prokaryotic GTP-using PEPCKs as outgroups, drawing sequences from sources listed above. Figure 3 shows that vertebrate mitochondrial and cytosolic PEPCKs form two compartment-specific paralogous branches whose most basal members on each branch are fish sequences. This apparent duplicationand-divergence event may have been part of the wholegenome duplications found at the base of vertebrate evolution [22]. The Drosophila paralogs form a coherent sub-branch of an Insecta branch, originated independently of the vertebrate paralogs, evolving the duplicationand-divergence genomic mechanism for compartmentspecificity in parallel. The apparent Bombyx pseudogene (bmocyto2; BMP026778_1) groups closely with the main Bombyx cytosolic gene; it is unrelated to the Drosophila paralog pair. The exon-splicing mechanism of the two Lepidoptera, Bombyx and Colias, is a distinct alternative to whole gene paralogy for compartmentspecificity of PEPCK.

Overall sequence variation of Colias PEPCK
The cDNA of Colias cytosolic PEPCK includes 1929 bp, 643 codons, and the mitochondrial form includes 1896 bp, 632 codons (omitting the stop in each case). The length differences lie in the 5 0 exons of 22 and 11 codons respectively; base pair and amino acid sites for the gene as a whole are numbered beginning with the mitochondrial 5 0 exon. Cytosol form sequences add 11 codons/33 base pairs (bp) to numbers beyond the end of the cytosol exon 1. The mitochondrion-targeting exon 1 may be excised once its protein has been imported into mitochondria, as this happens with other such proteins, ostensibly to prevent further interaction with the mitochondrial membranes [23]. If so, this would affect interpretation of both structural and population-genetic aspects of mitochondrial PEPCK0s functional evolutionary interactions (see below).
18 C. eurytheme and 18 C. meadii were sequenced for both 5 0 exons and the common exons 2-13, and haplotype phases were estimated for each species as noted above. Genetic statistics are tabulated in Table 1 for the common exons 2-13 and for the two 5 0 exons. PEPCK is highly variable at both amino acid and DNA levels: e.g., for exons 2-13 of Colias eurytheme, in common between the compartment forms, overall nucleotide diversity π = 0.0269, synonymous diversity π ss = 0.1054, and θ = 4N e μ = 0.028 (estimated from the number of segregating sites S). Both 5 0 exons are less variable than exons 2-13. In comparison, Colias eurytheme PGI, one of the most variable animal genes known (maintained so by strong natural selection [4]), shows π Σ = 0.0267, π ss = 0.0993, and θ = 0.034 [5], while average values for a sample of Drosophila melanogaster genes are π Σ = 0.0040, π ss = 0.0135, and θ = 0.0040 [24]. Colias 0 PEPCK thus matches its PGI in level of variability. It also shows similarly high estimates of the minimum number of intragenic recombination events [25], i.e. 60 and 41 for C. eurytheme and C. meadii respectively, vs. 58 in the 1668 bp of C. eurytheme PGI [5].

Patterns of allelic amino acid variation
Each unique PEPCK sequence at either nucleotide or amino acid level of organization constitutes a distinct genetic allele at that level. We focus here on amino acid variation as the possible basis of naturally selected enzyme properties, thence effects on higher-level phenotypes and eventually on Darwinian fitness. Figure 4 shows all mitochondrial-form amino acid allelic variants as inferred by PHASE (above). Nearly all the variation occurs in those exons, 2-13, which are in common Figure 3 Maximum likelihood reconstruction of PEPCK gene phylogeny. Numbers at edges are bootstrap support values from 1000 iterations using a JTT amino acid substitution model in PHYML [15]. Sequences were bootstrapped and iterations compiled with SEQBOOT and CONSENSE from the PHYLIP package [16]. Abbreviations are those of Figure 2 plus: mycobact, Mycobacterium sp. (Eubacteria); sulfolob, Sulfolobus sp. (Archaea); danio, Danio rerio (zebrafish); salm, Salmo salar (salmon); gallus, Gallus gallus (chicken); bost, Bos taurus (cattle); mus, Mus domesticus (housemouse); hsap, Homo sapiens (human).
between the compartment forms. 23 codons have singleton amino acid variants in C. eurytheme or C. meadii while 20 codons have major polymorphism (p 2 ≥ 0.05) in either species (the cytosol 5 0 exons have only one singleton in C. eurytheme, none in C. meadii). These combine into 28 (of 36) distinct alleles in C. eurytheme and 24 (of 36) distinct alleles in C. meadii.
Three polymorphic codons, 335, 503, and 629, have p 2 ≥ 0.05 in both species: Gly/Ser 335, Asp/Glu 503, and Ile/Val 629. All these changes are charge-neutral. By exact binomial test [26], corrected for multiple tests [27], codon 629 differs significantly in its variant frequencies between species, while codons 335 and 503 do not ( Table 2). These variants combine into 8 allele classes or allele ″macrostates″ [5], which are identified by one-letter amino acid codes at each position, e.g. GEI for Gly Glu Ile). Frequencies of these alleles for the two species are given in Table 3; their differences mostly follow C. eurytheme 0 s increase in Ile 629 frequency compared to C. meadii.
We asked if there is intragenic linkage disequilibrium among codon 335, 503, and 629 variants; as Additional file 3 shows, there is not. For reasons noted below, we also tested C. eurytheme for disequilibrium between these 3 codons and the 5 0 -mitochondrial-exon codon 11 Arg/Lys polymorphism of C. eurytheme, but none was found (Additional file 3). Indeed, scanning whole cDNA sequences of each species for linkage disequilibrium with DnaSP and filters (Methods, above) found only one instance of disequilbrium between nonsynonymous variants (codons 122 and 220 gave a locally significant Fisher 0 s exact test at P = 0.001, but this was not significant by DnaSP0s Bonferroni criterion for multiple testing) in C. eurytheme, and no such instances in C. meadii. Absence of disequilibria among amino acid variants fits with the finding above of extensive intragenic recombination in PEPCK of both species. Some mainly nearby disequilibria involving synonymous variants are seen, but no interpretation is evident and we omit these data for the sake of brevity.
On a hypothesis of selective neutrality some variable sites are expected to be shared between species, given large θ and thus a number of sites variable in each species at a time [28]. But on this hypothesis, the vast majority of those variants are expected to be of very low frequency and destined to be lost by drift to fixation [29]. A finding of multiple high-frequency variants shared between species at even roughly similar frequencies is unusual on this hypothesis, and so merits study of its possible causesneutrality as null hypothesis, or some form of natural selection as an alternative. Direct study of the variants 0 functional effects, and fitness consequences in the wild, will be the future and final arbiter of this issue. But we can even now make more use of present data, by homology-based structural modelling and by populationgenetic simulation, to gain further insight.
Structural nature and potential impacts of amino acid sequence variation We summarize PEPCK 0 s protein structure in order to study placement of its polymorphic variants in Colias for clues to their evolutionary meaning. This is a first step in assessing neutrality or selection in functional terms: if molecular modelling shows changes in intramolecular protein bonding by amino acid variants, that may at least suggest variants 0 possible functional effects, while if no such changes are evident, functional neutrality of the variants is strongly suggested. PEPCK is one of a few enzymes of glycolysis and related processes which are active as monomers, without oligomeric structure. The best homologous modelling template available, per 3D-Jury [17], is PEPCK of Rattus: high-resolution crystal structures exist for two catalytic conformations of this ( [30,31], see Additional file 4 for a Colias-Rattus alignment). Sequence identity between Colias and Rattus PEPCKs is 0.61, and their Dayhoff similarity is 0.77. These values support homology-based modelling, with accuracy of mid-range crystallographic resolution [19], to explore variants 0 potential structural effects. We modelled both conformations of each amino acid polymorph allele for each of three compartmentspecific forms: cytosolic exon 1 plus common exons 2-13, mitochondrial exon 1 plus exons 2-13, and exons 2-13 alone in light of the above-noted likelihood that mitochondrial exon 1 is excized once mitochondrial PEPCK has reached its target. Figures 5a,b show that Colias PEPCK 0 s tertiary structure is built up from its secondary structure as an irregular lattice of β-strands. This supports α-helices which form much of the protein 0 s surfaces. Ends of these αand β-structures are connected by loops. The catalytic center includes a mobile ″lid″ loop which when open (structure PDB 2qew, Figure 5a) allows substrate/ product binding or release, but closes (structure PDB 2qf2, Figure 5b) over these ligands when they are bound during catalysis [30][31][32]. Other kinds of changes accompany this lid movement, e.g.: the ″p-loop″, including substrate-binding residues Cys 304 and Lys 306, moves in the catalytic site with the lid0s movement [30,32]; Figure 4 Amino acid sequence variation of Colias PEPCK (as mitochondrial form). Sequence names are formed from species abbreviations as in Figure 2, "m" or "f" for specimen sex, specimen numbers, and "mt1" and "mt2" for alleles inferred from heterozygous sequences with the PHASE algorithm as discussed in the text.   Figure 7. Ile/Val 629 is near one end of the 3 0 α-helix Asn 616 -Gln 630. Their nonpolar side chains, whose volumes differ by one -CH 2group, make van der Waals contact with Trp 595, and often Leu 596, in a different α-helix starting with Lys 592. In comparison to salt bridges or hydrogen bonds, van der Waals contacts occur over a wider range of carbon-carbon distances (hence different values of contact energy), which may be grouped: contact 2.95 -4.5 Ǻ, marginal contact 4.5 -5.2 Ǻ, no contact > 5.2 Ǻ cf. [33]. Figure 8 shows the absence or presence of van der Waals contact with Leu 596 for Ile 629 vs. Val 629, with the other polymorphic sites the same in each case.
absence/presence of a salt bridge to Arg 506 with the extension of 503 0 s side chain between alleles SDI and SEI (Figures 7a,c).    Figure 6); movement of Glu 503 to the right, and Arg 272 down, relative to Arg 506, forming 503-506 salt bridge (cf. Figure 7); loss of strong van der Waals contact between Leu 596 and Ile 629; shortening of the left end of the α-helix containing Leu 596. P, 0.02). This may arise from the combination of Gly 335 0 s small volume and Asp 503 0 s short side chain, keeping Arg 272 and Asp 503 side chains distant from one another. the full mitochondrial form shows the fewest polar bonds by variants at site 335 (1/16) compared to the cytosol and the 5 0 -exon-excised mitochondrial forms (13/32; ×* = 2.47, P < 0.02). The cytosol form shows generally greater van der Waals contact distance between Ile/Val 629 and Leu 596 when closed than when open. This effect is less pronounced in the full mitochondrial form and is slightly reversed in the 5 0 -exon-excised mitochondrial form.
In summary, these results show that each allelic bond combination may make a different potential energy contribution to the stabilization of the protein 0 s structure as a whole, or of one part of the catalytic cycle (i.e. open or closed) vs. the other. By such effects, the alleles might alter either catalytic function or thermal stability, or both, of the PEPCK enzyme, and might do so differently among compartment-specific forms.

Population-genetic testing of hypotheses for shared PEPCK polymorphisms
The species studied here, Colias eurytheme and C. meadii, represent in North America the lowland species complex and an alpine/northern species complex, respectively. They are fully reproductively isolated [34] and are well separated in phylogeny [35]. We now test population-genetic explanations for their sharing of 3 major (p 2 ≥ 0.05) PEPCK polymorphisms, at codons 335, 503, and 629, and their close similarity of allele frequencies at codons 335 and 503 (above). Obvious alternative hypotheses are that some form of balancing selection maintains these polymorphisms in the two species, or instead that the variation is selectively neutral and subject to genetic drift.  We first consider the null hypothesis of selective neutrality. We assume the two Colias species diverged from an ancestral species t generations before the present. A non-synonymous polymorphism shared by the two species could result from a single mutation that occurred in the ancestral species and drifted to the observed frequencies in the descendant species. The probability of this approaches zero as t increases. It is also possible that independent mutations occurred in each descendant species after the split and drifted to the observed frequencies.
We have used the coalescent simulation program ″ms″ [36] to simulate this scenario with a symmetric two-allele mutation assumption, so that mutations can occur both in the ancestral species and in the two descendant species. The simulations are used to assess the probability, given the null hypothesis that the variants are drifting neutrally, that the variants' frequencies in the descendant species would be as similar as or more similar than their observed values. We denote the frequency of the second allele at a particular codon in species A (C. eurytheme) and B (C. meadii) by p 2A and p 2B , respectively. From the ms program's output we estimate the probability that |p 2A -p 2B | is less than or equal to the observed allele frequency difference. This probability will be a function of time since species A and B split from their common ancestor, scaled by effective population size: i.e., t/2N e . To choose a focal set of simulation outcomes from which to estimate this null probability, we set the condition that (p 2A + p 2B ) = observed value. This allows an unequivocal ordering of |p 2Ap 2B | values from maximum to minimum agreement with the null hypothesis, hence minimum to maximum agreement with its selective alternative.
We estimate overall θ ss = 4N e μ ss [N e = effective population size, μ ss = synonymous (= neutral) mutation rate], a central parameter for these simulations, based on values of S ss , the number of segregating synonymous sites, from glycolytic genes of each Colias species: glyceraldehyde phosphate dehydrogenase GAPdH, phosphoglycerate kinase PGK, phosphoglycerate mutase PGAM (C. eurytheme only), pyruvate kinase PK, and triose phosphate isomerase TPI. These genes give no evidence of balancing or positive-directional selection on nonsynonymous sites, which if present could bias estimates by causing neutral variants to "hitchhike" on such sites (W. Watt et al., unpublished). We correct estimates from sex-linked TPI for its N e being a priori ¾ that of the other genes. The average value among estimates from these genes is θ ss = 0.0556. We subdivide this following the common observation that transition mutants are twice as common as transversion mutants overall, and the fact that for any base pair one transition and two transversions are possible as single mutants. Thus, in our primary simulation analysis, to test transition polymorphisms we set θ ss-tr = 0.037, and for transversion polymorphisms we set θ ss-tv = 0.009.
We do not use a value of θ ss from PEPCK itself in our primary analysis, as on the alternative hypothesis of selective maintenance of the shared polymorphisms, θ ss might well be elevated (by hitchhiking) above neutral expectations, biasing any further calculations. However, a variant of the neutral null hypothesis would invoke a PEPCKspecific elevation of the mutation-rate component of θ ss to explain high levels of PEPCK variation. We can, therefore, ask what is the probability of finding |p 2A -p 2B | less than or equal to the observed allele frequency difference using PEPCK θ ss in simulations, to see if elevated mutation rate could explain observed results on a neutral assumption. The overall θ ss value averaged between C. eurytheme and C. meadii = 0.0892, so θ ss-tr = 0.059 and θ ss-tv = 0.015 for these additional simulations.
Next, what should be the "choice rule" for which PEPCK codons to test? Test power would be poor for small total numbers of the second allele at each codon, so our rule is that the total counts (n 2A + n 2B ) be ≥ 10 (out of 72, given 36 sequences for each species). Therefore, on one hand, besides the shared codons of primary interest, we should also test codon 11 of the mitochondrial PEPCK form, polymorphic for Arg/Lys. This has (n 2A + n 2B ) = 16, which satisfies the choice rule although it is polymorphic only in C. eurytheme, not C. meadii, as this is a possible outcome of the null hypothesis. But on the other hand, if the mitochondrialtargeting exon which includes codon 11 is indeed excised after PEPCK enters the mitochondrion (above), codon 11's polymorphism would not be expressed, would be synonymous in functional terms, and should not be tested with the other codons. Hence we report our significance testing both with and without inclusion of codon 11.
We ran 2 × 10 7 simulations of 36 sampled sequences for each of two species and for each of the codon polymorphisms (codons 11, 335, and 629, transitions, and codon 503, transversion), filtered their output for cases satisfying [(p 2A + p 2B ) = observed value], and tabulated the fraction of those with (|p 2Ap 2B | ≤ observed value) at intervals of t/2N e between 0.2 and 2.6 as our null-hypothesis probability. Table 4 presents these results.
For comparison to Table 4, we estimate t/2N e for our two Colias species via genetic statistics of synonymous (assumed neutral) variation at Colias GAPdH, hexokinase HK, PGK, PK, PGAM (both species), and TPI. We define these symbols: π ssA or π ssB = synonymous nucleotide diversity in species A or B; π ssAB = between-species synonymous nucleotide diversity; π ssCA = synonymous nucleotide diversity in the most recent common ancestor (MRCA); μ = mutation rate; N e = effective population size; t = time in generations since MRCA; E(x) = expected value of x. Next, we assume π ssA~πssB π ssCA . π ssA~πssB is evident for all 6 genes (W.B. Watt et al., unpublished); that these values also reflect π ssCA is reasonable, given present patterns of Colias speciation by differentiation of large "semispecies" populations without evident bottlenecking or founder effects [37,38]. Then: Assuming π ssCA ∼ π ssA , E(π ssAB -π ssA ) = 2μt By a parallel argument, E[(π ssAB -π ssB )/π ssB ] ≅ 2μt/4N e μ = t/ 2N e . Estimates of t/2N e were made for each species in turn at each of the 6 genes listed above, again correcting estimates from TPI for its smaller N e . The final average t/2N e over all 6 genes and 2 species is 1.828 ± 0.635 (mean ± standard error of mean). Table 5 applies "Fisher's method" of combining probabilities [27] to the joint analysis of "ms" simulation results for the three polymorphic codons shared among species, with and without the mitochondrial codon 11 polymorphism as discussed above. We used t/2N e = 1.80 as the closest tabulated value less than our averaged estimate. As Table 5 shows, using θ ss estimated from the five genes as above, it is highly unlikely that these polymorphisms would be as or more similar than observed due to neutral drift, whether (P = 0.004) or not (P = 0.001) codon 11 is included in the analysis. This is so even though at codon 629 the polymorphism does differ significantly in frequency between species (above). Using θ ss from PEPCK itself does not change these conclusions importantly: including codon 11 in the simulations, P = 0.006, while without codon 11 P = 0.002. Thus an elevated PEPCKspecific mutation rate combined with neutrality is also rejected as an explanatory hypothesis for the shared polymorphisms. These results support the alternative working hypothesis that codon polymorphisms 335, 503, and 629 are shared between species because they are maintained by parallel natural selection in the two species.

Discussion
PEPCK and PGI: different forms of chronically maintained polymorphism?
We have found that neutral drift is quite unlikely to explain the sharing of PEPCK's amino acid polymorphisms Probabilities of joint neutrality including codon 11, df = 8 P = 0.004 P = 0.006 omitting codon 11, df = 6 P = 0.001 P = 0.002 Site-specific probabilities estimated by simulation as described in the text and given in Table 4. t/2N e estimate (actually 1.828) calculated as described in the text.
Fisher's method [27] is used to assemble joint probabilities of neutrality. Analysis using θ ss from 5 genes tests basic neutral hypothesis vs. selective alternative; analysis using θ ss from PEPCK tests whether an elevated mutation rate component of θ ss could allow the null hypothesis to be sustained. See the text for details. Outcomes for values of species' separation time (t) scaled by effective population size (2N e ), calculated by simulation using "ms" software [36] as described in the text. Alternative sources of θ ss for simulations are explained in the text. P values for codon 11 are the same for both θ ss sources.
between Colias species. Thus it makes sense that PEPCK, as a candidate for selectively maintained chronic polymorphism among species, shows very high levels of genetic variability comparable to those of Colias' PGI gene, whose amino acid polymorphism is maintained widely across the genus by strong balancing selection [4,39,40]. However, the cases differ in detail. PGI polymorphism is maintained in C. eurytheme and C. meadii without preserving allelic identity between species. At PGI, a two-transversion change, Gly 370 GGG → Ser 370 TCG, was fixed by a selective sweep in the midst of chronic polymorphism at other codon sites [4], somewhere in phylogeny between more basal C. meadii and derived C. eurytheme [35]. This increases the eurytheme PGI genotypes' thermal stabilities compared to those of meadii [41], fitting with differences in thermal ecology between the species. In contrast, PEPCK's polymorphism engages the same 3 codons between species although, despite the inter-species similarity of amino acid variant frequencies at codons 335 and 503, the increase of p 2 frequency at codon 629 from basal C. meadii to derived C. eurytheme does change several multicodon allele frequencies ( Table 3).
The amino acid polymorphs of PEPCK and PGI have one structural feature in common: they occur outside their enzymes' catalytic centers. This is often so for natural variants that change catalytic (or stability) properties of enzymes without altering their reaction mechanisms [42]. But other structural aspects of these polymorphisms differ between the genes. PGI, with interpenetrated dimeric structure, is completely inactive as a dissociated monomer, so that no separable allelic properties exist beyond sequence differences themselves, and the genotype is the minimum unit of function and thus of performance or fitness effects. In contrast, PEPCK is active catalytically as a single polypeptide, so allelic structural and functional properties exist distinct from genotypic properties, which would be linear combinations of allelic ones. If PEPCK's genotypes do differ in function, they could, for example, increase heterozygotes' breadth of function across the range of a state variable such as temperature, or differ in balances of alternate metabolic roles (below).
Molecular evolutionists have long recognized the prevalence of conservation of sequence via stabilizing ("purifying") selection across broad clades. In contrast, polymorphism is often seen as transient, whether neutral or selected, with "exceptions" recognized for recombination-suppressing inversion blocks as in Drosophila [43], or for frequency-dependent cases such as host-pathogen "arms races" [44][45][46] or selfincompatibility systems [47]. Cases such as PGI, phosphoglucomutase [48][49][50], and now probably PEPCK, demonstrate that chronic polymorphism may often be a long-term, stable response to multiple-scale environmental variation, albeit perhaps predisposed by complexities of protein structure (PGI) and/or metabolic role (PEPCK).
Accordingly, studies of the kinetic properties and thermal stabilities of the PEPCK variants will be of high priority, testing the present working hypothesis of selective maintenance for their polymorphism. If functional differences are indeed found, these may well give clues to the genotype-phenotype-environment interactions responsible for variants' maintenanceas was the case for Colias PGI. Field studies to test those interactions will follow in turn.

Compartmentation: functional rôles and expression strategies among taxa
The concept of "elementary flux modes" expresses how a group of enzyme steps can be deployed to execute different reaction series serving distinct metabolic functionse.g., glycolysis vs. gluconeogenesis, or interactions of glycolysis with the pentose shunt [51]. As seen above, expression of PEPCK in both cytosol and mitochondria may support alternative elementary flux modes. Krebs cycle carbon skeletons derived from dietary or stored lipids or amino acids could be converted by mitochondrial PEPCK from OAA to PEP, then moved to the cytoplasm (by the tricarboxyl transporter [52]) for reverse-glycolytic support of glycerol synthesis or storage in glycogen [1]. Otherwise, cytosolic PEPCK could prime the Krebs cycle to match large transients in glycolytic flux (such as seen in insect flight), by diverting part of this flux from PEP into OAA, thence into mitochondria, perhaps via the malate shuttle, to react with acetyl-CoA derived from pyruvate [2].
How these flux modes might interact with functional effects of the Colias PEPCK amino acid polymorphisms remains to be explored. But in addition, the alternative strategies for PEPCK forms' expressionsplice variants in higher Lepidoptera vs. full-scale paralogous genes of independent origin in vertebrates and in Dipterabespeak a level of adaptive specialization on a large scale. They may, for example, reflect deep clade differences in nutritional mass-energy budget structures. Thus multiple levels of evolutionary comparison are evoked by our present findings.
These expression strategy differences have implications for studies of clade structure itself. It was proposed that PEPCK sequences might offer good phylogenetic signal differentiating Mesozoic to early Cenozoic divergences of insect taxa, mainly in Lepidoptera but with other insects including Drosophila as outgroups [53]. The part of PEPCK studied is in the common block of splicevariant sequences in Bombyx and Colias, a region also quite similar between the Drosophila paralogs. However, what are the most appropriate outgroups, and whether basal Lepidoptera concur in the strategy of Bombyx and Colias, or display another expression pattern which may complicate sorting out sequence homology vs. paralogy, are open questions which systematists must address if they study this gene.
PEPCK and the place of specific-gene studies in a time of genomic variation surveying High-throughput sequencing and variation surveys using it have remarkable power to screen genomes for genetic evidence of evolution [54]. But it is increasingly recognized that "genomics is not enough" to overcome underdetermination of genetic variation patterns by theoretically possible alternative processes [55]. (This problem should be clear even from simple population genetics: e.g., a heterozygote deficiency compared to Hardy-Weinberg expectation may arise from inbreeding or a Wahlund effect or underdominance in fitness, and only study of process can choose the right explanation.) Genome-wide surveys can at best evoke working hypotheses to be tested by study of varying mechanisms in specific gene systems [56][57][58]. The focus of our molecular survey on a central, functionally well-known pathway has allowed augmentation of PEPCK's genetic statistics with initial structural and population-genetic studies, and poises the case for mechanistic testing. This study of Colias PEPCK, like our earlier work [39], engages diverse processes which can shape natural variation, from proteinspecific structural predispositions or constraints to epistatic interaction among nearby enzyme steps, systemic pathway organization and enzymes' roles in it, or "global" issues of energy allocation or network connectedness. Case studies of natural variation's effects are a potent source of insight into partition of evolutionary causes among these processes.
This does not imply a surrender to evolutionary particularism. On the contrary, we seek, with Whitehead [59], ". . .to see the forest by means of the trees". The now-obvious universality of the genetic code, the "unity of biochemistry", and other unifying concepts of molecular and physiological evolution were established by detailed studies of the molecular mechanisms of diverse organismsin complement to the distillation of natural selection and other early evolutionary generalities out of many specific cases by Darwin and his successors. In biology, the path to heuristic generalization runs through the comparative study of specificity [42,60].
This situation underscores the importance of a difference of evolutionary paradigms: an approach which is self-limited to amechanistic pattern analysis in evolution [61], vs. a view which values patterns as starting points but, as in our earlier work and as begun here, tests their causes by mechanistic studies of genotype-phenotypeenvironment interactions [62][63][64] which are the actual drivers of natural selection [65]. Increasing focus on the power of the latter paradigm [66,67] will lead to deeper insight into evolutionary processes and into realistic generalities concerning them.

Conclusions
We've pursued diverse approaches to PEPCK's evolution in two Colias species: phylogenetic comparisons of strategy for expression of cell compartment forms, finding splice variation in Colias (like Bombyx) as contrasted to paralogous gene divergence in other clades; finding extensive genetic variation at nucleotide and amino acid levels, including three amino acid polymorphisms which are shared among species, in two cases with similar frequencies; homology-based modelling, finding that these three polymorphisms may have both local structural impacts and longer-range interactions among their distinct locations in PEPCK structure; population genetic simulation, testing the null hypothesis of neutrality of amino acid polymorphs and finding it improbable (0.001 ≤ P ≤ 0.006), leaving the alternative of natural-selective maintenance.
Each by itself gives important clues to causes of the gene's extensive variation in context of the splice-based evolutionary strategy of compartment-specific PEPCK expression, in contrast to the paralogy seen in Drosophila and in vertebrates. Together they offer a coherent hypothesis of selectively maintained polymorphism, chronically persistent among species. This hypothesis is now poised for further test, clarifying PEPCK's genotype-phenotypeenvironment interaction by studies of PEPCK's allelic and genotypic functions, performances, and fitnesses.
Additional file 5: Absence/presence of polymorphic amino acid variable bonds with sidechains of nearby invariant amino acids in Colias PEPCK. Abbreviations: H bond, hydrogen bond; Ǻ, Ǻngstrom unit of length; vdW, van der Waals; marg, marginal. Alleles identified by oneletter amino acid codes at each of the three polymorphic sites shared between species. Invariant amino acid sites bonding to variable amino acids are identified by number prior to bond length in Ǻ. 272-503 H bonds engage the backbone carbonyl of amino acid 503. All bond types, including distance criteria for marginal or absent van der Waals contacts, are discussed in the main text and many are illustrated in Figures 6, 7, 8.