We show here that phased haplotypes of engrailed associated with wing pattern forms are highly divergent phylogenetically. It has recently been shown that sets of SNPs across engrailed show strong association with the wing patterns of f. poultoni and f. planemoides and that this region shows elevated linkage disequilibrium (Figure 1; ). The presence of multiple linked SNPs is a prerequisite for the recognition of phylogenetically distinct lineages, and we show here that these morph-associated SNPs define a small number of morph-associated haplotypes. Each individual of these two morphs possesses at least one of these haplotypes, which occupy exceptionally long branches in the haplotype tree, even by comparison to the corresponding sequences across distantly related species of Papilio (Figure 3). The divergence of the morph-associated haplotypes is contributed disproportionately by non-synonymous changes. The allele diversity does not follow the strong geographic structure suggested by the mtDNA, which splits the populations in eastern and western branches ( and Additional file 6: Figure S1) in roughly the same way as genitalic characters , although other nuclear markers also show only weak subdivision ( and unpublished). Taken together, these findings strongly suggest that the genomic region around the first exon of engrailed diverges from patterns of neutral variation at phylogenetic and genetic levels, which further supports a role of this region for controlling wing pattern variation in P. dardanus.
The phylogenetic analysis revealed several details about the evolutionary history of engrailed and the adjacent invected loci. First, the higher-level analyses of the genus Papilio (Figures 3 and 4 respectively) did not yield many surprises; expected subclades were largely recovered for both loci, albeit with generally poor support due to the short (<800 bp) fragment length and high homoplasy (Additional file 7: Table S6). Hence it can be concluded that the engrailed and invected loci are useful to track the lineage history, and are not generally distorted by selection outside of P. dardanus. Another key result is that P. dardanus is recovered as monophyletic in the engrailed phylogeny, despite the inclusion of the divergent haplotypes, arguing against an origin for divergent alleles in other species. The intraspecific trees of P. dardanus alleles were characterised by a largely unresolved polytomy at the base in both loci, with very little internal structure related to geography, confirming the conclusions of  based on a more limited sample. Additionally, there is no obvious phylogenetic structure beyond the divergent alleles associated with f. poultoni and f. planemoides in the sequenced region of engrailed (Figure 5). Hence, aside from these divergent alleles, the phylogenies corroborate the findings from the SNP associations , that did not relate the recessive morphs to particular SNPs.
The existence of multiple fixed differences in the two divergent alleles is likely a result of recurrent fixation of novel mutations. The two groups of morph-associated alleles may share some changes due to a common ancestry or alternatively, if these non-synonymous changes have a direct functional role, this might represent convergence between the two alleles. Indeed, all of the changes that unite the f. poultoni and f. planemoides alleles are non-synonymous, consistent with the second hypothesis. However, the observation of an elevated level of non-synonymous intraspecific polymorphism in P. dardanus engrailed is not limited to the obvious cases of the morph-associated alleles. The results of the McDonald-Kreitman test provide strong evidence for non-neutral evolution in this gene. This has been observed with the SNP variation in the highly polymorphic Kenyan population used for the initial association studies , and is confirmed here for a second population from a different region (South Africa) that does not even include the most divergent alleles. The engrailed locus in P. dardanus therefore shows a pattern of evolution that differs from the remainder of the genus Papilio and also from the adjacent invected locus, in that non-synonymous changes have accumulated at a much increased evolutionary rate.
It is highly surprising to find such rapid coding sequence evolution in a gene that is so highly conserved across the arthropods. In particular, one of the changes unique to f. planemoides occurs in the conserved EH1 domain of the Engrailed protein, changing the core motif from FSISNIL in all other P. dardanus to YSISNIL which confers a change in the otherwise conserved core of FxIxxIL . Given this conserved sequence it seems likely that this change may affect binding of Engrailed to the transcriptional repressor Groucho [52, 53]. Whether or not the coding changes in the two divergent alleles affect the function of the Engrailed protein remains an open question. Similarly, we cannot be certain whether these variant sites are themselves the focus of selection, or whether their fixation is the result of hitchhiking due to linkage with other changes. In the latter case, we hypothesise that favourable selection due to a novel mimetic resemblance outweighs the negative effects of a linked genetic load; selection coefficients for mimicry are likely to be very high .
Theory predicts that alleles with a high linked genetic load, as might be conferred by the amino acid changes in the EH1 domain, are likely to be under selection to increase in dominance: linked genetic load is ‘sheltered’ in rare dominant alleles as individuals will nearly always be heterozygous for these alleles . In the Batesian mimicry system of P. dardanus, the equilibrium frequency is the abundance of the mimic relative to their model at which fitness is maximised, without predators associating a pattern with palatability rather than toxicity . The equilibrium allele frequency will therefore be on average lower for a dominant allele as compared to alleles further down the dominance hierarchy. Hence, we speculate that whether the coding sequence changes at engrailed are in any way functional or not, their likely negative pleiotropic effects may be shielded from selection by the recessive-morph alleles at engrailed. This shielding of deleterious mutations through heterozygosity may therefore explain why it is especially the dominant morphs which show such a large excess of non-synonymous mutations.
Finally, we may ask what these results contribute to our understanding of the evolution of the supergene that was hypothesized to underlie the phenotypic variation in P. dardanus. The evolution of linkage disequilibrium to maintain co-adapted alleles is a core facet of classical supergene theory [57, 58] and other supergene systems have been shown to possess complex genomic architectures that may act to reduce recombination between supergene alleles [12, 59–61]. Here we show that the genomic signature of selection evident from the phylogenetic trees and increased rates of non-synonymous changes does not extend beyond engrailed; we find little evidence for morph-associated variants at linked genes such as invected. A similar pattern was seen in the SNP analysis , along with the low level of linkage disequilibrium observed elsewhere in the H region. The high level of diversity uncovered in a population sample of P. dardanus is indicative of negative frequency-dependent selection, as would be expected for a locus underlying polymorphic Batesian mimicry, as selection against over-abundant phenotypes results in a balanced polymorphism, which can be detected by a signature of increased diversity. The existence of multiple fixed changes in these alleles could be the result of recurrent selective sweeps fixing the variants, with either the coding region sites under selection themselves or hitchhiked to fixation as the result of selection on linked variants. The changes might also reflect the build-up of genetic load due to a reduction of recombination as predicted by classic supergene theories of polymorphic mimicry. The previous observation of low linkage disequilibrium beyond the engrailed locus  means that this predicted effect only affects a single gene which, however, spans a large genomic region of ~70 kb .
The findings of high levels of non-synonymous diversity, potentially affecting protein structure, are similar to patterns observed in the doublesex coding region in P. polytes
. Both the P. dardanus engrailed locus and P. polytes doublesex have alternative alleles associated with different female mimetic morphs and these alternative alleles differentiated by multiple changes with a high ratio of non-synonymous to synonymous SNPs in the coding regions. These two mimetic butterflies have evolved similar systems of Batesian mimetic polymorphism through a complex sequence of changes in single developmentally important genes, although the genes involved are very different.