Here we present data on a cluster of 3 female LRT-specific serine protease genes suggested to be involved in post-mating processes in A. gambiae s.s. As already shown for other genes with different functions, the reconstruction of the 3 gene-trees shows that most species share alleles at all loci, as an effect of introgression and/or retention of ancestral polymorphisms, and that only A. merus and A. melas are placed in monophyletic assemblages (Figure 2). On the other hand, we found an unusually high substitution rate, which contributes mostly to an exceptionally high level of intra-specific polymorphisms, especially at nonsynonymous sites (Table 1). Moreover, while A. gambiae, A. arabiensis, and A. quadriannulatus do not differ for any fixed replacement, A. melas and A. merus diverge from the other species at all loci, showing a high number of fixed substitutions at both synonymous (7-9) and nonsynonymous (13-23) sites at locus AGAP005195.
The comparisons of different site-models - used to test for the presence of positively selected sites in our codon alignments - indicate that in all 3 serine proteases most amino acidic residues (75-87%) are conserved among the species analysed and that these proteins are overall subjected to purifying selection (SLAC global d
= 0.482, 0.672, 0.406 for AGAP005194, AGAP005195, AGAP005196 respectively, see also figure 3). However, a number of codons appear to be targeted by positive selective pressures in all genes (Tables 2 and 3) and, noteworthy, some of them lie relatively close to the catalytic triad and/or on the edge of the specificity pocket, which is considered to be important for substrate recognition and/or binding (Figure 4a-c).
Moreover, lineage-specific tests of adaptive evolution detected events of episodic selection in AGAP005195 and AGAP005196. In the former gene, the branch-site models detected episodic selection on lineages that were not detected by branch-models, as expected if positive selection occurs at a few sites in an overall purifying selection background. In particular, codon 71 is shown to have evolved under selective pressures along the branch separating the clade grouping A. melas and A. merus from the other members of the A. gambiae complex. These two species are grouped apart also on the basis of other substitutions at positions 51, 52, 70, 72, 107, 157 and 193. Among them, sites 51 and 157 were also detected by BEB analysis of site-model comparisons (Table 1). It is worth to note that most of these residues are exposed to the solvent and form a sort of semi-circle around the active site (Figure 4b), suggesting that this epitope might interact with a peculiar substrate in A. melas and A. merus. It can be hypotesized that this epitope evolved under selective pressures in a common ancestor of these two species, or, alternatively, because of convergent evolution. The latter hypothesis would be consistent with the distant phylogenetic relationship between these two species, as suggested by their chromosomal inversions patterns [11, 28]. In addition, we found A. merus- (13, 36, 104, 105, 152, 155, 159, 160, 165, 183, 199) and A. melas- (152, 181) specific replacements: among them, residues in position 152, 155, 159 and 199 (the last identified by all ML selection methods) on the same protein side of the catalytic triad, while the remaining residues are located in exposed loops in regions far from it (not shown in Figure 4). Follow up molecular studies will be needed to clarify whether AGAP005195 has evolved a different role in A. melas and A. merus as opposed to the other species of the complex.
In AGAP005196, the branch model detected episodic selection along the branch separating the A. melas-like from the A. gambiae-like alleles, and along the branch leading to A. melas. In both cases, the ω assigned to these branches (assumed to be the same at all sites) was not significantly >1. On the contrary, the branch-site test applied to the branch leading to A. melas was statistically significant, but BEB failed to identify sites under positive selection. Finally, a significant excess of fixed replacements between A. gambiae and A. melas was also detected by the MK-test for this gene (Table 1).
The interpretation of the results obtained is not straightforward, with particular reference to those from the models of episodic selection. As already mentioned, the occurrence in AGAP005196 of relatively high frequencies of nonsynonymous mutations at both intra- and inter-specific levels may affect the interpretation of evidence of episodic selection along the branch leading to A. melas. In addition, although A. melas presents four species-specific amino acid replacements at positions 2, 88, 110, 129 (which also contribute to the significance of MK test in the comparison with A. gambiae), these are shared with some A. arabiensis and A. quadriannulatus individuals, thus revealing again a pattern of incomplete lineage sorting also for this particular haplotype.
among conspecifics (or among closely related sequences/species) has been observed in other taxa, and a variety of hypotheses has been suggested to interpret these results under a regime of negative selection: balancing selection, variable population sizes, variable mutation rates, relaxed selective constraints and/or the prevalence of slightly deleterious mutations [19, 20]. Based on our data, we cannot rule out any of these hypotheses, nor provide an unambiguous explanation for the observed pattern. However, the high level of replacement polymorphisms observed in all the 3 serine proteases (Table 1) suggests that these genes might evolve as a functionally redundant cluster. In fact, duplicated genes with partially or completely overlapping functions may experience relaxed evolutionary constraints that allow them to rapidly explore and eventually fix new advantageous variants [29–31]. Indeed, in some Drosophila species, female-expressed serine proteases have experienced recurrent events of lineage-specific gene duplications immediately followed by a period of positive selection, indicating neo-functionalization of gene duplicates [32, 33]. Evidence of recent duplication activity playing a crucial role also in the evolution of Anopheles female proteases is provided by the finding of an additional copy of AGAP005196 located in the same gene cluster. This previously undetected paralog bears the insertion of a miniature transposable element of the TA-I-α-Ag MITE family, which is frequently associated with gene introns and putatively affects gene regulation [17, 34]. If we assume that the A. gambiae serine-protease cluster is experiencing relaxed evolutionary constraints, the decrease of d
observed only at increasing evolutionary distances in all 3 proteases (Figure 3) may simply reflect a lag in removal of slightly-deleterious mutations by purifying selection occurring in the early stages of sequence differentiation. This could, for instance, explain the ω >>1 observed for most A. merus intra-specific comparisons in AGAP005194 (Figure 3), which could alternatively be interpreted also as the result of long-term positive selective pressures in this species (Table 3). The latter interpretation is consistent with the observation that some A. merus-specific polymorphic replacements (i.e. 42 and 43, identified by M2 and M8) map close to the catalytic triad (Figure 4a), although some of them are also shared with other species (e.g. A. quadriannulatus). Hence, a better knowledge on the functional importance of these non-synonymous changes would be needed to evaluate if balancing selection is maintaining different haplogroups at intermediate frequencies in the A. merus gene pool.
In addition, we cannot exclude that genetic drift caused by long-term small effective population sizes of A. melas and A. merus might have also contributed to determine the observed fixation of species-specific substitutions (and, therefore, lineage-sorting). Indeed, it has been argued by many authors that coalescence processes and demographic fluctuations have differently affected and shaped the population genetic history of members of the A. gambiae complex [35–38]. Since πs can be used as an estimator of 4Neμ, assuming the same mutation rate (μ) in all lineages, differences at neutral sites at the AGAP005195 locus would indicate that A. gambiae and A. arabiensis have experienced larger effective population sizes than A. merus, consistently both with their wider geographic range and their higher levels of shared ancestral polymorphisms (see above: πs ~ 1.7%, 0.4%, and 0.0% for A. gambiae, A. arabiensis and A. merus, respectively). A similar explanation could be applied to the four species-specific replacements found in A. melas at the AGAP005196 locus and indeed, under the same assumption, πs would indicate that A. gambiae and A. arabiensis have experienced larger effective population sizes than A. melas (πs ~1.9%, 2.8%, and 0.1% for A. gambiae, A. arabiensis and A. melas, respectively).
Difficulties in the interpretation of results from selection-tests are mostly due to the unresolved phylogenetic history of the A. gambiae complex, as already argued for genes modulating immune responses to the malaria parasites [15, 35, 39–43]. However, the accurate inferences of the structural models of these proteins allowed us to better evaluate the importance of these putative positive selected residues. It is worth to note that most positive selected residues appear not located "at random" on 3D structures. In particular, the structural models show that AGAP005194 and AGAP005195 positive selected residues are placed in a relatively large surface area near the catalytic triad and in the specificity pocket (Figure 4). A similar pattern was also observed for duplicated genes encoding several mating-induced female serine proteases of Drosophila [32, 33, 44–47], which are supposed to interact with rapidly evolving accessory gland proteins transferred by males during mating [48–50]. Sexual selection and/or conflict due to male-female protein interactions have been considered to be responsible for these patterns in Drosophila and to have promoted rapid divergence, which in some cases has been found to be ten-fold higher than in genes expressed in non-reproductive tissues . However, in the case of the monandrous species of the A. gambiae complex, sexual selection and/or conflict cannot be convincingly invoked to explain the presence of positive selected sites near the catalytic triad or in the specificity pocket. A more likely explanation could be that the observed pattern is derived from the interaction of the 3 female-specific proteases with rapidly evolving substrates or inhibitors that may differently modulate their catalytic activity. This scenario mirrors the rapid evolution of immune-related genes engaged in host-pathogen arms race (see ref. 35 and reference therein). Indeed, in Drosophila females sexually antagonistic interactions at the time of mating activate a number of immune-related genes, which are induced by the transfer of sperm and seminal fluid peptides, rather than by pathogens . It has been suggested that this immune response could account for the 'cost of mating', in the form of decreased female lifespan and fecundity . Although in A. gambiae no significant induction of known immune genes was detected after mating and no cost of mating has ever been reported, some genes encoding immune-like peptides were shown to be strongly upregulated in the female atrium . It could be hypothesized that the three serine proteases studied here have a dual role in Anopheles fertility: helping the preservation of the female reproductive tract from possible damaging factors transferred during mating, and processing of the mating plug. In effect, a dual role could be hypothesized for AGAP005194: this protease has been found to respond to bacterial infection  and its role in processing the mating plug is confirmed by its localization on the plug surface (Figure 5). Given the strong down-regulation of the 3 serine proteases at 24 h post mating, it is reasonable to speculate that their transcription may be turned down by male-derived factors released during mating plug digestion, thereby reducing the cost of mating and allowing females to entirely divert their energy resources to reproductive processes. This hypothesis would be more consistent with a co-operation between the sexes in optimizing their reproductive success, rather than with an arms race among sexes.
The relaxation in purifying selection provided by the functional redundancy in the cluster would allow the maintenance of high genetic variability, on which positive selection could act to eventually fix novel variants to perform either new or more specific functions (neo-functionalization) . In fact, the 3D models highlighted an important structural differentiation in the two atrium-specific proteases AGAP005194 and AGAP005195 that might have a different substrate preference (Figure 4d and Additional file 4). This differentiation, due to a bulky aromatic residue (Phe) in position 213 of AGAP005194 respect to a smaller hydrophobic one (Val) in AGAP005195 and AGAP005196, is fixed in all species of the A. gambiae complex and thus likely appeared very early during the evolution of the paralogs, probably because of neo-functionalization. The relative high differentiation (35-50% identity on PEST genome ver. 3.5, Sept. 2009) and the absence of gene conversion among the three paralogs (except for the 'recent' copy of AGAP005196 bearing the MITE insertion) indicate that these are not at their very early stages of duplication in the A. gambiae complex. In this context, it would be important to obtain more information on the conservation of gene linkage (synteny) for this cluster of functionally related genes in species more distant to those within the A. gambiae complex. Interestingly, a comparative study of gene orders between A. gambiae and A. stephensi at ~1 Mb resolution did not detect a conserved synteny block for the chromosome region containing these serine protease genes . Furthermore, the order of these genes was inverted (if not reshuffled) in A. stephensi because of the accumulation of a large number of fixed inversions during the divergence of the two species. Novel data from the ongoing genome sequencing project from 13 more Anopheles species will provide a better knowledge on the orthology and synteny of these genes and, hopefully, a stable phylogenetic framework to trace the evolution of relevant amino acid substitutions in copies of this gene cluster. These data would also help validating the findings of other 'novelties' such as fixed autapomorphic and synapomorphic replacements located in strategic positions close to the catalytic triad of the 3 proteases, which suggests an interaction with factors transferred by males (e.g. substrates, inhibitors or pathogens) that may have differently evolved in independent A. gambiae lineages. In addition, the spermatheca-specific protease AGAP005196, because of its relative low divergence among members of the A. gambiae complex, is likely to have appeared more recently than the other copies with functions partially or completely overlapping with those of AGAP005195. Indeed, preliminary data show that a number of seminal proteins are transferred to the spermatheca bound on sperm (Catteruccia F., personal communication), as it has been shown in Drosophila . It is possible that AGAP005196 is also experiencing relaxed evolutionary constraints that allow the accumulation of mutations not tolerated in the previous selective regime and that might be responsible for its adaptation to substrates present in a novel reproductive tissue (i.e. the spermatheca). This is consistent with the observation that A. melas-specific sites 88 and 129 are located nearby the edge of the specificity pocket and might play a role in substrate recognition and/or binding (Figure 4c).