We analyzed an A. gambiae gene family in the "fertilization island" of chromosome arm 3R, where male-specific genes possibly involved in modulating female post-mating behavioural responses are located . The following evidence strongly suggest the presence of a duplicate of AgAcp34A-2 gene, that we have named AgAcp34A-3: i) a consistent, reproducible permanent heterozygosity in sequence data obtained for AgAcp34A-2 in all specimens, and ii) the detection of SNPs (i.e. double peaks in chromatograms) in each paralog-specific sequence. Although trace reads of this novel duplicate are found in the repository of the A. gambiae genome (as well as in those of M- and S-forms), the annotation and localization of this copy on the genome map is probably hindered by the presence of a segmental duplication which complicates the assembly. In fact, our FISH assays suggest that this gene is placed in the same chromosomal division of AgAcp34A-2 in A. gambiae, A. arabiensis and A. merus (Figure 3) and that a tandem duplication likely occurred in this genomic region in a common ancestor of the A. gambiae complex. Note that, presently, two genes are reported as putative paralogs of AgAcp34A-2 in the A. gambiae genome: AGAP009369 (i.e. AgAcp34A-1, sharing 86% of identity) - mapping on the same chomosomal division as AgAcp34A-2 - and AGAP012706 (99% of identity with AgAcp34A-2), which is annotated on the artificial "unknown chromosome" containing unassigned scaffolds. However, the high level of identity between AGAP012706 and AgAcp34A-2 at both coding/non-coding and flanking regions, raises doubts on the actual existence of this additional paralog and suggests that this should rather be considered an alternative haplotype of AgAcp34A-2 .
Overall, the AgAcp34A-2 protein shares 83% and 78% of amino acid residues with AgAcp34A-1 and AgAcp34A-3 proteins, respectively. A relatively smaller number of residues (68%) are shared between AgAcp34A-1 and AgAcp34A-3 (see also Figure 4). Since the three paralogs are present in all analyzed species, it is likely that duplication events happened before the split of Afrotropical A. gambiae complex species from a common ancestor, and even before the divergence of the complex from the Asian malaria vector A. stephensi. In fact, we identified all three genes also in the latter species (data not shown) and AgAcp34A-2 can be found in its transcriptome . Although it is difficult to determine with certainty when and how these duplications occurred, as well as the sequence of duplication events, it is likely that the most closely related copies (i.e. AgAcp34A-1 or AgAcp34A-3) could represent the direct descendant of an ancestral gene from which the other paralogs would have originated. A gene duplication of this ancestral gene could have given rise to a copy that, as a result of functional redundancy , would have been free to accumulate mutations along its coding and flanking regions. Subsequently, a gene conversion event between these two copies would have homogenized their N-terminal portions, as well as part of their 5' flanking regions. This would explain the finding of the high diversity observed mostly at the C-terminal and the 3' flanking region between AgAcp34A-1 and AgAcp34A-2 (Figure 4). A novel gene duplication event (possibly due to an unequal cross-over) would have then generated a third copy (the ancestor of AgAcp34A-3) from an AGAP009370-like ancestral gene. Again, this new duplicate would have accumulated mutations leading to fixation of several amino acid changes possibly under selective pressure. The finding of several replacements in the N-terminal portion of the AgAcp34A-3 secreted peptide leads us to hypothesize that this region represents the least constrained part of the protein, where the fixation of novel mutations could more easily occur. Finally, other homogenization events among the three genes (e.g. an ectopic recombination, as inferred between AgAcp34A-1 and AgAcp34A-3) likely contributed to the evolution of these paralogs, complicating the interpretation of their evolutionary patterns: for instance, the degree of sequence variation observed in flanking regions of AgAcp34A-1 and AgAcp34A-2 is consistent with the hypothesis that these two duplicates originated earlier than AgAcp34A-3, whose flanking regions nearly perfectly match those of AgAcp34A-2, thus indicating a more recent origin. However, we cannot rule out the hypothesis that AgAcp34A-3 could be the actual ancestral gene from which AgAcp34A-2 (and, later, AgAcp34A-1) would have originated. However, this would have required an extensive homogenization process between AgAcp34A-2 and AgAcp34A-3 leading to decreased nucleotide diversity in flanking regions, but not in coding regions. Additional data are needed to assess the orthology, synteny and diversity of these genes in other Anopheles species and to clarify the sequence of duplication events generating the copies of this gene family.
From a functional perspective, gene duplication is believed to increase the plasticity of transcriptomes [15, 40]. In fact, novel copies, being initially free from selective pressure because of their functional redundancy, depending on the relative forces of selection and drift may: i) remain very similar to the original copy and maintain the same function; ii) diverge and acquire a new function (neo-functionalization); or iii) lose their function and become pseudogenes (pseudogenization). Although we do not have direct allele-specific evidence of protein translation, we did not detect mutations in nucleotide sequences of any of the three genes that could impair the ability of some haplotypes to code for a functional protein. Furthermore, we found that all three genes were transcribed in males of at least 3 species (A. gambiae s.s., A. arabiensis and A. merus) (Additional file 1), suggesting that pseudogenization has not occurred. Previous RT-PCR assays had shown that AgAcp34A-1 and AgAcp34A-2 were exclusively expressed in MAG of A. gambiae s.s. , although these experiments were not designed to efficiently distinguish among transcripts produced by the three paralogs. Recently, mass spectrometry proteomic analysis demonstrated that AgAcp34A-2 (and/or AgAcp34A-3, but no data are available to confirm the presence of AgAcp34A-1) is among the proteins expressed exclusively in MAG of A. gambiae s.s and among the components of the mating plug . Indeed our immuno-fluorescence assays confirmed that an AGAP009370-like protein is specifically expressed in the posterior part of MAG in all Anopheles species here examined (Figure 5c, d) and showed that one or both proteins are concentrated at the posterior tip of the A. gambiae s.s. mating plug (Figure 5b). As the antibody used in immunofluorescence might recognize both AgAcp34A-2 and AgAcp34A-3, we cannot state with certainty whether one or both proteins are transferred to females during copulation. Regardless, the specific localization of the AGAP009370-like protein on the portion of the mating plug that is in close proximity to the duct connecting the atrium to the spermatheca suggests two considerations: i) the process of plug formation and transfer is spatially and temporally organized, and ii) the AGAP009370-like protein might play a role in sperm function upon migration to the spermatheca.
In Drosophila, the comparisons of gene sequences within and between species have shown that Acps are rapidly evolving and that many paralogs that arose by gene duplication events have diverged rapidly from their ancestral copy under directional selection, leading in some circumstances to the acquisition of lineage-specific duplicates [4–12]. The availability of multiple Drosophila genome sequences has allowed evaluation of the evolutionary hypotheses of positive selection on these male-specific reproductive proteins in a robust phylogenetic and functional context. In contrast, the lack of genomic data on Anopheles taxa other than A. gambiae s.s. and the absence of a reliable phylogenetic background do not allow to infer the evolution of characters along well-defined lineages (e.g. in monophyletic sister-groups) [21, 41]. Furthermore, although the A. gambiae complex represents an interesting model to study the adaptive evolution of genes potentially involved in reproductive isolation, the pervasive incomplete lineage sorting of alleles among closely related species affects the interpretation of genetic estimates and the application of selection models [21, 42]. Similarly to what has been reported for other genes of this complex [21, 43–45], we found several haplotypes shared by multiple A. gambiae species in all of the three paralogs, due either to introgressive hybridization and/or to retention of ancestral polymorphisms. Despite this, we observed fixed species-specific replacements in at least one gene from each taxon along their geographical distribution, with the exception of A. gambiae s.s. (Figures 1 and 4). In AgAcp34A-2, one fixed species-specific replacement (i.e. a glutamine (Q) to a glutamic acid (E) change in the QLLQLLQLL sequence motif at the C-terminal) is observed in A. arabiensis (Figure 4b). To our knowledge, this represents the first report of an amino acid substitution in a gene positioned in an area of an autosome not affected by chromosomal inversion polymorphisms that clearly distinguishes A. arabiensis from A. gambiae s.s.. In fact, extensive genetic exchange of autosomal sequences has been frequently reported between these two largely sympatric sibling species [44, 46, 47]. Since it has been suggested that genes involved in reproductive isolation may be protected against extensive gene flow , it is tempting to speculate on a possible role of AgAcp34A-2 to the reproductive isolation between A. gambiae s.s. and A. arabiensis.
In the other two genes, a higher number of fixed substitutions is observed in those species of the complex characterized by a more restricted geographic distribution: A. merus shows very high level of differentiation in AgAcp34A-1, A. melas in AgAcp34A-3 and A. quadriannulatus in both genes (Figure 4a, c). As discussed in studies on the molecular evolution of other genes in this species group [21, 41, 42], genetic drift might have contributed as a major force in the diversification of these geographically more restricted species and would have thus determined the fixation of species-specific substitutions (and, therefore, lineage-sorting).
It is interesting to note that almost all species-specific substitutions are placed in the C- and N-terminal portions of the secreted peptides: since these are the most strongly differentiated regions among the three paralogs, it is likely that these regions might be the least constrained portions of the three proteins. Alternatively, if these substitutions are affecting the protein functions, their fixation might be preferentially explained by positive selection.
The selective forces shaping the evolution of these genes cannot be fully clarified here due to the above cited limits in the application of selection inferences in the A. gambiae complex and to the lack of other information to corroborate the possible significance of the observed amino acid replacements (e.g., no structural conserved domains are recognisable for these proteins, neither indications of possible interaction with other molecules/proteins are available). However, our data provide some clues to the evolutionary forces that may have contributed to the diversification of the paralogs. In all three genes most polymorphisms occurred at synonymous rather than at nonsynonymous sites (Table 2). We can thus hypothesize that purifying selection operates to retain the structure and function of these proteins. Moreover, AgAcp34A-1 and AgAcp34A-2 appear to be more conserved than AgAcp34A-3 (Table 2, Figure 3 and 4). This is particularly evident when comparing the levels of genetic polymorphisms in A. gambiae s.s. and A. arabiensis (Table 2): the average π values computed over all sites are comparable for AgAcp34A-1 and AgAcp34A-2, ranging from 0.0 to 0.9%, whereas ~2-3 fold higher values are found for AgAcp34A-3 (1.7% to 2.9%). One of the main findings of genomic studies is that duplicate genes do not evolve symmetrically , i.e. they do not evolve at the same rate, due either to differences in recombination rate or to relaxation of negative selection and/or an increase in positive selection. Since the three genes are located in the same chromosomal area (at least within ~100 kb, the limit of resolution of a tandem duplication by FISH), they should be subject to the same mutation and recombination rates. It is then more plausible to hypothesize a relaxation of negative selection in AgAcp34A-3: in fact, the higher π values found at synonymous versus non-synonymous sites likely indicate that purifying selection is more relaxed than in the other two paralogs. This implies that, the resulting protein could tolerate a higher number of amino acid changes, but purifying selection still prevents pseudogenization preserving its function. If both AGAP009370-like proteins were expressed in MAG and transferred to females as part of the mating plug, the observed differences in the selective regime between AgAcp34A-2 and AgAcp34A-3 could be related to different interactions between these factors and proteins expressed in the female lower reproductive tract. In this case a possible specialization into different (but, probably, complementary) functions during the post-mating processes can be hypothesized.