Arm-specific dynamics of chromosome evolution in malaria mosquitoes

Background The malaria mosquito species of subgenus Cellia have rich inversion polymorphisms that correlate with environmental variables. Polymorphic inversions tend to cluster on the chromosomal arms 2R and 2L but not on X, 3R and 3L in Anopheles gambiae and homologous arms in other species. However, it is unknown whether polymorphic inversions on homologous chromosomal arms of distantly related species from subgenus Cellia nonrandomly share similar sets of genes. It is also unclear if the evolutionary breakage of inversion-poor chromosomal arms is under constraints. Results To gain a better understanding of the arm-specific differences in the rates of genome rearrangements, we compared gene orders and established syntenic relationships among Anopheles gambiae, Anopheles funestus, and Anopheles stephensi. We provided evidence that polymorphic inversions on the 2R arms in these three species nonrandomly captured similar sets of genes. This nonrandom distribution of genes was not only a result of preservation of ancestral gene order but also an outcome of extensive reshuffling of gene orders that created new combinations of homologous genes within independently originated polymorphic inversions. The statistical analysis of distribution of conserved gene orders demonstrated that the autosomal arms differ in their tolerance to generating evolutionary breakpoints. The fastest evolving 2R autosomal arm was enriched with gene blocks conserved between only a pair of species. In contrast, all identified syntenic blocks were preserved on the slowly evolving 3R arm of An. gambiae and on the homologous arms of An. funestus and An. stephensi. Conclusions Our results suggest that natural selection favors specific gene combinations within polymorphic inversions when distant species are exposed to similar environmental pressures. This knowledge could be useful for the discovery of genes responsible for an association of inversion polymorphisms with phenotypic variations in multiple species. Our data support the chromosomal arm specificity in rates of gene order disruption during mosquito evolution. We conclude that the distribution of breakpoint regions is evolutionary conserved on slowly evolving arms and tends to be lineage-specific on rapidly evolving arms.


Background
Despite the growing recognition of the importance of chromosomal inversions for adaptation and evolution of species [1][2][3][4], the evolutionary forces responsible for rearrangement establishment and maintenance remain an enigma of evolutionary biology. Comparative mapping has yielded important insights into patterns and mechanisms of genome rearrangements in plants, mammals, fruit flies, and yeasts [5][6][7][8][9][10]. One of the important findings is that inversions are distributed nonuniformly across different chromosomes. Cytogenetic studies performed on malaria mosquito species of subgenus Cellia provided some of the most obvious examples of the nonuniform inversion distribution [11][12][13][14]. Cellia is the largest subgenus within genus Anopheles and is restricted to the Old World. It includes some of the most important malaria vectors, such as, An. gambiae, An. funestus, and An. stephensi. The polytene chromosome complement of Anopheles female consists of the X chromosome and four autosomal arms: 2R, 2L, 3R, and 3L. A study of the distribution of 82 polymorphic inversions in An. gambiae s.s. has found that the chromosome 2 is the most inversion-rich. It has 77 inversions versus only five inversions on chromosome 3 [13]. In An. funestus and An. stephensi, polymorphic inversions tend to cluster on the chromosomal arms that are homologous to the 2R and 2L arms of An. gambiae [15][16][17][18]. These observations suggest that genome rearrangements have chromosome-specific facilitators or inhibitors. The major consequence of unequal rates of chromosome evolution is a disproportionally large role of genetic content of certain chromosomal arms in adaptation and, possibly, speciation. However, the mechanisms that govern the unequal distribution of rearrangements among chromosomes are poorly understood.
The common inversions 2Rb, 2Rbc, 2Rcu, 2Ru, 2Rd, and 2La of An. gambiae s.s. are widespread in the arid sub-Saharan Africa but almost absent in the humid equatorial Africa [14]. Similarly, multiple inversions on arms 2R and 3R in natural populations of An. funestus are fixed in the southern humid rainforest area of Cameroon and decrease in frequency going northwards, with their complete absence in the northernmost dry Sahelian savannas [19]. Chromosomal studies of the Asian malaria vector An. stephensi reveal striking differences in the kinds and frequencies of paracentric inversions on 2R and 3L between rural and urban populations, especially with respect to the common 2Rb inversion [16]. These observations suggest that the genes on the 2R and 2L arms of An. gambiae and on the homologous arms of An. funestus and An. stephensi are more sensitive to variation in the environment experienced by these species and thus show evidence of selective response to environmental pressures. More recent studies provided ecological evidence that sympatric species An. gambiae and An. funestus inhabit a wide range of the same ecoclimatic settings in Cameroon [20,21], suggesting an intriguing possibility that polymorphic inversions capture identical sets of genes in different species and, thus, confer similar ecological adaptations. It has been proposed that inversions 2st of D. buzzatii and In(3R)Payne of D. melanogaster have common genes because both inversions are associated with body size [22]. However, the nonrandom occurrence of similar sets of genes within polymorphic inversions of different species has not yet been demonstrated. The presence of common genes within inversions of homologous chromosomal arms would indicate that natural selection favors certain adaptive gene combinations when different species are exposed to similar environmental variables.
Polytene chromosomes, which are present in many species of Anopheles, provide an opportunity to develop high-resolution physical maps and to study genome organization and evolution in malaria mosquitoes. We previously reported a 1-Mb resolution physical map for An. stephensi and analyzed the evolutionary dynamics of chromosomal inversions in subgenus Cellia [11]. The study has shown that despite the paucity of inversion polymorphisms on the X chromosome, this chromosome has the fastest rate of inversion fixation, while the 2R arm has the highest inversion fixation rate among autosomes. The results have also indicated that the rapidly and slowly evolving chromosomal arms have different genome landscapes characterized by distinctly enriched gene subpopulations and classes of repetitive DNA. Although the propensity of chromosomes to rearrangements seems to play a major role in the rates of inversion origin, negative and positive selections may differentially control the establishment and maintenance of polymorphic inversions on different chromosomal arms. For example, chromosome arms 3R and 3L of An. gambiae and homologous arms in other mosquito species have the lowest rates of inversion polymorphism and fixation [11,13,16,18]. The low tolerance of a chromosome to gene order disruption could contribute to the low rate of inversion establishment. Large blocks of genes that are conserved in the evolution of several species may represent functionally important combinations that are maintained together by natural selection [23] and/or genomic fragments devoid of evolutionary fragile regions [24]. Comparative analysis of gene order preservation and disruption across species can determine possible differences in the chromosomal arm tolerance to rearrangements.
Here, we report evolutionary insights from a comparative physical mapping among three species of malaria mosquitoes. We demonstrated that polymorphic inversions on the 2R arm, involved in environmental adaptations in these three species, nonrandomly share similar sets of genes. These results suggest that distantly related species acquire parallel adaptations through capturing common genes by independent polymorphic inversions. We also found that the chromosomal arms differ in their tolerance to gene order disruption during mosquito evolution. The distribution of breakpoint regions is evolutionary conserved on slowly evolving arms and tends to be lineage-specific on rapidly evolving arms. The arm-specific tolerance to chromosomal breakage could be responsible for the nonuniform establishment of inversions.

Results
Inversion distances among An. stephensi, An. gambiae, and An. funestus To avoid a lineage-specific bias in pair-wise analyses of gene orders, we estimated the chromosomal divergence among the mosquito species. Three-way comparative mapping can be very efficient in determining inversion distances among species. The cDNA and BAC clones physically and in silico mapped to polytene chromosomes of An. funestus, An. stephensi, and An. gambiae [11] were used to identify conserved gene orders among the three species (Additional file 1). The comparison of the physical maps of these species has identified the whole-arm translocations and paracentric inversions and detected no pericentric inversions or partial-arm translocations (Figures 1, 2, 3 and 4). Therefore we were able to determine the arm homology among species [11]. Accordingly, chromosome X (Additional file 2) and arm 2R (Figure 1) are homologous across all three species. The 2L arm of An. gambiae corresponds to the 3R of An. funestus and the 3L of An. stephensi ( Figure 2). The 2L arm of An. funestus corresponds to the 3R arms of An. gambiae and An. stephensi ( Figure  3). The 2L arm of An. stephensi corresponds to the 3L arms of An. funestus and An. gambiae ( Figure 4). We have calculated inversion distances among An. stephensi, An. gambiae, and An. funestus based on locations of 87 common autosomal DNA markers in all three species (Additional file 3). This comparison has been done at the~2.42-Mb level of resolution using the Multiple Genome Rearrangements (MRG) program (signed option) [25]. The MGR program has estimated 51 fixed inversions between An. stephensi and An. gambiae, 54 fixed inversions between An. stephensi and An. funestus, and 50 fixed inversions between An. gambiae and An. funestus. We also used the Genome Rearrangements In Man and Mouse (GRIMM) program without assuming directionality of the markers (unsigned option) to perform a pair-wise analysis of rearrangements [26]. The GRIMM program calculated 30 fixed inversions between An. stephensi and An. gambiae, 35 fixed inversions between An. stephensi and An. funestus, and 34 fixed inversions between An. gambiae and An. funestus. These data indicate that the three species have approximately equal chromosomal divergence from each other.
Presence of similar sets of genes in polymorphic inversions of An. gambiae, An. stephensi, and An. funestus The presence of common genes within inversions of homologous chromosomal arms could indicate that natural selection favors certain adaptive gene combinations when different species are exposed to similar environments. We tested for the presence or absence of physically and in silico mapped cDNA and BAC clones, which contained genes, in common polymorphic inversions of three mosquito species [14,16,18,27] at~1-Mb level of resolution (Figures 1, 2, 3 and 4). In the previous study, we performed a test on the uniformity of the marker distribution across the chromosomes in An. gambiae, An. stephensi, and An. funestus using the Χ 2 statistic [11]. The distribution of the DNA markers was shown to be uniform for each arm and each species. The observed number of shared genes in polymorphic inversions of An. stephensi and An. gambiae (Additional file 4), as well as of An. funestus and An. gambiae (Additional file 5), were compared to those that would be expected under pure chance. Under the hypothesis that the genes are distributed due to pure chance with respect to polymorphic inversions and to each other, identical markers would be randomly distributed across a pair of chromosome arms from different species. Our results rejected this hypothesis as we found cases of nonrandom clustering of markers within polymorphic inversions in different species. Figure 5 shows the heat plots for the test statistic: (O i,j -E i,j ) 2 /E i,j , which demonstrate the difference between the observed and expected number of shared markers in each inversion of An. gambiae and An. stephensi, as well as An. gambiae and An. funestus. Simulated p-values were computed from Monte Carlo simulated distributions, based on our test statistic, by considering the number of simulated replicates which were larger than the observed statistics ( Table 1). Additional file 6 shows the probabilities that the intensity rate exceeds one for shared genes between An. stephensi and An. gambiae. Figure 5a, b shows the corresponding intensity heat map (based on the test statistic (O i,j -E i,j ) 2 /E i,j , which aids in visually assessing the locations of shared hot and cold spots. We define g i,j as the estimated factor of increased gene sharage, over what we would expect at random (See Methods for details). Inferred values where g i,j = 1 suggest the shared polymorphism is in line with what we might expect to see at random. On the 2R arm, we observed that marker pairs 2Rf (in An. stephensi) and 2Ru (in An. gambiae) correspond to an activity hotspot (indicated by the lightyellow color in Figure 5a) with posterior probability Pr (g i,j > 1|Data) = 1. The level of increase over expectation is dramatic (6.23 times). Similarly, additional file 6, accompanied by Figure 5 c, d, details results for shared polymorphic inversions between An. funestus and An. gambiae. Again we observed on the 2R arm that marker pairs between 2Rt (in An. funestus) and 2Ru (in An. gambiae) determine a hotspot of shared polymorphisms (Pr(g i,j > 1|Data) = 1, 6.22 times). Another example of nonrandomly shared genes is the small 2Rb inversion of An. gambiae (Figure 6a) and the overlapping inversions 2Rd and 2Rh of An. funestus (Figures 1 and 5c). The large 2La inversion is the only common inversion in An. gambiae on this arm (Figure 6b). We found from little to no co-occurrence of the same markers in this inversion and polymorphic inversions on 3R of An. funestus and 3L of An. stephensi (less yellow and more orange boxes) (Figures 2 and 5b, d). The results provide evidence that several polymorphic inversions at least on the 2R arm of An. gambiae nonrandomly share gene combinations with inversions of An. stephensi and An. funestus.    It is important to note that chromosomal arms 2R and 2L of An. gambiae had the most rapid reshuffling of gene order in evolution among autosomes. The mean length of conserved gene blocks is 1.3 Mb on 2R and 1.7 Mb on 2L (Figures 1 and 2). These lengths are much smaller than the sizes of the common polymorphic inversions: 12.5 Mb (2Rj, 813 genes) [28] [29], and 21.6 Mb (2La, 1,281 genes) (Figure 6b) [30]. In some cases, the presence of similar sets of genes within inversions in different species could be explained by preservation of ancestral gene orders. For example, all three markers mapped to inversion 2Ru of An. gambiae and inversion 2Rf of An. stephensi formed a conserved block ( Figure 1). However, many of the common markers did not belong to conserved gene blocks. Moreover, we observed the reshuffled positions of small conserved blocks within polymorphic inversions in different species (e.g., inversions 2Rb of An. gambiae, 2Re of An. stephensi, and 2Rd/2Rh of An. funestus in Figure 1). Therefore, colocalization of the common markers within polymorphic inversions is not only a result of the ancestral gene order but also a result of new genes combinations independently originated in different species.  To support the conclusion that gene sharing is more common in inversions than outside inversions, we have formed a metric for comparing the outside regions to the inverted regions. Table 2 shows the ratio of average levels of gene sharing (as compared to expected under the null model), between the outside regions relative to the inversions. We notice that for all comparisons the outside regions have less shared genes than inversions, as shown by the expected posterior assessments of the ratios being less than 1. For all except the An. gambiae/An. funestus 2L/3R comparisons, 95% posterior credible intervals are bounded above by 1, which demonstrate these results are significant. For An. gambiae/An. funestus 2L/3R, the 95% interval has strong overlap with 1, which demonstrates the level of connectivity (as compared to expected) is similar between inverted and outside regions.

Conservation and disruption of gene order in mosquito evolution
To determine possible differences among chromosomal arms in the tolerance to generating evolutionary breakpoints, we performed comparative analysis of gene order preservation across species. Conserved gene blocks were defined as regions with the same order and distance between at least two markers in two or more different species. For example, there were small conserved blocks (~1 Mb) on arm 2R ( Figure 1) as well as large conserved blocks (up to 6-8 Mb) on arms 3R and 3L of An. gambiae (Figures 3 and 4). In this study, we identified two types of conserved blocks of genes: (i) blocks that were conserved among all three species (fully conserved blocks) and (ii) blocks that were conserved between two species, but were disrupted in the third species (partially conserved blocks). The largest fully conserved and disrupted gene blocks are shown in Figure 3 and Figure 4, respectively. For each of the four autosomal arms, we counted blocks that were conserved between only An. gambiae and An. funestus, An. gambiae and An. stephensi, and those that were conserved among all three species simultaneously. Given the most recent common ancestor between each of our species of interest, disrupted blocks have accumulated through time, accounting the current level of block disruption in our sample (see Additional file 7 for a schematic visualization of this process). We focus our attention on two primary features of the blocks: the number and length of such blocks. The level of block disruption was inferred using a compound Poisson process, in which the number of conserved blocks follows a Poisson distribution and the length of the chromosomal arms scales the rate of the process (see Methods for details). While we were primarily interested in the level of block disruption, and how it differs between groups, we explicitly model levels for fully and partially conserved blocks (conserved+disrupted), since the counts are higher for these sets of blocks, and yield better estimation properties. The rate parameters {l j , g j } measure the abundance or block counts, and lengths, respectively on arm j {2R, 2L, 3R, 3L}.
We denote the fitted compound Poisson process rate parameters as {l j (c+d) , g j (c+d) } and {l j (c) , g j (c) }, where the superscripts (c+d) and (c) denote whether the rates were fitted to conserved+disrupted or conserved blocks.
The difference between these sets of rates (l j (diff) =l j (c+d) -l j (c) and g j (diff) = g j (c+d) -g j (c) ) models the rates governing  the disrupted blocks, which were then used to compare the levels of block disruption between groups. We report these rate parameters in conjunction with the combined summary (l j (diff) g j (diff) )/L j , where L j is the total length of arm j. This combined summary has the interpretation of blocks per region length per total length (See Methods for further explanation). For both fitted processes, posterior summaries are displayed in Table 3. We demonstrated that the rate of accumulation of conserved and conserved +disrupted blocks was mildly higher for arms 2R and 3L. However, the lengths of the blocks were dramatically smaller on 2R than those found on other arms. For inferences of disrupted blocks, we considered the difference of these parameter pairings l j (diff) = l j (c) -l j (c+d) and (g j -1 ) (diff) = (g j (c) ) -1 -(g j (c+d) ) -1 . Strong overlap with zero, in each of the above parameter differences, indicated a negligible disruption rate. On the other hand, highly negative values indicated that the rates for the disrupted blocks were less than those for the conserved blocks. Conversely, large values suggested higher rates for the disrupted blocks. The analysis revealed that the 2R arm has the highest rate of accumulation of disrupted blocks per unit length, l j (diff) , with a probability equal to 0.905 (Figure 7, Table 3). The effects for the other arms were less pronounced. Moreover, the 3R arm of An. gambiae and the homologous arms of the other species had the lowest l j (diff) value with a probability of 0.5, and all identified gene blocks in this arm were preserved among An. gambiae, An. funestus, and An. stephensi. The data suggest that this chromosomal arm possesses evolutionary conserved breakpoint clusters and has low tolerance to generating new breakpoints.

Parallel evolution of adaptive inversion polymorphisms in mosquito species
Natural selection seems to play a role in maintaining inversion polymorphisms on 2R and 2L of An. gambiae and their homologous arms in An. stephensi and An.
funestus [14,16,17]. Our previous study has shown that chromosomal arms rich in polymorphic inversions (2R, 2L) have higher gene densities [11]. This observation confirmed the assumption that an inversion with fewer genes would have a smaller selective advantage [31]. Moreover, the study of gene ontology terms has provided evidence that 2L is enriched with genes involved in the structural integrity of a cuticle, while the 2R arm has an overrepresentation of genes involved in cellular response to stress (e.g., temperature, humidity) [11]. These data strongly support the role of natural selection in maintaining polymorphic inversions associated with adaptation of mosquitoes to the dry environment [14].
A study of larval ecology demonstrated that sympatric species An. gambiae and An. funestus inhabit a wide range of the same ecological settings in Cameroon [20]. We hypothesized that if polymorphic inversions in the Table 3 Expected parameter values and their associated 95% credible intervals shown for both conserved and disrupted blocks of each arm of An. gambiae*  distant species confer the same adaptations than similar sets of genes can be present within inversions of these species. The presence of similar sets of genes within independent inversions would imply the role of natural selection acting on similar genetic content of homologous chromosomal arms and creating parallel phenotypes of the evolutionary distant species. A theoretical model suggests that the probability of parallel evolution under natural selection is about two times bigger than that under neutrality [32]. Our results demonstrated that inversions on 2L of An. gambiae, 3L of An. stephensi, and 3R of An. funestus have almost random sets of genes (Figure 5b, d). However, we found that the several 2R inversions in An. gambiae, An. stephensi, and An. funestus do share common genes. The 2Rf inversion of An. stephensi has an increased frequency in the urban environment [16] and nonrandomly share common genes with overlapping inversions 2Rc, 2Rd, 2Rbk, and 2Ru of An. gambiae. Another "urban" inversion in An. stephensi, 2Rb, had a gene homology to the inversion 2Rc of An. gambiae (Figure 5a). In contrast, the 2Re inversion of An. stephensi has an increased frequency in the rural environment [16] and nonrandomly shares common genes with inversion 2Rb of An. gambiae and overlapping inversions 2Rd/2Rh of An. funestus. This nonrandom distribution of markers is not only the result of preservation of ancestral gene order. In fact, we observed cases with extensively reshuffled gene orders within independently originated polymorphic inversions ( Figure 1).
The gene shuffling is common in malaria mosquitoes [11], and these species are phylogenetically distant enough from each other [33,34] to have independently originated polymorphic inversions, which differ in chromosomal positions and size. The nonrandom presence of homologous genes within inversion 2Rb of An. gambiae and inversion 2Rh of An. funestus is especially interesting in the light of ecological adaptations associated with these inversions. The high frequency of the 2Rb inversion of An. gambiae has been found strongly associated with increased degree of aridity. In contrast, the low frequencies of this inversion have been recorded in humid areas [35]. The 2Rh inversion of An. funestus has a similar (although reverse) pattern of association with aridity. Correlation of the 2Rh inversion with the higher vapor pressure has been demonstrated the strongest among all studied inversions of An. funestus. In contrast, the standard (2Rh+) arrangement has been found associated with the lower vapor pressure [21]. Thus, it is likely that natural selection favors adaptive gene combinations within polymorphic inversions on 2R when distantly related species are exposed to similar environmental pressures. The availability of these gene complexes would support long-term maintenance of polymorphic inversions. This knowledge could be useful for the discovery of genes responsible for an association of inversion polymorphisms with phenotypic variations in multiple species. If candidate genes were indentified within a polymorphic inversion in one species, the orthologous genes in another species likely play a similar role in adaptation if they are captured by a polymorphic inversion involved in the parallel adaptation. Future studies should identify specific alleles associated with parallel adaptation of species of subgenus Cellia.

Arm-specific tolerance to disruption of gene order
We hypothesized that the arm-specific tolerance to chromosomal breakage could be responsible for the nonuniform distribution of inversions in autosomes. The comparative analysis of conserved and disrupted gene blocks in chromosomal arms across the three species provided evidence that 2R is more tolerant to disrupting gene orders and generating new evolutionary breakpoints than other arms (Figure 7). We observed that if a block on 2R was conserved between two mosquito species it was likely disrupted in the third species (Figure 1). In contrast, all identified gene blocks remain preserved on the 3R arm of An. gambiae and the homologous arms of the other species suggesting the existence of arm-specific constraints to breakage (Figure 4). These constraints could be controlled by negative selection acting against disruption of certain gene combinations. It is possible that slowly and rapidly evolving chromosomes may differ in sizes or abundance of coregulated gene clusters. Purifying selection against genomic rearrangements may preserve physical colocalization of coexpression clusters [23]. For example, clusters of genes deregulated in trx mutant D. melanogaster larvae are not uniformly distributed along the genome; 60% of them are located on chromosome 3L [36], which is a slowly-evolving arm in Drosophila [8,37]. Physically clustered genes may have shared regulatory regions, common expression pattern, and chromatin-level regulation, or they may represent clusters of essential genes [38]. Additionally, conservation of gene order in certain regions of the genome has been explained by long-range gene regulation [39]. If the 3R arm of An. gambiae is enriched in large functional gene clusters, then generating inversions on this arm or inserting a transgene into 3R will likely have a negative effect on mosquito fitness. Alternatively, the location of breakpoint clusters in only specific chromosomal regions of the 3R arm of An. gambiae and the homologous arms of the other species could be responsible for the preservation of the gene order in evolution. Indeed, a recent study has shown that fragility of certain regions rather than functional constraints plays the main role in nonuniform distribution of inversions in Drosophila chromosomes [24]. If this is the case in mosquitoes, than the distribution of breakpoints is lineage-specific on 2R and evolutionary conserved on the 3R arm of An. gambiae and the homologous arms of the other species. Our previous study has found that the 2R arm has the highest density of regions involved in segmental duplications that clustered in the breakpoint-rich zone of the arm. In contrast, the slower evolving 2L, 3R, and 3L, arms were enriched with matrix-attachment regions [11]. Future analyses of the genome sequence in different species will shed light on the exact mechanism of breakage in each individual chromosomal arm. Regardless of the mechanism, it is clear that new rearrangement breaks are more easily allowable on 2R than on other arms in different mosquito lineages, thus, contributing to the arm-specific differences in rates of chromosomal evolution in Anopheles.

Conclusions
Our study demonstrated that polymorphic inversions on the 2R arm nonrandomly captured similar sets of genes in An. gambiae, An. funestus, and An. stephensi. This finding suggests that natural selection favors specific gene combinations within polymorphic inversions when distant species are exposed to similar environmental pressures. This knowledge could be useful for the discovery of genes responsible for an association of inversion polymorphisms with ecological adaptations in multiple species. We also found that the autosomal arms differ in their tolerance to disruption of syntenic blocks during mosquito evolution. The arm 2R has the highest level of inversion fixation among autosomal arms and the highest tolerance to disruption of syntenic blocks in evolution of subgenus Cellia. In contrast, the 3R arm of An. gambiae and homologous arms of the other species have few inversions and the lowest tolerance to disruption of syntenic blocks. All syntenic blocks found on this arm in two mosquito species were also preserved in the third species. Therefore, the distribution of breakpoint regions tends to be evolutionary conserved on slowly evolving arms and lineage-specific on rapidly evolving arms. These data support the chromosomal arm specificity in rates of gene order disruption during mosquito evolution.

Calculation of rearrangement distances
The sequences of cDNA and BAC clones physically mapped to polytene chromosomes of An. funestus and An. stephensi were used to identify homologous sequences in the An. gambiae genome through BLASTN and BLASTX algorithms available at VectorBase [40] in our previous study [11]. The MGR [25] and GRIMM [26] programs were utilized to calculate inversion distances among An. gambiae, An. stephensi, and An. funestus (Additional file 3). The MGR program is available at http://www.cs.ucsd.edu/groups/bioinformatics/ MGR. The signed option of the MGR program was used. This program implements an algorithm that minimizes the sum of the rearrangements over all the edges of the phylogenetic tree [25]. The GRIMM program (unsigned option) was used to perform a pair-wise analysis of rearrangements [26]. GRIMM software uses the Hannenhalli and Pevzner algorithms for computing the minimum number of rearrangement events and for finding optimal scenarios for transforming one genome into another (http://grimm.ucsd.edu/GRIMM/). These algorithms use gene order information to estimate rearrangement distances.

Chromosome preparation and visualization of inversions
Ovaries from the An. gambiae half-gravid females prefixed in Carnoy's fixative solution were dissected in 50% propionic acid under a Leica MZ6 dissection microscope (Leica Microsystems GmbH, Wetzlar, Germany). A cover slide was placed on the follicles and pressed to squash the cells. The banding pattern of polytene chromosomes was examined using Olympus CX-41 phasecontrast microscope (x1000) (Olympus America Inc., Melville, NY, USA). Slides with good chromosomal preparations were dipped in liquid nitrogen, then cover slips were removed and slides were dehydrated in 50%, 70%, 95%, and 100% ethanol. Preparations with polymorphic inversions were imaged with an Olympus Q-Color 5 camera and Q-Imaging software (Olympus America Inc., Melville, NY, USA). The chromosomal inversions were identified using a standard cytogenetic map for An. gambiae [41].

Analysis of presence of common markers within polymorphic inversions of distant species
To assess the functional impact of shared genes on the genomic inversions, we begin by assessing the rate of accumulation of genes in inverted regions. If these genes have no bearing on inversions, we would expect the density of the genes located in these regions to be uniformly distributed throughout the genome. Our analysis centers on identifying regions of the genome that exhibit hot and cold spots for interaction. To test independence of markers, we compared the observed number of shared genes, in inverted regions, to those that would be expected under pure chance. Under the hypothesis that the genes are distributed due to pure chance, identically classified genes would be uniformly and independently distributed across a pair of chromosome arms (each from different species, which share the same genes).
Hence, if the inverted region, on each chromosome, accounts for the fraction p (1) i , and p (2) j of the chromosome in species 1 and 2, respectively, for shared genes (i,j), then the probability that genes would be shared strictly by chance isp (1) i p (2) j . Therefore, given N total homologous gene markers, the expected number of shared genes, under pure chance, is For each inverted region, indexed by shared genes (i,j), (O i,j -E i,j ) 2 /E i,j represents a discrepancy between the expected and observed number (O i,j ) of shared genes. The statistic has an c 2 distribution when the sample size (N) is large. We use this framework to test if the genes are distributed in a manner consistent with the null hypothesis. While the test statistic has an asymptotic c 2 distribution, the small sample sizes may yield inaccurate p-values. To simulate the distribution, we reposition each gene pair (i,j) on each of the corresponding gene arms, uniformly on each arm. For each stochastic realization, we count (O (s) i,j ) the number of (stochastically repositioned) genes that fall in the regions corresponding to gene pair (i,j). Summing the test statistic ((O (s) i,j − E i,j ) 2 /E i,j ) over each gene pair yields a random draw from the null distribution. We approximate the full distribution using 10,000 random draws and compare our test statistic, based on the observed quantities (O i,j , for all (i,j)-pairs). The fraction of random draws which exceed our observed test statistic closely approximates the true p-value, without any asymptotic assumptions.
To further study the observed arrangement of shared polymorphisms, we estimate the multiplicative rate at which gene sharing (for each (i,j)-pair) occurs above or below which is explainable at random. Below, we describe a Bayesian model for estimating these rates. Under the hypothesis that these regions arise uniformly and independently, the expected number of such shared genes is where p (1) i , and p (2) j are the respective fraction of the total genome length for species 1 and 2. Again, N represents the total number of homologous gene markers.
We use a Binomial sampling model, with mean ρ i,j = (λ i,j + μ i,j )Np (1) i p (2) j , for measuring the abundance of shared genes (i,j), between species.
In the main text, we let (g i,j = l i,j + μ i,j ) for ease of explanation. We explain the parameters in this model below. We let l i,j > 0 be a unit-less intensity parameter measuring the over/under abundance of shared genes, based solely on the interdependence of genes (i,j), and let μ i,j correspond to an additional random effect for the shared polymorphism being additionally hot (μi,j > 0), additionally cold (μi,j > 0), or neutral (μ i,j = 0). Explicitly, l i,j models the average shared gene intensity, and μ i,j models any additional hot or cold intensity.
Since shared inversion regions are not necessarily independent (i.e. l i,j and l i,j , both depend on gene j in species 2), we factor the gene specific rate of influence by each region, and let where λ The random effects term μ i,j allows for additional variation in the intensity of shared polymorphisms, which can relate to extra cold (μ i,j < 0), or hot (μ i,j > 0) regions. This term accounts for the increase/decrease in shared polymorphisms, not accounted for by the averaging over gene intensities given by l i,j . The random effect model, for additional hot or cold intensities, follows as: where δ(condition) is an indicator function, taking on the value 1 if the condition is true, and 0 otherwise. N (μ i,j |μ H ,σ)δ({μ i,j ,μ H } > 0) and N(μ i,j |μ C ,σ)δ({μ i,j ,μ C } < 0) represent Normal density functions, constrained to be positive or negative, depending on if the shared polymorphisms is exceedingly hot or cold. Under our scenario, mixing weights (p 0 , p H , p c ) are unknown, as are the mean hot and cold temperatures: μ H and μ C , respectively.
Naturally, p 0 + p H + p C = 1. While this model has a rich set of parameters, the overall model is identified due to the strong joint dependency in the parameter set. That is, λ (1) i and λ (2) j are identified by the average gene intensity, in species 1 and 2, respectively. The μ i,j 's are identified by the extreme over and under distribution of shared polymorphisms which are not explained by λ i,j = λ (1) i λ (2) j . Given observed gene counts (O i,j ) for each shared inversion region, the likelihood function follows as: where ρ i,j = (λ (1) i λ (2) j + μ i,j )Np (1) i p (2) j , and Δ and O represent the matrices of intensity parameters, and associated counts of shared genes, for all (i,j) pairs. To estimate the unknown intensity parameters, we rely on a Bayesian inferential framework. Priors were selected as marginal reference priors. We describe these as follows.
For each l i,j we let: .
We note that this prior is a global measure on the probability of observing shared polymorphisms between genes (i,j), given the fractional gene lengths p (1) i , p (2) j . Such a structure will have nearly optimal frequency coverage properties (see [42] for details). For the mean hot and cold temperatures in the random effects compartment of the model, we specify p(μ H ) = p(μ C )∝1. For the mixing weights, we let p(p 0 ) = 0.99, and p(p H ) = (p C ) = 0.005. We have chosen this structure so that only very large deviations from the gene neutral model (μ i,j = 0) are declared hot or cold.
We focus on interpreting the joint quantities g i,j = l i,j + μ i,j , through their proximity to 1. That is, values of g i,j = l i,j + μ i,j close to 1 will support that genes are shared at random, for region (i,j). On the other hand, very high (and low) values will demonstrate and over (and under) accumulation of such genes. The relative over (or under) connectedness of individual genes can easily be made by considering the ratios λ Parameters were estimated using Markov Chain Monte Carlo (MCMC) [43]. Chains were burned in for 1,000,000 iterations, and used an additional 100,000 samples for estimation. Visual assessments of trace plots showed the chains had all reached stationarity. Multiple runs were made, starting from various distant locations in order to ensure convergence.
To analyze the level of gene sharing outside inverted regions, relative to that inside the inversion regions, we consider the average rate of connectivity (as compared to expected) of outside regions: (2) j / ∈out γ out,j + γ out,out N (1) where g i,out , g out,j , and g out,out denote the rates corresponding to outside the inversion regions on their respective arms. N (1) and N (2) denote the number of inverted regions on their respective arms. The sums are explicitly indexed over regions inside inversion regions which have connections between all outside regions. The average rate of connectivity (as compared to expected) within the inverted regions follows as: By taking the ratio of these average rates: we obtain a measure of how under (or over) connected the regions outside the inversions are compared to those within the inverted regions. Ratios near 1, demonstrate similar levels of connectivity, whereas under connectivity in the outside regions are demonstrated by ratios <1, and over connectivity in the outside regions are shown by ratios >1. Bayesian posterior assessments of these ratios are easily computed given our MCMC samples corresponding to each gene pair.

Analysis of the rates of gene order disruption
We identified two types of conserved blocks of genes: (i) blocks that were conserved among all three species (fully conserved blocks) and (ii) blocks that were conserved between two species but were disrupted in the third species (partially conserved blocks). To analyze the rate of gene order disruption, between An. stephensi and An. funestus, with An. gambiae, it is necessary to consider the process that describes such genetic disruptions. Since these three separated species were descendants of some common species, at some point in time, each species had complete preservation of gene order with respect to the other two species. We denote this time by t 0 = 0. This time represents the most recent time for which all three lineages can be mapped back to their most recent common ancestor. Through time, after these three species speciated, each of their respective genomes evolved at different rates, which resulted in sets of disruptions between the species groups. For discrete time points (t 0 <t 1 < ··· <t c ) (t c is the current state of time), these disruptions appeared, resulting in a lower level of arm-specific preservation of gene order. This process of disruption accumulation is illustrated in Additional file 7, from which we observed that through time the length of the conserved blocks decrease with increased frequency. While we do not get to see this progression through time, we do observe the level of conservation at time t c . If this process were allowed to continue for an infinite amount of time, all blocks would eventually be disrupted. We note that our schematic only illustrates the basic idea, since inversions typically alter the location of the conserved blocks. In this analysis, we are concerned with how this process changes for each of the chromosome arms:{2R, 2L, 3R, 3L}. Since the disrupted blocks are accumulating through time, we account for both the number of conserved blocks and the length of each block at the current time (t c ). We model the number and length of blocks using a compound Poisson process, where the number of conserved blocks follows a Poisson process and where the length of the chromosomal arm scales the rate of the process; a separate process governs each conserved block's length. Formally, for each arm j {2R, 2L, 3R, 3L}, with total chromosome length L j , we model this process by Where the number of observed blocks (N(L j )) follows a Poisson process, and b i,j denotes each individual block length and R j denotes the total length of conserved blocks on arm j (either fully conserved blocks, partially conserved blocks or both simultaneously), which is computed by the sum of individual block lengths on arm j. Specifically, we let Pr(N(L j ) = n) = e λ j L j (λ j L j ) n n b i,j ∼ exp(γ j /L j ), where the arm-specific parameters l j and g j define rates for the counting and length process, respectively. For the counting process, we have that E[N(L j )] = λ j L j , for the block length process, g j is interpreted as the per-unit-length rate of accumulation of blocks. Hence, larger values of g j indicate longer block lengths, whereas, for the block counting process, larger values of l j will correspond to a higher quantity of blocks. It should be noted that the process is conditioned on the length of each chromosome (L j ), which induces conditional independence. However, marginally N(L j ), and b i,j are not assumed to be independent.
The individual lengths b i,j are scaled to the total arm length, so that E[b i,j ] = L j /γ j models the fraction of each arm that is occupied by a block. Alternatively, the length of all conserved blocks follows the distribution so that E[R j ] = N(L j )L j /g j . We consider the simple summary of the compound process that has the arm-specific interpretation of blocks per region length per total length.
On its own, the number of blocks per region length shows the density of conserved blocks within the conserved portion of each arm. Because each arm is of different length, we summarize our inferences by scaling this density to the total length of each arm. High values correspond to short blocks and/or high counts. Specifically, which rate is high (counts or lengths) is indistinguishable for our purposes. Hence, when summarizing results, it is convenient to think about l j , g j , and the above equation together. Our primary question is if the level of conserved disruptions accumulates at a rate different from that for fully conserved blocks. For this, we fit the process on both fully conserved and partially conserved blocks, and those pertaining to only fully conserved blocks. The parameter sets associated with each of the respective processes are denoted (λ (c+d) j , γ (c+d) j ) and (λ (c) j , γ (c) j ) representing the partially conserved and fully conserved processes. Other parameterizations are justifiable; however, due to the sparsity in disrupted blocks on some arms, we prefer this parameterization. For both processes, posterior summaries (rate expectations and 95% credible intervals) are displayed in Table 3.

Additional material
Additional file 1: Physically and in silico mapped DNA markers in the An. gambiae, An. funestus, and An. stephensi genomes. The markers used for calculating inversion distances among An. stephensi, An. gambiae, and An. funestus are highlighted by yellow on 2R, green on 2L, teal on 3R and gray on 3L arms of An. gambiae. The genomic coordinates of polymorphic inversions are show for 2La and 2Rj in bold black, for 2Rb in green bold font, for 2Rc in red bold font, for 2Ru in blue bold font. The genomic coordinates of overlapping polymorphic inversions are show for 2Rd by underline and for 2Rbk in italic font.