### Analysis of presence of common markers within polymorphic inversions of distant species

To assess the functional impact of shared genes on the genomic inversions, we begin by assessing the rate of accumulation of genes in inverted regions. If these genes have no bearing on inversions, we would expect the density of the genes located in these regions to be uniformly distributed throughout the genome. Our analysis centers on identifying regions of the genome that exhibit hot and cold spots for interaction. To test independence of markers, we compared the observed number of shared genes, in inverted regions, to those that would be expected under pure chance. Under the hypothesis that the genes are distributed due to pure chance, identically classified genes would be uniformly and independently distributed across a pair of chromosome arms (each from different species, which share the same genes).

Hence, if the inverted region, on each chromosome, accounts for the fraction

and

of the chromosome in species 1 and 2, respectively, for shared genes (

*i,j*), then the probability that genes would be shared strictly by chance is

. Therefore, given

*N* total homologous gene markers, the expected number of shared genes, under pure chance, is

.

For each inverted region, indexed by shared genes (

*i,j*),(

*O*
_{
i,j
}
*- E*
_{
i,j
})

^{2}/

*E*
_{
i,j
}represents a discrepancy between the expected and observed number (O

_{
i,j
}) of shared genes. The statistic

has an *χ*
^{2} distribution when the sample size (*N*) is large. We use this framework to test if the genes are distributed in a manner consistent with the null hypothesis. While the test statistic has an asymptotic *χ*
^{2} distribution, the small sample sizes may yield inaccurate p-values. To simulate the distribution, we reposition each gene pair (*i,j*) on each of the corresponding gene arms, uniformly on each arm. For each stochastic realization, we count (
) the number of (stochastically repositioned) genes that fall in the regions corresponding to gene pair (*i,j*). Summing the test statistic (
) over each gene pair yields a random draw from the null distribution. We approximate the full distribution using 10,000 random draws and compare our test statistic, based on the observed quantities (O_{
i,j
}, for all (*i,j*)-pairs). The fraction of random draws which exceed our observed test statistic closely approximates the true p-value, without any asymptotic assumptions.

To further study the observed arrangement of shared polymorphisms, we estimate the multiplicative rate at which gene sharing (for each (

*i,j*)-pair) occurs above or below which is explainable at random. Below, we describe a Bayesian model for estimating these rates. Under the hypothesis that these regions arise uniformly and independently, the expected number of such shared genes is

where
and
are the respective fraction of the total genome length for species 1 and 2. Again, *N* represents the total number of homologous gene markers.

We use a Binomial sampling model, with mean
, for measuring the abundance of shared genes (*i,j*), between species.

In the main text, we let (*γ*
_{
i,j
}= *λ*
_{
i,j
}+ *μ*
_{
i,j
}) for ease of explanation. We explain the parameters in this model below. We let *λ*
_{
i,j
}> 0 be a unit-less intensity parameter measuring the over/under abundance of shared genes, based solely on the interdependence of genes (*i,j*), and let *μ*
_{
i,j
}correspond to an additional random effect for the shared polymorphism being additionally hot (*μi,j* > 0), additionally cold (*μi,j* > 0), or neutral (*μ*
_{
i,j
}= 0). Explicitly, *λ*
_{
i,j
}models the average shared gene intensity, and *μ*
_{i,j} models any additional hot or cold intensity.

Since shared inversion regions are not necessarily independent (i.e.

*λ*
_{
i,j
}and

*λ*
_{
i,j
}, both depend on gene

*j* in species 2), we factor the gene specific rate of influence by each region, and let

where
and
are marginal intensities corresponding to genes *i* and *j*, in their respective species. To create an identifiable model, we arbitrarily set
. Hence, interpretation of the marginal intensity of gene connectivity is made solely through ratios of the form
(*k*,*k*' ∈ {1,2}), which indicate the relative intensity of gene *i* (in species *k*), to gene *j* (in species *k*').

The random effects term *μ*
_{
i,j
}allows for additional variation in the intensity of shared polymorphisms, which can relate to extra cold (*μ*
_{
i,j
}< 0), or hot (*μ*
_{
i,j
}> 0) regions. This term accounts for the increase/decrease in shared polymorphisms, not accounted for by the averaging over gene intensities given by *λ*
_{
i,j
}.

The random effect model, for additional hot or cold intensities, follows as:

where *δ*(*condition*) is an indicator function, taking on the value 1 if the condition is true, and 0 otherwise. *N*(*μ*
_{
i,j
}|*μ*
_{H},σ)δ({*μ*
_{
i,j
},*μ*
_{
H
}} > 0) and *N*(*μ*
_{
i,j
}|*μ*
_{C},σ)δ({*μ*
_{
i,j
},*μ*
_{
C
}} < 0) represent Normal density functions, constrained to be positive or negative, depending on if the shared polymorphisms is exceedingly hot or cold. Under our scenario, mixing weights (*p*
_{0}, *p*
_{
H
}, p_{c}) are unknown, as are the mean hot and cold temperatures: *μ*
_{
H
}and *μ*
_{
C
}, respectively.

Naturally, *p*
_{0} + *p*
_{H} + *p*
_{C} = 1. While this model has a rich set of parameters, the overall model is identified due to the strong joint dependency in the parameter set. That is,
and
are identified by the average gene intensity, in species 1 and 2, respectively. The *μ*
_{
i,j
}'s are identified by the extreme over and under distribution of shared polymorphisms which are not explained by
.

Given observed gene counts (

*O*
_{
i,j
}) for each shared inversion region, the likelihood function follows as:

where
, and Δ and *O* represent the matrices of intensity parameters, and associated counts of shared genes, for all (*i,j*) pairs. To estimate the unknown intensity parameters, we rely on a Bayesian inferential framework. Priors were selected as marginal reference priors. We describe these as follows.

For each

*λ*
_{
i,j
}we let:

.

We note that this prior is a global measure on the probability of observing shared polymorphisms between genes (*i,j*), given the fractional gene lengths
. Such a structure will have nearly optimal frequency coverage properties (see [42] for details). For the mean hot and cold temperatures in the random effects compartment of the model, we specify *p*(*μ*
_{
H
}) = *p*(*μ*
_{
C
})∝1. For the mixing weights, we let *p*(*p*
_{0}) = 0.99, and *p*(*p*
_{
H
}) = (*p*
_{
C
}) = 0.005. We have chosen this structure so that only very large deviations from the gene neutral model (*μ*
_{
i,j
}= 0) are declared hot or cold.

We focus on interpreting the joint quantities *γ*
_{
i,j
}= *λ*
_{
i,j
}+ *μ*
_{
i,j
}, through their proximity to 1. That is, values of *γ*
_{
i,j
}= *λ*
_{
i,j
}+ *μ*
_{
i,j
}close to 1 will support that genes are shared at random, for region (*i,j*). On the other hand, very high (and low) values will demonstrate and over (and under) accumulation of such genes. The relative over (or under) connectedness of individual genes can easily be made by considering the ratios

Parameters were estimated using Markov Chain Monte Carlo (MCMC) [43]. Chains were burned in for 1,000,000 iterations, and used an additional 100,000 samples for estimation. Visual assessments of trace plots showed the chains had all reached stationarity. Multiple runs were made, starting from various distant locations in order to ensure convergence.

To analyze the level of gene sharing outside inverted regions, relative to that inside the inversion regions, we consider the average rate of connectivity (as compared to expected) of outside regions:

where

*γ*
_{
i,out
},

*γ*
_{
out,j
}, and

*γ*
_{
out,out
}denote the rates corresponding to outside the inversion regions on their respective arms.

*N*
^{(1)} and

*N*
^{(2)} denote the number of inverted regions on their respective arms. The sums are explicitly indexed over regions inside inversion regions which have connections between all outside regions. The average rate of connectivity (as compared to expected) within the inverted regions follows as:

By taking the ratio of these average rates:

we obtain a measure of how under (or over) connected the regions outside the inversions are compared to those within the inverted regions. Ratios near 1, demonstrate similar levels of connectivity, whereas under connectivity in the outside regions are demonstrated by ratios <1, and over connectivity in the outside regions are shown by ratios >1. Bayesian posterior assessments of these ratios are easily computed given our MCMC samples corresponding to each gene pair.

### Analysis of the rates of gene order disruption

We identified two types of conserved blocks of genes: (i) blocks that were conserved among all three species (fully conserved blocks) and (ii) blocks that were conserved between two species but were disrupted in the third species (partially conserved blocks). To analyze the rate of gene order disruption, between

*An. stephensi* and

*An. funestus*, with

*An. gambiae*, it is necessary to consider the process that describes such genetic disruptions. Since these three separated species were descendants of some common species, at some point in time, each species had complete preservation of gene order with respect to the other two species. We denote this time by

*t*
_{0} = 0. This time represents the most recent time for which all three lineages can be mapped back to their most recent common ancestor. Through time, after these three species

*speciated*, each of their respective genomes evolved at different rates, which resulted in sets of disruptions between the species groups. For discrete time points (

*t*
_{0} <

*t*
_{1} < ··· <

*t*
_{
c
}) (

*t*
_{
c
}is the current state of time), these disruptions appeared, resulting in a lower level of arm-specific preservation of gene order. This process of disruption accumulation is illustrated in Additional file

7, from which we observed that through time the length of the conserved blocks decrease with increased frequency. While we do not get to see this progression through time, we do observe the level of conservation at time

*t*
_{
c
}. If this process were allowed to continue for an infinite amount of time, all blocks would eventually be disrupted. We note that our schematic only illustrates the basic idea, since inversions typically alter the location of the conserved blocks. In this analysis, we are concerned with how this process changes for each of the chromosome arms:{2R, 2L, 3R, 3L}. Since the disrupted blocks are accumulating through time, we account for both the number of conserved blocks and the length of each block at the current time (

*t*
_{c}). We model the number and length of blocks using a compound Poisson process, where the number of conserved blocks follows a Poisson process and where the length of the chromosomal arm scales the rate of the process; a separate process governs each conserved block's length. Formally, for each arm

*j* ∈ {2R, 2L, 3R, 3L}, with total chromosome length

*L*
_{
j
}, we model this process by

Where the number of observed blocks (

*N*(

*L*
_{
j
})) follows a Poisson process, and

*b*
_{
i,j
}denotes each individual block length and

*R*
_{
j
}denotes the total length of conserved blocks on arm

*j* (either fully conserved blocks, partially conserved blocks or both simultaneously), which is computed by the sum of individual block lengths on arm

*j*. Specifically, we let

where the arm-specific parameters

*λ*
_{
j
}and

*γ*
_{
j
}define rates for the counting and length process, respectively. For the counting process, we have that

for the block length process, *γ*
_{
j
}is interpreted as the per-unit-length rate of accumulation of blocks. Hence, larger values of *γ*
_{
j
}indicate longer block lengths, whereas, for the block counting process, larger values of *λ*
_{
j
}will correspond to a higher quantity of blocks. It should be noted that the process is conditioned on the length of each chromosome (*L*
_{
j
}), which induces conditional independence. However, marginally *N*(*L*
_{
j
}), and b_{
i,j
}are not assumed to be independent.

The individual lengths b

_{
i,j
}are scaled to the total arm length, so that

models the fraction of each arm that is occupied by a block. Alternatively, the length of all conserved blocks follows the distribution

so that

*E*[

*R*
_{
j
}] =

*N*(

*L*
_{
j
})

*L*
_{
j
}/

*γ*
_{
j
}. We consider the simple summary of the compound process

that has the arm-specific interpretation of *blocks per region length per total length*.

On its own, the number of *blocks per region length* shows the density of conserved blocks within the conserved portion of each arm. Because each arm is of different length, we summarize our inferences by scaling this density to the *total length* of each arm. High values correspond to short blocks and/or high counts. Specifically, which rate is high (counts or lengths) is indistinguishable for our purposes. Hence, when summarizing results, it is convenient to think about *λ*
_{
j
}, *γ*
_{
j
}, and the above equation together. Our primary question is if the level of conserved disruptions accumulates at a rate different from that for fully conserved blocks. For this, we fit the process on both fully conserved and partially conserved blocks, and those pertaining to only fully conserved blocks. The parameter sets associated with each of the respective processes are denoted
and
representing the partially conserved and fully conserved processes. Other parameterizations are justifiable; however, due to the sparsity in disrupted blocks on some arms, we prefer this parameterization. For both processes, posterior summaries (rate expectations and 95% credible intervals) are displayed in Table 3.