The impact of migratory flyways on the spread of avian influenza virus in North America

Wild birds are the major reservoir hosts for influenza A viruses (AIVs) and have been implicated in the emergence of pandemic events in livestock and human populations. Understanding how AIVs spread within and across continents is therefore critical to the development of successful strategies to manage and reduce the impact of influenza outbreaks. In North America many bird species undergo seasonal migratory movements along a North-South axis, thereby providing opportunities for viruses to spread over long distances. However, the role played by such avian flyways in shaping the genetic structure of AIV populations remains uncertain. To assess the relative contribution of bird migration along flyways to the genetic structure of AIV we performed a large-scale phylogeographic study of viruses sampled in the USA and Canada, involving the analysis of 3805 to 4505 sequences from 36 to 38 geographic localities depending on the gene segment data set. To assist in this we developed a maximum likelihood-based genetic algorithm to explore a wide range of complex spatial models, depicting a more complete picture of the migration network than determined previously. Based on phylogenies estimated from nucleotide sequence data sets, our results show that AIV migration rates are significantly higher within than between flyways, indicating that the migratory patterns of birds play a key role in viral dispersal. These findings provide valuable insights into the evolution, maintenance and transmission of AIVs, in turn allowing the development of improved programs for surveillance and risk assessment.


Background
Avian influenza viruses (AIVs) infect a wide range of bird species, with sporadic jumps to mammalian hosts, notably humans, causing short-lived epidemics and occasionally establishing endemic transmission cycles [1,2]. Wild birds, particularly Anseriformes (e.g. ducks, geese, and swans) and Charadriiformes (e.g. gulls, shorebirds, and terns), act as the main natural reservoirs for AIV, and viral prevalence in these species is considerably higher than in other birds [3,4]. Importantly, many bird species that experience high levels of AIV infection undertake long-distance seasonal movements along migration routes, or flyways, in reaction to fluctuations in the availibity of food or breeding sites [5,6]. This natural phenomenon provides an obvious mechanism for AIVs to spread over long distances, connecting spatially disjunct localities and creating opportunities for viral transmission to those wild bird species and poultry resident in disparate geographic localities [7]. Indeed, migrating wild birds have been linked to the geographic diffusion of a variety of AIV subtypes [3], including highly pathogenic H5N1 influenza virus [8][9][10][11], as well as other RNA viruses such as West Nile virus [12,13].
Despite their classification into multiple subtypes based on sequence diversity in the hemagglutinin (HA) and neuraminidase (NA) genes, those AIVs sampled from the Western and Eastern hemispheres tend to form distinct monophyletic groups, with relatively infrequent viral movement between hemispheres [14][15][16]. This phylogenetic pattern implies that there is a low transmission rate between birds that are located in disjunct localities, in turn suggesting that bird movements, including wild bird migration, between North America and the Old World are limited [17]. It is therefore reasonable to assume that natural physical barriers like extended areas of water and mountain ranges lead to the ecological separation of bird species and, by extension, to their viral populations. One such obvious barrier at the continental scale is the presence of avian flyways, which loosely describe the migratory pathways followed by diverse avian species [18]. Four such major flywaysthe Pacific, Central, Mississippi, and Atlantichave been described in North America, and describe (albeit loosely) the patterns by which birds migrate along the North-South axis within the continent. However, because flyway assignments are often only approximate, and the borders between them fluid because they do not reflect absolute physical boundaries, there will evidently be some movement among flyways [19].
Despite the potential importance of avian flyways in shaping the population structure of AIV and its patterns of spread, those studies performed to date have produced strongly contradictory results [7,14,20,21]. For example, Lam et al. [14] showed that virus dispersion occurred more frequently within than between migratory flyways and concluded both that flyways acted as physical barrier for viral dispersal and that there was significant virus isolation-by-distance. In marked contrast, through a Bayesian phylogeographic approach Bahl et al. [7] suggested that migratory flyways had relatively little impact on the spatial spread of AIV, particularly as bird movements along the East-West axis in North America had a larger contribution to viral spread than previously proposed. However, these studies used relatively small data sets and applied very different analytical methods, although both made use of the underlying AIV phylogeny. As a consequence, the role played by flyways in AIV evolution and phylogeography is uncertain.
To better understand the role played by avian migratory flyways in the dispersal of AIV in North America we analysed large data sets of the internal gene segments of AIVs sampled from 36 to 38 administrative regions across the United States and Canada. As in previous studies [7,14] we used continuous-time Markov models to characterize transmission rates between discrete geographic locations. We evaluated simple models such as the flyway-based models proposed by Lam et al. (7) and, more importantly, designed an efficient genetic algorithm that, given a fixed and small number of parameters, automatically finds the best model that fits the data. Importantly, the new phylogeographic method developed here does not require information on sampling time such that data sub-sampling is not necessary when sampling dates are missing or inaccurate, as was the case with a number of the AIV sequences analysed here. In addition, the simplicity of the method means that it is able to accommodate thousands of gene sequences, which would pose a significant challenge to more computationally intensive phylogeographic methods such as those incorporated within the BEAST package [7,22]. Critically, our results indicate that the mean transition (i.e. dispersal) rate within flyways is between 4 and 13 times greater than that between flyways, suggesting that the migratory patterns exhibited by birds have a major impact on the spread of AIV in North America.

Data preparation
To investigate the strength of association between viral and bird migration we focused on the internal gene segments of AIV, encoding the PB2, PB1, PA, NP, and MP (M1 and M2 coding regions were concatenated) proteins. Those genes encoding the viral HA and NA proteins were excluded due to the very deep divergences between subtypes, while NS was excluded due to the presence of two phylogenetically distinct alleles (A and B). Full-length nucleotide sequences of AIV isolated in North America were downloaded from the Influenza Virus Resource Database at NCBI (http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html). For each gene, sequences were aligned using MAFFT [23] and were manually edited using Seqotron [24]. The sampling location within North America was extracted from the sequence name, and each sequence was labelled with a discrete geographic location using either the state classification in the USA or the province classification in Canada. Sequences belonging to a geographic location that was represented less than 6 times were discarded. This resulted in final data sets of 4346 (PB2), 4421 (PB1), 4505 (PA), 3853 (NP), and 3805 (MP) sequences. The number of distinct geographic locations per gene was 36, 38, 38, 37, and 37 for the PB2, PB1, PA, NP, and MP data sets, respectively.
Sequences in these data sets were sampled from both wild birds (the majority) and poultry (i.e. quail, chicken, turkey, and pheasant). To determine whether poultry influenced the dispersal of AIV on these data we subsampled the representative PB2 gene data by removing all those sequences (n = 232) associated with poultry. This resulted in an alignment of 4114 wild bird AIV sequences sampled from 34 localities. All the data sets utilized here are available on the Zenodo repository https:// doi.org/10.5281/zenodo.153883.

Flyway designation
Each US state and Canadian province was assigned to one of the four North American flyways defined by the United States Fish and Wildlife Service and Flyway Councils (Lincoln 1979) and usually referred to as localities in the text. From the west coast to east coast the designated flyways are the Atlantic flyway (AF), Mississippi flyway (MF), Central flyway (CF), and the Pacific flyway (PF). The assignment of individual localities to specific flyways is described in the countries.csv file in the Zenodo repository, although flyways are better regarded as loose assemblages rather than entities with fixed boundaries. The spatial distance between each pair of states ⁄ provinces was calculated as the great circle distance between the average latitude and longitude of each locality.

Phylogenetic analyses
Maximum likelihood trees for each data set were inferred using ExaML [25] assuming the generalised timereversible (GTR) substitution model and a discretised gamma distribution (4 categories) of substitution rate across sites using the default settings.

Genetic differentiation of AIV among sampling localities
We used the nearest neighbour statistic S nn [26] to determine the genetic differentiation among localities. This statistic measures how often the nearest neighbors of sequences are found in the same geographic locality. Accordingly, an S nn estimate close to 1 suggests that populations at two localities are highly differentiated, while an estimate near 0.5 indicates little differentiation (i.e. a panmictic population). This method requires pairwise genetic distances be calculated for every sequence. To this end we used the phylogenetic tree inferred from the nucleotide data sets of each internal gene segment and calculated patristic distances between each pair of taxa, in which the patristic distance is the sum of branches over the shortest path between two taxa [27]. A C++ program implementing this procedure is available from Github repository https://github.com/4ment/gdp.

Estimation of AIV migration between geographic localities
Migration rates of avian influenza viruses between discrete geographic locations were analysed using a reversible continuous-time Markov chain model as described in Pagel [28] and implemented in Physher [29]. The state space of the Markov chain was defined as the set of geographic locations. For the phylogeographic analysis of each gene segment, we fixed the tree topology and the branch lengths to their maximum likelihood estimates obtained in the phylogenetic inference of the corresponding nucleotide alignment. Time in this Markov chain is therefore measured in units of nucleotide substitutions per site.
To investigate the patterns of viral migration we use several classes of models: (i) a homogeneous rate model (HRM) with equal transition rates among localities; (ii) a two-rate model (TRM) in which transition rates within any flyway are equal and transition rates of pairs of states belonging to different flyways are equal; (iii) flyway-specific rate models (FRM) where transition rates within a flyway are equal and transition rates between flyways are different; (iv) a general rate model (GRM) in which given a fixed number of rates, the equal transition rate within flyways assumption is relaxed. We refer to 4-FRM for models based on four flyways (i.e. AF, CF, MF, and AF) and 3-FRM when the central and Mississippi flyways are merged into a single flyway due to their geographic overlap. Similarly, the TRM are named either 4-TRM or 3-TRM depending on the number of flyways under investigation.
Calculating the maximum likelihood of the HRM, TRM, and FRM models is relatively straightforward as it only requires standard numerical methods to optimize continuous parameters. In contrast, the parameter space in the GRM is high dimensional and contains both discrete and continuous parameters. Given a fixed number of transition rate categories k and a symmetric d × d transition rate matrix, the GRM will have the following parameters: r = (r 1 ,…,r k ) the value of the rate for each category. z = (z 1 ,…,z d*(d-1)/2 ) vector of rate-class assignments for each non-diagonal element of the rate matrix where z i ∈ 1 … k.
Rate parameters r 1 ,…,r k are optimized using standard numerical methods while rate-class assignment is optimized using a genetic algorithm (GA). We implemented the GA as a generational genetic search algorithm CHC GA, an approach that was previously applied to natural selection and molecular clock inference [29,30]. Each individual of the GA population is represented as a vector containing a particular rate-class assignment (z). We implemented the method in Physher [29] and the genetic algorithm was parallelized using POSIX threads.

Model comparison
Since the homogeneous, 2-rate model, and flywaybased models are nested, we used the likelihood ratio test (LRT) to test their goodness-of-fit [31]. The HRM is the simplest model is at contains only one transition rate. The number of flyways under investigation will determine the number of parameter in the FRM models. For a four-flyway rate model (4-FRM) the number of free parameters is 10, while the number of free parameters for three flyways (3-FRM) is reduced to 6. Although the GRM and HRM models are nested, this is not necessarily the case for the GRM and FRM models, so that the LRT cannot be used to compare these latter models.
Another approach to model selection is to use information theory-based criteria such as the Akaike information criterion (AIC) and the Bayesian information criterion [32]. Unlike the LRT, these selection criteria are applicable to non-nested models. The AIC penalizes the number of parameters using the following formula: where LnL is the log-likelihood and k is the number of estimated parameters. When the number of observation is small, it is recommended to use a second order correction to the AIC: where LnL is the log-likelihood, s is the number of observations, and k is the number of estimated parameters. To be valid, the AICc requires that the number of observations exceed the number of estimated parameters. Unfortunately, it is difficult to define the number of observations in phylogenetics. In nucleotide-based inference, the number of observations is usually assumed to be either the number of characters or the number of unique sites in the alignment [32]. Using the total number of characters is likely to be an over-estimate due to the correlation of characters among sites, while the number of unique sites would underestimate the effective sample size. In the context of phylogeography we can only observe one realization of the spatial process that generated the discrete geographic locations. Due to the difficulty in statistically comparing models of different dimensionality, we fixed a priori the number of parameters to 6, which represents the number of free parameters in the 3-FRM.

Assessing the impact of migratory flyway on viral spread
We assessed the influence of migratory flyways on viral spread by calculating the within-and between-flyway mean rates and their ratio, in which a ratio greater than 1.0 suggests that migratory flyways act as a barrier to viral dispersal.
Importantly, however, this flyway assignment is unlikely to be accurate since flyway boundaries coincide with administrative boundaries of localities and are therefore not entirely based on ecological data. To investigate the impact of flyway boundary choices we changed the flyway boundaries and calculated both mean rates and calculated their ratio. Depending on the gene segment under investigation there are at most 8 localities abutting flyway boundaries, so we redefined the boundaries around these localities. For example, using the flyway classification used in this paper Alberta belongs to the Central flyway while the neighboring province of British Columbia belongs to the Pacific flyway. In this particular case we recalculated means rates and their ratios twice by assigning both states to either the Pacific or Atlantic flyway. Specifically, we tried every combination where one or more localities were assigned to the flyway of its neighbour, while avoiding flyway overlaps. For example, we do not assign Alberta to the Pacific flyway and British Columbia to the Central flyway as this would change the order of flyways on the east-west axis (i.e. from Pacific/Central/Mississippi/Atlantic to Central/Pacific/Mississippi/Atlantic).

Results
To investigate the structure of AIV transmission between US states and Canadian provinces, and its association with patterns of bird migration (i.e. the presence of avian flyways), we constructed several continuous-time Markov chains (Table 1) and compared them when appropriate.  First, to assess the heterogeneity of the migration network we compared the (non-flyway) HRM model to the 3-TRM and 4-TRM (flyway-based) models using a likelihood ratio test. For each gene we found that the TRM models provided a significantly better fit to the data than the (non-flyway) HRM (LRT p-value << 10 −10 for every gene analyzed). Most notably, the within-flyway rate is about 4 to 5 times greater than the between-flyway rate depending on the gene segment and the number of flyways assumed. Although the 3-TRM and 4-TRM have the same number of parameters, the log-likelihood of the former is higher than that of the latter in every analysis suggesting that viral gene flow is relatively unconstrained between the Central and Mississippi flyways. Another explanation for this pattern is that the within-flyway rate of these two flyways is equal. Both the 3-FRM and 4-FRM models impose fewer constraints on the underlying structure of the migration network by allowing heterogeneity in migration rate between flyways. These complex models fit the data better than any of the TRMs in all data sets (LRT p-values = 0). Similarly, the most parameter rich model, 4-FRM, also fit the data significantly better than the 3-FRM in each analysis (LRT p-value << 10 −10 ). Migration rates inferred using the 3-TRM, 4-TRM, 3-FRM and 4-FRM Table 3 Estimates of the within-and between-flyway rates and their ratio under the 3-TRM assuming 3 flyways for each gene segment. Under the 3-TRM the ratio is equal to the within flyway rate estimate over the between flyway rate estimate. Under the 3-FRM and GRM the ratio is equal to the mean within flyway rate over the mean between flyway rate. Mean rates are calculated assuming three flyways 3 (Tables 2 and 3). Importantly, the migration rate between the most distant flyways (i.e. Pacific and Atlantic flyways) was significantly lower than the other rates, indicating that there is clear isolation-by-distance along the East-West axis.
Using the same number of rates as in the 3-FRM (i.e. 6 rates), we relaxed the assumption that transition rates within flyways are identical and inferred the best model using a genetic algorithm. This successfully identified models that fit the data better than the flyway-based models. Notably, although the 3-FRM and the general rate GRM have the same number of parameters, the log-likelihood of the GRM is significantly higher than the 3-FRM in every data set ( Table 1), suggesting that the flyway-based classification of rates is too stringent or unrealistic. For example, Nevada and Alaska are allocated to the same (Pacific) flyway but are separated by thousands  of kilometres, obviously providing fewer opportunities for direct transmission of avian viruses between birds than neighbouring localities. Overall, our analyses show that transmission rates vary widely not only within flyways but also between them (Figs 1, 2,  3 and 4). In what follows we present the results based on the PB2 gene segment in detail as these are illustrative of the overall pattern (results for the other gene segments are provided in the supplementary material). In the PB2 gene analysis (Figs 1 and  2), the highest migration rate category was assigned to pairs of localities that belong to different flyways, while some migration rates within flyways were equal to 0. Importantly, the three highest migration rate categories were not assigned to pairs of localities belonging to either the Pacific or Atlantic flyways. Many of the estimated transmission rates were equal to 0, while non-zero rates spanned several orders of magnitude, revealing a patchy and heterogeneous transmission network between localities. We also estimated the mean within and between flyway rates for each data set; their ratio suggested that the within-flyway rate is at least four times higher than the between-flyway estimate (Tables 2 and 3). Finally, Figs 1 and 2 also suggest that Alaska is highly connected with localities belonging to any flyway, confirming its importance as a hub for wild bird migration.
It can be argued that our estimates may be influenced by our a priori assignment of localities to flyways. To identify a potential bias in our analyses we therefore calculated the same ratio described above but using different combinations of locality-to-flyway assignments for localities located along flyway boundaries. In all cases the ratio is greater than two, suggesting that our model is robust to uncertainties in locality-to-flyway assignments (Fig 5).
We next determined the level of genetic differentiation among geographic localities using the nearest neighbour statistic (S nn ). Although there appears to be no correlation between spatial distance and S nn , our results show that as the spatial distance between localities increases so the mean S nn tends to be higher, suggesting the presence of a structured population (Figs 6 and 7). Furthermore, when three flyways are considered, the mean within-flyway S nn is lower (i.e. less structured) than the mean between-flyway S nn involving contiguous flyways, which in turn is  Fig. 3 Patterns of migration rates across localities. Plots of migration rates using the PB2 phylogeny (as an exemplar) on a two-dimensional matrix where each cell represents a rate between two locations. Each cell is color-coded according to the migration rate between the locations. The matrix is symmetric and location labels on the x-and y-axis are ordered so that locations belonging to the same flyway are placed next to each other lower than the between-flyway statistic involving noncontiguous flyways (i.e. the Pacific and Atlantic flyways) (Table 4). Finally, a recent study suggested that poultry are an important driver of the regional spread of AIV [33]. Since poultry movements are independent of the flyway system, we reanalyzed the PB2 gene segment data set by removing all those sequences isolated from poultry. Using the same genetic algorithm and assuming four flyways, the mean within-flyway rate is inferred to be 4.28 times higher than the mean between-flyway rate in this wild bird-only data set, a value almost identical to the ratio estimated using the full data set (ratio = 4.33). We argue that if poultry played an important part in the spread of AIV it would dampen the north-south pattern observed in these data since domestic birds  Fig. 4 Patterns of migration rates across localities. Plots of migration rates inferred for the PB1, PA, NP, and MP gene segments on a two-dimensional matrix where each cell represent a rate between two locations. Each cell is color-coded according to the migration rate between the locations. The matrix is symmetric and location labels on the x-and y-axis are ordered so that locations belonging to the same flyway are next to each other could in theory spread AIV in any geographic direction. Although we have insufficient data to examine the impact of the poultry trade in data, the apparent lack of sensitivity to the inclusion of sequences originating from poultry supports our conclusion that wild birds primarily spread AIV along flyways.

Discussion
Our large-scale phylogeographic analysis reveals that the main gradient of diffusion of avian influenza viruses in North America occurs along a North-South axis, within the migratory flyways utilized by wild birds. In particular, our results show that the withinflyway migration rate was 4-13 times greater than the between-flyway rate depending on the gene segment and the model used, thereby providing clear evidence that migratory flyway plays an important role in structuring AIV populations in North America. The most compelling observation in this context was that viruses sampled from the Pacific and Atlantic flyways, the most geographically distant flyways, showed the least gene flow implying a significant isolation-by-distance along the East-West axis, in support of the suggestion of Lam et al. [14]. However, it is also evident that gene flow between flyways has occurred at a measureable rate, as expected with geographically contiguous localities. As noted previously [14], our analyses suggest a strong overlap between the Central and Mississippi flyways, with high transmission rates between these two areas (Figs 1, 2, 3 and 4). Indeed, given that the barriers between flyways are obviously fluid, some viral spread between flyways is expected. This is in sharp contrast to the separation of North American and European birds by the Atlantic Ocean, which acts as a strong barrier to bird interaction [34,35]. Although genetic differentiation (S nn ) and spatial distance appear to be uncorrelated, the trend depicted in Figs 6 and 7 provides a more nuanced explanation of the role of migratory flyways in the spread of AIV and the impact of isolation by distance on viral genetic diversity. Our results show that spatial distance tends to reduce genetic diversity less along the North-South axis than along the East-West axis, suggesting that gene flow occurs more frequently along migratory flyways. For example, the mean S nn calculated for distinct pairs of localities that belong to the Pacific and Atlantic flyways is higher than for pairs of localities that belong to the same flyway.
Other studies have investigated the extent of the correlation between viral spread and migratory birds using different data sets and methods, which have provided contradicting conclusions. One early study [14] utilized a combination of parsimony and maximum likelihood phylogenetic methods and provided evidence that gene flow was greater within flyways than between flyways, hence supporting a key role for flyways. However, at the time of this study, AIV sequences were only available from 16 localities (states and provinces) in North America, thereby increasing the chance that intermediate transmission chains between close localities would  be missed. Indeed, a later study based on the Bayesian analysis of a larger data set (although still restricted to 16 geographic localities) suggested that viral spread was mainly independent of bird migration patterns [7]. Herein, we have greatly expanded these data to an analysis of 36 to 38 localities, employing a model-based approach similar to Lam et al. [14] but extending the intuitive but rather inflexible flyway-based models to allow variable rates within flyways without increasing parameter space. Our models, which showed a better fit to the data than the 3-FRM with the same complexity, depicted a more intricate picture of viral spread in North America, but still captured a strong correlation between gene flow and migratory bird along flyways, and showed that migration rates within flyways are not homogeneous. In addition, the GRM reveals the absence of viral transmission between some localities, especially between pairs of localities that belong to the Pacific and Atlantic flyways.
Another key element in our study is that transition rates are estimated using a fixed phylogenetic tree topology with branch lengths that are inferred from nucleotide sequences. While transition rates expressed in units of time would have been a more natural interpretation, AIV exhibits strong substitution rate heterogeneity [36], rendering the estimation of chronograms and substitution rates challenging. In addition, the sampling dates of many sequences are either absent or incomplete, which will clearly limit the scope of all phylogeographic approaches that rely on molecular clock dating.
As in previous studies [14], our estimation of the migration prossess relies on a single phylogenetic tree inferred from the nucleotide sequence alignment of each segment separately. This is computationaly attractive since our large-scale comparative analysis is based on thousands of sequences, while the focus on individual gene segments removes the possible impact of reassortment. However, it is currently unclear how to deal with the uncertainty involved in the estimation of the tree and rate parameters within the maximum likelihood framework developed here.

Conclusions
Our analyses of this set of genome sequence data provides strong evidence that the North-South migration of birds in North America, reflected in the presence of geographically-based flyways, plays an important role in shaping the genetic structure of populations of avian influenza virus. As such, the spread of AIV in wild birds at the continental scale has some degree of predictability that may eventually assist in our attempts to control the future spread of any highly pathogenic influenza viruses that emerge in North America.