New approaches for unravelling reassortment pathways
© Svinti et al.; licensee BioMed Central Ltd. 2013
Received: 5 April 2012
Accepted: 21 November 2012
Published: 1 January 2013
Skip to main content
© Svinti et al.; licensee BioMed Central Ltd. 2013
Received: 5 April 2012
Accepted: 21 November 2012
Published: 1 January 2013
Every year the human population encounters epidemic outbreaks of influenza, and history reveals recurring pandemics that have had devastating consequences. The current work focuses on the development of a robust algorithm for detecting influenza strains that have a composite genomic architecture. These influenza subtypes can be generated through a reassortment process, whereby a virus can inherit gene segments from two different types of influenza particles during replication. Reassortant strains are often not immediately recognised by the adaptive immune system of the hosts and hence may be the source of pandemic outbreaks. Owing to their importance in public health and their infectious ability, it is essential to identify reassortant influenza strains in order to understand the evolution of this virus and describe reassortment pathways that may be biased towards particular viral segments. Phylogenetic methods have been used traditionally to identify reassortant viruses. In many studies up to now, the assumption has been that if two phylogenetic trees differ, it is because reassortment has caused them to be different. While phylogenetic incongruence may be caused by real differences in evolutionary history, it can also be the result of phylogenetic error. Therefore, we wish to develop a method for distinguishing between topological inconsistency that is due to confounding effects and topological inconsistency that is due to reassortment.
The current work describes the implementation of two approaches for robustly identifying reassortment events. The algorithms rest on the idea of significance of difference between phylogenetic trees or phylogenetic tree sets, and subtree pruning and regrafting operations, which mimic the effect of reassortment on tree topologies. The first method is based on a maximum likelihood (ML) framework (MLreassort) and the second implements a Bayesian approach (Breassort) for reassortment detection. We focus on reassortment events that are found by both methods. We test both methods on a simulated dataset and on a small collection of real viral data isolated in Hong Kong in 1999.
The nature of segmented viral genomes present many challenges with respect to disease. The algorithms developed here can effectively identify reassortment events in small viral datasets and can be applied not only to influenza but also to other segmented viruses. Owing to computational demands of comparing tree topologies, further development in this area is necessary to allow their application to larger datasets.
Influenza viruses are a major cause of infections in humans, with a dynamic history characterized by common seasonal epidemics and occasional pandemics. The evolution of the virus during and in between these outbreaks is difficult to describe because it undergoes rapid evolution in order to evade the constantly adapting immune response of their hosts . The influenza A genome consists of eight individual segments of single stranded, negative sense RNA, each containing a single gene [2, 3]. The segmented nature of the genome allows for the exchange of entire genes between different viral strains when they co-infect the same cell through the process of reassortment . Two viruses co-infecting a cell could potentially generate 254 genotypes. Reassortment is particularly interesting because it is an event that can quickly generate influenza strains with novel infectious properties [5–7]. The most recent pandemic in 2009 provides a fresh example of a reassortant H1N1 strain that arose to claim an infectious ability of global scale . In fact, three of the most recent pandemics have been caused by reassortant influenza A strains (1957, 1968 and 2009) . Li and colleagues generated all the possible reassortants from two H5N1 and H3N2 strains and found that 72% of all the new subtypes replicate differently to the parental strains whereas 28% were not viable . In addition, 22 reassortants were more pathogenic in mice than the parental H5N1. These results underline the importance that reassortment plays in the evolution of influenza viruses and how essential it is to be able to describe these events. Identification of new reassortants is a crucial step in understanding viral evolution and preferential reassortment patterns, in working towards preventing infections with and spread of fatal viruses.
Phylogenetic trees depict hypotheses of evolutionary relationships between species or sequences . If no reassortment events have occurred in the history of a set of sequences, then, allowing for phylogenetic error and stochastic effects, we expect that their respective phylogenetic trees will display overall congruence. More formally stated, we have a null hypothesis that a single phylogenetic tree (i.e. evolutionary history) best fits all regions of the genome. The alternative hypothesis is that some part of the genome fits an alternative phylogenetic tree better and this is sufficient to overcome our initial expectation of a single tree - thereby rejecting the null hypothesis in frequentist terms, or overcoming our prior in Bayesian terms. Reassortment events, therefore, can be identified in those cases where genetic segments from the same isolate occupy different positions on the phylogenetic trees inferred from these segments . In many studies up to now, the assumption has been made that if two phylogenetic trees differ, it is because reassortment has caused them to be different [13–16]. While phylogenetic incongruence may be caused by real differences in evolutionary history, it can also be the result of evolutionary model misspecification , high levels of homoplasy , long branch attraction , inadequate sampling  or separating data into partitions . Therefore, if we wish to understand the frequency and nature of reassortment we must have methods for distinguishing between topological inconsistency on phylogenetic trees that can be accounted for by confounding effects and topological inconsistency that is due to reassortment. A reassortment event is more easily detected if the two strains involved in producing the reassortant are sufficiently divergent in their sequences. Reassortment between very similar strains is likely to go undetected by most, if not all, methods. At its most extreme, consider the instance where reassortment has involved two identical parental strains. This event cannot be detected using a computational approach but it is also not relevant in a biological context, as we would expect the identical gene segments to have identical properties. In addition to the detection of the existence of reassortment events, it is desirable to be able to identify the direction of reassortment – the likely origin of genetic segments in a reassortant strain - and also the frequency with which particular segments reassort. This information can be translated into an understanding of overall reassortment pathways and can be used to make predictions about future reassortment events.
In order to accomplish the objective of adequately reconstructing reassortment events in influenza, it is necessary to have a robust statistical platform, based on evaluating the differences between phylogenetic trees derived from different viral genome segments. In this paper we assess reassortment by the evaluation of the significance of the differences between phylogenetic trees constructed from different genome segments. If phylogenetic trees constructed from distinct segments are non-trivially different (in other words, if the differences in tree topologies cannot be explained by stochastic effects, by having insufficient amount of data, low levels of phylogenetic signal or high levels of artifactual noise), the occurrence of one or more reassortment events may be the best explanation of this divergence. If the differences between phylogenetic trees are trivial, say, when they can be accounted for by invoking error in tree estimation as an explanation, it is appropriate to assume that no reassortment has taken place. We acknowledge however, that even though we might reject the null hypothesis (no reassortment), there can be situations where this is not correct. We do not have a method for discovering the false rejection rate and indeed it is unlikely to be discoverable using current technology.
The Robinson-Foulds (RF) distance  and subtree transfer distances (subtree pruning and regrafting, or SPR)  are two methods of measuring dissimilarity when comparing trees. Counting how many bipartitions in one tree are not shared with the other tree gives a value of the RF distance between two trees. However, this metric treats each feature of the tree equally and as a result, trees that agree in an important area can be given a large tree distance . An SPR operation on a tree consists of pruning a subtree by cutting an edge, and then regrafting the subtree by the same cut edge to a new vertex . It can be said that two trees are close together if one can be obtained from the other by a small number of SPR operations . The “cut and paste” nature of this operation resembles the effect caused by reassortment events. Therefore, identifying the minimum number of SPR operations to convert one phylogenetic tree into another can describe the minimum number of reassortment events that have occurred in the history of the strains being studied. The EEEP (Efficient Evaluation of Edit Paths)  algorithm can be used to achieve this objective in a very short time. EEEP works in a pairwise manner: it uses a reference and a test tree to construct the ‘edit path’ between them, or the SPR operations needed to convert the reference tree to the test tree. Choosing this fast algorithm for the calculation of edit paths resulted in a pairwise approach to the reassortment problem, the outcomes of which are combined at a later stage in the analysis.
Not all differences in phylogenetic tree topology need to invoke reassortment events as the explanation. Determining whether the topologies produced by SPR modifications are significantly different to the starting tree, requires a statistical model for the variation of the features of the tree . Tests such as the bootstrap probability , Kishino-Hasegawa , and the approximately unbiased (AU) test  are used to assess confidence in phylogenetic hypotheses. In order to determine whether trees represent significantly different hypotheses based on a particular dataset, the AU test has been used [29–31]. This test uses a multi-scale bootstrap technique that consists of generating sets of bootstrap replicates with varying sequence lengths and estimating the AU p-value from the change in the bootstrap probability values along the changing sequence length . Given a set of trees and an alignment of sequences, the resulting p-value for each tree reflects how well that particular topology describes the given data. The AU test has some desirable properties. The use of an explicit model of sequence evolution in order to reconstruct the history of a segment of DNA is an important step in understanding virus evolution. In addition, the AU test has a robust statistical platform on which to evaluate the significance of the difference between sets of phylogenetic trees. In our analyses, we have used the AU test for the sets of trees produced by the EEEP program for each pairwise analysis.
We present two automated methods to assess the significance of difference between phylogenetic trees and we use these methods to test whether reassortment occurred in a simulated dataset as well as a small collection of influenza viruses isolated in Hong Kong in 1999. The first method (MLreassort) applies subtree pruning and regrafting (SPR) modifications on maximum likelihood trees and the AU test to identify significant differences in tree topologies. The second method (Breassort) uses a Bayesian approach, together with multidimensional scaling and SPR, to identify the most frequent SPR operations that connect two significantly different sets of trees. Reconciliation of topologies is sought between all pairs of maximum likelihood trees derived from the eight influenza genetic segments. The steps required to reconcile two trees, starting with one tree (the reference tree) and perturbing its topology to obtain the other tree (the test tree), involve subtree pruning and regrafting operations. We only infer reassortment events for those SPRs that are found by both methods as a conservative approach. Although one method may be found to perform better than the other, for now we have opted to focus on consistency across methods. We propose this kind of approach as a way of evaluating the nature of reassortment in influenza and other segmented viruses.
Reassortment is a process that results in conflicting phylogenetic hypotheses derived from their respective gene segments. Our overall aim is to detect real reassortment events that have occurred in a set of influenza isolates. We apply these algorithms to a small set of influenza sequences as well as a simulated data set. The enormous complexity of influenza virus evolutionary history cannot reasonably be explored in this manuscript, so instead we present exemplar datasets and describe how the approaches work. We also suggest future improvements that would be useful for a thorough understanding of the complex nature of reassortment networks.
We inferred phylogenetic trees for all the eight segments using the GTR model of evolution. For MLreassort, maximum likelihood trees were obtained for each segment (Figure 1A and B). In order to evaluate the significance of the differences between pairs of maximum likelihood tree topologies, we used the EEEP (Efficient Evaluation of Edit Paths) algorithm  in combination with the AU test to first of all define the best way of using SPR edits to convert one tree into another via the fewest moves and subsequently to evaluate whether any of these edits result in statistically significant changes in tree topology. Once the SPR edits and resulting tree topologies are obtained, the AU test is used to estimate confidence intervals around trees. Non-overlapping confidence intervals indicate that the trees included in the sets are significantly different to each other. SPR operations connecting trees with significantly different topologies are referred to as significant operations/transitions/branch swaps, and they are depicted on the maximum likelihood tree for each segment (Figure 1A and B). The direction of arcs, pointing from the end with an empty circle (the likely source) to filled circle (the recipient lineage) indicates the direction of transfer. For the simulated data, all significant branch swaps involved taxon ‘G’. Using the HA alignment (the alignment simulated using a reassortment-containing hypothesis) and the GTR model of evolution, a HA tree was obtained (Figure 1B) with a negative log likelihood of 9256.28. MLreassort proposes a branch swap on the HA tree to reconcile it with the other seven trees (moving taxon ‘G’ to group with taxon ‘F’). After applying this SPR to the HA tree, the negative log likelihood increases by 1868.43 (to 11124.71). This means that the tree obtained after the SPR modification provides a worse explanation of the HA data than the tree before the SPR modification.
Breassort uses a Bayesian-based approach for tree reconstruction. The result of a Bayesian MCMC sampling of parameter space, including tree space, is a set of phylogenetic trees. In order to detect reassortment, we developed a method for comparing these sets. One way of achieving a meaningful comparison of two sets of trees that are derived from two different alignments is to evaluate tree ‘distances’, both within and between the two sets, in geometric space , with each point in this space representing a phylogenetic tree. Using a standard statistical method, convex hull peeling , we arranged the points into convex hulls. Eliminating the 5% outliers in each set of trees using convex hull peeling, we carried out pairwise comparisons of the remaining 95% of trees. The pairs of Bayesian tree sets all overlap, with the exception of cases in which HA is involved. The set of HA trees does not overlap with the sets of trees for MP, NA, NP, NS, PA, PB1 and PB2. This kind of situation might indicate that the genes have different evolutionary histories, but could also be caused by stochastic effects as might be expected, for instance, if the alignments were short. Therefore, for those cases where the sets of trees do not overlap, we randomly selected 100 trees from each segment’s tree space. We then carried out EEEP analyses in order to find branch swaps that connect the trees in the two sets. SPR edits that were found in more than 70% of cases were considered significant. The results of these comparisons are summarized using a network diagram, where each node represents the set of trees from an MCMC Bayesian analysis for its corresponding segment (Figure 1C). The edges between the nodes correspond to the proposed SPR modifications, and the direction (ends with filled circles) is towards the trees being modified. Edges of the same colour show the same SPR operation, as shown in the legend.
In inferring reassortment events, we have opted for a conservative approach that uses consistency across both MLreassort and Breassort (Figure 1D and E). The HA tree proposes that all the other seven trees should be altered by moving taxon ‘G’ to cluster with taxon ‘B’ (Figure 1A, C and D). On the other hand, segments MP, NA, NP, NS, PA, PB1 and PB2 propose that an SPR transition on the HA tree, which would involve moving taxon ‘G’ to group with taxon ‘F’ (Figure 1B, C and E), is required in order to reconcile the trees. This modification would result in a topology that would resemble the original HA tree, before the manual introduction of the modification. Interpreting the output in the light of all the results, it makes sense to infer that the HA tree, rather than seven other trees, has a ‘wrong’ topology. This would imply that taxon ‘G’ has an HA segment that is inconsistent in its evolution with the other seven genes. If this was a real viral data it would mean that ‘G’ is a reassortant strain that has acquired a HA segment from a ‘B’-like strain. Therefore, the SPR move manually introduced into the HA tree by moving taxon ‘G’ to group with ‘B’ (instead of ‘F’), has been recovered by both algorithms. This result suggests that the algorithm described can accurately detect manually induced reassortment events. This is not a stringent test of the methods and more complex scenarios such as those involving multiple reassortant viruses will be more difficult to detect, however, the approach can also be shown to work well with small datasets of real viral data.
Alignments and models for Hong Kong 1999 dataset
Alignment length (nt)
PTP test p-value
PHI test p-value
Best tree (−ln L)
Another significant SPR modification (Figure 6B) involves grouping hk1073 with A/pheasant/HongKong/SH39/99 H6N1 (sh39). A general trend can be seen where the majority of trees suggest this modification for the HA and NA trees. The topologies of the six internal segments show hk1073 (of type H9N2) clustering with the H6N1 group, whereas this is not the case with the two external viral proteins, HA (where the hk1073 sequence is placed as an outgroup to the H6N1/H5N1 clade) and NA (where the sequence is placed as a sister to a H3N2 sequence) (Additional file 1: Figure S1). Also, the network diagram indicates this SPR modification for the HA and NA trees (blue edge, Figure 5). This scenario indicates one or more reassortment events where hk1073 is formed by acquiring gene segments encoding internal proteins from a H6N1 virus, or one with the same composition of genes. It was reported previously that this H9N2 isolate shares the same six internal segments with H5N1 and H9N2 strains circulating in 1997 , as well as with H6N1 1997 strains . Another significant SPR modification involves moving sh39 to cluster with the quail isolates (quail99) of type H6N1 (Figure 6C), suggested by the HA tree only. This grouping is altered in the internal segments’ trees by the presence of the hk1073 isolate, resulting in sh39 being an outgroup of the quail99/hk1073 clade (MP) or a sister clade to hk1073 (NP, NS, PB1, PB2). The result reinforces the hypothesis that the hk1073 is a reassortant in the H6N1 group.
The proposed scenario is that hk1073 is a reassortant virus that is made up of internal gene segments derived from H6N1 viruses (sh39 and quail isolates), an NA from a H3N2-like virus, and a HA from another H9 virus. The segments resulting in this genome architecture could have been incorporated into this virus at the same time in one reassortment event, or this could have occurred in many stages. A possibility is that hk1073 first acquired an NA segment from an HxN2 strain (Hx stands for any type of HA), internal segments from an H6N1-like virus that circulated in 1999 and a HA segment from either parent, generating an intermediate type. This newly formed strain would be involved in another event that gave rise to the hk1073 reassortant by acquiring a HA segment from an H9Nx virus (where Nx stands for any type of NA). Previous work  suggests that the H9N2 viruses isolated in Hong Kong in 1997 shared the segments encoding for internal proteins with the H6N1 and H5N1 viruses of that year. Also, Lin and colleagues  report that the hk1073 strain itself has not picked up any segments by reassortment since 1997 and that it is closely related to A/quail/HongKong/G1/1997. This leads us to believe that the 1997 precursor of hk1073 is the first reassortant strain with this genetic make up from which the 1999 isolates were generated. Influenza H9N2 strains in general have been observed to possess a high propensity for reassortment .
It is expected that the outgroup strain would not be involved in any reassortment events (in this analysis, the outgroup was isolated on a different continent ten years previously). However, there are some SPR operations that incorporate this sequence, which could point to reassortment events with other strains that were not included in our dataset. None of the branch swaps involving the outgroup strain appear in the results from both algorithms, therefore we didn’t consider them. Another SPR (grouping A/Environment/HongKong/43710/99 H5N1 or env99, with hk1774) was found by both algorithms (Figure 6D), but there is little agreement on which segments propose this move. The HA tree disagrees with the PB2 tree on the grouping of the hk1774 and env99 strains. It seems that these two gene segments have evolved separately, however there is not enough information to hypothesise how this may have happened and what is the source of the other segments.
In the reassortment network, the PA node is not connected by edges to any other nodes (Figure 5). Poorly supported branches in the phylogenetic topology can cause this effect. Another challenge is to determine which SPR operations are complementary to each other and this can only be done in the context of all the other trees. One of the limitations with EEEP is memory consumption  but most paths in pairwise comparisons were recovered. The main drawback in using an individual phylogeny as a starting point for the algorithm is the potential of wrongly inferring reassortment events if the start topology is incorrect. Comparing Bayesian phylogenetic tree sets in geometrical space is one way of approaching the comparison of large sets of trees. MDS was found useful in comparing multiple Bayesian analyses and for exploring sets of trees . However, its challenges include information loss during the scaling of distances into lower dimensions of space and defining tree space from a limited number of trees. Another approach could consist of constraining the MCMC tree search by using another segment’s topology, and using Bayes factors  to determine whether one hypothesis describes the data significantly better than the other. Nevertheless, the algorithms presented here have the potential to detect real reassortment events by minimising the inference of false positives.
Two statistical frameworks for identifying reassortment events using phylogeny are presented. The first algorithm (MLreassort) uses a maximum likelihood approach. Here we identify the shortest path (consisting of SPR operations) for converting a phylogenetic tree derived from gene A to the tree derived from gene B. An SPR operation is evaluated as significant when the resulting tree has a significantly lower likelihood score compared to the pre-modified tree. This was carried out for all the pairwise comparisons of trees and interpretation of results is done considering all the trees and the proposed SPR modifications. The second algorithm is based on a Bayesian approach (Breassort). The steps consist of identifying non-overlapping tree spaces, detecting SPR moves that would convert trees from the tree space of gene A to trees from that of gene B (200 comparisons), and focusing on the most frequent branch swaps. The SPR operations that are considered to result in significant changes in tree topology are events that cannot be simply explained by errors in the phylogenetic estimation and, having rejected the hypothesis of recombination, reassortment needs to be considered. We use a conservative approach when inferring reassortment events by only considering those SPRs that are found by both algorithms. Further work will involve development of the algorithm for use on larger data sets. As the size of the data set increases, it is likely that more reassortant strains would be included and a better way of representing the results would also be necessary. Nevertheless, the algorithms presented are able to detect reassortment events in any segmented viruses, using statistical methods for distinguishing between true reassortment and other causes of incongruence between phylogenetic tree topologies.
Automated computational methods for detecting reassortment events have been explored previously and these involve or avoid the use of phylogenetic analysis. Suzuki’s phylogenetic approach  is based on examining quartets of strains; the FluReF  algorithm selects candidate reassortment groups through a bottom-up search of phylogenetic trees, and another approach combines genotyping and phylogenetics to infer reassortment events . Recently, Nagarajan and Kingsford  reported GiRaF, a graph-incompatibility-based reassortment finder. The program involves the generation of an incompatibility graph from sets of trees resulting from a Bayesian tree search, followed by mining of phylogenetic discordances using a search algorithm. In contrast with our algorithm, GiRaF can be used to process large datasets efficiently. In addition to the detection of reassortment events, GiRaF also reports a measure of confidence for each prediction. Rabadan et al. have developed a statistical method to estimate the likelihood that two segments have co-evolved, without relying on phylogenetic analysis . Another algorithm based on the theory of quasispecies was reported by Wan et al. , where using segment genotype thresholds allowed the identification of reassortment events. Macken and colleagues  established a baseline of circulating genotypes at a certain time period, and used it to characterize reassortment events of viruses from a second time period. Another algorithm based on neighbourhood of strains for each segment  was developed to detect reassortment in all the unique full genomes available until June 2011. MDS has also been used to identify differences in influenza A phylogenetic history. Rambaut et al.  used MDS to plot the distances between 500 trees for segments derived from H3N2 isolates. In this case, the tree distances were calculated from the time to the most recent common ancestor for each season. Although some of these methods are more efficient in processing large datasets, we believe that our approaches provide a sensitive and accurate test for the detection of reassortment events.
We used the Influenza Virus Resource at NCBI  to extract the nine sequences of the fully completed genomes isolated in Hong Kong during 1999, using as outgroup A/redknot/NJ/325/1989 H7N7 (accession numbers provided in Additional file 1: Table S1). This set includes four strains isolated from the environment. We kept only one of these sequences due to their high similarity (≥99%). The dataset then consisted of one H5N1 virus isolated from the environment, three H6N1 strains from quail and pheasant, one human H9N2, one human H3N2 strain and a H7N7 isolate as the outgroup. Data deposited in the Dryad Repository: http://dx.doi.org/10.5061/dryad.mg040.
We compiled alignments for each of the 8 segments using MUSCLE  with the parameters set to their default values. The alignments were inspected by eye to see if any ambiguously aligned regions were included. The alignments were examined using the PTP (permutation probability) test implemented in PAUP  and also analysed with likelihood mapping implemented in TreePuzzle  (default parameters used in both cases). The best fitting model of nucleotide substitution was selected by evaluating 56 compositionally homogenous models using the Akaike information criterion test implemented in the ModelTest software . In order to eliminate the possibility that variation in phylogenetic tree topologies is due to the presence of recombinant segments, the PHI (Pairwise Homoplasy Index) test  for recombination was applied to each of the eight alignments, with the default parameters. The program uses the alignment in question and returns files with informative and unambiguous sites as well as a log file containing the PHI p-values.
PAUP*4b10  was used to carry out the maximum likelihood calculations. When the best fitting models were identified, PAUP was used in order to find the maximum likelihood tree topology. We used the Tree-Bisection-Reconnection method with 10 random addition replicates. The software program MrBayes v3.1.2  was used to carry out the Bayesian analysis. The program was run twice for each segment with 4 chains and sampling frequency of 1000, for 10 million generations with 10% of the sampled trees discarded as ‘burn-in’. We checked for convergence of the two runs by using the standard deviation of split frequencies, which is a measure of the similarity of the tree samples of the two independent runs. As the two tree samples become increasingly similar, the average standard deviation of split frequencies approaches zero. Only the trees sampled after this convergence point were used for further analysis.
In order to get a general overview of the differences between tree topologies, Robinson-Foulds (RF) distances between the trees were calculated using PAUP. We then carried out 56 pairwise comparisons (all possible pairwise comparisons of the 8 segments) using the EEEP (Efficient Evaluation of Edit Paths) software . EEEP is a program that seeks to reconcile a reference tree and a test tree, by finding the minimum number of SPR moves required to convert the former into the latter. The output of the program consists of the smallest set of SPR permutations that could convert the reference tree topology into the test tree topology. Such a set of SPR operations is called an edit path, and it is possible for more than one path to be found. The goal is to find the shortest edit path, or the smallest number of SPR operations that can reconcile the reference and test trees. As these transfers are analogous to lateral gene transfer (LGT) events, the edit path output by the program corresponds to a minimum set of lateral gene transfers between strains in the reference tree. The following settings were used: partitioning of data into regions of discordance, using time constraints and permissive tree distance ratchets (see original publication for details of the algorithm). Ratchets can be used to assess whether a modified reference tree is more or less similar to the test tree than the pre-modification tree. The use of ratchets can reduce the running time of the algorithm but the frequency of the cases where no solution is found increases . EEEP was run using a reference tree, a test tree and a support threshold of 0, which will not collapse any nodes into polytomies. Perl scripts were developed to run the software on each of the eight segments’ rooted trees using each of the other seven segments’ unrooted trees as test trees, and to parse the output. The SPR moves suggested by EEEP were used to modify the reference tree (code available on request) and a set of intermediate trees was compiled for each pairwise comparison.
In order to run the AU test, PAUP was used to produce a text file with the site-wise log-likelihoods for each set of trees resulting from the EEEP analysis. This log-likelihood file was used as input for CONSEL . CONSEL is a program package for assessing the confidence of tree selection i.e. to evaluate which trees are within the confidence set. The confidence set, the set of trees that are not rejected by the tests, is expected to include the true tree . The output consists of a list of p-values for each of the trees in the input file, one of which corresponds to the AU test statistic. We used a p-value cut-off of 0.01 to determine the confidence sets. For all the trees in the confidence, we can suggest that the differences between them are not statistically significant and can be explained by sampling artifacts. Trees outside the confidence set are a significantly worse explanation of the data. The null hypothesis (that there is no reassortment) is rejected by the trees with a p-value lower than a chosen threshold. The approximately unbiased (AU) test (implemented in CONSEL) was run twice in order to obtain confidence intervals around both the reference and test trees. To find the confidence interval around a reference tree, the site-wise log likelihoods were calculated using the alignment for the reference tree. To find the confidence interval around the test tree, the procedure was repeated but this time the alignment for the test tree was used when getting the log likelihoods. (Additional file 1: Figure S2) shows diagrammatically how an SPR is determined to be significant, given different SPR paths between two confidence sets of trees. The interpretation is as follows.
A significant branch move (or edit) is one that, when applied to a tree, causes a significant change in tree topology. Additional file 1: Figure S2a and Sb shows those instances where the trees on the edit path are always within the confidence set of optimal trees of at least one alignment. In these cases, by definition, there are no significant transitions. In the first case, the two confidence intervals overlap and in the second, the confidence set for Tr (reference tree) is a subset of the confidence set for Tt (test tree). In Additional file 1: Figure S2c and Sd, the edit path connects two non-overlapping confidence sets. In both cases, we would consider that LGT is the best explanation of the observed data. If multiple transfers are found in each path between these confidence intervals, the shortest path is considered (Additional file 1: Figure S2e). If transitions in a path move in and out of a confidence interval, the last transition that moves out of the confidence region is considered (Additional file 1: Figure S2f). When multiple trees are found outside both confidence sets, pairwise analysis of neighbouring trees needs to be carried out again using the AU test, in order to determine the significance of the transitions that connect them.
Multidimensional scaling, also known as principle coordinate analysis, is a way of representing pairwise distances between objects . The idea is based on ordering points (representing phylogenetic trees) in geometrical space such that similar points are closer to each other. Pairwise tree sets were compared by visually representing them in 2D space. In order to do this, we combined the sets of trees for two segments at a time (26 pairwise sets) and obtained the Robinson-Foulds distance between all trees in the combined sets (this includes distances between trees within a set, and between the two sets). The resulting matrix was analysed using the R statistical software . The high dimensionality of the data makes it impossible to explain and visualize the observed distances between the trees and therefore multi-dimensional scaling was used to reduce the data to two dimensions. The stress, which is a measure of the squared differences between ideal distances and actual distances, needs to be minimized so as to better reflect the original distances between objects. A standard statistical method for ordering points into convex hulls was applied . Convex hull peeling was carried out to remove the outermost points in a cloud (e.g. 5%) in order to eliminate outliers. The remaining 95% hulls were compared. A plot of two sets of points was obtained, representing trees derived from two different segments. If the clouds of points (more precisely, their 95% confidence sets) overlap, we conclude that no significant differences between the tree topologies in the two sets exist. However, if there is no overlap, we can hypothesize that the trees in the two sets reflect different evolutionary histories. In this case we carried out EEEP analysis between the two clouds of trees. We carried out 200 EEEP analyses (100 in one direction, and 100 in reverse) between the two sets, randomly picking the trees to compare. This resulted in a set of significant SPR branch moves. We deemed any SPR modification that was present in more than 70% of paths between two tree sets as worthy of consideration. Python packages reportlab and rpygraphviz were used to draw a frequency (or reassortment) network to visually represent these results.
Plots were used to visualize the overlap between the results from the maximum likelihood and Bayesian approaches together. The lists of SPR moves from each algorithm were compared and a plot was drawn for each SPR found by both approaches. The x-axis shows the eight segments and any modifications their corresponding trees (or tree sets) propose. The y-axis shows the trees that need to be modified. For example, if the HA tree proposes an SPR move (say hk1073 to hk1774, Figure 6A) to the MP tree, this will be illustrated by a mark at the HA x-axis and MP y-axis positions. The marks can be either circles or crosses. A circle indicates that the SPR comes from the MLreassort, whereas a cross indicates a result from the Breassort. We took this conservative approach order to eliminate method-dependent inferences of reassortment.
The algorithms were tested on a simulated dataset in order to assess their capacity to detect manually induced reassortment events. The program Seq-Gen  was used to simulate eight alignments on a tree derived from the HA segment of the Hong Kong data set. An alignment of seven sequences was generated for each gene segment using the GTR model of nucleotide substitution, and the alignment length is equivalent to the length of the respective gene (e.g. the simulated alignment for HA is 1,778 bp in length). The sequences were named using letters A-G. Phylogenetic trees were built from these alignments and were rooted on taxon ‘A’. An SPR modification was then manually introduced in the tree used to generate the HA data, by moving taxon ‘G’ to group with taxon ‘B’ rather than with taxon ‘F’. The algorithms were reapplied to the simulated data with the modified HA alignment, in order to test whether they can detect this manually induced SPR.
This work was supported by a Science Foundation Ireland Research Frontiers Programme grant (Grant No. 09/RFP/EOB2510) and the NUI Maynooth John Hume Scholarship.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.