Dating the time of viral subtype divergence
© O'Brien et al; licensee BioMed Central Ltd. 2008
Received: 28 January 2008
Accepted: 09 June 2008
Published: 09 June 2008
Precise dating of viral subtype divergence enables researchers to correlate divergence with geographic and demographic occurrences. When historical data are absent (that is, the overwhelming majority), viral sequence sampling on a time scale commensurate with the rate of substitution permits the inference of the times of subtype divergence. Currently, researchers use two strategies to approach this task, both requiring strong conditions on the molecular clock assumption of substitution rate. As the underlying structure of the substitution rate process at the time of subtype divergence is not understood and likely highly variable, we present a simple method that estimates rates of substitution, and from there, times of divergence, without use of an assumed molecular clock. We accomplish this by blending estimates of the substitution rate for triplets of dated sequences where each sequence draws from a distinct viral subtype, providing a zeroth-order approximation for the rate between subtypes. As an example, we calculate the time of divergence for three genes among influenza subtypes A-H3N2 and B using subtype C as an outgroup. We show a time of divergence approximately 100 years ago, substantially more recent than previous estimates which range from 250 to 3800 years ago.
Precise estimates are sorely lacking for dating the emergence and divergence of viral subtypes. Improved estimates equip epidemiologists and virologists to begin to correlate these important establishing events with historical demographic changes, geographical invasions and zoonoses, the transferring of a virus from one host species to another [7, 1, 25]. For example, archeological sequence data can furnish accurate dates and show that substantial genomic changes associate with geographical invasion and zoonosis [14, 17]. Further, the recent availability of viral gene sequences sampled at a pace commensurate with their rate of nucleotide substitution vastly augments the ability to rigorously infer the time scale of phylogenies and hence determine the time of the most recent common ancestor (TMRCA) for different viral types [18, 26, 6].
Systematic studies characterize the substitution process and substitution rate process of several classes of viral subtypes in, for example, Dengue, influenza subtype A, human immunodeficiency virus (HIV) and the virus responsible for sudden acute respiratory syndrome (SARS). For the last three viruses, a unique zoonotic transfer appears to co-occur with substantial changes in both the composition of nucleotides and amino acids as well as alterations in the rate of nucleotide substitution [15, 14, 1]. In Dengue, where a single subtype simultaneously inhabits two hosts (humans and Aedes aegypti) in a persistent zoonotic process, the introduction of the virus to new geographical environments associates with a dramatic increase in sequence diversity . Unfortunately, no studies thus far analyze the rate of nucleotide substitution during either geographical invasion or zoonosis. Consequently, studies of the date of origins of viral subtypes must use strong a priori assumptions on the rate structure of nucleotide substitution.
Two primary methods find use to date the time of viral subtype divergence. The most commonly employed approach determines the divergence time of subtypes using a molecular clock assumption (MCA) over an entire phylogeny [18, 21, 5, 26]. In its strict formulation, the MCA posits a proportional relation between the number of substitutions and the intervening time period over the entire phylogeny. Looser forms of MCAs require only that the proportionality hold along individual branches, with the rates across branches drawn from a pre-specified distribution . Committed to some variant of the MCA, current algorithms then estimate the rate of nucleotide substitution over all taxa in a given set. Consequently, these methods provide inference most suitable for situations where sequence evolution follows a MCA (e.g. influenza A-H3N2 in human hosts, as in ) or deviates from the MCA homogenously in time (e.g. perhaps influenza A in wild fowl, see ). In considering divergence events between viral subtypes, even when the MCA well-approximates nucleotide substitution within a given subtype, the above methods may incorrectly infer the time of divergence across subtypes. By either assuming that a single rate of nucleotide substitution holds for the region preceding the common ancestor of each subtype or by smoothing the rate of nucleotide substitution over clades with different numbers of taxa, the adherence to a MCA prevents direct inference of the rate during subtype divergence.
Suzuki and Nei (2002) propose an alternative, more heuristic method of estimation to counteract the problem of differing rates of substitution before and after zoonotic events [23, 25]. In these studies, the evolutionary models draw a distinction between the rate of substitution within a given subtype and the rate of substitution between subtypes. However, trouble arises since there are no methods for estimating the latter quantity. Consequently, the models assume that the rate of substitution for portions of the phylogeny between the subtypes equals the mean rate in the initial host species population. For instance, in dating the time of divergence between influenza B hemagglutinin and influenza C hemagglutinin-esterase, Suzuki and Nei use the rate of amino acid substitution for water fowl for the portions of the phylogeny previous to the TMRCA of these two proteins . While this method may accurately reflect the rate within avian and human hosts, it neglects whatever additional changes in the rate of substitution are due to the process of zoonotic adaptation, likely leading to a substantial underestimation of the date of the TMRCA.
The study here focuses on influenza, although the techniques are readily applied to other rapidly evolving organisms. Influenza has three types, A, B and C, classified based on serological analysis. To date, only type A sequences have been demonstrably associated with global pandemics . Since modern surveillance began in the 1930s, type B has only been responsible for mild epidemics while type C has been nearly asymptomatic in human infection. Several subtypes of A, notably H1N1 and H3N2, are currently co-circulating in the human population. As the H1N1 and H3N2 subtypes may be as divergent from each other as they are from types B and C, we will refer to all types and subtypes simply as subtypes for the remainder of this paper. We select for this study three genes, coding for hemagglutinin (HA), the matrix protein (MP) and the non-structural protein (NS) responsible for interfering with host immune response. Subtype C has a hemagglutinin-esterase gene that is analogous to the hemagglutin gene in other subtypes . We hence refer to the hemagglutinin gene generally and the hemagglutinin-esterase gene when referring specifically to the subtype C sequences.
We present a simple estimation tool to determine the date of divergence among viral subtypes that overcomes the difficulties encountered with use of the MCA by measuring the pairwise rate of substitution between taxa. Our estimator derives from the triplet statistic developed in [26, 22, 13], where each sequence member of the triplet draws from a different subtype. In this manner, we generate from each triplet an estimate of the rate of nucleotide substitution between the most recently diverged subtypes, and consequently provide an estimate of the TMRCA. This circumvents the problems posed by earlier methods by directly estimating the pairwise rate of nucleotide substitution over the set of pairs of sequences straddling the subtype divergence without any further rate assumptions other than the existence of a mean. However, this method is only capable of determining the rate between two subtypes where a third, more distantly related, subtype functions as an outgroup. This method thus trades the ad hoc rate assumptions of the previous methods with two implicit conditions: (i) that subtypes have a unique divergence and (ii) a third, comparable subtype is available to serve as an outgroup. In exchange, we arrive at a precise statistical measure of the TMRCA that converges as the number of taxa increases and is robust to the balancing of the numbers of taxa between different subtypes. We show that applying this method to dating the divergence of influenza subtypes A-H3N2 and B gives a time of divergence approximately 100 years before present, substantially more recent than previous estimates.
Subtracting the first equation from the second equation yields D ik - D jk = p ij (t j - t i ), which is equivalent to Equation 1. This derivation makes clear that the estimator (1) measures the rate along the path from sequence i to sequence j, with only incidental dependence on sequence k.
We note that the term t i + t j - 1 is used rather than t i + t j to account for the expected error coming from the uniformly distributed ν i and ν j .
As nucleotide data increases without bound, K ij → D ij and → p ij , ensuring that → α ij . For finite sequence lengths, this relation ensures that . To gain an understanding of this estimator, we note that with a standard model of substitution (e.g. JC69, HKY85), a rate of substitution of 10-4 (s/s/yr) and a sequence of 2000 nucleotides, the above estimator yields a standard error of approximately 23 years .
where P is the sum of the inverse variance of each estimate, .
where P α is the sum of the inverse variance of each estimate, . Having found , we estimate its variance by a bootstrap resampling of sequences from each subtype .
The computational efficiency of this estimator is on the order O(n3) for a tree of n taxa. This is natural as each of the initial rate estimates is composed of information concerning three taxa. While the growth of computational expense in the number of taxa may appear unpleasant, in practice this algorithm is both fast and stable, owing to the absence of costly optimization procedures for parameter inference, and is able to handle data sets of thousands of taxa. The authors detail the computational efficiency of a similar statistic in . As an example, for the data presented below all computations required only a few seconds on a desktop computer.
Data and Results
Within-subtypes rates of nucleotide substitution for hemagglutinin (HA), matrix (MP) and non-structural (NS) genes for subtypes A-H3N2, B and C.
3.21 ± 0.43
2.31 ± 0.37
0.68 ± 0.18
1.57 ± 0.38
2.20 ± 0.48
1.31 ± 0.33
2.14 ± 0.25
1.92 ± 0.20
1.68 ± 0.51
Assuming a molecular clock within a subtype and with the rates above, we generated the corresponding dates of the TMCRA. Figure 3 shows histograms of the TMRCA estimates for different genes and subtypes. All genes are similar in dating the TMRCA for A-H3N2 to approximately 1965 (1964, 1965, and 1962 for HA, MP and NS genes, respectively). These dates are consistent with the emergence of the A-H3N2 subtype into global circulation during the 1968 pandemic . Both the MP and NS genes date the TMRCA of subtype B to 1943, while the HA rate places the TMRCA at 1953. This latter value is inconsistent with the influenza B sub-epidemics of 1950–51 but is consistent with the emergence of the more lethal Victoria strain of influenza B in 1953 . Each of these estimates has a standard error of approximately 2 years and so these discrepancies may be accounted by measurement uncertainty. The 10 year gap between the TMCRA suggested by the different genes can be explained by a reassortment event. Finally, the TMRCA of subtype C is calculated as 1952 and 1953 by the MP and NS genes, respectively, while the HA gene places the TMCRA at 1906. This nearly half century discrepancy suggests that the subtype C HA gene experienced a markedly different evolutionary history than either the MP or the NS gene. A biologically plausible explanation would be a reassortment event. Another possible explanation is that non-MCA rate behavior has lead to substantial bias in dating the TMRCA.
Across-subtype rates of nucleotide substitution between subtypes A-H3N2 and B for hemagglutinin (HA), matrix (MP) and non-structural (NS) genes.
We present a new method for ascertaining the rate of nucleotide substitution between subtypes and apply this method together with traditional MCA methods to date the divergence of influenza subtypes A-H3N2, B, and C. We use three genes, HA, MP and NS, to date two types of divergence events: the time of the most recent common of each subtype and the time of divergence between two subtypes, A-H3N2 and B. For the former event type, we show that the three genes are loosely consistent in their dating of the TMRCA of the subtypes, with the notable exception of the HA-derived estimate of subtype C's TMRCA approximately 50 years before the MP- and NS-derived estimates. This discrepancy may indicate either that subtype C's hemagglutinin-esterase gene engaged in a biologically significant event, such as reassortment, or that MCA estimation does not adequately model the evolution of the gene.
For the divergence between subtypes A-H3N2 and B, previous studies using the MCA generally place a time of divergence of several hundred years ago, ranging from the 16th to early 19th centuries. Other analysis have yielded estimates of 3600 years ago . In the current study, application of the MCA yielded estimates in the last half of the 18th century. However, applying the pairwise rate estimate developed above we find uniformly, across genes, that the divergence likely occurred in the very early 20th century. The discrepancy between these two measures is likely due to the increased modeling flexibility of the pairwise rate estimate relative to the MCA.
This discrepancy between the rates and corresponding TMCRA estimates has important biological consequence. The phylogenetic divergence between subtype A-H3N2 and B corresponds to a subspeciation event for the virus. The results in this study indicate that the process of speciation is not neutral but instead a period of rapid and intense genetic change. The three genes studied here consistently show large acceleration in the rate of nucleotide substitution for the divergence period relative to the rates observed within a stable subtype. This study gives strong evidence that, at least for influenza viral subtype divergence, the process of subspeciation is associated not just with large genomic changes but also with an accelerated, finite process of adaption.
Assuming that the more recent estimate is correct, a subsequent question is whether or not a pandemic or epidemic associates with subtype A-H3N2/B divergence. In the twentieth century, all influenza pandemics associate with the emergence or reemergence of subtypes (A-H1N1 in 1918, A-H2N2 in 1957 and A-H3N2 in 1968). Serological analysis indicates that the 1897 pandemic was likely due to subtype A-H2N2. However, the pandemic of 1900 is of uncertain type, although it is commonly reported in the literature as being due to A-H3N2 . The above analysis suggests that it is possible to postulate that the cause of this pandemic is due to the emergence of subtype A-H3N2 or B.
As noted above, we condition the results presented here on a specific sequence alignment. As the question under consideration concerns the divergence of specific genes and proteins over a (presumably) long time scale, the capacity to generate reasonable alignments diminishes with increasing time of divergence between types, conditional on the rate of substitution. We find that for the hemagglutinin gene, a proportion of sequence alignments support the split of subtype B from subtype C after the split between subtypes A-H3N2 and B, in opposition to the topology enforced in our analysis. Hence, to some unknown degree, our analysis is necessarily biased by the choice of alignment. This suggests that improved dating can be found by integrating estimation procedures over an ensemble of alignments .
where is the total time over the phylogeny and p is mean rate over the phylogeny . This relation dictates that as divergence events become more remote the ability of the triplet method to resolve the time of divergence diminishes. While this limit prohibits the calculation of remote divergence events, the example presented above lies within the appropriate scale.
In place of a specific MCA, the estimates presented here directly calculate the rate of substitutions between taxa from different viral subtypes. As such estimates span paths between subtypes, they simultaneously capture the rate evolution along branches both within and between subtypes. From these estimates, we are able to directly infer the time of divergence between subtypes. As a trade-off for limited MCAs, the method requires an outgroup subtype to function as an origin relative to the subtypes under consideration. We feel that the triplet method provides a simple and widely applicable way to calculate the dates of divergence of rapidly evolving organisms without the pitfalls of the MCA.
We present a simple method for calculating the time of viral subtype divergence that does not assume a molecular clock over the entire phylogeny. Additionally, the estimator of this method, a weighted sum of pairwise estimates, furnishes a defined variance for the time of the most common ancestor between subtypes. As a tradeoff for this increased precision, the structure of the triplet statistic requires an outgroup set of sequences, usually a closely related subtype. We apply this estimator to the case of influenza subtype divergence, considering three genes. We show that the estimated divergence time of subtypes A-H3N2 and B is more than a century later than those calculated with a molecular clock.
that has been previously used in the paper outlining the TREBLE algorithm , and originates in . However, this apparently natural statistic is substantially biased when the sampling times of sequences i and j are close. To be seen in the following derivation, this bias is the result of the time sampling error structure.
JD O'Brien was supported by the NIGMS Systems and Integrative Biology Training Grant for the duration of this work. MA Suchard is supported by an Alfred P. Sloan Research Fellowship in Computational and Evolutionary Molecular Biology and a John Simon Guggen-heim Memorial Fellowship.
- Brown EG: Influenza virus genetics. Biomedical Pharmacotherapy. 2000, 54: 196-209. 10.1016/S0753-3322(00)89026-5.View ArticleGoogle Scholar
- Buonagurio DA, Nakada S, Fitch WM, Palese P: Epidemiology of influenza C virus in man: multiple evolutionary lineages and low rate of change. Virology. 1986, 153 (1): 12-21. 10.1016/0042-6822(86)90003-6.View ArticlePubMedGoogle Scholar
- Chen R, Holmes EC: Avian influenza virus exhibits rapid evolutionary dynamics. Moleclar Biology and Evolution. 2006, 23 (12): 2336-2341. 10.1093/molbev/msl102.View ArticleGoogle Scholar
- Dowdle WR: Influenza pandemic periodicity, virus recycling, and the art of risk assessment. Emerging Infectious Diseases. 2006Google Scholar
- Drummond A, Ho SY, Phillips MJ, Rambaut A: Relaxed phylogenetics and dating with confidence. Public Library of Science Biology. 2006, 4 (5):Google Scholar
- Drummond A, Nicholls GK, Rodrigo AG, Solomon W: Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics. 2002, 161: 1307-1320.PubMed CentralPubMedGoogle Scholar
- Drummond A, Pybus OG, Rambaut A, Forsberg R, Rodrigo AG: Measurably evolving populations. Trends in Ecology and Evolution. 2003, 18 (9): 481-488. 10.1016/S0169-5347(03)00216-7.View ArticleGoogle Scholar
- Efron B, Tibshirani RJ: Introduction to the Bootstrap. 1993, CRC Press, New YorkView ArticleGoogle Scholar
- Ferguson NM, Galvani AP, Bush RM: Ecological and immunologial determinants of influenza evolution. Nature. 2003, 422 (6930): 428-433. 10.1038/nature01509.View ArticlePubMedGoogle Scholar
- Hasegawa M, Kishino H, Yano T-A: Dating the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution. 1985, 22 (2): 160-174. 10.1007/BF02101694.View ArticlePubMedGoogle Scholar
- Hennessy AV, Minuse E, Davenport FM: A twenty-one-year experience with anitgenic variation among influenza B viruses. Journal of Immunology. 1965, 94 (2): 301-306.Google Scholar
- Huber PJ: Robust statistics: A review (1972 Wald lecture). Annals of Mathematical Statistics. 1972, 43 (4): 1041-1067. 10.1214/aoms/1177692459.View ArticleGoogle Scholar
- Kashyap R, Subas S: Statistical estimation of parameters in a phylogenetic tree using a dynamics model of the substitutional process. Journal of Theoretical Biology. 1974, 47 (1): 75-101. 10.1016/0022-5193(74)90100-3.View ArticlePubMedGoogle Scholar
- Lemey P, Pybus O, Wang B, Saksena NK, Salemi M, Vandamme A-M: Tracing the origin and history of the HIV-2 epidemic. Proceeding of the National Academy of Sciences. 2003, 100 (11): 6588-6592. 10.1073/pnas.0936469100.View ArticleGoogle Scholar
- Lu H, Zhao Y, Zhang J, Wang Y, Li W, Zhu X, Sun S, Xu J, Ling L, Cai L, Bu D, Chen R: Date of origin of the SARS coronavirus strains. BMC Infectious Diseases. 2004, 4: 3-10.1186/1471-2334-4-3.PubMed CentralView ArticlePubMedGoogle Scholar
- Macken C, Lu H, Goodman J, Boykin L: The value of a database in surveillance and vaccine selection. Options for the Control of Influenza IV. Edited by: Osterhaus A, Cox N, Hampson A. 2001, Elsevier Science, Amsterdam, NL, 103-106.Google Scholar
- Mills CE, Robins JM, Lipsitch M: Transmissibility of 1918 pandemic influenza. Nature. 2004, 432 (7019): 904-906. 10.1038/nature03063.View ArticlePubMedGoogle Scholar
- Rambaut A: Estimating the rate of molecular evolution: Incorporating non-contemporaneous sequences into maximum likelihood phylogenies. Bioinformatics. 2000, 16 (4): 395-399. 10.1093/bioinformatics/16.4.395.View ArticlePubMedGoogle Scholar
- Redelings B, Suchard MA: Joint Bayesian estimation of alignment and phylogeny. Systematic Biology. 2005, 54 (3): 401-418. 10.1080/10635150590947041.View ArticlePubMedGoogle Scholar
- Rzhetsky A, Nei M: Tests of applicability of several substitution models for DNA sequence data. Molecular Biology and Evolution. 1995, 12 (1): 131-151.View ArticlePubMedGoogle Scholar
- Sanderson MJ: r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics. 2003, 19 (2): 301-302. 10.1093/bioinformatics/19.2.301.View ArticlePubMedGoogle Scholar
- Seo TK, Thorne JL, Hasegawa M, Kishino H: A viral sampling design for testing the molecular clock and for estimating evolutionary rates and divergence times. Molecular Biology and Evolution. 2002, 18 (1): 115-123.Google Scholar
- Suzuki Y, Nei M: Origin and evolution of influenza hemagglutinin genes. Molecular Biology and Evolution. 2002, 19 (2): 501-509.View ArticlePubMedGoogle Scholar
- Thompson JD, Gibson TJ, Plewniak F, Jeanmourgin F, Higgins DG: The Clustal X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research. 1997, 25 (24): 4876-4882. 10.1093/nar/25.24.4876.PubMed CentralView ArticlePubMedGoogle Scholar
- Twiddy SS, Holmes EC, Rambaut A: Inferring the rate and time-scale of Dengue virus evolution. Molecular Biology and Evolution. 2001, 20 (1): 122-129. 10.1093/molbev/msg010.View ArticleGoogle Scholar
- Yang Z, O'Brien JD, Zheng X-B, Zhu H-Q, She Z-S: Tree and rate estimation by local evaluation of heterochronous data. Bioinformatics. 2007, 23 (2): 169-176. 10.1093/bioinformatics/btl577.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.