The notion that inaccurate specification of dates used for molecular rate calibration could produce misleading results is not a novel one [12, 35, 40–43]. In the context of RHDV in particular, seven partial VP60 sequences were identified as misdated modern contaminants by maximum likelihood analysis , but were nonetheless included in one BEAST analysis , resulting in the slowest substitution rate published for RHDV. It has also been suspected for some time that certain taxa could have a strong influence over inferred phylogenies, and a number of methods have been developed to identify weak clades in phylogenetic trees [44–47] or the highly influential taxa responsible for weakening phylogenetic relationships . However, there is currently no direct method for identifying the presence of a taxon, such as AY269825, that has significant influence over evolutionary rate and TMRCA estimates, while not altering phylogenetic relationships. Further, the influence of misdated taxa on estimates of evolutionary parameters has yet to be extensively examined or quantified.
The comprehensive jackknifing controls used here demonstrate that the Chinese RHDV isolate NJ/China/1985, GenBank accession number AY269825, was responsible for dragging down the substitution rate estimate for the VP60 gene by 65%. The Chinese-language paper that described this taxon [49
] contained important details about the isolation and handling of this strain. The first paragraph of the methods section (see translation below) revealed that, though it was isolated from nature in 1985, it was maintained in the laboratory for vaccine preparation and was likely not sequenced until much later. The 2003 submission date of AY269825 to GenBank is concordant with the lower bound of the age estimated by the leaf-dating method, as well as its grouping with isolates from 2006–2009 in the MCC and ML trees (Figure 1
"“RHDV NJ85 strain isolate was discovered and characterized byInstitute of Veterinary medicine, Jiangsu Academy of Agricultural Sciences (JAAS), from rabbits raised in an unknown farm in Nanjing City, China, in 1985. Since the discovery, this strain hasbeen maintained in lab rabbits until now. This strain has been usedto prepare potent rabbit hemorrhagic disease vaccine for years Our lab has cloned the gene VP60 in E. coli JM109 and BL21(DE3).”"
The grouping with much more recent isolates could be explained by AY269825’s use in RHDV vaccine production. Whether attenuated or improperly inactivated, the strain could have been released into China, and now this lineage can be isolated from other regions of China and Russia ([50, 51]; see GenBank file for HM623309 and FJ794179). This is similar to the lab-escape strain that complicated substitution rate estimation for Influenza A virus . Instead of changing many dates of isolation, however, only AY269825 would have to be assigned a different date (removing the four taxa that grouped with AY269825 did not change the estimated substitution rate or TMRCA of RHDV, data not shown). Previous studies have shown that the long-term rate of viral evolution in the lab can mimic the rate in nature , so unlike the 20 years of frozen stasis that the Influenza isolate experienced, AY269825 was changing at a rate similar to its wild relatives.
However, consistent results were obtained by excluding this isolate from analyses altogether. Without AY269825, the TMRCA for the entire virulent RHDV complex and each of its lineages is substantially lower, resolving the much of the debate over its puzzling evolutionary history. Indeed, without this misdated taxon, the ancestor of the entire complex is estimated to have existed between 1955-1978 (Table 1), as few as six years before the 1984 appearance of RHD. These results cannot address whether virulence was a shared trait of the most recent common ancestor of RHDV, or if it evolved independently in multiple lineages. Whenever virulence did evolve, it did not have to go undocumented for several decades [9, 12, 14, 15, 23, 25, 32, 33].
While the root-to-tip regression analysis identified AY269825 as potentially deviating from the molecular clock, FR823354, a taxon that did not affect VP60’s evolutionary dynamics (Figure 5, ), also had a similarly high residual value (0.055 cf 0.049). This underscores the problem of using congruence with a strict molecular clock as the sole means of assessing the validity of dates of isolation. Deviation from the root-to-tip regression line is expected for viruses with variable rates of evolution, which would be accurately modeled with a relaxed molecular clock BEAST analysis [11, 53–57]. Indeed, correlation between genetic distance and time was stronger for the VP60 dataset including the misdated AY269825 taxon than the RdRp dataset, which produced a more trustworthy substitution rate (r = 0.76 compared to r = 0.70). The decision to include or exclude taxa based on residuals from a best-fitting line is largely subjective, as there are no guidelines for how common large residuals are in tip-dated viral datasets. In fact, one of the AY269825-containing RHDV datasets had been subjected to this control prior to rate analysis, and the authors did not reject it as an outlier . Finally, root-to-tip regression provides no insight into the magnitude of effect of any taxon on evolutionary estimates. The jackknife controls proposed here focus on detecting taxa that have had a disproportionate effect on the BEAST results, and, in the case of RHDV, offered strong quantitative evidence against including AY269825.
Another interesting finding is that without AY269825, the estimated substitution rate for the VP60 gene is almost identical to that of the RdRp gene, despite the fact that the latter dataset had fewer than half as many taxa (Table 1). Even while the RdRp dataset did not have as strong a temporal signal (Additional file 3), probably due to the lower number of taxa, it still produced a significantly more accurate substitution rate estimate than datasets with two to three times as many taxa that included just one misdated taxon (Table 1). Further, the estimated substitution rate from our VP60 RHDV + RCV dataset was nearly identical to that from another dataset which contained just 29 taxa (27 RHDV, 1 RCV, did not contain AY269825), including one distantly related European brown hare syndrome virus taxon as an outgroup . This pattern of different sized datasets producing very similar substitution rates is not unique to RHDV. For example, two BEAST analyses of Dengue virus type 2 from datasets of 115 taxa and 67 taxa yielded nearly identical substitution rates [58, 59]. Further, in the case of Human parechovirus, three BEAST analyses of three different genomic regions based on datasets with a range of 29–199 taxa also produced nearly identical substitution rates [55, 60].
It is evident that assigning years of isolation to taxa should be done with great caution in tip-calibrated rate analyses. These results support favoring data sets with fewer taxa with verifiable dates of isolation over larger data sets with less quality control: additional good data do not swamp out the effects of one badly dated taxon. When researchers are including any ambiguously dated taxa, or when they want to be certain about the effects of each taxon on the rate analysis, jackknife controls provide a clear way to see these effects. As many sequences are added to GenBank without easily accessible papers describing in detail the isolation, passaging and sequencing of each isolate, it is necessary to verify if one or more of the sequences is having a disproportionate influence on the results. We propose n-1 jackknifing as one method for researchers using tip-calibrated analyses in BEAST to ensure that a small number of taxa are not spoiling their estimates.