We retrieved 3281 complete sequences of human HBV and one full-length sequence of woolly monkey HBV from the GenBank of the National Center for Biotechnology Information available on April 2011 . The full sequence set comprised 320 genotype A, 387 genotype B, 836 genotype C, 383 genotype D, 221 genotype E, 72 genotype F, 15 genotype G, 19 genotype H, and 1043 unknown or uncertain genotype sequences. The genotypes assigned to the different sequences were obtained either directly from the GenBank records or from the associated publications.
All the sequences were screened to exclude entries that were related to patents, artificial mutants, and identical sequences. Further, sequences with unknown, uncertain genotype or documented recombination information were removed. The remaining sequences were aligned using the MUSCLE software with default parameters . Results of the alignments were checked manually for further validation. Gaps (insertions/deletions) and all nonstandard nucleotide bases (all characters except A, C, G, T, and –) were considered as missing values in further analysis. After that, sequences with more than 20% gaps or missing data were removed. Positions of sites were identified by their relative positions to the traditional hypothetical EcoRI site in the full-genome alignments.
To achieve a fair and representative presentation for all the genotypes, we applied a multi-step procedure to remove extra sequences from the initial sequences set. In the first step, we sequentially removed sequences with high similarity to any others until all remaining sequences had a pairwise difference larger than or equal to 2.5%. After the initial cleaning, the sequence pool had 379 full-length HBV sequences (including 38 genotype A, 82 genotype B, 138 genotype C, 77 genotype D, 32 genotype E, 9 genotype F, 2 genotype G, and 3 genotype H).
From the filtered sequences, 30 sequences were randomly drawn for each of genotypes A, B, C, D, and E. Genotypes F, G, and H were not included in further analysis because the purpose of the present study was to elucidate the phylogenetic relationship of genotypes A, B, and C. Furthermore, to involving the limited sequences of genotypes F, G, and H (9 genotype F, 2 genotype G, and 3 genotype H) in the analysis may produce problematic results due to unequal number of involving sequences of each genotype. The full-length HBV sequence of woolly monkey was considered as an ancestral reference (outgroup) in this study . This woolly monkey HBV sequence and the randomly selected human HBV sequences were combined together and aligned by MUSCLE with default parameter settings. To improve the data quality of the aligned sequences, GBLOCKs was used to remove aligned columns with more than half gaps or with low data quality [35, 36]. In total, 105 columns (3.2%) were removed in the process. The working dataset therefore included 151 full-length sequences of HBV for further phylogenetic investigation.
Constructing a consensus phylogenetic relationship
A sliding window approach was used in which an analyzing window moves along the aligned HBV sequences with the same step length (10 bp), but a different window size in different runs. The work of sliding window is similar with that of previous publication about recombination detection . Analysis of the results from different runs with different window sizes (250 bp, 500 bp, 750 bp, 1000 bp, 1250 bp, or 1500 bp) could show how differences in window size impact phylogeny reconstruction. In each stop of the window movement, local phylogenetic trees of the aligned sequence fragments were reconstructed by Ninja software using the neighbor-joining method and Kimura 2 parameter model . With the given outgroup, all the local phylogenetic trees were further split into primary rooted triplets. From each local phylogenetic tree, 551,300 (
, the number of combinations of any 3 sequences from the given set of 150 HBV sequences) primary rooted triplets were obtained. Because of the circular characteristic of HBV genome, the initial start of HBV sequences were concatenated at the end of the original sequences, in order to make each base have an equal coverage by the sliding window.
The primary rooted phylogenetic triplets of each window in each run were filtered to remove the minor triplets that presented two different minor phylogenetic relationships. It is worth to note here that, for every combination with 3 human HBV sequences and the root, there were three possible topologies for each window in each run and the three topologies were not compatible with each other. We took only one of the possible topologies, i.e. the major triplet, for further analysis. The removed triplets were less common and inconsistent with the major phylogenetic relationship presented in the same analyzing window (see Results for further details, Figure 1). The remaining rooted triplets from all the analyzing windows in the same run were then pooled together to reconstruct a consensus tree using the rooted triplet consensus method . Ewing, et al. (2008) declared that the consensus method based on rooted triplets outperformed the extended majority rule consensus strategy. We constructed consensus phylogenetic relationships of HBV genotypes in different runs separately using different window sizes.
Evaluating the reliability of the reconstructed phylogenetic relationship
The reliability of the reconstructed phylogenetic relationship of HBV sequences can be evaluated by comparing the consensus phylogenetic relationship with phylogenetic trees of genome segments (local phylogenetic trees). Good consistency between them would indicate good reliability of the consensus phylogeny. In this study, multiple comparisons were conducted to achieve a thorough understanding of the reliability.
First the consistency of the reconstructed consensus phylogeny and local phylogenetic trees was investigated on a genome-segment level. For each genome segment, local neighbor-joining trees (involving all 151 taxa) were built using Ninja software with the aforementioned substitution model . We then dissected the local neighbor-joining trees and our consensus tree-like phylogenetic relationship into rooted triplets. For phylogenies with n taxa (including an outgroup), the proportion of compatible triplets between the local tree and consensus tree could be obtained by
, where k is the total number of compatible triplets and
is the number of total rooted triplets (n = 151 in this case). The proportions were calculated for all genome segments and then used as a measure for the agreement of reconstructed consensus phylogeny and local phylogenetic trees.
Second, the consistency of internal branches (nontrivial splits) of the consensus phylogenetic tree and local phylogenetic trees was evaluated by checking how often the nontrivial splits of the consensus tree were supported by nontrivial splits of local phylogenetic trees. For any given internal branch (with m children) of an n-taxa consensus tree (including an outgroup), the phylogenetic relationship was dissected into rooted triplets with a total number
to form a consensus rooted triplet pool. The probability that a given rooted triplet from the consensus rooted triplet pool was supported by dissected rooted triplets of local phylogenetic tree could be estimated by
, where y was the number of dissected rooted triplets of the local phylogenetic trees which shared the same phylogenetic relationships with their corresponding triplets of the consensus tree, and j was the total number of local neighbor-joining trees determined by the size of the sliding window and length of the moving step. The 95% CI of the estimation was obtained by a bootstrapping method in which local phylogenetic trees were randomly sampled with replacements to generate an artificial rooted triplet pool for the aforementioned evaluation.
Performance demonstration in the presence of recombination
Synthetic data was generated by introducing simulated genotype A/C recombinants to the raw data set that was used for aforementioned investigation of HBV phylogeny. For a pair of sequences, one from each of the two genotypes, we gave the recombination probability p. Expected frequency of recombinants in the sequence pool of genotype A, C, and A/C recombinant could be estimated as f = 1 - (1 - p)30 because 30 sequences of each genotype were included in the raw data set. We considered all possible pairs of the involving sequences of genotypes A and C to simulate the occurrence of recombination between the two genotypes. When a recombination occurred between a pair of sequences with probability p, location of the recombinant fragment was randomly chosen on the HBV genome, and length of the recombinant fragment was determined by the empirical length distribution of recombinants from Yang et al’s study . Because HBV genome is a circular molecular, we allowed recombinant fragment cover the junction of sequence end and start.
Phylogenetic relationship of the synthetic data was reconstructed by using ML method. Before the reconstruction, jModelTest2 was executed to choose the best-fit model from the 88 candidate models . Since GTR + I + G model was selected as the best-fit model, a ML tree was built using the ML method implemented in PALM package . The same synthetic data was also analyzed by our consensus method to produce a consensus tree. By given different probability of recombination p, we performed the data simulation and phylogeny reconstruction multiple times to achieve a thoughtful evaluation.