- Research Article
- Open Access
Pan-genome and phylogeny of Bacillus cereus sensu lato
BMC Evolutionary Biologyvolume 17, Article number: 176 (2017)
Bacillus cereus sensu lato (s. l.) is an ecologically diverse bacterial group of medical and agricultural significance. In this study, I use publicly available genomes and novel bioinformatic workflows to characterize the B. cereus s. l. pan-genome and perform the largest phylogenetic and population genetic analyses of this group to date in terms of the number of genes and taxa included. With these fundamental data in hand, I identify genes associated with particular phenotypic traits (i.e., “pan-GWAS” analysis), and quantify the degree to which taxa sharing common attributes are phylogenetically clustered.
A rapid k-mer based approach (Mash) was used to create reduced representations of selected Bacillus genomes, and a fast distance-based phylogenetic analysis of this data (FastME) was performed to determine which species should be included in B. cereus s. l. The complete genomes of eight B. cereus s. l. species were annotated de novo with Prokka, and these annotations were used by Roary to produce the B. cereus s. l. pan-genome. Scoary was used to associate gene presence and absence patterns with various phenotypes. The orthologous protein sequence clusters produced by Roary were filtered and used to build HaMStR databases of gene models that were used in turn to construct phylogenetic data matrices. Phylogenetic analyses used RAxML, DendroPy, ClonalFrameML, PAUP*, and SplitsTree. Bayesian model-based population genetic analysis assigned taxa to clusters using hierBAPS. The genealogical sorting index was used to quantify the phylogenetic clustering of taxa sharing common attributes.
The B. cereus s. l. pan-genome currently consists of ≈60,000 genes, ≈600 of which are “core” (common to at least 99% of taxa sampled). Pan-GWAS analysis revealed genes associated with phenotypes such as isolation source, oxygen requirement, and ability to cause diseases such as anthrax or food poisoning. Extensive phylogenetic analyses using an unprecedented amount of data produced phylogenies that were largely concordant with each other and with previous studies. Phylogenetic support as measured by bootstrap probabilities increased markedly when all suitable pan-genome data was included in phylogenetic analyses, as opposed to when only core genes were used. Bayesian population genetic analysis recommended subdividing the three major clades of B. cereus s. l. into nine clusters. Taxa sharing common traits and species designations exhibited varying degrees of phylogenetic clustering.
All phylogenetic analyses recapitulated two previously used classification systems, and taxa were consistently assigned to the same major clade and group. By including accessory genes from the pan-genome in the phylogenetic analyses, I produced an exceptionally well-supported phylogeny of 114 complete B. cereus s. l. genomes. The best-performing methods were used to produce a phylogeny of all 498 publicly available B. cereus s. l. genomes, which was in turn used to compare three different classification systems and to test the monophyly status of various B. cereus s. l. species. The majority of the methodology used in this study is generic and could be leveraged to produce pan-genome estimates and similarly robust phylogenetic hypotheses for other bacterial groups.
Bacillus cereus sensu lato (s. l.) is an ecologically diverse bacterial group that comprises a growing number of species, many of which are medically or agriculturally important. Historically recognized and most well-sampled of the species are B. anthracis (the causative agent of anthrax), B. cereus sensu stricto (capable of causing food poisoning and other ailments), and B. thuringiensis (used to control insect pests). Other species are distinguished by rhizoidal growth patterns (B. mycoides and B. pseudomycoides ), thermotolerance and cytotoxicity (B. cytotoxicus ), psychrotolerance and ability to cause food spoilage (B. weihenstephanensis  and B. wiedmannii ), and utility as a probiotic in animal nutrition (B. toyonensis ). In addition, several new species have also recently been described (B. bingmayongensis , B. gaemokensis , and B. manliponensis ). In order to understand the fantastic diversity of B. cereus s. l. and its concomitant ability to occupy diverse environmental niches and exhibit a variety of phenotypes, it is crucial to accurately characterize genomic diversity within the group and to generate robust phylogenetic hypotheses about the evolutionary relationships among group members.
A typical B. cereus s. l. genome contains ≈5500 protein-coding genes [9, 10]. Due to rampant horizontal gene transfer in bacterial ecosystems, however, the genome of a particular strain or species often contains genes not found in closely related taxa . Thus, it is now common practice to characterize the full gene complement of a closely related group of bacterial strains or species, otherwise known as a “pan-genome” . In this study, a “core” gene is defined as present in at least 99% of sampled taxa, an “accessory” gene as a non-core gene present in at least two taxa, and a “unique” gene as present in only one taxon. A previous effort to characterize the B. cereus s. l. pan-genome  based on a comparison of a relatively small number of strains estimated that there are ≈3000 core genes and ≈22,500 total genes in the B. cereus s. l. pan-genome. A more recent study  using 58 strains reported similar estimates.
Phylogenetic hypotheses of B. cereus s. l. have been generated from a variety of data sources, including 16S rRNA sequences , amplified fragment length polymorphism (AFLP) data [14, 15], multilocus sequence typing (MLST) of housekeeping genes [15–18], single-copy protein-coding genes , locally collinear blocks (LCBs) , conserved protein-coding genes , whole-genome single nucleotide polymorphisms (SNPs) , and digital DNA-DNA hybridization (dDDH) data . Phylogenetic analyses have used distance methods [13, 14, 17, 20], maximum likelihood [14, 18, 19], maximum parsimony , and Bayesian methods . For the most part, published phylogenies have tended to agree with and reinforce one another, although naturally there have been different classification systems developed with attendant implications for species designations. One popular classification system divides the B. cereus s. l. phylogeny into three broad clades [13, 16, 21]; traditionally, Clade 1 contains B. anthracis, B. cereus, and B. thuringiensis; Clade 2 contains B. cereus and B. thuringiensis; and Clade 3 contains a greater diversity of species including B. cereus, B. cytotoxicus, B. mycoides, B. thuringiensis, B. toyonensis, and B. weihenstephanensis. A somewhat more fine-grained classification system divides the phylogeny into seven major groups [14, 15, 18], each with its own thermotolerance profile  and propensity to cause food poisoning .
In this study I aimed to produce the most accurate and comprehensive estimate of the B. cereus s. l. pan-genome and phylogeny to date by analyzing all publicly available B. cereus s. l. genome data with a novel bioinformatic workflow for pan-genome characterization and pan-genome-based phylogenetic analysis.
Distance-based phylogeny of the genus Bacillus
All “reference” and “representative” Bacillus genome assemblies were retrieved from the NCBI RefSeq  database in October 2016, comprising 86 assemblies from 74 well-described Bacillus species and 44 assemblies from as-yet uncharacterized species. In addition, 16 “latest” assemblies were added for five Bacillus species that are thought to be part of B. cereus s. l. (B. bingmayongensis , B. gaemokensis , B. pseudomycoides , B. toyonensis , and B. wiedmannii ). In total, 146 Bacillus genomes were included in the distance-based phylogenetic analysis. The sketch function in Mash  version 1.1.1 (arguments: -k 21 -s 1000) was used to create a compressed representation of each genome, and then the Mash distance function was used to generate all pairwise distances among genomes. The Mash distance matrix was converted to PHYLIP format and analyzed with FastME  version 2.1.4 using the default BIONJ  algorithm.
Creation of taxon sets
All complete genomes of eight B. cereus s. l. species (B. anthracis, B. cereus, B. cytotoxicus , B. mycoides, B. pseudomycoides , B. thuringiensis, B. toyonensis , and B. weihenstephanensis ) were downloaded from the NCBI RefSeq  database in October 2016, which altogether comprised 114 genomes. One strain from each species was designated the “reference taxon” for that species, as required by HaMStR  (Table 1). This taxon set of complete genomes (“BCSL_114”; Table 1) was used to build the HaMStR databases and as the basis for the majority of the analyses performed in this study.
To perform analyses involving all publicly available B. cereus s. l. genome data, all “latest assemblies” were downloaded for the eight species mentioned above, and based on analysis of the Bacillus distance-based phylogeny, assemblies were added for B. bingmayongensis , B. gaemokensis , B. manliponensis , B. wiedmannii , and one uncharacterized species (Bacillus sp. UNC437CL72CviS29), which altogether comprised 498 genomes (“BCSL_498”; Table 1). A list of RefSeq assembly accessions for all taxa used in this study is provided (Additional file 1).
B. cereus s. l. isolate metadata, including “Assembly Accession”, “Disease”, “Host Name”, “Isolation Source”, “Motility”, and “Oxygen Requirement” was downloaded from PATRIC  in December 2016. This metadata was used to associate patterns of gene presence and absence with phenotypes exhibited by groups of taxa.
All B. cereus s. l. genomes were annotated de novo with Prokka  version 1.12-beta (arguments: –kingdom Bacteria –genus Bacillus).
The pan-genome of B. cereus s. l. was inferred with Roary  version 3.7.0. The BCSL_114 Prokka annotations were provided to Roary as input; in turn, Roary produced a gene presence/absence matrix (Additional file 2), a multi-FASTA alignment of core genes using PRANK  version 0.140603, and a tree based on the presence and absence of accessory genes among taxa using FastTree 2  version 2.1.9. The “accessory binary tree” was computed using only the first 4000 genes in the accessory genome.
Phylogenetic network analysis
A NEXUS-format binary version of the BCSL_114 gene presence/absence matrix was analyzed with SplitsTree4  version 4.14.4. Three methods of calculating distances between taxa were evaluated: Uncorrected_P, GeneContentDistance , and the MLDistance variant of GeneContentDistance . The NeighborNet  algorithm was used to reconstruct the phylogenetic network.
Scoary  version 1.6.9 was used to associate patterns of gene presence and absence with particular phenotypes (traits), an analysis known as “pan-GWAS” . Scoary required two basic input files: the BCSL_114 gene presence/absence matrix, augmented with gene presence/absence information for BCSL_498 taxa obtained from orthology determination with HaMStR  (Additional file 3), and a binary trait matrix that was created using the isolate metadata obtained from PATRIC (Additional file 4). Assignment of traits to taxa was performed conservatively in that missing data was not assumed to be an indication of the presence or absence of a particular trait. Scoary was run with 1,000 permutation replicates, and genes were reported as significantly associated with a trait if they attained a naive P-value less than 0.05, a Benjamini-Hochberg-corrected P-value less than 0.05, an empirical P-value less than 0.05, and were not annotated as “hypothetical proteins”. Lists of genes were subsequently tested for enrichment of biological processes using the data and services provided by AmiGO 2  version 2.4.24, which in turn used the PANTHER database  version 11.1.
HaMStR database creation
The orthologous protein sequence clusters output by Roary were filtered to produce a set of gene models suitable for use with HaMStR  version 13.2.6. HaMStR enables one to build gene models for a clade of interest (using, ideally, high-quality complete genomes), which are subsequently used to identify orthologs in other sequence data (e.g., draft genome assemblies, transcriptomes, etc.). HaMStR required that each sequence cluster contain at least one sequence from the set of previously selected reference taxa (Table 1), so clusters not meeting this requirement were omitted. Furthermore, each cluster was required to contain at least four sequences, the minimum number of sequences required to produce an informative unrooted phylogenetic tree. Finally, clusters that Roary flagged as having a quality-control issue were removed. The 9,070 clusters that passed these filters were aligned using the linsi algorithm in MAFFT  version 7.305. Gene models (i.e., Hidden Markov Models, or HMMs) were produced from the aligned cluster sequences using the hmmbuild program from HMMER  version 3.0. Finally, for each reference taxon, a BLAST  database was built using the full complement of protein-coding genes for that taxon. This completed the construction of the initial HaMStR database, which is called “HAMSTR_FULL”. A variant of HAMSTR_FULL called “HAMSTR_CORE” was created, which contained only the 594 gene models corresponding to core genes.
Mobile genetic element removal
For tree-based phylogenetic analyses that assume a process of vertical inheritance, the inclusion of mobile genetic elements (MGEs) that may be horizontally transferred is likely to confound the phylogenetic inference process . Thus, an effort was made to identify and remove putative MGEs from the HaMStR databases. In December 2016, a list of Bacillus genes derived from a plasmid source was downloaded from the NCBI Gene  database. In addition, all genes were exported from the ACLAME  database version 0.4. Using this information, gene models that were either plasmid-associated or found in the ACLAME list of MGEs were removed from HaMStR databases. Gene models whose annotation included the keywords “transposon”, “transposition”, “transposase”, “insertion”, “insertase”, “plasmid”, “prophage”, “intron”, “integrase”, or “conjugal” were also removed. The resulting HaMStR databases, “HAMSTR_FULL_MGES_REMOVED” and “HAMSTR_CORE_MGES_REMOVED”, contained 8954 and 578 gene models, respectively. The workflow used to construct the HAMSTR_FULL_MGES_REMOVED database is shown as a diagram (Additional file 5).
The protein-coding gene annotations of “query” taxa — i.e., taxa not included in BCSL_114 — were searched for sequences matching HaMStR database gene models using HaMStR  version 13.2.6 (which in turn used GeneWise  version 2.4.1, HMMER  version 3.0, and BLASTP  version 2.2.25+). In the first step of the HaMStR search procedure, the hmmsearch program from HMMER was used to identify translated substrings of protein-coding sequence that matched a gene model in the database, which were then provisionally assigned to the corresponding sequence cluster. To reduce the number of highly divergent, potentially paralogous sequences returned by this initial search, the E-value cutoff for a “hit” was set to 1e-05 (the HaMStR default was 1.0). In the second HaMStR step, BLASTP was used to compare the hits from the HMM search against the proteome of the reference taxon associated with that gene model; sequences were only retained if the reference taxon protein used in the construction of the gene model was also the best BLAST hit. The E-value cutoff for the BLAST search was set to 1e-05 (the HaMStR default was 10.0).
Data matrix construction
Amino acid sequences assigned to orthologous sequence clusters were aligned using MAFFT  version 7.305. The resulting amino acid alignments were converted to corresponding nucleotide alignments using a custom Perl script that substituted for each amino acid the proper codon from the original coding sequence. Initial orthology assignment may sometimes result in multiple sequences for a particular taxon/locus combination , which need to be reduced to a single sequence for inclusion in phylogenetic data matrices. For this task the “consensus”  procedure was used, which collapsed all sequence variants into a single sequence by replacing multi-state positions with nucleotide ambiguity codes. Following application of the consensus procedure, individual sequence cluster alignments were concatenated, adding gaps for missing data as necessary using a custom Perl script. The workflow used for orthology determination and data matrix construction is shown as a diagram (Additional file 6).
Maximum likelihood phylogenetic analysis
Concatenated nucleotide data matrices were analyzed under the maximum likelihood criterion using RAxML  version 8.2.8 (arguments: -f d -m GTRGAMMAI). The data were analyzed either with all nucleotides included in a single data subset (ALL_NUC), or with sites partitioned by codon position (CODON_POS). Partitioned analyses assigned a unique instance of the substitution model to each data subset, with joint branch length optimization. Analyses of the BCSL_114 taxon set consisted of an adaptive best tree search  and an adaptive bootstopping procedure that used the autoMRE RAxML bootstopping criterion ; thus, the number of search replicates performed varied from 10 to 1000, depending on the analysis. DendroPy  was used to map bootstrap probabilities onto the best tree. Analysis of the BCSL_498 taxon set required ≈256 GB of RAM and multiple weeks of runtime, and was thus limited to a single best tree search.
Genomic regions that may have been involved in past recombination events should be excluded from phylogenetic analyses that assume a process of vertical inheritance, or phylogenetic inference methods should incorporate this information to produce a more accurate phylogeny . In this study, two different software packages that address this problem were evaluated. First, the profile program from PhiPack  was used to flag and remove from concatenated data matrices sites that exhibited signs of mosaicism. Following the procedure employed in Parsnp , the profile program defaults were used, except that the step size was increased from 25 to 100 (-m 100). RAxML was then used to create new versions of data matrices that excluded regions whose Phi statistic P-value was less than 0.01. Second, ClonalFrameML  (downloaded from GitHub June 14, 2016) with default parameters was used to correct the branch lengths of phylogenies to account for recombination. ClonalFrameML required all ambiguous bases in data matrices to be coded as “N”.
Maximum parsimony phylogenetic analysis
Concatenated nucleotide data matrices were analyzed under the maximum parsimony criterion using PAUP*  version 4.0a150. A heuristic search was performed using default parameters.
Tree distance calculation
To complement phylogenetic analysis and existing classification systems, taxa were clustered with hierBAPS  (bugfixed version dated August 15, 2013), a Bayesian model-based population genetic approach that accounts for admixture within and among lineages. The BCSL_498 alignment of 8954 genes (MAT_6) was provided to hierBAPS as input, and hierBAPS was directed to produce a single-level clustering with a maximum of 10 clusters.
Clustering of taxon-associated attributes
The degree of clustering of taxa sharing a common attribute, given a phylogeny relating those taxa, was quantified using the genealogical sorting index  (gsi) version 0.92 made available through the web service at molecularevolution.org . Significance of the gsi was determined by running 104 permutation replicates.
Distance-based phylogeny of the genus Bacillus
The Mash-distance-based phylogeny of the genus Bacillus (Additional file 7) indicated a B. cereus s. l. clade containing the following species: B. anthracis, B. bingmayongensis, B. cereus (sensu stricto), B. cytotoxicus, B. gaemokensis, B. manliponensis, B. mycoides, B. pseudomycoides, B. thuringiensis, B. toyonensis, B. weihenstephanensis, B. wiedmannii, and one uncharacterized species (Bacillus sp. UNC437CL72CviS29). Within B. cereus s. l., the first taxon to split off from the remainder of the group was B. manliponensis, followed by B. cytotoxicus (which has been previously recognized as an outlier [12, 19]).
The pan-genome of B. cereus s. l. was inferred with Roary  using the BCSL_114 taxon set. Roary produced a total of 59,989 protein-coding gene sequence clusters (Additional file 2) from an average of 5726 genes per input genome (Additional file 8). The shortest cluster sequence was 122 nt, the longest cluster sequence was 22,967 nt, and the average length of a cluster sequence was 755 nt. The average difference between the shortest and longest sequence in a cluster was only 67 nt, suggesting that the input data was relatively uniform, as would be expected from complete genomes. The B. cereus s. l. “core genome”, consisting of genes present in at least 99% of taxa sampled, was represented by 598 genes (≈1% of all genes). A rarefaction curve shows that after ≈35 genomes have been sampled (≈31% of all genomes), the number of core genes remains fairly constant at ≈600 genes, while the total number of genes in the pan-genome continues to increase almost linearly (Additional file 9). The 59,391 non-core genes were divided into 32,324 “accessory genes” (i.e., non-core genes present in at least two taxa; ≈54% of all genes), and 27,067 “unique genes” (i.e., genes present in only one taxon; ≈45% of all genes). A rarefaction curve shows that as genomes are sampled, genes never before observed continue to be found at a fairly steady rate, and the total number of unique genes discovered continues to increase, with no indication of soon approaching an asymptote (Additional file 10); these trends indicate that the B. cereus s. l. pan-genome is “open”. Finally, Roary produced an “accessory binary tree”, which was plotted alongside core and accessory gene presence/absence information (Additional file 11). This figure shows that the outermost B. cereus s. l. clades include taxa with relatively few accessory genes included in the analysis (≤40), such as B. cytotoxicus, B. mycoides, and B. pseudomycoides; by contrast, the genomes with the most accessory genes present (>1000) belong to the highly clonal clade of B. anthracis strains.
Phylogenetic network analysis
SplitsTree4  was used to build a phylogenetic network from the gene presence/absence information produced by Roary. The choice of method for computing distance did affect network branch lengths; the network presented here was computed with the MLDistance variant of GeneContentDistance (Fig. 1), as that method seemed most appropriate for gene presence/absence data. The phylogenetic network recapitulated both the three-clade [13, 16, 21] and seven-group [14, 15, 18] classification systems used in previous studies. Group I and Group VII, both part of Clade 3, were most radically diverged from the remainder of the network. Notably, Group II was absent from the network, as it was not represented by any complete B. cereus s. l. genomes at the time the study was performed.
Scoary  was used to associate patterns of gene presence and absence with particular phenotypes (traits), an analysis known as “pan-GWAS” . Pan-GWAS was performed for the following traits: isolation source (cattle, human, invertebrate, non-primate mammal, or soil); motility; oxygen requirement (aerobic or facultative); and disease (anthrax or food poisoning). Eight of ten traits tested had some number of significant positively or negatively associated genes (Table 2). Traits with a sufficient number of associated genes were tested for possible enrichment of gene ontology biological processes (Additional file 12). The most interesting findings from this analysis concerned taxa isolated from soil. Specifically, metabolic and biosynthetic processes involving quinone (and in particular, menaquinone) were positively associated with soil isolates. Analysis of quinone species present in soil have been used previously to characterize soil microbiota . Furthermore, a high ratio of menaquinone to ubiquinone (the two dominant forms of quinone in soil) has been associated with the presence of gram-positive bacteria such as Bacillus species . On the other hand, biological processes involving flagella, cilia, or motility more generally were negatively associated with soil isolates. This finding is consistent with observations that motility may not be necessary for bacterial colonization of plant roots , doubts about the evolutionary advantage of maintaining flagella in a soil environment , and general properties of soil that bring into question the importance of active movement and the extent to which it occurs .
Concatenated data matrices
In total, six different concatenated nucleotide data matrices were constructed and analyzed (MAT_1–MAT_6; Table 3). The majority of the data matrices used the BCSL_114 taxon set (MAT_1–MAT_5); only MAT_6 used the BCSL_498 taxon set. Various gene sets were used, including
all core genes identified by Roary (all_core);
only the core genes used to build the HaMStR database (hamstr_core);
HaMStR core genes with mobile genetic elements (MGEs) removed (hamstr_core_mges_removed), and a variant of this gene set with PhiPack sites removed; and finally,
all HaMStR genes with MGEs removed (hamstr_full_mges_removed).
Aligned data matrices ranged from 96,802 nt to 8,207,628 nt in length. Matrix completeness, defined as the percentage of non-missing data, ranged from 47.4 to 99.5%. The percentage of ambiguous characters present in data matrices ranged from 0.0 to 17.0%.
In total, nine different phylogenetic analyses of the six concatenated data matrices were performed (Table 4). Eight of the nine analyses used maximum likelihood (ML_1–ML_8), and one analysis used maximum parsimony (MP_1). For reasons of computational tractability, all exploratory analyses used the BCSL_114 taxon set (ML_1–ML_7 and MP_1); only when the best-performing methods were established was analysis of the BCSL_498 taxon set pursued (ML_8). During the exploratory phase, several variables were tested for their effect on phylogenetic outcome:
use of MAFFT instead of PRANK to align protein sequence clusters;
removal of MGEs;
use of maximum parsimony in addition to maximum likelihood;
partitioning of sites by codon position;
removal of sites implicated in recombination; and finally,
use of all eligible genes from the pan-genome versus only core genes.
Importantly, all phylogenetic analysis results recapitulated the three-clade [13, 16, 21] and seven-group [14, 15, 18] classification systems of previous studies. Taxa were consistently assigned to the same clade and group, independent of the particular phylogenetic analysis performed. Thus, topological differences between analysis results, as measured by the Robinson-Foulds distance  (Additional file 13), were confined to intra-group relationships. Bootstrap support was fairly consistent for all analyses that used core genes, and increased dramatically when all eligible genes from the pan-genome were used (Table 4). Additional detail about the phylogenetic analyses, and the logic behind their progression, is provided in the subsections that follow.
Choice of multiple sequence alignment program
Roary produced multiple sequence alignments of all 598 core genes with PRANK , which explicitly models insertions and deletions, but as a consequence runs more slowly than some other alignment programs. The PRANK alignments were concatenated to produce MAT_1. A similar matrix was built using the 594 HaMStR-eligible core genes, except that the gene sequence clusters were aligned with MAFFT  (MAT_2). Phylogenetic analyses of these two matrices with RAxML  revealed only negligible differences in bootstrap probabilities (ML_1 vs. ML_2; Table 4), so for the sake of computational efficiency MAFFT was used for the remainder of the analyses.
Removal of mobile genetic elements
For tree-based phylogenetic analyses that assume a process of vertical inheritance, the inclusion of mobile genetic elements (MGEs) that may be horizontally transferred is likely to confound the phylogenetic inference process . Thus, putative MGEs were identified and removed from HAMSTR_CORE, leaving 578 core genes (HAMSTR_CORE_MGES_REMOVED). Phylogenetic analysis of this slightly smaller data matrix (MAT_3) revealed comparable bootstrap probabilities to those from the analysis that used HAMSTR_CORE (ML_3 vs. ML_2; Table 4); nevertheless, out of principle, HaMStR databases with MGEs removed were used for the remainder of the analyses.
Partitioning of sites by codon position
It is well known that nucleotides in different codon positions (first, second, or third) are likely to be under different selective pressures ; thus, when analyzing protein-coding nucleotide sequences, it is common practice to apply a different substitution model (or different instance of the same substitution model) to the sites associated with each codon position, thus effectively partitioning the data matrix into three data subsets. The effect of partitioning by codon position was tested with two different matrices (MAT_3 and MAT_5); only negligible differences in bootstrap probabilities were found as compared to the unpartitioned results (ML_4 vs. ML_3 and ML_7 vs. ML_6; Table 4).
Removal of sites implicated in recombination
Genomic regions that may have been involved in past recombination events should not be used for phylogenetic analyses that assume a process of vertical inheritance . The profile program from PhiPack  was used to flag and remove sites from MAT_3 that exhibited signs of mosaicism. The resulting data matrix (MAT_4) contained less than one-fourth the number of unique alignment patterns of MAT_3, thus representing a substantial reduction in data suitable for phylogenetic analysis. This was reflected in bootstrap probabilities, which were somewhat depressed overall (ML_5 vs. ML_3; Table 4). It was thus concluded that removing sites implicated in recombination had a deleterious effect on phylogenetic analysis results, and so this procedure was not applied to subsequent analyses.
Use of all eligible genes from the pan-genome versus only core genes
Using all eligible genes (HAMSTR_FULL_MGES_REMOVED) for phylogenetic analysis as opposed to using only core genes (HAMSTR_CORE_MGES_REMOVED) caused bootstrap probabilities to increase dramatically (ML_6 vs. ML_3 and ML_7 vs. ML_4; Table 4). Thus, the ML_6 result was selected as the best estimate of the phylogenetic relationships among the BCSL_114 taxa. ClonalFrameML  was used to correct the branch lengths of this tree to account for recombination, and the tree was rooted using B. cytotoxicus [12, 19]. The resulting BCSL_114 phylogeny is shown as a phylogram with major clades and groups indicated (Fig. 2), and as a cladogram with bootstrap probabilities annotated (Additional file 14).
Maximum likelihood-based analysis of all taxa
Once the exploratory analyses were completed, an analysis of BCSL_498 was executed using the HAMSTR_FULL_MGES_REMOVED gene set. The average number of genes included in the analysis for each species, clade, and group is given in Additional file 8, and a count of the number of genes included for each taxon is given in Additional file 15. Due to the size of the data matrix (almost 4×106 unique alignment patterns), only a single best tree search replicate was completed (ML_8; Table 4). Informed by the distance-based analysis of Bacillus species (Additional file 7), the tree was rooted using B. manliponensis. The resulting BCSL_498 phylogeny is shown as a phylogram with major clades and groups indicated (Fig. 3). In contrast to analyses of BCSL_114, Group II is now represented, and is located on the tree where expected [14, 15, 18]. Based on this topology of currently sequenced genomes, a count of the number of taxa by species is provided for major B. cereus s. l. clades and groups (Additional file 8).
The hierBAPS  clustering analysis divided BCSL_498 into nine clusters (Additional file 15), which are displayed alongside major B. cereus s. l. clades and groups in Fig. 4. The hierBAPS clusters are congruent with the three-clade classification system, and largely agree with the seven-group classification system, with the following differences. Clade 1 included members of three clusters (as opposed to only two groups), and Clade 3 included members of six clusters (as opposed to four groups). Some Clade 3 clusters expanded slightly relative to their counterpart group to include taxa that were not assigned to any group, and B. manliponensis was assigned to its own cluster. Interestingly, two Clade 3 B. cereus taxa were assigned to Cluster 3, whereas other members of Cluster 3 were assigned to Clade 1, thus suggesting some genetic admixture between these two clades that was not reflected in the phylogenetic analysis.
Clustering of taxon-associated traits
The genealogical sorting index  (gsi) was used to quantify the degree of clustering of taxa sharing a common attribute given a phylogeny relating those taxa. The gsi statistic for a particular attribute takes a value from the unit interval [0,1]; if taxa associated with the attribute form a monophyletic group, the g s i=1; otherwise, the greater the degree to which taxa associated with the attribute are dispersed throughout the tree (accounting for the number of taxa and the size of the tree), the smaller the gsi will be for that attribute.
Quantifying the degree of B. cereus s. l. species monophyly
The gsi was calculated for six B. cereus s. l. species that were sufficiently represented in the BCSL_498 phylogeny; all P-values were < <0.05 (Additional file 16 and Table 5). Due to its highly clonal nature, B. anthracis was the species closest to monophyly (g s i=0.95), and would have indeed been monophyletic except that one B. anthracis taxon (GCF_001029875) did not group with the others (but still placed in Group III). This might be a misannotation and should be investigated. B. weihenstephanensis was the species furthest from monophyly (g s i=0.15), primarily because it was represented by only six taxa, one of which (GCF_000518025) was found in Group IV — the remainder were found in Group VI. Again, the annotation of the Group IV taxon with regard to species affiliation should be scrutinized.
Quantifying the degree of clustering of taxa sharing common traits
The gsi was calculated for ten traits shared by various B. cereus s. l. taxa using the BCSL_114 phylogeny from the ML_6 analysis; all P-values with the exception of one were less than 0.05 (Table 6). As not all of the taxa in BCSL_114 were assayed for each trait, the gsi values were artificially depressed; nevertheless, their relative values may be compared. The traits with the largest gsi values were “isolation source: cattle” and “isolation source: non-primate mammal”, the taxa associated with the former being a subset of the taxa associated with the latter. These taxa were all located in Group III, and all but two were identified as B. anthracis. This finding is consistent with the prevalence of mortality due to anthrax among cattle and other herbivores .
I show that the B. cereus s. l. pan-genome is still “open” (Additional files 9 and 10), thus implying that continued sampling of the group — especially of underrepresented taxa such as environmental strains  — will continue to reveal novel gene content. My estimate of the number of protein-coding genes in the B. cereus s. l. core and pan-genome (≈600 and ≈60,000, respectively), based on 114 complete genomes, is consistent with previous estimates [12, 13], as more extensive sampling of an open pan-genome will necessarily reduce the core genome size while simultaneously increasing the pan-genome size. It is interesting to observe that the basic phylogenetic structure of B. cereus s. l. can be accurately computed by relatively quick phylogenetic analyses based solely on the distribution of accessory genes among taxa (Fig. 1 and Additional file 11), which may in fact be sufficient for some applications. The diversity and adaptability of B. cereus s. l. may be in part attributable to the significant proportion of unique genes in its pan-genome (≈27,000, almost 50% of all genes; Additional file 10).
Pan-GWAS analysis found a number of genes significantly associated with various phenotypic traits (Table 2). In terms of validating this analysis, one might naturally look for genes known to be associated with B. anthracis virulence  or B. cereus s. l.-induced food poisoning ; however, these genes are not found among the analysis results. Many of these genes were not annotated by Roary, and of the ones that were, some were not represented in the HAMSTR_FULL database, thus reducing the number of taxa for which there would have been usable data. The genes that were reported to be significantly associated with “disease: anthrax”, “disease: food poisoning”, and other traits thus represent hypotheses that remain to be validated. Only four traits had enough significant positively or negatively associated genes to allow for the identification of enriched subsets of genes involved in particular biological processes (“isolation source: cattle”, “isolation source: human”, “isolation source: non-primate mammal”, and “isolation source: soil”; Additional file 12). Of these, only the biological processes associated with “isolation source: soil” were sufficiently specific so as to be meaningfully interpretable. To increase the statistical power of the pan-GWAS analysis and thereby generate more comprehensive and specific lists of genes associated with various traits, one would need to include additional taxa with relevant metadata and gene content information.
All phylogenetic analyses in this study recapitulated the three-clade and seven-group classification systems, and taxa were consistently assigned to the same clade and group (Figs. 1, 2, 3 and 4), irrespective of the data source or analysis methodology used (Tables 3 and 4). This strongly suggests that the broad phylogenetic structure of B. cereus s. l. has been inferred correctly. I demonstrate that the three-clade and seven-group systems are compatible with each other, as no group has its member taxa assigned to multiple clades. Clades 1 and 2 are much more extensively sampled than Clade 3 due to historical interest in B. anthracis and B. thuringiensis (Additional file 8); a recent study has shown that there is likely to be a tremendous amount of as-yet incompletely characterized diversity in Clade 3 that can be assayed by sampling various natural environments . Indeed, Clade 3 exhibited the greatest degree of species diversity; in particular, Group I contained representatives of seven different species, including two newly characterized species (B. bingmayongensis  and B. gaemokensis ; Additional file 8). Six of the 498 taxa did not place into one of the seven previously circumscribed groups, which suggests that classification systems will need to be updated and refined as additional isolates are sequenced. Perhaps most interesting among the unplaced taxa is B. manliponensis , which appears to be even more divergent from other B. cereus s. l. taxa than B. cytotoxicus  (Fig. 3 and Additional file 7). One possibility for updating the group-level classification system is to incorporate information from the Bayesian model-based clustering analysis, the results of which were shown to be compatible with the three-clade system and which recommended nine clusters instead of seven groups (Fig. 4).
Using the phylogeny of BCSL_498, I quantified the degree of monophyly for six current B. cereus s. l. species designations (Additional file 16 and Table 5). This analysis demonstrates quantitatively that with the exception of B. anthracis, species definitions within B. cereus s. l. are not currently based on phylogenetic relatedness, but rather on phenotypes such as virulence, physiology, and morphology [14, 18]. The primary focus of this study is the accurate reconstruction of phylogenetic relationships among taxa, and thus I make no specific recommendations for species re-designation based on these results. However, I do note a trend towards refined species designations that correlate with group affiliation; for example, several B. cereus strains in Group II have recently been re-designated B. wiedmanii ; similarly, Böhm et al.  suggested that all Group V taxa should be designated B. toyonensis . In general, I recommend that taxonomic revisions are informed by well-supported phylogenetic hypotheses that have been generated without bias towards any particular species concept (e.g., dDDH boundaries ).
In a bioforensic setting, phylogenies that include well-supported strain-level relationships aid greatly in the identification of new isolates, and thus support both the attribution process (traceback of an isolate to its source) as well as analyses of pathogen evolution in an epidemic or outbreak scenario. However, the extremely high level of genomic conservation among closely related bacterial strains, especially in the core genome or in commonly typed conserved regions such as housekeeping genes, has limited the ability of previous analyses to make robust strain-level phylogenetic inferences. An important contribution of the current study is to show that bootstrap probabilities increase substantially when accessory genes are included in phylogenetic analyses along with core genes (Table 4). Thus, I have been able to resolve many strain-level, intra-group relationships of B. cereus s. l. with 100% bootstrap support for the first time (Additional file 14).
In this study, I used novel bioinformatic workflows to characterize the pan-genome and phylogeny of B. cereus sensu lato. Based on data from 114 complete genomes, I estimated that the B. cereus s. l. core and pan-genome contain ≈600 and ≈60,000 protein-coding genes, respectively. Pan-GWAS analysis revealed significant associations of particular genes with phenotypic traits shared by groups of taxa. All phylogenetic analyses recapitulated two previously used classification systems, and taxa were consistently assigned to the same major clade and group. By including accessory genes from the pan-genome in the phylogenetic analyses, I produced an exceptionally well-supported phylogeny of 114 complete B. cereus s. l. genomes. The best-performing methods were used to produce a phylogeny of all 498 publicly available B. cereus s. l. genomes, which was in turn used to compare three different classification systems and to test the monophyly status of various B. cereus s. l. species. The majority of the methodology used in this study is generic and could be leveraged to produce pan-genome estimates and similarly robust phylogenetic hypotheses for other bacterial groups. All scripts, software, and data products associated with this study are freely available.
A CLAssification of mobile genetic elements
Amplified fragment length polymorphism
Bayesian analysis of population structure
Bacillus cereus sensu lato
Basic local alignment search tool
digital DNA-DNA hybridization
Genealogical sorting index
Genome-wide association study
Hidden Markov model
Hidden Markov model based search for orthologs using reciprocity
locally collinear block
Multiple alignment using fast fourier transform
Mobile genetic element
Multilocus sequence typing
National center for biotechnology information
Protein ANalysis THrough evolutionary relationships
Pathosystems resource integration center
Phylogenetic analysis using parsimony *and other methods
Phylogeny inference package
Probabilistic alignment kit
Random access memory
Randomized axelerated maximum likelihood
Reference sequence database
Single nucleotide polymorphism
Nakamura LK. Bacillus pseudomycoides sp. nov. Int J Syst Evol Microbiol. 1998; 48(3):1031–5.
Guinebretière MH, Auger S, Galleron N, Contzen M, De Sarrau B, De Buyser ML, Lamberet G, Fagerlund A, Granum PE, Lereclus D, De Vos P, Nguyen-The C, Sorokin A. Bacillus cytotoxicus sp. nov. is a novel thermotolerant species of the Bacillus cereus Group occasionally associated with food poisoning. Int J Syst Evol Microbiol. 2013; 63(1):31–40.
Lechner S, Mayr R, Francis KP, Prüss BM, Kaplan T, Wiessner-Gunkel E, Stewart GS, Scherer S. Bacillus weihenstephanensis sp. nov. is a new psychrotolerant species of the Bacillus cereus group. Int J Syst Evol Microbiol. 1998; 48(4):1373–82.
Miller RA, Beno SM, Kent DJ, Carroll LM, Martin NH, Boor KJ, Kovac J. Bacillus wiedmannii sp. nov., a psychrotolerant and cytotoxic Bacillus cereus group species isolated from dairy foods and dairy environments. Int J Syst Evol Microbiol. 2016; 66(11):4744–53.
Jiménez G, Urdiain M, Cifuentes A, López-López A, Blanch AR, Tamames J, Kampfer P, Kolsto A-B, Ramón D, Martínez JF, Codoner FM, Rosselló-Móra R. Description of Bacillus toyonensis sp. nov., a novel species of the Bacillus cereus group, and pairwise genome comparisons of the species of the group by means of ANI calculations. Syst Appl Microbiol. 2013; 36(6):383–91. doi:10.1016/j.syapm.2013.04.008.
Liu B, Liu GH, Hu GP, Cetin S, Lin NQ, Tang JY, Tang WQ, Lin YZ. Bacillus bingmayongensis sp. nov., isolated from the pit soil of Emperor Qin’s Terra-cotta warriors in China. Anton Leeuw. 2014; 105(3):501–10. doi:10.1007/s10482-013-0102-3.
Jung MY, Paek WK, Park IS, Han JR, Sin Y, Paek J, Rhee MS, Kim H, Song HS, Chang YH. Bacillus gaemokensis sp. nov., isolated from foreshore tidal flat sediment from the Yellow Sea. J Microbiol. 2010; 48(6):867–71. doi:10.1007/s12275-010-0148-0.
Jung MY, Kim JS, Paek WK, Lim J, Lee H, Kim PI, Ma JY, Kim W, Chang YH. Bacillus manliponensis sp. nov., a new member of the Bacillus cereus group isolated from foreshore tidal flat sediment. J Microbiol. 2011; 49(6):1027–32. doi:10.1007/s12275-011-1049-6.
Papazisi L, Rasko DA, Ratnayake S, Bock GR, Remortel BG, Appalla L, Liu J, Dracheva T, Braisted JC, Shallom S, Jarrahi B, Snesrud E, Ahn S, Sun Q, Rilstone J, Økstad OA, Kolstø A-B, Fleischmann RD, Peterson SN. Investigating the genome diversity of B. cereus and evolutionary aspects of B. anthracis emergence. Genomics. 2011; 98(1):26–39. doi:10.1016/j.ygeno.2011.03.008.
Toby IT, Widmer J, Dyer DW. Divergence of protein-coding capacity and regulation in the Bacillus cereus sensu lato group. BMC Bioinformatics. 2014; 15(11):8. doi:10.1186/1471-2105-15-S11-S8.
Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, DeBoy RT, Davidsen TM, Mora M, Scarselli M, Margarit y Ros I, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O’Connor KJB, Smith S, Utterback TR, White O, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R, Fraser CM. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial “pan-genome”. Proc Natl Acad Sci U S A. 2005; 102(39):13950–13955. doi:10.1073/pnas.0506758102. http://www.pnas.org/content/102/39/13950.full.pdf.
Lapidus A, Goltsman E, Auger S, Galleron N, Ségurens B, Dossat C, Land ML, Broussolle V, Brillard J, Guinebretiere MH, Sanchis V, Nguen-the C, Lereclus D, Richardson P, Wincker P, Weissenbach J, Ehrlich SD, Sorokin A. Extending the Bacillus cereus group genomics to putative food-borne pathogens of different toxicity. Chem Biol Interact. 2008; 171(2):236–49. doi:10.1016/j.cbi.2007.03.003. Frontiers of Pharmacology and Toxicology
Zwick ME, Joseph SJ, Didelot X, Chen PE, Bishop-Lilly KA, Stewart AC, Willner K, Nolan N, Lentz S, Thomason MK, Sozhamannan S, Mateczun AJ, Du L, Read TD. Genomic characterization of the Bacillus cereus sensu lato species: Backdrop to the evolution of Bacillus anthracis. Genome Res. 2012; 22(8):1512–24. doi:10.1101/gr.134437.111. http://genome.cshlp.org/content/22/8/1512.full.pdf+html.
Guinebretière MH, Thompson FL, Sorokin A, Normand P, Dawyndt P, Ehling-Schulz M, Svensson B, Sanchis V, Nguyen-The C, Heyndrickx M, De Vos P. Ecological diversification in the Bacillus cereus group. Environ Microbiol. 2008; 10(4):851–65. doi:10.1111/j.1462-2920.2007.01495.x.
Tourasse NJ, Økstad OA, Kolstø A-B. HyperCAT: an extension of the SuperCAT database for global multi-scheme and multi-datatype phylogenetic analysis of the Bacillus cereus group population. Database. 2010; 2010:017. doi:10.1093/database/baq017.
Didelot X, Barker M, Falush D, Priest FG. Evolution of pathogenicity in the Bacillus cereus group. Syst Appl Microbiol. 2009; 32(2):81–90. doi:10.1016/j.syapm.2009.01.001.
Drewnowska JM, Swiecicka I. Eco-genetic structure of Bacillus cereus sensu lato populations from different environments in northeastern Poland. PLOS ONE. 2013; 8(12):1–11. doi:10.1371/journal.pone.0080175.
Böhm ME, Huptas C, Krey VM, Scherer S. Massive horizontal gene transfer, strictly vertical inheritance and ancient duplications differentially shape the evolution of Bacillus cereus enterotoxin operons hbl, cytk and nhe. BMC Evol Biol. 2015; 15(1):246. doi:10.1186/s12862-015-0529-4.
Schmidt TR, Scott EJ, Dyer DW. Whole-genome phylogenies of the family Bacillaceae and expansion of the sigma factor gene family in the Bacillus cereus species-group. BMC Genomics. 2011; 12(1):430. doi:10.1186/1471-2164-12-430.
Liu Y, Lai Q, Göker M, Meier-Kolthoff JP, Wang M, Sun Y, Wang L, Shao Z. Genomic insights into the taxonomic status of the Bacillus cereus group. Sci Rep. 2015; 5:14082.
Okinaka RT, Keim P. The phylogeny of Bacillus cereus sensu lato. Microbiol Spectr. 2016; 4(1):TBS-0012-2012.
Guinebretière MH, Velge P, Couvert O, Carlin F, Debuyser ML, Nguyen-The C. Ability of Bacillus cereus group strains to cause food poisoning varies according to phylogenetic affiliation (groups I to VII) rather than species affiliation. J Clin Microbiol. 2010; 48(9):3388–91. doi:10.1128/JCM.00921-10. http://jcm.asm.org/content/48/9/3388.full.pdf+html.
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O, Brover V, Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D, Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey KM, Murphy MR, O’Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C, Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C, Webb D, Wu W, Landrum MJ, Kimchi A, Tatusova T, DiCuccio M, Kitts P, Murphy TD, Pruitt KD. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2015; 44(D1):733. doi:10.1093/nar/gkv1189.
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016; 17(1):132. doi:10.1186/s13059-016-0997-x.
Lefort V, Desper R, Gascuel O. FastME 2.0: A comprehensive, accurate, and fast distance-based phylogeny inference program. Mol Biol Evol. 2015; 32(10):2798–800. doi:10.1093/molbev/msv150. http://mbe.oxfordjournals.org/content/32/10/2798.full.pdf+html.
Gascuel O. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol. 1997; 14(7):685–95. http://mbe.oxfordjournals.org/content/14/7/685.full.pdf+html.
Ebersberger I, Strauss S, von Haeseler A. HaMStR: Profile hidden markov model based search for orthologs in ESTs. BMC Evol Biol. 2009; 9(1):157. doi:10.1186/1471-2148-9-157.
Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, Gillespie JJ, Gough R, Hix D, Kenyon R, Machi D, Mao C, Nordberg EK, Olson R, Overbeek R, Pusch GD, Shukla M, Schulman J, Stevens RL, Sullivan DE, Vonstein V, Warren A, Will R, Wilson MJC, Yoo HS, Zhang C, Zhang Y, Sobral BW. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res. 2014; 42(D1):581–91. doi:10.1093/nar/gkt1099. http://nar.oxfordjournals.org/content/42/D1/D581.full.pdf+html.
Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014; 30(14):2068–9. doi:10.1093/bioinformatics/btu153. http://bioinformatics.oxfordjournals.org/content/30/14/2068.full.pdf+html.
Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane JA, Parkhill J. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015; 31(22):3691–3. doi:10.1093/bioinformatics/btv421. http://bioinformatics.oxfordjournals.org/content/31/22/3691.full.pdf+html.
Löytynoja A, Goldman N. An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci U S A. 2005; 102(30):10557–62. doi:10.1073/pnas.0409137102. http://www.pnas.org/content/102/30/10557.full.pdf.
Price MN, Dehal PS, Arkin AP. FastTree 2 – approximately maximum-likelihood trees for large alignments. PLOS ONE. 2010; 5(3):1–10. doi:10.1371/journal.pone.0009490.
Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol. 2006; 23(2):254–67. doi:10.1093/molbev/msj030. http://mbe.oxfordjournals.org/content/23/2/254.full.pdf+html.
Huson DH, Steel M. Phylogenetic trees based on gene content. Bioinformatics. 2004; 20(13):2044–9. doi:10.1093/bioinformatics/bth198. http://bioinformatics.oxfordjournals.org/content/20/13/2044.full.pdf+html.
Bryant D, Moulton V. In: Guigó R, Gusfield D, (eds).NeighborNet: An Agglomerative Method for the Construction of Planar Phylogenetic Networks. Berlin: Springer; 2002, pp. 375–91. doi:10.1007/3-540-45784-4_28. http://dx.doi.org/10.1007/3-540-45784-4_28
Brynildsrud O, Bohlin J, Scheffer L, Eldholm V. Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol. 2016; 17(1):238. doi:10.1186/s13059-016-1108-8.
Carbon S, Ireland A, Mungall CJ, Shu S, Marshall B, Lewis S, Hub A, Group WPW. AmiGO: online access to ontology and annotation data. Bioinformatics. 2008; 25(2):288. doi:10.1093/bioinformatics/btn615.
Mi H, Huang X, Muruganujan A, Tang H, Mills C, Kang D, Thomas PD. PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements. Nucleic Acids Res. 2016; 45(D1):183. doi:10.1093/nar/gkw1138.
Katoh K, Frith MC. Adding unaligned sequences into an existing alignment using MAFFT and LAST. Bioinformatics. 2012; 28(23):3144–6. doi:10.1093/bioinformatics/bts578. http://bioinformatics.oxfordjournals.org/content/28/23/3144.full.pdf+html.
Eddy SR. Profile hidden Markov models. Bioinformatics. 1998; 14(9):755–63. doi:10.1093/bioinformatics/14.9.755. http://bioinformatics.oxfordjournals.org/content/14/9/755.full.pdf+html.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool,. J Mol Biol. 1990; 215(3):403–10. doi:10.1006/jmbi.1990.9999.
Boto L. Horizontal gene transfer in evolution: facts and challenges. Proc R Soc Lond B Biol Sci. 2010; 277(1683):819–27. doi:10.1098/rspb.2009.1679. http://rspb.royalsocietypublishing.org/content/277/1683/819.full.pdf.
Coordinators NR. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2016; 44(Database issue):7–19. doi:10.1093/nar/gkv1290.
Leplae R, Lima-Mendez G, Toussaint A. ACLAME: A CLAssification of Mobile genetic Elements, update 2010. Nucleic Acids Res. 2010; 38(suppl 1):57–61. doi:10.1093/nar/gkp938.
Birney E, Clamp M, Durbin R. GeneWise and Genomewise. Genome Res. 2004; 14(5):988–95. doi:10.1101/gr.1865504. http://genome.cshlp.org/content/14/5/988.full.pdf+html.
Bazinet AL, Mitter KT, Davis DR, Van Nieukerken EJ, Cummings MP, Mitter C. Phylotranscriptomics resolves ancient divergences in the Lepidoptera. Syst Entomol. 2017; 42:305–16. doi:10.1111/syen.12217.
Bazinet AL, Cummings MP, Mitter KT, Mitter CW. Can RNA-Seq resolve the rapid radiation of advanced moths and butterflies (Hexapoda: Lepidoptera: Apoditrysia)? An exploratory study. PLOS ONE. 2013; 8(12). doi:10.1371/journal.pone.0082615.
Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014; 30(9):1312–3. doi:10.1093/bioinformatics/btu033. http://bioinformatics.oxfordjournals.org/content/30/9/1312.full.pdf+html.
Bazinet AL, Zwickl DJ, Cummings MP. A gateway for phylogenetic analysis powered by grid computing featuring GARLI 2.0. Syst Biol. 2014; 63(5):812–8. doi:10.1093/sysbio/syu031. http://sysbio.oxfordjournals.org/content/63/5/812.full.pdf+html.
Pattengale ND, Alipour M, Bininda-Emonds ORP, Moret BME, Stamatakis A. In: Batzoglou S, (ed).How Many Bootstrap Replicates Are Necessary?. Berlin: Springer; 2009, pp. 184–200. doi:10.1007/978-3-642-02008-7_13. http://dx.doi.org/10.1007/978-3-642-02008-7_13
Sukumaran J, Holder MT. DendroPy: a Python library for phylogenetic computing. Bioinformatics. 2010; 26(12):1569–71. doi:10.1093/bioinformatics/btq228. http://bioinformatics.oxfordjournals.org/content/26/12/1569.full.pdf+html.
Bruen TC, Philippe H, Bryant D. A simple and robust statistical test for detecting the presence of recombination. Genetics. 2006; 172(4):2665–81. doi:10.1534/genetics.105.048975. http://www.genetics.org/content/172/4/2665.full.pdf.
Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol. 2014; 15(11):524. doi:10.1186/s13059-014-0524-x.
Didelot X, Wilson DJ. ClonalFrameML: Efficient inference of recombination in whole bacterial genomes. PLoS Comput Biol. 2015; 11(2):1–18. doi:10.1371/journal.pcbi.1004041.
Swofford DL. Phylogenetic analysis using parsimony (* and other methods). Version 4.Sunderland: Sinauer Associates; 2002.
Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981; 53(1):131–47. doi:10.1016/0025-5564(81)90043-2.
Rambaut A. http://tree.bio.ed.ac.uk/software/figtree/. Accessed Oct 2016.
Letunic I, Bork P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 2016; 44(W1):242. doi:10.1093/nar/gkw290.
Cheng L, Connor TR, Sirén J, Aanensen DM, Corander J. Hierarchical and spatially explicit clustering of DNA sequences with BAPS software. Mol Biol Evol. 2013; 30(5):1224. doi:10.1093/molbev/mst028.
Cummings MP, Neel MC, Shaw KL. A genealogical approach to quantifying lineage divergence. Evolution. 2008; 62(9):2411–22. doi:10.1111/j.1558-5646.2008.00442.x.
Bazinet AL, Cummings M. The Lattice Project: a Grid research and production environment combining multiple Grid computing models. Distributed & Grid Computing—Science Made Transparent for Everyone. Principles, Applications and Supporting Communities. Marburg: Tectum Publishing House; 2008, pp. 2–13.
Fujie K, Hu HY, Tanaka H, Urano K, Saitou K, Katayama A. Analysis of respiratory quinones in soil for characterization of microbiota. Soil Sci Plant Nutr. 1998; 44(3):393–404. doi:10.1080/00380768.1998.10414461. http://dx.doi.org/10.1080/00380768.1998.10414461
Kandeler E. Physiological and biochemical methods for studying soil biota and their function In: Stotzky G, editor. Soil Biochemistry: vol. 9. CRC Press, n.p.: 1996. p. 253–86. Chap. 7.
Czaban J, Gajda A, Wróblewska B. The motility of bacteria from rhizosphere and different zones of winter wheat roots. Pol J Environ Stud. 2007; 16(2):301–8.
Kaiser D. Bacterial swarming: A re-examination of cell-movement patterns. Curr Biol. 2007; 17(14):561–70. doi:10.1016/j.cub.2007.04.050.
Murphy SL, III RLT. In: Paul EA, (ed).Soil Microbiology, Ecology and Biochemistry: Academic Press, n.p.; 2006, pp. 53–83. Chap. 3.
Bofkin L, Goldman N. Variation in evolutionary processes at different codon positions. Mol Biol Evol. 2006; 24(2):513. doi:10.1093/molbev/msl178.
World Health Organization and International Office of Epizootics and Food and Agriculture Organization of the United Nations. Anthrax in Humans and Animals. Nonserial Publication Series: World Health Organization, n.p.; 2008. https://books.google.com/books?id=EKYihvnaA7oC.
Koehler TM. In: Koehler TM, (ed).Bacillus anthracis Genetics and Virulence Gene Regulation. Berlin: Springer; 2002, pp. 143–64. doi:10.1007/978-3-662-05767-4_7. http://dx.doi.org/10.1007/978-3-662-05767-4_7
I thank Shashikala Ratnayake for assistance generating the Mash-distance-based phylogeny of Bacillus, Todd Treangen for helpful discussions about the project, and M.J. Rosovitz, Brian Janes, and Martina Eaton for providing feedback on drafts of the manuscript. I also thank two anonymous reviewers for their comments.
This work was funded under Contract No. HSHQDC-15-C-00064 awarded by the Department of Homeland Security (DHS) Science and Technology Directorate (S&T) for the operation and management of the National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and Development Center. The views and conclusions contained in this document are those of the author and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the DHS or S&T. In no event shall DHS, NBACC, S&T or Battelle National Biodefense Institute have any responsibility or liability for any use, misuse, inability to use, or reliance upon the information contained herein. DHS does not endorse any products or commercial services mentioned in this publication.
Availability of data and materials
All data analyzed during the current study were downloaded from public databases (ACLAME, NCBI, and PATRIC), and dates of download are provided in the text. A list of RefSeq assembly accessions for the taxa used in this study is provided in Additional file 1. HaMStR databases, concatenated phylogenetic data matrices, phylogenetic trees, Perl scripts and other data supporting the results of this study are available in the Dryad repository, Dryad DOI: doi:10.5061/dryad.dm82j.
Ethics approval and consent to participate
Consent for publication
The author declares that he has no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
RefSeq assembly accessions for the taxa used in this study. A list of RefSeq assembly accessions for the bcsl_498 taxa. (TXT 7 kb)
Roary gene presence/absence matrix for bcsl_114 taxa. The gene presence/absence spreadsheet lists all genes in the pan-genome and the taxa in which they are present, along with summary statistics and additional information. (CSV 38195 kb)
Roary gene presence/absence matrix for bcsl_114 taxa, augmented with gene presence/absence information for bcsl_498 taxa. The gene presence/absence spreadsheet lists all genes in the pan-genome and the taxa in which they are present, along with summary statistics and additional information. (CSV 59187 kb)
Binary matrix of phenotypic traits exhibited by bcsl_498 taxa. Binary phenotypic trait matrix for bcsl_498 taxa, created using the isolate metadata obtained from PATRIC. (XLSX 61 kb)
Construction of a HaMStR database. Prokka was used to annotate 114 B. cereus s. l. complete genomes. The resulting protein-coding gene annotations were provided as input to Roary, which constructed a pan-genome consisting of 59,989 orthologous protein sequence clusters. After filtering, which included mobile genetic element removal, the 8954 remaining clusters were aligned with MAFFT. Gene models were built from the multiple sequence alignments using the hmmbuild program from HMMER. The 8954 gene models, together with separately constructed reference taxon BLAST databases, constituted the hamstr_full_mges_removed HaMStR database. (PDF 15 kb)
Construction of a concatenated data matrix. Prokka was used to annotate B. cereus s. l. “query genomes”— i.e., draft genomes that were not included in bcsl_114. The resulting protein-coding gene annotations were provided as input to HaMStR, which used the hmmsearch program from HMMER followed by BLASTP to assign query sequences to HaMStR database gene models. Clusters of orthologous protein sequences from query and database taxa were aligned with MAFFT and converted to corresponding nucleotide alignments. The multiple sequence alignments were reduced to a single sequence per taxon with a consensus procedure that used nucleotide ambiguity codes to combine information from sequence variants where necessary. The individual alignments were then concatenated to produce the final data matrix. (PDF 15 kb)
Mash-distance-based phylogeny of the genus Bacillus. Phylogeny of 146 Bacillus genomes, computed with Mash and FastME. (PDF 34 kb)
Attributes of B. cereus s. l. species, clades, and groups. Tables that provide number of taxa, average number of genes found among Roary clusters, and average number of genes present in mat_6 for B. cereus s. l. species, clades, and groups. (XLSX 44 kb)
Rarefaction curve: core vs. all genes. The rarefaction curve shows that after ≈35 genomes have been sampled (≈31% of all genomes), the number of core genes remains fairly constant at ≈600 genes, while the total number of genes in the pan-genome continues to increase almost linearly. (PDF 25 kb)
Rarefaction curve: new vs. unique genes. The rarefaction curve shows that as genomes are sampled, genes never before observed continue to be found at a fairly steady rate, and the total number of unique genes discovered continues to increase, with no indication of soon approaching an asymptote. (PDF 15 kb)
Accessory binary tree and gene presence/absence visualization. The “accessory binary tree” and gene presence/absence information produced by Roary are plotted side-by-side. The outermost B. cereus s. l. clades include taxa with relatively few accessory genes included in the analysis, such as B. cytotoxicus, B. mycoides, and B. pseudomycoides. By contrast, the genomes with the most accessory genes present belong to the highly clonal clade of B. anthracis strains. (PDF 485 kb)
Scoary result summary, including enriched gene ontology biological processes. Positively or negatively trait-associated gene sets produced by Scoary were subsequently tested for possible enrichment of gene ontology biological processes. Complete Scoary results for eight traits, including gene annotations, are also given. (XLSX 423 kb)
Robinson-Foulds distance between all pairs of bcsl_114 phylogenetic results. Both the standard and normalized Robinson-Foulds distance is given. (XLSX 42 kb)
BCSL_114 maximum likelihood phylogenetic analysis results. Cladogram depicting the best estimate of the phylogenetic relationships among bcsl_114 taxa, computed with RAxML using 8954 genes (ml_6; Table 4). B. cytotoxicus was used to root the tree. Major B. cereus s. l. clades and groups are indicated, as are bootstrap probabilities. (PDF 40 kb)
Taxon metadata for bcsl_498. Table providing clade, group and hierBAPS cluster affiliation for bcsl_498 taxa, along with the number of genes found among Roary clusters (complete genomes only) and the number of genes present in mat_6 (out of a possible total of 8954 genes). (XLSX 72 kb)
BCSL_498 maximum likelihood phylogenetic analysis results, color-coded by species. Phylogram depicting an estimate of the phylogenetic relationships among bcsl_498 taxa, computed with RAxML using 8954 genes (ml_8; Table 4). B. manliponensis was used to root the tree. B. cereus s. l. species tested for monophyly with the gsi are color-coded. (PDF 71 kb)
About this article
- Bacillus cereus sensu lato
- Bacillus cereus group
- Bayesian model-based clustering
- Phylogenetic clustering