Genome wide evolutionary analyses reveal serotype specific patterns of positive selection in selected Salmonella serotypes

Background The bacterium Salmonella enterica includes a diversity of serotypes that cause disease in humans and different animal species. Some Salmonella serotypes show a broad host range, some are host restricted and exclusively associated with one particular host, and some are associated with one particular host species, but able to cause disease in other host species and are thus considered "host adapted". Five Salmonella genome sequences, representing a broad host range serotype (Typhimurium), two host restricted serotypes (Typhi [two genomes] and Paratyphi) and one host adapted serotype (Choleraesuis) were used to identify core genome genes that show evidence for recombination and positive selection. Results Overall, 3323 orthologous genes were identified in all 5 Salmonella genomes analyzed. Use of four different methods to assess homologous recombination identified 270 genes that showed evidence for recombination with at least one of these methods (false discovery rate [FDR] <10%). After exclusion of genes with evidence for recombination, site and branch specific models identified 41 genes as showing evidence for positive selection (FDR <20%), including a number of genes with confirmed or likely roles in virulence and ompC, a gene encoding an outer membrane protein, which has also been found to be under positive selection in other bacteria. A total of 8, 16, 7, and 5 genes showed evidence for positive selection in Choleraesuis, Typhi, Typhimurium, and Paratyphi branch analyses, respectively. Sequencing and evolutionary analyses of four genes in an additional 42 isolates representing 23 serotypes confirmed branch specific positive selection and recombination patterns. Conclusion Our data show that, among the four serotypes analyzed, (i) less than 10% of Salmonella genes in the core genome show evidence for homologous recombination, (ii) a number of Salmonella genes are under positive selection, including genes that appear to contribute to virulence, and (iii) branch specific positive selection contributes to the evolution of host restricted Salmonella serotypes.


Background
Salmonella is a ubiquitous human and animal pathogen. This genus contains >2,500 recognized serotypes and is divided into two species, Salmonella bongori and Salmonella enterica. S. enterica consists of six subspecies (i.e., enterica, salamae, arizonae, diarizonae, houtenae, and indica) [1]. Salmonella enterica subsp. enterica serotypes can also be divided into subdivisions according to their host adaptation [2]. For example, Uzzau et al. [2] proposed that Salmonella serotypes can be divided into (i) host-restricted Salmonella serotypes (i.e., serotypes exclusively associated with one particular host, e.g., Salmonella Typhi and Paratyphi A); (ii) host-adapted Salmonella serotypes (i.e., serotypes prevalent in one particular host species, but able to cause disease in other host species, e.g., Salmonella Choleraesuis); and (iii) unrestricted Salmonella serotypes (i.e., serotypes capable of causing self-limiting gastroenteritis and, less commonly, systemic disease in a wide range of host species, e.g., Salmonella Typhimurium).
Multi-locus sequence typing (MLST) data indicate that the last common ancestor of the human host-adapted Salmonella Typhi existed 15,000-150,000 years ago [3]. The evolution of Salmonella Typhi towards a lifestyle characterized by systemic infection and transmission by excretion through the gall bladder rather than luminal gut colonization [4] involved a combination of acquisition events (e.g., acquisition of Vi capsule related genes), and deletion events (e.g., loss of virulence-associated genes, such as several genes in SPI-1, SPI-2, SPI-3, SPI-4 and SPI-5). Salmonella Paratyphi A also causes typhoid fever, although the symptoms are typically milder than those caused by Salmonella Typhi. While Salmonella Paratyphi A also appears to have evolved recently, Salmonella Typhi and Paratyphi A clearly show distinct differences in their genome evolution, including a number of unique gene inactivation events in these two serotypes [5]. Non-typhoidal Salmonella serotypes are responsible for gastroenteritis in humans and other animals. These serotypes are mainly transmitted by ingestion of food, feed, or water contaminated with infected feces [6], but can also be transmitted by direct contact [7,8]. Disease caused by non-typhoidal Salmonella is one of the most common bacterial foodborne diseases worldwide [9]. Salmonella Typhimurium is one of the most common non-typhoidal Salmonella serotypes, is found worldwide, and can cause disease, predominantly self limiting gastroenteritis, in a large number of animal species [2]. The host adapted Salmonella Choleraesuis can cause severe disease, characterized by septicemia and enterocolitis, in swine. While relatively uncommon, this serotype can also infect humans where it typically causes severe invasive infections, e.g., infective aneurysm [10].
The importance of acquisition of novel (non-homologous) genes by lateral gene transfer has been clearly dem-onstrated in a number of bacteria, including a number of bacterial pathogens [11][12][13][14]. For example, acquisition of pathogenicity islands has played a critical role in the evolution of Salmonella [13] and other Gram-negative and Gram-positive pathogens [15]. Gene degradation and gene deletions also have been shown to play a critical role in bacterial evolution, particularly when organisms with a broad niche specificity adapt to narrow and specific ecological niches [5,16]. For example, it has been suggested that gene degradation and gene deletion contribute to host adaptation in both Salmonella Typhi and Salmonella Paratyphi A [5]. Microarray technologies have also allowed for rapid and large scale studies on gene presence/ absence in a large number of isolates, including in Salmonella [17]. In addition to gene acquisition and deletion, positive selection and homologous recombination, play important roles in the evolution of bacteria and bacterial pathogens [18][19][20][21].
Genome wide studies on positive selection and homologous recombination in bacterial pathogens, including Streptococcus spp. [20], Listeria monocytogenes [18], Campylobacter [22], E. coli [23,24], and Shigella [24] have contributed to a better understanding of the evolution of these important pathogens. So far, no genome wide analyses of positive selection in Salmonella have been reported. One study [25] evaluated 410 genes present in both S. enterica and E. coli and reported that 50% of amino acid substitutions in these genes appear to have been fixed by positive selection in one of these species. In order to further improve our understanding of the evolution of Salmonella, we performed full genome analyses for homologous recombination and positive selection using the completed and published genome sequences for five Salmonella strains, including the host restricted Salmonella Typhi (two strains) and Paratyphi A, the host adapted Salmonella Choleraesuis, and the broad host range Salmonella Typhimurium. Our analyses focused on the evolution of core genome genes (i.e., genes found in all 5 genomes) and did not include efforts to detect genes acquired by Salmonella through horizontal gene transfer and subsequent nonhomologous recombination (e.g., virulence gene islands), as these types of evolutionary events have already been well characterized [13,26,27]. Analysis of the Salmonella serotypes included in our study here will, in particular, provide an improved understanding in the roles of positive selection and homologous recombination in the evolution of host-adapted pathogen strains and lineages.

Genome sequences
Five available annotated Salmonella enterica subsp.enterica genome sequences were used in this study (Table 1) August 20, 2009, 16 fully sequenced Salmonella genomes, including the 5 genomes used in our study, were available in GenBank (see Additional file 1), the 5 genomes used were the only fully sequenced Salmonella genomes available when our analyses were initiated. These 5 genomes allow for evaluation of evolutionary trends among host-restricted and host adapted Salmonella strains as they include the serotypes Typhi, Paratyphi A, and Choleraesuis.

Identification of orthologous genes presents in all five
Salmonella genomes analyzed OrthoMCL [28], which has previously been used for prokaryotic genome analyses [20,22], was used to identify orthologous genes in the five Salmonella genomes. Orthologs present in all five genomes were aligned using ClustalW [29]. Multiple sequence alignments were carried out on amino acid sequences from each orthologous group, followed by conversion to nucleotide sequence alignments using the PAL2NAL software [30]. This strategy was used to allow for correct alignment of diversified regions in which multiple nucleotide substitution events have taken place; since amino acid sequences are more conserved than DNA sequences, they are easier to align and the final alignments are more reliable. Alignments containing variable sequence lengths or having low alignment scores were manually evaluated and edited, using BioEdit software [31], as previously described [18]. For example, alignments containing sequences with different lengths and alignments that contained multiple indels that caused incorrect alignments were reviewed and edited as detailed in [18].

Detection of genes under positive selection
Positive selection can be detected by comparing the rate of non-synonymous substitutions (d N ) to the rate of synonymous substitutions (d S ). While different methods exist for detection of positive selection, PAML (Phylogenetic Analysis by Maximum Likelihood) was used here as (i) its use for detection of signals of positive selection in bacteria [18,20,23,24,32,33], viruses [34], and eukaryotes [35,36] has been well documented, (ii) it has been shown to have a relatively good power to detect positive selection even with as few as 5 sequences, while keeping the number of false positives low [37], and (iii) it allows for detection of signals of branch specific positive selection. We used two types of tests implemented in PAML v3.15 to identify genes with evidence for positive selection [38], as previously detailed [18]. Briefly, an overall test for positive selection (Test Overall; TO) was carried out to identify genes under positive selection in any or all of the branches of a given phylogeny; this test compares the null model M1a (nearly-neutral) to the alternative model M2a (positive selection) [37]. To identify genes that are under positive selection in specific branches of the Salmonella phylogeny, the branch-site test2 [39] was used. The branch-site test was specifically used to identify genes under positive selection in the ancestral branches of (i) the human restricted serotypes Typhi (Ty#) and Paratyphi A (Pty#), (ii) the porcine adapted serotype Choleraesuis (Ch#), and (iii) the unrestricted serotype Typhimurium (Tym#) (Figure 1). Overall, 18 different phylogenetic trees represented the phylogeny of the 3316 Salmonella orthologous genes, including one tree that represented the phylogeny of 1198 genes. Both the overall test and the branch site tests were performed using the gene specific trees.
For each test, nested models (one null model that does not allow for positive selection and one alternative model that allows for positive selection) were compared using a Likelihood Ratio Test (LRT) [40]. For each model, three replicates were generated and the maximum likelihood values for each model were used in the LRT in order to eliminate the runs that could not reach the global maximum likelihood score. Tests that yielded LRT values < -0.1 were re-run 10 times and the maximum values for each model were used to calculate the LRT. Negative LRT values (i.e., some tests yielded values ≥ -0.1) were rounded to zero (p-value = 1). For all branch-specific tests, one degree of freedom was used to calculate p-values, while for the overall test, two degrees of freedom were used to calculate p-values. Because recombination may generate false positive results with PAML, the final analysis of positive selection was carried out only for those genes that showed no evidence for recombination with any of the four methods used to detect evidence of recombination.

Detection of genes with evidence of recombination
Recombination analyses were performed using GENE-CONV version 1.81 [41], Maximum χ 2 [42], pairwise homoplasy index (PHI) [43] and neighbor similarity score (NSS) [44] to specifically detect evidence of homologous recombination among orthologous genes found in all 5 genomes; the 3316 alignments of orthologous genes used for these analyses thus contained one sequence from each genome and only recombination events between sequences present in the alignment were considered. Except for GENECONV, the approaches used are implemented in PhiPack [43]. GENECONV and Maximum χ 2 are substitution distribution methods, while NSS and Phi are compatibility methods [45,46]. None of these tests require that the true phylogenetic tree is known. GENE-CONV detects the evidence of recombination by assessing the significance of long tracts of identical sites among pairs of sequences in a multiple alignment of informative sites. Maximum χ 2 searches for recombination breakpoints in the alignment by comparing the number of polymorphic and non-polymorphic sites downstream and upstream of each putative break point (in this method, all polymorphic sites are initially considered as putative recombination breakpoints). NSS uses pairs of informative sites to detect evidence for recombination by assessing the tendency of neighboring sites to be more compatible than sites that are farther apart. PHI measures the similarity between closely linked sites to assess whether a fragment shows evidence for recombination. GENECONV, Maximum χ 2 and NSS were used here as these methods, in a comparison of several methods (not including PHI), were shown to perform best (high power and low false positive rates) for sequences with divergence around 5% -20% [47], representing a level of divergence expected between different Salmonella serotypes. These methods still differ in their relative power and specificity for detecting recombination though (e.g., depending on sequence divergence) and multiple methods were thus used to identify genes with evidence of recombination, particularly to allow for exclusion of any genes that may have evolved through recombination from subsequent positive selection analyses, which may be affected by recombination.
For the GENECONV analyses, the parameter g-scale was set to 1 and inner p-values were used to identify genes with evidence for recombination [41]. For Maximum χ 2 , a fixed window size of 2/3 the number of polymorphic sites was used, while for PHI, a window size of 50 nucleotides was used. P-values were estimated using 10,000 permutations of the alignment for GENECONV and 1,000 permutations for NSS, Maximum χ 2 and PHI.

Assessment of codon bias, nucleotide diversity and number of informative sites
To assess the codon bias, we identified the effective number of codons used in a gene (N C ) using the program "chips" in the EMBOSS package [48]. N C values range from 20, where one codon is used for each amino acid, to 61, where all alternative synonymous codons are used. Lower values of N C indicate higher codon bias in the gene, while higher values of N C values indicate lower codon bias. Nucleotide diversity and number of informative sites were obtained from PhiPack outputs.

Statistical analyses
Correction for multiple testing was performed using the procedure reported by Benjamini and Hochberg [49] as implemented in the program Q-Value [50]. As previously detailed by our group [18], for each p-value, the q-value was calculated; the q-value represents the false discovery rate [FDR], i.e., the expected proportion of false positives among the significant tests. Corrections were performed Example of neighbor joining tree used for positive selection analysis Figure 1 Example of neighbor joining tree used for positive selection analysis. Gene specific trees were used for all positive selection analysis. The tree showed here represented the phylogeny of 849 genes. Branches used for branch specific analyses are indicated; Ch# = Choleraesuis branch specific test; Ty# = Typhi branch specific test; Tym# = Typhimurium branch specific test; Pty# = Paratyphi A branch specific test. separately for each test to account for testing of multiple genes. In a preliminary analysis of positive selection, all 3,316 genes were used for FDR correction. As recombination affects the tests for positive selection, the final positive selection analysis was performed using only those 3,046 genes that showed no evidence for recombination; FDR correction for this final positive selection analysis was thus performed with 3,046 genes. As the tests used for positive selection are already conservative [39], a false discovery rate (FDR) cutoff of 20% was used for the positive selection analyses [18]. For recombination analyses, an FDR cut-off of 10% was used to compensate the fact that no correction for multiple tests (GENECONV, NSS, Maximum χ 2 and PHI) was carried out due to the high correlation among the tests [18].
Associations between JCVI role categories and number of genes with (i) evidence of positive selection and (ii) evidence of recombination were tested using chi-square tests (or Fisher's exact tests where appropriate). Mann-Whitney U-tests (Wilcoxon tests) were used to determine whether selected continuous variables (i.e., gene length, codon bias, and nucleotide diversity) differed between a given role categories and all other role categories. In addition, Mann-Whitney U-tests were used to test whether the p-val-ues of the positive selection tests for genes in a given role category were significantly lower than the p-values among the genes in the other role categories. All Mann-Whitney U-tests were performed as one-sided tests. All tests were performed in the Statistical Analysis System (SAS) 9.1 (SAS Institute Inc., Cary, NC).
Bonferroni corrections for all tests were performed based on the number of tests performed. The cut off value for significance was set at 0.05; Bonferroni corrected p-values are reported unless otherwise stated. Actual p-values are reported unless p-values were < 0.001 or < 0.0001.

Verification of positive selection and recombination patterns in selected genes in a larger Salmonella set
For four genes ( Table 2), including two genes that showed evidence for positive selection and recombination (i.e., folK-2, sseC) and two genes that only showed evidence for positive selection (i.e., STM3258, purE) in the initial genome wide analyses, gene sequences were determined for an additional 42 Salmonella isolates to further test positive selection and recombination patterns. The 42 Salmonella isolates were selected to reflect a diversity of human and animal associated serotypes; specifically, the isolates were selected to represent the 15 most common human for genes that showed evidence of recombination, results are shown in a parenthesis as recombination may affect the positive selection analyses. b recombination tests that were significant (Q < 0.1) are listed; GEN = GENECONV; MAX = Maximum χ 2 ; PHI = pairwise homoplasy; NSS = neighbor similarity score c Results of positive selection and recombination analyses were based on gene sequence data for the 5 genomes and 42 additional Salmonella isolates (see Additional file 2); for folK-2 and sseC sequences were only obtained for 36 additional isolates; for STM3258 sequences were only obtained for 37 additional isolates. d positive selection tests that were significant (P < 0.05); for genes that showed evidence of recombination with multiple tests, results are shown in a parenthesis as recombination may affect the positive selection analyses. e recombination tests that were significant (P < 0.05) and animal associated serotypes in the US (as detailed in the 2003 Salmonella Annual Report from the US Centers of Disease Control and Prevention [51]) as well as two additional Salmonella Typhi isolates. Human and cattle isolates representing the common human and animal associated serotypes were conveniently selected from the strain collection available at Cornell University Food Safety Laboratory, which include human and animal clinical isolates originally obtained from the New York State Department of Health and the Cornell University Animal Health Diagnostic Center, respectively. For common serotypes (e.g., Typhimurium) more isolates were included in this set as compared to less common serotypes (e.g., Dublin) (see Additional file 2 for a listing of all isolates used).
Multiple isolates with the same serotype were selected to represent the most common distinct Pulsed Field Gel Electrophoresis (PFGE) and multilocus sequence typing (MLST) types within a given serotype.
PCR conditions and primers for folK-2, sseC, purE, and STM3258 amplification are described in Additional file 3. PCR products were purified using Exonuclease I (USB) and shrimp alkaline phosphatase (USB). Purified PCR products were sequenced using the Applied Biosystems Automated 3730 DNA Analyzer at the Cornell University Life Sciences Core Laboratories Center. Big Dye Terminator chemistry and AmpliTaq-FS DNA Polymerase were used for sequencing. Alignments for positive selection and recombination analyses, which were performed as detailed above, were constructed using the gene sequences for the five genomes analyzed and the gene sequences for the additional isolates sequenced.

Initial identification and characterization of orthologous genes present in the five Salmonella genomes representing serotypes Typhi, Typhimurium, Choleraesuis, and Paratyphi A
Using OrthoMCL, a total of 3323 orthologous genes present in all 5 Salmonella genomes were identified. Since seven orthologous genes had low quality alignments, we excluded these genes and used 3316 orthologous genes for the analyses described below. Genes that were not found in all of the five strains were excluded from our analyses. The 3316 core genes represented 69, 81, 73, and 75%, respectively, of the Salmonella Choleraesuis, Paratyphi A, Typhimurium, and Typhi genes annotated in the genomes analyzed.
Interestingly, we identified one 2-gene cluster (i.e., STM0947 and STM0948), which was repeated 12 times in the Salmonella Choleraesuis genome, present once in Typhimurium genome and absent in the Typhi and Paratyphi A genomes. These two genes encode a putative integrase (STM0947) and a putative cytoplasmic protein (STM0948), which differ by 4 and 1 non-synonymous substitution(s), respectively, between Choleraesuis and Typhimurium LT2. In addition, we identified one other gene (NT03ST2087, encoding a putative Tn10 transposase), which was repeated 7 times in the Salmonella Choleraesuis and found once in the Salmonella Typhi CT18, while not present in the other genomes analyzed.
Salmonella Choleraesuis thus appears to contain at least two multi-copy mobile genetic elements.
Genes categorized in the JCVI role categories "Hypothetical Proteins", "Protein synthesis", "Unclassified" and "Unknown function" showed a tendency to have shorter alignments (P < 0.001, P = 0.027, P = 0.002, P = 0.017, respectively; one sided U-test) as compared to genes in other role categories, while genes in the JCVI role categories "Amino Acid Biosynthesis", "DNA Metabolism", "Energy Metabolism", and "Transport and Binding Proteins" showed a tendency to have longer alignments (P < 0.001, P = 0.001, P < 0.001, and P < 0.001, respectively; one sided U-test) as compared to genes in other role categories.
Genes in the JCVI role categories "Cellular envelope", "Hypothetical proteins", and "Unclassified" showed a tendency to have more non-synonymous substitutions (P = 0.009, P < 0.001, and P < 0.001, respectively; one sided U-test) as compared to genes in other role categories. Genes in the JCVI role categories "Biosynthesis of cofactors, prosthetic groups, and carriers", "Energy Metabolism", and "Transport and Binding Proteins" showed a tendency to have more synonymous substitutions (P < 0.001, P < 0.001, and P = 0.001, respectively; one sided Utest) as compared to genes in other role categories. Genes in the JCVI role categories "Amino acid biosynthesis", "Energy metabolism", "Protein Synthesis", "Purines, pyrimidines, nucleosides, and nucleotides", "Transcription", and "Transport and binding proteins" showed a tendency to have higher codon bias (P = 0.006, P < 0.001, P < 0.001, P < 0.001, P = 0.033, and P = 0.010, respectively; one sided U-test) as compared to genes in other role categories.

Approximately 8% of core genes show significant evidence for recombination
Among the 3316 orthologous genes, 233 genes showed no substitutions; these genes thus were not analyzed for evidence of homologous recombination (since the methods used cannot detect evidence of recombination if an alignment presents no polymorphisms). While the remaining 3083 genes were analyzed for recombination using GENECONV, only 2849 genes were analyzed using Max χ 2 , NSS and PHI (467 ortholog alignments had ≤1 informative site and thus could not be analyzed with these programs in PhiPack). Overall, 270 genes (8.14% of all 3,316 core genes) showed evidence for recombination in at least one of the four tests used (FDR < 10%). A total of 192, 155, 69, and 20 orthologs showed evidence of recombination using GENECONV, Max χ 2 , NSS and PHI, respectively. Only 10 genes showed evidence for recombination with all 4 approaches (Table 3). Substitution methods (i.e., GENECONV and Maximum χ 2 ) thus identified more genes with evidence of recombination as compared to compatibility methods (i.e., NSS and PHI). The differences in the number of genes with evidence of recombination detected with each method are related to (i) the power of the methods to detect recombination in sequences with different divergence and recombination levels, as well as (ii) the number of false positives associated with each method under different scenarios of heterogeneous substitution rates and convergent evolution. For example, GENECONV and Maximum χ 2 showed more power to detect recombination as compared to NSS in a study using computer simulations [47], consistent with the observation that both of these methods identified the largest number of genes with evidence of recombination here. Both GENECONV and NSS also have been found, in a study using empirical data, to show higher levels of false positives as compared to Maximum χ 2 when the sequences are very divergent [45], while, in another study [43] both NSS and Maximum χ 2 have been shown to yield more false positives than PHI particularly in sequences with mutational hot spots. This is consistent with our observation that PHI identified the lowest number of genes with evidence for homologous recombination.
When considering all 270 genes identified as having evidence of recombination by at least one method, genes with higher numbers of informative sites (P < 0.0001; one sided U-test), longer alignments (P < 0.0001; one sided Utest), higher codon bias (P < 0.0001; one sided U-test), and higher nucleotide diversity (P < 0.0001; one sided Utest) were more likely to have evidence for recombination. An overall chi-square test showed that genes with evidence of recombination were not randomly distributed among the 20 JCVI role categories (P < 0.001; Fisher's exact test with Monte Carlo simulation). Subsequent individual chi-square and Fisher's exact tests, determining whether genes with evidence for recombination were associated with individual role categories, showed that genes with evidence of recombination were significantly overrepresented in the role categories "Biosynthesis of cofactors, prosthetic groups, and carriers", "Energy metabolism", "Hypothetical proteins" and "Purines, pyrimidines, nucleosides, and nucleotides" (uncorrected P = 0.0035, P = 0.0037, P = 0.0034, and P = 0.0493, respectively) ( Figure 2). However, after corrections for multiple comparisons, the associations are not significant (Bonferroni corrected P = 0.063, P = 0.066, P = 0.061, and P = 0.887, respectively).

Initial analysis revealed a total of 81 Salmonella genes showing evidence for positive selection
When preliminary positive selection analyses were performed on all 3,316 orthologous genes, 21 genes showed evidence for positive selection (FDR <20%) in the overall test (TO) (Additional file 4). A total of 23, 21, 13, and 14 genes, respectively, showed evidence of positive selection (FDR <20%), using the branch-site test, in the Choleraesuis, Typhi, Typhimurium, and Paratyphi A branch (Additional file 4). As the two Typhi isolates formed a single branch in only the phylogenies for 1261 genes, only these genes were used to test for positive selection in the Typhi branch. While 81 genes showed evidence of positive selection in at least one test (including 11 genes with evidence for positive selection in two tests, see Additional file 4), 32 of these genes also showed evidence of recombination with at least one of the four recombination tests used (Table 4; Additional file 4). Genes with evidence of recombination were more likely to be under positive selection (P < 0.0001; Chi-square test). Although this may indicate that positive selection contributes to fixation of new allelic variants that were generated by recombination [18], it may also reflect that the positive selection tests were affected by intragenic recombination [52]. Thus, FDR corrections for positive selection analyses were repeated after removal of the 270 genes with evidence of recombination; these new FDR corrections used 3,046 genes for the overall (TO) test and the branch tests of Choleraesuis, Typhimurium and Paratyphi, and 1,108 genes for the Typhi branch test. All data in the subsequent sections represent the data for genes with no evidence for homologous recombination, unless otherwise stated.

A total of 41 Salmonella genes with no evidence of recombination showed evidence of positive selection
Positive selection tests identified 5 genes with evidence for positive selection (FDR <20%) in the overall test (TO) ( Table 5). A total of 8, 16, 7, and 5 genes, respectively, showed evidence of positive selection (FDR <20%), using the branch-site test, in the Choleraesuis, Typhi, Typhimurium, and Paratyphi A branches (Table 5; Additional file 5). None of these genes showed of evidence of positive selection in more than one test.
No association between the low effective number of codons used by a gene (Nc) and evidence for positive selection was observed (P > 0.05; one-sided U-test) suggesting that results of positive selection analyses were not biased by constrains on codon usage, which could result in a low synonymous substitution rate in these genes. Moreover, no association between low d S (the number of synonymous substitutions divided by the number of synonymous sites) and positively selected genes was  observed (P > 0.05; one-sided, U-test), supporting that the results were not biased by a low synonymous substitution rate. A Fisher's exact test did not find any significant overall association between the 20 JCVI role categories and the genes under positive selection (Figure 3), possibly due to the low number of genes under positive selection in each role category. To further test for associations between positive selection and gene role category, we thus assessed, for each of the role categories, whether the distribution of the p-values for each positive selection test deviated from the random distribution, using the non-parametric U-test. The JCVI role category "Hypothetical proteins" showed significant trends of having genes with low p-values in the Choleraesuis, Typhimurium and Paratyphi A branch specific tests for positive selection (Bonferroni corrected P = 0.042, P = 0.034 and P < 0.001, respectively; one sided Utest) as compared to genes in other role categories. In addition, genes in the JCVI role categories "Unclassified" and "Protein synthesis" showed a significant trend of having low p-values in the Choleraesuis and Typhimurium branch tests for positive selection, respectively, as compared to genes in other role categories (Bonferroni corrected P = 0.002 and P = 0.013, respectively; one sided Utest).
Interestingly, ompC showed evidence for positive selection in our study (Table 5) as well as in a previous study of Shigella and E. coli [24]. Our analyses showed that aa residues 228 and 274 show evidence for positive selection (Additional file 6), while aa 163, 202, and 203 showed evidence for positive selection in E. coli and Shigella [24]. Salmonella OmpC aa site 228, which was found to be under positive selection here, is located in a region that is absent from the E. coli and present in Shigella OmpC, while Salmonella OmpC aa site 274 is located in a region that is absent from OmpC in both E. coli and Shigella.

Verification of positive selection and recombination patterns, identified by genome wide analyses, for four genes among 42 Salmonella isolates
In order to confirm positive selection and recombination patterns identified by the full genome analyses, we used a larger set of 42 Salmonella isolates to sequence and analyze four genes, including two genes that showed evidence for positive selection and recombination (i.e., folK-2, sseC)  32 genes also showed evidence of recombination with at least one of the four recombination tests used in our study. Statistical analysis showed that genes evidence of recombination were more likely to be under positive selection (P < 0.0001; chi-square test). Therefore, we excluded the 270 genes with evidence of recombination from our final positive selection analysis. c This column lists the number of genes that show evidence for recombination and positive selection in a given test (e.g., TO); since many genes showed evidence of recombination in > 1 recombination test, the total number of genes in this column is lower than the sum of the numbers in a given row. While a total of 32 genes showed evidence of recombination and positive selection, the sum of the numbers in this column is > 32 as some genes showed evidence of positive selection in two tests. and two genes that only showed evidence for positive selection (i.e., STM3258, purE). folK-2, which encodes an enzyme involved in the synthesis of folic acid, could not be PCR amplified in 6 Salmonella isolates, representing serotypes Montevideo (n = 2), Oranienburg, Javiana, Urbana, and Muenster. Analyses of 41 folK-2 sequences (5 sequences from the genomes and 36 newly determined sequences) confirmed that this gene shows evidence for recombination (Table 2).sseC, which is located in the Salmonella pathogenicity island 2, could not be PCR amplified in 6 Salmonella isolates, representing serotypes Agona (n = 2), Havana, Kentucky, and Mbandaka (n = 2). Analyses of the sseC sequences also confirmed that this gene shows evidence for recombination ( Table 2). The STM3258 gene, which encodes a putative PTS component, could not be PCR amplified in one Salmonella Typhimurium and three serotype 4,5,12:i:-isolates. Results from the analyses of the resulting 43 STM3258 gene sequences was consistent with the genome analyses data and confirmed that this gene shows no evidence for recombination, but is under positive selection in the Salmonella Typhi branch. purE, which encodes an enzyme involved in the synthesis of purine ribonucleotide, was successfully amplified and sequenced in all 42 isolates; analyses of the resulting sequences also found evidence for positive selection in the Salmonella Typhi branch ( Table 2); one test (NSS) on all 47 purE gene sequences found evidence for recombination in this gene (P < 0.001).

Discussion
In this study, we used 5 Salmonella genomes representing host restricted (i.e., Typhi and Paratyphi A), host adapted (i.e., Choleraesuis), and unrestricted (i.e., Typhimurium) serotypes to study the evolution of core genes in different Salmonella serotypes. A total of 3,316 orthologs found in these 5 Salmonella genomes were used to (i) identify genes with evidence of recombination and (ii) identify genes under positive selection. Positive selection and recombination patterns for four genes of interest were confirmed in a larger set of isolates representing 23 different serotypes. Overall, our data show that, among the serotypes evaluated, (i) less than 10% of Salmonella genes in the core genome show evidence for homologous recombination, (ii) a number of core Salmonella genes are under positive selection, including genes that appear to contribute to virulence, and (iii) the cell surface protein ompC, which may contribute to multi drug resistance in Salmonella, is  targeted by positive selection in both Salmonella and E. coli [24].

Less than 10% of Salmonella genes show evidence for intragenic recombination
Since the first bacterial genome was sequenced in 1995, comparative tools have shown that horizontal gene trans-fer is the major process for the evolution of prokaryotes [12,14,55]. Horizontal gene transfer has also been proposed to have played an important role in the evolution of the Salmonella genome. Salmonella Typhimurium LT2 seems to have acquired a number of novel genomic regions after the divergence from E. coli around 100 millions years ago [56] and it has been estimated that 25% of Proportions of genes with evidence of positive selection among individual JCVI role categories   Genes in SPIs 1 to 5 are reported as described by [53] using primary annotation locus numbers for Salmonella Typhimurium LT2; genes in SPI-6 are reported as described by [54] using JCVI locus numbers for Salmonella Typhi CT18 the Salmonella Typhimurium genome may have been introduced by horizontal gene transfer [3]. Groups of genes introduced by horizontal gene transfer include prophages and Salmonella pathogenicity islands (SPIs) [13]. While the role of horizontal gene transfer in introducing novel genes into the Salmonella genome has been well established, our analyses show that horizontal transfer (and recombination) of homologous genes also plays an important role in the diversification of Salmonella; 270 of the 3316 genes characterized (8.1%) showed evidence for intragenic homologous recombination. By comparison, analysis of four E. coli and two Shigella genomes found 236 genes with evidence for intragenic recombination, representing approximately 6.3% of genes analyzed [24]. Chen et al. [23] reported that 12.8% of core genome genes, found in seven E. coli genomes, showed evidence for recombination. A study of 410 genes present in six E. coli and six Salmonella enterica genomes reported that 23% of these genes showed evidence of recombination in Salmonella; this estimate may be higher than the one reported here as the 410 genes evaluated do not represent a random sample of the Salmonella core genome [25]. Interestingly, even novel genes that were initially introduced into the Salmonella genome through horizontal gene transfer and non-homologous recombination, showed evidence for further subsequent diversification through homologous recombination (e.g., one and two genes in SPI-1 and 2, respectively, showed evidence for intragenic recombination). A recent analysis by Didelot et al. [57] also suggested that convergence of Salmonella Typhi and Paratyphi A, two human host-restricted serotypes, through >100 recombination events involving both transfer of novel genes as well as transfer of homologous genes, further supporting the importance of horizontal transfer of homologous gene sequences in the evolution of Salmonella [53].

A number of core Salmonella genes are under positive selection, including genes that appear to contribute to virulence and systemic infection
A total of 1.2% of genes found in all five Salmonella genomes (i.e., 41 genes) showed evidence for positive selection and no evidence for recombination. While 5 genes showed evidence for positive selection in the overall analyses, 36 genes showed evidence for positive selection only in specific branches, indicating considerable branch specific positive selection in the Salmonella serotypes evaluated. Previously, Petersen et al. [24] reported that, among 3,505 E. coli and Shigella genes that showed no evidence for recombination, a total of 23 genes (0.66%) showed evidence for positive selection. Among Grampositive pathogens, Orsi et al. [18] reported that 36 L. monocytogenes and L. innocua genes (1.6%) showed evidence of positive selection (among a total of 2267 genes analyzed), while Lefebure and Stanhope [20] reported that 11 to 34% of the genes in the Streptococcus core genome showed evidence for positive selection, although this study did not control for multiple comparisons and thus may have somewhat overestimated the number of genes under positive selection. Recently, Lefebure and Stanhope [22] showed that 92.5% of non-recombinant core genome loci are under positive selection, in at least one lineage, in 17 Campylobacter genomes, which represented 8 different species. While, these different analyses suggest that the proportion of genes with evidence for positive selection appears to vary considerably between different bacterial species or genera, methodological aspects (e.g., approaches used to correct for multiple comparisons, approaches used to identify genes with evidence for recombination) may also affect the number of genes identified as showing evidence for positive selection.
Interestingly, three Salmonella genes with evidence for positive selection were located in Salmonella pathogenicity islands (SPIs). SPIs are chromosomal regions that contain genes contributing to a particular virulence phenotype [26,58,59]. So far, five common SPIs (i.e., SPI-1 through SPI-5), found among the majority of Salmonella enterica strains, as well as a number of additional less common SPIs have been reported. siiB, which showed evidence for positive selection, is located in SPI-4 and encodes a probable membrane protein (putative methyl-accepting chemotaxis protein). Morgan et al. [60] reported that the SPI-4 genes siiD, siiE, and siiF play a role in Salmonella Typhimurium intestinal colonization of calves. Kiss et al. [61] specifically showed that a Salmonella Typhimurium strain lacking siiB shows reduced secretion of SiiE, as compared to the wildtype, suggesting a possible involvement of siiB in calf virulence (as an siiE mutant showed reduced colonization in a calf model [60]). pipB, located in SPI-5, also showed evidence for positive selection. SPI-5 encodes T3SS-1 and T3SS-2 effector proteins [62]. PipB localizes to the Salmonella Containing Vacuole (SCV) in mammalian host cells [63]. In addition, Wood et al. [62] reported that a pipB null mutant showed reduced intestinal secretory and inflammatory responses in ligated bovine ileal loops, suggesting that this, as well as other genes in SPI-5, may contribute to bovine enteric infections. PipB also appears to be required for colonization of the cecum, by Salmonella Typhimurium, in chickens [64]. safC, a gene located in SPI-6 [54], a region called Salmonella enterica centrisome 7 genomic island (SCI) in Salmonella Typhimurium [65], was also found to be under positive selection. safC encodes an outer membrane usher protein for Salmonella atypical fimbriae [65]. While a Salmonella Typhimurium strain with a deletion of SPI-6 showed reduced ability to invade Hep2 cells [65], we are not aware of any studies characterizing virulence of a safC null mutant. While the SPI-2 genes sseC and sseF have previously been reported to (i) show evidence for differential evolution [66] and (ii) contain distinct clusters of polymorphic sites that might be unique to the human adapted serotypes Typhi and Paratyphi [67], these genes did not show evidence for positive selection in our final analyses. Both sseC and sseF showed evidence for positive selection in the Choleraesuis branch in our initial analysis, but sseC was removed from the final analysis as this gene also showed evidence of recombination and sseF did not meet the 20% cutoff for FDR. In combination with a previous study [66] that reported that a number of genes located in Salmonella pathogenicity islands show evidence for differential evolution in different Salmonella serotypes, our findings do support that positive selection contributes to evolution of pathogenicity island genes in Salmonella, even though further analyses on larger data sets will be needed to clarify the contributions of positive selection and recombination to evolution of these genes.
Overall, three genes in the JCVI role category "Purine, pyrimidine, nucleoside and nucleotide biosynthesis" (i.e., wcaH, purE and nrdI) showed evidence for positive selection (while showing no evidence for recombination). wcaH, which encodes a GDP-mannose mannosyl hydrolase, is under positive selection in the Typhimurium branch, while purE and nrdI were found to be under positive selection in the Typhi branch. purE encodes a phosphoribosylaminoimidazole carboxylase, while nrdI, which is located in an operon with genes that encode a Class 1b ribonucleotide reductase, encodes a small flavoprotein with unknown function in Streptococcus pyogenes [68]. Positive selection for purE in the Salmonella Typhi branch was also confirmed in our analyses of 22 human and 20 animal Salmonella isolates, which included two additional Typhi strains. This is a striking finding since Samant et al. [69] recently reported that de novo nucleotide biosynthesis is essential for bacterial growth in blood. As Salmonella Typhi predominantly causes systemic septicemic infections in humans, these findings suggest that adaptive changes in genes encoding purine, pyrimidine, nucleoside and nucleotide biosynthesis functions may have been critical in the evolution of this host restricted human pathogen. Our findings thus further support that development of novel drugs targeting appropriate purine, pyrimidine, nucleoside and nucleotide biosynthesis pathways may represent an opportunity for therapeutic approaches for bacterial pathogens causing septicemic infections [69].
Additional genes with evidence for positive selection and possible roles in host infection include katG, which encodes a catalase. While antioxidant defenses mechanism appear to contribute to virulence in a number of pathogens, Salmonella katG null mutations have shown no affect on Salmonella's ability to survive inside phagocytic cells and in a murine model of infection [70]. The impor-tance of adaptive changes in Salmonella katG thus remains to be determined. It seems possible that adaptive changes in genes involved in anaerobic growth may contribute to an improved ability of different strains of this gastrointestinal pathogen to survive under anaerobic conditions encountered in the intestinal tract. We also identified a number of genes with evidence for positive selection that have no apparent link to infection and virulence, including malZ, malT, and mtlA, which encode, respectively, a maltodextrin glucosidase, a transcriptional activator of mal genes, and a mannitol specific PTS system component. While it has been proposed that horizontal transfer of genes encoding proteins involved in acquisition and synthesis of nutrients and genes encoding components of metabolic networks is critical as bacteria adapt to specific environments and ecological niches [12], our findings suggest that positive selection of genes encoding metabolic capabilities also contribute to adaptation to new environments.

Cell surface proteins are targeted by positive selection in both Salmonella and E. coli
While we identified, in our preliminary analysis, three genes encoding outer membrane proteins (ompC, ompS1 and ompS2) that showed evidence for positive selection, only ompC showed no evidence of recombination. ompC, a highly expressed omp gene, encodes a protein that not only appears to play a role in Salmonella virulence [71], but also is a receptor for Gifsy-1 and Gifsy-2 phages [72]. An analysis of six E. coli and Shigella genomes also found that three omp genes (i.e., ompF, ompC and ompA) showed evidence of positive selection [24], while Chen et al. [23] reported that ompC and ompF were under positive selection in uropathogenic E. coli strains. Furthermore, genes encoding the outer membrane proteins OmpA and OmpB showed evidence for positive selection in Rickettsia spp. [73]. Overall, these data strongly suggest that adaptive changes in genes encoding outer membrane proteins critically contribute to the evolution of a variety of bacteria, including pathogenic enterobacteriaciae. In particular, ompC, which encodes one of the most abundant E. coli proteins [24], appears to be under positive selection in a number of pathogenic enterobacteriaciae. As proposed by Petersen et al. [24], positive selection in omp genes may be an important mechanism that facilitates adaptation of bacterial pathogens allowing them to escape recognition by the host immune system and phages. In addition, mutations in porin genes (e.g., those belonging to OmpC and OmpF groups), as well as changes in Omp expression levels, have been linked to increased resistance to βlactam antibiotics [74][75][76]. For example, under strong antibiotic pressure, bacteria can reduce the influx of antibiotic through downregulation of porin expression or expression of modified porins. Positive selection in porin genes, particularly ompC thus may also be associated with selection to increase antibiotic resistance. These findings provide potentially interesting avenues for future mutagenesis studies to elucidate the role of ompC polymorphisms in various phenotypes, including β-lactam resistance.

Conclusion
Our analyses strongly suggest that both homologous recombination and positive selection (particularly lineage specific positive selection) contribute to the evolution of the Salmonella core genome, at least in the serotypes analyzed here. While genes with evidence of positive selection identified here may provide promising targets for future mutational studies aimed at further identifying mechanisms that contribute to Salmonella diversification, including its adaptation to specific host species, one cannot extrapolate our findings on a few Salmonella serotypes to other serotypes unless additional analyses are performed. The relevance of the lineage specific positive selection patterns identified here is supported, though, by the convergence of the positive selection patterns identified in the Salmonella Typhi lineage (i.e., for genes encoding proteins involved in purine, pyrimidine, nucleoside and nucleotide biosynthesis) and experimental evidence that genes involved in de novo nucleotide biosynthesis are essential for bacterial growth in blood [69].
In conjunction with previous genome wide studies on positive selection in uropathogenic E. coli [23], Shigella and E. coli [24], Listeria spp. [18], Campylobacter [22] and Streptococcus spp. [20], our data clearly indicate the positive selection and homologous recombination among core genome genes play an important role in the evolution of bacterial pathogens, in addition to the well established importance of gene acquisition and deletion. Positive selection and homologous recombination also appear to contribute to further evolution of novel genes initially acquired by lateral gene transfer, such as selected genes in the Salmonella pathogenicity islands. As additional pathogen genomes, including additional Salmonella genomes, have and continue to become available, positive selection and recombination analyses on larger numbers of genomes will further improve our understanding of bacterial pathogens.