Complete genome sequences were downloaded from GenBank as follows: for R. etli CFN42: chromosome [GenBank:NC_007761], and plasmids pCFN42a [GenBank:NC_007762], pCFN42b [GenBank:NC_007763], pCFN4c [GenBank:NC_007764], pCFN42d [GenBank:NC_004041], pCFN42e [GenBank:NC_007765], and pCFN42f [GenBank:NC_007766]; for R. etli CIAT652: chromosome [GenBank:NC010994], and plasmids pCIAT652a [GenBank:NC010998], pCIAT652b [GenBank:NC010996], and pCIAT652c [GenBank:NC010994]; and for R. leguminosarum 3841: chromosome [GenBank:NC_008380], and plasmids pRL7 [GenBank:NC_008382], pRL8 [GenBank:NC_008383], pRL9 [GenBank:NC_008379], pRL10 [GenBank:NC_008381], pRL11 [GenBank:NC_008384], and pRL12 [GenBank:NC_008378]. We also used reads and contigs from the draft genomes of R. etli strains 8C-3 [GenBank:NZ_ABRA00000000], BRASIL5 [GenBank:NZ_ABQZ00000000], CIAT894 [GenBank:NZ_ABRD0000000], GR56 [GenBank:NZ_AABRD0000000], IE4771 [GenBank:NZ_ABRD00000000], and KIM5 [GenBank:NZ_ABQY0000000].
Determination of SNPs and pairwise nucleotide differences
Paired alignments between the draft genomes (contigs) and the ORFs from the genomes of CFN42 or CIAT652 were performed using the Dds2 program , which produces ungapped alignments of fragments having similarities greater than 80%. Each duplicated paired alignment (i.e., segments for which paralogous existed in the reference genome) was filtered using the reciprocal best hits option of the Fil program  under the following parameter set: coverage > 60% with respect to a reference gene and a percentage differential score cutoff < 10%. When two alignments had the same coverage, we selected the alignment with the higher score. Once the results were filtered, we created a gapped alignment using the Gap22 program  on segments for which the identity was greater than 85%. Both sequences were extracted using an ad hoc Perl script (homemade) formed for each paired alignment. To avoid frameshifts, we realigned each pair using cross-match  with the following parameters: discrep_lists masklevel, 0; tags gap_init, 3; gap_ext, 2; ins_gap_ext, 2; del_gap_ext, 2; minmatch, 14; maxmatch, 14; max_group_size, 20; minscore, 30; bandwidth, 14; and indexwordsize, 10. Finally, for each alignment, we determined the probability that a site was polymorphic using the Polybayes program , with the probability set at greater than 0.99 and a minimum Phred of Q45 [51, 52].
Assessment of methodological accuracy at low coverage
To determine if differences in coverage among the studied strains affected the reliability of the variability estimations, we took readings representing ~1× sequence coverages of seven E. coli genomes and complete-genome readings (about 10×) of the same genomes from GenBank (ftp://ftp.ncbi.nih.gov/genomes/) and assembled these readings using the Celera assembler . The above-described analysis was applied to both the 1× and 10× coverage datasets, and the results were compared using the Mann-Whitney and Kolmogorov-Smirnov tests . The utilized E. coli draft genomes were: 101-1 [GenBank:NZ_AAMK00000000], 53638 [GenBank:NZ_AAKB00000000], B171 [GenBank:NZ_AAJX0000000], E1100019 [GenBank:NZ_AAJW0000000], F11 [GenBank:NZ_AAJU0000000], HS [GenBank:NC_009800], and O157_H7_ec4024 [GenBank:NZ_ABJT0000000]. The complete genome sequence of E. coli K12 [GenBank:NC_000913] was used as the reference.
Determination of triplets (homologous segments)
For the comparisons between all ORFs of the reference genomes (both CFN42 and CIAT652 were used throughout the work) and each incomplete genome (the contigs), we obtained the coordinates of all homologous segments (triplets) using the Mauve program . Our analysis was standardized by aligning p42F (CFN42) against (R. leguminosarum bv viciae 3841) pRL12 using the following parameters: backbone-size = 100; max-backbone-gap = 50; weight = 90; island-size = 100. These plasmids were chosen because they contain shared syntenic blocks . Sequence extraction, realignment of each conserved segment (backbone) and SNP determination were all performed as described above (see determination of SNPs section).
Determination of quartets (orthologous segments)
To detect recombination events among DNA sequences, at least four sequences are required for the analysis . Here, we first identified SNPs that distinguished each draft genome from the two reference genomes (CFN42 and CIAT652), and then determined the fragments that were shared between each draft genome and the ORFs from CFN42 and CIAT652, together with all replicons of R. leguminosarum 3841 (chosen because of its extensive synteny with CFN42) . Sharing was determined using the Mauve program  (see determination of triplets section) and the shared fragments were realigned with the Muscle program (default parameters) . To eliminate any large gaps within the alignments (rare in orthologous fragments), we used the Gblocks program under its default parameters .
Detection of recombination
A variety of methods for detection of recombination have been reported in the literature , but no one strategy performs optimally under all evolutionary scenarios . Therefore, a reasonable approach is to employ multiple methods and consider recombination events predicted by at least two methods as being the most reliable. Here, we used this strategy and considered recombination events that were detected by at least two of the following four programs :
A) Geneconv : Using this program, we ran 100,000 simulations for each quartet with the following parameters chosen: Dumptab; Dumpjseq; Dumpfrag; Annotate; WideCol; ShowBlast; Indel_blocs; ShowBcPwKaPvals; SortGfragsBySeq; Show_maxmeansims; ShowUnal; Gscale = 1; ListPair; ListBest; Bcsims; Allouter; Numsim = 100000/sp. This allowed us to detect possible genetic conversion events.
B) Pist : With this program, we first identified the best-fit DNA substitution model for each shared fragment using the Akaike information criterion. We then used the best model to reconstruct the phylogeny using a maximum likelihood method (Phyml ) with 100 non-parametric bootstrap replicates. We next determined the invariant sites, alpha values, ts/tv ratios, base frequencies, and constant sites using the PAML program  and the GTR model. Finally, we ran Pist with the REV model and 10,000 permutations. Pist uses parsimony-informative sites to detect recombination events and is robust for highly divergent genes.
C) PhiPack : We used the parameters of 10,000 permutations and a window size of 25 nt, and implemented the Pairwise Homoplasy Index, Maximum X2, and the Neighbor Similarity Score.
D) Hyphy program : We used the routine GARD, which enables automated phylogenetic detection of recombination. We employed the GTR model and beta-gamma rate variation.
To determine if a recombinant gene was present in two different quartets or strains, we constructed a binary presence/absence matrix (1/0) for each gene that was found in two or more strains. These profiles were hierarchically clustered using the Cluster program .
Regions shared among all strains of R. etli and R. leguminosarum bv viciae 3841 were identified using the Mauve program , realigned by Muscle using the default parameters , and filtered for long gaps with Gblocks . We then obtained the phylogeny of each region using a maximum likelihood approach employed by the Phyml program  (with 1,000 non-parametric bootstrap replicates) and the best nucleotide substitution model identified by the Akaike information criterion [67, 68]. We used three methods to construct the phylogeny from the concatenated dataset, in order to determine the species tree. The first was the RAxML program (maximum likelihood) , in which we ran the GTR nucleotide substitution model and a GAMMA+P-Invar estimation of rate heterogeneity. This analysis yielded a Maximum Likelihood ML estimate of the alpha parameter and 1,000 distinct randomized Maximum Parsimony trees. The second program used was Phyml (maximum likelihood) , running 1,000 non-parametric replicates and the GTRG model. Finally, we employed the MrBayes program (Bayesian analysis)  running the Nucmodel 4by4 for DNA. The number of rate categories for the gamma distribution was set at four, with an allowance for a proportion of invariable sites. Because of the high computational burden, we performed two runs with four chains, for 500,000 generations in total. Trees were sampled every 500 generations, 25% of all samples were removed as reflecting burn-ins, and a consensus was obtained. Moreover, to assess differences in topology among the probable strain trees and individual gene trees, we used the Consel program , which calculated expected likelihood weighting and performed the Shimodaira-Hasegawa SH test . Finally, a neighbor-net network was generated using the concatenated sequences and the Splits tree4 program .
Nucleotide diversity and ts/tv ratios
For each shared fragment (quartet), we determined the nucleotide diversity and segregating sites using R. leguminosarum 3841 as an outgroup and employing the libsequence library . The transition/transversion ts/tv ratios were determined for each quartet by using the PAML program  and applying the best model of nucleotide substitution obtained from each orthologous segment (see determination of quartet).
We used the COGs database  to undertake functional annotation across the four broad categories and sub categories to shared regions (all strains) as well as recombinant quartets. Quartets that had not been functionally assigned within the COG database were placed in the "Poorly Characterized'' category. For assignation to a category, we used the reciprocal best hits technique with an E-value < 1 × 10-7.