New analysis for consistency among markers in the study of genetic diversity: development and application to the description of bacterial diversity
 Sandrine Pavoine^{1}Email author and
 Xavier Bailly^{2}
DOI: 10.1186/147121487156
© Pavoine and Bailly; licensee BioMed Central Ltd. 2007
Received: 17 January 2007
Accepted: 03 September 2007
Published: 03 September 2007
Abstract
Background
The development of postgenomic methods has dramatically increased the amount of qualitative and quantitative data available to understand how ecological complexity is shaped. Yet, new statistical tools are needed to use these data efficiently. In support of sequence analysis, diversity indices were developed to take into account both the relative frequencies of alleles and their genetic divergence. Furthermore, a method for describing interpopulation nucleotide diversity has recently been proposed and named the double principal coordinate analysis (DPCoA), but this procedure can only be used with one locus. In order to tackle the problem of measuring and describing nucleotide diversity with more than one locus, we developed three versions of multiple DPCoA by using three ordination methods: multiple coinertia analysis, STATIS, and multiple factorial analysis.
Results
This combination of methods allows i) testing and describing differences in patterns of interpopulation diversity among loci, and ii) defining the best compromise among loci. These methods are illustrated by the analysis of both simulated data sets, which include ten loci evolving under a stepping stone model and a locus evolving under an alternative population structure, and a real data set focusing on the genetic structure of two nitrogen fixing bacteria, which is influenced by geographical isolation and host specialization. All programs needed to perform multiple DPCoA are freely available.
Conclusion
Multiple DPCoA allows the evaluation of the impact of various loci in the measurement and description of diversity. This method is general enough to handle a large variety of data sets. It complements existing methods such as the analysis of molecular variance or other analyses based on linkage disequilibrium measures, and is very useful to study the impact of various loci on the measurement of diversity.
Background
The exponential increase in sequencing abilities is modifying the way genetic diversity is assessed. For instance, multilocus sequencing (MLS) now allows the estimation of genetic relatedness among microorganisms for both housekeeping genes and accessory genes such as virulence or symbiotic determinants [1]. Thus, several publications reported complex MLS schemes studying more than ten genes located in different genomic regions and involved in various metabolic pathways. These studies have indicated the influence of various parameters, such as recombination rate [2] or epidemiological traits [3], on the diversification of bacterial populations. Furthermore, recent progress in sequencing technologies suggests that still more and more sequence data will be available to study questions related to community ecology in the near future [4]. New statistical methodologies should therefore be developed to deal with the complexity of data sets that will be produced. One of the main problems raised by the increase in sequence information is the assessment of congruence among population structures depicted by different molecular markers [5]. In bacterial lineages, especially for those in which sex is common, the diversity of each locus could be shaped by the gain/loss of genes, gene flow boundaries and specific selective pressures [6]. The problems which can arise from the overall analysis of a MLS data set in which loci do not share congruent evolutionary constraints include, among others, misleading inferences of genetic relatedness and phylogenetic relationships [7] or overestimation of linkage disequilibrium [8].
Bacterial isolates which are characterized by MLS usually belong to several genetic groups (i.e. species or populations) which can be defined according to the sampling strategy or according to more refined methodologies [9]. For each locus of a MLS data set, the different sequence types recovered are called alleles. In this context, the properties of the data set can be summarized by two sets of matrices. The first set includes G matrices {F_{1},..., F_{ g },..., F_{ G }}, in which G is the number of loci. Each of these matrices contains the frequencies of the different alleles recovered at a given locus among the populations under study. The dimensions of these matrices are thus (ρ_{1}, r), ..., (ρ_{ g }, r), ..., (ρ_{ G }, r), in which ρ_{ g }is the number of alleles observed at locus g and r is the number of populations delineated. The second set also includes G matrices called {D_{1},..., D_{ g }..., D_{ G }}, which contain the pairwise genetic distances between the alleles observed at locus g. Usually, the information contained within these two sets of matrices are analyzed independently using respective population genetic statistics (i.e. diversity indices and differentiation measures) and phylogenetic methods. Yet, while it is possible to perform analyses over all loci in either a population genetic or a phylogenetic framework, few methodologies are available to assess the congruence of the information obtained from different loci. In particular, a comparison of the patterns revealed by differentiation measures among the populations sampled, i.e. population structure, is a problematic issue.
Multivariate analysis is an interesting methodological way to approach this problem. For instance, MoazamiGoudarzi and Laloë [5] have proposed a twostep procedure to test the dissimilarity in population structures revealed by different microsatellite loci. Although this analysis can be used to test the similarity of population differentiations inferred from a set of markers, it can be noted that: i) it can not be used to describe population structures, and ii) genetic divergence among alleles are not taken into account, while these can be quite informative. Consequently, further improvements should be considered since alternative statistical approaches are available [10]. In this context, the aim of this survey is to propose a new procedure called multiple double principal coordinate analyses (mDPCoA). The mDPCoA aims at comparing interpopulation structures provided by the different markers of a MLS scheme. Firstly, a pattern of population differences is obtained for each MLS marker using a double principal coordinate analysis (DPCoA) which is a recently developed ordination method which takes into account both the frequency of alleles and their genetic divergence [11] (see Eckburg et al. [12] and Bik et al. [13] for applications of this method to the analysis of bacterial diversity). Secondly, population patterns are compared using three different methods: the Multiple Coinertia Analysis [14], STATIS [15], and the Multiple Factorial Analysis [16]. Finally, a permutation procedure can be used to test the pairwise correlation among MLS markers. These analysis pipelines have been used on either simulated or published MLS data sets to check the accuracy and the relevance of the procedures. The results obtained illustrate the ability of this methodology to make inferences on various features of populations under study.
Results
Algorithms of multiple Double Principal Coordinate Analysis
Computations were performed using new functions and functions implemented in the ade4 [17] and ape [18] packages written in the R software [19] [see Additional file 1]. A manual describing the use of the different functions is supplied [see Additional file 2].
Let {F_{1},..., F_{ g },..., F_{ G }be the set of matrices of type alleles × populations, containing the frequencies of alleles in the populations for the G loci, {D_{1},..., D_{ g },..., D_{ G }} be the set of matrices containing the distances among alleles, B_{ r }be the diagonal matrix containing the population weights (the weight of a population is the proportion of individuals drawn from this population), and ${B}_{{\rho}_{g}}$ be the diagonal matrix containing the allele weights for the g^{th} locus (the weight of an allele is its frequency over all the populations studied). The matrices of distances must be Euclidean [20], which is obtained with, for example, either Lingoes [21] or Cailliez [22] correction.
1. For a single locus g, the analysis of the amongpopulation diversity corresponds to a DPCoA, which results in three main steps:
Defining a Euclidean space composed by principal axes of the distances among the alleles. The coordinates of the alleles in this space are in R_{ g }such that: ${Q}_{g}^{t}{D}_{g}{Q}_{g}={R}_{g}{R}_{g}^{t}$, where ${Q}_{g}={I}_{{\rho}_{g}}{B}_{{\rho}_{g}}{1}_{{\rho}_{g}}{1}_{{\rho}_{g}}^{t}$ is a projector which proceeds to weighted centering, with ${I}_{{\rho}_{g}}$ the ρ_{ g }× ρ_{ g }matrix of identity and ${1}_{{\rho}_{g}}$ a ρ_{ g }× 1 vector of units. That is to say, ${Q}_{g}^{t}{D}_{g}{Q}_{g}$ is the matrix centered by rows and columns;
2. Positioning, in this space, the populations at the centroid of the alleles they possess. The coordinates of the populations, in this space, are in C_{ g }such that: ${C}_{g}={B}_{r}^{1}{F}_{g}^{t}{R}_{g}$;
3. Proceeding to the singular value decomposition of the triplet (C_{ g }, ${I}_{{\mu}_{g}}$, B_{ r }), where μ_{ g }is the number of principal axes for the alleles of the g^{th} locus. This third step leads to a set of positive eigenvalues, in a diagonal (ν_{ g }× ν_{ g }) matrix Ψ_{ g }, and to a base of orthonormal eigenvectors, in a (r × ν_{ g }) matrix V_{ g }, defining the new Euclidean space. The eigenvectors constitute the principal axes of the distances among populations. In this new space, which is the DPCoA space, the coordinates of the alleles are in X_{ g }= R_{ g }V_{ g }, and the coordinates of the populations in Y_{ g }= C_{ g }V_{ g }.
A consideration of the set of all the loci leads thus to G triplets $\left({Y}_{1},{I}_{{\nu}_{1}},{B}_{r}\right),\mathrm{...},\left({Y}_{g},{I}_{{\nu}_{g}},{B}_{r}\right),\mathrm{...},\left({Y}_{G},{I}_{{\nu}_{G}},{B}_{r}\right)$
Our objective being to evaluate the consistency among the patterns of interpopulation diversity provided by each locus, considering evolutionary distances among alleles, we had to find a Euclidean space allowing the direct comparison among the individual DPCoA analyses. We evaluated three alternative solutions taken from the Ktable multivariate analysis: the multiple coinertia analysis (MCoA) [14], STATIS [15] and the multiple factorial analysis (MFA) [16].
DPCoA and Multiple Coinertia analysis
The Multiple Coinertia Analysis applied to the triplets $\left({Y}_{1},{I}_{{\nu}_{1}},{B}_{r}\right),\mathrm{...},\left({Y}_{g},{I}_{{\nu}_{g}},{B}_{r}\right),\mathrm{...},\left({Y}_{G},{I}_{{\nu}_{G}},{B}_{r}\right)$.
can be viewed as follows:
The main step is the definition of a set of axes ${u}_{g}^{\left[k\right]}$, for 1 ≤ k <K, and 1 ≤ g ≤ G, normalized in each space ${\mathbb{R}}^{{\nu}_{g}}$, which will serve to position the populations according to each individual locus, and K unique variables v^{[k]}, for 1 ≤ k <K, D_{ r }normalized in ℝ^{ r }, which may be used to synthesize the information provided by the G loci. This definition is done by maximizing
$\sum _{g=1}^{G}{\pi}_{g}{\u3008{Y}_{g}{u}_{g}v\u3009}_{{B}_{r}}^{2}$, given that
${\u3008{v}^{\left[k\right]}{v}^{\left[l\right]}\u3009}_{{B}_{r}}=0$ and ${\u3008{u}_{g}^{\left[k\right]}{u}_{g}^{\left[l\right]}\u3009}_{{B}_{r}}=0$ for all k, l (1 ≤ k <l), and all g (1 ≤ g ≤ G).
The value π_{ g }is a weight attributed to the triplet (Y_{ g }, ${I}_{{\nu}_{g}}$, B_{ r }) so as to homogenize the impact of each triplet in the multiple analysis. We use π_{ g }equal to the inverse of the inertia of the triplet (Y_{ g }, ${I}_{{\nu}_{g}}$, B_{ r }), sum of all its eigenvalues. Let U_{ g }be the matrix $\left[{u}_{g}^{\left[1\right]}\left\mathrm{...}\right{u}_{g}^{\left[k\right]}\left\mathrm{...}\right{u}_{g}^{\left[K\right]}\right]$ and V the matrix [v^{[1]}...v^{[k]}...v^{[k]}]. The individual analyses can be projected on the MCoA space. In this space, it is possible to compare the coordinates of the populations according to the consensus of the information provided by the different loci to the coordinates of the populations obtained from each locus. While V contains the consensual coordinates of the populations, the coordinates at which the g^{th} locus positions the populations are obtained from ${L}_{{Y}_{g}}=\sqrt{{\pi}_{g}}{Y}_{g}{U}_{g}$. Because ${Y}_{g}={B}_{r}^{1}{F}_{g}^{t}{X}_{g}$, the matrix ${L}_{{X}_{g}}=\sqrt{{\pi}_{g}}{X}_{g}{U}_{g}$ positions the alleles of the g^{th} locus, so that each population is at the centroid of its allelic composition. However, to compare the individual analyses with the compromise, it is better to D_{ r }normalize ${L}_{{Y}_{g}}$ and ${L}_{{X}_{g}}$ because V is by definition D_{ r }normalized.
DPCoA and STATIS
whose eigenanalysis, E = UΛU^{ t }, leads to the best compromise of the population pattern over the G loci. Note that $\Vert {B}_{r}^{1/2}{Y}_{g}{Y}_{g}^{t}{B}_{r}^{1/2}\Vert =Vav\left({Y}_{g}\right)$. According to this compromise, the coordinates of the populations are in ${B}_{r}^{1/2}U{\Lambda}^{1/2}$. Owing to Lavit et al. [15], the G individual population patterns corresponding to the locus considered independently can be obtained. The coordinates of the i^{th} populations according to the g^{th} locus are the elements of the i^{th} row of ${Y}_{g}{Y}_{g}^{t}{B}_{r}^{1/2}U{\Lambda}^{1/2}$. Given that ${Y}_{g}={B}_{r}^{1}{F}_{g}^{t}{X}_{g}$, the rows of the matrix ${Y}_{g}{Y}_{g}^{t}{B}_{r}^{1/2}U{\Lambda}^{1/2}$ position the alleles of the g^{th} locus, so that each population is at the centroid of its allelic composition.
DPCoA and Multiple Factorial Analysis
The MFA is the Principal Component Analysis (PCA) of the global matrix
Y_{ TOT }= [π_{1}Y_{1}...π_{ g }Y_{ g }...π_{ G }Y_{ G }]:
Because ${Y}_{g}={B}_{r}^{1}{F}_{g}^{t}{X}_{g}$, the matrix ${\pi}_{g}{X}_{g}{Y}_{g}^{t}{B}_{r}{Y}_{TOT}U{\Lambda}^{1/2}$ positions the alleles of the g^{th} locus, so that each population is at the centroid of its allelic composition.
Relationships between the multiple DPCoA and the measurement of diversity
In this formula, g designates the g^{th} locus, ρ_{ g }is the number of different alleles observed for that locus, ${p}_{i}={\left({p}_{1i},\mathrm{...},{p}_{ki},\mathrm{...},{p}_{{\rho}_{g}i}\right)}^{t}$ is the vector containing the relative frequencies of the alleles in the i^{th} population, so that p_{ ki }is the frequency of the allele k in the i^{th} population, and ${d}_{kl}^{\text{all},g}$ is the distance among the alleles k and l of the g^{th} locus. The DPCoA uses a decomposition of this diversity component defined by Rao [27]:
H_{TOTAL, g}({μ_{ i }},{p_{ i }}) = H_{INTRA, g}({μ_{ i }},{p_{ i }}) + H_{INTRA, g}({μ_{ i }},{p_{ i }}),
where ${d}^{\text{pop},g}\left({p}_{i},{p}_{j}\right)=2{H}_{g}\left(\frac{{p}_{i}+{p}_{j}}{2}\right){H}_{g}\left({p}_{i}\right){H}_{g}\left({p}_{j}\right)$.
In the first step of the DPCoA, all the points (i.e. alleles and populations) are in a space called "common space" [11]. In this common space, the inertia (i.e. variance) of the allele points weighted by p_{ i }is equal to H_{ g }(p_{ i }), the diversity of the population i, according to locus g. The inertia of all the allele points weighted by $\sum}_{i=1}^{r}{\mu}_{i}{p}_{i$ is equal to H_{TOTAL, g}, the total diversity of the data set. Finally, the inertia of all the population points weighted by μ = (μ_{1},..., μ_{ i },..., μ_{ r }) is equal to H_{INTER, g}, the component of diversity among populations [11]. At the end of the DPCoA analysis, all the points are projected in a subspace which optimizes the representation of the differences among populations. In this subspace, only H_{INTER, g}is maintained, which is thus the focus of the analysis: optimally displaying the diversity among populations.
Consequently, the multiple DPCoA allows us to optimize the description of diversity among populations obtained with several loci. The first goal of this method is to describe the differences in population patterns across the loci, hence studying the congruence among loci. Another objective may be to erase these differences and provide a compromise population pattern revealed by the majority of the loci. The DPCoASTATIS is advocated for this purpose. Concerning the measurement of diversity, when several loci are considered to measure diversity, the sum or average of the diversity components over the loci is currently used as a global measure of diversity [see for example [28, 29]]. With such processes, the weights given to the loci for the sum or averaging are uniform. We have just shown that STATIS provides optimal locus weights for the calculation of the component of diversity among populations. The great advantage of these multivariate analyses is that visualization of the differences among loci is possible so that one can assess the relevance of using average information over loci, whether these means are weighted or not.
Associated tests
We performed both Mantel and Rν tests to evaluate the significance of the differences in population patterns among loci. For each locus, distances among populations are calculated with the interpopulation diversity H_{INTER, g}({μ_{ i }}:{p_{ i }}) according to Nei and Li [23] and Rao [24, 27]. We just said that this statistic is at the core of the DPCoA. As we apply formula (H_{INTER, g}) in a pairwise fashion, the distance between population i and population j for locus g is μ_{ i }μ_{ j }d^{pop, g}(p_{ i }, p_{ j }). We choose μ_{ i }μ_{ j }d^{pop, g}(p_{ i }, p_{ j }) and not simply d^{pop, g}(p_{ i }, p_{ j }) to take into account differential sample sizes, exactly in the way that we considered them in ordination procedures. The Mantel test calculates correlations among the raw distance measures, while the Rν test compares principal coordinates obtained by PCoA. Rν correlations are always higher than Mantel correlations because their values lie between 0 and 1, while Mantel correlation values lie between 1 and 1.
Application to simulated and real data sets
We used the following procedure to test the methodologies presented above based on simulated and real data sets. First, pairwise correlations among loci by Mantel and/or Rν tests were assessed to define groups of consistent loci. At this step, atypical loci can be identified. Then mDPCoA was performed to describe both the compromise population structure and the differences among groups of loci. Finally, we describe the connections between the observed structures and ecological, evolutionary or functional data.
Application to a simulated data set
Simulation process
In order to assess the efficiency of the present method, simulated sequence data sets, which illustrate various population structures, were obtained assuming linkage equilibrium among loci. Assuming recombination, the different markers can indeed have different histories and thus different population structures. Moreover, if every marker has an independent history, finding similarities and differences among their genetic structures would be more difficult. Using SIMCOAL 2.0 [30] we considered a onedimensional stepping stone model with eight populations of constant size [31]. The eight populations evolved 10^{6} generations after emerging from a single ancestral population. For each population, 60 individuals were sampled out of 10000 individuals. In this context, we simulated DNA sequence evolution of ten loci of 300 base pairs under a Jukes and Cantor model [32] assuming a mutation rate of 5 × 10^{6}. The stepping stone model allows migration between adjacent populations: for example, at time t, the population 4 can exchange individuals with populations 3 or 5, but not with other populations. We chose the following migration rates: 5 × 10^{2}, 10^{2}, 5 × 10^{3}, 10^{3}, 5 × 10^{4}, 10^{4}, 5 × 10^{5}, 10^{5}, 5 × 10^{6}. We also simulated an eleventh locus that reveals a different population structure. For this locus, we assumed no migration between odd populations (i.e. populations 1, 3, 5, 7) and even populations (i.e. populations 2, 4, 6, 8) and a migration rate of 10^{3} among odd or even populations, with other parameters kept unchanged. Such a simulation resulted in two clades of alleles which are obviously divergent, the first clade being specific to some populations (e.g. odd ones), the second clade being specific to other populations (e.g. even ones). Such genetic structure can be observed in case of either balancing/disruptive selection [e.g. [33]] or horizontal transfer of an outlier allele [e.g. [7]].
We applied the mDPCoA approach first on the complete data set, second on the allele distances only and then taking into account just the allele frequencies. We evaluated the intensity of interpopulation structure by measuring the AMOVA ϕ_{ ST }parameter [25].
Results
Application to the description of Sinorhizobium species diversity
The data set
In order to test the efficiency of the procedures we proposed, we needed a real data set which should give simple and explicit results but which could also encompass the features of complex MLS data sets. We chose to focus on nitrogen fixing bacteria belonging to the genus Sinorhizobium (Rhizobiaceae) associated with the plant genus Medicago (Fabaceae). The data set we chose is a combination of two data sets fully available online from GenBank and published in two recent papers [8, 34]. The complete sampling procedure is described in the two papers and summarized in an additional file [see Additional file 3]. Based on the sampling scheme, we delineated six populations according to geographical origin (France: F, Tunisia Hadjeb: TH, Tunisia Enfidha: TE), the host plant (M. truncatula or similar symbiotic specificity: T, M. laciniata: L), and the taxonomical status of bacteria (S. meliloti: mlt, S. medicae: mdc). Each population will be called hereafter according to the three above criteria, e.g. THLmlt is the population sampled in Tunisia at Hadjeb from M. laciniata nodules which include S. meliloti isolates. S. medicae interacts with M. truncatula while S. meliloti interacts with both M. laciniata (S. meliloti bv. medicaginis) and M. truncatula (S. meliloti bv. meliloti) [35, 36]. The numbers of individuals are respectively 46 for FTmdc, 43 for FTmlt, 20 for TETmdc, 24 for TETmlt, 20 for TELmlt, 42 for THTmlt and 20 for THLmlt [see Additional files 4, 5, 6, 7].
We applied the multiple DPCoA to this data set, and compared the results to those obtained with STRUCTURE [42, 43]. STRUCTURE estimates population structure using genotype data. The basic hypotheses are linkage equilibrium within subpopulations (or possibly weak linkage [44]) and HardyWeinberg equilibrium (if the organism under study is not haploid).
Results
Pairwise correlations among loci with the complete real data set
Mantel  IGS _{ NOD }  IGS _{ EXO }  IGS _{ GAB }  Rv tests  IGS _{ NOD }  IGS _{ EXO }  IGS _{ GAB } 

IGS _{ EXO }  0.164  IGS _{ EXO }  0.232  
IGS _{ GAB }  0.173  1.000*  IGS _{ GAB }  0.230  1.000*  
IGS _{ RKP }  0.164  1.000*  0.999*  IGS _{ RKP }  0.227  1.000*  0.999* 
There is a clear relationship between the patterns of population differences and the distribution of allelic diversity (Figure 6B). For instance, the two bacterial species did not share any alleles in common, even for the IGS_{ NOD }locus. Furthermore, the populations associated with M. laciniata did not share any alleles with the populations associated with M. truncatula for the IGS_{ NOD }locus, resulting in three independent allelic pools belonging respectively to S. medicae and the two biovars of S. meliloti. Furthermore, the distance between the IGS_{ NOD }alleles associated with M. laciniata and those associated with M. truncatula is very high, almost as high as the distance which separates S. meliloti and S. medicae on IGS_{ EXO }. The particular polymorphism pattern observed for IGS_{ NOD }might be explained by both the hostplant selective pressure that acts on nod genes and the events of horizontal transfer that affect the nod gene cluster [34].
Relative effects of distances and frequencies
The conclusions which can be drawn from these analyses of the effects of distances and frequencies on the interpopulation diversity are as follows. In all of the analyses, the most peculiar locus remains IGS_{ NOD }. The high separation of populations according to their host plant is due to distinct and distant alleles for IGS_{ NOD }and allele distances for IGS_{ GAB }. The differences among IGS_{ GAB }, IGS_{ RKP }, and IGS_{ EXO }are due to differentiation patterns among S. meliloti populations. Finally, the distinction between the French and the Tunisian populations mostly relies on allele frequency data.
Discussion
The MDPCoA approach provides a useful tool for: (i) identifying atypical loci by both tests and factorial maps; (ii) describing differences in population structures between groups of congruent loci by factorial maps; (iii) including evolutionary distances among alleles, which is seldom done.
Missing data
In all the analyses we performed, the weight of a population is the number of individuals sampled from this population divided by the total number of individuals sampled. Given that we consider several loci, this definition of the weights supposes that we have identified the allelic composition of each individual for all loci. In case of missing allelic data, i.e. if the allelic content of some individuals is missing for one or several loci, one should define different weight systems depending on the loci. According to the g^{th} locus, the weight of population i is the number of characterized individuals from population i divided by the total number of characterized individuals. This would lead to G different systems of weights, i.e. one per locus. Unfortunately, neither STATIS nor the MCoA nor the MFA can support different population weights. Consequently, one will have to assume a similar set of population weights over loci although some data are missing. To overcome this problem, it may be assumed that the weight of a population is the number of individuals sampled from this population divided by the total number of individuals sampled, whether or not the allelic information for all the loci and for all the individuals is available.
Another case of usual missing data is the lack of nucleotide divergence among alleles. In that case, we suggest fixing the distance among any two different alleles equal to 1, so that the DPCoA is equal to the nonsymmetric correspondence analysis [11, 45]. Furthermore, the inertia of the allelic points per population in the DPCoA "common space" is then equal to the gene diversity index H, introduced by Nei [28], and the inertia of the population points is equal to the gene diversity among populations defined by Nei [28] in its decomposition of gene diversity. The inertia among population points in the best compromise plot and DPCoASTATIS is a measure of gene diversity among populations averaged over the G loci, where the weights given to the loci are not simply uniform but set optimal for synthesizing what is common to the loci. This process gives less weight to outliers and reflects the distances among populations as they are seen by the majority of the loci.
Effects of frequencies and distances
The effect of frequencies and distances comprises two components: the effect due to sampling error and the effect due to population structure. The effects of sampling error on the component of nucleotide diversity within and between populations have been studied elsewhere [23, 46], and might be the object of further research in the context of the mDPCoA.
The relative effects of frequencies and distances on the analysis of population structure depend on the degree of differentiation among the populations under study. In case of low differentiation, population structure is usually due to variations in allelic frequencies. For instance, differences among French and Tunisian populations of S. meliloti that are highlighted by IGS_{ EXO }, IGS_{ GAB }and IGS_{ RKP }are due to allelic frequencies. Conversely, as the number of alleles shared by the different population decreases, taking into account the information provided by sequence divergence is crucial to efficiently describe their relationships. For instance, the specific interpopulation structure of IGS_{ NOD }is mainly due to sequence divergence.
Pertinence of the correlation tests
Both correlation tests (Mantel and Rν) can be nonsignificant for two reasons: either because of an absence of population structure or because the two loci compared reveal different population structures. As highlighted in a previous section, the estimated ϕ_{ ST }parameter and the factorial maps obtained by one of the three versions of the mDPCoA (with MCoA, STATIS or the MFA), can be used to choose among the two alternatives. Concerning the relative interest of the two tests, the Rν test is revealed to be more powerful when applied to our simulated data set, so we advocate its use.
Relative advantages and disadvantages of the three proposed analyses – choice of a method
The three methods are alike in their procedure because they are all based on a compromise. However, they differ in the way the compromise is obtained. With the MCoA, the compromise is built during the definition of the factorial axes. It maximizes the average correlation among the individual analyses and the compromise. With STATIS, the compromise is obtained before going to the core of the multivariate ordination analysis. Here, the compromise maximizes the correlations among the patterns of interpopulation diversity provided by the loci. With the MFA, the pieces of information given by the loci are simply added to each other by creating a large table juxtaposing the information on the loci. This last method is the simplest, where pieces of information are simply added. On the other hand, MCoA and STATIS first compare the patterns of interpopulation diversity provided by the loci, either for visualizing in a single space the differences among loci or for erasing these differences, and find a best compromise over the loci, respectively.
Unfortunately, the representation of the differences among loci with STATIS is not optimal [15] because STATIS focuses on similarities instead of dissimilarities among loci. Consequently, in comparison to alternative methods, it theoretically lacks an optimal explicability, and an efficient description of the differences in population patterns among loci. The description of the differences among population patterns is thus more precise using MCoA and MFA. Conversely, the main advantage of STATIS over other methods is that it provides a simpler compromise pattern.
The choice among the three methods therefore depends on the goal of the underlying study. If the objective is to obtain the best compromise over the loci, then we advocate the use of DPCOA with STATIS. However, if the objective is to obtain a detailed comparison among the population patterns provided by the G loci, then we encourage the use of the DPCoA with the MCoA.
Complementarity between mDPCoA and other analyses
The mDPCoA could be associated with other tools to study population structure, including the AMOVA, which forms the basis of the DPCoA, Linkage Disequilibrium (LD) statistics, and also recent approaches such as STRUCTURE or CLONAL FRAME.
The AMOVA averages molecular variability over loci to test the existence of differences between populations or groups of populations in terms of both allele frequencies and nucleotide distances among alleles. The Mantel and Rv statistics associated with the mDPCoA use the same information to test the differences between the interpopulation structures inferred by several loci.
Both linkage disequilibrium (LD) measures and the mDPCoA aim at assessing whether there is a significant association among the polymorphism patterns observed for different molecular markers. However, LD approaches and mDPCoA differ in several ways. Without discrepancies among the population structures, mDPCoA would fail to detect that different loci evolve independently, even if these are in linkage equilibrium at the population scale. Conversely, in the Sinorhizobium spp. data set, the mDPCoA detected that IGS_{ NOD }pattern of population differences was drastically different from the ones obtained with IGS_{ RKP }, IGS_{ GAB }and IGS_{ EXO }, suggesting a horizontal gene transfer of nod genes between S. meliloti bv. meliloti and S. medicae. Because of the differentiation between S. meliloti and S. medicae, LD measures would have failed to detect such a transfer event. Linkage disequilibrium measures and mDPCoA therefore appear as complementary tools to study the influence of sex during the evolution of bacterial lineages.
The mDPCoA is above all a descriptive method, as it does not rely on any assumptions about models of evolution such as linkage equilibrium or selective neutrality. Nevertheless, this analysis pipeline can raise questions that will be investigated using complementary analyses. Thus, demonstrating differences among population structures obtained from different loci raised questions regarding the definition of population boundaries, or the genealogy of both genes and individuals. A consensus population structure could be inferred without any a priori knowledge using STRUCTURE, and its efficiency can be confirmed and illustrated using the correlation tests and the graphical outputs of the mDPCoA. CLONAL FRAME is an explanatory method, estimating clonal relationships and looking for key recombination events with a view of finding the mechanisms implied in microevolution [47]. It can be used to gain insights into the history of an atypical locus. Finally, the detection of selection traces and mechanistic experiments can be of great interest to explain mDPCoA results. These different approaches thus complement the mDPCoA, and conversely, the mDPCoA complements these approaches. For instance, both STRUCTURE and CLONAL FRAME imply working on MLS analyses, and the choice of the finite set of loci used in these analyses may be crucial. Each method can be improved by looking at the results returned by the two others. A joint interpretation of the results of the alternative methods may thus allow a better interpretation of the results and lead to a deeper analysis of particular loci for a better understanding of the data.
Conclusion
All three methods proposed can be used for a better description of interpopulation genetic diversity measured over more than one locus. They imply a new reflection on the role of means in measures of diversity: can we work on average information over loci, or do we first need to examine the differences among the patterns of diversity given by the loci? Sometimes, the differences among loci are so high that the compromise obtained by the multivariate analyses will be unstable and the use of averaged information can hamper interpretation. This issue is related to the question raised decades ago: can we build a unique, very synthetic measure of biodiversity, or do we have to make up our mind to define several conflicting measures? As it is based on multivariate analyses, the multiple DPCoA in its three forms can be used to analyze large data sets. It allows a comparison of genetic diversity measured on various loci. It complements existing tools such as AMOVA and linkage disequilibrium measures. It is used here on molecular data because it is in genetics the question of congruence among markers was raised several years ago. We illustrated this procedure using a limited but complex sequence database. The method will have to be tested on other data sets, yet the results are already very promising. Moreover, mDPCoA is potentially more general than we presented here since it can be extended to any data set where pairs of matrices comprise a matrix with abundance or presence/absence and a matrix of dissimilarities. Further applications in ecology could thus be considered, such as the description of intercommunity diversity based on both genotypic and phenotypic features.
Abbreviations
 AMOVA:

Analysis of MOlecular Variance
 bv.:

biovar
 DPCoA:

Double Principal Coordinate Analysis
 FTmdc:

Population sampled at Sainte Colombe l'Eglise in France from M. truncatula nodules which include S. medicae isolates
 FTmlt:

Population sampled at Sainte Colombe l'Eglise in France from M. truncatula nodules which include S. meliloti bv. meliloti isolates
 IGS:

Intergenic spacers
 LD:

Linkage disequilibrium
 MCoA:

Multiple Coinertia Analysis
 mDPCoA:

multiple Double Principal Coordinate Analysis
 MFA:

Multiple Factorial Analysis
 MLS:

Multilocus Sequencing
 PCA:

Principal Component Analysis
 STATIS:

comes from a French expression "structuration des tabeaux à trois indices de la statistique" which means: structuration of the tables characterized by three statistical modes
 TELmlt:

Population sampled in Tunisia at Enfidha from M. laciniata nodules which include S. meliloti bv. medicaginis isolates
 TETmdc:

Population sampled in Tunisia at Enfidha from M. truncatula nodules which include S. medicae isolates
 TETmlt:

Population sampled in Tunisia at Enfidha from M. truncatula nodules which include S. meliloti bv. meliloti isolates
 THLmlt:

Population sampled in Tunisia at Hadjeb from M. laciniata nodules which include S. meliloti bv. medicaginis isolates
 THTmlt:

Population sampled in Tunisia at Hadjeb from M. truncatula nodules which include S. meliloti bv. meliloti isolates.
Declarations
Acknowledgements
The authors are grateful to Pr. I Olivieri, Pr. JPW Young and two anonymous reviewers for their useful comments about this study. We also thank R. Lower, and the American Journal Experts who helped us to improve the quality of this manuscript. This paper takes place in a research project on "Biodiversity, perception and use" funded by the French Institute of Biodiversity. Within this more general context, we develop and discuss methodologies for measuring biodiversity on multimarker data sets at various scales, from individuals' gene loci to species' functional traits.
Authors’ Affiliations
References
 Cooper JE, Feil EJ: Multilocus sequence typing: what is resolved?. Trends in Microbiology. 2004, 12: 373377. 10.1016/j.tim.2004.06.003.View ArticlePubMedGoogle Scholar
 Hanage WP, Fraser C, Spratt BG: The impact of homologous recombination on the generation of diversity in bacteria. Journal of Theoretical Biology. 2006, 239: 210209. 10.1016/j.jtbi.2005.08.035.View ArticlePubMedGoogle Scholar
 Fraser C, Hanage WP, Spratt BG: Neutral microepidemic evolution of bacterial pathogens. Proceedings of the National Academy of Sciences of the United States of America. 2005, 102: 19681973. 10.1073/pnas.0406993102.PubMed CentralView ArticlePubMedGoogle Scholar
 Metzker ML: Emerging technologies in DNA sequencing. Genome Research. 2005, 15: 17671776. 10.1101/gr.3770505.View ArticlePubMedGoogle Scholar
 MoazamiGoudarzi K, Laloë D: Is a multivariate consensus representation of genetic relationships among populations always meaningful?. Genetics. 2002, 162: 473484.PubMed CentralPubMedGoogle Scholar
 Hanage WP, Fraser C, Spratt BG: Fuzzy species among recombinogenic bacteria. BMC Biology. 2005, 3: 610.1186/1741700736.PubMed CentralView ArticlePubMedGoogle Scholar
 Falush D, Torpdahl M, Didelot X, Conrad DF, Wilson DJ, Achtman M: Mismatch induced speciation in Salmonella: model and data. Philosophical Transactions of the Royal Society of London Series B  Biolog. 2006, 361: 20452053. 10.1098/rstb.2006.1925.View ArticleGoogle Scholar
 Bailly X, Olivieri I, De Mita S, CleyetMarel JC, Béna G: Recombination and selection shape the molecular diversity pattern of nitrogenfixing Sinorhizobium sp. associated to Medicago. Molecular Ecology. 2006, 15: 27192734.View ArticlePubMedGoogle Scholar
 Falush D, Wirth T, Linz B, Pritchard JK, Stephens M, Kidd M, Blaser MJ, Graham DY, Vacher S, PerezPerez GI, Yamaoka Y, Megraud F, Otto K, Reichard U, Katzowitsch E, Wang X, Achtman M, Suerbaum S: Traces of human migrations in Helicobacter pylori populations. Science. 2003, 299: 15821585. 10.1126/science.1080857.View ArticlePubMedGoogle Scholar
 Escoufier Y: Le traitement des variables vectorielles. Biometrics. 1973, 29: 750760. 10.2307/2529140.View ArticleGoogle Scholar
 Pavoine S, Dufour AB, Chessel D: From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. Journal of Theoretical Biology. 2004, 228: 523537. 10.1016/j.jtbi.2004.02.014.View ArticlePubMedGoogle Scholar
 Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, Sargent M, Gill SR, Nelson KE, Relman DA: Diversity of the human intestinal microbial flora. Science. 2005, 308: 16351638. 10.1126/science.1110591.PubMed CentralView ArticlePubMedGoogle Scholar
 Bik EM, Eckburg PB, Gill SR, Nelson KE, Purdom EA, Francois F, PerezPerez G, Blaser MJ, Relman DA: Molecular analysis of the bacterial microbiota in the human stomach. Proceedings of the National Academy of Sciences of the United States of America. 2006, 103: 732737. 10.1073/pnas.0506655103.PubMed CentralView ArticlePubMedGoogle Scholar
 Chessel D, Hanafi M: Analyses de la coinertie de K nuages de points. Revue de Statistique Appliquée. 1996, : . [http://www.numdam.org/item?id=RSA_1996__44_2_35_0]Google Scholar
 Lavit C, Escoufier Y, Sabatier R, Traissac P: The ACT (Statis method). Computational Statistics and Data Analysis. 1994, 18: 97119. 10.1016/01679473(94)901341.View ArticleGoogle Scholar
 Escofier B, Pagès J: Multiple factor analysis: results of a threeyear utilization. Multiway data analysis. Edited by: Coppi R and Bolasco S. 1989, , Elsevier Science Publishers B.V., NorthHolland, 277285.Google Scholar
 Chessel D, Dufour AB, Thioulouse. J: The ade4 package I Onetable methods. R News. 2004, 4: 510. [http://cran.rproject.org/doc/Rnews/Rnews_20041.pdf]Google Scholar
 Paradis E, Strimmer K, Claude J, Jobb G, OpgenRhein R, Dutheil J, Noel Y, Bolker B: ape: Analyses of Phylogenetics and Evolution. 2005, , R package version 1.7Google Scholar
 Ihaka R, Gentleman R: R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996, 5: 299314. 10.2307/1390807.Google Scholar
 Gower JC: Euclidean distance geometry. Mathematical Scientist. 1982, 7: 114.Google Scholar
 Lingoes JC: Some boundary conditions for a monotone analysis of symmetric matrices. Psychometrika. 1971, 36: 195203. 10.1007/BF02291398.View ArticleGoogle Scholar
 Cailliez F: The analytic solution of the additive constant problem. Psychometrika. 1983, 48: 305310. 10.1007/BF02294026.View ArticleGoogle Scholar
 Nei M, Li WH: Mathematical model for studying genetic variation in terms of restriction endonucleases. Proceedings of the National Academy of Sciences of the United States of America. 1979, 76: 52695273. 10.1073/pnas.76.10.5269.PubMed CentralView ArticlePubMedGoogle Scholar
 Rao CR: Diversity and dissimilarity coefficients: a unified approach. Theoretical Population Biology. 1982, 21: 2443. 10.1016/00405809(82)900041.View ArticleGoogle Scholar
 Excoffier L, Smouse PE, Quattro JM: Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics. 1992, 131: 479491.PubMed CentralPubMedGoogle Scholar
 Pavoine S, Dolédec S: The apportionment of quadratic entropy: a useful alternative for partitioning diversity in ecological data. Environmental and Ecological Statistics. 2005, 12: 125138. 10.1007/s1065100510372.View ArticleGoogle Scholar
 Rao CR: Rao's axiomatization of diversity measures. Encyclopedia of Statistical Sciences. Edited by: Kotz S and Johnson NL. 1986, New York, Wiley and Sons, 614617.Google Scholar
 Nei M: Analysis of gene diversity in subdivised populations. Proceedings of the National Academy of Sciences of the United States of America. 1973, 70: 33213323. 10.1073/pnas.70.12.3321.PubMed CentralView ArticlePubMedGoogle Scholar
 Nei M: Molecular evolutionary genetics. 1987, New York, NY, USA, Columbia University PressGoogle Scholar
 Laval G, Excoffier L: SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics. 2004, 12: 24852487. 10.1093/bioinformatics/bth264.View ArticleGoogle Scholar
 Kimura M: Stepping Stone model of population. Annual Report of the National Institute of Genetics. 1953, 3: 6263.Google Scholar
 Jukes T, Cantor C: Evolution of protein molecules. Mammalian protein metabolism. Edited by: Munro HN. 1969, New York, Academic press, 21132.View ArticleGoogle Scholar
 Charlesworth D, Mable BK, Schierup MH, Bartolomé C, Awadalla P: Diversity and Linkage of Genes in the SelfIncompatibility Gene Family in Arabidopsis lyrata. Genetics. 2003, 164: 15191535.PubMed CentralPubMedGoogle Scholar
 Bailly X, Olivieri I, Brunel B, CleyetMarel JC, Béna G: Horizontal gene transfer and homologous recombination drive the evolution of the nitrogenfixing symbionts of Medicago species. Journal of Bacteriology. 2007, 189: 52235236. 10.1128/JB.0010507.PubMed CentralView ArticlePubMedGoogle Scholar
 Bena G, Lyet A, Huguet T, Olivieri I: Medicago  Sinorhizobium symbiotic specificity evolution and the geographic expansion of Medicago. Journal of Evolutionary Biology. 2005, 18: 15471558.View ArticlePubMedGoogle Scholar
 Villegas MDC, Rome S, Maure L, Domergue O, Gardan L, Bailly X, CleyetMarel JC, Brunel B: Nitrogenfixing sinorhizobia with Medicago laciniata constitute a novel biovar (bv. medicaginis) of S. meliloti. Systematic and Applied Microbiology. 2006, 29: 526538. 10.1016/j.syapm.2005.12.008.View ArticleGoogle Scholar
 Barran LR, Bromfield ES, Brown DC: Identification and cloning of the bacterial nodulation specificity gene in the Sinorhizobium meliloti  Medicago laciniata symbiosis. Canadian Journal of Microbiology. 2002, 48: 765771. 10.1139/w02072.View ArticlePubMedGoogle Scholar
 Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology. 2003, 52: 696704. 10.1080/10635150390235520.View ArticlePubMedGoogle Scholar
 Felsenstein J, Churchill GA: A Hidden Markov model approach to variation among sites in rate of evolution. Molecular Biology and Evolution. 1996, 13: 93104. [http://mbe.oxfordjournals.org/cgi/content/abstract/13/1/93]View ArticlePubMedGoogle Scholar
 McGuire G, Prentice MJ, Wright F: Improved error bounds for genetic distances from DNA sequences. Biometrics. 1999, 55: 10641070. 10.1111/j.0006341X.1999.01064.x.View ArticlePubMedGoogle Scholar
 Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution. 1981, 17: 368376. 10.1007/BF01734359.View ArticlePubMedGoogle Scholar
 Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data: dominant markers and null alleles. Molecular Ecology Notes. 2007, Published article online doi: 10.1111/j.14718286.2007.01758.x:Google Scholar
 Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945959.PubMed CentralPubMedGoogle Scholar
 Falush D, Stephens M, Pritchard JK: Inference of population structure: Extensions to linked loci and correlated allele frequences. Genetics. 2003, 164: 15671587.PubMed CentralPubMedGoogle Scholar
 Lauro N, D'Ambra L: L'analyse non symétrique des correspondances. Data Analysis and Informatics, III. Edited by: Diday E, Jambu M, Lebart L, Pages J and Tomassone R. 1984, NorthHolland, Elsevier, 433446.Google Scholar
 Lynch M, Crease TJ: The analysis of population survey data on DNA sequence variation. Molecular Biology and Evolution. 1990, 7: 377394. [http://mbe.oxfordjournals.org/cgi/content/abstract/7/4/377]PubMedGoogle Scholar
 Didelot X, Falush D: Inference on bacterial microevolution using multilocus sequence data. Genetics. 2007, 175: 12511266. 10.1534/genetics.106.063305.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.