New analysis for consistency among markers in the study of genetic diversity: development and application to the description of bacterial diversity

Background The development of post-genomic methods has dramatically increased the amount of qualitative and quantitative data available to understand how ecological complexity is shaped. Yet, new statistical tools are needed to use these data efficiently. In support of sequence analysis, diversity indices were developed to take into account both the relative frequencies of alleles and their genetic divergence. Furthermore, a method for describing inter-population nucleotide diversity has recently been proposed and named the double principal coordinate analysis (DPCoA), but this procedure can only be used with one locus. In order to tackle the problem of measuring and describing nucleotide diversity with more than one locus, we developed three versions of multiple DPCoA by using three ordination methods: multiple co-inertia analysis, STATIS, and multiple factorial analysis. Results This combination of methods allows i) testing and describing differences in patterns of inter-population diversity among loci, and ii) defining the best compromise among loci. These methods are illustrated by the analysis of both simulated data sets, which include ten loci evolving under a stepping stone model and a locus evolving under an alternative population structure, and a real data set focusing on the genetic structure of two nitrogen fixing bacteria, which is influenced by geographical isolation and host specialization. All programs needed to perform multiple DPCoA are freely available. Conclusion Multiple DPCoA allows the evaluation of the impact of various loci in the measurement and description of diversity. This method is general enough to handle a large variety of data sets. It complements existing methods such as the analysis of molecular variance or other analyses based on linkage disequilibrium measures, and is very useful to study the impact of various loci on the measurement of diversity.


Additional file 2 -Instructions for performing multiple DPCoA in R
R is a free software available at http://lib.stat.cmu.edu/R/CRAN It includes hundreds of packages in many scientific areas.
Create a working directory, and be carrefull to define the right working directory in your R console (see in the tool bar, file choosing directory). For analyzing your genetic data and performing the multiple DPCoA, you need to load two packages from R: 'ape' and 'ade4'. Write the following instructions on your R console: library(ade4) library(ape) In your working directory, copy the file 'mdpcoa.R' in your working directory (Additional file 1) and load it: This file contains three functions: one for the preparation of the data (prep.mdpcoa), the second for the numerical calculations (mdpcoa), and the third for graphical displays (kplotX.mdpcoa). These three functions are described below.

1/ IF YOUR DATA SET CONTAINS LISTS OF DNA SEQUENCES
Create a folder (for our case study, name it "dna" using small characters) in your working directory, open it and copy your DNA files. With our case study, copy the files 'NOD.aa', 'EXO.aa', 'GAB.aa', and 'RKP.aa' (Additional files 4 to 7) in the "dna" folder. Each file corresponds to a locus. It contains the sequences in the format 'FASTA'. Note that the library ape can handle various format of DNA file (see the manual of this library for more information). Next we need to load the four files. To obtain a factor giving the name of the population to which each sequence belongs load the file "pop.txt": pop <-as.factor(read. Use the function "prep.mdpcoa" to prepare the data: If your data set contains lists of DNA sequences, the function will provide you with the list of matrices giving the abundance of each allele in populations, the list of distances among alleles, and the list of the allele DNA sequences.
The function prep.mdpcoa has five parameters: prep.mdpcoa (folder, pop, format, model, ...) folder is a character string giving the path through the folder which contains the DNA sequences. This folder should contain only the sequences of interest for the study.
pop is a factor giving the name of the population to which each sequence belongs.
format is a character string specifying the format of the DNA sequences. Three choices are possible: '"interleaved"', '"sequential"', or '"fasta"', or any unambiguous abbreviation of these.
model is a vector giving the model to be applied for the calculations of the distances for each locus. One model should be attributed to each locus, given that the loci are in alphabetical order. The models can take the following values: "raw", "JC69", "K80" (the default), "F81", "K81", "F84", "BH87", "T92", "TN93", "GG95", "logdet", or "paralin". See the help documentation for the function "dist.dna" of ape for a describtion of the models.
... designs further arguments passed to the read.dna function We apply now this function to our real data set: dat <-prep.mdpcoa("dna", pop, model = c("F84", "F84", "F84", "F81"), pairwise.deletion = TRUE) The object "dat" is a list of three sublists: dat$sam: is a list of data frames with the populations as columns, alleles as rows and abundances as entries. Each table corresponds to a locus. dat$dis: is a list of objects of class 'dist', corresponding to the distances among alleles dat$alleleseq: is a list of objects of class 'dna' providing the DNA sequence of each allele for all the loci.
We will now stock the abundance matrices and the distances in the following two objects, and proceed to the mDPCoA.
The distances should be Euclidean. Several transformations exist to render a distance object Euclidean (see functions cailliez, lingoes and quasieuclid in the ade4 package). Here we use the quasieuclid function.

2/ A UNIQUE FUNCTION WITH A LARGE SET OF OPTIONS:
As indicated above, the functions developed for the multiple DPCoA are available in the Additional file 1. If it has not yet been done, copy this file in your working directory and write the following instruction: This file contains three functions, the function prep.mdpcoa that we have just seen, a function for the numerical calculations (mdpcoa), and the third function for graphical displays (kplotX.mdpcoa). The two last functions have the following expressions: mdpcoa <-function(msamples, mdistances = NULL, method = c("mcoa", "statis", "mfa"), option = c("inertia", "lambda1", "uniform", "internal"), scannf = TRUE, nf = 3, full = TRUE, nfsep = NULL, tol = 1e-07) kplotX.mdpcoa <-function(object, xax = 1, yax = 2, mfrow = NULL, which.tab = 1:length(object$nX), includepop = FALSE, clab = 0.7, cpoi = 0.7, unique.scale = FALSE, csub = 2, possub = "bottomright") In these two expressions, the parameters are: msamples A list of data frames with the populations as columns, alleles as rows and abundances as entries. All the tables should have equal numbers of columns (populations). Each table corresponds to a locus. mdistances A list of objects of class 'dist', corresponding to the distances among alleles. The order of the loci should be the same in msamples as in mdistances. method One of the three possibilities: "mcoa", "statis", or "mfa". If a vector is given, only its first value is considered. option One of the four possibilities for normalizing the population coordinates over the loci: "inertia", "lambda1", "uniform", or "internal". These options are used with MCoA and MFA only. scannf a logical value indicating whether the eigenvalues bar plots should be displayed nf if scannf is FALSE, an integer indicating the number of kept axes for the multiple analysis full a logical value indicating whether all the axes should be kept in the separated analyses (one analysis, DPCoA, per locus) nfsep if full is FALSE, a vector indicating the number of kept axes for each of the separated analyses tol a tolerance threshold for null eigenvalues (a value less than tol times the first one is considered as null) object an object of class 'mdpcoa' xax the number of the x-axis yax the number of the y-axis mfrow a vector of the form 'c(nr,nc)', otherwise computed by as special own function 'n2mfrow' which.tab a numeric vector containing the numbers of the loci to analyse includepop a logical indicating if the populations must be displayed. In that case, the alleles are displayed by points and the populations by labels clab a character size for the labels cpoi a character size for plotting the points, used with 'par("cex")'*cpoint. If zero, no points are drawn unique.scale if TRUE, all the arrays of figures have the same scale csub a character size for the labels of the arrays of figures used with 'par("cex")*csub' possub a string of characters indicating the sub-title position ("topleft", "topright", "bottomleft", "bottomright") An object obtained by the function mdpcoa has two classes. The first one is "mdpcoa" and the second is either "mcoa", or "statis", or "mfa", depending on the method chosen. Consequently, other functions already available in ade4 for displaying graphical results can be used: With MCoA, plot.mcoa: this function displays (1) the differences among the populations according to each locus and the compromise, (2) the projection of the principal axes of the individual analyses onto the synthetic variables (noted V in the main text of the paper), Further comments on these functions will be found in the help files of ade4.

DPCOA and MCOA :
# The calculations are saved in an object called "mdpcoa1": mdpcoa1 <-mdpcoa(sam, dis, "mcoa", scannf = FALSE, full = FALSE, nfsep = rep(2,4)) # The following plots correspond to Figure 6 in the article. It shows the discrepencies among loci in the population patterns they provide. The first plot is the compromise. In the next four plots, arrows connect the positions of the populations according to the compromise with their positions according to the defined locus.

DPCOA and STATIS:
# The calculations are now saved in an object called "mdpcoa2":