Identification of geographically distributed sub-populations of Leishmania (Leishmania) major by microsatellite analysis

Background Leishmania (Leishmania) major, one of the agents causing cutaneous leishmaniasis (CL) in humans, is widely distributed in the Old World where different species of wild rodent and phlebotomine sand fly serve as animal reservoir hosts and vectors, respectively. Despite this, strains of L. (L.) major isolated from many different sources over many years have proved to be relatively uniform. To investigate the population structure of the species highly polymorphic microsatellite markers were employed for greater discrimination among it's otherwise closely related strains, an approach applied successfully to other species of Leishmania. Results Multilocus Microsatellite Typing (MLMT) based on 10 different microsatellite markers was applied to 106 strains of L. (L.) major from different regions where it is endemic. On applying a Bayesian model-based approach, three main populations were identified, corresponding to three separate geographical regions: Central Asia (CA); the Middle East (ME); and Africa (AF). This was congruent with phylogenetic reconstructions based on genetic distances. Re-analysis separated each of the populations into two sub-populations. The two African sub-populations did not correlate well with strains' geographical origin. Strains falling into the sub-populations CA and ME did mostly group according to their place of isolation although some anomalies were seen, probably, owing to human migration. Conclusion The model- and distance-based analyses of the microsatellite data exposed three main populations of L. (L.) major, Central Asia, the Middle East and Africa, each of which separated into two sub-populations. This probably correlates with the different species of rodent host.


Page 2 of 13
(page number not for citation purposes) Background Leishmania (Leishmania) major is one of the agents causing Old World cutaneous leishmaniasis (CL) in humans, which, in the case of L. (L.) major, is a rural, zoonotic, vector-borne disease, involving different species of wild rodents as animal reservoir hosts and different species of phlebotomine sand flies as vectors, depending on the geographical location where it occurs (reviewed, [1]. Humans, though infected in large numbers, are considered to be incidental hosts not directly implicated in the transmission cycle [2]. The parasite and, thus, the disease are geographically widely distributed in the arid and semi-arid areas of: North-West, North, Central sub-Saharan and East Africa; the Near and Middle East; Central Asia; and Rajasthan, India. Despite the very broad geographical distribution and large variety of types of wild animal host and sand fly vector, the different strains of L. (L.) major isolated from many sources over many years have proved to be relatively uniform when studied by most of the classical and more modern methods used for characterizing leishmanial strains. Serological tests like excreted factor (EF) serotyping and the application of Leishmania species-specific monoclonal antibodies have shown that antigenic differences exist among different strains of L. (L.) major [3]. Multilocus enzyme electrophoresis (MLEE) exposed some enzyme electrophoretic variation with some isoenzyme variant profiles showing a degree of geographical subdivision within a general enzyme electrophoretic uniformity [4][5][6].
The more recent application of various molecular biological methods revealed geographically distributed genetic polymorphism among different strains of L. (L.) major. Analysis of sequence polymorphisms in seven coding, non-coding and anonymous nuclear DNA sequences [1] as well as a partial sequence from the kinetoplast DNA maxicircle divergent region [7] of strains of L. (L.) major showed that the strains from Central Asia and the Middle East were genetically distinct. None to very little variation was seen among the Central Asian strains and somewhat more among the Middle Eastern ones. Only a few East African strains of L. (L.) major were included in these studies, which tended to be genetically closer to the Middle Eastern than to the Central Asian ones. Different patterns in the most variable sequence were attributed to variations in complex microsatellite repeats. This prompted a search for highly polymorphic microsatellite markers that would allow greater discrimination of otherwise closely related strains of L. (L.) major.
Microsatellites are repeated simple motifs of a few nucleotides (<6) flanked by unique sequences. They are ubiquitous in prokaryotes and eukaryotes and have been shown to exhibit a substantial level of polymorphism in a number of eukaryotic genomes [8,9]. They are becoming one of the principal genetic marker systems in phylogenetic, population genetic and molecular epidemiological studies. The leishmanial genome is relatively rich in microsatellite sequences, about 600 (CA) n loci per haploid genome, with CA being the most abundant dinucleotide repeat as in all vertebrates and fungi investigated so far [9][10][11]. Microsatellite markers have been used successfully for characterizing and detecting genetic variation in other Old World leishmanial strains and species, i. e., L. (Leishmania) donovani and L. (Leishmania) infantum [12][13][14] and L.
This study adds the species L. (L.) major. Ten informative microsatellite loci based on nucleotide sequence information of L. (L.) major obtained from the Leishmania genome project were used to determine polymorphism and microheterogeneity, and their geographical and epidemiological distribution.

Microsatellite analysis
Twenty-three sequences located on chromosomes 1, 3, 5, 21 and 35 that contained nucleotide repeats such as (AT)n, (GC)n, (CA)n, (GTG)n, and (GACA)n were selected from the genome sequence of L. (L.) major published by the European Bioinformatics Institute. PCR primers were designed for microsatellite loci that contained at least 6 repeats in the reference strain L. (L.) major MHOM/IL/1980/Friedlin and were located on different chromosomes or, if located on the same chromosome, were at least distant enough to be considered unlinked ( Table 1). Fifteen of the 23 primer pairs yielded a single PCR fragment for the majority of the strains tested. Ten of these 15 primer pairs amplified polymorphic fragments of different size in different strains of L. (L.) major (Table 1).
Only 4 out of 45 pair-wise combinations (9%) were in significant linkage disequilibrium (P < 0.05) when GENE-POP was used. No linkage was observed using the Bonferroni corrections implemented in FSTAT.
The 10 polymorphic loci were used to perform multilocus microsatellite typing (MLMT) on 106 strains of L. (L.) major collected in 10 Asian and 9 African countries. Four of the 10 loci showed only homozygous allele combinations. However, three GTG loci, two AT loci and the GACA locus were heterozygous for one or more of the strains (see Additional file 1). In some cases, no PCR product could be obtained in repeated PCR runs, which was treated as missing data for the statistical analyses.
Sixty-six differing MLMT profiles comprising the number of repeats in each of the 10 markers were revealed among the 106 strains (see Additional file 1). The MLMT profiles were named Lmj, thus indicating that these microsatellite profiles are unique to L. (L.) major, and numbered from 1 to 66. Figures 1 to 3 show their genetic inter-relationship. Most profiles were represented by a single strain; eleven were present in more than one strain. Nine microsatellite variant profiles were exposed among the 23 strains in the sub-population CA1; six among the 16 strains in the subpopulation CA2; thirteen among the 26 strains in the subpopulation ME1; fifteen among the 17 strains in the subpopulation ME2; eleven among the 11 strains in the subpopulation AF1; and twelve among the 13 strains in the sub-population AF2 (see Table 2, and Figure 2 for origins and geographical distributions). Table 3 shows that the number of alleles varies between loci, ranging from 3 to 10. An increased degree of inbreeding within loci (Fis) was detected, ranging from 0.874-1.00 (P < 0.05) with a mean of 0.976. Observed heterozygosity (Ho) among loci was extremely low compared with the heterozygosity (He) expected under assumption of the Hardy-Weinberg equilibrium. These results clearly show that L. (L.) major is largely clonal and that hybridization only occurs at very low frequencies.

Population structure
The population structure was investigated, using the Bayesian-model based clustering approach implemented in the software STRUCTURE [16]. The most probable number of populations in this data set by calculating ΔK was three: Central Asia (CA, comprising strains from Uzbekistan, Turkmenistan and Kazakhstan; the Middle East (ME), comprising strains from Israel, The Palestinian Authority, Egypt, Saudi Arabia, Kuwait and eastern Turkey; and Africa (AF), comprising strains from North, East and West Africa, western Turkey, Iran and Iraq ( Figure 1A and Table 2). STRUCTURE analyses were re-done with each of the three main populations, CA, ME and AF, to expose further subdivision; and each main population did separate into two sub-populations: CA1 and CA2; ME1 and ME2; AF1 and AF2 (Figures 1B, C and 1D, Table 2). Although calculations of ΔK do not allow for differentiation between one or two groups in the dataset, the strains' assignment ( Table 2) was most congruent with the existence of two sub-populations. The distribution of the different subpopulations is shown in Figure 2.
On using a predefined clustering approach, the total area from which the strains of L. (L.) major were collected was divided into five geographical sectors: Central Asia (CA), the Middle East (ME), North, West and East Africa with the last three being referred to collectively as Africa (AF). The boundaries between these sectors are mainly significant geographical barriers like seas and deserts. Each strain was assigned to one of the sectors, according to its place of collection without regard to underlying genetic relationships ( Figure 3). The Central Asian cluster was highly homogenous, genetically and according to predefined geographical placement, with all the strains having a very high membership coefficient of 0.999. The Middle  East cluster was also genetically well defined, except for the exclusion of the two western Turkish and sole Iraqi and Iranian strains, which grouped with the African strains. The entire African cluster was very heterogeneous regarding MLMT profiles and did not separate into dis-tinct North, West and East African regional clusters as envisaged by the predefined clustering approach.
Increasing values of K (2-6) were used to identify ancestral populations. At K = 2, the first split separated the population CA from other strains. At K = 3, three main  populations, CA, ME and AF were exposed. At K = 4, the African strains split into two clusters. At K = 5, the Middle Eastern cluster separated into two distinct sub-clusters, ME1 and ME2 while at K = 6 the population CA split into the sub-clusters CA1 and CA2.

Construction of distance trees
For the phylogenetic analyses of these strains of L. (L.) major, the dataset was clone-corrected and presented 66 microsatellite genotypes. NJ and UPMGA trees were constructed from the different distance matrices obtained. All phylograms displayed the same five clusters, albeit with different bootstrap support values, as shown, for example, in the unrooted NJ tree (Figure 4) that was based on a Dps distance matrix and has: i) a Central Asian cluster; ii) two clusters of strains from the Middle East, corresponding to the sub-populations ME1 and ME2 as determined by STRUCTURE, and iii) two clusters of strains from Africa, which correlate perfectly with the sub-populations AF1 and AF2 as determined by STRUCTURE. The sub-division seen in the Central Asian cluster is largely congruent with the two sub-populations CA1 and CA2 identified by STRUCTURE. As in the STRUCTURE analysis, the two western Turkish strains (Lmj 36) grouped with the strains in the sub-population AF2 while the eastern Turkish one (Lmj 50) belonged to the sub-population ME 1. The two strains from Saudi Arabia formed a unique branch in the unrooted NJ tree, closer to the Central Asian cluster, however with only low bootstrap values.
Estimated population structure obtained with STRUCTURE Figure 1 Estimated population structure obtained with STRUCTURE. This is shown as plots of the estimated membership coefficient (Q), which is represented by a single vertical line for each sample. Coloured segments represent the sample's estimated membership in each of the K inferred clusters. Individual isolates can have membership in multiple clusters, with membership coefficients summing up to 1 across clusters. Panel A represents the STRUCTURE result for the whole data set, whereas panels B -D show sub-structuring obtained when STRUCTURE was re-run for each main population separately. CA = Central Asia; ME = Middle East; AF = Africa.

Population genetic data
Fst, as a measure of genetic differentiation between populations, calculated in a pair-wise manner for the main populations, revealed high genetic differentiation between the populations CA and ME, and CA and AF. Genetic isolation was less pronounced between the populations ME and AF. When Fst was calculated at a sub-population level, moderate (Fst = 0.05 -0.15), high (Fst = 0.15 -0.25) and very high (Fst > 0.25) genetic differentiation was observed for the pairs AF1/AF2, ME1/ME2 and CA1/CA2, respectively (Table 4).
Genetic flow or migration rate, Nm, refers to the movement of strains between populations and sub-populations and sets a limit as to how much genetic divergence can occur. At the sub-population level, the genetic flow was the highest between the two African sub-populations and the least between the two Central Asian sub-populations (Table 4). At the pre-defined population classification, there was clear genetic flow (Nm) between the African regions (1.4962 and 1.7216).

The numbers of alleles encompassed by geographical groupings
To compare the number of alleles falling within the three main geographical groupings, the MLMT profiles were grouped according to their geographical origin as specified by their WHO codes: Central Asia (15 profiles); the Middle East (27 profiles); and Africa (21 profiles). In order to compare groups of equal size, a re-sampling procedure was used [15]. The African group had the highest

Discussion
Elfari et al. [1] showed that genetic and biological variation among strains of L. (L.) major tended to correspond with their geographical origin. Among the several analytical techniques and methods developed in that study, only two employed genetic markers that were polymorphic enough to distinguish between closely related strains. For population genetic studies of L. (L.) major like this one, highly discriminative markers had to be developed. Of the microsatellite markers designed, ten, all based on sequences published by the Leishmania Genome Project [11], have proved to be suitable for multilocus microsatellite typing (MLMT). Microsatellite analysis of the 106 strains of L. (L.) major revealed 66 different microsatellite profiles. These were unique to this species of Leishmania and demonstrated substantial genetic micro-heterogeneity with regard to microsatellite profiles. In many cases in this study, particular microsatellite profiles were found in single strains. Had many more strains of L. (L.) major been available, these profiles would, probably, have been associated with groups of strains as was seen with the strains isolated at Mubarak and Termez, Uzbekistan, (Table 2 and Figure 3).
The model-and distance-based analyses of the microsatellite data exposed three main populations of L. (L.) major, corresponding to three separate geographical regions, Central Asia, the Middle East and Africa. When the data were analyzed by STRUCTURE with an increase in the number of populations from 2 to 6 in order to identify the ancestral source population, the first split separated the Central Asian strains of L. (L.) major from all the others, suggesting, possibly, their older origin. The Bayesian algorithm implemented in STRUCTURE was more appropriate for characterizing population structure because it identified distinct sub-populations based on patterns of allele frequencies and, also, determined fractions of the genotype that belong specifically to each sub-population. STRUCTURE has been shown to accurately infer individ-Population structure as shown by plots of Q (estimated membership coefficient for each sample), using five populations pre-defined according to their geographical origin Figure 3 Population structure as shown by plots of Q (estimated membership coefficient for each sample), using five populations pre-defined according to their geographical origin. Regions may have more than one colour. For instance, strains from Iran and Iraq geographically belong to ME but genetically to East Africa as represented by the colour yellow.  ual ancestries [16] and provide information on population relationships and history [17].
The presence of these three populations was supported by the detection of genetic isolation among them, particularly between population CA and the two populations AF and ME that were separate but genetically closer to one another. This may reflect the geographical distance and barriers between the three main clusters (data not shown).
Genetic distances between populations pre-defined by Struc-ture Figure 5 Genetic distances between populations pre-defined by Structure. Strains belonging to populations and sub-populations as inferred by Structure are indicated in Table 2.   Both types of analysis led to sub-division of the main clusters. Congruence of the results supported tripartite clustering despite the absence of strong bootstrap values on the main branches of the distance tree (Figures 4 and 5). The two sub-groups in each of the three main populations were also supported by F-statistics, although the genetic differentiation was less pronounced, especially between the sub-populations AF1 and AF2, and between the subpopulations ME1 and ME2. Wright (1978) maintained that even slight genetic differentiation among sub-populations is significant.
Within the context of this study, it seems that the populations AF and ME were more genetically diversified than the population CA. This might be due to different sampling procedures in the three areas. The 22 strains from Africa were collected between 1976 and 2004 in 16 separated locations. In contrast to that, the 37 ME strains were collected between 1967 and 2003 from different foci encompassed by the Eastern Mediterranean region and Arabian peninsula, and the 39 strains from Central Asia between 1973 and 2003 in only 3 countries. To correct, at least partially, for this sampling bias mean number of alleles have been calculated based on the MLMT profiles present in the three areas, e.g. by excluding profiles shared by more than one strain. When corrected for Central Asia, the smallest group, the lowest number of alleles was still found in Central Asia followed by the Middle East and then Africa. The sampling bias might still account for the high heterogeneity of the African strains, but the differences in diversity for CA and ME strains cannot be explained only by differences in sampling. Most of the ME strains were collected in Israeli and Palestinian foci from a territory much smaller than the Central Asian area. The area from where the Central Asian strains of L. (L.) major studied here came is a landlocked generally arid plateau. The greater variation of habitats and biotopes within Middle Eastern environments seems to translate into a greater variety of animal hosts and sand fly vectors compared to Central Asia.
Where groups of strains with the same microsatellite profile were available, it was possible to make some assessment with regard to time and spatial distribution. For example, the strains in the groups of strains with the microsatellite profiles Lmj 6, Lmj 15 and Lmj 40 were isolated in a given area for each group at the same time or within short periods of time. However, strains with the microsatellite profile Lmj 1, Lmj 39 and Lmj 53 were isolated in several locations at different times. This, in addition to the observed low heterozygosity at microsatellite loci, seems to provide evidence against extensive and frequent genetic exchange within L. (L.) major. That microsatellite variants can persist for long periods of time and are restricted or widely spread in their distribution, supposedly, results from the effect of biotopic conditions on the animal hosts and vectors of leishmanial parasites and much less so in the case of the humans, who do alter the biotope by construction, agriculture and many of their other activities.
Most of the strains in the population ME came from Israel and The Palestinian Authority and their separation into the sub-populations ME1, Jordan Valley-Dead Sea area, and ME2, the Negev-Sinai area, did seem to parallel geographical distribution but not entirely so. Most 'misplaced' strains were from human cases and could be explained by travel and migration between these areas. Several strains have membership coefficients for both subpopulations, indicating gene flow between them.
The two African sub-populations did not correlate well with the geographical origin of the strains that fell into them. Their analysis was hampered owing to the small number of strains available and their very wide geographical dispersal. Unlike Dps-based phenetic analysis (NJ, UPGMA), which is not affected by population size [18], STRUCTURE improves with increasing sample size for each population [16]. In this study, the results obtained using STRUCTURE were consistent with those of the phenetic analysis. Nevertheless, more strains from different African foci need to be analysed to really determine the population structure of African strains of L. (L.) major.
Preferably, one should use strains isolated from sand fly vectors and rodent hosts to avoid the effect of human migration. Small sample size was also a problem in the case of the two single strains from Iran and Iraq and the two from western Turkey that grouped together with the African strains and the eastern Turkish strain that grouped with the Middle Eastern strains (Table 2 and Figure 3).
Waki et al. [19] hypothesized that adaptation to the micro-environment in the macrophages of the mammalian hosts, the place of long-term residence of all leishmanial parasites, including L. (L.) major, is most significant in the evolution of the genus Leishmania. They considered adaptation to the sand fly vectors' gut, the place of shortterm residence, to be of lesser evolutionary importance. That the three main populations of L. (L.) major identified by this study exist in endemic foci that each have different rodent species serving as the reservoir hosts [20] seems to support this hypothesis. Variation among the strains of L. (L.) major from the same endemic area leading to different sub-populations could be attributed to differences in sand fly vector populations. For instance, analysis of mitochondrial cytB haplotypes exposed two genetically distinct populations of Ph. papatasi in the Middle East [21].

Conclusion
Multilocus Microsatellite Typing (MLMT) based on 10 different microsatellite markers and using a Bayesian model-based analysis as well as phylogenetic reconstructions based on genetic distances, identified three main populations, corresponding to three separate geographical regions: Central Asia (CA); the Middle East (ME); and Africa (AF). Each of them separated into two sub-populations which might reflect the existence of different species of rodent host in these areas. The African and Middle Eastern populations seemed to be more genetically diversified than the Central Asian population. Sampling bias might account for the high heterogeneity of the African strains but cannot explain the differences in diversity for Central Asian and Middle Eastern strains.  [25]. Screening for microsatellite variation was also done using 3.5% MetaPhor agarose gels [13]. The lengths of PCR products in PAGE and Metaphor agarose electrophoresis were determined by comparing them with small-molecular-size markers (10-bp ladder; Invitrogen, Life Tech, Carlsbad, CA USA). Fluorescent-labelled PCR products were analyzed and measured for size with the fragment analysis tool of the CEQ 8000 automated genetic analysis system (Beckman Coulter, USA) as previously described [13,26]. The fragment sizes estimated by the different methods used were always compared to that of strain L.

Data analysis
The software STRUCTURE Version 2 [16] was used to reveal potential population structure in the data set. The programme was run using an admixture model with a burn-in period of 30,000 iterations, followed by 1,000,000 Markov Chain Monte Carlo (MCMC) repeats. Based on allele frequencies, this approach identifies genetically distinct populations by estimating, for each individual studied, in this case each leishmanial strain, the fraction of the genotype belonging solely to it. Individuals can be assigned to multiple clusters with the membership coefficients of all those clusters summing up to one. The most fitting number of populations was determined by comparing log-likelihood values for K (number of populations) between 1 and 10. These were plotted and the value of K at the plateau (maximum) of the Gaussian graph plotted indicated the most likely population structure. In order to quantify the degree of variation of the likelihood for each K, ten runs were performed for each K. In addition, ΔK was calculated based on the rate of change in the log probability of data between successive K values [27]. Moreover, runs were conducted with strains assigned to 5 pre-defined populations (Central Asia, the Middle East, North Africa, East Africa and West Africa), according to the actual geographical origin of the CL isolates.
Different microsatellite genetic distance measurements based on the infinite allele model (IAM) and on the stepwise mutation model (SMM) were used to construct phylogenetic trees. The proportion of shared alleles (Dps) [28] calculates multilocus pairwise distance measurements as 1 -(the total number of shared alleles at all loci/ n), where n is the number of loci compared. Chord distance [29], Nei's standard genetic distance [30] and delta mu squared (Dmu2) [31], were also calculated. The last is based on the average squared difference in allele size. Confidence intervals for all distance measurements were calculated by bootstrapping over loci, using the program MICROSAT [32]. Neighbor-joining (NJ) and Unweighted Pair Group Method with Arithmetic Mean (UPGMA) consensus trees of both distances were then constructed in PAUP, version 4.0b8 [33] and PHYLIP version 3.6 [34].
The degree of genetic difference among populations was estimated by Wright's F-statistics [35], calculated according to Weir and Cockerham [36]. FSTAT software version 2.9.3.2 [37] was used to calculate Fis and Fst values, numbers of alleles per locus and genetic flow or migration rate, as Nm = 1-Fst/4Fst [38]. Average sample size (n), mean number of alleles per locus (A), observed (H o ) and expected (H e ) heterozygosity under Hardy-Weinberg equilibrium were calculated using Gene Data Analysis (GDA), version 1.0 (d16c) [39]. The calculation was based on the permutation method [40], which is useful for multiallelic microsatellite loci. Pair-wise linkage disequilibrium (D) was calculated using GENEPOP and FSTAT. The former applies Fisher's exact test [41] to the allele combinations between all pairs of loci in the whole population and in each population, while the latter uses loglikelihood ratio G-statistics [36] with Bonferroni corrections for multiple comparisons.
To compare the allelic abundance in different geographical regions, the MLMT profiles of L. (L.) major were grouped according to their main geographical origin: Africa, Central Asia and Middle East. Profiles shared by more than one strain were excluded from this analysis. To compare the different groups, all of them were reduced to the size of the smallest group, which was Central Asia with 12 MLMT profiles. The groups of larger size were re-sampled and the number of alleles in each group was estimated, using 100 replicates of the re-sampled data [15].