Uninformative polymorphisms bias genome scans for signatures of selection
© Roesti et al.; licensee BioMed Central Ltd. 2012
Received: 24 January 2012
Accepted: 22 June 2012
Published: 22 June 2012
Skip to main content
© Roesti et al.; licensee BioMed Central Ltd. 2012
Received: 24 January 2012
Accepted: 22 June 2012
Published: 22 June 2012
With the establishment of high-throughput sequencing technologies and new methods for rapid and extensive single nucleotide (SNP) discovery, marker-based genome scans in search of signatures of divergent selection between populations occupying ecologically distinct environments are becoming increasingly popular.
On the basis of genome-wide SNP marker data generated by RAD sequencing of lake and stream stickleback populations, we show that the outcome of such studies can be systematically biased if markers with a low minor allele frequency are included in the analysis. The reason is that these ‘uninformative’ polymorphisms lack the adequate potential to capture signatures of drift and hitchhiking, the focal processes in ecological genome scans. Bias associated with uninformative polymorphisms is not eliminated by just avoiding technical artifacts in the data (PCR and sequencing errors), as a high proportion of SNPs with a low minor allele frequency is a general biological feature of natural populations.
We suggest that uninformative markers should be excluded from genome scans based on empirical criteria derived from careful inspection of the data, and that these criteria should be reported explicitly. Together, this should increase the quality and comparability of genome scans, and hence promote our understanding of the processes driving genomic differentiation.
A major challenge in evolutionary biology is to understand how natural selection acts on molecular genetic variation [1–4]. One approach to studying the consequences of selection at the genomic level is the application of genome scans that screen a collection of polymorphic genetic marker loci for their extent of differentiation between multiple (typically two) populations occupying ecologically distinct environments. Loci or genomic regions displaying particularly high population differentiation (usually quantified by an FST estimator ) relative to some differentiation baseline (reflecting primarily neutral drift) are interpreted as either being directly under divergent selection, or exhibiting genetic hitchhiking along with a quantitative trait locus (QTL) under divergent selection [6–9]. Genome scans therefore have the potential to illuminate the link between ecological selection and molecular variation, and hence to contribute to our understanding of adaptive diversification. This is particularly true if information from genome scans is integrated with complementary lines of evidence such as QTL mapping .
Genome scans can be performed in different ways, depending on the genomic resources available for a focal research system. On the one hand, reference-free (anonymous) scans are carried out without information on the physical genomic position of a marker locus. Here the FST value for each locus is treated as an independent data point and is evaluated against a baseline distribution derived from the entire data set e.g., [11–14]. Loci exhibiting extreme FST values relative to the baseline (‘outlier loci’) are then interpreted as being directly or indirectly influenced by divergent selection. (Note that we here use divergent selection in a broad sense, including situations where an allele is selected in one environment but neutral in the other.) On the other hand, reference-based genome scans map loci physically to an available genome e.g., [15–18]. This offers a great advantage: loci occurring in the same genomic neighbourhood, and consequently exhibiting some physical linkage, will tend to display correlated FST values that can be integrated by taking a sliding window approach. This allows not only the identification of genomic regions displaying exceptionally high population differentiation, but also exploring the number and physical extent of such regions . Moreover, depending on the marker resolution, outlier regions may be screened for candidate genes potentially targeted by divergent selection.
Differentiation between two populations, as quantified by Weir and Cockerham’s FST estimator theta []
Genotypes population A
Genotypes population B
Second, a very low (or negative) FST value will also arise if the alleles at a marker locus exhibit an extremely skewed frequency distribution. That is, if a locus is nearly monomorphic in both populations but contains an alternative allele segregating at very low frequency such that this allele occurs only once or a few times in the entire data set (lower example in Table 1). Such a locus is constrained to display a very low FST value between the populations . However, inferring the absence of population differentiation from this FST value is problematic. The reason is that such rare alleles primarily represent relatively recent mutations, most of which will experience rapid stochastic loss . Markers with a very low minor allele frequency therefore lack the adequate sensitivity to capture the historical signatures of drift and hitchhiking, the key processes in genome scans.
Of course, in addition to the situation where a natural allele segregates at very low frequency within populations, a highly skewed allele frequency distribution at a locus can also arise artificially during marker data acquisition. For instance due to PCR replication or sequencing error. The locus then produces a minimal FST value although correctly no FST value would be calculated because the locus is not polymorphic. However, many strategies exist to avoid such technical errors (including achieving high sequencing coverage, or re-sequencing; see also  and references therein). Our paper is therefore primarily concerned with biological polymorphisms.
To summarize, there are two fundamentally different causes for minimal FST values in genome scan data sets: polymorphisms with relatively even allele frequency distribution, but without population differentiation, versus polymorphisms with extremely skewed allele frequency distribution unable to pick up population differentiation. Hereafter, we refer to these forms of polymorphisms as ‘informative’ versus ‘uninformative’. We emphasize, however, that we restrict this crude classification to genome scans searching for signatures of selection in the form of elevated differentiation. Markers with highly skewed allele frequency distributions might well be informative in other analytical contexts, such as the estimation of mutational or demographic parameters based on allele frequency spectra [24, 25].
If uninformative polymorphisms are abundant in a marker data set used for a genome scan (and they generally will, see below), we can predict a number of undesirable consequences: in both reference-free and reference-based approaches, the estimated overall baseline differentiation, which is considered to reflect the effect of drift, will be biased downward. As a consequence of this bias in the baseline, the number of loci considered outliers driven by divergent selection in reference-free genome scans may be inflated. By contrast, in sliding window type of scans, the magnitude of among-population differentiation in genomic regions influenced by selection will be weakened, or in the worst case erased. Both effects can lead to incorrect conclusions about the genomic consequences of divergent selection. We emphasize that these problems will arise irrespective of the specific estimator used to quantify population differentiation, or the method chosen for outlier detection. That is, uninformative marker loci will also influence sophisticated methods that estimate FST for a locus by taking into account genome-wide differentiation and locus-specific sample size , or approaches based on P-values from locus-specific significance tests (e.g., ).
It would thus seem straightforward to eliminate uninformative marker loci from polymorphism data sets prior to performing a genome scan, as reflected in Beaumont and Nichols’  recommendation to preferably use loci with high heterozygosity for such analyses. However, a screen of 24 recent genome scan papers based on single nucleotide polymorphisms (SNPs), including most such studies currently available, suggests that the above issue is not generally recognized. (Note that our paper focuses on SNPs because this marker type is becoming standard in population genomics; but the conclusions hold for any type of marker.) Only three studies report marker filtering according to some minor allele frequency threshold ([18, 26, 27]; the latter study excluded singleton loci only, i.e., markers with the minor allele occurring only a single time). It is therefore possible that patterns reported and conclusions drawn in many genome scan studies are unreliable to some extent. Given that genome scans are becoming increasingly easy to perform owing to the advent of high-throughput sequencing technology , new techniques for extensive SNP discovery (in particular restriction site associated DNA (RAD) sequencing ), and automated data analysis pipelines, the problem of bias arising from uninformative marker loci deserves wide recognition. A first goal of our study is therefore to use extensive SNP data from lake and stream population of stickleback fish to demonstrate that uninformative marker loci indeed have the potential to bias results from genome scans. Our second goal is to show that such bias can be avoided through careful inspection of the data set and subsequent exclusion of uninformative marker loci based on empirical criteria.
Our study uses SNP data from threespine stickleback (Gasterosteus aculeatus) populations occurring in lake and stream habitats within two independently colonized drainages. The first is the Lake Constance drainage in Switzerland (the ‘COW’ lake-stream population pair from ), hereafter called the ‘Constance system’. The divergence between the lake and stream population in this system appears to be recent (a few hundred years). The second is the Boot Lake drainage on Vancouver Island, Canada (the Boot sites ‘L’ and ‘S2’ in ), hereafter called the ‘Boot system’. Lake-stream divergence in this system is more ancient (thousands of years). Lake and stream stickleback are known to experience divergent selection [31, 32], and the specific population pairs were chosen because they differ in the magnitude of habitat-related phenotypic and neutral genetic (microsatellite) divergence (stronger divergence in the older Boot system than in the younger Constance system). For further details on the locations and populations see [30, 31]. All samples were taken with permission from the British Columbia Ministry of Environment (permit number NA06-20791), and the fisheries authority of the canton Thurgau.
For SNP detection, we Illumina-sequenced RAD  derived from 27 stickleback specimens from each of the four sites (i.e., one lake and one stream site in two drainages; total N = 108). Library preparation essentially followed the method described in . In short, DNA was digested by using the Sbf1 restriction enzyme and barcode-ligated for each individual separately. Amplified barcoded DNA was then single-end sequenced on an Illumina genome analyzer IIx with 76 cycles in libraries of 18 pooled individuals each. The Illumina short reads (sequenced RAD sites; deposited at the NCBI Short Read Archive, accession number SRP007695) were parsed by individual barcode, and for each individual separately aligned to the stickleback genome (Ensembl database version 63.1, assembly Broad S1) using Novoalign v2.07.06 (http://novocraft.com). Alignment to a unique genome position was enforced, effectively eliminating sequences derived from repeated elements. The average sequence coverage per individual and RAD site was 27 and 31 for the lake and stream sample in the Constance system, and 30 and 11 for the Boot system. Alignments were converted to BAM format using Samtools v0.1.11 . For each individual and RAD site, we then determined the consensus diploid genotype if ten or more replicate reads were available, or a haploid consensus genotype if replication was below ten. This threshold was chosen because for polymorphic nucleotide positions, we identified heterozygote diploids based on a binomial test with insufficient power at low replication. This test involved calculating the binomial likelihood of the observed frequency distribution of the SNP alleles under the null hypothesis of heterozygosity (i.e., assuming a probability of 0.5 for both alleles). Positions were considered heterozygous if the likelihood was greater than 0.01. Consensus genotyping was quality-aware in that bases with a greater than 0.01 calling error probability were excluded from the binomial test.
To find SNP markers and calculate genome-wide lake-stream population differentiation within each of the two systems, we pooled individual consensus genotypes from the lake and stream sample for each RAD site. If at least 27 genotypes were available from each of the two habitats, we proceeded with FST calculation. In other words, a RAD site was considered only if each individual contributed at least one haploid consensus genotype on average to the site’s genotype pool. For FST calculation, the genotype pool for each RAD site was screened base by base for polymorphisms. If a variable position occurred, we calculated FST based on haplotype diversity (equation 7 in ). For RAD sites exhibiting multiple SNPs, we retained only the highest FST value observed across all variable base positions. (Using the average FST value across all positions, or selecting a single SNP at random, produced very similar results supporting identical conclusions.) Negative FST values were rounded to zero, as commonly done.
The above FST calculation considered any type of SNPs. To explore the effect of informative versus uninformative markers, we repeated the above FST calculation protocol by imposing the restriction that the minor (less frequent) allele had to occur at least n times in the lake-stream genotype pool, where n spanned the range from two to ten in increments of one. (The above default FST calculation represents the case with n = 1.) For each calculation series, we then computed the number of resulting SNPs, and the mean FST value across all SNPs. We also visualized genomic differentiation by a sliding window approach using local polynomial fitting (LOESS) implemented in R (R Development Core Team ; 2nd order polynomial with band width of 0.4; using simpler polynomials and different band widths did not alter our conclusions). All post-sequencing analysis except for alignment and file conversion was coded in the R language, making use of the Bioconductor packages ShortRead , Biostrings, and Rsamtools.
Note that in Figure 3, we define informative marker loci as those with the minor allele occurring at least four times (n = 4), resulting in an average inter-locus distance of 53 kb and 63 kb for the Constance and Boot system. This minor allele threshold eliminated bias associated with uninformative marker loci relatively effectively; choosing higher thresholds had a relatively minor influence on the sliding window profiles.
Our empirical analysis demonstrates that abundant uninformative polymorphisms in a genome scan data set can bias the estimated baseline differentiation, and hence affect conclusions about the genomic signatures of selection.
In our stickleback data set, uninformative polymorphisms (essentially in the form of singleton loci) were very abundant. Illumina sequencing type one errors (i.e., wrong base calls despite high indicated base call quality) in RAD sequences poorly replicated at the individual level might contribute to this pattern [23, 39]. To examine this possibility, we inspected 50 randomly chosen SNPs exhibiting zero FST from the full data set accepting any type of polymorphisms (i.e., minor allele count threshold n = 1) for each lake-stream system. As expected, a high proportion of these markers were singleton loci (Constance: 28 [56%]; Boot: 35 [70%]). For the Boot system with lower replication per locus in the stream sample (see above), 15 of the 35 singleton loci represented unreplicated RAD sequences. For these loci, the minor allele is likely a sequencing error.
However, all but one of the singleton alleles in the Constance system represented consensus genotypes integrating multiple (2–26, mean: 9.1) replicate RAD sequences. Hence, the bulk of the uninformative marker loci in our data clearly cannot be attributed to sequencing error, because the probability of multiple identical errors at a specific nucleotide position at a given RAD site is practically zero. The abundance of rare SNP alleles therefore represents a real biological feature of the studied stickleback populations (acknowledging a small potential contribution from PCR artefacts). This is not unexpected: theory consistently predicts a skew toward polymorphisms with low minor allele frequency, and hence a high proportion of singleton polymorphisms, under a broad range of demographic and selective conditions [24, 36, 40–44]. Bias associated with uninformative polymorphisms is therefore of general importance to genome scan studies, and not specific to our empirical system. Our analysis also raises a caveat regarding marker densities; the effective number of markers providing relevant information in genome scans might often be dramatically lower than the number reported.
In the present study, excluding singleton polymorphisms had the greatest influence on the results. Reliable quantification of differentiation patterns, however, might require substantially more stringent minor allele frequency thresholds. (Note that such marker filtering also effectively eliminates any sequencing and PCR error from the data.) Bradbury et al., for instance, excluded SNPs exhibiting an overall minor allele frequency of 0.25 or less, and a similar threshold was adopted in a recent lake-stream stickleback study carried out in our lab . To obtain a guideline for marker filtering, the latter RAD-based study evaluated the strength of the correlation in FST values between ‘sister’ RAD sites (i.e., DNA sequences flanking the same restriction site in the genome) in relation to increasingly stringent minor allele frequency thresholds (see Appendix S2 in the Supporting information to ). The rationale was that if an FST value provided by a given marker reliably quantifies the consequences of drift and selection in a genomic region, then another extremely tightly linked marker should yield a similar FST value. This approach, however, requires tightly physically linked markers data and substantial population differentiation (otherwise the correlation in FST between linked will remain poor even with stringent marker filtering).
Given the rapidly increasing feasibility and popularity of genome scans for signatures of selection, researchers should be aware that uninformative polymorphisms need to be excluded from data sets. This is not achieved by just avoiding technical errors, as a high prevalence of nearly monomorphic loci is a general biological feature of samples from natural populations. We suggest that a reasonable strategy to define and eliminate uninformative polymorphisms should be chosen by inspecting the allele frequency distribution of the polymorphisms, and by assessing the influence of different marker filtering thresholds on the genomic patterns of interest, or appropriate statistics (such as the correlation of FST between sister RAD sites). Also, the approach taken to eliminate uninformative polymorphisms should be reported explicitly. Together, this should increase the quality and comparability of genome scans, and hence promote our understanding of the processes shaping genomic differentiation.
The authors declare no competing interest.
Field work was aided by Anne-Catherine Grandchamp and, for the Boot system, supported financially by Andrew Hendry. Roman Kistler (fisheries authority of the canton Thurgau) permitted sampling of the Constance specimens. Paul Etter and Bill Cresko kindly shared their experience and protocol for RAD library preparation. Brigitte Aeschbach and Nicolas Boileau facilitated wet lab work, Ina Nissen performed Illumina sequencing at the Quantitative Genomics Facility, D-BSSE, ETH Zürich, and Lukas Zimmermann provided IT support. The bioinformatics pipeline benefited from modifications to Novoalign by Colin Hercus, and from coding suggestions by Martin Morgan and Hervé Pagès. Matthieu Foll, Markus Pfenninger, and three anonymous reviewers provided valuable suggestions that improved the paper. WS was supported financially by the European Research Council (Starting Grant ‘INTERGENADAPT’), the Swiss National Science Foundation, and the University of Basel. DB was supported by the Swiss National Science Foundation (Ambizione grant PZ00P3_126391 / 1) and the Research Found of the University of Basel. We kindly thank all these people and institutions.