Variant validation procedure | Rationale |
---|---|
1  Remove variants that don’t match the expected allele size (212, 215, 221 bp) | Variants that have deletions/substitutions shifting the reading frame probably result from sequencing errors (Assumption 1) |
2  Remove variants that have less than four copies in the whole dataset | Variants represented once in an individual probably result from sequencing errors (Assumption 4) and variants represented only in one individual probably result from PCR errors (Assumption 5) |
3  Remove individuals with less than 200 reads | A low number of reads per individual might lead to incomplete genotyping, thus the results would be unreliable (Assumption 6). The minimum number of reads required per individual is estimated using the probability distribution plotted by Galan et al. [28] |
4  Remove variants that have MPAF lower than 0.01 | Variants represented rarely in the whole dataset probably result from sequencing errors (Assumption 2) |
   Remove variants that have MPAF between 0.01 - 0.025 if they can be explained as a chimera or a single basepair mutation | Variants represented rarely in the whole dataset but more frequently in per individual bases probably result from PCR errors if the parental sequences are also present (Assumption 3) |
5  Remove variants that have a single copy per individual | Variants represented once in an individual probably result from sequencing errors (Assumption 4) |
   Remove variants that have less than five copies per individual if they can be explained as a chimera or a single basepair mutation | Variants represented two, three or four times within an individual probably result from PCR errors if the parental sequences are present (Assumption 3). The threshold for PCR errors is estimated from the distribution of artefacts in the previous step |