Chapter 14 Variant filtering

We filtered the previously produced raw vcf with several steps:

  • Gather raw vcf files gathering in 26 813 513 variants over 432 individuals
  • Biallelic raw vcf filtering in 19 242 294 variants over 432 individuals
  • SNP biallelic vcf filtering in 17 521 879 variants over 432 individuals
  • Filters biallelic snp vcf filtering in 15 531 866 variants over 432 individuals
  • Missing filtered biallelic snp vcf filtering in 454 262 variants over 406 individuals
  • Paracou filtered & non missing biallelic snp vcf filtering in 454 262 variants over 385 individuals

14.4 Filters

We filtered the biallelic snp vcf with following filters (name, filter, description), resulting in 15 531 866 filtered biallelic snps, using next histograms to set and test parameters values :

  • Quality (QUAL) QUAL < 30: represents the likelihood of the site to be homozygote across all samples, we filter out variants having a low quality score (14.1)
  • Quality depth (QD) QD < 2: filter out variants with low variant confidence (14.1)
  • Fisher strand bias (FS) FS > 60: filter out variants based on Phred-scaled p-value using Fisher’s exact test to detect strand bias (14.1)
  • Strand odd ratio (SOR) SOR < 3: filter out variants based on Phred-scaled p-value used to detect strand bias (14.1)
Quality, quality by depth, fisher strand and strand odd ratios per biallelic SNPs.

Figure 14.1: Quality, quality by depth, fisher strand and strand odd ratios per biallelic SNPs.

14.5 Missing data

Missing data filtering is a bit more tricky because missing data of SNPs and individuals are related, e.g. removing individuals with a lot of missing data result in the decrease of SNPs. Ideally, we would want to keep all individuals, but this would result in a lot of SNP loss because of least represented individuals. So we need to chose a threshold for missing data for individuals --mind and SNPs --geno.

Missing data statistics for filtered biallelic SNP before missing data filtering per individual.

Figure 14.2: Missing data statistics for filtered biallelic SNP before missing data filtering per individual.

Missing data statistics for filtered biallelic SNP before missing data filtering per SNP.

Figure 14.3: Missing data statistics for filtered biallelic SNP before missing data filtering per SNP.

14.5.1 Normal filter

With a maximum of 95% of missing data per individual --mind 0.95 and a maximum of 15% of missing data per SNP -geno 0.15, we obtained 454 262 biallelic filtered snps for 406 individuals.

Missing data statistics for filtered biallelic SNP after missing data filtering (95% for individuals and 15% for SNPs) per individual.

Figure 14.4: Missing data statistics for filtered biallelic SNP after missing data filtering (95% for individuals and 15% for SNPs) per individual.

Missing data statistics for filtered biallelic SNP after missing data filtering (95% for individuals and 15% for SNPs) per SNP.

Figure 14.5: Missing data statistics for filtered biallelic SNP after missing data filtering (95% for individuals and 15% for SNPs) per SNP.

Heterozigosity statistics for filtered biallelic SNP after missing data filtering (95% for individuals and 15% for SNPs) per SNP.

Figure 14.6: Heterozigosity statistics for filtered biallelic SNP after missing data filtering (95% for individuals and 15% for SNPs) per SNP.

14.5.2 Hard filter

With a maximum of 95% of missing data per individual --mind 0.95 and a maximum of 5% of missing data per SNP -geno 0.05, we obtained 180 217 biallelic filtered snps for 406 individuals.

Missing data statistics for filtered biallelic SNP after missing data filtering (95% for individuals and 5% for SNPs) per individual.

Figure 14.7: Missing data statistics for filtered biallelic SNP after missing data filtering (95% for individuals and 5% for SNPs) per individual.

Missing data statistics for filtered biallelic SNP after missing data filtering (90% for individuals and 15% for SNPs) per SNP.

Figure 14.8: Missing data statistics for filtered biallelic SNP after missing data filtering (90% for individuals and 15% for SNPs) per SNP.

Heterozigosity statistics for filtered biallelic SNP after missing data filtering (95% for individuals and 5% for SNPs) per SNP.

Figure 14.9: Heterozigosity statistics for filtered biallelic SNP after missing data filtering (95% for individuals and 5% for SNPs) per SNP.