Chapter 13 Variant call

We used GATK as it has apparently similar performance to other variant callers (Supernat et al. 2018) and was more known by Myriam. For that we used the following pipeline:

  1. Variant calling Run the HaplotypeCaller on each sample’s BAM files to create single-sample gVCFs using the .g.vcf extension for the output file.
  2. Data aggregation Aggregate the GVCF files and feed in one GVCF with GenomicsDBImport to be genotyped
  3. Joint genotyping Run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.

13.2 Data aggregation

We aggregated the GVCF files and feed in one GVCF database with GenomicsDBImport to be genotyped. Beware, GATK 4.0.0.0 does not deal with multiple intervals when using GenomicsDBImport, so we used GATK 4.1.2.0. We divided the step on several intervals files of a maximum of 1000 sequences computed in parralel to speed up the operation. NB, we tested the pipeline with 3 individual haplotypes and 10 intervals of 100 sequences ran in parrallel; and it took 24 minutes. Consequently with 10 fold more sequences per intervals we may increase to 4H, and the effect of 10 fold more individual haplotypes is hard to assess.

Due to a memory overload on the cluster I boosted the sarray to 24G per node beside limiting gatk java session to 20G, still the overload is strange as if gatk was opening parrallel session of 20G java. We should not decrease batch size as batch of 50 individuals means that we will use 9 batches ! If memorry issues persist we may decrease the intervals length (currently 1000 sequences) and increase the number of jobs in sarray. We may even decrease intervals to 100 sequences resulting in more than 800 jobs and run theim by batch of 100 hundreds if they are really more efficient.

Running a first batch of 10 samples on 100 sequences took 35 minutes. Thus 432 samples should took ca 1 day and 40 minutes. This is the first run of DB among 8, so in total DB built should take8 days ! But on the other hand joint genotyping might be launch on each run as soon as they finish. So we might clean the first vcf and obtained a preview of population genetics structure with the first 100 sequences.

13.3 Joint genotyping

We joint-genotyped individuals with GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers. We divided the step on several intervals files of a maximum of 1000 sequences computed in parralel to speed up the operation (similarly to previous step). NB, we tested the pipeline with 3 individual haplotypes and 10 intervals of 100 sequences ran in parrallel; and it took 6 minutes. Consequently with 10 fold more sequences per intervals we may increase to 1H, and the effect of 10 fold more individual haplotypes is hard to assess. Then we merged genotypes of all intervals with GatherVcfs from Picard in one raw VCF to be filtered.

References

Supernat, A., Vidarsson, O.V., Steen, V.M. & Stokowy, T. (2018). Comparison of three variant callers for human whole genome sequencing. Scientific Reports, 8, 17851. Retrieved from http://www.nature.com/articles/s41598-018-36177-7