Chapter 13 Variant call

We used GATK as it has apparently similar performance to other variant callers (Supernat et al. 2018) and was more known by Myriam. For that we used the following pipeline:

Variant calling Run the HaplotypeCaller on each sample’s BAM files to create single-sample gVCFs using the .g.vcf extension for the output file.
Data aggregation Aggregate the GVCF files and feed in one GVCF with GenomicsDBImport to be genotyped
Joint genotyping Run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.

13.1 Variant calling

Run the HaplotypeCaller on each sample’s BAM files to create single-sample gVCFs using the .g.vcf extension for the output file. We used sarray which is much powerful than sbatch in parralel computing.

# folders
mkdir variantCalling/gvcf5

# test
file=$(ls mapping/bam5/*.bam | head -n 1)
srun --mem=20G --pty bash
module load bioinfo/gatk-4.1.2.0 ; gatk --java-options "-Xmx20G" HaplotypeCaller -R reference/reference.fasta -I $file -O variantCalling/gvcf4/$(basename "${file%.*}").g.vcf.gz  -ERC GVCF
exit
rm variantCalling/gvcf4/*

# sarray
for file in $(ls mapping/bam5/*.bam); do echo "module load bioinfo/gatk-4.1.2.0 ; gatk --java-options \"-Xmx20G\" HaplotypeCaller -R reference/reference.fasta -I $file -O variantCalling/gvcf5/$(basename "${file%.*}").g.vcf.gz  -ERC GVCF"; done > haplo5.sh
mkdir haplo5
sarray -J haplo5 -o haplo5/%j.out -e haplo5/%j.err -t 48:00:00 --mem=20G --mail-type=BEGIN,END,FAIL  haplo5.sh

# clean
rm -r haplo5
rm -r tmp

13.2 Data aggregation

We aggregated the GVCF files and feed in one GVCF database with GenomicsDBImport to be genotyped. Beware, GATK 4.0.0.0 does not deal with multiple intervals when using GenomicsDBImport, so we used GATK 4.1.2.0. We divided the step on several intervals files of a maximum of 1000 sequences computed in parralel to speed up the operation. NB, we tested the pipeline with 3 individual haplotypes and 10 intervals of 100 sequences ran in parrallel; and it took 24 minutes. Consequently with 10 fold more sequences per intervals we may increase to 4H, and the effect of 10 fold more individual haplotypes is hard to assess.

Due to a memory overload on the cluster I boosted the sarray to 24G per node beside limiting gatk java session to 20G, still the overload is strange as if gatk was opening parrallel session of 20G java. We should not decrease batch size as batch of 50 individuals means that we will use 9 batches ! If memorry issues persist we may decrease the intervals length (currently 1000 sequences) and increase the number of jobs in sarray. We may even decrease intervals to 100 sequences resulting in more than 800 jobs and run theim by batch of 100 hundreds if they are really more efficient.

Running a first batch of 10 samples on 100 sequences took 35 minutes. Thus 432 samples should took ca 1 day and 40 minutes. This is the first run of DB among 8, so in total DB built should take8 days ! But on the other hand joint genotyping might be launch on each run as soon as they finish. So we might clean the first vcf and obtained a preview of population genetics structure with the first 100 sequences.

# Sample map
touch sample_map.txt
for file in $(ls gvcf*/*.g.vcf.gz)
do 
   echo -e $(basename "${file%.*}")"\t"$file >> sample_map.txt
done

# seq lists
mkdir reference.sequences.lists
cut ../reference/reference.fasta.fai -f1 > reference.sequences.lists/reference.sequences.list
cd reference.sequences.lists
split -l 100 -d reference.sequences.list reference.sequences_ --additional-suffix=.list
rm reference.sequences.list
ls | wc -l
cd ..

# folders
mkdir tmp
mkdir symcaptureDB

# test
srun --mem=24G --pty bash
file=$(ls reference.sequences.lists/ | head -n 1)
module load bioinfo/gatk-4.1.2.0 ; gatk --java-options "-Xmx20g -Xms20g" GenomicsDBImport --genomicsdb-workspace-path symcaptureDB/"${file%.*}".DB -L reference.sequences.lists/$file --sample-name-map sample_map.txt --batch-size 50 --tmp-dir=tmp
exit
rm -r symcaptureDB/*
rm tmp/*

# sarray

for file in $(ls reference.sequences.lists/); do echo "module load bioinfo/gatk-4.1.2.0 ; gatk --java-options \"-Xmx20g -Xms20g\" GenomicsDBImport --genomicsdb-workspace-path symcaptureDB/\"${file%.*}\".DB -L reference.sequences.lists/$file --sample-name-map sample_map.txt --batch-size 10 --consolidate"; done > combine.sh
split -l 207 -d combine.sh combine_ --additional-suffix=.sh
rm combine.sh
mkdir combine
sarray -J combine -o combine/%j.out -e combine/%j.err -t 48:00:00 --mem=40G --mail-type=BEGIN,END,FAIL combine.sh
rm combine_00.sh
rm -r combine
rm -r tmp

# clean
rm -r combine
rm -r tmp

13.3 Joint genotyping

We joint-genotyped individuals with GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers. We divided the step on several intervals files of a maximum of 1000 sequences computed in parralel to speed up the operation (similarly to previous step). NB, we tested the pipeline with 3 individual haplotypes and 10 intervals of 100 sequences ran in parrallel; and it took 6 minutes. Consequently with 10 fold more sequences per intervals we may increase to 1H, and the effect of 10 fold more individual haplotypes is hard to assess. Then we merged genotypes of all intervals with GatherVcfs from Picard in one raw VCF to be filtered.

# folders
mkdir tmp
mkdir symcapture.vcf.gz

# test
file=$(ls reference.sequences.lists/ | head -n 1)
srun --mem=20G --pty bash
module load bioinfo/gatk-4.1.2.0 ; gatk --java-options "-Xmx20g" GenotypeGVCFs -R ../reference/reference.fasta -L reference.sequences.lists/$file -V gendb://symcaptureDB/$file -O symcapture.vcf.gz/"${file%.*}".vcf.gz
exit
rm tmp/*
rm symcapture.vcf.gz/*

# sarray

for file in $(ls reference.sequences.lists); do echo "module load bioinfo/gatk-4.1.2.0 ; gatk --java-options \"-Xmx20g\" GenotypeGVCFs -R ../reference/reference.fasta -L reference.sequences.lists/$file -V gendb://symcaptureDB/${file%.*}.DB -O symcapture.vcf.gz/${file%.*}.vcf.gz"; done > genotype.sh
mkdir genotype
sarray -J genotype -o genotype.array/%j.out -e genotype.array/%j.err -t 48:00:00 --mem=20G --mail-type=BEGIN,END,FAIL genotype.array.sh

# clean
rm -r genotype.array
rm -r tmp

# merge
echo -e '#!/bin/bash\n#SBATCH --time=48:00:00\n#SBATCH -J gather\n#SBATCH -o gather.out\n#SBATCH -e gather.err\n#SBATCH --mem=20G\n#SBATCH --cpus-per-task=1\n#SBATCH --mail-type=BEGIN,END,FAIL\nmodule load bioinfo/picard-2.14.1\njava -Xmx20g -jar $PICARD GatherVcfs \' > gather.sh
for file in $(ls symcapture.vcf.gz/*.gz)
do
    echo -e '\tI='$file' \' >> gather.sh
done
echo -e '\tO=symcapture.all.raw.vcf.gz\n' >> gather.sh

References

Supernat, A., Vidarsson, O.V., Steen, V.M. & Stokowy, T. (2018). Comparison of three variant callers for human whole genome sequencing. Scientific Reports, 8, 17851. Retrieved from http://www.nature.com/articles/s41598-018-36177-7