Chapter 4 Neutral region selection

The Guianan sequences from Torroba-Balmor et al (unpublished)

4.1 Raw reads from Torroba-Balmori et al. (unpublished) alignment on scaffolds from Scotti et al. (in prep)

We will use the French Guianan raw reads from Torroba-Balmori et al. (unpublished), by aligning them on scaffolds from Olsson et al. (2017) with bwa.

#!/bin/bash
#SBATCH --time=36:00:00
#SBATCH -J alignIvan
#SBATCH -o alignIvan_output.out
#SBATCH -e alignIvan_error.out
#SBATCH --mem=20G
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=BEGIN,END,FAIL

# Environment
module purge
module load bioinfo/bwa-0.7.15
module load bioinfo/picard-2.14.1
module load bioinfo/samtools-1.4
module load bioinfo/bedtools-2.26.0

# read preparation
cd ~/work/Symphonia_Torroba/
tar -xvzf Gbs.tar.gz
cd raw
rm PR_49.fastq RG_1.fastq
for file in ./*.fastq
do
  echo $file
  filename=$(basename "$file")
  filename="${filename%.*}"
  perl -pe 's|[\h]||g' $file > "${filename}".renamed.fastq
  rm $file
done
# variables
cd ~/work/Symphonia_Genomes/Ivan_2018/torroba_alignment
reference=~/work/Symphonia_Genomes/Ivan_2018/merged_1000/merged_1000.fa
query_path=~/work/Symphonia_Torroba/raw
# alignment
bwa index $reference
mkdir bwa
for file in $query_path/*.fastq
do
  filename=$(basename "$file")
  filename="${filename%.*}"
  rg="@RG\tID:${filename}\tSM:${filename}\tPL:IONTORRENT"
  bwa mem -M -R "${rg}" $reference $file > bwa/"${filename}.sam"    
#  rm $file
done
# sam2bam
for file in ./bwa/*.sam
do
  filename=$(basename "$file")
  filename="${filename%.*}"
  java -Xmx4g -jar $PICARD SortSam I=$file O=bwa/"${filename}".bam SORT_ORDER=coordinate
done
# Bam index
for file in bwa/*.bam
do
  filename=$(basename "$file")
  filename="${filename%.*}"
  java -Xmx4g -jar $PICARD BuildBamIndex I=$file O=bwa/"${filename}".bai
done
# sam2bed
mkdir bed
for file in ./bwa/*.bam
do
  filename=$(basename "$file")
  filename="${filename%.*}"
  bedtools bamtobed -i bwa/"${filename}".bam > bed/"${filename}".bed
done
# merge bed
mkdir merged_bed
for file in ./bed/*.bed
do
  filename=$(basename "$file")
  filename="${filename%.*}"
  bedtools merge -i bed/"${filename}".bed > merged_bed/"${filename}".bed
done
cat bed/* | sort -k 1,1 -k2,2n >  all.nonunique.bed
bedtools merge -i all.nonunique.bed -c 1 -o count > all.merged.bed
Table 4.1: alignment coverage summary
N Width (Mbp) Coverage (%)
aligned sequence 1 786 852 130.3105 29.87923
selected scaffold 179 665 421.7514 96.70448
total 190 098 436.1239 100.00000

4.3 Raw reads from Torroba-Balmori et al. (unpublished) alignment on scaffolds from Olsson et al. (2017)

We will again use the French Guianan raw reads from Torroba-Balmori et al. (unpublished), by aligning them on scaffolds from Olsson et al. (2017) with bwa.

#!/bin/bash
#SBATCH --time=36:00:00
#SBATCH -J alignOlsson
#SBATCH -o alignOlsson_output.out
#SBATCH -e alignOlsson_error.out
#SBATCH --mem=20G
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=BEGIN,END,FAIL

# Environment
module purge
module load bioinfo/bwa-0.7.15
module load bioinfo/picard-2.14.1
module load bioinfo/samtools-1.4
module load bioinfo/bedtools-2.26.0

# read preparation
cd ~/work/Symphonia_Torroba/
tar -xvzf Gbs.tar.gz
cd raw
rm PR_49.fastq RG_1.fastq
for file in ./*.fastq
do
  echo $file
  filename=$(basename "$file")
  filename="${filename%.*}"
  perl -pe 's|[\h]||g' $file > "${filename}".renamed.fastq
  rm $file
done
# variables
cd ~/work/Symphonia_Genomes/Olsson_2016/torroba_alignment
reference=~/work/Symphonia_Genomes/Olsson_2016/Olsson2017/Olsson2017.fa
query_path=~/work/Symphonia_Torroba/raw
# alignment
bwa index $reference
mkdir bwa
for file in $query_path/*.fastq
do
  filename=$(basename "$file")
  filename="${filename%.*}"
  rg="@RG\tID:${filename}\tSM:${filename}\tPL:IONTORRENT"
  bwa mem -M -R "${rg}" $reference $file > bwa/"${filename}.sam"    
  rm $file
done
# sam2bam
for file in ./bwa/*.sam
do
  filename=$(basename "$file")
  filename="${filename%.*}"
  java -Xmx4g -jar $PICARD SortSam I=$file O=bwa/"${filename}".bam SORT_ORDER=coordinate
done
# Bam index
for file in bwa/*.bam
do
  filename=$(basename "$file")
  filename="${filename%.*}"
  java -Xmx4g -jar $PICARD BuildBamIndex I=$file O=bwa/"${filename}".bai
done
# sam2bed
mkdir bed
for file in ./bwa/*.bam
do
  filename=$(basename "$file")
  filename="${filename%.*}"
  bedtools bamtobed -i bwa/"${filename}".bam > bed/"${filename}".bed
done
# merge bed
mkdir merged_bed
for file in ./bed/*.bed
do
  filename=$(basename "$file")
  filename="${filename%.*}"
  bedtools merge -i bed/"${filename}".bed > merged_bed/"${filename}".bed
done
cat bed/* | sort -k 1,1 -k2,2n >  all.nonunique.bed
bedtools merge -i all.nonunique.bed -c 1 -o count > all.merged.bed
Table 4.3: alignment coverage summary
N Width (Mbp) Coverage (%)
aligned sequence 1 786 852 130.3105 12.68385
selected scaffold 1 056 548 590.1957 57.44708
total 2 653 526 1 027.3729 100.00000

4.4 Masking scaffolds with multimatch from Olsson et al. (2017)

Table 4.4: alignment coverage summary
N Width (Mbp) Mask (%N) Coverage (%)
selected scaffold 245 646 464.1304 1.896894 45.17643
total 2 653 526 1 027.3729 NA 100.00000

4.6 Merge of selected scaffolds

We merged selected scaffolds from Scotti et al (in prep) and Olsson et al. (2017) with quickmerge.

We merged selected scaffolds from Scotti et al (in prep) and Olsson et al. (2017) with quickmerge. We found 13 343 overlaps resulting a final merged assembly of 82 792 scaffolds for a total lenght of 146.80 Mb.

Merging result from quickmerge. Left graph represents the overlap distribution. Right graph represent the merged scaffolds distribution.

Figure 4.1: Merging result from quickmerge. Left graph represents the overlap distribution. Right graph represent the merged scaffolds distribution.

4.7 Final subset of selected neutral scaffolds

We finally selected 0.533 Mb of sequences by sampling 533 1-kb sequences among 533 scaffolds (1 sequence per scaffold) with a probability \(p=\frac{scaffold\_length}{total\_length}\).

Table 4.5: Selected neutral scaffolds
N Width (Mbp) Mask (%N)
533 0.533 0.0042383

4.8 Repetitive regions final check

Last but not least, we do not want to include repetitive regions in our targets for baits design. We consequently aligned raw reads from one library from Scotti et al. (in prep) on our targets with bwa.

We obtained a continuous decreasing distribution of read coverage across our scaffolds regions (figure 4.2). We fitted a \(\Gamma\) distribution with positive parameters for scaffolds regions with a coverage under 5 000 (non continuous distribution with optimization issues). We obtained a distribution with a mean of 324 reads per region and a \(99^{th}\) quantile of 4 042. We decided to mask regions with a coverage over the \(99^{th}\) quantile and remove scaffolds with a mask superior to 75% of its total length (figure ??).

Read coverage distribution.

Figure 4.2: Read coverage distribution.

target regions with a coverage over the 99th quantile of the fitted Gamma distribution (4042).

Figure 4.3: target regions with a coverage over the 99th quantile of the fitted Gamma distribution (4042).

Table 4.6: Selected, masked and filtered funcional targets.
N Width (Mbp) Mask (%N)
415 0.415 0.024412

References

Olsson, S., Seoane-Zonjic, P., Bautista, R., Claros, M.G., González-Martínez, S.C., Scotti, I., Scotti-Saintagne, C., Hardy, O.J. & Heuertz, M. (2017). Development of genomic tools in a widespread tropical tree, Symphonia globulifera L.f.: a new low-coverage draft genome, SNP and SSR markers. Molecular Ecology Resources, 17, 614–630. Retrieved from http://doi.wiley.com/10.1111/1755-0998.12605