Chapter 11 Quality Check

We received demultiplexed libraries from sequencing. We will then check sequences quality combining already produced fastqc and compare them with originally furnished (i) baits, (ii) targets, and (iii) references:

  1. Multi Quality Check: we used multiqc to combined fastqc iuputs for every library (1002 for forward and reverse individuals) and check sequences, counts, quality and GC content
  2. Trimming: we trimmed sequences removing bad quality and adaptors sequences
  3. Targets mapping: we mapped 10 libraries on targets to check of-targets sequences
  4. Reference mapping: we mapped 10 libraries on hybrid reference to check of-reference sequences, and assess de novo usefulness

11.1 Multi Quality Check

We used multiqc to combined fastqc iutputs for every library (1002 for forward and reverse individuals) and chech sequences, counts, quality and GC content.

11.1.1 Counts

We have a big heterogeneity of sapmle representativity (215 000 folds), but 85% of samples have more than 66 6667 sequences (ca 1M targets / 150 bp * 10X). Moreover duplicated sequences are obviously more present in overrepresentated individuals, probably more linked to PCR biased than sequencing issues.

Sequence counts.

Figure 11.1: Sequence counts.

11.1.2 Quality

Sequences quality are very good as the Phred score is above 25 for every bases on all positions across all sequences !

Phred score.

Figure 11.2: Phred score.

11.1.3 GC content

The mean GC content is 41.5 and only few sequences have non expected global content or content across the sequence.

GC content across sequences.

Figure 11.3: GC content across sequences.

GC content within sequences.

Figure 11.4: GC content within sequences.

11.2 Trimming

We listed all libraries in a txt files and trimmed all libraries with trimmomatic in pair end (PE) into paired and unpaired compressed fastq files (fq.gz). We trimmed the adaptor (ILLUMINACLIP) of our protocol (TruSeq3-PE) with a seed mismatches of 2 (mismatched count allowed), a threshold for clipping palindrome of 30 (authorized match for ligated adapters), a threshold for simple clip of 10 (match between adapter and sequence), a minimum adaptor length of 2, and keeping both reads each time (keepBothReads). We trimmed sequences on phred score with a minimum of 15 in sliding window of 4 (SLIDINGWINDOW:4:15) without trimming the beginning (LEADING:X) or the end (TRAILING:X). Without surprise due to the high quality check of sequencing, trimming resulted in 99.91% of paired trimmed reads compared to raw reads (11.5). Thus the main issue of our dataset for now is more the representativity of sequences mor than their quality.

Trimming results.

Figure 11.5: Trimming results.

11.3 Targets mapping

We mapped 10 libraries on targets to check of-targets sequences and targets loss. Globally we had a good coverage of targets (median of 90%, 11.6) but reads were 70% to 81% of-targets (11.1) ! Consequently we could not only use targets as reference for reads mapping.

Reads alignment coverage on targets. Distribution has been cut at 2000X.

Figure 11.6: Reads alignment coverage on targets. Distribution has been cut at 2000X.

Table 11.1: Reads mapped on targets statistics.
Library Reads mapped Percentage of reads mapped
P7-3-2806 358925 28.22
BCI-SG14 950 27.15
BCI-SG47 16677 27.18
P11-2-240 1064 19.25
P14-2-2842 607526 23.85
P2-2-675 499249 28.01
P4-2-2657 784026 26.77
P5-3-2202 722215 30.28
P6-3-2800 474588 20.19
P6-4-2867 1210288 19.31
P7-3-2806 358925 28.22

11.4 Reference mapping

We mapped every libraries on hybrid reference to check of-reference sequences, and assess de novo usefulness. Globally we had a low coverage of the reference (median of 19%, 11.7) but reads were 79% to 88% on-reference (11.2) ! Finally, we had a median of 4Mb covered with 10X on reference, which is 4 times what we designed in probes. Consequently, we won’t need de novo assembly and will proceed to read mapping for every libraries on the built reference, already partly annotated.

Reads alignment coverage on reference. Distribution has been cut at 2000X.

Figure 11.7: Reads alignment coverage on reference. Distribution has been cut at 2000X.

Table 11.2: Reads mapped on reference statistics.
Library Reads mapped Percentage of reads mapped
BCI-SG14 3232 85.28
BCI-SG47 57142 85.57
P11-2-240 4669 78.74
P14-2-2842 2384919 85.85
P2-2-675 1684774 86.47
P4-2-2657 2717886 85.82
P5-3-2202 2276779 87.75
P6-3-2800 2161522 84.74
P6-4-2867 5707686 83.93
P7-3-2806 1153783 83.91