Chapter 5 Fuctional region selection

We used open reading frames (ORF) to target genes within scaffolds. ORFs have been detected with transdecoder on assembled transcripts. First, we filtered ORFs including a start codon(figure 5.1). Then, we aligned ORFs on pre-selected and merged genomic scaffolds with blat. We obtained 7 744 aligned scaffolds (table 5.1 and figure 5.2). Thanks to the alignment, we removed overlapping genes (figure 5.3) and obtained 4 076 pre-selected genes with a total length of 757 hbp (figure 5.4). Finally we used transcript differential expression to select all genes differentially expressed between Symphonia globulifera and Symphonia sp1 (figure 5.5). We selected 1150 sequences of 500 to 1-kbp representing 1 063 Mbp (table 5.2). To validate our final target set, we aligned with bwa raw reads from one library from Scotti et al. (in prep). …

5.2 ORF alignment on genomics scaffolds

7 744 scaffolds matched with ORFs (10.5% for 15.4 Mbp, see table 5.1 and figure 5.2).

Table 5.1: Alignment coverage of Tysklind et al. (in prep) ORFs over genomic scaffolds with blat.
N Width (Mbp) Coverage (%)
aligned sequence 21 146 2.201315 1.499565
selected scaffold 7 744 15.425248 10.507881
total 82 792 146.796946 100.000000
Alignment result of Tysklind et al. (in prep) ORFs over genomic scaffolds with blat. Left graph represents the overlap distribution. Right graph represent the selected and deduplicated scaffolds distribution.

Figure 5.2: Alignment result of Tysklind et al. (in prep) ORFs over genomic scaffolds with blat. Left graph represents the overlap distribution. Right graph represent the selected and deduplicated scaffolds distribution.

5.4 Pre-selected genes

We obtained 4 076 genes pre-selected for a total length of 757 kbp (figure 5.4).

Available genes for target sequences design.

Figure 5.4: Available genes for target sequences design.

5.5 Differential Expression (DE) of genes

Figure 5.5 shows genes differential expression. First circle represent genes with isoforms not enriched whereas second and third circle represents respectivelly genes with isoforms S. sp1 and S. globulifera enriched. Relatively few genes contained enriched isoforms, and most of them were S. globulifera enriched.

Legend

Figure 5.5: Genes differential expression.

5.7 Repetitive regions final check

Last but not least, we do not want to include repetitive regions in our targets for baits design. We consequently aligned raw reads from one library from Scotti et al. (in prep) on our targets with bwa.

We obtained a continuous decreasing distribution of read coverage across our scaffolds regions (figure 5.6). We fitted a \(\Gamma\) distribution with positive parameters for scaffolds regions with a coverage under 5 000 (non continuous distribution with optimization issues). We obtained a distribution with a mean of 309 reads per region and a \(99^{th}\) quantile of 2 606. We decided to mask regions with a coverage over the \(99^{th}\) quantile and remove scaffolds with a mask superior to 75% of its total length (figure ??).

Read coverage distribution.

Figure 5.6: Read coverage distribution.

target regions with a coverage over the 99th quantile of the fitted Gamma distribution (2606).

Figure 5.7: target regions with a coverage over the 99th quantile of the fitted Gamma distribution (2606).

Table 5.3: Selected, masked and filtered funcional targets.
N Width (Mbp) Mask (%N)
975 0.896759 0.0168273