Chapter 16 Annotation
We developed and manually tested with few SNPs a method to match transcriptomic data from Tysklind et al., in prep with our genomic data. We found expected proportion of SNPs type, with a lot of putatively hitchhiker SNPs in functional SNPs. Finally, ca 2000 SNPs were matching between the two datasets.
population structure of individuals. Then, we quickly looked at th spatial and environmental distribution of the different gene pools and individual mismatch (before association genomic analyses).
- method data preparation and methodology to match transcriptomic data from Tysklind et al., in prep with our genomic data
- results SNP classification, corresponding SNPs and genes, and common SNPs between datasets.
16.1 Method
We focused on scaffolds from the hybrid genomic reference which contained a filtered SNP from the final dataset (7551 over 82792). We focused on transcripts longer than 500 bp and with at least one ORF (137 656 over 257 140). We then, used transcripts alignment on scaffolds to align back SNPs on transcripts, (beware the indexing position depend on the format vcf start with 1 while psl with 0). We proceeded as follow (not a common procedure but a home-made code to be double-checked):
- SNP on scaffolds with no transcript match were classified as “purely neutral”
- SNP on scaffolds matching a transcript were positioned on the global alignment
- SNP out the alignment area were classified as “putatively hitchhiking”
- SNP within the alignment area were further positionned among the different alignment blocks (cf
blat
functionning) - SNP out of alignment blocks (intron) were classified as “putatively hitchhiking”
- SNP in one alignment blocks were classified as “functional”
scf <- read_tsv(file.path(path, "..", "variantCalling", "paracou",
"symcapture.all.biallelic.snp.filtered.nonmissing.paracou.bim"),
col_names = c("scf", "snp", "posM", "pos", "A1", "A2")) %>%
dplyr::select(scf) %>%
unique() %>%
unlist()
reference <- readDNAStringSet(file.path(path, "..", "reference", "reference.fasta"))
writeXStringSet(reference[scf],
file.path(path, "scaffolds.fasta"))
orf <- src_sqlite(file.path("..", "..", "..", "data", "Symphonia_Genomic", "Niklas_transcripts",
"Trinotate", "symphonia.trinity500.trinotate.sqlite")) %>%
tbl("ORF") %>%
dplyr::select(transcript_id) %>%
dplyr::rename(trsc = transcript_id) %>%
collect() %>%
unique()
trsc.tbl <- src_sqlite(file.path("..", "..", "..", "data", "Symphonia_Genomic", "Niklas_transcripts",
"Trinotate", "symphonia.trinity500.trinotate.sqlite")) %>%
tbl("Transcript") %>%
dplyr::select(transcript_id, sequence) %>%
dplyr::rename(trsc = transcript_id) %>%
collect() %>%
filter(trsc %in% orf$trsc)
rm(orf)
trsc <- DNAStringSet(trsc.tbl$sequence)
names(trsc) <- trsc.tbl$trsc
writeXStringSet(trsc, file.path(path, "transcripts.fasta"))
# blat scaffolds.fasta transcipts.fasta alignment.psl
16.2 Results
We obtained as planified a third of neutral snps (33%) against two thirds of non-neutral snps for which the majority was out of transcripts and may behave as hitchhikers (Fig. 16.1). Nevertheless, few functional snps matched several transcripts and/or genes due to transcripts isoforms, genes superposition and possibl multimatching errors (Fig. 16.2). For the moment we wont further annotate functional SNPs with ORFs and SNPs on transcript alignment, or gene ontology and synonymy. We will only finish annotation for selected SNPs in further analyses. Finally, we compared obtained SNP in genomics in this analysis with transcriptomic SNPs identified by Tysklind et al., in prep (Fig. 16.3)) and validated 2 098 common SNPs.
16.3 Synonymy
snps <- read_tsv(file.path(path, "snpsFunc.annotation")) %>%
left_join(read_tsv(file.path(path, "snps.annotation")))
trsc <- src_sqlite(file.path("..", "..", "..", "data", "Symphonia_Genomic", "Niklas_transcripts",
"Trinotate", "symphonia.trinity500.trinotate.sqlite")) %>%
tbl("Transcript") %>%
filter(transcript_id %in% local(snps$trsc)) %>%
dplyr::rename(trsc = transcript_id) %>%
dplyr::select(trsc, sequence) %>% collect()
t <- snps %>%
dplyr::select(snp, trsc, posTrsc, A1, A2) %>%
reshape2::melt(id.vars = c("snp", "trsc", "posTrsc"),
variable.name = "Allele", value.name = "base") %>%
arrange(snp) %>%
left_join(trsc) %>%
mutate(ref = stringr::str_sub(sequence, posTrsc, posTrsc)) %>%
group_by(snp) %>%
filter(any(base %in% ref))
t %>%
filter(base != ref) %>%
ungroup() %>%
dplyr::select(trsc, posTrsc, base) %>%
write_tsv(path = file.path(path, "SNPsOnTrsc.tsv"), col_names = F)
mkdir TransDecoder
cp transcripts.fasta TransDecoder/
cd TransDecoder/
TransDecoder.LongOrfs -t transcripts.fasta
makeblastdb \
-in /bank/ebi/uniprot/current/fasta/uniprot_sprot.fasta \
-dbtype prot -input_type fasta -out uniprot/uniprot_sprot.fa -hash_index
blastp -query transcripts.fasta.transdecoder_dir/longest_orfs.pep \
-db uniprot/uniprot_sprot.fa \
-max_target_seqs 1 -outfmt 6 -evalue 1e-5 -num_threads 4 > blastp.outfmt6
hmmscan --cpu 4 --domtblout pfam.domtblout \
Pfam/Pfam-A.hmm \
transcripts.fasta.transdecoder_dir/longest_orfs.pep
TransDecoder.Predict -t transcripts.fasta \
--retain_pfam_hits pfam.domtblout --retain_blastp_hits blastp.outfmt6
~/Tools/gffread/gffread longest_orfs.gff3 -T -o longest_orfs.gtf
perl ~/Tools/SNPdat_package_v1.0.5/SNPdat_v1.0.5.pl \
-i SNPsOnTrsc.tsv \
-g longest_orfs.gtf \
-f transcripts.fasta \
-s synonymy.summary \
-o synonymy.output
snps <- read_tsv(file.path(path, "snpsFunc.annotation")) %>%
left_join(read_tsv(file.path(path, "snps.annotation")))
synonymy <- read_tsv(file.path(path, "synonymy.output"), col_names = T)
ggplot(synonymy, aes(synonymous)) +
geom_bar() +
ggtitle(paste0(round(178908/475272*100), "% of non synonymous SNPs in genic SNPs"))
snps %>%
mutate(TRSC = toupper(trsc)) %>%
left_join(dplyr::rename(synonymy, TRSC = "Chromosome Number", posTrsc = "SNP Position")) %>%
dplyr::select(-TRSC) %>%
group_by(snp) %>%
summarise(synonymous = as.numeric(any(synonymous == "Y", na.rm = T))) %>%
mutate(synonymous = recode(synonymous, "1" = "synonymous", "0" = "anonymous")) %>%
write_tsv(path = file.path(path, "snpSynonymy.annotation"), col_names = F)
# grep synonymous snpSynonymy.annotation | cut -f1 > snp.synonymous
# grep anonymous snpSynonymy.annotation | cut -f1 > snp.anonymous