Tue. Sept. 06, 2022

Project name: NOPP-gigas-ploidy-temp
Funding source: National Oceanographic Partnership Program
Species: Crassostrea gigas
variable: ploidy, elevated seawater temperature, desiccation
Github repo: NOPP-gigas-ploidy-temp

« previous notebook entry « | » next notebook entry »

Tagseq analysis - using HISAT2

I think I fixed the issue with low alignment scores that I was getting with HISAT2, which is a preferred aligner for TagSeq data because it is splice aware, unlike BowTie2. I hard trimmed the first 15 bp (-u 15) of the TagSeq transcripts using the following:

trim adapter sequences

mkdir trim-fastq/
cd raw-data/

for F in *.fastq
do
#strip .fastq and directory structure from each file, then
# add suffice .trim to create output name for each file
results_file="$(basename -a $F)"

# run cutadapt on each file, hard trim first 15 bp, minimum length 20
/home/shared/8TB_HDD_02/mattgeorgephd/.local/bin/cutadapt $F -a A{8} -a G{8} -a AGATCGG -u 15 -m 20 -o \
/home/shared/8TB_HDD_02/mattgeorgephd/NOPP-gigas-ploidy-temp/trim-fastq/$results_file
done

Merge, compar with MultiQC

After merging by lane, I got an average alignment rate of 72.98 +/- 3.07 sd (much better then without hard trimming ~ 55%). Here is the multiqc report for the merged, trimmed sequences.

Looking through the results, I see issues with the following samples:

D54 - only 2.1 million reads
N56 - only 0.9 million reads, 57.6% duplicates
X44 - only 1.0 million reads, 67.1% duplicates

Run DESeq2 on all, unfiltered

I ran DESeq2 on all aligned reads using this R script, without excluding samples.

Here is the pheatmap comparing all samples:

Looking at the map it looks like R53 and X42 may also be a problem, but lets run the comparisons to see if it stands out. I ran DESeq2 again, removing D54, N56, and X44 using the following R script. The analysis yielded 28,877 DEGs.

The all-treatments biplot agreed that R53 and X42 is an outlier, while also identifying M43 and N54.

Run DESeq2 w/ filters

I used the following code to filter bad/outlier samples,

coldata <- coldata[!(row.names(coldata) %in% c('D54','N56', 'X44', 'R53', 'M43', 'N54', 'X42')),]
cts <- as.matrix(subset(cts, select=-c(D54, N56, X44, R53, M43, N54, X42)))
coldata %>% dplyr::count(group)
all(colnames(cts) %in% rownames(coldata))

This resulted in at least 10 samples per treatment.

group	ploidy	treatment	n
A	diploid	control	11
B	triploid	control	12
C	diploid	heat	11
D	triploid	heat	10
E	diploid	desiccation	10
F	triploid	desiccation	11

I then ran DESeq2 on all samples, filtering by group numbers to run individual comparisons.

# Filter data
coldata_trt <- coldata # %>% filter(group == "A" | group == "B")
cts_trt     <- subset(cts, select=row.names(coldata_trt))

# Calculate DESeq object
dds <- DESeqDataSetFromMatrix(countData = cts_trt,
                              colData = coldata_trt,
                              design = ~ group)

dds <- DESeq(dds) # Run DESeq2
resultsNames(dds) # lists the coefficients

I then removed genes that didn’t have at least 1/3 of samples with 10 or more counts. See this bioconductor thread.

keep <- rowSums(DESeq2::counts(dds) >= 10) >= ncol(cts_trt)/3
dds <- dds[keep,]

Across all samples, 31,371 genes were identified and 19,089 genes were removed through filtering, resulting in 12,282 genes to compare across samples.

Here are some results: PCA: All treatments:

PCA: Diploid only:

PCA: Triploid only:

PCA: Multiple comparisons by ploidy:

comparison	diploid	triploid
single-stressor v control
multi-stressor v control

DEG: significant genes by treatment: </br> Differentially expressed genes with p<0.05 and log2fold change > 1.5, using the apeglm shrinkage estimator

comparison	diploid	triploid
single-stressor v control	317	94
multi-stressor v control	74	62

PCA: Multiple comparisons by treatment:

comparison	control	single-stressor	multi-stressor
ploidy

DEG: significant genes by treatment:

Differentially expressed genes with p<0.05 and log2fold change > 1.5, using the apeglm shrinkage estimator

comparison	control	single-stressor	multi-stressor
ploidy	177	344	138