Project name: NOPP-gigas-ploidy-temp
Funding source: National Oceanographic Partnership Program
Species: Crassostrea gigas
variable: ploidy, elevated seawater temperature, desiccation
Github repo: NOPP-gigas-ploidy-temp

« previous notebook entry « | » next notebook entry »

Background

We received 3’end RNA sequencing (3’Tag RNA-Seq or TagSeq) data from 72 samples crassostrea gigas samples from the UT-Austin Genomic Sequencing and Analysis Facility (GSAF).

The tagseq sample list with sample IDs and treatments is available here. See the prior post for QC information and location of files on gannet.

Tagseq data processing

The tagseq data was received as zipped fastq.gz files, so the first thing to download them from gannet using wget:

# Download tag-seq data
mkdir raw-data/
cd raw-data/

wget -r -A .fastq.gz https://gannet.fish.washington.edu/panopea/NOPP-gigas-ploidy-temp/022022-tagseq/ \
--no-check-certificate

and unzip them using gunzip:

# unzip .fastq.gz files
cd raw-data/
gunzip *.fastq.gz

The sequencing facility provided the data as the result of two sequencing lanes per sample ID. The next step was to combine them:

# concatenate fastq files by lane

cd raw-data/

printf '%s\n' *.fastq | sed 's/^\([^_]*_[^_]*\).*/\1/' | uniq |
while read prefix; do
    cat "$prefix"*R1*.fastq >"${prefix}_R1.fastq"
    # cat "$prefix"*R2*.fastq >"${prefix}_R2.fastq" # include if more than one run
done

# I moved files to merged-fastq

Trimming and Filtering

Then trim the adapter sequences used in tagseq using cutadapt. These include poly a and g tails, as well as AGATCGG. The first 15 basepairs of the 3’ end were trimmed to prevent the inclusion of low quality reads (-q, quality-cutoff). After this process, any fragments that were less than 20 basepairs (-m, minimum read length).

# trim adapter sequences

mkdir trim-fastq/
cd merged-fastq

for F in *.fastq
do
#strip .fastq and directory structure from each file, then
# add suffice .trim to create output name for each file
results_file="$(basename -a $F)"

# run cutadapt on each file
/home/shared/8TB_HDD_02/mattgeorgephd/.local/bin/cutadapt $F -a A{8} -a G{8} -a AGATCGG -q 15 -m 20 -o \
/home/shared/8TB_HDD_02/mattgeorgephd/gigas-WGBS-ploidy-desiccation/trim-fastq/$results_file
done

Trimming and filtering resulted in an ~7% loss in reads.

Alignment

After trimming I aligned the reads to the C gigas Roslin genome using bowtie2. The first step was to create the bowtie2 index for the Rosline genome (.fa file, previously downloaded from gannet) with mitochondrial DNA included.

# create bowtie2 index for cgigas genome (took 8 min on Raven)

/home/shared/bowtie2-2.4.4-linux-x86_64/bowtie2-build \
/home/shared/8TB_HDD_02/mattgeorgephd/gigas-WGBS-ploidy-desiccation/sequences/cgigas_uk_roslin_v1_genomic-mito.fa \
/home/shared/8TB_HDD_02/mattgeorgephd/gigas-WGBS-ploidy-desiccation/sequences/cgigas_roslin_v1.fa

and then run bowtie on the trimmed reads

# Run bowtie on trimmed reads, pre-set option= --sensitive-local

mkdir bowtie_sam/
cd bowtie_sam/

for file in /home/shared/8TB_HDD_02/mattgeorgephd/gigas-WGBS-ploidy-desiccation/trim-fastq/*.fastq
do
results_file="$(basename -a $file).sam"

# run Bowtie2 on each file
/home/shared/bowtie2-2.4.4-linux-x86_64/bowtie2 \
--local \
-x /home/shared/8TB_HDD_02/mattgeorgephd/gigas-WGBS-ploidy-desiccation/sequences/cgigas_roslin_v1.fa \
--sensitive-local \
--threads 48 \
--no-unal \
-k 5 \
-U $file \
-S $results_file; \
done >> bowtieout.txt 2>&1

I then checked the alignment rate:

# check % alignment from Bowtie

grep "overall alignment rate" /home/shared/8TB_HDD_02/mattgeorgephd/gigas-WGBS-ploidy-desiccation/bowtie_sam/bowtieout.txt

# average alignment rate = 87.8% +/- 1.3

The resulting .sam files were then converted to .bam files:

# Convert .sam files to .bam files, create bam indices

mkdir bowtie_bam/
cd bowtie_bam/

for file in /home/shared/8TB_HDD_02/mattgeorgephd/gigas-WGBS-ploidy-desiccation/bowtie_sam/*.sam
do
results_file="$(basename -a $file)_sorted.bam"
/home/shared/samtools-1.12/samtools view -b $file | /home/shared/samtools-1.12/samtools sort -o /home/shared/8TB_HDD_02/mattgeorgephd/gigas-WGBS-ploidy-desiccation/bowtie_bam/$results_file
done

Assembly

After alignment, assembly and preparation for DESeq2 analysis was performed using StringTie using the mRNA feature track of the Rosline C gigas genome.

# Assemble bowtie alignments w/ stringtie2 using mRNA genome feature track
array=($(ls /home/shared/8TB_HDD_02/mattgeorgephd/gigas-WGBS-ploidy-desiccation/bowtie_bam/*.bam))

for i in ${array[@]}; do
        sample_name=`echo $i| awk -F [.] '{print $1}'`
	      /home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
	      -p 48 \
	      -e \
	      -B \
	      -G /home/shared/8TB_HDD_02/mattgeorgephd/gigas-WGBS-ploidy-desiccation/sequences/cgigas_uk_roslin_v1_mRNA.gff \
	      -A ${sample_name}.gene_abund.tab \
	      -o ${sample_name}.gtf ${i} \

        echo "StringTie assembly for seq file ${i}" $(date)
done

echo "StringTie assembly COMPLETE, starting assembly analysis" $(date)

# 20220607 - I could not figure out how to designate the output. All outputs ended up in bowtie output folder.

Next I merged the stringtie output and generated the gene count matrix to prepare for DESeq2 analysis.


cd /home/shared/8TB_HDD_02/mattgeorgephd/gigas-WGBS-ploidy-desiccation/bowtie_bam

# make gtf list file (needed for stringtie merge function)
for filename in *.gtf; do
  echo $PWD/$filename;
  done > gtf_list.txt

# make listGTF file (needed for count matrix), two columns w/ sample ID
for filename in *.gtf; do
  echo $filename $PWD/$filename;
  done > listGTF.txt

# merge GTFs into a single file
/home/shared/stringtie-2.2.1.Linux_x86_64/stringtie \
  --merge \
  -p 48 \
	-G /home/shared/8TB_HDD_02/mattgeorgephd/gigas-WGBS-ploidy-desiccation/sequences/cgigas_uk_roslin_v1_mRNA.gff \
	-o cgigas_merged.gtf gtf_list.txt #Merge GTFs to form $

echo "Stringtie merge complete" $(date)

# Compile gene count matrix from GTFs
/home/shared/stringtie-2.2.1.Linux_x86_64/prepDE.py \
  -g cgigas_gene_count_matrix.csv \
  -i listGTF.txt #Compile the gene count matrix

echo "Gene count matrix compiled." $(date)