Project name: project-gigas_ploidy
Funding source: unknown
Species: Crassostrea gigas
variable: ploidy, desiccation, high temperature

previous notebook entry

Background

After running methylkit to find the DML within each dataset, we now need to see where they are located. To do this I will be running this jupyter notebook and using the genome feature tracks available here

Genome feature analysis

Here I use the output of getMethylDiff from our prior MethylKit analysis to create BEDfiles:

%%bash

#Replace , with tabs
#Remove extraneous quotes entries (can also be done in R)
#Print chr, start, end, meth.diff
#Remove header
#Save as BEDfile

for f in /home/mngeorge/project-gigas_ploidy/bisulfide_analysis/WGBS/DML/DML-getMethylDiff-Cov10-20-Ronit-COMBO.csv
do
    tr "," "\t" < ${f} \
    | tr -d '"' \
    | awk '{print $2"\t"$3"\t"$4"\t"$8}' \
    | tail -n+2 \
    > ${f}.bed
done

Next, I used intersectBed to calculate the overlap between the getMethylDiff bed files and the genome feature track files:

#Find overlaps between DML and each genome feature - Ronit dataset, factor: ploidy
#Count number of overlaps

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_gene.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-gene-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-gene-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_exon.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-exon-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-exon-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_exonUTR.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-exonUTR-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-exonUTR-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_intron.bed \
> DML/RONIT/DML-getMethylDiff-Cov10-20-intron-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-intron-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_lncRNA.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-lncRNA-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-lncRNA-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_mRNA.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-mRNA-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-mRNA-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_CDS.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-CDS-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-CDS-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_nonCDS.bed \
> DML/RONIT/DML-getMethylDiff-Cov10-20-nonCDS-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-nonCDS-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_intergenic.bed \
> DML-getMethylDiff-Cov10-20-intergenic-Ronit-PLOIDY.bed
!wc -l DML-getMethylDiff-Cov10-20-intergenic-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_rm.te.bed \
> DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY-rm.te.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY-rm.te.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_upstream.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-upstream-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-upstream-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_downstream.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-downstream-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-downstream-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_flanks.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-flanks-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-flanks-Ronit-PLOIDY.bed

Here is the output:

57267 DML/RONIT/DML-getMethylDiff-Cov10-20-gene-Ronit-PLOIDY.bed
19069 DML/RONIT/DML-getMethylDiff-Cov10-20-exon-Ronit-PLOIDY.bed
5416 DML/RONIT/DML-getMethylDiff-Cov10-20-exonUTR-Ronit-PLOIDY.bed
38267 DML/RONIT/DML-getMethylDiff-Cov10-20-intron-Ronit-PLOIDY.bed
1951 DML/RONIT/DML-getMethylDiff-Cov10-20-lncRNA-Ronit-PLOIDY.bed
56319 DML/RONIT/DML-getMethylDiff-Cov10-20-mRNA-Ronit-PLOIDY.bed
13654 DML/RONIT/DML-getMethylDiff-Cov10-20-CDS-Ronit-PLOIDY.bed
41696 DML/RONIT/DML-getMethylDiff-Cov10-20-nonCDS-Ronit-PLOIDY.bed
2223 DML-getMethylDiff-Cov10-20-intergenic-Ronit-PLOIDY.bed
7564 DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY-rm.te.bed
343 DML/RONIT/DML-getMethylDiff-Cov10-20-upstream-Ronit-PLOIDY.bed
935 DML/RONIT/DML-getMethylDiff-Cov10-20-downstream-Ronit-PLOIDY.bed
1207 DML/RONIT/DML-getMethylDiff-Cov10-20-flanks-Ronit-PLOIDY.bed

Running this code on all the outputs from all the datasets analyzed, I generated stacked bar plots linking at the proportion of CpG and DMLs within each genome feature, using the following R code:

#####################################################################
### DML-genome-feature-location Stacked bar plot

library(readxl)
library(ggplot2)
library(RColorBrewer)

data         <- read_excel("DML_locations-all-datasets.xlsx", sheet = "plot", col_names = TRUE)
data$feature <- factor(data$feature, levels = c("exon_UTR",
                                                "CDS",
                                                "intron",
                                                "upstream_flank",
                                                "downstream_flank",
                                                "lncRNA",
                                                "TE",
                                                "intergenic"), ordered=TRUE)

# Plot
p1 <- ggplot(data = data, aes(y = B_combo_contrast, x = 'X', fill = feature)) +
      geom_bar(stat  = "identity",
               color = "black",
               #fill  = rep(v,length.out=18),
               size  = 1) +
      scale_fill_brewer(palette = "YlGnBu") +
      scale_y_continuous(limits = c(0,100), breaks = seq(0,100,20)) +
      theme( line              = element_line(size=1.5),
             rect              = element_rect(size=1.5),
             text              = element_text(size=22,color="black"),
             panel.background  = element_blank(),
             panel.grid.major  = element_blank(),
             panel.grid.minor  = element_blank(),
             axis.text.x       = element_text(size=22,color="black"),
             axis.text.y       = element_text(size=22,color="black"),
             # axis.ticks.y      = element_blank(),
             axis.title.x      = element_blank(),
             axis.title.y      = element_blank(),
             axis.line         = element_line(color = "black", size = 0.5),
             panel.border      = element_rect(color = "black", fill=NA, size=2))

p1

ggsave("B_combo_contrast.png",
       plot   = p1,
       dpi    = 1200,
       device = "png",
       width  = 4.8,
       height = 6,
       units  = "in")

##########################################################################

Here is the prior table for reference:

Table 1. DML counts from each analysis

Here are the plots:

Figure 1. Each dataset separately

Figure 2. Both datasets together