Project name: project-gigas_ploidy
Funding source: unknown
Species: Crassostrea gigas
variable: ploidy, desiccation, high temperature

previous notebook entry

next notebook entry

Background

After running methylkit to find the DML within each dataset, we now need to see where they are located. To do this I will be running this jupyter notebook and using the genome feature tracks available here

Genome feature analysis

Here I use the output of getMethylDiff from our prior MethylKit analysis to create BEDfiles:

%%bash

#Replace , with tabs
#Remove extraneous quotes entries (can also be done in R)
#Print chr, start, end, meth.diff
#Remove header
#Save as BEDfile

for f in /home/mngeorge/project-gigas_ploidy/bisulfide_analysis/WGBS/DML/DML-getMethylDiff-Cov10-20-Ronit-COMBO.csv
do
    tr "," "\t" < ${f} \
    | tr -d '"' \
    | awk '{print $2"\t"$3"\t"$4"\t"$8}' \
    | tail -n+2 \
    > ${f}.bed
done

Next, I used intersectBed to calculate the overlap between the getMethylDiff bed files and the genome feature track files:

#Find overlaps between DML and each genome feature - Ronit dataset, factor: ploidy
#Count number of overlaps

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_gene.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-gene-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-gene-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_exon.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-exon-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-exon-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_exonUTR.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-exonUTR-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-exonUTR-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_intron.bed \
> DML/RONIT/DML-getMethylDiff-Cov10-20-intron-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-intron-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_lncRNA.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-lncRNA-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-lncRNA-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_mRNA.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-mRNA-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-mRNA-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_CDS.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-CDS-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-CDS-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_nonCDS.bed \
> DML/RONIT/DML-getMethylDiff-Cov10-20-nonCDS-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-nonCDS-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_intergenic.bed \
> DML-getMethylDiff-Cov10-20-intergenic-Ronit-PLOIDY.bed
!wc -l DML-getMethylDiff-Cov10-20-intergenic-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_rm.te.bed \
> DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY-rm.te.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY-rm.te.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_upstream.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-upstream-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-upstream-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_downstream.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-downstream-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-downstream-Ronit-PLOIDY.bed

!intersectBed \
-u \
-a DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY.bed \
-b GFF/cgigas_uk_roslin_v1_flanks.gff \
> DML/RONIT/DML-getMethylDiff-Cov10-20-flanks-Ronit-PLOIDY.bed
!wc -l DML/RONIT/DML-getMethylDiff-Cov10-20-flanks-Ronit-PLOIDY.bed

Here is the output:

57267 DML/RONIT/DML-getMethylDiff-Cov10-20-gene-Ronit-PLOIDY.bed
19069 DML/RONIT/DML-getMethylDiff-Cov10-20-exon-Ronit-PLOIDY.bed
5416 DML/RONIT/DML-getMethylDiff-Cov10-20-exonUTR-Ronit-PLOIDY.bed
38267 DML/RONIT/DML-getMethylDiff-Cov10-20-intron-Ronit-PLOIDY.bed
1951 DML/RONIT/DML-getMethylDiff-Cov10-20-lncRNA-Ronit-PLOIDY.bed
56319 DML/RONIT/DML-getMethylDiff-Cov10-20-mRNA-Ronit-PLOIDY.bed
13654 DML/RONIT/DML-getMethylDiff-Cov10-20-CDS-Ronit-PLOIDY.bed
41696 DML/RONIT/DML-getMethylDiff-Cov10-20-nonCDS-Ronit-PLOIDY.bed
2223 DML-getMethylDiff-Cov10-20-intergenic-Ronit-PLOIDY.bed
7564 DML/RONIT/DML-getMethylDiff-Cov10-20-Ronit-PLOIDY-rm.te.bed
343 DML/RONIT/DML-getMethylDiff-Cov10-20-upstream-Ronit-PLOIDY.bed
935 DML/RONIT/DML-getMethylDiff-Cov10-20-downstream-Ronit-PLOIDY.bed
1207 DML/RONIT/DML-getMethylDiff-Cov10-20-flanks-Ronit-PLOIDY.bed

Running this code on all the outputs from all the datasets analyzed, I generated stacked bar plots linking at the proportion of CpG and DMLs within each genome feature, using the following R code:

#####################################################################
### DML-genome-feature-location Stacked bar plot

library(readxl)
library(ggplot2)
library(RColorBrewer)

data         <- read_excel("DML_locations-all-datasets.xlsx", sheet = "plot", col_names = TRUE)
data$feature <- factor(data$feature, levels = c("exon_UTR",
                                                "CDS",
                                                "intron",
                                                "upstream_flank",
                                                "downstream_flank",
                                                "lncRNA",
                                                "TE",
                                                "intergenic"), ordered=TRUE)

# Plot
p1 <- ggplot(data = data, aes(y = B_combo_contrast, x = 'X', fill = feature)) +
      geom_bar(stat  = "identity",
               color = "black",
               #fill  = rep(v,length.out=18),
               size  = 1) +
      scale_fill_brewer(palette = "YlGnBu") +
      scale_y_continuous(limits = c(0,100), breaks = seq(0,100,20)) +
      theme( line              = element_line(size=1.5),
             rect              = element_rect(size=1.5),
             text              = element_text(size=22,color="black"),
             panel.background  = element_blank(),
             panel.grid.major  = element_blank(),
             panel.grid.minor  = element_blank(),
             axis.text.x       = element_text(size=22,color="black"),
             axis.text.y       = element_text(size=22,color="black"),
             # axis.ticks.y      = element_blank(),
             axis.title.x      = element_blank(),
             axis.title.y      = element_blank(),
             axis.line         = element_line(color = "black", size = 0.5),
             panel.border      = element_rect(color = "black", fill=NA, size=2))

p1

ggsave("B_combo_contrast.png",
       plot   = p1,
       dpi    = 1200,
       device = "png",
       width  = 4.8,
       height = 6,
       units  = "in")

##########################################################################

Here is the prior table for reference:

Table 1. DML counts from each analysis

Here are the plots:

Figure 1. Each dataset separately

Figure 2. Both datasets together