Load required libraries

library(methrix)
#> Loading required package: data.table
#> Loading required package: SummarizedExperiment
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:data.table':
#> 
#>     first, second
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> Loading required package: IRanges
#> 
#> Attaching package: 'IRanges'
#> The following object is masked from 'package:data.table':
#> 
#>     shift
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: DelayedArray
#> Loading required package: matrixStats
#> 
#> Attaching package: 'matrixStats'
#> The following objects are masked from 'package:Biobase':
#> 
#>     anyMissing, rowMedians
#> 
#> Attaching package: 'DelayedArray'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
#> The following objects are masked from 'package:base':
#> 
#>     aperm, apply, rowsum
library(GEOquery) #Only used for downloading example dataset 
#> Setting options('download.file.method.GEOquery'='auto')
#> Setting options('GEOquery.inmemory.gpl'=FALSE)

Download bedgraph files

The example data analysis presented here uses the following publicly available whole genome bisulfite sequencing data from GEO: GSE90553

data_dir = "./GSE90553/"
dir.create(path = data_dir, showWarnings = FALSE, recursive = TRUE)
#Download the files
lapply(c("GSM3039352", "GSM3039353", "GSM3039354", "GSM3039348", "GSM3039349", "GSM3039350"), function(gsm){
  filePaths <-  GEOquery::getGEOSuppFiles(GEO = gsm, baseDir = data_dir, makeDirectory = TRUE)  
})

Above command should download files to data_dir.

bed_files <- list.files(
  path = data_dir,
  pattern = "*bed\\.gz$",
  full.names = TRUE, 
  recursive = TRUE
)

print(basename(bed_files))
#> [1] "GSM3039348_BiSeq_WT_D14_R1.bed.gz" "GSM3039349_BiSeq_WT_D14_R2.bed.gz"
#> [3] "GSM3039350_BiSeq_WT_D14_R3.bed.gz" "GSM3039352_BiSeq_WT_D6_R1.bed.gz" 
#> [5] "GSM3039353_BiSeq_WT_D6_R2.bed.gz"  "GSM3039354_BiSeq_WT_D6_R3.bed.gz"

Importing with methrix

CpG annotation

As a first step, we need a list of CpG sites in the respective reference genome. The CpG sites are extracted using the respective Bsgenome annotation package. The read_bedgraph function is also able to extract CpG sites on it’s own, however, it might be beneficial to do it separately.

hg19_cpgs <- methrix::extract_CPGs(ref_genome = "BSgenome.Hsapiens.UCSC.hg19")
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> -Extracting CpGs
#> -Done. Extracted 29,891,155 CpGs from 298 contigs.

hg19_cpgs is a list containing the position of all CpG sites in the genome (1 based coordinates), chromosome lengths, and the genome build used.

Sample annotation

An annotation table is also necessary to perform analyses. The data will be added to the methrix object, as a colData slot.

sample_anno <- data.frame(
  row.names = gsub(
    pattern = paste0("GSM[[:digit:]]*_BiSeq_(WT_D[[:digit:]]{1,2}_R[[:digit:]]).bed.gz"),
    replacement = "\\1",
    x = basename(bed_files)
    ),
  Day = gsub(
    pattern = paste0("GSM[[:digit:]]*_BiSeq_WT_(D[[:digit:]]{1,2})_R[[:digit:]].bed.gz"),
    replacement = "\\1",
    x = basename(bed_files)
    ),
  Replicate = gsub(
    pattern = paste0("GSM[[:digit:]]*_BiSeq_WT_D[[:digit:]]{1,2}_(R[[:digit:]]).bed.gz"),
    replacement = "\\1",
    x = basename(bed_files)
    ),
  stringsAsFactors = FALSE
)

print(sample_anno)
#>           Day Replicate
#> WT_D14_R1 D14        R1
#> WT_D14_R2 D14        R2
#> WT_D14_R3 D14        R3
#> WT_D6_R1   D6        R1
#> WT_D6_R2   D6        R2
#> WT_D6_R3   D6        R3

Reading bedGraph files

read_bedgraphs() function is a versatile bedgraph reader intended to import bedgraph files generated virtually by any sort of methylation calling program. It requires user to provide indices for column containing chromosome names, start position and other required fields. However, there are also presets available to import bedgraphs from most common programs such as

In case if the bedGraphs are generated using any of the above programs, use pipiline argument in read_bedgraphs() to specify - which causes the function to automatically infer/parse the required columns.

For the example in hand, since we do not know the program used to generate the files, it might worth to take a look at one of the files, and manually specify the column indices.

data.table::fread(bed_files[1], nrows = 5)
#>    #chrom start   end ratio totalC methC strand next Plus totalC methC Minus
#> 1:     10 60024 60026   0.5      2     1      +    G    +      2     1     -
#> 2:     10 60088 60090   1.0      2     2      +    G    +      2     2     -
#> 3:     10 60108 60110   1.0      3     3      +    G    +      3     3     -
#> 4:     10 60139 60141   1.0      6     6      +    G    +      6     6     -
#> 5:     10 60141 60143   1.0      7     7      +    G    +      7     7     -
#>    totalC methC localSeq
#> 1:      0     0    ATCGG
#> 2:      0     0    TGCGT
#> 3:      0     0    ATCGT
#> 4:      0     0    GACGC
#> 5:      0     0    CGCGC

It seems first three columns are standard BED format. Beta values, read counts for total depth, methylated Cs are in 4th, 5th and 6th column respectively. We can pass this information to read_bedgraphs() which in-turn parses the required information.

In addition read_bedgraphs function also adds CpGs missing from the reference set, and creates a methylation/coverage matrices. Once the process is complete - it returns an object of class methrix which in turn inherits SummarizedExperiment class. methrix object contains ‘methylation’ and ‘coverage’ matrices (either in-memory or as on-disk HDF5 arrays) along with pheno-data and other basic info. This object can be passed to all downstream functions for various analysis. For further details on the data structure, see the SummarizedExperiment package.

Tips:

  • For computers with large memory, increase the speed using vect=TRUE. The function will open multiple files, defined by the vect_batch_size argument. Keep in mind, that increasing the batch size will increase the memory need as well. Furthermore n_threads will distribute reading function across multiple workers to speed up the process.

  • If vect=FALSE, the function will iterate along the list of files, keeping only one file open at a time - which helps to reduce the memory footprint.

meth <- methrix::read_bedgraphs(
    files = bed_files,
    ref_cpgs = hg19_cpgs,
    chr_idx = 1,
    start_idx = 2,
    beta_idx = 4,
    cov_idx = 5, M_idx = 6,
    stranded = FALSE,
    zero_based = TRUE, 
    collapse_strands = FALSE, 
    coldata = sample_anno,
    vect = TRUE, n_threads = 3
)
meth
#> An object of class methrix
#>    n_CpGs: 28,217,448
#> n_samples: 6
#>     is_h5: FALSE
#> Reference: hg19

Gotchas!

In case if you see notes about too many missing CpGs while running read_bedgraphs() - it might be due to one of the following reasons.

  • Obvious reason would be your library size is not large enough to cover entire ref. genome in which case you can ignore the notes.

  • Another obvious mistake is using the wrong reference genome.

  • A common mistake is falsely indicating strandedness and zero or one based positioning. The combination of the two might lead to only using C-s from one strand (due to a 1 bp shift), resulting lower coverage for the sites. A nice description of the zero- and one-based coordination system can be found here

  • A more detailed description of the read_bedgraphs() function is here