Working with HDF5-based object

read_bedgraph function

For memory efficient read in, one can use an HDF5 based methrix object. Only one bedgraph file is in the memory at the same time, while the resulting object won’t be stored in the memory, but on-disk.

Additional arguments to use HDF5:

  • Set h5=TRUE
  • Set vect=FALSE
  • h5_dir –> a directory to save the final object. It is possible to save the object later. It increases the processing time significantly.
  • h5temp –> a temporary directory to use during data processing. Set this if for example the default temporary location doesn’t have enough free space to store the temporary data.
meth <- methrix::read_bedgraphs(
  files = bed_files,
  ref_cpgs = hg19_cpgs,
  chr_idx = 1,
  start_idx = 2,
  M_idx = 5,
  U_idx = 6,
  stranded = FALSE,
  zero_based = TRUE, 
  collapse_strands = FALSE,  
  coldata = sample_anno,
  vect = FALSE,
  h5 = TRUE)

All methrix functions work with HDF5-based objects as well, there is no difference in using different functions.

meth <- methrix::remove_uncovered(meth)

It is also possible to transform non-HDF5-based objects to HDF5-based ones and back.


m <- convert_HDF5_methrix(m=meth)
m2 <- convert_methrix(m=m)

Saving and loading

Saving and loading of an HDF5-based object is not possible using the standard save or saveRDS functions. methrix offers easy to use saving and loading tools, which are essentially wrappers around the saveHDF5SummarizedExperiment and loadHDF5SummarizedExperiment functions.


target_dir = paste0( getwd(), '/temp/')
save_HDF5_methrix(meth, dir = target_dir, replace = TRUE)

meth <- load_HDF5_methrix(dir = target_dir)

Working with large number of samples

The primary goal of methrix is to allow users to handle the whole-genome methylation data. The functions are optimized to keep the speed high and the memory need low. However, additional efforts were taken to allow methrix to handle large number of samples (even >100) in the samples in the same, efficient way. Therefore, many functions implement the argument n_chunks to split these datasets into digestible chunks and n_cores to parallelize the processing of these chunks.

Functions currently supporting the arguments n_chunks and n_cores: coverage_filter get_region_summary remove_snps mask_methrix only support the n_cores argument.

The multicore option is not available on Windows.

if (grepl("Windows", Sys.getenv("OS"))){
res <- get_region_summary(meth, regions=dmrs[1:5], n_chunks = 2, n_cores = 1)} else {
  res <- get_region_summary(meth, regions=dmrs[1:5], n_chunks = 2, n_cores = 2)
}

res