Introduction

The core functionality of methrix is a generic reader function for virtually every possible format of single sample bed, bedGraph or txt files.

This description helps the users to find the correct settings for their own files.

Generic settings

#Load library
library(methrix)

Files and annotation

A character list of the files with their paths’.


bdg_files <- list.files(
  path = system.file('extdata', package = 'methrix'),
  pattern = "*bedGraph\\.gz$",
  full.names = TRUE
)

The sample annotation is a data.frame containing all annotations we would like to have in the final object. The rownames will be used as sample names.

** The sample annotation and the file list should have the same order! **

Since the filenames and the sample names can be different, there is no way to check if the order is correct during reading in. Therefore the user has to take care that they are in the correct order.

#Generate some sample annotation table
sample_anno <- data.frame(
  row.names = gsub(
    pattern = "\\.bedGraph\\.gz$",
    replacement = "",
    x = basename(bdg_files)
  ),
  Condition = c("cancer", 'cancer', "normal", "normal"),
  Pair = c("pair1", "pair2", "pair1", "pair2"),
  stringsAsFactors = FALSE
)

print(sample_anno)

The sample names will be changed if they contain not allowed characters (e.g. space) or don’t follow the R naming conventions (e.g. starts with number). In this case a warning will appear during read in.

It is possible not to provide annotation data. In this case the sample names will be converted from the filenames and the colData slot of the methrix object will be empty.

Reference data

methrix is using BSgenome packages to find the reference position of the CpG sites in the genome. This package will not only read in the measured methylation values, but fills it into an object that contains all possible CpGs on the genome. This also allows to perform genome-wide statistics for example on covergage. The relevant BSgenome package has to be installed. The user can define the name of the BSgenome package or use the extract_CPGs function with the given genome and provide the resulting object as reference.

The argument ref_build is an additional argument for the user’s information. It doesn’t affect the reading functionality. Optional.

#Genome of your preference to work with
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

library(BiocManager)

if(!requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) {
BiocManager::install("BSgenome.Hsapiens.UCSC.hg19")
}
library(BSgenome.Hsapiens.UCSC.hg19) 
#First extract genome wide CpGs from the desired reference genome
hg19_cpgs <- suppressWarnings(methrix::extract_CPGs(ref_genome = "BSgenome.Hsapiens.UCSC.hg19"))
#Read the files 
meth <- methrix::read_bedgraphs(
  files = bdg_files,
  ref_cpgs = hg19_cpgs,
  chr_idx = 1,
  start_idx = 2,
  M_idx = 3,
  U_idx = 4,
  stranded = FALSE,
  zero_based = FALSE, 
  coldata = sample_anno
)
#or just use the name of the BSgenome package

#Read the files 
#meth <- methrix::read_bedgraphs(
#  files = bdg_files,
#  ref_cpgs = "BSgenome.Hsapiens.UCSC.hg19",
#  chr_idx = 1,
#  start_idx = 2,
#  M_idx = 3,
#  U_idx = 4,
#  stranded = FALSE,
#  zero_based = FALSE, 
#  collapse_strands = FALSE, 
#  coldata = sample_anno
#)

The contigs argument restricts the coontigs used during read in. Defaults to all autosomes and sex chromosomes. Please keep in mind that this function was not tested will all possible organisms.

meth <- methrix::read_bedgraphs(
  files = bdg_files,
  ref_cpgs = hg19_cpgs,
  chr_idx = 1,
  start_idx = 2,
  M_idx = 3,
  U_idx = 4,
  contigs = "chr21",
  stranded = FALSE,
  zero_based = FALSE, 
  coldata = sample_anno
)

Changing sample names:

#Generate some sample annotation table
sample_anno <- data.frame(
  row.names = gsub(
    pattern = "\\.bedGraph\\.gz$",
    replacement = "",
    x = basename(bdg_files)
  ),
  Condition = c("cancer", 'cancer', "normal", "normal"),
  Pair = c("pair1", "pair2", "pair1", "pair2"),
  stringsAsFactors = FALSE
)
rownames(sample_anno) <- paste0("1 ", rownames(sample_anno))

print(sample_anno)

meth <- methrix::read_bedgraphs(
  files = bdg_files,
  ref_cpgs = hg19_cpgs,
  chr_idx = 1,
  start_idx = 2,
  M_idx = 3,
  U_idx = 4,
  stranded = FALSE,
  zero_based = FALSE, 
  coldata = sample_anno
)

Predifined pipelines

methrix automatically assigns the column layout for many widely used software outputs. Currently supported:

If these pipelines are used, the user doesn’t have to define the following arguments: chr_idx, start_idx, end_idx, beta_idx, M_idx, U_idx, strand_idx, cov_idx

If they are still defined, they won’t be taken into account.

idx columns

  • Required: chr_idx and start_idx for identifying the position.

  • A combination of M (count of methylated reads), U (count of unmethylated reads), beta (beta methylation value) and coverage (read count) so that all the values can be calculated. Accepted combinations:

                                      * M and U
                                      * M and coverage 
                                      * U and coverage
                                      * beta and coverage

The beta value can be either percentage or ratio. The function will check the first lines and will convert it to ratio.

Strandedness

It is very important to correctly set the strandedness in order to have the correct methylation values.

Arguments:

  • stranded Is the data in stranded format? If for each CpG position there are two values, the data is stranded. In many cases the strand information is available (+ or -).
#chr21 27866423  2 8 +
#chr21 27866424  1 3 -
#chr21 27866921  2  2 +
#chr21 27866923  3  2 -
#chr21 27867197  1  2 +
#chr21 27867198  1  7 -

In some cases the strand information is not available, but one can infer from the position:

#chr21 27866423  2 8 
#chr21 27866424  1 3 
#chr21 27866921  2  2 
#chr21 27866923  3  2 
#chr21 27867197  1  2 
#chr21 27867198  1  7 
  • collapse_strands If TRUE, the read in function will collapse the reads and summarize their methylation value. Usually, this option is set to TRUE, except in some special cases.

  • synced_coordinates In some rare cases, the coordinates are the same for the + and - strands with no strand information available:


#chr21 27866423  27866425 2   8 
#chr21 27866423  27866425 1   3 
#chr21 27866921  27866923 2   2 
#chr21 27866921  27866923 3   2 
#chr21 27867197  27867198 1   2 
#chr21 27867197  27867198 1   7 

Specific settings

During read in the user have to decide on parameters that are mostly affected by the available resources. If the study population is not so big or there is ample of RAM available, one can go for using in-memory object and read in in batches, as well as parallel processing. This is the fastest way.

  • vect if one chooses the vectorized processing, the data will be processed in batches: the read in is not performed one by one, but is done on multiple files at once.
  • vect_batch_size defines the number of samples that should be processed at the same time. It is defined by the available memory. If e.g. vect_batch_size=5, 5 bedgraph files will be in the memory at the same time.
  • n_threads not available on Windows. The reading will be separated on multiple cores. Be careful: it will cause a linear increase in memory usage.

HDF5 backend

If the memory is limited and/or the study population is large, there is an option to use HDF5 backend.
Only one bedgraph file is in the memory at the same time, while the resulting object won’t be stored in the memory, but on-disk. Additional arguments to use HDF5:

  • Set h5=TRUE
  • Set vect=FALSE
  • h5_dir –> a directory to save the final object. It is possible to save the object later. It increases the processing time significantly.
  • h5temp –> a temporary directory to use during data processing. Set this if for example the default temporary location doesn’t have enough free space to store the temporary data.

meth <- methrix::read_bedgraphs(
  files = bdg_files,
  ref_cpgs = hg19_cpgs,
  chr_idx = 1,
  start_idx = 2,
  M_idx = 3,
  U_idx = 4,
  stranded = FALSE,
  zero_based = FALSE, 
  coldata = sample_anno,
  vect = FALSE,
  h5 = TRUE)