07-Reader_FAQ.Rmd
The core functionality of methrix
is a generic reader function for virtually every possible format of single sample bed, bedGraph or txt files.
This description helps the users to find the correct settings for their own files.
#Load library library(methrix)
A character list of the files with their paths’.
bdg_files <- list.files( path = system.file('extdata', package = 'methrix'), pattern = "*bedGraph\\.gz$", full.names = TRUE )
The sample annotation is a data.frame
containing all annotations we would like to have in the final object. The rownames will be used as sample names.
** The sample annotation and the file list should have the same order! **
Since the filenames and the sample names can be different, there is no way to check if the order is correct during reading in. Therefore the user has to take care that they are in the correct order.
#Generate some sample annotation table sample_anno <- data.frame( row.names = gsub( pattern = "\\.bedGraph\\.gz$", replacement = "", x = basename(bdg_files) ), Condition = c("cancer", 'cancer', "normal", "normal"), Pair = c("pair1", "pair2", "pair1", "pair2"), stringsAsFactors = FALSE ) print(sample_anno)
The sample names will be changed if they contain not allowed characters (e.g. space) or don’t follow the R naming conventions (e.g. starts with number). In this case a warning will appear during read in.
It is possible not to provide annotation data. In this case the sample names will be converted from the filenames and the colData
slot of the methrix
object will be empty.
methrix
is using BSgenome
packages to find the reference position of the CpG sites in the genome. This package will not only read in the measured methylation values, but fills it into an object that contains all possible CpGs on the genome. This also allows to perform genome-wide statistics for example on covergage. The relevant BSgenome
package has to be installed. The user can define the name of the BSgenome
package or use the extract_CPGs
function with the given genome and provide the resulting object as reference.
The argument ref_build
is an additional argument for the user’s information. It doesn’t affect the reading functionality. Optional.
#Genome of your preference to work with if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") library(BiocManager) if(!requireNamespace("BSgenome.Hsapiens.UCSC.hg19")) { BiocManager::install("BSgenome.Hsapiens.UCSC.hg19") } library(BSgenome.Hsapiens.UCSC.hg19)
#First extract genome wide CpGs from the desired reference genome hg19_cpgs <- suppressWarnings(methrix::extract_CPGs(ref_genome = "BSgenome.Hsapiens.UCSC.hg19"))
#Read the files meth <- methrix::read_bedgraphs( files = bdg_files, ref_cpgs = hg19_cpgs, chr_idx = 1, start_idx = 2, M_idx = 3, U_idx = 4, stranded = FALSE, zero_based = FALSE, coldata = sample_anno ) #or just use the name of the BSgenome package #Read the files #meth <- methrix::read_bedgraphs( # files = bdg_files, # ref_cpgs = "BSgenome.Hsapiens.UCSC.hg19", # chr_idx = 1, # start_idx = 2, # M_idx = 3, # U_idx = 4, # stranded = FALSE, # zero_based = FALSE, # collapse_strands = FALSE, # coldata = sample_anno #)
The contigs
argument restricts the coontigs used during read in. Defaults to all autosomes and sex chromosomes. Please keep in mind that this function was not tested will all possible organisms.
meth <- methrix::read_bedgraphs( files = bdg_files, ref_cpgs = hg19_cpgs, chr_idx = 1, start_idx = 2, M_idx = 3, U_idx = 4, contigs = "chr21", stranded = FALSE, zero_based = FALSE, coldata = sample_anno )
Changing sample names:
#Generate some sample annotation table sample_anno <- data.frame( row.names = gsub( pattern = "\\.bedGraph\\.gz$", replacement = "", x = basename(bdg_files) ), Condition = c("cancer", 'cancer', "normal", "normal"), Pair = c("pair1", "pair2", "pair1", "pair2"), stringsAsFactors = FALSE ) rownames(sample_anno) <- paste0("1 ", rownames(sample_anno)) print(sample_anno) meth <- methrix::read_bedgraphs( files = bdg_files, ref_cpgs = hg19_cpgs, chr_idx = 1, start_idx = 2, M_idx = 3, U_idx = 4, stranded = FALSE, zero_based = FALSE, coldata = sample_anno )
methrix
automatically assigns the column layout for many widely used software outputs. Currently supported:
MethylDackel - default MethylDackel extract output
“BSseeker2_CGmap”, BSseeker2 used with CGmapTools
If these pipelines are used, the user doesn’t have to define the following arguments: chr_idx
, start_idx
, end_idx
, beta_idx
, M_idx
, U_idx
, strand_idx
, cov_idx
If they are still defined, they won’t be taken into account.
Required: chr_idx
and start_idx
for identifying the position.
A combination of M (count of methylated reads), U (count of unmethylated reads), beta (beta methylation value) and coverage (read count) so that all the values can be calculated. Accepted combinations:
* M and U
* M and coverage
* U and coverage
* beta and coverage
The beta value can be either percentage or ratio. The function will check the first lines and will convert it to ratio.
It is very important to correctly set the strandedness in order to have the correct methylation values.
Arguments:
stranded
Is the data in stranded format? If for each CpG position there are two values, the data is stranded. In many cases the strand information is available (+ or -).#chr21 27866423 2 8 + #chr21 27866424 1 3 - #chr21 27866921 2 2 + #chr21 27866923 3 2 - #chr21 27867197 1 2 + #chr21 27867198 1 7 -
In some cases the strand information is not available, but one can infer from the position:
#chr21 27866423 2 8 #chr21 27866424 1 3 #chr21 27866921 2 2 #chr21 27866923 3 2 #chr21 27867197 1 2 #chr21 27867198 1 7
collapse_strands
If TRUE, the read in function will collapse the reads and summarize their methylation value. Usually, this option is set to TRUE, except in some special cases.
synced_coordinates
In some rare cases, the coordinates are the same for the + and - strands with no strand information available:
#chr21 27866423 27866425 2 8 #chr21 27866423 27866425 1 3 #chr21 27866921 27866923 2 2 #chr21 27866921 27866923 3 2 #chr21 27867197 27867198 1 2 #chr21 27867197 27867198 1 7
During read in the user have to decide on parameters that are mostly affected by the available resources. If the study population is not so big or there is ample of RAM available, one can go for using in-memory object and read in in batches, as well as parallel processing. This is the fastest way.
vect
if one chooses the vectorized processing, the data will be processed in batches: the read in is not performed one by one, but is done on multiple files at once.vect_batch_size
defines the number of samples that should be processed at the same time. It is defined by the available memory. If e.g. vect_batch_size=5
, 5 bedgraph files will be in the memory at the same time.n_threads
not available on Windows. The reading will be separated on multiple cores. Be careful: it will cause a linear increase in memory usage.If the memory is limited and/or the study population is large, there is an option to use HDF5 backend.
Only one bedgraph file is in the memory at the same time, while the resulting object won’t be stored in the memory, but on-disk. Additional arguments to use HDF5:
h5=TRUE
vect=FALSE
meth <- methrix::read_bedgraphs( files = bdg_files, ref_cpgs = hg19_cpgs, chr_idx = 1, start_idx = 2, M_idx = 3, U_idx = 4, stranded = FALSE, zero_based = FALSE, coldata = sample_anno, vect = FALSE, h5 = TRUE)