diff --git a/vignettes/SE-read-summary-level.png b/vignettes/SE-read-summary-level.png new file mode 100755 index 0000000..b95db23 Binary files /dev/null and b/vignettes/SE-read-summary-level.png differ diff --git a/vignettes/SingleMoleculeGenomicsIO.Rmd b/vignettes/SingleMoleculeGenomicsIO.Rmd index 36e2846..7dc7df1 100644 --- a/vignettes/SingleMoleculeGenomicsIO.Rmd +++ b/vignettes/SingleMoleculeGenomicsIO.Rmd @@ -21,11 +21,28 @@ knitr::opts_chunk$set( `SingleMoleculeGenomicsIO` provides tools for reading and processing single-molecule genomics data in R. These include functions for reading base -modifications from `modBam` files, e.g. generated by -[Dorado](https://github.com/nanoporetech/dorado), -or text files generated by [modkit](https://nanoporetech.github.io/modkit), -efficient representation of such data as R objects, and functions to manipulate -and summarize such objects. +modifications from `modBam` files (standard bam files with modification data +stored in `MM` and `ML` tags, see +[SAMtags.pdf](https://samtools.github.io/hts-specs/SAMtags.pdf)), +e.g. generated from Oxford Nanopore or PacBio data using tools like +[Dorado](https://github.com/nanoporetech/dorado) or equivalent, +from text files generated by +[modkit](https://nanoporetech.github.io/modkit), or from `bam` files where +modification information is encoded via sequence mismatches (e.g., resulting +from bisulfite sequencing or deamination-based protocols). +The package also provides an efficient representation of such data as R objects +(based on the widely used `SummarizedExperiment` class), and functions to +manipulate and summarize such objects. +`SingleMoleculeGenomicsIO` does not provide functions for downstream analyses +(e.g. differential footprint identification or nucleosome analyses), or +visualization. Such functionality can for example be found in +[footprintR](https://fmicompbio.github.io/footprintR/) or +[nomeR](https://github.com/fmi-basel/compbio-nomeR). + +In this vignette, we will focus on reading data from these different formats +and outline how to process the obtained data, e.g. to filter out +low-quality reads, subset to specific sequence contexts, or regroup the reads +according to some annotation of interest. Current contributors include: @@ -46,58 +63,111 @@ if (!requireNamespace("BiocManager", quietly = TRUE)) BiocManager::install("fmicompbio/SingleMoleculeGenomicsIO") ``` -# Load data +# Load required packages -## Load read-level data - -We start by loading single-molecule genomics data at the level of individual -reads (a small example dataset with 6mA modifications representing -accessibility). Read-level data is provided as a `modBam` -file (standard bam file with modification data stored in `MM` and `ML` tags, -see [SAMtags.pdf](https://samtools.github.io/hts-specs/SAMtags.pdf)) and is -read using the `readModBam()` function from `SingleMoleculeGenomicsIO`. - -Alternatively, read-level modification data can also be extracted from -`modBam` files using [modkit](https://nanoporetech.github.io/modkit) (see -`readModkitExtract()` to read the output generated by `modkit extract`). - -The small example `modBam` files contained in `SingleMoleculeGenomicsIO` were -generated using the [Dorado](https://github.com/nanoporetech/dorado) aligner: - -```{r data-files, message=FALSE} -# load packages +```{r, message=FALSE} library(SingleMoleculeGenomicsIO) library(SummarizedExperiment) library(Biostrings) library(SparseArray) library(BiocParallel) +``` + +# Data representation + +The figure below illustrates the basic structure of the data container used +within `SingleMoleculeGenomicsIO`. All reading functions described in the +next section generate a container of this type. + +![](SE-read-summary-level.png) + +As we see, the data is stored in a `RangedSummarizedExperiment` object. This is +a standard Bioconductor container that can contain one or more assays (storing +the observed data), together with annotations for rows (genomic positions) and +columns (samples or individual reads), as well as general metadata in a list +format. + +The assay data in the object can be either on the individual read level, or +represent data summarized across all reads for a given sample. As we will see +below, an object can contain both read-level and summary-level data. + +# Load data from `modBam` files + +In this section we will mainly focus on reading data from `modBam` files, and +explore the generated object. For more information about reading other types of +supported files (see above), we refer to section \@ref(other-formats). +As mentioned, regardless of the reading function that is used, the resulting +`SummarizedExperiment` object will have the same format. + +## Load read-level data + +A small example dataset with 6mA modifications representing +accessibility, in the form of two `modBam` files generated using the +[Dorado](https://github.com/nanoporetech/dorado) aligner, is provided with +`SingleMoleculeGenomicsIO`. Here, we read this into R using the `readModBam()` +function. Typically, only a subset of the full `modBam` file, corresponding +to a specific genomic region of interest, is loaded. -# read-level 6mA data generated by 'dorado' +As these `modBam` files represent 6mA modifications, we set `modbase = "a"` +in the `readModBam()`. For a full list of +possible modifications, see the section about base modifications in the +[SAM tag specification](https://samtools.github.io/hts-specs/SAMtags.pdf)). +We also provide a sequence reference (genome sequence) from which to extract +the genomic sequence context for modified bases. This can be helpful during +analysis, e.g. to filter out observations that are likely to represent +sequencing error. + + +```{r data-files} +# read-level 6mA data generated by Dorado modbamfiles <- system.file("extdata", c("6mA_1_10reads.bam", "6mA_2_10reads.bam"), package = "SingleMoleculeGenomicsIO") names(modbamfiles) <- c("sample1", "sample2") + +# file with sequence of the reference genome in fasta format +reffile <- system.file("extdata", "reference.fa.gz", + package = "SingleMoleculeGenomicsIO") ``` -Modification data is read from these files using `readModBam()`: ```{r read-data} se <- readModBam(bamfiles = modbamfiles, regions = "chr1:6940000-7000000", modbase = "a", verbose = TRUE, + sequenceContextWidth = 1, + sequenceReference = reffile, BPPARAM = BiocParallel::SerialParam()) se ``` -This will create a `RangedSummarizedExperiment` object with positions in rows: +This will create a `RangedSummarizedExperiment` object with rows corresponding +to individual genomic positions: ```{r row-data} # rows are positions... rowRanges(se) ``` -The columns of `se` correspond to samples: +The positions are stranded, as base modifications can be observed on either of +the two strands: + +```{r} +table(strand(se)) +``` + +We notice that (as expected) most of the positions for which we obtain a +modification value indeed correspond to an 'A' in the genome sequence. +Discrepancies here may arise from, e.g., sequencing errors or differences +between the genome sequence of the studied subjects and the reference sequence. + +```{r} +table(rowRanges(se)$sequenceContext) +``` + +The columns of `se` correspond to samples (recall that in our case we loaded +two `modBam` files): ```{r column-data} # ... and columns are samples @@ -116,6 +186,13 @@ element per sample: se$readInfo ``` +In general, columns in `colData` that provide read-level information can be +obtained using `getReadLevelColDataNames()`: + +```{r coldata-read-level-colnames} +getReadLevelColDataNames(se) +``` + Each sample typically contains several reads (see `se$n_reads`), which correspond to the rows of `se$readInfo`, for example for `"sample1"`: @@ -125,7 +202,8 @@ se$readInfo$sample1 ## Explore assay data -The single assay `mod_prob` is a `DataFrame` with modification probabilities. +At this point, the `RangedSummarizedExperiment` object contains a single assay +named `mod_prob`, which is a `DataFrame` with modification probabilities. ```{r mod-prob-assay} assayNames(se) @@ -154,7 +232,14 @@ ncol(m$sample2) # 2 reads in "sample2" ncol(as.matrix(m)) # 5 reads in total ``` -One advantage of the grouping of reads per sample is that you can easily +In general, assays that provide read-level information can be +obtained using `getReadLevelAssayNames()`: + +```{r se-read-level-assaynames} +getReadLevelAssayNames(se) +``` + +One advantage of the grouping of reads per sample is that we can easily perform per-sample operations using `lapply` (returns a `list`) or `endoapply` (returns a `DataFrame`): @@ -163,8 +248,8 @@ lapply(m, ncol) endoapply(m, ncol) ``` -The `NaMatrix` objects do not store `NA` values and thus use much less -memory compared to a normal (dense) matrix. The `NA` values here occur +The `NaMatrix` objects do not store `NA` values explicitly and thus use much +less memory compared to a normal (dense) matrix. The `NA` values here occur at positions (rows) that are not covered by the read of that column, and there are typically a large fraction of NA values: @@ -179,9 +264,10 @@ using `mean(..., na.rm = TRUE)`, otherwise the result will be `NA` for all incompletely covered positions: ```{r mod-prob-zeros} -# modification probabilities at position "chr1:6928850:-" +# find row corresponding to position "chr1:6928850:-" (idx <- which(seqnames(se) == "chr1" & start(se) == 6928850 & strand(se) == "-")) +# modification probabilities at the given position m[idx, ] # WRONG: take the mean of all values (including NAs) @@ -202,11 +288,15 @@ above for illustration. ## Summarize read-level data Summarized data can be obtained from the read-level data by calculating -a per-position summary over the reads in each sample: +a per-position summary over the reads in each sample, and are stored as +additional assays in the resulting object. The `flattenReadLevelAssay()` +function can calculate several relevant summaries: ```{r summarise-read-level} se_summary <- flattenReadLevelAssay( se, statistics = c("Nmod", "Nvalid", "FracMod", "Pmod", "AvgConf")) +# new assays are added to the object +assayNames(se_summary) ``` As discussed above, this will correctly exclude non-observed values @@ -234,7 +324,10 @@ probability (`Pmod`) and the average confidence of the modification calls per position (`AvgConf`). As we have not changed the default `keepReads = TRUE`, we keep the read-level assay from the input object (`mod_prob`) in the returned object. Because of the grouping of reads by sample in `mod_prob`, the dimensions -of read-level or summary-level objects remain identical: +of read-level or summary-level objects remain identical. However, as the +summary-level assays only contain a single value per position and sample, they +are regular matrices rather than the hierarchical `DataFrames` used for the +read-level data. ```{r collapsed-assays} # read-level data is retained in "mod_prob" assay @@ -246,19 +339,20 @@ assay(se_summary, "mod_prob") # the dimensions of read-level `se` and summarized `se_summary` are identical dim(se) dim(se_summary) + +# summary-level assays are regular matrices +class(assay(se_summary, "FracMod")) ``` ## Load summary-level data As illustrated above, summary-level data (e.g. average modification fraction at -each position) can be calculated from read-level data. Alternatively, it can -also be directly read from `modBam` files, which is faster and reduces the -size of the object in memory. In addition, there are other file formats that -only store summary-level data (e.g. `bed` files) that can also be imported -using `SingleMoleculeGenomicsIO`. - -### Read summary-level data from `modBam` files +each position) can be calculated from read-level data. Alternatively, reads can +also be directly summarized while reading from `modBam` files, which is faster +and reduces the size of the object in memory. In addition, there are other file +formats that only store summary-level data (e.g. `bed` files) that can also be +imported using `SingleMoleculeGenomicsIO` (see below). To read summary-level data with `readModBam()`, we set the `level` argument to `"summary"`: @@ -273,8 +367,8 @@ se_summary2 <- readModBam(bamfiles = modbamfiles, se_summary2 ``` -Notice that `se_summary2` contains only summary-level assays (`Nmod`, `Nvalid` and -`FracMod`), and no read-level modifications anymore (assay `mod_prob`). +Notice that `se_summary2` contains only summary-level assays (`Nmod`, `Nvalid` +and `FracMod`), and no read-level modifications anymore (assay `mod_prob`). Otherwise, it resembles the read-level object `se`, for example it covers the same genomic positions: @@ -292,105 +386,176 @@ identical(assay(se_summary, "Nvalid"), assay(se_summary2, "Nvalid")) ``` -### Read summary-level data from `bedMethyl` files - -As mentioned, summary-level data can also be imported from `bedMethyl` files -that contain total and methylated counts for individual positions:: +## Load data for a random subset of the reads -```{r data-files-summary} -# collapsed 5mC data for a small genomic window in bedMethyl format -bedmethylfiles <- system.file("extdata", - c("modkit_pileup_5mC_1.bed.gz", - "modkit_pileup_5mC_2.bed.gz"), - package = "SingleMoleculeGenomicsIO") +By default, `readModBam()` will load all reads overlapping the regions +indicated by the `regions` argument. However, it is also possible to load a +random subset of reads (this can be useful e.g. to calculate distributions +of quality scores, using reads aligning to different parts of the genome but +without the need to load the entire `modBam` file into R). This is +achieved by setting the `nAlnsToSample` argument to a non-zero value +(representing the desired number of reads per sample), and indicating which +reference sequences to sample reads from. -# file with sequence of the reference genome in fasta format -reffile <- system.file("extdata", "reference.fa.gz", - package = "SingleMoleculeGenomicsIO") +```{r sample-reads} +se_sample <- readModBam(bamfiles = modbamfiles, + modbase = "a", + nAlnsToSample = 5, + seqnamesToSampleFrom = "chr1", + verbose = TRUE, + BPPARAM = BiocParallel::SerialParam(RNGseed = 1327828L)) +se_sample$n_reads ``` -As these `bedMethyl` files represent 5mC modifications, we set `modbase = "m"` -in the `readBedMethyl()` call (the same could be done in `readModBam` to load -`modBam` files representing 5mC modifications). For a full list of -possible modifications see the section about base modifications in the -[SAM tag specification](https://samtools.github.io/hts-specs/SAMtags.pdf)). -We also provide a sequence reference (genome sequence) from which to extract -the sequence context for each position. +Note that the exact number of sampled reads may not be identical to the +requested number. One reason for this is that the requested number is combined +with the total number of alignments to calculate the fraction of reads to be +sampled (in this small case, 0.5), and each alignment will be retained with +this probability. -```{r read-data-bedmethyl} -names(bedmethylfiles) <- c("sample1", "sample2") -se_bed <- readBedMethyl(bedmethylfiles, - modbase = "m", - sequenceContextWidth = 3, - sequenceReference = reffile, - BPPARAM = BiocParallel::SerialParam()) -se_bed -``` +## Label reads by sequence variants -As before, the summarized data is read into a `RangedSummarizedExperiment` -object, with rows corresponding to genomic positions: +The `readModBam` and `readMismatchBam` functions allow to automatically label +reads using variant positions (`variantPositions` argument): -```{r rowranges-bedmethyl} -rowRanges(se_bed) +```{r read-varpositions} +# create GPos object with reference sequence variant positions +varpos <- GPos(seqnames = "chr1", pos = c(6937731, 6937788, 6937843, 6937857, + 6937873, 6937931, 6938932, 6939070)) + +# load and label reads at varpos +se_varpos <- readModBam(bamfiles = modbamfiles, + modbase = "a", + regions = "chr1:6925000-69500000", + variantPositions = varpos, + verbose = FALSE, + BPPARAM = BiocParallel::SerialParam(RNGseed = 1327828L)) ``` -... and the columns correspond to the different samples (here corresponding -to the names of the two input files in `bedmethylfiles`): +Read bases at `variantPositions` will be extracted and concatenated to create +a variant label for each read, using `'-'` for non-overlapping positions. +The resulting labels are stored for each sample in the `variant_label` column +of `readInfo`: -```{r colnames-bedmethyl} -colnames(se_bed) +```{r readlabels-sample1} +se_varpos$readInfo$sample1$variant_label ``` -The data is stored in two matrices (assays) called `Nmod` (the number of -modified bases per sample and position) and `Nvalid` (the number of valid -total read covering that position in each sample): +In a diploid experimental system with known heterozygous positions, this allows +for example to measure allele specific signals by grouping reads based on these +variant labels. A step-by-step example illustrating this procedure is +available in the [footprintR book](https://fmicompbio.github.io/footprintR_book/grouping-reads.html#sec-group-by-sequence). -```{r assays-bedmethyl} -assayNames(se_bed) +# Load data from other file types {#other-formats} -head(assay(se_bed, "Nmod")) -head(assay(se_bed, "Nvalid")) -``` +In the previous sections we illustrated how to read data from `modBam` files +and extract both read-level and summary-level information. As mentioned in the +beginning of this document, `SingleMoleculeGenomicsIO` can also be used to read +data from other file formats that are often used to store data from single +molecule genomics experiments. -From these, we can easily calculate the fraction of modifications: +In cases where read-level modification data has already been extracted from the +`modBam` files using [modkit](https://nanoporetech.github.io/modkit), these +files can be read into a `RangedSummarizedExperiment` using the +`readModkitExtract()` function. -```{r fraction-modified-bedmethyl} -fraction_modified <- assay(se_bed, "Nmod") / assay(se_bed, "Nvalid") -head(fraction_modified) +```{r read-modkit-extract} +# define path to modkit extract output file +extrfile <- system.file("extdata", "modkit_extract_rc_5mC_1.tsv.gz", + package = "SingleMoleculeGenomicsIO") + +# read complete file +readModkitExtract(extrfile, modbase = "m", filter = NULL, + sequenceContextWidth = 3, + sequenceReference = reffile, + BPPARAM = BiocParallel::SerialParam()) +# only read entries that are not suggested to be filtered out +readModkitExtract(extrfile, modbase = "m", filter = "modkit", + sequenceContextWidth = 3, + sequenceReference = reffile, + BPPARAM = BiocParallel::SerialParam()) ``` -As before, non-finite fractions result from positions that had zero coverage -in a given sample. +Summary-level data can also be provided as `bedMethyl` files (containing +total and methylated counts for individual positions). These can be read using +the `readBedMethyl()` function, which again returns a +`RangedSummarizedExperiment` object (this time only containing summary-level +assays). -## Load data for a random subset of the reads +```{r read-data-bedmethyl} +# collapsed 5mC data for a small genomic window in bedMethyl format +bedmethylfiles <- system.file("extdata", + c("modkit_pileup_5mC_1.bed.gz", + "modkit_pileup_5mC_2.bed.gz"), + package = "SingleMoleculeGenomicsIO") -By default, `readModBam()` will load all reads overlapping the regions -indicated by the `regions` argument. However, it is also possible to load a -random subset of reads (this can be useful e.g. to calculate distributions -of quality scores, using reads aligning to different parts of the genome but -without the need to load the entire `modBam` file into R). +names(bedmethylfiles) <- c("sample1", "sample2") +readBedMethyl(bedmethylfiles, + modbase = "m", + sequenceContextWidth = 3, + sequenceReference = reffile, + BPPARAM = BiocParallel::SerialParam()) +``` + +In addition to the `modBam` files, where modification information is +encoded in the `MM` and `ML` tags, `SingleMoleculeGenomicsIO` can also load +data from `bam` files where modification information has to be inferred from +the observed read sequence. This includes bisulfite sequencing data (where +unmethylated and typically inaccessible Cs are converted to Ts) and data +generated using deamination protocols (where instead the accessible Cs are +converted to Ts). Because of these conversions, such data is aligned to the +reference genome using dedicated alignment tools like the ones provided with +the `QuasR` R package or the `Bismark` aligner. `SingleMoleculeGenomicsIO` +supports reading `bam` files generated by either of these, using the +`readMismatchBam()` function. +The genomic sequence context to focus on, as well as specifications about +which bases represent modified and unmodified states, respectively, are +determined based on the arguments `sequenceContext`, `readBaseUnmod` and +`readBaseMod`. + +```{r} +# bisulfite sequencing bam file aligned with QuasR::qAlign +quasrbisfile <- system.file("extdata", "BisSeq_quasr_paired.bam", + package = "SingleMoleculeGenomicsIO") -```{r sample-reads} -se_sample <- readModBam(bamfiles = modbamfiles, - modbase = "a", - nAlnsToSample = 5, - seqnamesToSampleFrom = "chr1", - verbose = TRUE, - BPPARAM = BiocParallel::SerialParam(RNGseed = 1327828L)) -se_sample$n_reads +# read data +readMismatchBam(bamfiles = quasrbisfile, + bamFormat = "QuasR", + regions = "chr1:6950000-6955000", + sequenceContext = "C", + sequenceReference = reffile, + readBaseUnmod = "T", + readBaseMod = "C") + +# bisulfite sequencing bam file aligned with Bismark +bismarkbisfile <- system.file("extdata", "BisSeq_bismark_paired.bam", + package = "SingleMoleculeGenomicsIO") + +# read data +readMismatchBam(bamfiles = bismarkbisfile, + bamFormat = "Bismark", + regions = "chr1:6950000-6955000", + sequenceContext = "C", + sequenceReference = reffile, + readBaseUnmod = "T", + readBaseMod = "C") ``` -Note that the exact number of sampled reads may not be identical to the -requested number. One reason for this is that the requested number is combined -with the total number of alignments to calculate the fraction of reads to be -sampled (in this small case, 0.5), and each alignment will be retained with -this probability. +The table below compares the capabilities of the provided reading functions. + +| Function | Description | Read-level output | Summary-level output | Region-based access | Sampling | Variant read labels | +| :------: | :---------: | :---------------: | :------------------: | :-----------------: | :------: | :-----------------: | +| `readModBam` | Read `modBam` file(s) | Yes | Yes | Yes | Yes | Yes | +| `readModkitExtract` | Read `modkit` extract file(s) | Yes | No | No | No | No | +| `readBedMethyl` | Read `bedMethyl` file(s) | No | Yes | No | No | No | +| `readMismatchBam` | Read `bam` file(s) with modifications encoded as as sequence mismatches | Yes | Yes | Yes | Yes | Yes | + # Expand the `SummarizedExperiment` object to single base resolution The objects returned by the reading functions above contain data only for -the positions with an observable modification state for the given modbase. For -cases where a single nucleotide resolution is required (i.e., where the rows +the positions with an observable modification state. For +cases where single nucleotide resolution is required (i.e., where the rows in the object must cover each position within a region), the `expandSEToBaseSpace` will return a suitable object. Note that any strand information in the original positions will be ignored. @@ -400,12 +565,60 @@ se_expand <- expandSEToBaseSpace(se) dim(se_expand) ``` +# Filter positions + +`SingleMoleculeGenomicsIO` provides several ways to filter out individual +genomic positions, based on e.g. the genomic sequence context or the +observed read coverage. Such filtering can be important to reduce noise in +the data for downstream analyses. + +To illustrate the position-level filtering, let us consider the `se` object +obtained from the two `modBam` files above: + +```{r} +se +``` + +Recall that we added the genomic sequence context of each position when reading +this data (in `rowRanges(se)$sequenceContext`). We could also add this manually +to the `RangedSummarizedExperiment`, via the `addSeqContext()` function. +We use the `filterPositions()` function to filter the rows of `se`, retaining +only rows where the nucleotide in the corresponding genomic position is +an `A`, and where the read coverage is at least 2. Note that for the sequence +context filter, any IUPAC code can be used - for example, `NCG` would match +trinucleotide contexts with a central C followed by a G (and preceded by any +base). + + +```{r} +se_filt <- filterPositions( + se, + filters = c("sequenceContext", "coverage"), + sequenceContext = "A", + assayNameCov = "mod_prob", + minCov = 2 +) +dim(se_filt) +``` + +In this case, the filtering reduced the number of unique positions from +`r nrow(se)` to `r nrow(se_filt)` (a reduction by +`r 100*(1 - signif(nrow(se_filt)/nrow(se), 3))`%). +At the same time, the number of non-NA +values in the `mod_prob` assay is only reduced from +`r length(nnawhich(as.matrix(assay(se))))` to +`r length(nnawhich(as.matrix(assay(se_filt))))` (a reduction by +`r 100*(1 - signif(length(nnawhich(as.matrix(assay(se_filt))))/length(nnawhich(as.matrix(assay(se)))), 3))`%), indicative of the fact that +the filtered out positions are generally covered by only a few reads. -# Add quality scores and filter `SummarizedExperiment` object +# Filter reads -`SingleMoleculeGenomicsIO` provides functionality to calculate a variety of -statistics for individual reads. The results will be added as a new column in -the `colData` of the `SummarizedExperiment`. +In addition to the position filtering illustrated above, +`SingleMoleculeGenomicsIO` provides functionality for calculating read-level +quality scores and filtering out entire reads with low quality. The +calculation of the quality scores is performed using the `addReadStats()` +function, which will return a `RangedSummarizedExperiment` object with the +quality scores added as a new column in the `colData` (by default named `QC`). ```{r add-read-stats} # start from read-level data @@ -418,6 +631,12 @@ se <- addReadStats(se, name = "QC", se$QC$sample1 ``` +As a result, the `colData` now contains two columns with read-level information: + +```{r readlevel-coldata-after-qc} +getReadLevelColDataNames(se) +``` + The distribution of the quality statistics can also be plotted, which is often helpful in order to decide on cutoffs for downstream analysis. Note that in the current example data, each sample contains only a few reads and hence the plot @@ -441,38 +660,34 @@ lapply(assay(se_filtreads, "mod_prob"), ncol) metadata(se_filtreads)$filteredOutReads ``` -It is also possible to filter out positions (rows), based on a variety of -criteria, using the `filterPositions()` function. Typical filter criteria -include the sequence context of the observation and the read coverage. +Both position and read filters above were applied after reading the data from +the `modBam` file into a `RangedSummarizedExperiment` object. An alternative to +this is to create a filtered `modBam` file once, and then use that for all +downstream analyses. This can be done using the `filterReadsModBam()` file, +which accepts the same filter criteria as `filterReads()` above. Here we +illustrate the process using one of the example `modBam` files provided with +the package. -In order to filter by sequence context, we first need to add it to our -object (if this was not already done when the data was imported). This can be -done using the `addSeqContext()` function. We can extract sequence contexts of -arbitrary width, based on a provided genome reference. +```{r} +# input file +modbamfile <- system.file("extdata", "6mA_1_10reads.bam", + package = "SingleMoleculeGenomicsIO") -```{r add-seqcontext} -gnm <- readDNAStringSet(system.file("extdata", - "reference.fa.gz", - package = "SingleMoleculeGenomicsIO")) -se <- addSeqContext(se, sequenceContextWidth = 1, sequenceReference = gnm) -``` +# output file +filtmodbamfile <- tempfile(fileext = ".bam") -Next, we can filter the object to only contain positions where the annotated -nucleotide at the corresponding genomic position is an `A`. Any IUPAC code can -be used - for example, `NCG` would match trinucleotide contexts with a -central C followed by a G (and preceded by any base). - -```{r filter-seqcontext} -# check sequence context distribution across observations -table(as.character(rowData(se)$sequenceContext)) - -se_posfilt <- filterPositions(se, filters = "sequenceContext", - sequenceContext = "A") -dim(se_posfilt) -# check that the sequence context of all retained positions is A -table(as.character(rowData(se_posfilt)$sequenceContext)) +res <- filterReadsModBam(infiles = modbamfile, + outfiles = filtmodbamfile, + modbase = "a", + indexOutfiles = FALSE, + minQscore = 10, + minAlignedFraction = 0.8, + BPPARAM = BiocParallel::SerialParam(), + verbose = TRUE) +res ``` + # Subset and regroup reads After importing read-level data with `SingleMoleculeGenomicsIO`, the set of