-
Notifications
You must be signed in to change notification settings - Fork 30
Dev Docs
Note: This setup will change in the near(ish) future, but it is accurate for now (July 2024)
We'll trace through the TransformPow function, which raises all entries in the matrix to a fixed power.
General Flow:
- We define an S4 object which inherits the "IterableMatrix" type (for all BPCells matrices) link
- For direct loading from disk, this will store an absolute path to the input file
- For transforms, generally one slot named
@matrixwill store the input matrix S4 object
- User calls a function that requires accessing the data (e.g.
rowSums) - BPCells R code calls
iterate_matrix, which converts from a R object and returns a C++ pointer to R. R then passes this C++ pointer that can load data to an RCpp function that will use the data. rowSums R side- These pointers should generally be short-lived and not be returned to the user. (e.g. if a file is deleted while a C++ pointer wants to read it, that might create a badly-controlled crash rather than a nice R error)
- Each different "IterableMatrix" subclass implements
iterate_matrixviasetMethod("iterate_matrix", ...)example link. Generally implemented by callingiterate_matrixon an input, then passing the pointer to an RCpp function.
- On the C++ side, these are represented as
Rcpp::XPtr<std::unique_ptr<T>>.- Wrapping and unwrapping is handled by functions in R_xptr_wrapper.h
- With few exceptions, this means that once C++ gets passed an object from R, it takes ownership and will destruct the object in a timely manner (thus closing any open files avoiding locking issues with HDF5 files)
- A RCpp top-level function takes the argument as an Rcpp
SEXPtype, then converts it usingtake_unique_xptrto the assumed C++ type. example link - The top-level function calls a method/function that operates only on C++ objects (
&MatrixLoader<double>::rowSumsin our example)
These steps are basically the same for fragment objects
R "IterableMatrix" -- base class for all BPCells matrix types. Every transform or input file type gets a new subclass. New subclasses must implement the following methods:
- iterate_matrix() -> converts from R object to C++ pointer
- matrix_type() -> Return the data type (uint32_t, float, double)
- short_description() -> Return a vector of strings describing the steps
- storage_order() (optional override) -> "row" for stored contiguously by row, "col" for stored contiguously by column
- "col" is more common. By default types inherit and use the
@transposeslot which is TRUE if the matrix is row-major. C++ treats everything as col-major.
- "col" is more common. By default types inherit and use the
- matrix_inputs() (optional override) -> return a list of the input matrix or matrices for transform objects
- matrix_inputs<- (optional override) -> Allow setting the input matrices given a list of IterableMatrix objects
Important slots:
-
@dim- dimensions of the matrix (row,col). -
@dimnames-length-2 list with row/col names if present (in that order)
C++ MatrixLoader
New operations implement a subclass, and override methods as needed. For convenience when writing transforms, use MatrixLoaderWrapper on the implementation side (will default to using the input's implementation for all functions). For convenience on the consumer side, use MatrixIterator, which allows accessing one element at a time rather than in batches via pointers. Example MatrixIterator usage
- Conceptually, each entry is a tuple of (val, row), and these are loaded one column at a time (so the col index will not change within an inner loop). In practice, there's a pointer to a
valarray, and a pointer to arowarray which have equal lengths. Often these are pointers to the.data()of astd::vector<T> - As metadata, rows and columns can have string names assigned to them. This is the only metadata supported on BPCells matrices
R IterableFragments link
- Unlike IterableMatrix, does not include any slots in the base class
- iterate_fragments() -> basically same deal as iterate_matrix(). Converts R object to C++ pointer
- short_description() -> Same as IterableMatrix. Return a charater vector listing transformation steps for this object and all its inputs.
- cellNames() -> return a vector of the cell names for this object
- chrNames() -> return a vector of the chromosome names for this object
C++ FragmentsLoader
Same idea as MatrixLoader. Similarly, FragmentLoaderWrapper is useful for implementing new transforms, and FragmentIterator is useful for writing a consuming function.
- Each entry is a tuple of (cell, start, end) where start is 0-based inclusive index and end is a 0-based exclusive index. Cell is a numeric ID. Chromosome changes slowly and has a separate numeric ID. Range coordinate systems link. BPCells uses a zero-based, end-exclusive system similar to bed files.
- Cell and chromosome string labels are the only metadata supported on BPCells fragment objects.
Problem: when a user presses Ctrl-C, we want to know to quit out of our computation promptly.
BPCells solution:
- Worker function takes a final argument of
std::atomic<bool> *user_interrupt. - If pointer is non-null, worker function checks periodically and exits early if it becomes true:
if (user_interrupt != nullptr && count++ % 65536 == 0 && *user_interrupt) return;example link- Don't put this in the very inner loop ideally. For matrices make sure it gets checked at least once a column. For fragments once a chromosome is OK, but ideally something like every ~100K fragments (pick a power of 2 as in above example)
- In top-level function, call worker function through
run_with_R_interrupt_check. example usage- Note that getting the C++ reference types mixed up will cause confusing compiler errors. Check example usage to see how it works.
What else to know:
- run_with_R_interrupt_check will check for an interrupt in the main thread every 100ms, and run the worker in a background thread. It will signal an interrupt by setting user_interrupt to true.
- It is EXTREMELY IMPORTANT that no R objects are created/destroyed inside the spawned thread, which includes destructors that mess with R's GC protection.
- i.e. worker functions should basically operate on pointers/references to C++ objects
To help avoid unnecessary copies, BPCells in some circumstances keeps direct references to R objects. The major use-cases when that happens are:
- Loading data from an R object (e.g. dgCMatrix)
- Using pre-loaded R vectors holding row/col names (or cell/chr names) to avoid repeatedly reading from disk.
Unfortunately, we need to be extremely careful that these R objects are only created/destroyed on the R main thread. As stated in the R extensions documentation:
Calling any of the R API from threaded code is ‘for experts only’ and strongly discouraged. Many functions in the R API modify internal R data structures and might corrupt these data structures if called simultaneously from multiple threads. Most R API functions can signal errors, which must only happen on the R main thread. Also, external libraries (e.g. LAPACK) may not be thread-safe.
Within Rcpp, R objects will default to being protected from garbage collection (code link). This means, however, that the Rcpp object constructors and destructors must only execute on the main thread as they interact with the R API to manage GC protection.
From the design of BPCells C++ code, we will often have code that is running not on the main
thread, e.g. the C++ function run_with_R_interrupt_check runs the main work in a background
thread to simplify interrupt checking. And obviously explicit parallelization will result in code running
off the main thread.
For safety, we must assume that MatrixLoader and FragmentLoader objects may contain references to R objects.
Therefore their constructors and destructors can only be called from the main thread.
Ensuring constructors are only called from the main thread is fairly simple -- just put all constructors in the top level when directly requested from R.
To ensure destructors are only called on the main thread, we must never give ownership of a MatrixLoader or FragmentLoader
object to a worker thread. Passing references or raw pointers is fine. Giving ownership via unique_ptr, move semantics,
or pass-by-value must all be avoided when handing a MatrixLoader or FragmentLoader to a worker thread.