Conversation
Ideas for follow-up improvements:
|
d0f1355 to
5862fba
Compare
5862fba to
f685ad8
Compare
r/R/data.R
Outdated
| file.remove(file.path(data_dir, "demo_mat.tar.gz")) | ||
| } | ||
| } | ||
| return(open_matrix_dir(file.path(data_dir, "demo_mat"))) |
There was a problem hiding this comment.
I'm also thinking about the edge case that data_dir/demo_mat has been written to, but the file does not represent a functionally correct matrix. In this case, it might make sense to put this in a try catch block. Any thoughts?
|
Given our discussions about wanting to provide pre-filtered matrices and fragments, I chose the path of providing some additional parameterization on |
bnprks
left a comment
There was a problem hiding this comment.
As mentioned when we discussed earlier, I think we may want to make another couple tweaks later but this seems good to use as a base for the function documentation examples in later PRs.
I think when we revisit this, we'll want to switch to a github releases URL rather than cloudflare, and consider if we want to switch the function names to like get_demo_data(type=c("matrix", "fragments")) or something like that. We'll probably also want to try harder to get rid of the very large unfiltered download options to reduce file sizes.
But with the understanding that we'll re-visit once the function examples are merged in I think this is good enough to use as a base. I don't mind if this lives in main while we work on the function docs, or we can keep in the test-data branch if you prefer.
| if (is.null(directory)) { | ||
| directory <- file.path(tempdir()) | ||
| dir.create(directory, recursive = TRUE, showWarnings = FALSE) | ||
| } |
There was a problem hiding this comment.
It might be nicer to create a subdir just for the downloaded files, then add an on.exit() to delete the directory so it's more foolproof to clean up downloads
Co-authored-by: Ben Parks <bnprks@users.noreply.github.com>
delete intermediates when exiting clean up docs wording
ecf938a to
c6872dc
Compare
…ecessary parameterization + fix example styling
4d71059 to
b166d10
Compare
r/tests/testthat/test-data.R
Outdated
|
|
||
|
|
||
| test_that("Getting test data works", { | ||
| expect_no_error(BPCells:::prepare_demo_data(file.path(tools::R_user_dir("BPCells", which = "data"), "demo_data"))) |
There was a problem hiding this comment.
I'm worried about the impact of this on time to run the test suite -- maybe we set this to skip by default?
There was a problem hiding this comment.
Since we don't have a "comprehensive" flag for our tests, what would be the preferred way of doing this? Should I just use skip(), or should I use skip_if() depending on a pre-defined variable by the user?
There was a problem hiding this comment.
I think either unconditional skip or skip unless an environment variable is defined are fine options. Maybe just put as skip() for now?
One future option once we move to github-based data set hosting would be to just put the data generation functions into a small "BPCellsData" package in that repo, in which case we'd fully avoid the issue of having tests for these in the main BPCells test suite.
There was a problem hiding this comment.
Agreed, I'll add that to the canvas
| # Recreate mat if mat is malformed | ||
| tryCatch({ | ||
| mat <- open_matrix_dir(file.path(directory, "pbmc_3k_rna_raw")) | ||
| } | ||
| # Check if we already ran import | ||
| if (!file.exists(file.path(directory, "pbmc_3k_frags"))) { | ||
| atac_raw_url <- paste0(url_base, "pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz") | ||
| ensure_downloaded(file.path(directory, "pbmc_3k_10x.fragments.tsv.gz"), atac_raw_url, timeout = timeout) | ||
| frags <- open_fragments_10x(file.path(directory, "pbmc_3k_10x.fragments.tsv.gz")) %>% | ||
| write_fragments_dir(file.path(directory, "pbmc_3k_frags")) | ||
| } else { | ||
| }, error = function(e) { | ||
| rna_raw_url <- paste0(url_base, "pbmc_granulocyte_sorted_3k_raw_feature_bc_matrix.h5") | ||
| ensure_downloaded(file.path(intermediate_dir, "pbmc_3k_10x.h5"), rna_raw_url, timeout = timeout) | ||
| mat <<- open_matrix_10x_hdf5(file.path(intermediate_dir, "pbmc_3k_10x.h5"), feature_type="Gene Expression") %>% | ||
| write_matrix_dir(file.path(directory, "pbmc_3k_rna_raw"), overwrite = TRUE) | ||
| }) |
There was a problem hiding this comment.
Two notes:
- The use of
mat <<- ...without havingmatalready be defined in the outer scope is a problem -- runningprepare_demo_data()then has a side effect of defining a variablematin the top-level environment. Simply setmat <- NULLbefore thetryCatchcall to fix this - Is the intent to still keep the
pbmc_3k_rna_rawandpbmc_3k_fragsfiles in the output directory indefinitely? I had kind of assumed they would be cleaned up as intermediate files.
There was a problem hiding this comment.
- That is a good catch. I assumed incorrectly that using
<<-only goes to the parent scope, instead of going directly to global env if the var is not declared.
There was a problem hiding this comment.
2.Maybe there was a little bit of a miscommunication here. I think there are two parts of this. Firstly, you're correct in that this function was holding two copies of the non-{filtered, subsetted} matrix/fragments, which only happens if you run it with filter_qc = FALSE and subset = FALSE. That was uncaught and is now fixed, and only one copy is being held now.
But in the case of whether we should be holding the non-filtered, and non-subsetted matrix and fragments at all...
My impression was that choosing which permutations of filtering and subsetting was going to be a post PR task, and we would just leave all those options here for now. The reason was that we might use them for some of the examples to make some more sense in the short term (see #233 plot_read_count_knee() and plot_tss_scatter()), and we would refactor after concluding the examples. However, I think we can do a middle ground for now of not saving pbmc_3k_rna_raw and pbmc_3k_frags unless the user explicitly requests them at least once previously. What do you think?
There was a problem hiding this comment.
Your changes here on (1) and (2) match what I was aiming for.
I agree we can wait on exactly what version of filtering to settle on in final cleanup, I just was concerned about full-size files getting retained unconditionally even if the user was just trying to get a smaller subset.
Add functions for getting test data (
get_test_mat(),get_test_frags()) and removal of test data (remove_test_data()).Data was prepared using an internal function named
prepare_test_data(). This function is still exposed to maintain reproducibility in the future. Outputs were stored as a tarball on our cloudflare object store.As we discussed, data is being stored in
file.path(tools::R_user_dir("BPcells", which="data"), "test_data"), and can be easily removed usingremove_test_data(). It would be good to note that this feature is specific to R 4.0 and above.For versions below R 4.0. We can probably add in a conditional that checks if a user provided directory exists, otherwise just write to tmp if not given. Nonetheless, I think it would still be safe to keep BPCells to R 4.0 above.