Skip to content

[r] Test data functions#193

Merged
immanuelazn merged 15 commits intomainfrom
ia/test-data
Apr 10, 2025
Merged

[r] Test data functions#193
immanuelazn merged 15 commits intomainfrom
ia/test-data

Conversation

@immanuelazn
Copy link
Collaborator

Add functions for getting test data (get_test_mat(), get_test_frags()) and removal of test data (remove_test_data()).
Data was prepared using an internal function named prepare_test_data(). This function is still exposed to maintain reproducibility in the future. Outputs were stored as a tarball on our cloudflare object store.

As we discussed, data is being stored in file.path(tools::R_user_dir("BPcells", which="data"), "test_data"), and can be easily removed using remove_test_data(). It would be good to note that this feature is specific to R 4.0 and above.

For versions below R 4.0. We can probably add in a conditional that checks if a user provided directory exists, otherwise just write to tmp if not given. Nonetheless, I think it would still be safe to keep BPCells to R 4.0 above.

@immanuelazn immanuelazn changed the title Ia/test data Test data functions Jan 28, 2025
@bnprks
Copy link
Owner

bnprks commented Jan 31, 2025

  • We should register an actual domain rather than using cloudflare's test setup (e.g. greenleaf-lab-data.com ?)
  • We should have error handling so that if the cloudflare download doesn't work we'll attempt prepare_test_data() -- (that way if our cloudflare URL ever goes down accidentally then there's still a good chance the functions will work)
  • If we're going to use tools::R_user_dir we'll need to require the tools package and do something about the fact that the function doesn't exist in R versions <4.0. Either bump our required R version or check the R.version object and download to file.path(tempdir(), "bpcells_data") instead for older R versions
  • I'm not sure I love the name "test data". I think "demo data" or "example dataset" or something along those lines might seem more inviting to users
  • Maybe specify the file download size in the docs?
  • When calling options(timeout=300) in prepare_test_data() we should record the previous value (if any) and restore the options setting (via on.exit is probably good). Maybe just use the internal BPCells:::ensure_downloaded() helper which already handles that and the file.exists check.

Ideas for follow-up improvements:

  • If we're starting to use tools::R_user_dir, there might be an opportunity to improve the "Reference Annotations" functions/data. (e.g. caching a tibble RDS of the most recent gencode download, since that is slow to parse the gtf even if it's pre-downloaded).

@immanuelazn immanuelazn force-pushed the ia/test-data branch 3 times, most recently from d0f1355 to 5862fba Compare February 10, 2025 22:17
r/R/data.R Outdated
file.remove(file.path(data_dir, "demo_mat.tar.gz"))
}
}
return(open_matrix_dir(file.path(data_dir, "demo_mat")))
Copy link
Collaborator Author

@immanuelazn immanuelazn Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also thinking about the edge case that data_dir/demo_mat has been written to, but the file does not represent a functionally correct matrix. In this case, it might make sense to put this in a try catch block. Any thoughts?

@immanuelazn immanuelazn changed the title Test data functions [r] Test data functions Mar 27, 2025
@immanuelazn
Copy link
Collaborator Author

Given our discussions about wanting to provide pre-filtered matrices and fragments, I chose the path of providing some additional parameterization on get_demo_mat(), get_demo_frags() and prepare_demo_data(). Let me know if you have any thoughts!

Copy link
Owner

@bnprks bnprks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned when we discussed earlier, I think we may want to make another couple tweaks later but this seems good to use as a base for the function documentation examples in later PRs.

I think when we revisit this, we'll want to switch to a github releases URL rather than cloudflare, and consider if we want to switch the function names to like get_demo_data(type=c("matrix", "fragments")) or something like that. We'll probably also want to try harder to get rid of the very large unfiltered download options to reduce file sizes.

But with the understanding that we'll re-visit once the function examples are merged in I think this is good enough to use as a base. I don't mind if this lives in main while we work on the function docs, or we can keep in the test-data branch if you prefer.

Comment on lines 28 to 31
if (is.null(directory)) {
directory <- file.path(tempdir())
dir.create(directory, recursive = TRUE, showWarnings = FALSE)
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nicer to create a subdir just for the downloaded files, then add an on.exit() to delete the directory so it's more foolproof to clean up downloads

immanuelazn and others added 2 commits April 7, 2025 11:57
Co-authored-by: Ben Parks <bnprks@users.noreply.github.com>
delete intermediates when exiting
clean up docs wording


test_that("Getting test data works", {
expect_no_error(BPCells:::prepare_demo_data(file.path(tools::R_user_dir("BPCells", which = "data"), "demo_data")))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about the impact of this on time to run the test suite -- maybe we set this to skip by default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have a "comprehensive" flag for our tests, what would be the preferred way of doing this? Should I just use skip(), or should I use skip_if() depending on a pre-defined variable by the user?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either unconditional skip or skip unless an environment variable is defined are fine options. Maybe just put as skip() for now?

One future option once we move to github-based data set hosting would be to just put the data generation functions into a small "BPCellsData" package in that repo, in which case we'd fully avoid the issue of having tests for these in the main BPCells test suite.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll add that to the canvas

Comment on lines 49 to 57
# Recreate mat if mat is malformed
tryCatch({
mat <- open_matrix_dir(file.path(directory, "pbmc_3k_rna_raw"))
}
# Check if we already ran import
if (!file.exists(file.path(directory, "pbmc_3k_frags"))) {
atac_raw_url <- paste0(url_base, "pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz")
ensure_downloaded(file.path(directory, "pbmc_3k_10x.fragments.tsv.gz"), atac_raw_url, timeout = timeout)
frags <- open_fragments_10x(file.path(directory, "pbmc_3k_10x.fragments.tsv.gz")) %>%
write_fragments_dir(file.path(directory, "pbmc_3k_frags"))
} else {
}, error = function(e) {
rna_raw_url <- paste0(url_base, "pbmc_granulocyte_sorted_3k_raw_feature_bc_matrix.h5")
ensure_downloaded(file.path(intermediate_dir, "pbmc_3k_10x.h5"), rna_raw_url, timeout = timeout)
mat <<- open_matrix_10x_hdf5(file.path(intermediate_dir, "pbmc_3k_10x.h5"), feature_type="Gene Expression") %>%
write_matrix_dir(file.path(directory, "pbmc_3k_rna_raw"), overwrite = TRUE)
})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two notes:

  1. The use of mat <<- ... without having mat already be defined in the outer scope is a problem -- running prepare_demo_data() then has a side effect of defining a variable mat in the top-level environment. Simply set mat <- NULL before the tryCatch call to fix this
  2. Is the intent to still keep the pbmc_3k_rna_raw and pbmc_3k_frags files in the output directory indefinitely? I had kind of assumed they would be cleaned up as intermediate files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. That is a good catch. I assumed incorrectly that using <<- only goes to the parent scope, instead of going directly to global env if the var is not declared.

Copy link
Collaborator Author

@immanuelazn immanuelazn Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.Maybe there was a little bit of a miscommunication here. I think there are two parts of this. Firstly, you're correct in that this function was holding two copies of the non-{filtered, subsetted} matrix/fragments, which only happens if you run it with filter_qc = FALSE and subset = FALSE. That was uncaught and is now fixed, and only one copy is being held now.

But in the case of whether we should be holding the non-filtered, and non-subsetted matrix and fragments at all...
My impression was that choosing which permutations of filtering and subsetting was going to be a post PR task, and we would just leave all those options here for now. The reason was that we might use them for some of the examples to make some more sense in the short term (see #233 plot_read_count_knee() and plot_tss_scatter()), and we would refactor after concluding the examples. However, I think we can do a middle ground for now of not saving pbmc_3k_rna_raw and pbmc_3k_frags unless the user explicitly requests them at least once previously. What do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your changes here on (1) and (2) match what I was aiming for.

I agree we can wait on exactly what version of filtering to settle on in final cleanup, I just was concerned about full-size files getting retained unconditionally even if the user was just trying to get a smaller subset.

@immanuelazn immanuelazn merged commit 16faead into main Apr 10, 2025
4 checks passed
@immanuelazn immanuelazn deleted the ia/test-data branch August 6, 2025 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants