[r] Test data functions by immanuelazn · Pull Request #193 · bnprks/BPCells

immanuelazn · 2025-01-28T23:55:58Z

Add functions for getting test data (get_test_mat(), get_test_frags()) and removal of test data (remove_test_data()).
Data was prepared using an internal function named prepare_test_data(). This function is still exposed to maintain reproducibility in the future. Outputs were stored as a tarball on our cloudflare object store.

As we discussed, data is being stored in file.path(tools::R_user_dir("BPcells", which="data"), "test_data"), and can be easily removed using remove_test_data(). It would be good to note that this feature is specific to R 4.0 and above.

For versions below R 4.0. We can probably add in a conditional that checks if a user provided directory exists, otherwise just write to tmp if not given. Nonetheless, I think it would still be safe to keep BPCells to R 4.0 above.

bnprks · 2025-01-31T19:08:11Z

We should register an actual domain rather than using cloudflare's test setup (e.g. greenleaf-lab-data.com ?)
We should have error handling so that if the cloudflare download doesn't work we'll attempt prepare_test_data() -- (that way if our cloudflare URL ever goes down accidentally then there's still a good chance the functions will work)
If we're going to use tools::R_user_dir we'll need to require the tools package and do something about the fact that the function doesn't exist in R versions <4.0. Either bump our required R version or check the R.version object and download to file.path(tempdir(), "bpcells_data") instead for older R versions
I'm not sure I love the name "test data". I think "demo data" or "example dataset" or something along those lines might seem more inviting to users
Maybe specify the file download size in the docs?
When calling options(timeout=300) in prepare_test_data() we should record the previous value (if any) and restore the options setting (via on.exit is probably good). Maybe just use the internal BPCells:::ensure_downloaded() helper which already handles that and the file.exists check.

Ideas for follow-up improvements:

If we're starting to use tools::R_user_dir, there might be an opportunity to improve the "Reference Annotations" functions/data. (e.g. caching a tibble RDS of the most recent gencode download, since that is slow to parse the gtf even if it's pre-downloaded).

immanuelazn · 2025-02-27T23:13:20Z

r/R/data.R

+            file.remove(file.path(data_dir, "demo_mat.tar.gz"))
+        }
+    }
+    return(open_matrix_dir(file.path(data_dir, "demo_mat")))


I'm also thinking about the edge case that data_dir/demo_mat has been written to, but the file does not represent a functionally correct matrix. In this case, it might make sense to put this in a try catch block. Any thoughts?

immanuelazn · 2025-03-27T05:13:57Z

Given our discussions about wanting to provide pre-filtered matrices and fragments, I chose the path of providing some additional parameterization on get_demo_mat(), get_demo_frags() and prepare_demo_data(). Let me know if you have any thoughts!

bnprks

As mentioned when we discussed earlier, I think we may want to make another couple tweaks later but this seems good to use as a base for the function documentation examples in later PRs.

I think when we revisit this, we'll want to switch to a github releases URL rather than cloudflare, and consider if we want to switch the function names to like get_demo_data(type=c("matrix", "fragments")) or something like that. We'll probably also want to try harder to get rid of the very large unfiltered download options to reduce file sizes.

But with the understanding that we'll re-visit once the function examples are merged in I think this is good enough to use as a base. I don't mind if this lives in main while we work on the function docs, or we can keep in the test-data branch if you prefer.

bnprks · 2025-04-04T18:30:20Z

r/R/data.R

+    if (is.null(directory)) {
+      directory <- file.path(tempdir())
+      dir.create(directory, recursive = TRUE, showWarnings = FALSE)
+    }


It might be nicer to create a subdir just for the downloaded files, then add an on.exit() to delete the directory so it's more foolproof to clean up downloads

r/R/data.R

Co-authored-by: Ben Parks <bnprks@users.noreply.github.com>

delete intermediates when exiting clean up docs wording

…ecessary parameterization + fix example styling

bnprks · 2025-04-10T07:51:56Z

r/tests/testthat/test-data.R



 test_that("Getting test data works", {
+    expect_no_error(BPCells:::prepare_demo_data(file.path(tools::R_user_dir("BPCells", which = "data"), "demo_data")))


I'm worried about the impact of this on time to run the test suite -- maybe we set this to skip by default?

Since we don't have a "comprehensive" flag for our tests, what would be the preferred way of doing this? Should I just use skip(), or should I use skip_if() depending on a pre-defined variable by the user?

I think either unconditional skip or skip unless an environment variable is defined are fine options. Maybe just put as skip() for now?

One future option once we move to github-based data set hosting would be to just put the data generation functions into a small "BPCellsData" package in that repo, in which case we'd fully avoid the issue of having tests for these in the main BPCells test suite.

Agreed, I'll add that to the canvas

bnprks · 2025-04-10T08:02:30Z

r/R/data.R

+    # Recreate mat if mat is malformed
+    tryCatch({
        mat <- open_matrix_dir(file.path(directory, "pbmc_3k_rna_raw"))
-    }
-    # Check if we already ran import
-    if (!file.exists(file.path(directory, "pbmc_3k_frags"))) {
-        atac_raw_url <- paste0(url_base, "pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz")
-        ensure_downloaded(file.path(directory, "pbmc_3k_10x.fragments.tsv.gz"), atac_raw_url, timeout = timeout)
-        frags <- open_fragments_10x(file.path(directory, "pbmc_3k_10x.fragments.tsv.gz")) %>%
-            write_fragments_dir(file.path(directory, "pbmc_3k_frags"))
-    } else {
+    }, error = function(e) {
+        rna_raw_url <- paste0(url_base, "pbmc_granulocyte_sorted_3k_raw_feature_bc_matrix.h5")
+        ensure_downloaded(file.path(intermediate_dir, "pbmc_3k_10x.h5"), rna_raw_url, timeout = timeout)
+        mat <<- open_matrix_10x_hdf5(file.path(intermediate_dir, "pbmc_3k_10x.h5"), feature_type="Gene Expression") %>% 
+            write_matrix_dir(file.path(directory, "pbmc_3k_rna_raw"), overwrite = TRUE)
+    })


Two notes:

The use of mat <<- ... without having mat already be defined in the outer scope is a problem -- running prepare_demo_data() then has a side effect of defining a variable mat in the top-level environment. Simply set mat <- NULL before the tryCatch call to fix this

Is the intent to still keep the pbmc_3k_rna_raw and pbmc_3k_frags files in the output directory indefinitely? I had kind of assumed they would be cleaned up as intermediate files.

That is a good catch. I assumed incorrectly that using <<- only goes to the parent scope, instead of going directly to global env if the var is not declared.

2.Maybe there was a little bit of a miscommunication here. I think there are two parts of this. Firstly, you're correct in that this function was holding two copies of the non-{filtered, subsetted} matrix/fragments, which only happens if you run it with filter_qc = FALSE and subset = FALSE. That was uncaught and is now fixed, and only one copy is being held now.

But in the case of whether we should be holding the non-filtered, and non-subsetted matrix and fragments at all...
My impression was that choosing which permutations of filtering and subsetting was going to be a post PR task, and we would just leave all those options here for now. The reason was that we might use them for some of the examples to make some more sense in the short term (see #233 plot_read_count_knee() and plot_tss_scatter()), and we would refactor after concluding the examples. However, I think we can do a middle ground for now of not saving pbmc_3k_rna_raw and pbmc_3k_frags unless the user explicitly requests them at least once previously. What do you think?

Your changes here on (1) and (2) match what I was aiming for.

I agree we can wait on exactly what version of filtering to settle on in final cleanup, I just was concerned about full-size files getting retained unconditionally even if the user was just trying to get a smaller subset.

…rite global `mat`

…rna/frags file

[r] add test data

9824d30

immanuelazn changed the title ~~Ia/test data~~ Test data functions Jan 28, 2025

immanuelazn force-pushed the ia/test-data branch 3 times, most recently from d0f1355 to 5862fba Compare February 10, 2025 22:17

[r] update test data functions based on feedback

f685ad8

immanuelazn force-pushed the ia/test-data branch from 5862fba to f685ad8 Compare February 10, 2025 22:21

immanuelazn added 2 commits February 26, 2025 17:21

[r] rewrite some demo data documentation

e1ec1cf

[r] update get_demo_mat() docs

4b748b9

immanuelazn commented Feb 27, 2025

View reviewed changes

immanuelazn added 2 commits March 7, 2025 15:37

[r] clean up wording for demo data docstring

8311ca8

[r] expand demo data parameterization with subsetting and filtering

c96db54

immanuelazn changed the title ~~Test data functions~~ [r] Test data functions Mar 27, 2025

bnprks reviewed Apr 4, 2025

View reviewed changes

immanuelazn and others added 2 commits April 7, 2025 11:57

Apply suggestions from code review

af1eba7

Co-authored-by: Ben Parks <bnprks@users.noreply.github.com>

[r] add examples to demo data

c6872dc

delete intermediates when exiting clean up docs wording

immanuelazn force-pushed the ia/test-data branch from ecf938a to c6872dc Compare April 7, 2025 19:53

immanuelazn added 2 commits April 7, 2025 13:48

[r] add in link for specifying dataset for demo data

a321917

[r] add in better error handling in prepare_demo_data(), remove unn…

b166d10

…ecessary parameterization + fix example styling

immanuelazn force-pushed the ia/test-data branch from 4d71059 to b166d10 Compare April 8, 2025 03:18

[r] update demo_data.Rd styling

fc5cb7e

bnprks reviewed Apr 10, 2025

View reviewed changes

immanuelazn added 4 commits April 10, 2025 10:17

[r] add fix to tryCatch block in prepare_demo_data() to not overw…

b37279e

…rite global `mat`

[r] refactor prepare_demo_data(), remove usage of intermediate raw …

f959e0c

…rna/frags file

[r] skip test for preparing demo data

a241604

[r] update NEWS.md

5a4e25e

immanuelazn merged commit 16faead into main Apr 10, 2025
4 checks passed

immanuelazn deleted the ia/test-data branch August 6, 2025 22:16



		test_that("Getting test data works", {
		expect_no_error(BPCells:::prepare_demo_data(file.path(tools::R_user_dir("BPCells", which = "data"), "demo_data")))

Conversation

immanuelazn commented Jan 28, 2025

Uh oh!

bnprks commented Jan 31, 2025

Uh oh!

immanuelazn Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

immanuelazn commented Mar 27, 2025

Uh oh!

bnprks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

immanuelazn Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

immanuelazn Feb 27, 2025 •

edited

Loading

immanuelazn Apr 10, 2025 •

edited

Loading