Update adding treatment metadata, event columns, wgd and pre-computed msi#36
Update adding treatment metadata, event columns, wgd and pre-computed msi#36davidrequena wants to merge 2 commits intomainfrom
Conversation
|
Would you mind reopening the old PR? I'd prefer to keep these changes contained to a single PR thread as it makes it a bit easier for me to follow the thread of discussion. I'm going to close this PR for now. Please reopen the old one (which already has your changes) and add your comment. |
| return(metadata) | ||
| } | ||
|
|
||
| #' @name add_sv_columns |
There was a problem hiding this comment.
We discussed the convention of how the nested events should be represented in metadata.json and datafiles.json based on the datasets schema (see the screenshot you sent me) earlier at my desk. This code won't create the nested entry in the json for complex events.
Have you tried loading in a json via jsonlite::fromJSON in R to see how nested columns would be represented in R before they're written to the json? There are examples for you to work with to be able to engineer this.
There was a problem hiding this comment.
To clarify, what gets written from parsing complex events in R data.table object metadata to the metadata.json should be formatted such that it conforms to the nested structure specified in the datasets.json schema in your screenshot above (as we spoke about).
There was a problem hiding this comment.
The datasets.json means that it's looking for entries in datafiles.json that look like:
{
"complex_events": {
"pyrgo": 0,
"del": 0,
...
}
}That means nesting the column as a list element in R data.table object metadata. Load in other datafiles.json and see examples of similar fields in R to see how those are constructed in a data.table/data.frame structure via jsonlite::fromJSON()
There was a problem hiding this comment.
Please re-implement this such that this will conform to the convention in the schema in datasets.json.
There was a problem hiding this comment.
In the new commit of the PR, I am removing this function. I think I can pull this info from sv_type_counts in the datasets.json directly. I deleted all references to the function.
| lstix = seq_len(NROW(added_field_values)) | ||
| for (ii in lstix) { | ||
| field = added_field_values[ii] | ||
| fnm = names(field) | ||
| value = field[[1]] | ||
| metadata[[fnm]] = value | ||
| } | ||
|
|
There was a problem hiding this comment.
What is the reason for moving this block of code to the beginning?
There was a problem hiding this comment.
This allows me to add the added fields to metadata and check lines below if the input was already provided, so the function that calculates the corresponding column doesn't need to run (or try to run). It can be checked better in the new commit of the PR.
| #' @name add_treatment_metadata | ||
| #' @title Add treatment metadata | ||
| #' @description | ||
| #' Adds treatment metadata information such as age at biopsy, treatment lines, number of different treatment lines, and information about the treatment line with the best response (name, response, mechanism, PFS duration). | ||
| #' | ||
| #' @param metadata A data.table containing metadata. | ||
| #' @param input_age_at_biopsy Age at the biopsy. | ||
| #' @param input_treatment_lines Different treatment lines received. | ||
| #' @param input_n_treatment_lines Number of treatment lines received. | ||
| #' @param input_best_treatment Name of the treatment line with the best response. | ||
| #' @param input_best_treatment_response Best response obtained among the treatment lines. | ||
| #' @param input_best_treatment_mechanism Mechanism of the treatment line with the best response. | ||
| #' @param input_best_treatment_PFS_duration Progression-free survival of the treatment line with the best response. | ||
| #' @return Updated metadata with treatment information added. | ||
| add_treatment_metadata <- function( | ||
| metadata, | ||
| input_age_at_biopsy = NULL, | ||
| input_treatment_lines = NULL, | ||
| input_n_treatment_lines = NULL, | ||
| input_best_treatment = NULL, | ||
| input_best_treatment_response = NULL, | ||
| input_best_treatment_mechanism = NULL, | ||
| input_best_treatment_PFS_duration = NULL | ||
| ) { | ||
|
|
||
| # Validate and use age_at_biopsy if provided | ||
| if (!is.null(input_age_at_biopsy)) { | ||
| if (!is.numeric(input_age_at_biopsy)) { | ||
| warning("age_at_biopsy must be a number, ignored") | ||
| } else { | ||
| metadata[, age_at_biopsy := input_age_at_biopsy] | ||
| } | ||
| } | ||
|
|
||
| # Validate and use treatment_lines if provided | ||
| if (!is.null(input_treatment_lines)) { | ||
| if (!is.character(input_treatment_lines)) { | ||
| warning("input_treatment_lines must be a character, ignored") | ||
| } else { | ||
| metadata[, treatment_lines := input_treatment_lines] | ||
| } | ||
| } | ||
|
|
||
| # Validate and use input_n_treatment_lines if provided | ||
| if (!is.null(input_n_treatment_lines)) { | ||
| if (!is.numeric(input_n_treatment_lines)) { | ||
| warning("input_n_treatment_lines must be a number, ignored") | ||
| } else { | ||
| metadata[, n_treatment_lines := input_n_treatment_lines] | ||
| } | ||
| } | ||
|
|
||
| # Validate and use input_best_treatment if provided | ||
| if (!is.null(input_best_treatment)) { | ||
| if (!is.character(input_best_treatment)) { | ||
| warning("input_best_treatment must be a character, ignored") | ||
| } else { | ||
| metadata[, best_treatment := input_best_treatment] | ||
| } | ||
| } | ||
|
|
||
| # Validate and use input_best_treatment_response if provided | ||
| if (!is.null(input_best_treatment_response)) { | ||
| if (!is.character(input_best_treatment_response)) { | ||
| warning("input_best_treatment_response must be a character, ignored") | ||
| } else { | ||
| metadata[, best_treatment_response := input_best_treatment_response] | ||
| } | ||
| } | ||
|
|
||
| # Validate and use input_best_treatment_mechanism if provided | ||
| if (!is.null(input_best_treatment_mechanism)) { | ||
| if (!is.character(input_best_treatment_mechanism)) { | ||
| warning("input_best_treatment_mechanism must be a character, ignored") | ||
| } else { | ||
| metadata[, best_treatment_mechanism := input_best_treatment_mechanism] | ||
| } | ||
| } | ||
|
|
||
| # Validate and use input_best_treatment_PFS_duration if provided | ||
| if (!is.null(input_best_treatment_PFS_duration)) { | ||
| if (!is.numeric(input_best_treatment_PFS_duration)) { | ||
| warning("input_best_treatment_PFS_duration must be a number, ignored") | ||
| } else { | ||
| metadata[, best_treatment_PFS_duration := input_best_treatment_PFS_duration] | ||
| } | ||
| } | ||
| return(metadata) | ||
| } | ||
|
|
There was a problem hiding this comment.
This treatment metadata is specific to HMF, but not other datasets. Skilift shouldn't have to support every single variable that is specific enough only to be useful for one data set. The right place for this would be in code documenting a specific analysis (like jupyter, r markdown, emacs blog files, etc).
There was a problem hiding this comment.
Please remove this block from the PR.
There was a problem hiding this comment.
Done in the new update of the PR
…onditional to msi, tmb, and wgd to detect if the value was already provided before running
|
@shihabdider in this second commit I am not passing the provided values as parameters and I added conditionals (lines 1844-1872) to prevent the corresponding function to execute if the value is already provided in the cohort. |
My changes are the following:
I added a new function (add_wgd) which calculates Whole Genome Doubling using Bielski's paper method (https://pubmed.ncbi.nlm.nih.gov/30013179/) or uses the value provided.
We do not currently have wgd, this is a new functionality.
I modified two functions already present (add_tmb and add_msisensor_score), to accept provided values before attempting calculation. This is necessary for HMF, because we have the MSI but we don't have the inputs to run msisensor.
Two functions I am adding (add_treatment_metadata and add_sv_columns) are pulling metadata from the cohort object and adds it to the metadata.json object, which later will be integrated into the datafiles.json and datafiles.arrow. This allows these columns to be used both in the aggregation plots and in the frontend interface editing the datasets.json.
The advantage over the current state is automating the acquisition instead of having to manipulate the individual .json files in R.
I modified create_metadata to add steps for the functions I described above. I also added the corresponding man files.