Skip to content

Update adding treatment metadata, event columns, wgd and pre-computed msi#36

Open
davidrequena wants to merge 2 commits intomainfrom
dr_dev
Open

Update adding treatment metadata, event columns, wgd and pre-computed msi#36
davidrequena wants to merge 2 commits intomainfrom
dr_dev

Conversation

@davidrequena
Copy link
Copy Markdown
Contributor

My changes are the following:

I added a new function (add_wgd) which calculates Whole Genome Doubling using Bielski's paper method (https://pubmed.ncbi.nlm.nih.gov/30013179/) or uses the value provided.
We do not currently have wgd, this is a new functionality.

I modified two functions already present (add_tmb and add_msisensor_score), to accept provided values before attempting calculation. This is necessary for HMF, because we have the MSI but we don't have the inputs to run msisensor.

Two functions I am adding (add_treatment_metadata and add_sv_columns) are pulling metadata from the cohort object and adds it to the metadata.json object, which later will be integrated into the datafiles.json and datafiles.arrow. This allows these columns to be used both in the aggregation plots and in the frontend interface editing the datasets.json.
The advantage over the current state is automating the acquisition instead of having to manipulate the individual .json files in R.

I modified create_metadata to add steps for the functions I described above. I also added the corresponding man files.

@shihabdider
Copy link
Copy Markdown
Contributor

Would you mind reopening the old PR? I'd prefer to keep these changes contained to a single PR thread as it makes it a bit easier for me to follow the thread of discussion. I'm going to close this PR for now. Please reopen the old one (which already has your changes) and add your comment.

@shihabdider shihabdider reopened this Mar 26, 2026
Comment thread R/metadata.R Outdated
return(metadata)
}

#' @name add_sv_columns
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed the convention of how the nested events should be represented in metadata.json and datafiles.json based on the datasets schema (see the screenshot you sent me) earlier at my desk. This code won't create the nested entry in the json for complex events.

Have you tried loading in a json via jsonlite::fromJSON in R to see how nested columns would be represented in R before they're written to the json? There are examples for you to work with to be able to engineer this.

Image

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, what gets written from parsing complex events in R data.table object metadata to the metadata.json should be formatted such that it conforms to the nested structure specified in the datasets.json schema in your screenshot above (as we spoke about).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The datasets.json means that it's looking for entries in datafiles.json that look like:

{
    "complex_events": {
        "pyrgo": 0,
        "del": 0, 
        ...
    }
}

That means nesting the column as a list element in R data.table object metadata. Load in other datafiles.json and see examples of similar fields in R to see how those are constructed in a data.table/data.frame structure via jsonlite::fromJSON()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please re-implement this such that this will conform to the convention in the schema in datasets.json.

Copy link
Copy Markdown
Contributor Author

@davidrequena davidrequena Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new commit of the PR, I am removing this function. I think I can pull this info from sv_type_counts in the datasets.json directly. I deleted all references to the function.

Comment thread R/metadata.R
Comment on lines -1764 to -1771
lstix = seq_len(NROW(added_field_values))
for (ii in lstix) {
field = added_field_values[ii]
fnm = names(field)
value = field[[1]]
metadata[[fnm]] = value
}

Copy link
Copy Markdown
Contributor

@kevinmhadi kevinmhadi Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for moving this block of code to the beginning?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows me to add the added fields to metadata and check lines below if the input was already provided, so the function that calculates the corresponding column doesn't need to run (or try to run). It can be checked better in the new commit of the PR.

Comment thread R/metadata.R Outdated
Comment on lines +1724 to +1813
#' @name add_treatment_metadata
#' @title Add treatment metadata
#' @description
#' Adds treatment metadata information such as age at biopsy, treatment lines, number of different treatment lines, and information about the treatment line with the best response (name, response, mechanism, PFS duration).
#'
#' @param metadata A data.table containing metadata.
#' @param input_age_at_biopsy Age at the biopsy.
#' @param input_treatment_lines Different treatment lines received.
#' @param input_n_treatment_lines Number of treatment lines received.
#' @param input_best_treatment Name of the treatment line with the best response.
#' @param input_best_treatment_response Best response obtained among the treatment lines.
#' @param input_best_treatment_mechanism Mechanism of the treatment line with the best response.
#' @param input_best_treatment_PFS_duration Progression-free survival of the treatment line with the best response.
#' @return Updated metadata with treatment information added.
add_treatment_metadata <- function(
metadata,
input_age_at_biopsy = NULL,
input_treatment_lines = NULL,
input_n_treatment_lines = NULL,
input_best_treatment = NULL,
input_best_treatment_response = NULL,
input_best_treatment_mechanism = NULL,
input_best_treatment_PFS_duration = NULL
) {

# Validate and use age_at_biopsy if provided
if (!is.null(input_age_at_biopsy)) {
if (!is.numeric(input_age_at_biopsy)) {
warning("age_at_biopsy must be a number, ignored")
} else {
metadata[, age_at_biopsy := input_age_at_biopsy]
}
}

# Validate and use treatment_lines if provided
if (!is.null(input_treatment_lines)) {
if (!is.character(input_treatment_lines)) {
warning("input_treatment_lines must be a character, ignored")
} else {
metadata[, treatment_lines := input_treatment_lines]
}
}

# Validate and use input_n_treatment_lines if provided
if (!is.null(input_n_treatment_lines)) {
if (!is.numeric(input_n_treatment_lines)) {
warning("input_n_treatment_lines must be a number, ignored")
} else {
metadata[, n_treatment_lines := input_n_treatment_lines]
}
}

# Validate and use input_best_treatment if provided
if (!is.null(input_best_treatment)) {
if (!is.character(input_best_treatment)) {
warning("input_best_treatment must be a character, ignored")
} else {
metadata[, best_treatment := input_best_treatment]
}
}

# Validate and use input_best_treatment_response if provided
if (!is.null(input_best_treatment_response)) {
if (!is.character(input_best_treatment_response)) {
warning("input_best_treatment_response must be a character, ignored")
} else {
metadata[, best_treatment_response := input_best_treatment_response]
}
}

# Validate and use input_best_treatment_mechanism if provided
if (!is.null(input_best_treatment_mechanism)) {
if (!is.character(input_best_treatment_mechanism)) {
warning("input_best_treatment_mechanism must be a character, ignored")
} else {
metadata[, best_treatment_mechanism := input_best_treatment_mechanism]
}
}

# Validate and use input_best_treatment_PFS_duration if provided
if (!is.null(input_best_treatment_PFS_duration)) {
if (!is.numeric(input_best_treatment_PFS_duration)) {
warning("input_best_treatment_PFS_duration must be a number, ignored")
} else {
metadata[, best_treatment_PFS_duration := input_best_treatment_PFS_duration]
}
}
return(metadata)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This treatment metadata is specific to HMF, but not other datasets. Skilift shouldn't have to support every single variable that is specific enough only to be useful for one data set. The right place for this would be in code documenting a specific analysis (like jupyter, r markdown, emacs blog files, etc).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this block from the PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the new update of the PR

…onditional to msi, tmb, and wgd to detect if the value was already provided before running
@davidrequena
Copy link
Copy Markdown
Contributor Author

@shihabdider in this second commit I am not passing the provided values as parameters and I added conditionals (lines 1844-1872) to prevent the corresponding function to execute if the value is already provided in the cohort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants