diff --git a/vignettes/ColocBoost_Wrapper_Pipeline.Rmd b/vignettes/ColocBoost_Wrapper_Pipeline.Rmd index d54f2c8..714f293 100644 --- a/vignettes/ColocBoost_Wrapper_Pipeline.Rmd +++ b/vignettes/ColocBoost_Wrapper_Pipeline.Rmd @@ -21,11 +21,12 @@ This vignette demonstrates how to use the bioinformatics pipeline for ColocBoost `colocboost_pipeline` with [link](https://github.com/StatFunGen/pecotmr/blob/main/R/colocboost_pipeline.R). - See more details about input data preparation in `xqtl_protocol` with [link](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.html). +Acknowledgements: Thanks to Kate (Kathryn) Lawrence (GitHub:@kal26) for her contributions to this vignette. -Step 1: Loading individual-level and summary statistics using `load_multitask_regional_data` function from multiple cohorts or datasets +# 1. Loading Data using `colocboost_analysis_pipeline` function +This function harmonizes the input data and prepares it for colocalization analysis. -Step 2: Perform ColocBoost using `colocboost_analysis_pipeline` function In this section, we introduce how to load the regional data required for the ColocBoost analysis using the `load_multitask_regional_data` function. This function loads mixed datasets for a specific region, including individual-level data (genotype, phenotype, covariate data), summary statistics @@ -38,7 +39,8 @@ Below are the input parameters for this function for loading individual-level da ## 1.1. Loading individual-level data from multiple cohorts -inputs: +Inputs: + - **`region`**: String ; Genomic region of interest in the format of `chr:start-end` for the phenotype region you want to analyze. - **`genotype_list`**: Character vector; Paths for PLINK bed files containing genotype data (do NOT include .bed suffix). - **`phenotype_list`**: Character vector; Paths for phenotype file names. @@ -55,7 +57,8 @@ inputs: - **`xvar_cutoff`**: Numeric; Minimum genotype variance cutoff. Default is 0. - **`imiss_cutoff`**: Numeric; Maximum individual missingness cutoff. Default is 0. -outputs: +Outputs: + - **`region_data`**: List (with `individual_data`, `sumstat_data`); Output of the `load_multitask_regional_data` function. If only individual-level data is loaded, `sumstat_data` will be `NULL`. @@ -84,7 +87,6 @@ xvar_cutoff = 0 imiss_cutoff = 0.9 # More advanced parameters see pecotmr::load_multitask_regional_data() - region_data_individual <- load_multitask_regional_data( region = region, genotype_list = genotype_list, @@ -109,7 +111,8 @@ region_data_individual <- load_multitask_regional_data( ## 1.2. Loading summary statistics from multiple cohorts or datasets -inputs: +Inputs: + - **`sumstat_path_list`**: Character vector; Paths to the summary statistics. - **`column_file_path_list`**: Character vector; Paths to the column mapping files. See below for expected format. - **`LD_meta_file_path_list`**: Character vector; Paths to LD metadata files. See below for expected format. @@ -120,7 +123,8 @@ inputs: - **`n_cases`**: Integer vector; Number of cases. Set a 0 if `n_samples` is passed explicitly. If unknown, set as 0 and include `n_cases` column in the column mapping file to retrieve from the sumstat file. - **`n_controls`**: Integer vector; Number of controls. Set a 0 if `n_samples` is passed explicitly. If unknown, set as 0 and include `n_controls` column in the column mapping file to retrieve from the sumstat file. -outputs: +Outputs: + - **`region_data`**: List (with `individual_data`, `sumstat_data`); Output of the `load_multitask_regional_data` function. If only summary statistics data is loaded, `individual_data` will be `NULL`. **Summary statistics loading example** @@ -143,7 +147,6 @@ n_controls = c(0, 40000) # More advanced parameters see pecotmr::load_multitask_regional_data() - region_data_sumstat <- load_multitask_regional_data( sumstat_path_list = sumstat_path_list, column_file_path_list = column_file_path_list, @@ -160,6 +163,7 @@ region_data_sumstat <- load_multitask_regional_data( **Expected format for column mapping file** + The column mapping file is YAML (`.yml`) with key: value pairs mapping your input column names to the standardized names expected by the loader. Required columns are `chrom`, `pos`, `A1`, and `A2`, and either `z` or `beta` and `sebeta`. Either 'n_case' and 'n_control' or 'n_samples' can be passed as part of the column mapping, but will be overwritten by the n_cases and n_controls or n_samples parameterspassed explicitly. @@ -204,7 +208,8 @@ The colocalization analysis can be run in any one of three modes, or in a combin - **`joint GWAS mode`**: Perform colocalization analysis in disease-agnostic mode on the individual-level and summary statistics data together. - **`separate GWAS mode`**: Perform colocalization analysis in disease-prioritized mode on the the individual-level data and each summary statistics dataset separately, treating each summary statistics dataset as the focal trait. -inputs: +Inputs: + - **`region_data`**: List (with `individual_data`, `sumstat_data`); Output of the `load_multitask_regional_data` function. - **`focal_trait`**: String; For xQTL-only mode, the name of the trait to perform disease-prioritized ColocBoost, from `conditions_list_individual`. If not provided, xQTL-only mode will be run without disease-prioritized mode. - **`event_filters`**: List of character vectors; Patterns for filtering events based on context names. @@ -219,11 +224,13 @@ Example: for sQTL, `list(type_pattern = ".*clu_(\\d+_[+-?]).*", valid_pattern = - **`joint_gwas`**: Logical; if TRUE, performs joint GWAS mode, mapping all individual-level and sumstat data together.Default is `FALSE`. - **`separate_gwas`**: Logical; if TRUE, runs separate GWAS mode, where each sumstat dataset is analyzed separately with all individual-level data, treating each sumstat as the focal trait in disease-prioritized mode. Default is `FALSE`. -outputs: +Outputs: + - **`colocboost_results`**: List of colocboost objects (with `xqtl_coloc`, `joint_gwas`, `separate_gwas`); Output of the `colocboost_analysis_pipeline` function. If the mode is not run, the corresponding element will be `NULL`. ```{r, colocboost-analysis, eval = FALSE} -# load in individual-level and sumstat data +#### Please check the example code below #### +# # load in individual-level and sumstat data region_data_combined <- load_multitask_regional_data( region = region, genotype_list = genotype_list, @@ -277,4 +284,4 @@ colocboost_plot(colocboost_results$joint_gwas) for (i in 1:length(colocboost_results$separate_gwas)) { colocboost_plot(colocboost_results$separate_gwas[[i]]) } -``` +``` \ No newline at end of file diff --git a/vignettes/announcements.Rmd b/vignettes/announcements.Rmd index c7e9f58..08fb927 100644 --- a/vignettes/announcements.Rmd +++ b/vignettes/announcements.Rmd @@ -14,6 +14,11 @@ vignette: > - *May 2, 2025*: `colocboost` R package is available on [CRAN](https://CRAN.R-project.org/package=colocboost). ## Software updates +- `v1.0.7` Improvements to ColocBoost (check out the full details in [PR](https://github.com/StatFunGen/colocboost/pull/116)). + - Enhanced `colocboost_plot` function with flexible highlighting options and new visualization styles. + - Optimized performance and computational efficiency + - Improved documentation and examples for the wrapper pipeline + - Minor bug fixes for increased stability - `v1.0.6` Memory optimization and visualization improvements with bug fixes [CRAN](https://CRAN.R-project.org/package=colocboost). - Optimized LD-free version to reduce memory usage by eliminating large identity LD matrix generation - Enhanced `colocboost_plot` function with improved horizontal and vertical spacing labels