BOCC

Biological Ontology Cluster Comparison (BOCC) - Leverages the knowledge contained with in resources like the STRING protein-ptotein-interaction networks and the Human Phenotype Ontology to hypothesis about latent gene-to-phenotype connections. The CLI and Web App below provides access to clusters of genes and phenotypes where currently unconnected genes are phenotype are much more likely to be connected in the future.

Visualization Web App

Use the BOCC web app to explore and visualize the clusters in a fun interactive way!

Command Line Interface (CLI)

This CLI is useful for integration into automated pipeline and or for searching for large numbers of genes and phenoype pairs quickly.

Set-up

Download the CLI script

wget https://raw.githubusercontent.com/MSBradshaw/BOCC/cli/bocc_cli.py

Download the edge list

wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1BDrUa4Y55fFjx3j2tz7CzwD4qmgZS3fp' -O edgelist.2022.txt

Download genes file (HG37)

wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=17EHfiA3aiu2LouwhpNgX5VhC6ZaMcEpk' -O hg37.exons.bed.gz
wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1jxCuK7xKzz5r2cFlVkr2nFAfeW5c1EsG' -O hg37.exons.bed.gz.tbi

Download clusters

wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ZfjtKn3PxB8tTe0kg-OxH5BJlvcU30fz' -O all_clusters.2022.txt

Download example data

wget https://raw.githubusercontent.com/MSBradshaw/BOCC/cli/Example/test.hpo.txt
wget https://github.com/MSBradshaw/BOCC/raw/cli/Example/test.vcf.gz
wget https://github.com/MSBradshaw/BOCC/raw/cli/Example/test.vcf.gz.tbi

Configure environment

conda env create -f env.yml

Usage

conda activate bocc

python bocc_cli.py -v test.vcf.gz \
-g hg37.exons.bed.gz \
-p test.hpo.txt \
-c all_clusters.2022.txt \
-o TestDir \
-e edgelist.2022.txt \
-n 1

Output

BOCC will create the dirrectory specified with -o and place in it two files:

novel_matches.txt - which contains a list of affected genes and HPO terms that co-occur in the same cluster and are currently not connected - suggestive of a talent gene-to-phenotype connection.

preexisting_matches.txt - which contains any gene-to-phenotype pairs from the input data that already have a known connection.

Arguments

-v, --vcf: Path to the tabixed and BGZipped VCF file. Alternatively you an also provide a tabixed and BGZipped .bed file.
-g, --genes: Path to a .bed file that lists the genes, the 4th column should contain the gene name
-p, --hpos: Path to the file containing the HPO terms - one per line
-c, --clusters: Path to the BOCC cluster file
-o, --output: Path to the output dirrectory
-e, --edgelist: Path to the edgelist
-n, --num_processes: Number of cores to use (default 1). If searching for <= 50 genes and phenotypes the non-parallelized version is quicker.
--verbose: flag to enable verbose mode

Reproduce Results

We are committed to open and reproducible science. To best facilitate the reproducibiliy of our results we have bundled all of our methods into a snakemake pipeline and accompanying conda enviroment. Fair warning just because it is reproducible does not mean it is quick. The graph we repretendly cluster is large and some of the algorithms are slow and can take a week or more to complete even with 64 cores running in parallel. Generating the results depends on the Panther API for enrichment analysis which can also be an extremely slow processes and is prone to blocking IP addresses if you are try making more than 20 requests at the same time, FYI.

Conda Env

conda env create -f reproduction.env.yml

Reproduce Results

The figures and tables seen in the article can be reproduced with the jupyter notebooks in Experiments.

Reproduce data & clusters

conda activate reproduce_bocc
snakemake -s run_everything.smk --cores 64

This pipeline will produce the clusters them selves and the XGBoost prediction results.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
Algorithms		Algorithms
BOCC		BOCC
ClassificationResults		ClassificationResults
Example		Example
Experiments		Experiments
Resources		Resources
Scripts		Scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bocc_cli.py		bocc_cli.py
env.yml		env.yml
generate_bocc_results.smk		generate_bocc_results.smk
prepare_subcoms.smk		prepare_subcoms.smk
reproduction.env.yml		reproduction.env.yml
run_everything.smk		run_everything.smk
train_19_20_genetic_algo_outputs.txt		train_19_20_genetic_algo_outputs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BOCC

Visualization Web App

Command Line Interface (CLI)

Set-up

Usage

Output

Arguments

Reproduce Results

Conda Env

Reproduce Results

Reproduce data & clusters

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BOCC

Visualization Web App

Command Line Interface (CLI)

Set-up

Usage

Output

Arguments

Reproduce Results

Conda Env

Reproduce Results

Reproduce data & clusters

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages