omega

dN/dS analysis

Installing

conda create -n omega python=3.12
conda activate omega
conda install conda-forge::libcurl
git clone https://github.com/bbglab/omega.git
cd omega
pip install .

Run preprocessing

Input

Modes

Mutabilities generation modes

The only difference when running omega in one mode or another is in generating the mutabilities file(omega preprocessing). For the downstream step (omega estimator) there is no difference.

OmegaLoc

absent-synonymous has value ignore

omega preprocessing --preprocessing-mode compute_mutabilities \
                    --depths-file P19_0044_BDO_01.tsv.gz \
                    --mutations-file P19_0044_BDO_01.mutations.tsv \
                    --input-vep-postprocessed-file consensus.exons_splice_sites.tsv \
                    --mutabilities-table mutability_per_sample_gene_context.P19_0044_BDO_01.tsv \
                    --table-observed-muts mutations_per_sample_gene_impact_context.P19_0044_BDO_01.tsv \
                    --mutational-profile-file P19_0044_BDO_01.all.profile.tsv \
                    --single-sample P19_0044_BDO_01 \
                    --absent-synonymous ignore

OmegaGlobalLoc

absent-synonymous has value infer_global_custom

omega preprocessing --preprocessing-mode compute_mutabilities \
                    --depths-file P19_0044_BDO_01.tsv.gz \
                    --mutations-file P19_0044_BDO_01.mutations.tsv \
                    --input-vep-postprocessed-file consensus.exons_splice_sites.tsv \
                    --mutabilities-table mutability_per_sample_gene_context.P19_0044_BDO_01.gLoc.tsv \
                    --table-observed-muts mutations_per_sample_gene_impact_context.P19_0044_BDO_01.gLoc.tsv \
                    --mutational-profile-global-file all_samples.all.profile.tsv \
                    --mutational-profile-file P19_0044_BDO_01.all.profile.tsv \
                    --single-sample P19_0044_BDO_01 \
                    --absent-synonymous infer_global_custom \
                    --synonymous-mutrates-file all_samples.syn_mutrate.tsv

where all_samples.syn_mutrate.tsv contains at least these two columns: MUTDENSITY should be the number of mutations per Mb sequenced of the entire cohort that is being analyzed.

GENE    MUTDENSITY
ARID1A  0.5
CDKN1A  0.45
CREBBP  0.65
EP300   0.6
FGFR3   0.52
...

Issues

When generating the omega estimator files with OmegaLoc, any gene without synonymous mutations will not have any output.
When running omega estimator, if any of the 96 channels of CONTEXT>MUT is 0, because there is no possible site of that kind that can generate a mutation of the impact that is being analyzed, those channels will be discarded see #22. (this is particularly frequent for nonsense and essential_splice impacts)

TO BE UPDATED

Run preprocessing

$ python src/preprocessing/preprocessing.py test/input.json

Input

File with somatic mutations:
- Format:
  - The file should contain at least these 5 columns: CHROM, POS, REF, ALT, SAMPLE_ID
Depths:
- Format:
  - The first two columns of the file have to be informing about the chromosome and the position.
  - The next columns must have the name of the sample in the header and for each row, the corresponding value of depth in that chromosome and position.
  - Positions without coverage have to be filled with 0s.
File defining the regions/sites where you want to focus the analysis:
- BED file defining the regions sequenced: (always do this in the first run) This will make the execution start from the beginning and generate a file with all possible sites that will need to be annotated by VEP.
  - Format: BED6-like (not BED12)
  - The start and the end of the intervals are included as part of the analysis.
- VEP output file:
  - This must be the file downloaded from Ensembl VEP execution using as input the file of all possible sites created by the omega preprocessing module, otherwise the downstream steps might not work.
- VEP annotated sites postprocessed:
  - If you have already run the full preprocessing, this is the file that you should keep and then as long as you do not change the regions where you focus the analysis you can keep using this same file.
  - You could try to generate this file (see test/output/preprocessing/*.annotation_summary.tsv) or a file with the same information but the easiest way to proceed is to run the omega preprocessing module.

Make sure that the chromosome names are defined in the same way across the three files. For safety reasons, the "chr" prefix is enforced in all the three files.

Running Ensembl VEP

Take the file provided by omega preprocessing module.
Upload it to the server see here.

Select EnsemblVEP parameters:
- Leave everything as default except for the options to choose which variants to output. (see picture below)

Download the results, back to your computer/cluster

NOTE: If you are using a HPC clusterAlternatively you can right click on this button and copy the link. Then use wget to download the file directly to your desired location in the cluster, like this:

wget "https://www.ensembl.org/Homo_sapiens/Download/Tools/VEP?format=txt;tl=....." -O regions_sites.VEP_annotated.tsv

Ideally, you could provide the full path to this file back to the omega preprocessing execution that will be waiting for this. BUT RIGHT NOW it is not possible: SEE BELOW

The user needs to do the following: TODO: Document this or add some subprocess commands to do it in bash this requires a bit more of preprocessing in bash: See here:

cd DIRECTORY_WITH_THE_FILE
zcat Exome.no_header.tab.gz | cut -f 1,7,18 | awk '$3!="-"' | gzip > Exome.no_header.min_columns.tab.gz
zcat Exome.no_header.min_columns.tab.gz | tail -n +2 | awk -F'\t' 'BEGIN {OFS = "\t"} {split($1, a, "[_/]"); print a[1], a[2], a[3], $2, $3}' | gzip > Exome.no_header.min_columns.processed.no_labels.tab.gz

This is just an example, but you should follow it in order to get a dataframe with the following format/order and without header:

all_possible_sites[['CHROM', 'POS', 'REF', 'ALT', '#Uploaded_variation', 'Consequence', 'SYMBOL', 'MUT']]

THIS WILL BE CORRECTED SOON

Estimator

How to run

$ python src/estimator/main.py --option bayes --cores 2 test/input_estimation.json test/output_estimation_bayes.tsv

Options

--option admits the values bayes or mle, depending on the type of fitting

--cores how many cores to use by the parallel processing

Input

The input is given as a single json dictionary with keys pointing to the required paths.

Depths

Format: the first two columns of the file have to be informing about the chromosome and the position; the next columns must have the name of the sample in the header and for each row, the corresponding value of depth in that chromosome and position.
Positions without coverage have to be filled with 0s.

VEP annotated sites

File downloaded from Ensembl VEP execution using as input the file of all possible sites created by the omega preprocessing module, otherwise the downstream steps might not work.

Regions

File defining the regions/sites where you want to focus the analysis:

BED file defining the regions sequenced
Format: BED6-like (not BED12)
The start and the end of the intervals are included as part of the analysis.

Observed mutation counts

Mutation counts per sample, gene, impact and trinucleotide context
This file has been generated by the preprocessing step

Mutabilities

Absolute mutability per sample, gene and trinucleotide context
This file has been generated by the preprocessing step

Grouping dictionaries folder

The folder containing the grouping dictionaries to specify the grid that the execution will go through.
The grouping dictionaries are dictionaries establishing value groups. The analysis grid -- all the sample, gene, impact combinations that we want to consider -- will take into account the value groups described in these dictionaries.
Three dictionaries must be given: group_samples.json, group_genes.json, group_impacts.json

Output

The output will be a single table will the estimates per each item analyzed from the grid.

The columns specifying the grid will be identical,

gene, sample, impact

but the columns specifying the estimates coming out of each method will differ:

bayes

mean_dnds, perc_25_dnds, perc_75_dnds

mle

dnds, lower_bound, upper_bound, pvalue, learning_curve

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
docs/img		docs/img
omega		omega
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

omega

Installing

Run preprocessing

Input

Modes

Mutabilities generation modes

OmegaLoc

OmegaGlobalLoc

Issues

TO BE UPDATED

Run preprocessing

Input

Running Ensembl VEP

Estimator

How to run

Options

Input

Output

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

bbglab/omega

Folders and files

Latest commit

History

Repository files navigation

omega

Installing

Run preprocessing

Input

Modes

Mutabilities generation modes

OmegaLoc

OmegaGlobalLoc

Issues

TO BE UPDATED

Run preprocessing

Input

Running Ensembl VEP

Estimator

How to run

Options

Input

Output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages