Motif: Methylation Factor Identifiers

Motif is a modular pipeline for uncovering upstream regulators of aberrant DNA methylation. It prepares matched expression and methylation matrices, runs an adapted GRNBoost2 inference multiple times with different seeds, and aggregates the resulting gene–CpG networks into a single integrated network using user‑configurable aggregation methods. Genes with the highest aggregated edge weights are then prioritized and evaluated by their network‐proximity to known methylation enzymes (DNMT1, DNMT3A/B, TET1/2/3).

Repository Structure

Motif/
├── README.md
├── environment.yml                     # Conda environment specification
├── requirements.txt                    # pip dependencies
├── pyproject.toml                      # editable install
├── src/
│   ├── motif_core/
│   │   ├── grn_inference.py            # load data and run modified GRNBoost2
│   │   ├── aggregation.py              # aggregate per-seed networks
│   │   └── aggregation_grid_search.py  # parameterized aggregation grid search
│   ├── motif_eval/
│   │   ├── evaluation.py               # single-run network-proximity p-values
│   │   └── evaluation_grid_search.py   # evaluate aggregation grid search results
│   ├── motif_utils/                    # utility modules for aggregation and evaluation
│   │   ├── aggregation_utils.py        # helper functions for aggregation workflows
│   │   └── evaluation_utils.py         # helper functions for evaluation workflows
│   └── motif_grnboost/                 # adapted Arboreto code
│       ├── algo.py
│       ├── core.py
│       └── utils.py
├── notebooks/                          # Jupyter Notebooks for exploratory analyses and visualizations
│   ├── motif.ipynb                     # execute the full inference pipeline and inspect results
│   ├── evaluation.ipynb                # compute network‑proximity p‑values for aggregated results
│   ├── parameter_tuning.ipynb          # grid‑search analysis of aggregation parameters
│   └── gene_enrichment.ipynb           # functional enrichment of top candidate genes
├── data/                               # example input templates
├── results/                            # output directory
└── resources/                          # auxiliary files

Installation

Follow these steps to set up the MOTIF pipeline locally:

git clone https://github.com/your-org/MOTIF.git
cd MOTIF
conda env create -f environment.yml
conda activate motif
pip install -e .

Ensure you unzip the probe manifest before running:

unzip resources/hippie.txt.zip -d resources/

The repository includes a .pybiomart.sqlite file in resources/, allowing import of src modules directly.

Usage

1. Network Inference

Infer gene-CpG regulatory networks using a modified GRNBoost2 algorithm over multiple random seeds:

grn_inference() loads expression and methylation matrices and runs the adapted GRNBoost2 for the given seed.
Saves gene-CpGs weight table to results/grn_inference/grn_with_seed_<seed>.

Example usage:

python src/motif_core/grn_inference.py --cpgs_file data/sample_cpgs.tsv --genes_file data/sample_genes.tsv --seed 1

Required Arguments

Argument	Description	Type
`--cpgs_file`	Path to a TSV file with CpG methylation data.	`str`
`--genes_file`	Path to a TSV file with gene expression data.	`str`
`--seed`	Seed for GRNBoost2 run.	`int`

Optional Arguments

Argument	Description	Type	Default
`--should_filter_cpgs`	Filter CpGs to gene regions if set.	`flag`	False
`--filtering_genes_file`	File with Ensembl gene IDs (one per line) for filtering CpGs.	`str`	–
`--filtering_keep_promoters`	Keep promoter CpGs during filtering.	`flag`	False
`--filtering_keep_bodies`	Keep gene body CpGs during filtering.	`flag`	False
`--should_aggregate_cpgs`	Aggregate CpGs by region (e.g., mean per gene) if set.	`flag`	False
`--aggregation_genes_file`	File with Ensembl gene IDs (one per line) for CpG aggregation.	`str`	–
`--aggregation_keep_promoters`	Include promoter CpGs in aggregation.	`flag`	False
`--aggregation_keep_bodies`	Include gene body CpGs in aggregation.	`flag`	False
`--only_coding_genes`	Keep only protein-coding genes during preprocessing.	`flag`	False
`--promoter_length`	Number of base pairs upstream to define promoter regions.	`int`	`1000`
`--promoter_overlap`	Overlap in base pairs between promoters and gene bodies.	`int`	`200`

Result: results/grn_inference/grn_with_seed_<seed>.tsv.

2. Aggregation

Aggregate per-seed networks across runs.

Single Aggregation

Example usage:

python src/motif_core/aggregation.py --group_by --method freq --alpha 2 --beta 0.3

Optional Arguments

Argument	Type	Default	Description
`--nruns`	int or `'all'`	`'all'`	How many GRNBoost2 runs to aggregate. `'all'` loads all `.tsv` files from the input directory.
`--normalize`	flag	`False`	If set, normalize weights by total CpG site weight to avoid bias toward highly connected CpGs.
`--group_by`	flag	`False`	If set, group by gene and sum weights. Keeps only one entry per gene (top CpG).
`--method`	str	`'mean'`	Aggregation method to use: `mean`, `borda`, or `freq`.
`--alpha`	float	`1.0`	Used with `freq` method. Controls strength of frequency weighting; higher values emphasize consistency across runs.
`--beta`	float	`0.5`	Used with `freq` method. Balances between max and mean weight: `0` = mean only, `1` = max only, `0.5` = equal mix.
`--param_grid_file`	str	internal parameter grid	JSON file specifying combinations of aggregation parameters to run (overrides default internal grid).

Aggregation Methods:
- mean: average weight across runs.
- borda: rank-based aggregation.
- freq: frequency-weighted importance combining mean, max, and frequency; controlled by alpha and beta.

Result: results/aggregation/aggregated_weights.tsv.

Grid Search Aggregation

Run grid search over multiple aggregation parameter combinations.

You can supply a parameter grid as a JSON file.

python src/motif_core/aggregation_grid_search.py --param_grid_file resources/my_params.json

If --param_grid_file is not provided, a default internal parameter grid will be used.

Each parameter combination and the resulting aggregations are saved to:

results/aggregation_grid_search/params_<parameter_setting_number>.tsv
results/aggregation_grid_search/aggregated_weights_<parameter_setting_number>.tsv

3. Evaluation

Assess biological relevance using network-proximity p-values against known regulators (e.g., DNMT/TET genes).

Single Evaluation

Computes empirical p-value for top genes of results/aggregation/aggregated_weights.tsv:

Example usage:

python src/motif_eval/evaluation.py

Grid Search Evaluation

Run evaluation over all aggregated outputs in

results/aggregation_grid_search/params_<parameter_setting_number>.tsv
results/aggregation_grid_search/aggregated_weights_<parameter_setting_number>.tsv

Example usage:

python src/motif_eval/evaluation_grid_search.py

Results are saved to results/evaluation_grid_search/grid_search_results.tsv

Run Inference Pipeline

jupyter lab notebooks/motif.ipynb

Loads gene/CpG matrices
Runs inference over seeds
Saves GRNs to results/inference/

Run Evaluation

jupyter lab notebooks/evaluation.ipynb

Loads aggregated results
Computes p-values for top genes
Visualizes significance and gene overlaps

Parameter Tuning & Visualization

Explore which parameter settings yield the most significant p-values and robust gene regulators:

jupyter lab notebooks/parameter_tuning.ipynb

Analyze p-value distributions, consistency across seeds, and generate publication-quality plots.

Gene Enrichement Analysis

Analyze functional enrichment of top-ranked genes identified across multiple aggregation settings:

jupyter lab notebooks/gene_enrichement_analysis.ipynb

Loads aggregated top genes from results/grid_aggregation/.
Performs enrichment against GO, KEGG, and Reactome.
Produces interactive plots and tables of significant terms.

Citation

If you use MOTIF in your research, please cite:

Julia Wallnig et al. “MOTIF: Methylation Factor Identifiers—A Network-Based Approach to Identify Upstream Regulators of Aberrant DNA Methylation in Cancer.” preprint, 2025.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Motif: Methylation Factor Identifiers

Repository Structure

Installation

Usage

1. Network Inference

Required Arguments

Optional Arguments

2. Aggregation

Single Aggregation

Optional Arguments

Grid Search Aggregation

3. Evaluation

Single Evaluation

Grid Search Evaluation

Run Inference Pipeline

Run Evaluation

Parameter Tuning & Visualization

Gene Enrichement Analysis

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
data		data
notebooks		notebooks
resources		resources
results		results
src		src
.DS_Store		.DS_Store
.pybiomart.sqlite		.pybiomart.sqlite
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Motif: Methylation Factor Identifiers

Repository Structure

Installation

Usage

1. Network Inference

Required Arguments

Optional Arguments

2. Aggregation

Single Aggregation

Optional Arguments

Grid Search Aggregation

3. Evaluation

Single Evaluation

Grid Search Evaluation

Run Inference Pipeline

Run Evaluation

Parameter Tuning & Visualization

Gene Enrichement Analysis

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages