RegEvol

RegEvol is a computational framework for detecting directional selection in regulatory sequences through phenotypic predictions and phenotype-to-fitness models. This ongoing project aims to identify signatures of positive selection acting on transcription factor binding sites across mammals and drosophila, using both statistical and evolutionary approaches.

Project Overview

The pipeline includes the following key steps:

Regulatory Data Processing

TF ChIP-seq Peak Calling: Extraction and processing of transcription factor binding sites from publicly available ChIP-seq datasets in mammals and drosophila.
Homologous Sequence Retrieval: Identification of orthologous regulatory regions using whole-genome alignments from the Zoonomia Consortium.

Predictive Modeling

gkm-SVM Training: Training of gapped k-mer support vector machine (gkm-SVM) models on species-specific regulatory sequences to predict binding affinity changes.

Positive Selection Detection

Permutation Test: Application of the method from Liu & Robinson-Rechavi, 2020 to detect extreme change in binding affinity.
RegSel Test: A novel maximum likelihood-based test (RegEvol) to infer evolutionary regimes of regulatory regions from predicted binding affinity shifts.

Comparative and Functional Analyses

Simulations: Sequence evolution simulations under controlled parameters to validate the performance and robustness of selection detection methods.
Conservation Metrics: Comparison of selection signals with evolutionary conservation scores (phastCons and phyloP).
Functional Enrichment: Gene Ontology enrichment analysis of genes associated with positively selected regulatory regions.
Divergence vs. Polymorphism: Integration of interspecies divergence and intraspecies polymorphism data to interpret evolutionary dynamics.

Repository structure

├── 1_chipseq/                    # Scripts to run nf-core ChIP-seq pipeline
│   ├── run.nextflow.sh           # Main wrapper script
│   └── logs/                     # Logs generated by pipeline runs
│
├── 2_selection_tests/            # Scripts and Snakemake workflow for RegEvol and Permutation tests 
│   ├── Snakemake/                # Snakefile and workflow rules
│   ├── envs/                     # Conda environment YAML files
│   └── scripts/                  # Python/R scripts used by the workflow
│
├── 3_analysis/                   # Scripts for downstream analyses and figures
│
├── config/                       # Global configuration files
│
├── docs/                         # ChiP-seq samples description and download files
│
└── README.md                     # Project documentation

Requirements

Before running the pipelines, ensure the following tools are installed on your system:

Conda / Mamba – for managing environments and dependencies.
Singularity – required to use Docker/Singularity containers in Snakemake rules.

This workflow also depends on:

Snakemake ≥ X.Y : to run the RegEvol pipeline.
Nextflow : to run the ChIP-seq calling pipeline.

Option 1: Use your existing installation

If you already have Snakemake and Nextflow installed on your system, you can skip creating the bundled environment and run the workflow directly.

Option 2: Create the provided environment

If you don’t, you can create the environment with:

mamba env create -f config/workflows.yaml

Usage

1. Run the ChIP-seq pipeline (Nextflow)

The workflow uses the nf-core/chipseq (2.1.0) Nextflow pipeline to process ChIP-seq data.

Required inputs

The pipeline requires the following three input files:

Reference genome sequence (FASTA format)
→ Located in ../data/genome_sequences/<species>/.
Sample sheet (CSV format)
→ Describes the FASTQ files, sample groups, and metadata (see the nf-core/chipseq sample sheet specification).
→ Located in ../docs/ChIP-seq/<species>/<sample>_samples_input.csv.
Annotations (GTF or GFF file)
→ Located in ../data/genome_sequences/<species>/.

Data download

If FASTQ files are not already present in ../data/ChIP-seq/<species>/<sample>/, the pipeline will automatically download them using the list of URLs provided in: docs/ChIP-seq/<sp>/DL_<sample>.txt

Run

./1_chipseq/run.nextflow.sh --sp <species> --sample <sample> [options]

Required arguments:

--sp → species name (must match those in config/params.sh)
--sample → sample name (used for result subfolders)

Optional arguments (can be passed in any order using --option value):

--threads → number of threads to allocate (default: 1)
--peaksType → "Narrow" for transcription factors, "Broad" for histone marks (default: "Narrow")
--system → Execution mode: local or SLURM (default: "local")
--container → Container type: "conda" or "singularity" (default: "conda")
--resume → "resume" to resume a previous run, "false" to start fresh (default: "false")
--skip → "true" to skip SPP and MultiQC steps, "false" to run all steps (default: "false")
--help → display usage and exit

Example:

./1_chipseq/run.nextflow.sh --sp human --sample Wilson --threads 16 --system SLURM --container singularity

2. Run the RegEvol pipeline (Snakemake)

./2_selection_tests/run.snakemake.sh --sp <species> --sample <sample> [options]

Required arguments:

--sp → species name (must match those in config/params.sh)
--sample → sample name (used for result subfolders)

Optional arguments (can be passed in any order using --option value):

--threads → number of threads to allocate (default: 1)
--dryRun → Run Snakemake in dry-run mode: true/false (default: false)
--help → display usage and exit

Example:

./2_selection_tests/run.snakemake.sh --sp human --sample Wilson --threads 10 --dryRun true

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RegEvol

Project Overview

Regulatory Data Processing

Predictive Modeling

Positive Selection Detection

Comparative and Functional Analyses

Repository structure

Requirements

Option 1: Use your existing installation

Option 2: Create the provided environment

Usage

1. Run the ChIP-seq pipeline (Nextflow)

Required inputs

Data download

Run

2. Run the RegEvol pipeline (Snakemake)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1,189 Commits
1_chipseq		1_chipseq
2_selection_tests		2_selection_tests
3_analysis		3_analysis
config		config
docs/ChIP-seq		docs/ChIP-seq
old		old
.gitignore		.gitignore
README.md		README.md

mrrlab/RegEvol

Folders and files

Latest commit

History

Repository files navigation

RegEvol

Project Overview

Regulatory Data Processing

Predictive Modeling

Positive Selection Detection

Comparative and Functional Analyses

Repository structure

Requirements

Option 1: Use your existing installation

Option 2: Create the provided environment

Usage

1. Run the ChIP-seq pipeline (Nextflow)

Required inputs

Data download

Run

2. Run the RegEvol pipeline (Snakemake)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages