Skip to content

A computational framework for detecting directional selection in regulatory sequences through phenotypic predictions and phenotype-to-fitness models.

Notifications You must be signed in to change notification settings

mrrlab/RegEvol

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RegEvol

RegEvol is a computational framework for detecting directional selection in regulatory sequences through phenotypic predictions and phenotype-to-fitness models. This ongoing project aims to identify signatures of positive selection acting on transcription factor binding sites across mammals and drosophila, using both statistical and evolutionary approaches.


Project Overview

The pipeline includes the following key steps:

Regulatory Data Processing

  • TF ChIP-seq Peak Calling: Extraction and processing of transcription factor binding sites from publicly available ChIP-seq datasets in mammals and drosophila.
  • Homologous Sequence Retrieval: Identification of orthologous regulatory regions using whole-genome alignments from the Zoonomia Consortium.

Predictive Modeling

  • gkm-SVM Training: Training of gapped k-mer support vector machine (gkm-SVM) models on species-specific regulatory sequences to predict binding affinity changes.

Positive Selection Detection

  • Permutation Test: Application of the method from Liu & Robinson-Rechavi, 2020 to detect extreme change in binding affinity.
  • RegSel Test: A novel maximum likelihood-based test (RegEvol) to infer evolutionary regimes of regulatory regions from predicted binding affinity shifts.

Comparative and Functional Analyses

  • Simulations: Sequence evolution simulations under controlled parameters to validate the performance and robustness of selection detection methods.
  • Conservation Metrics: Comparison of selection signals with evolutionary conservation scores (phastCons and phyloP).
  • Functional Enrichment: Gene Ontology enrichment analysis of genes associated with positively selected regulatory regions.
  • Divergence vs. Polymorphism: Integration of interspecies divergence and intraspecies polymorphism data to interpret evolutionary dynamics.

Repository structure

├── 1_chipseq/                    # Scripts to run nf-core ChIP-seq pipeline
│   ├── run.nextflow.sh           # Main wrapper script
│   └── logs/                     # Logs generated by pipeline runs
│
├── 2_selection_tests/            # Scripts and Snakemake workflow for RegEvol and Permutation tests 
│   ├── Snakemake/                # Snakefile and workflow rules
│   ├── envs/                     # Conda environment YAML files
│   └── scripts/                  # Python/R scripts used by the workflow
│
├── 3_analysis/                   # Scripts for downstream analyses and figures
│
├── config/                       # Global configuration files
│
├── docs/                         # ChiP-seq samples description and download files
│
└── README.md                     # Project documentation

Requirements

Before running the pipelines, ensure the following tools are installed on your system:

  • Conda / Mamba – for managing environments and dependencies.
  • Singularity – required to use Docker/Singularity containers in Snakemake rules.

This workflow also depends on:

  • Snakemake ≥ X.Y : to run the RegEvol pipeline.
  • Nextflow : to run the ChIP-seq calling pipeline.

Option 1: Use your existing installation

If you already have Snakemake and Nextflow installed on your system, you can skip creating the bundled environment and run the workflow directly.

Option 2: Create the provided environment

If you don’t, you can create the environment with:

mamba env create -f config/workflows.yaml

Usage

1. Run the ChIP-seq pipeline (Nextflow)

The workflow uses the nf-core/chipseq (2.1.0) Nextflow pipeline to process ChIP-seq data.

Required inputs

The pipeline requires the following three input files:

  1. Reference genome sequence (FASTA format)
    → Located in ../data/genome_sequences/<species>/.

  2. Sample sheet (CSV format)
    → Describes the FASTQ files, sample groups, and metadata (see the nf-core/chipseq sample sheet specification).
    → Located in ../docs/ChIP-seq/<species>/<sample>_samples_input.csv.

  3. Annotations (GTF or GFF file)
    → Located in ../data/genome_sequences/<species>/.

Data download

If FASTQ files are not already present in ../data/ChIP-seq/<species>/<sample>/, the pipeline will automatically download them using the list of URLs provided in: docs/ChIP-seq/<sp>/DL_<sample>.txt

Run

./1_chipseq/run.nextflow.sh --sp <species> --sample <sample> [options]

Required arguments:

  • --sp → species name (must match those in config/params.sh)
  • --sample → sample name (used for result subfolders)

Optional arguments (can be passed in any order using --option value):

  • --threads → number of threads to allocate (default: 1)
  • --peaksType → "Narrow" for transcription factors, "Broad" for histone marks (default: "Narrow")
  • --system → Execution mode: local or SLURM (default: "local")
  • --container → Container type: "conda" or "singularity" (default: "conda")
  • --resume → "resume" to resume a previous run, "false" to start fresh (default: "false")
  • --skip → "true" to skip SPP and MultiQC steps, "false" to run all steps (default: "false")
  • --help → display usage and exit

Example:

./1_chipseq/run.nextflow.sh --sp human --sample Wilson --threads 16 --system SLURM --container singularity

2. Run the RegEvol pipeline (Snakemake)

./2_selection_tests/run.snakemake.sh --sp <species> --sample <sample> [options]

Required arguments:

  • --sp → species name (must match those in config/params.sh)
  • --sample → sample name (used for result subfolders)

Optional arguments (can be passed in any order using --option value):

  • --threads → number of threads to allocate (default: 1)
  • --dryRun → Run Snakemake in dry-run mode: true/false (default: false)
  • --help → display usage and exit

Example:

./2_selection_tests/run.snakemake.sh --sp human --sample Wilson --threads 10 --dryRun true

About

A computational framework for detecting directional selection in regulatory sequences through phenotypic predictions and phenotype-to-fitness models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 59.8%
  • Python 29.6%
  • Shell 7.2%
  • Perl 3.4%