RegEvol is a computational framework for detecting directional selection in regulatory sequences through phenotypic predictions and phenotype-to-fitness models. This ongoing project aims to identify signatures of positive selection acting on transcription factor binding sites across mammals and drosophila, using both statistical and evolutionary approaches.
The pipeline includes the following key steps:
- TF ChIP-seq Peak Calling: Extraction and processing of transcription factor binding sites from publicly available ChIP-seq datasets in mammals and drosophila.
- Homologous Sequence Retrieval: Identification of orthologous regulatory regions using whole-genome alignments from the Zoonomia Consortium.
- gkm-SVM Training: Training of gapped k-mer support vector machine (gkm-SVM) models on species-specific regulatory sequences to predict binding affinity changes.
- Permutation Test: Application of the method from Liu & Robinson-Rechavi, 2020 to detect extreme change in binding affinity.
- RegSel Test: A novel maximum likelihood-based test (RegEvol) to infer evolutionary regimes of regulatory regions from predicted binding affinity shifts.
- Simulations: Sequence evolution simulations under controlled parameters to validate the performance and robustness of selection detection methods.
- Conservation Metrics: Comparison of selection signals with evolutionary conservation scores (phastCons and phyloP).
- Functional Enrichment: Gene Ontology enrichment analysis of genes associated with positively selected regulatory regions.
- Divergence vs. Polymorphism: Integration of interspecies divergence and intraspecies polymorphism data to interpret evolutionary dynamics.
├── 1_chipseq/ # Scripts to run nf-core ChIP-seq pipeline
│ ├── run.nextflow.sh # Main wrapper script
│ └── logs/ # Logs generated by pipeline runs
│
├── 2_selection_tests/ # Scripts and Snakemake workflow for RegEvol and Permutation tests
│ ├── Snakemake/ # Snakefile and workflow rules
│ ├── envs/ # Conda environment YAML files
│ └── scripts/ # Python/R scripts used by the workflow
│
├── 3_analysis/ # Scripts for downstream analyses and figures
│
├── config/ # Global configuration files
│
├── docs/ # ChiP-seq samples description and download files
│
└── README.md # Project documentation
Before running the pipelines, ensure the following tools are installed on your system:
- Conda / Mamba – for managing environments and dependencies.
- Singularity – required to use Docker/Singularity containers in Snakemake rules.
This workflow also depends on:
If you already have Snakemake and Nextflow installed on your system, you can skip creating the bundled environment and run the workflow directly.
If you don’t, you can create the environment with:
mamba env create -f config/workflows.yamlThe workflow uses the nf-core/chipseq (2.1.0) Nextflow pipeline to process ChIP-seq data.
The pipeline requires the following three input files:
-
Reference genome sequence (FASTA format)
→ Located in../data/genome_sequences/<species>/. -
Sample sheet (CSV format)
→ Describes the FASTQ files, sample groups, and metadata (see the nf-core/chipseq sample sheet specification).
→ Located in../docs/ChIP-seq/<species>/<sample>_samples_input.csv. -
Annotations (GTF or GFF file)
→ Located in../data/genome_sequences/<species>/.
If FASTQ files are not already present in ../data/ChIP-seq/<species>/<sample>/, the pipeline will automatically download them using the list of URLs provided in: docs/ChIP-seq/<sp>/DL_<sample>.txt
./1_chipseq/run.nextflow.sh --sp <species> --sample <sample> [options]Required arguments:
- --sp → species name (must match those in config/params.sh)
- --sample → sample name (used for result subfolders)
Optional arguments (can be passed in any order using --option value):
- --threads → number of threads to allocate (default: 1)
- --peaksType → "Narrow" for transcription factors, "Broad" for histone marks (default: "Narrow")
- --system → Execution mode: local or SLURM (default: "local")
- --container → Container type: "conda" or "singularity" (default: "conda")
- --resume → "resume" to resume a previous run, "false" to start fresh (default: "false")
- --skip → "true" to skip SPP and MultiQC steps, "false" to run all steps (default: "false")
- --help → display usage and exit
Example:
./1_chipseq/run.nextflow.sh --sp human --sample Wilson --threads 16 --system SLURM --container singularity./2_selection_tests/run.snakemake.sh --sp <species> --sample <sample> [options]Required arguments:
- --sp → species name (must match those in config/params.sh)
- --sample → sample name (used for result subfolders)
Optional arguments (can be passed in any order using --option value):
- --threads → number of threads to allocate (default: 1)
- --dryRun → Run Snakemake in dry-run mode: true/false (default: false)
- --help → display usage and exit
Example:
./2_selection_tests/run.snakemake.sh --sp human --sample Wilson --threads 10 --dryRun true