Skip to content

Conversation

@BiancaStoecker
Copy link
Collaborator

@BiancaStoecker BiancaStoecker commented Dec 19, 2025

Summary by CodeRabbit

  • New Features

    • Added VEP- and REVEL-based variant annotation and processing to produce annotated VCF outputs and summary reports.
  • Documentation

    • Added CI-friendly reference docs and scripts describing downsampled resources and indexing/subsampling procedures.
  • Chores

    • Updated workflow outputs and paths to reflect annotated VCF files and integrated annotation steps.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

📝 Walkthrough

Walkthrough

Adds VEP/REVEL annotation to the variant benchmarking workflow: new annotation rules for fetching VEP caches/plugins and REVEL scores, processing/indexing REVEL, and annotating FP/FN VCFs. FP/FN output paths changed from results/fp-fn/vcf/...*.sorted.vcf.gz to results/fp-fn/annotated_vcf/...*.annotated.vcf.gz.

Changes

Cohort / File(s) Summary
Snakefile configuration
workflow/Snakefile
Includes rules/annotation.smk; updates FP/FN output paths from results/fp-fn/vcf/...*.sorted.vcf.gz to results/fp-fn/annotated_vcf/...*.annotated.vcf.gz
Annotation rules
workflow/rules/annotation.smk
New file with 8 rules: get_downsampled_vep_cache, get_vep_cache, get_vep_plugins, download_revel, process_revel_scores, tabix_revel_scores, annotate_shared_fn, annotate_unique_fp_fn (uses VEP wrappers v8.1.1, REVEL plugin integration, grouping "annotation")
Helper functions
workflow/rules/common.smk
Adds get_tabix_revel_params(), get_plugin_aux(plugin, index=False), and get_vep_cache_dir() to select tabix columns, plugin auxiliary paths (.tbi/downsampled), and VEP cache dir based on genome build and limit-reads
CI documentation / resources
workflow/resources/ci-test-references/README.md
New README describing CI-friendly downsampled REVEL table and reduced VEP cache, plus two helper scripts (subsample_all_vars.sh, index_subsample.sh) and example commands

Sequence Diagram(s)

sequenceDiagram
    participant Workflow as Workflow Engine
    participant Cache as VEP Cache/Plugins
    participant REVEL as REVEL Provider
    participant Index as Tabix Indexer
    participant VEP as VEP Annotator

    Workflow->>Cache: get_downsampled_vep_cache / get_vep_cache / get_vep_plugins
    Cache-->>Workflow: VEP cache & plugins available
    Workflow->>REVEL: download_revel (zip)
    REVEL->>REVEL: process_revel_scores (build-specific TSV)
    REVEL-->>Workflow: revel TSV
    Workflow->>Index: tabix_revel_scores (TSV -> .tbi)
    Index-->>Workflow: TSV + .tbi
    Workflow->>VEP: annotate_shared_fn / annotate_unique_fp_fn (VCF + plugins + cache)
    VEP-->>Workflow: annotated VCF (.annotated.vcf.gz) + stats HTML
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • johanneskoester

Poem

🐰 I hopped through caches, plugins bright and new,

VEP and REVEL stitched each variant true,
Logs and TSVs, indexed with care,
Annotated VCFs now dance in the air,
A rabbit's cheer for pipelines made anew.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding VEP annotation with REVEL, SIFT and PolyPhen scores to FP/FN VCFs, which is the core objective reflected throughout the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
workflow/rules/common.smk (1)

744-748: Type inconsistency: returns string vs empty list.

get_plugin_aux returns a string when plugin == "REVEL" but an empty list [] otherwise. In Snakemake, this mixed return type can work for inputs, but consider returning an empty string or consistently using lists for clarity.

🔎 Proposed fix for consistent return type
 def get_plugin_aux(plugin, index=False):
     if plugin == "REVEL":
         suffix = ".tbi" if index else ""
         return "resources/revel_scores.tsv.gz{suffix}".format(suffix=suffix)
-    return []
+    return ""
workflow/envs/htslib.yaml (1)

1-6: Consider updating htslib to version 1.22.1.

htslib 1.12 was released in March 2021. The latest available version in bioconda is 1.22.1, which includes bug fixes and performance improvements. Updating is straightforward and poses minimal compatibility risk for tabix/bgzip operations.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d820558 and 30894c7.

📒 Files selected for processing (5)
  • workflow/Snakefile (2 hunks)
  • workflow/envs/curl.yaml (1 hunks)
  • workflow/envs/htslib.yaml (1 hunks)
  • workflow/rules/annotation.smk (1 hunks)
  • workflow/rules/common.smk (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-11T08:47:27.118Z
Learnt from: BiancaStoecker
Repo: snakemake-workflows/dna-seq-benchmark PR: 149
File: workflow/Snakefile:36-40
Timestamp: 2025-11-11T08:47:27.118Z
Learning: In the dna-seq-benchmark workflow, VCF files are sorted by a generic sort_vcf rule in workflow/rules/utils.smk that transforms {prefix}.vcf.gz to {prefix}.sorted.vcf.gz, so rules that produce VCF outputs don't include .sorted in their filenames - the sorting is handled as a separate downstream step.

Applied to files:

  • workflow/Snakefile
  • workflow/rules/annotation.smk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Testing
🔇 Additional comments (4)
workflow/rules/common.smk (1)

738-741: LGTM!

The tabix parameter logic correctly maps the reference genome to the appropriate column for REVEL score indexing.

workflow/Snakefile (2)

18-18: LGTM!

The annotation rules are correctly included within the variant-calls config block, making them conditionally available only when variant calls are configured.


33-41: LGTM!

The output paths are correctly updated to reference the new annotated VCF outputs, which aligns with the annotation pipeline that takes sorted VCFs as input and produces annotated VCFs.

workflow/rules/annotation.smk (1)

76-99: Clarify the access.random() version requirement and confirm intentional parameter difference.

The access.random() function is available in Snakemake 7.17.1+ (the current min_version requirement), not exclusively an 8.x feature. The two annotation rules intentionally differ in their extra parameters: annotate_shared_fn includes --sift b --polyphen b while annotate_unique_fp_fn omits these flags. Verify this difference aligns with your analysis requirements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
workflow/rules/annotation.smk (2)

26-34: Add -L and --fail flags to curl command.

The curl command should include -L to follow redirects and --fail to exit with error status on HTTP failures, ensuring robust downloads from Zenodo.

🔎 Proposed fix
     shell:
-        "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"
+        "curl -L --fail https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"

37-60: Missing resources block causes runtime error.

The shell script references {resources.tmpdir} on line 50, but no resources block is declared in the rule. This will cause a Snakemake runtime error. Additionally, the temporary file is not cleaned up on exit.

🔎 Proposed fix
 rule process_revel_scores:
     input:
         "resources/revel_scores.zip",
     output:
         "resources/revel_scores.tsv.gz",
     params:
         build=get_reference_genome_build(),
     log:
         "logs/vep_plugins/process_revel_scores.log",
+    resources:
+        tmpdir=config.get("tmpdir", "/tmp"),
     conda:
         "../envs/htslib.yaml"
     shell:
         """
-        tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
+        tmpfile=$(mktemp -p {resources.tmpdir} revel_scores.XXXXXX)
+        trap 'rm -f "$tmpfile"' EXIT
         unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
         if [ "{params.build}" == "GRCh38" ] ; then
             zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
         elif [ "{params.build}" == "GRCh37" ] ; then
             cat $tmpfile > {output}
         else
             echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
             exit 125
         fi
         """
🧹 Nitpick comments (1)
workflow/rules/annotation.smk (1)

102-125: LGTM! SIFT and PolyPhen flags are correctly included.

The rule structure is correct and now includes the --sift b --polyphen b flags in the extra parameter (line 118) as intended per the PR objectives.

Note: There's a minor trailing space after "...polyphen b " at line 118 that can be trimmed (optional formatting nitpick).

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30894c7 and ba99965.

📒 Files selected for processing (2)
  • workflow/envs/curl.yaml (1 hunks)
  • workflow/rules/annotation.smk (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • workflow/envs/curl.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-11T08:47:27.118Z
Learnt from: BiancaStoecker
Repo: snakemake-workflows/dna-seq-benchmark PR: 149
File: workflow/Snakefile:36-40
Timestamp: 2025-11-11T08:47:27.118Z
Learning: In the dna-seq-benchmark workflow, VCF files are sorted by a generic sort_vcf rule in workflow/rules/utils.smk that transforms {prefix}.vcf.gz to {prefix}.sorted.vcf.gz, so rules that produce VCF outputs don't include .sorted in their filenames - the sorting is handled as a separate downstream step.

Applied to files:

  • workflow/rules/annotation.smk
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Testing
🔇 Additional comments (5)
workflow/rules/annotation.smk (5)

15-23: LGTM!

Standard VEP plugins setup with appropriate wrapper usage.


12-12: No action needed. The snakemake-wrappers version v8.0.2 used throughout the workflow is valid and current.


76-99: No issues found. The helper function get_plugin_aux() is correctly implemented in workflow/rules/common.smk, the lambda function calls with arguments are proper, and access.random() is the correct Snakemake API for this resource access pattern. The trailing space in the extra parameter at line 92 can optionally be trimmed for consistency.


63-73: Rule structure and implementation are correct.

The get_tabix_revel_params() function in workflow/rules/common.smk (lines 738-741) correctly returns build-appropriate tabix parameters for REVEL score indexing. It selects the correct column (2 for GRCh37, 3 otherwise) and uses appropriate tabix flags (-f -s 1 -b {column} -e {column}) for indexing the TSV file across different reference genomes.


1-12: Helper function is properly implemented and returns correct build strings.

The get_reference_genome_build() function in workflow/rules/common.smk is correctly implemented. Wrapper version v8.0.2 exists and is available in the snakemake-wrappers repository. The function validates the configuration and returns the expected values:

  • "GRCh37" for grch37 configuration
  • "GRCh38" for grch38 configuration

The rule structure correctly passes this value to the VEP cache wrapper as the build parameter.

Copy link
Collaborator

@famosab famosab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments / questions :)

@famosab
Copy link
Collaborator

famosab commented Jan 16, 2026

We still get this error:

[E::easy_errno] Libcurl reported error 78 (Remote file not found)
[E::easy_errno] Libcurl reported error 78 (Remote file not found)

BiancaStoecker and others added 8 commits January 20, 2026 10:17
Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
Added a step to free disk space on Ubuntu before testing.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/main.yml:
- Around line 45-59: The Free Disk Space step "Free Disk Space (Ubuntu)" has
inconsistent indentation under the with: block and may reference a non-existent
tag; fix by aligning the with: children (tool-cache, android, dotnet, haskell,
large-packages, swap-storage, docker-images) to use the same 8-space indentation
as other workflow steps, and verify that jlumbroso/free-disk-space@v1.3.1 is a
valid released tag — if not, change the action reference to a stable branch like
`@main` or a valid release tag.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@workflow/rules/annotation.smk`:
- Around line 113-136: The rule annotate_unique_fp_fn uses a hardcoded cache
access.random("resources/vep/cache") which can diverge from the cache path
resolved by get_vep_cache_dir() used in annotate_shared_fn; update
annotate_unique_fp_fn to use get_vep_cache_dir() (same symbol used by
annotate_shared_fn) for the cache input so both rules reference the same VEP
cache path (replace the cache=access.random(...) entry with
cache=get_vep_cache_dir()).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@workflow/rules/common.smk`:
- Around line 751-754: The function get_vep_cache_dir has inconsistent return
types: the "limit-reads" branch returns a plain value while the other branch
returns a single-element tuple because of the trailing comma; update
get_vep_cache_dir so both branches return the same type (either both plain
values or both tuples) by removing the trailing comma in the second return or by
wrapping the first return in a tuple, and ensure callers expect that unified
type.
♻️ Duplicate comments (2)
workflow/rules/annotation.smk (2)

37-47: Add -L and --fail flags to curl for redirect handling and error detection.

The Zenodo URL may redirect, and curl error 78 ("Remote file not found") reported in PR comments could stem from this. Adding -L follows redirects; --fail ensures curl exits with an error on HTTP failures.

     shell:
-        "curl {params.url} -o {output} &> {log}"
+        "curl -L --fail {params.url} -o {output} &> {log}"

50-73: Missing resources declaration for {resources.tmpdir} and temp file cleanup.

The shell script references {resources.tmpdir} (line 63) but no resources block is declared, causing a runtime error. Additionally, the temporary file $tmpfile is never cleaned up.

 rule process_revel_scores:
     input:
         "resources/revel_scores.zip",
     output:
         "resources/revel_scores.tsv.gz",
     params:
         build=get_reference_genome_build(),
     log:
         "logs/vep_plugins/process_revel_scores.log",
+    resources:
+        tmpdir=temp("tmpdir"),
     conda:
         "../envs/tools.yaml"
     shell:
         """
-        tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
+        tmpfile=$(mktemp "${{TMPDIR:-/tmp}}"/revel_scores.XXXXXX)
+        trap "rm -f $tmpfile" EXIT
         unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
         if [ "{params.build}" == "GRCh38" ] ; then
             zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
         elif [ "{params.build}" == "GRCh37" ] ; then
             cat $tmpfile > {output}
         else
             echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
             exit 125
         fi
         """
🧹 Nitpick comments (1)
workflow/rules/annotation.smk (1)

1-23: Log file path conflict between rules.

Both get_downsampled_vep_cache (line 5) and get_vep_cache (line 20) write to the same log path logs/vep/cache.log. While only one rule should run based on the limit-reads config, using distinct log paths improves traceability and avoids potential conflicts.

 rule get_downsampled_vep_cache:
     output:
         directory("resources/vep/cache_downsampled"),
     log:
-        "logs/vep/cache.log",
+        "logs/vep/cache_downsampled.log",

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@workflow/rules/annotation.smk`:
- Around line 1-9: The two rules get_downsampled_vep_cache and get_vep_cache
both write to the same log file "logs/vep/cache.log", which can cause
overwrites; update the get_downsampled_vep_cache rule (the rule header named
get_downsampled_vep_cache and its log directive) to use a distinct log path (for
example "logs/vep/cache_downsampled.log" or include a unique suffix/wildcard) so
each rule writes to its own log file and preserves logs when both run.
♻️ Duplicate comments (2)
workflow/rules/annotation.smk (2)

37-47: Add -L and --fail flags to curl - likely cause of reported runtime error.

The reported runtime error "Libcurl reported error 78 (Remote file not found)" is likely caused by the Zenodo URL redirecting while curl is not configured to follow redirects. Add -L to follow redirects and --fail to exit on HTTP errors.

Proposed fix
     shell:
-        "curl {params.url} -o {output} &> {log}"
+        "curl -L --fail {params.url} -o {output} &> {log}"

50-73: Missing resources declaration for tmpdir - will cause runtime error.

Line 63 references {resources.tmpdir} but no resources: block is declared. This will cause Snakemake to fail. Additionally, the temporary file $tmpfile is never cleaned up.

Proposed fix
 rule process_revel_scores:
     input:
         "resources/revel_scores.zip",
     output:
         "resources/revel_scores.tsv.gz",
     params:
         build=get_reference_genome_build(),
     log:
         "logs/vep_plugins/process_revel_scores.log",
+    resources:
+        tmpdir=temp("tmpdir"),
     conda:
         "../envs/tools.yaml"
     shell:
         """
-        tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
+        tmpfile=$(mktemp "${{TMPDIR:-/tmp}}"/revel_scores.XXXXXX)
+        trap "rm -f $tmpfile" EXIT
         unzip -p {input} | tr "," "\t" | sed '1s/.*/#&/' | bgzip -c > $tmpfile
         if [ "{params.build}" == "GRCh38" ] ; then
             zgrep -h -v ^#chr $tmpfile | awk '$3 != "." ' | sort -k1,1 -k3,3n - | cat <(zcat $tmpfile | head -n1) - | bgzip -c > {output}
         elif [ "{params.build}" == "GRCh37" ] ; then
             cat $tmpfile > {output}
         else
             echo "Annotation of REVEL scores only supported for GRCh37 or GRCh38" > {log}
             exit 125
         fi
         """

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@workflow/rules/common.smk`:
- Around line 757-763: The function get_revel_tsv returns None when
config.get("limit-reads") is true because the if branch calls
workflow.source_path("../resources/new_tabbed_revel_grch38.1pct.tsv.gz") but
does not return its value; update the if branch to return the result of
workflow.source_path(...) so get_revel_tsv always returns the expected file path
(keep the else branch returning "resources/revel_scores.tsv.gz"), referencing
get_revel_tsv, workflow.source_path, and config.get("limit-reads") to locate the
code.
- Around line 751-754: The function get_vep_cache_dir() calls access.random(...)
but access is not defined; fix by either importing/defining the access object or
replacing those calls with the proper Snakemake remote provider (e.g., use
snakemake.remote.S3.random("resources/vep/cache") or the appropriate remote
class for your storage). Update get_vep_cache_dir() and the corresponding
access.random usages in annotation.smk (the four occurrences) so they reference
a defined symbol (imported access) or use snakemake.remote.S3 (or the correct
snakemake.remote.* class) to obtain the random path.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@workflow/rules/common.smk`:
- Around line 744-754: get_plugin_aux currently returns a GRCh38-only downsample
file when config.get("limit-reads") is true, which will be wrong for GRCh37
runs; update get_plugin_aux to first detect the active reference (e.g., from
config.get("reference") or the project reference setting) and if plugin ==
"REVEL" and limit-reads is true choose the matching GRCh37 downsample resource
when available or else raise a clear, fast-fail error explaining that a GRCh37
downsample resource is missing and how to provide one; keep the existing GRCh38
path for GRCh38 runs and ensure the error mentions get_plugin_aux and the
config.get("limit-reads") condition so it is easy to locate.
- Around line 738-741: get_tabix_revel_params currently treats any non-"grch37"
reference as grch38 and silently falls back; change it to validate the
config["reference-genome"] explicitly using the project's validated helper
(e.g., the reference validation function used elsewhere) and raise or fail fast
on unknown values, then return "-f -s 1 -b 2 -e 2" for "grch37" and "-f -s 1 -b
3 -e 3" for "grch38"; reference get_tabix_revel_params and
config["reference-genome"] when locating where to replace the fallback logic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@workflow/rules/common.smk`:
- Around line 744-753: The path construction in get_plugin_aux (function) uses
workflow.basedir + "../.test/resources/..." which can break resolution; replace
that concatenation with
workflow.source_path(".test/resources/new_tabbed_revel_grch38.1pct.tsv.gz{suffix}".format(suffix=suffix))
so the REVEL test resource path is resolved via Snakemake's API (use the same
approach for the indexed suffix when index is True) instead of direct basedir
manipulation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@workflow/rules/annotation.smk`:
- Around line 1-11: The shell command in rule get_downsampled_vep_cache
incorrectly calls tar without -f so it treats {input} as a member name; update
the shell line in get_downsampled_vep_cache to pass the archive file to tar
using the -f option (i.e., include -f {input} before -C {output}), preserving
the mkdir, --strip-components 1, and stderr redirection to {log}.
♻️ Duplicate comments (4)
workflow/rules/common.smk (2)

738-741: Validate reference-genome instead of defaulting to GRCh38.

Any non-grch37 value currently falls through to column 3, which can mask config mistakes. Use the validated helper to fail fast on invalid values.

♻️ Proposed fix
 def get_tabix_revel_params():
     # Indexing of REVEL-score file where the column depends on the reference
-    column = 2 if config["reference-genome"] == "grch37" else 3
+    build = get_reference_genome_build()
+    column = 2 if build == "GRCh37" else 3
     return f"-f -s 1 -b {column} -e {column}"

744-754: Fail fast when limit-reads uses a GRCh38-only REVEL downsample.

With limit-reads, the helper always points to a GRCh38 downsample, which is wrong for GRCh37 runs. Guard this and raise a clear error (or add a GRCh37 downsample).

🐛 Proposed fix (fail fast if unsupported)
 def get_plugin_aux(plugin, index=False):
     if plugin == "REVEL":
         suffix = ".tbi" if index else ""
         if config.get("limit-reads"):
+            if get_reference_genome_build() != "GRCh38":
+                raise ValueError(
+                    "limit-reads REVEL downsample is only available for GRCh38"
+                )
             return workflow.source_path(
                 "../../.test/resources/downsampled_tabbed_revel_grch38.1pct.tsv.gz{suffix}".format(
                     suffix=suffix
                 )
             )
         return "resources/revel_scores.tsv.gz{suffix}".format(suffix=suffix)
     return []
workflow/rules/annotation.smk (2)

39-47: Follow redirects / fail on HTTP errors for the REVEL download.

This avoids silent 404s/redirects and aligns with the reported libcurl error 78.

🔧 Proposed fix
-    shell:
-        "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"
+    shell:
+        "curl -L --fail https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"

50-73: Define a tmpdir source and clean up the temp file.

{resources.tmpdir} is undefined here, and the temp file is never removed.

🐛 Proposed fix
-        tmpfile=$(mktemp {resources.tmpdir}/revel_scores.XXXXXX)
+        tmpfile=$(mktemp "${TMPDIR:-/tmp}"/revel_scores.XXXXXX)
+        trap "rm -f $tmpfile" EXIT

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@workflow/rules/annotation.smk`:
- Around line 16-27: The get_vep_plugins rule currently hardcodes release="111"
while get_vep_cache may use cache_downsampled (release 113) when
limit-reads=true; update get_vep_plugins (the rule named get_vep_plugins and its
release param) so its release value is computed the same way as get_vep_cache
(i.e., use the same helper/variable that picks cache release or branch on
limit-reads/cache_downsampled) or pass through the cache-matched release value,
ensuring the VEP plugins release matches the cache release when downsampled
caches are used.

In `@workflow/rules/common.smk`:
- Around line 745-758: The REVEL branch in get_plugin_aux is returning a CI
resource path that doesn't match the actual file name; update the returned
workflow.source_path call in get_plugin_aux (REVEL branch) to reference the
correct file name "new_tabbed_revel_grch38.1pct.tsv.gz{suffix}" (or
alternatively rename the resource to the current string) so the returned path
matches the CI resource and avoids FileNotFoundError.
♻️ Duplicate comments (1)
workflow/rules/annotation.smk (1)

41-49: Harden curl downloads against redirects/HTTP errors.
The current curl command doesn’t follow redirects or fail on HTTP errors, which can surface as libcurl error 78. Add -L --fail for reliability.

🔧 Suggested tweak
-        "curl https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"
+        "curl -L --fail https://zenodo.org/records/7072866/files/revel-v1.3_all_chromosomes.zip -o {output} &> {log}"

conda:
"../envs/tools.yaml"
shell:
"(mkdir -p {output}; curl -L https://github.com/snakemake-workflows/dna-seq-benchmark/raw/0181ccf16c5483c0d7d1ad1b8f9dfa87376b5b1f/workflow/resources/ci-test-references/vep_cache_113_GRCh38_chr22.tar.gz | tar -xz -C {output} --strip-components 1) 2> {log}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hier hab ichs irgendwie nicht geschafft dass es funktioniert, dass er die Datei relativ zur .smk file nimmt. Deswegen zieht er die jetzt mit curl aus unserem repo. Ich wollte sie halt auch selbst zur verfügung haben und nicht angewiesen sein auf ein anderes repo deswegen hab ich sie bei uns jetzt abgelegt.

@famosab famosab changed the title feat: Add VEP annotation with REVEL, Sift and PolyPhen Scores to fp… feat: VEP annotation with REVEL, Sift and PolyPhen Scores to fp/fn vcfs Jan 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants