Skip to content

Add variant_selection pipeline step #28

@jpflorido

Description

@jpflorido

Description

Introduce a new pipeline step named variant_selection responsible for selecting reportable variants from the variant_collection attribute stored in the Sample object.

This step will act as a logical filter layer between variant collection and downstream reporting, ensuring that only variants relevant to each analysis context are propagated further in the pipeline.

Motivation

Currently, variants are collected and stored in variant_collection according to category (e.g. PR, RR) and source, but there is no dedicated step that encapsulates the logic required to select subsets of variants depending on clinical and analytical context.

Adding a variant_selection step will:

  • Centralize and formalize variant selection logic.
  • Clearly separate variant collection from variant interpretation/selection.
  • Improve maintainability as selection rules evolve.
  • Facilitate future extensions (new categories, modes, or selection criteria).

Current behavior

  • Variants are aggregated and stored in variant_collection at the sample level.
  • Selection criteria are either applied implicitly downstream or are not explicitly structured as a pipeline step.
  • There is no single orchestration point responsible for variant selection across categories and analytical modes.

Proposed refactor

Add a new step named variant_selection. This step will operate per sample and will be responsible for orchestrating variant selection across categories (PR, RR) and modes (screening, advanced for RR)

High-level design

  • The variant_selection step will:

    • Iterate over the categories present in the sample (e.g. PR, RR).
    • For each category, call an auxiliary variant selection function.
    • Store the resulting selected variants in a structured, category-aware output (exact storage model to be defined later).
  • The auxiliary selection function will:

    • Iterate over the set of variants stored in variant_collection.
    • Apply selection rules based on:
      • Variant category (PR vs RR).
      • Sex of the individual.
      • RR mode (e.g. screening vs advanced).
    • Return only the variants that satisfy the applicable criteria.

Selection logic (high-level, non-exhaustive)

  • Personal Risk (PR)

    • Variants will be selected according to disease/gene inheritance models.
    • Different casuistics (e.g. heterozygous, homozygous, compound heterozygous) will be considered depending on the inheritance pattern.
  • Reproductive Risk (RR)

    • Variant selection will depend on:
      • RR mode (screening or advanced).
      • Sex of the individual.
    • Different zygosity-based selection rules will apply depending on the context.

Tasks

  • Add variant_selection attribute to Sample / SampleContext
  • Add variant_selection.py step to steps folder
  • Add pr_variant_selection.py to variant_selection folder (Personal Risk module) and auxiliary functions to ùtils.py`in variant_selection folder
  • Add rr_variant_selection.py to variant_selection folder (Reproductive Risk module)

Additional context

  • This step is conceptually downstream of variant_collection.
  • It should be designed to be extensible to additional categories or analytical modes.
  • The implementation should remain compatible with previous refactors

Metadata

Metadata

Assignees

Labels

refactorInternal code restructuring

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions