Skip to content

Latest commit

 

History

History
172 lines (120 loc) · 5.6 KB

File metadata and controls

172 lines (120 loc) · 5.6 KB

Validation Guide

This guide explains the main public validation rules.

Use Troubleshooting first if you just want the fastest fix.

What validation is protecting

PhosPy validation is there to keep the public workflow honest:

  • builder inputs must be readable and aligned
  • dataset objects must be structurally valid
  • references must match the dataset
  • workflow requests must contain the right types and usable values

Builder input rules

DatasetBuildRequest supports two public input routes:

  • pandas DataFrame values
  • file paths for supported table formats

Required fields:

  • phospho
  • site_metadata

Optional fields:

  • sample_metadata
  • total
  • organism
  • preprocessing_config

Main checks:

  • phospho and site_metadata must be DataFrames or supported file paths
  • phospho must be numeric
  • site_metadata.index must align to phospho.index
  • required site metadata information must be available through columns or supported derivation

Site metadata conventions

Supported aliases are intentionally narrow:

  • gene_symbol: gene_symbol, gene_name
  • site: site
  • site_sequence: site_sequence, centralized_sequence
  • protein_id: protein_id

Unsupported legacy aliases are rejected, including:

  • gene
  • residue
  • phosphosite
  • site_position
  • sequence
  • protein

If gene_symbol and/or site are missing, PhosPy can derive them only from index values exactly matching "<gene_symbol>;<site>;".

Final dataset boundary

AnalysisReadyPhosphoDataset is the strict workflow-facing boundary.

AnalysisReadyPhosphoDataset itself is strict, missing-value-free, and intended for workflow execution rather than loose exploratory ingestion.

Main expectations:

  • phospho and site_metadata are DataFrames
  • site identity is coherent between row IDs and metadata
  • required metadata values are non-empty
  • transformation state is established and coherent
  • the supported builder lane hands workflows a missing-value-free dataset

Preprocessing rules

DatasetPreprocessingConfig groups four policy areas:

  • missing_data
  • total_protein_correction
  • site_matrix
  • comparisons

Key public rules:

  • missing_data.policy="forbid" is the strict default
  • missing_data.policy="impute_row_median" requires min_observed_values
  • total_protein_correction.policy="ratio_to_total" requires a total table aligned to the phospho samples
  • site_matrix.policy="build_from_metadata" may reduce row count when rows cannot be supported in that lane
  • the public builder lane still ends in a missing-value-free AnalysisReadyPhosphoDataset
  • comparisons.policy="sample_metadata_pairs" requires matching sample_metadata and a usable sample-group column

Reference validation

Reference rules are simple but strict:

  • ReferencePreset.AUTO requires dataset.organism
  • explicit preset and dataset organism must agree when both are set
  • explicit ReferenceBundle.organism and dataset organism must agree when both are set
  • bundled runtime references are rat-only in this release
  • ReferencePreset.HUMAN and ReferencePreset.MOUSE are valid enum values, but they are not bundled runtime lanes here

ReferenceBundle itself must contain non-empty, internally consistent tables.

Workflow validation

Kinase workflow

KinaseWorkflowRequest validation checks:

  • dataset is AnalysisReadyPhosphoDataset
  • references is ReferencePreset or ReferenceBundle
  • config values are in supported ranges
  • scoring support floor is respected (min_substrates >= 2)

Signalome workflow

SignalomeWorkflowRequest validation checks:

  • kinase_result is KinaseWorkflowResult
  • signalome config values are in supported ranges
  • upstream matrices are usable for signalome execution
  • kinase_result.dataset.site_metadata.protein_id exists and is non-empty for all interpreted sites

Signalome is intentionally strict about protein identity. A site ID such as TSC2;S939; is not a substitute for protein_id.

Boundary errors during workflow execution

Some failures happen after request validation, during workflow interpretation or execution. These often raise WorkflowBoundaryError with:

  • a seam name
  • counts or other details
  • a next_action hint

Typical examples are overlap failures, low-support failures, or signalome network/module preconditions.

Quick fix table

Problem Common fix
input format rejected pass a DataFrame or supported file path
dataset organism missing for AUTO set organism=Organism.RAT for bundled first runs
bundled human/mouse preset fails use an explicit ReferenceBundle
signalome fails on protein_id add a non-empty protein_id column
rows dropped in site-matrix building review sequence support and preprocessing policy

Validation ownership summary

Invariant Owner
builder input source checks DatasetBuildRequestValidator
preprocessing config policy DatasetPreprocessingConfigValidator
analysis-ready dataset structure/content AnalysisReadyDatasetValidator
transformation-state coherence TransformationStateValidator
reference compatibility ReferenceCompatibilityValidator
reference bundle structure/content ReferenceBundleValidator
kinase workflow request/config validity KinaseWorkflowValidator
signalome workflow request/config validity SignalomeWorkflowValidator
runtime seam diagnostics workflow interpreters/executors via WorkflowBoundaryError

Where next