This guide explains the main public validation rules.
Use Troubleshooting first if you just want the fastest fix.
PhosPy validation is there to keep the public workflow honest:
- builder inputs must be readable and aligned
- dataset objects must be structurally valid
- references must match the dataset
- workflow requests must contain the right types and usable values
DatasetBuildRequest supports two public input routes:
- pandas
DataFramevalues - file paths for supported table formats
Required fields:
phosphosite_metadata
Optional fields:
sample_metadatatotalorganismpreprocessing_config
Main checks:
phosphoandsite_metadatamust be DataFrames or supported file pathsphosphomust be numericsite_metadata.indexmust align tophospho.index- required site metadata information must be available through columns or supported derivation
Supported aliases are intentionally narrow:
gene_symbol:gene_symbol,gene_namesite:sitesite_sequence:site_sequence,centralized_sequenceprotein_id:protein_id
Unsupported legacy aliases are rejected, including:
generesiduephosphositesite_positionsequenceprotein
If gene_symbol and/or site are missing, PhosPy can derive them only from
index values exactly matching "<gene_symbol>;<site>;".
AnalysisReadyPhosphoDataset is the strict workflow-facing boundary.
AnalysisReadyPhosphoDataset itself is strict, missing-value-free, and intended
for workflow execution rather than loose exploratory ingestion.
Main expectations:
phosphoandsite_metadataare DataFrames- site identity is coherent between row IDs and metadata
- required metadata values are non-empty
- transformation state is established and coherent
- the supported builder lane hands workflows a missing-value-free dataset
DatasetPreprocessingConfig groups four policy areas:
missing_datatotal_protein_correctionsite_matrixcomparisons
Key public rules:
missing_data.policy="forbid"is the strict defaultmissing_data.policy="impute_row_median"requiresmin_observed_valuestotal_protein_correction.policy="ratio_to_total"requires atotaltable aligned to the phospho samplessite_matrix.policy="build_from_metadata"may reduce row count when rows cannot be supported in that lane- the public builder lane still ends in a missing-value-free
AnalysisReadyPhosphoDataset comparisons.policy="sample_metadata_pairs"requires matchingsample_metadataand a usable sample-group column
Reference rules are simple but strict:
ReferencePreset.AUTOrequiresdataset.organism- explicit preset and dataset organism must agree when both are set
- explicit
ReferenceBundle.organismand dataset organism must agree when both are set - bundled runtime references are rat-only in this release
ReferencePreset.HUMANandReferencePreset.MOUSEare valid enum values, but they are not bundled runtime lanes here
ReferenceBundle itself must contain non-empty, internally consistent tables.
KinaseWorkflowRequest validation checks:
datasetisAnalysisReadyPhosphoDatasetreferencesisReferencePresetorReferenceBundle- config values are in supported ranges
- scoring support floor is respected (
min_substrates >= 2)
SignalomeWorkflowRequest validation checks:
kinase_resultisKinaseWorkflowResult- signalome config values are in supported ranges
- upstream matrices are usable for signalome execution
kinase_result.dataset.site_metadata.protein_idexists and is non-empty for all interpreted sites
Signalome is intentionally strict about protein identity. A site ID such as
TSC2;S939; is not a substitute for protein_id.
Some failures happen after request validation, during workflow interpretation or
execution. These often raise WorkflowBoundaryError with:
- a seam name
- counts or other details
- a
next_actionhint
Typical examples are overlap failures, low-support failures, or signalome network/module preconditions.
| Problem | Common fix |
|---|---|
| input format rejected | pass a DataFrame or supported file path |
dataset organism missing for AUTO |
set organism=Organism.RAT for bundled first runs |
| bundled human/mouse preset fails | use an explicit ReferenceBundle |
signalome fails on protein_id |
add a non-empty protein_id column |
| rows dropped in site-matrix building | review sequence support and preprocessing policy |
| Invariant | Owner |
|---|---|
| builder input source checks | DatasetBuildRequestValidator |
| preprocessing config policy | DatasetPreprocessingConfigValidator |
| analysis-ready dataset structure/content | AnalysisReadyDatasetValidator |
| transformation-state coherence | TransformationStateValidator |
| reference compatibility | ReferenceCompatibilityValidator |
| reference bundle structure/content | ReferenceBundleValidator |
| kinase workflow request/config validity | KinaseWorkflowValidator |
| signalome workflow request/config validity | SignalomeWorkflowValidator |
| runtime seam diagnostics | workflow interpreters/executors via WorkflowBoundaryError |