Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
5b0b217
test: add pytest infrastructure and smoke tests
tsenoner Mar 9, 2026
b705b53
build: migrate from setup.py to pyproject.toml
tsenoner Mar 9, 2026
fad205e
refactor(potenci): extract data to CSV, remove eval() and dead code
tsenoner Mar 9, 2026
777eb3a
style: apply ruff formatting and fix all lint issues
tsenoner Mar 9, 2026
bb19cf0
refactor: modernize project setup, POTENCI naming, and test structure
tsenoner Mar 9, 2026
bbb52fd
docs: restructure documentation, trim README, add docs/ directory
tsenoner Mar 9, 2026
04f17ba
feat: add POTENCI prediction cache and full dataset regression test
tsenoner Mar 10, 2026
fc0dbd3
fix(bmrb): add descriptive messages to bare ValueError raises
tsenoner Mar 10, 2026
6727269
refactor(potenci): clean up code structure, fix bugs, rename variables
tsenoner Mar 10, 2026
9074210
docs: update authors, emails, repo URL, and POTENCI docs
tsenoner Mar 10, 2026
6982301
refactor: replace os.path with pathlib, rename cryptic variables
tsenoner Mar 10, 2026
9bbe33c
test: mark full-dataset regression tests as slow, skip by default
tsenoner Mar 10, 2026
8eb7ad3
fix(filters): remove non-denaturants from blacklist, fix exp-method r…
tsenoner Mar 25, 2026
bc0a610
docs: expand CLAUDE.md with pipeline flow, caching, and filtering det…
tsenoner Mar 31, 2026
e2078c7
fix(output): cast seq column to object dtype before list assignment
tsenoner Mar 31, 2026
c9daf26
fix(filters): relax min-backbone-shift-types from 5 to 4 in strict tier
tsenoner Mar 31, 2026
d6a5de0
fix(bmrb): extend Celsius detection heuristic from 15-50 to 1-50 K
tsenoner Mar 31, 2026
72c9aff
feat(filters): add paramagnetic sample filter
tsenoner Apr 1, 2026
8726e8a
fix(filters): cast paramagnetic column to bool before negation
tsenoner Apr 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 29 additions & 14 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,16 +1,31 @@
build
dist
test*
tmp*
*.log
*.zip
*.txt
# Python
__pycache__/
*.pyc
*.egg-info/
build/
dist/

# Environments
.venv/

# Tool caches
.ruff_cache/
.pytest_cache/
.claude/

# Data (large, downloaded separately)
/data/

# Test subset (symlinks to data/)
tests/bmrb_subset/

# Pipeline output
tmp*/
*.pkl
*.txt
__pycache__
data
*.csv
notebooks
*.egg-info
2024-05-09
*.log

# Internal planning notes
docs/_planning/

# OS
.DS_Store
75 changes: 75 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# TriZOD — Project Instructions

## Quick Start
- Package manager: `uv`
- Install: `uv sync --group dev`
- Run CLI: `uv run trizod --help`
- Run tests: `uv run pytest tests/ -v`
- Lint: `uv run ruff check trizod/` and `uv run ruff format --check trizod/`

## Before Every Commit
1. `uv run ruff check trizod/ tests/`
2. `uv run ruff format --check trizod/ tests/`
3. `uv run pytest tests/ -v`
All three must pass.

## Architecture
- `trizod/trizod.py` — pipeline orchestration, CLI, filtering (`filter_defaults` DataFrame, `prefilter_dataframe()`, `print_filter_losses()`, `compute_scores_row()`, `main()`)
- `trizod/bmrb/bmrb.py` — BMRB NMR-STAR file parsing (Entity, Assembly, SampleConditions, ShiftTable, BmrbEntry)
- `trizod/potenci/potenci.py` — POTENCI random coil shift predictions (public API: `get_pred_shifts()`)
- `trizod/scoring/scoring.py` — Z-score computation, offset correction (AIC-based global + 9-residue rolling window)
- `trizod/constants.py` — shared constants (BBATNS, AA mappings, weights)

## Pipeline Flow
1. Parse args (two-phase: preset first, then detailed args)
2. Find & load BMRB files → `load_bmrb_entries()` (with pickle cache in `tmp/bmrb_entries/`)
3. Prefilter entries → `prefilter_dataframe()` (applies all filter criteria from `filter_defaults`)
4. Compute scores → `compute_scores_row()` per entry (POTENCI predictions cached in `tmp/potenci/`, wSCS cached in `tmp/wSCS/`)
5. Post-filter (offset rejection) → entries exceeding `max-offset` are excluded or masked
6. Output results (JSON + optional CSV) + `print_filter_losses()` report

## Caching
- `--cache-dir` defaults to `./tmp`
- `tmp/bmrb_entries/` — pickled BmrbEntry objects
- `tmp/potenci/` — POTENCI predictions as JSON (content-addressed by seq+T+pH+ion hash)
- `tmp/wSCS/` — scored weighted chemical shifts as `.npz` (keyed by entry_id + shift table IDs)
- POTENCI cache is filter-independent (depends only on sequence + conditions), so re-runs with different filter settings are mostly cache hits

## Filtering
- 4 stringency tiers: `unfiltered`, `tolerant`, `moderate`, `strict`
- `filter_defaults` DataFrame in `trizod/trizod.py` defines defaults per tier
- Individual filters overridable via CLI args
- `print_filter_losses()` reports per-filter counts (filtered + uniquely filtered)
- See `docs/filtering.md` for full reference

## Offset Correction (not re-referencing)
- `scoring.py` detects per-atom-type systematic biases between observed and POTENCI-predicted shifts
- Two strategies: global offset (AIC test) and 9-residue rolling window; picks whichever yields lower Z-scores
- This is NOT classical NMR re-referencing (DSS/TMS) — it adjusts Z-score calculation internally
- The pipeline does NOT output re-referenced .str files or corrected chemical shifts

## Conventions
- Python >=3.9, ruff for linting/formatting
- Scientific variable names allowed (T, pH, Ion, N, etc.) — see ruff ignore rules
- Always show staged files and proposed commit message, then wait for user approval before committing

## Testing
- `tests/test_potenci.py` — POTENCI prediction accuracy and edge cases
- `tests/test_smoke.py` — CLI entrypoints, single-entry pipeline integration
- `tests/test_pipeline_regression.py` — 300-entry subset regression (requires data/)
- Pipeline/regression tests require BMRB data in `data/bmrb_entries/`
- 9 tests total, ~60-80s runtime

## Documentation
- `docs/pipeline.md` — detailed pipeline walkthrough (6 stages)
- `docs/potenci.md` — POTENCI module: origin, API, performance, internals
- `docs/filtering.md` — filter descriptions and default values per stringency level
- `docs/_planning/` — internal planning notes (gitignored)
- `docs/_planning/status-2026-03-25.md` — implementation status and roadmap

## Data
- BMRB entries: `data/bmrb_entries/` (17,388 files, not committed)
- Baselines: `data/baseline/` (not committed)
- Filter impact analysis: `data/filter_impact/` (not committed)
- Test reference: `tests/reference/unfiltered.json` (committed)
- Test subset IDs: `tests/quick_subset_ids.txt` (committed)
66 changes: 13 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,68 +4,28 @@ Novel, continuous, per-residue disorder scores from protein NMR experiments stor

## Description

Accurate quantification of intrinsic disorder, crucial for understanding functional protein dynamics, remains challenging. We introduce TriZOD, an innovative scoring system for protein disorder analysis, utilizing nuclear magnetic resonance (NMR) spectroscopy chemical shifts. Traditional methods provide binary, residue-specific annotations, missing the complex spectrum of protein disorder. TriZOD extends the CheZOD scoring framework with quantitative statistical descriptors, offering a nuanced analysis of intrinsically disordered regions. It calculates per-residue scores from chemical shift data of polypeptides in the Biological Magnetic Resonance Data Bank (BMRB). The CheZOD Z-score is a quantitative metric for how much a set of experimentally determined chemical shifts deviate from random coil chemical shifts. The TriZOD G-scores extend upon them to be independent of the number of available chemical shifts. They are normalized to range between 0 and 1, which is beneficial for interpretation and use in training disorder predictors. Additionally, TriZOD introduces a refined, automated selection of BMRB datasets, including filters for physicochemical properties, keywords, and chemical denaturants. We calculated G-scores for over 15,000 peptides in the BMRB, approximately 10-fold the size of previously published CheZOD datasets.
Validation against DisProt annotations demonstrates substantial agreement yet highlights discrepancies, suggesting the need to reevaluate some disorder annotations. TriZOD advances protein disorder prediction by leveraging the full potential of the BMRB database, refining our understanding of disorder, and challenging existing annotations.
TriZOD computes per-residue disorder scores from NMR chemical shift data in the Biological Magnetic Resonance Data Bank (BMRB). It extends the CheZOD scoring framework with quantitative statistical descriptors, offering nuanced analysis of intrinsically disordered regions. The CheZOD Z-score measures how much experimentally determined chemical shifts deviate from random coil predictions. The TriZOD G-score normalizes these to [0, 1], independent of the number of available shift types.

## Installation

## Usage
## Architecture

## Datasets

The latest dataset is published under the DOI [10.6084/m9.figshare.25792035](https://www.doi.org/10.6084/m9.figshare.25792035).
TriZOD consists of three core modules:

This publication consists of four nested datasets of increasing filter stringency: Unfiltered, tolerant, moderate and strict. An overview of the applied filters is given below. The .json files contain all entries of the BMRB that are in accordance with the given filter levels. These are not redundancy reduced and also contain the test set entries and are therefore not intended for direct use as training sets in machine learning applications. Instead, for this purpose, please use only those entries with IDs found in the [filter_level]_rest_set.fasta files and extract the corresponding information such as TriZOD G-scores and/or physicochemical properties from the respective .json files. These fasta files contain the cluster representatives of the redundancy reduction procedure which was performed in an iterative fashion such that clusters with members found in all filter levels are shared among them and have the same cluster representatives. If necessary, all other cluster members can be retrieved from the given [filter_level]_rest_clu.tsv files. The file TriZOD_test_set.fasta contains the IDs and sequences of the TriZOD test set. It is intended that the corresponding data is taken from the strict dataset.
- **`trizod/bmrb/`** — Parses BMRB NMR-STAR files into structured data (entities, assemblies, sample conditions, shift tables)
- **`trizod/potenci/`** — Predicts random coil chemical shifts using the [POTENCI](https://github.com/protein-nmr/POTENCI) algorithm ([docs](docs/potenci.md))
- **`trizod/scoring/`** — Computes per-residue CheZOD Z-scores and TriZOD G-scores from experimental vs. predicted shifts

### Filter defaults
The main pipeline (`trizod/trizod.py`) orchestrates these modules. See [docs/pipeline.md](docs/pipeline.md) for a detailed walkthrough.

TriZOD filters the peptide shift data entries in the BMRB database given a set of filter criteria. Though these criteria can be set individually with corresponding command line arguments, it is most convinient to use one of four filter default options to adapt the overall stringency of the filters. The command line argument `--filter-defaults` sets default values for all data filtering criteria. The accepted options with increasing stringency are `unfiltered`, `tolerant`, `moderate` and `strict`. The affected filters are:
## Installation

| Filter | Description |
| :--- | --- |
| temperature-range | Minimum and maximum temperature in Kelvin. |
| ionic-strength-range | Minimum and maximum ionic strength in Mol. |
| pH-range | Minimum and maximum pH. |
| unit-assumptions | Assume units for Temp., Ionic str. and pH if they are not given and exclude entries instead. |
| unit-corrections | Correct values for Temp., Ionic str. and pH if units are most likely wrong. |
| default-conditions | Assume standard conditions if pH (7), ionic strength (0.1 M) or temperature (298 K) are missing and exclude entries instead. |
| peptide-length-range | Minimum (and optionally maximum) peptide sequence length. |
| min-backbone-shift-types | Minimum number of different backbone shift types (max 7). |
| min-backbone-shift-positions | Minimum number of positions with at least one backbone shift. |
| min-backbone-shift-fraction | Minimum fraction of positions with at least one backbone shift. |
| max-noncanonical-fraction | Maximum fraction of non-canonical amino acids (X count as arbitrary canonical) in the amino acid sequence. |
| max-x-fraction | Maximum fraction of X letters (arbitrary canonical amino acid) in the amino acid sequence. |
| keywords-blacklist | Exclude entries with any of these keywords mentioned anywhere in the BMRB file, case ignored. |
| chemical-denaturants | Exclude entries with any of these chemicals as substrings of sample components, case ignored. |
| exp-method-whitelist | Include only entries with any of these keywords as substring of the experiment subtype, case ignored. |
| exp-method-blacklist | Exclude entries with any of these keywords as substring of the experiment subtype, case ignored. |
| max-offset | Maximum valid offset correction for any random coil chemical shift type. |
| reject-shift-type-only | Upon exceeding the maximal offset set by <--max-offset>, exclude only the backbone shifts exceeding the offset instead of the whole entry. |
## Usage

The following table lists the respective filtering criteria for each of the four filter default options:
## Datasets

| Filter | unfiltered | tolerant | moderate | strict |
| :--- | --- | --- | --- | --- |
| temperature-range | [-inf,+inf] | [263,333] | [273,313] | [273,313] |
| ionic-strength-range | [0,+inf] | [0,7] | [0,5] | [0,3] |
| pH-range | [-inf,+inf] | [2,12] | [4,10] | [6,8] |
| unit-assumptions | Yes | Yes | Yes | No |
| unit-corrections | Yes | Yes | No | No |
| default-conditions | Yes | Yes | Yes | No |
| peptide-length-range | [5,+inf] | [5,+inf] | [10,+inf] | [15,+inf] |
| min-backbone-shift-types | 1 | 2 | 3 | 5 |
| min-backbone-shift-positions | 3 | 3 | 8 | 12 |
| min-backbone-shift-fraction | 0.0 | 0.0 | 0.6 | 0.8 |
| max-noncanonical-fraction | 1.0 | 0.1 | 0.025 | 0.0 |
| max-x-fraction | 1.0 | 0.2 | 0.05 | 0.0 |
| keywords-blacklist | [] | ['denatur'] | ['denatur', 'unfold', 'misfold'] | ['denatur', 'unfold', 'misfold', 'interacti', 'bound'] |
| chemical-denaturants | [] | ['guanidin', 'GdmCl', 'Gdn-Hcl','urea'] | ['guanidin', 'GdmCl', 'Gdn-Hcl','urea'] | ['guanidin', 'GdmCl', 'Gdn-Hcl','urea','BME','2-ME','mercaptoethanol', 'TFA', 'trifluoroethanol', 'Potassium Pyrophosphate', 'acetic acid', 'CD3COOH', 'DTT', 'dithiothreitol', 'dss', 'deuterated sodium acetate'] |
| exp-method-whitelist | ['', '.'] | ['','solution', 'structures'] | ['','solution', 'structures'] | ['solution', 'structures'] |
| exp-method-blacklist | [] | ['solid', 'state'] | ['solid', 'state'] | ['solid', 'state'] |
| max-offset | +inf | 3 | 3 | 2 |
| reject-shift-type-only | Yes | Yes | No | No |
The latest dataset is published under the DOI [10.6084/m9.figshare.25792035](https://www.doi.org/10.6084/m9.figshare.25792035).

Please note that each of these filters can be set individually with respective command line options and that this will take precedence over the filter defaults set by the `--filter-defaults` option.
Four nested datasets of increasing filter stringency are provided: unfiltered, tolerant, moderate, and strict. See [docs/filtering.md](docs/filtering.md) for the complete filter reference and default values.

## Project status

Under active development
Loading