Modernize codebase: project setup, POTENCI refactor, docs, and tests by tsenoner · Pull Request #10 · MarkusHaak/trizod

tsenoner · 2026-03-10T13:43:07Z

Summary

Pure refactoring and modernization of the entire codebase — no pipeline behavior changes. All regression tests pass against existing baselines.

What's included (12 commits):

Build: Migrate from setup.py to pyproject.toml with uv
Style: Apply ruff formatting and fix all lint issues
Refactor (POTENCI): Extract data tables to CSV files, remove eval() and dead code, clean up code structure, fix bugs, rename variables to descriptive names
Refactor: Replace os.path with pathlib, rename cryptic variables across all modules
Tests: Add pytest infrastructure, smoke tests, POTENCI accuracy tests, 300-entry subset regression, full-dataset regression (slow-marked), prediction cache
Docs: Restructure documentation — add pipeline.md, potenci.md, filtering.md; trim README; update authors/emails/repo URL
Fix (bmrb): Add descriptive messages to bare ValueError raises

Stats: 41 files changed, +6892 / −3461 lines

Test plan

uv run pytest tests/ -v — all 9 default tests pass
uv run pytest tests/ -v --run-slow — all 13 tests pass (including full-dataset regression)
uv run ruff check trizod/ tests/ — no lint issues
uv run ruff format --check trizod/ tests/ — formatting clean
Regression baselines unchanged

🤖 Generated with Claude Code

Set up pytest with smoke tests covering CLI, POTENCI predictions, and single-entry pipeline computation. Fix .gitignore to use explicit patterns (the old `test*` glob would have excluded the new `tests/` directory). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace deprecated setup.py with modern pyproject.toml using hatchling as the build backend (PEP 621). Move pytest config into pyproject.toml and remove standalone pytest.ini. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Extract hardcoded data tables (centshifts, neicorrs, termcorrs, tempcoeffs, combdevs) to CSV files in trizod/potenci/data/ - Replace all eval() calls with float() and safe parsing - Remove dead code: getpredshifts_arr(), getphcorrs_arr(), main(), writeOutput() - Add 300-entry regression test suite with strategic sampling across filter levels, sequence lengths, and edge cases - Remove old ad-hoc test scripts (test/); add TODO for CheZOD equality test (needs reference data) - Fix .gitignore: /data/ for top-level only, add tests/bmrb_subset/ Verified: all predictions are bit-for-bit identical to original code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add ruff config (E, W, F, I, N, UP, B, C4, SIM rules with scientific code exceptions). Format all modules and fix 32 lint issues: bare excepts, type() comparisons, collapsible ifs, %-formatting → f-strings, unused re-exports (__all__), lambda closure binding, raise-from. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Switch to uv as package manager (pyproject.toml deps, uv.lock) - Remove legacy POTENCI code (potenci1_3.py) - Rename all POTENCI functions/variables/constants to descriptive snake_case - Extract phshifts data to CSV, cache module-level data, vectorize rolling RMS - Add dedicated test_potenci.py, make all tests ruff-conformant - Add CLAUDE.md, POTENCI README, and architecture section to project README - Add docs/ to .gitignore Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move detailed filter tables and POTENCI docs out of README into docs/. Add pipeline walkthrough. Reorganize .gitignore and move internal planning notes to gitignored docs/_planning/. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Content-based cache keyed by hash(seq, T, pH, ion) avoids recomputing POTENCI predictions across pipeline runs. Includes precompute script for batch caching and regression test against baseline outputs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove dead code (CSV I/O functions, pka_csv_path/identifier params) - Fix 3-residue pentamer crash (IndexError on short sequences) - Suppress harmless overflow/OptimizeWarning in pKa fitting - Extract BB_ATOMS constant, _SKIP_ATOM_PAIRS, _build_pentamer helper - Rename non-descriptive variables to descriptive names - Reorder file: constants → data loading → helpers → pKa → pH → API Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add both authors (Markus Haak, Tobias Senoner) with @tum.de emails. Update GitHub URL to MarkusHaak/trizod. Update POTENCI docs to reflect current API (removed pka_csv_path, renamed private functions, caching). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Migrate all path handling from os.path to pathlib.Path across the codebase. Rename non-descriptive variables in scoring.py (dAIC → delta_AIC, ashwi_ → abs_weighted_diffs, ol_ → outlier_mask, etc.), trizod.py (m → match, kw → keyword, fp → cache_path, scores_ → score_array, method_whitelist_ → whitelist_lower, etc.), and bmrb.py (pplist(l) → pplist(items), comprehension vars a/s → descriptive names). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add @pytest.mark.slow to full-dataset regression tests (4 tests that each run the entire pipeline on 17k entries). Default pytest run now takes ~47s instead of ~6min. Run all tests with: pytest -m "" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…egex Remove 9 chemicals from strict chemical-denaturants that are not denaturants: reducing agents (DTT, BME, mercaptoethanol), NMR reference standards (DSS), and common buffers (acetic acid, CD3COOH, deuterated sodium acetate). Fix exp-method-blacklist by removing "state" keyword which incorrectly matched "solution-state" experiment subtypes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ails Add detailed architecture notes (key functions per module), pipeline flow (6 stages), caching system documentation, offset correction clarification (not re-referencing), and updated data paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pandas str-typed columns reject list assignment. Cast to object first to fix CSV output crash on full dataset runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Modern NMR experiments commonly measure only 4 backbone shift types (HN, N, CO, CA) without HA. Requiring 5 types unnecessarily excluded many high-quality modern datasets. Code change in previous commit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Temperatures 1-14 K are physically impossible for liquid-state NMR (sample would be frozen). Values in this range are certainly Celsius. Only 3 conditions across 2 entries (bmr52889, bmr53214) are affected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Parse _Assembly.Paramagnetic and _Entity.Paramagnetic tags from BMRB NMR-STAR files. Exclude entries flagged as paramagnetic at tolerant+ tiers. Paramagnetic samples (iron proteins, lanthanide tags, etc.) cause 0.5-5+ ppm shift perturbations that make Z-score computation meaningless since POTENCI assumes diamagnetic conditions. Checks both assembly and entity level to catch 15 entries with contradictory Assembly=no/Entity=yes flags. ~183 entries affected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The parallel_apply returns mixed types — cast to bool explicitly to avoid TypeError on pandas invert operation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tsenoner and others added 19 commits March 9, 2026 17:20

build: migrate from setup.py to pyproject.toml

b705b53

Replace deprecated setup.py with modern pyproject.toml using hatchling as the build backend (PEP 621). Move pytest config into pyproject.toml and remove standalone pytest.ini. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(bmrb): add descriptive messages to bare ValueError raises

fc0dbd3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(output): cast seq column to object dtype before list assignment

e2078c7

Pandas str-typed columns reject list assignment. Cast to object first to fix CSV output crash on full dataset runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(filters): cast paramagnetic column to bool before negation

8726e8a

The parallel_apply returns mixed types — cast to bool explicitly to avoid TypeError on pandas invert operation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernize codebase: project setup, POTENCI refactor, docs, and tests#10

Modernize codebase: project setup, POTENCI refactor, docs, and tests#10
tsenoner wants to merge 19 commits intomainfrom
develop

tsenoner commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tsenoner commented Mar 10, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant