Skip to content

Modernize codebase: project setup, POTENCI refactor, docs, and tests#10

Open
tsenoner wants to merge 19 commits intomainfrom
develop
Open

Modernize codebase: project setup, POTENCI refactor, docs, and tests#10
tsenoner wants to merge 19 commits intomainfrom
develop

Conversation

@tsenoner
Copy link
Copy Markdown
Collaborator

Summary

Pure refactoring and modernization of the entire codebase — no pipeline behavior changes. All regression tests pass against existing baselines.

What's included (12 commits):

  • Build: Migrate from setup.py to pyproject.toml with uv
  • Style: Apply ruff formatting and fix all lint issues
  • Refactor (POTENCI): Extract data tables to CSV files, remove eval() and dead code, clean up code structure, fix bugs, rename variables to descriptive names
  • Refactor: Replace os.path with pathlib, rename cryptic variables across all modules
  • Tests: Add pytest infrastructure, smoke tests, POTENCI accuracy tests, 300-entry subset regression, full-dataset regression (slow-marked), prediction cache
  • Docs: Restructure documentation — add pipeline.md, potenci.md, filtering.md; trim README; update authors/emails/repo URL
  • Fix (bmrb): Add descriptive messages to bare ValueError raises

Stats: 41 files changed, +6892 / −3461 lines

Test plan

  • uv run pytest tests/ -v — all 9 default tests pass
  • uv run pytest tests/ -v --run-slow — all 13 tests pass (including full-dataset regression)
  • uv run ruff check trizod/ tests/ — no lint issues
  • uv run ruff format --check trizod/ tests/ — formatting clean
  • Regression baselines unchanged

🤖 Generated with Claude Code

tsenoner and others added 19 commits March 9, 2026 17:20
Set up pytest with smoke tests covering CLI, POTENCI predictions, and
single-entry pipeline computation. Fix .gitignore to use explicit patterns
(the old `test*` glob would have excluded the new `tests/` directory).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace deprecated setup.py with modern pyproject.toml using hatchling
as the build backend (PEP 621). Move pytest config into pyproject.toml
and remove standalone pytest.ini.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract hardcoded data tables (centshifts, neicorrs, termcorrs,
  tempcoeffs, combdevs) to CSV files in trizod/potenci/data/
- Replace all eval() calls with float() and safe parsing
- Remove dead code: getpredshifts_arr(), getphcorrs_arr(), main(),
  writeOutput()
- Add 300-entry regression test suite with strategic sampling across
  filter levels, sequence lengths, and edge cases
- Remove old ad-hoc test scripts (test/); add TODO for CheZOD equality
  test (needs reference data)
- Fix .gitignore: /data/ for top-level only, add tests/bmrb_subset/

Verified: all predictions are bit-for-bit identical to original code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ruff config (E, W, F, I, N, UP, B, C4, SIM rules with scientific
code exceptions). Format all modules and fix 32 lint issues: bare
excepts, type() comparisons, collapsible ifs, %-formatting → f-strings,
unused re-exports (__all__), lambda closure binding, raise-from.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Switch to uv as package manager (pyproject.toml deps, uv.lock)
- Remove legacy POTENCI code (potenci1_3.py)
- Rename all POTENCI functions/variables/constants to descriptive snake_case
- Extract phshifts data to CSV, cache module-level data, vectorize rolling RMS
- Add dedicated test_potenci.py, make all tests ruff-conformant
- Add CLAUDE.md, POTENCI README, and architecture section to project README
- Add docs/ to .gitignore

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move detailed filter tables and POTENCI docs out of README into docs/.
Add pipeline walkthrough. Reorganize .gitignore and move internal
planning notes to gitignored docs/_planning/.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Content-based cache keyed by hash(seq, T, pH, ion) avoids recomputing
POTENCI predictions across pipeline runs. Includes precompute script
for batch caching and regression test against baseline outputs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove dead code (CSV I/O functions, pka_csv_path/identifier params)
- Fix 3-residue pentamer crash (IndexError on short sequences)
- Suppress harmless overflow/OptimizeWarning in pKa fitting
- Extract BB_ATOMS constant, _SKIP_ATOM_PAIRS, _build_pentamer helper
- Rename non-descriptive variables to descriptive names
- Reorder file: constants → data loading → helpers → pKa → pH → API

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add both authors (Markus Haak, Tobias Senoner) with @tum.de emails.
Update GitHub URL to MarkusHaak/trizod. Update POTENCI docs to reflect
current API (removed pka_csv_path, renamed private functions, caching).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Migrate all path handling from os.path to pathlib.Path across the
codebase. Rename non-descriptive variables in scoring.py (dAIC →
delta_AIC, ashwi_ → abs_weighted_diffs, ol_ → outlier_mask, etc.),
trizod.py (m → match, kw → keyword, fp → cache_path, scores_ →
score_array, method_whitelist_ → whitelist_lower, etc.), and bmrb.py
(pplist(l) → pplist(items), comprehension vars a/s → descriptive names).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add @pytest.mark.slow to full-dataset regression tests (4 tests that
each run the entire pipeline on 17k entries). Default pytest run now
takes ~47s instead of ~6min. Run all tests with: pytest -m ""

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…egex

Remove 9 chemicals from strict chemical-denaturants that are not
denaturants: reducing agents (DTT, BME, mercaptoethanol), NMR reference
standards (DSS), and common buffers (acetic acid, CD3COOH, deuterated
sodium acetate). Fix exp-method-blacklist by removing "state" keyword
which incorrectly matched "solution-state" experiment subtypes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ails

Add detailed architecture notes (key functions per module), pipeline
flow (6 stages), caching system documentation, offset correction
clarification (not re-referencing), and updated data paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pandas str-typed columns reject list assignment. Cast to object first
to fix CSV output crash on full dataset runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Modern NMR experiments commonly measure only 4 backbone shift types
(HN, N, CO, CA) without HA. Requiring 5 types unnecessarily excluded
many high-quality modern datasets. Code change in previous commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Temperatures 1-14 K are physically impossible for liquid-state NMR
(sample would be frozen). Values in this range are certainly Celsius.
Only 3 conditions across 2 entries (bmr52889, bmr53214) are affected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parse _Assembly.Paramagnetic and _Entity.Paramagnetic tags from BMRB
NMR-STAR files. Exclude entries flagged as paramagnetic at tolerant+
tiers. Paramagnetic samples (iron proteins, lanthanide tags, etc.)
cause 0.5-5+ ppm shift perturbations that make Z-score computation
meaningless since POTENCI assumes diamagnetic conditions.

Checks both assembly and entity level to catch 15 entries with
contradictory Assembly=no/Entity=yes flags. ~183 entries affected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The parallel_apply returns mixed types — cast to bool explicitly
to avoid TypeError on pandas invert operation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant