feat: add LACS-inspired chemical shift re-referencing by tsenoner · Pull Request #9 · MarkusHaak/trizod

tsenoner · 2026-03-06T09:55:24Z

Summary

Adds a new trizod/referencing/ module implementing LACS-inspired chemical shift re-referencing
Detects and corrects systematic spectrometer referencing errors (~25% of BMRB entries have these)
LACS method: exploits CA/CB sharing the same 13C reference to detect 13C offsets; applies per-atom-type analysis for 15N and 1H
All offsets validated via AIC criterion before acceptance
Runs as a pre-processing step before the existing per-atom offset correction in scoring.py

Changes

trizod/referencing/__init__.py + referencing.py: New module with estimate_reference_offsets(), validate_offsets(), apply_rereferencing()
trizod/scoring/scoring.py: Integrate re-referencing into get_offset_corrected_wscs(), pass through new parameters
trizod/trizod.py: Add CLI args (--rereferencing, --rereferencing-method, --max-reref-offset), add to filter_defaults, wire through compute_scores() → compute_scores_row() → main(), store reref_* offsets in output

Design

Pipeline flow:
  observed shifts → POTENCI prediction → delta (obs - pred)
                                            ↓
                                    [NEW] re-referencing (LACS/global)
                                            ↓
                                    existing offset correction
                                            ↓
                                    Z-score computation

Test plan

Run on known mis-referenced entries — should detect and correct large 13C/15N offsets
Run on correctly-referenced entries — should produce near-zero corrections (AIC rejects small offsets)
Compare Z-score distributions before/after re-referencing
Verify --no-rereferencing produces identical output to before
Verify cache compatibility (old caches load without error, new caches include reref data)

🤖 Generated with Claude Code

- Remove duplicate entries (*.txt was listed twice) - Organize into logical sections with comments - Add exception for trizod/potenci/data/ directory - Use proper gitignore patterns (directories with trailing /) - POTENCI data files are now included as required dependencies

- Add 6 CSV tables extracted from inline strings - Add comprehensive README documenting data format - Data sourced from Nielsen & Mulder (2018) POTENCI algorithm Files added: - tablecent.csv: Central residue chemical shifts - tablenei.csv: Neighbor residue corrections - tabletermcorrs.csv: Terminal corrections - tabletempk.csv: Temperature coefficients - tablecombdevs.csv: Combinatorial deviations - tablephshifts.csv: pH-dependent shifts - README.md: Comprehensive documentation

- Add comprehensive type hints (ShiftDict, CorrectionDict, etc.) - Replace unsafe eval() calls with safe float conversion - Implement CSV-based data loading with caching - Add PhysicalConstants dataclass - Remove all backward compatibility wrappers - Update module docstring with academic references Security: Eliminates eval() vulnerability Performance: Cached data loading with lru_cache Maintainability: Type-safe, well-documented API

- Add comprehensive module docstring - Fix typo: logging.waring() to logging.warning() - Update outdated comments (python2.x to python3.10+) - Replace ##-style comments with proper documentation - Update to use new constants API (PHYSICAL_CONSTANTS, load_* functions) - Improve CLI documentation in main()

- Export modern API: load_central_shifts, PHYSICAL_CONSTANTS, etc. - Remove legacy exports: R, a, b, cutoff, e, ncycles - Add module docstring - Update __all__ for clean public API

- Remove setup.py (replaced by pyproject.toml) - Add uv.lock for reproducible dependencies - Configure hatchling to include potenci/data files - Update build system to use modern Python packaging standards Migration: setup.py → pyproject.toml + uv Build backend: hatchling Lock file: uv.lock for reproducibility

- Run ruff check --fix --unsafe-fixes on all modules - Apply ruff format for consistent code style - Fix import ordering, comparison operators, nested if statements - Remove unused variables and imports - Add explicit exception handling (no bare excepts) - Rename functions to follow snake_case convention: - get_pH → get_ph - convChi2CDF → conv_chi2_cdf - get_offset_corrected_wSCS → get_offset_corrected_wscs - Rename exceptions to follow Error suffix convention: - Found → FoundError - OffsetTooLargeException → OffsetTooLargeError - OffsetCausedFilterException → OffsetCausedFilterError - FilterException → FilterError - Fix lambda loop variable binding issue - Add exception chaining with "from e"

…equires-python - Replace deprecated np.float with np.float64 (removed in NumPy 1.24+) - Replace eval() with float() in read_csv_pkaoutput (security fix) - Fix argument count mismatch in getpredshifts_arr -> getphcorrs_arr call - Remove dead code: unused log_fun(), no-op str(i+1) statements - Bump requires-python from >=3.8 to >=3.9 (BooleanOptionalAction, dict |) - Exclude test/ from ruff config (pre-existing issues, not part of package) - Add implementation plan document for BMRB expert suggestions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace blanket *.csv/*.txt/*.zip patterns with specific directory rules for clarity and transparency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implement a pre-scoring re-referencing step that detects and corrects systematic spectrometer referencing errors (~25% of BMRB entries). The LACS method exploits the fact that CA and CB share the same 13C reference: if both show a consistent mean offset, it indicates a 13C referencing error. Similar per-atom-type analysis is applied for 15N and 1H nuclei. All offsets are validated via AIC criterion. - New module: trizod/referencing/ with estimate_reference_offsets(), validate_offsets(), and apply_rereferencing() - Integrate into scoring.get_offset_corrected_wscs() before existing per-atom offset correction - Add CLI args: --rereferencing, --rereferencing-method, --max-reref-offset - Add to filter_defaults: enabled by default for tolerant/moderate/strict - Store per-atom re-referencing offsets in output (reref_* columns) - Cache re-referencing offsets alongside existing wSCS cache Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

tsenoner and others added 11 commits November 28, 2025 14:48

refactor(potenci): clean up module exports

ee3d377

- Export modern API: load_central_shifts, PHYSICAL_CONSTANTS, etc. - Remove legacy exports: R, a, b, cutoff, e, ncycles - Add module docstring - Update __all__ for clean public API

chore: add docs/ to .gitignore and remove from tracking

e18e1aa

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: restructure .gitignore with explicit patterns

714c4be

Replace blanket *.csv/*.txt/*.zip patterns with specific directory rules for clarity and transparency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

tsenoner force-pushed the develop branch from 714c4be to 9074210 Compare March 10, 2026 09:55

tsenoner marked this pull request as draft March 10, 2026 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add LACS-inspired chemical shift re-referencing#9

feat: add LACS-inspired chemical shift re-referencing#9
tsenoner wants to merge 11 commits intodevelopfrom
feat/chemical-shift-rereferencing

tsenoner commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tsenoner commented Mar 6, 2026

Summary

Changes

Design

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant