Skip to content

feat: add LACS-inspired chemical shift re-referencing#9

Draft
tsenoner wants to merge 11 commits intodevelopfrom
feat/chemical-shift-rereferencing
Draft

feat: add LACS-inspired chemical shift re-referencing#9
tsenoner wants to merge 11 commits intodevelopfrom
feat/chemical-shift-rereferencing

Conversation

@tsenoner
Copy link
Copy Markdown
Collaborator

@tsenoner tsenoner commented Mar 6, 2026

Summary

  • Adds a new trizod/referencing/ module implementing LACS-inspired chemical shift re-referencing
  • Detects and corrects systematic spectrometer referencing errors (~25% of BMRB entries have these)
  • LACS method: exploits CA/CB sharing the same 13C reference to detect 13C offsets; applies per-atom-type analysis for 15N and 1H
  • All offsets validated via AIC criterion before acceptance
  • Runs as a pre-processing step before the existing per-atom offset correction in scoring.py

Changes

  • trizod/referencing/__init__.py + referencing.py: New module with estimate_reference_offsets(), validate_offsets(), apply_rereferencing()
  • trizod/scoring/scoring.py: Integrate re-referencing into get_offset_corrected_wscs(), pass through new parameters
  • trizod/trizod.py: Add CLI args (--rereferencing, --rereferencing-method, --max-reref-offset), add to filter_defaults, wire through compute_scores()compute_scores_row()main(), store reref_* offsets in output

Design

Pipeline flow:
  observed shifts → POTENCI prediction → delta (obs - pred)
                                            ↓
                                    [NEW] re-referencing (LACS/global)
                                            ↓
                                    existing offset correction
                                            ↓
                                    Z-score computation

Test plan

  • Run on known mis-referenced entries — should detect and correct large 13C/15N offsets
  • Run on correctly-referenced entries — should produce near-zero corrections (AIC rejects small offsets)
  • Compare Z-score distributions before/after re-referencing
  • Verify --no-rereferencing produces identical output to before
  • Verify cache compatibility (old caches load without error, new caches include reref data)

🤖 Generated with Claude Code

tsenoner and others added 11 commits November 28, 2025 14:48
- Remove duplicate entries (*.txt was listed twice)
- Organize into logical sections with comments
- Add exception for trizod/potenci/data/ directory
- Use proper gitignore patterns (directories with trailing /)
- POTENCI data files are now included as required dependencies
- Add 6 CSV tables extracted from inline strings
- Add comprehensive README documenting data format
- Data sourced from Nielsen & Mulder (2018) POTENCI algorithm

Files added:
- tablecent.csv: Central residue chemical shifts
- tablenei.csv: Neighbor residue corrections
- tabletermcorrs.csv: Terminal corrections
- tabletempk.csv: Temperature coefficients
- tablecombdevs.csv: Combinatorial deviations
- tablephshifts.csv: pH-dependent shifts
- README.md: Comprehensive documentation
- Add comprehensive type hints (ShiftDict, CorrectionDict, etc.)
- Replace unsafe eval() calls with safe float conversion
- Implement CSV-based data loading with caching
- Add PhysicalConstants dataclass
- Remove all backward compatibility wrappers
- Update module docstring with academic references

Security: Eliminates eval() vulnerability
Performance: Cached data loading with lru_cache
Maintainability: Type-safe, well-documented API
- Add comprehensive module docstring
- Fix typo: logging.waring() to logging.warning()
- Update outdated comments (python2.x to python3.10+)
- Replace ##-style comments with proper documentation
- Update to use new constants API (PHYSICAL_CONSTANTS, load_* functions)
- Improve CLI documentation in main()
- Export modern API: load_central_shifts, PHYSICAL_CONSTANTS, etc.
- Remove legacy exports: R, a, b, cutoff, e, ncycles
- Add module docstring
- Update __all__ for clean public API
- Remove setup.py (replaced by pyproject.toml)
- Add uv.lock for reproducible dependencies
- Configure hatchling to include potenci/data files
- Update build system to use modern Python packaging standards

Migration: setup.py → pyproject.toml + uv
Build backend: hatchling
Lock file: uv.lock for reproducibility
- Run ruff check --fix --unsafe-fixes on all modules
- Apply ruff format for consistent code style
- Fix import ordering, comparison operators, nested if statements
- Remove unused variables and imports
- Add explicit exception handling (no bare excepts)
- Rename functions to follow snake_case convention:
  - get_pH → get_ph
  - convChi2CDF → conv_chi2_cdf
  - get_offset_corrected_wSCS → get_offset_corrected_wscs
- Rename exceptions to follow Error suffix convention:
  - Found → FoundError
  - OffsetTooLargeException → OffsetTooLargeError
  - OffsetCausedFilterException → OffsetCausedFilterError
  - FilterException → FilterError
- Fix lambda loop variable binding issue
- Add exception chaining with "from e"
…equires-python

- Replace deprecated np.float with np.float64 (removed in NumPy 1.24+)
- Replace eval() with float() in read_csv_pkaoutput (security fix)
- Fix argument count mismatch in getpredshifts_arr -> getphcorrs_arr call
- Remove dead code: unused log_fun(), no-op str(i+1) statements
- Bump requires-python from >=3.8 to >=3.9 (BooleanOptionalAction, dict |)
- Exclude test/ from ruff config (pre-existing issues, not part of package)
- Add implementation plan document for BMRB expert suggestions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace blanket *.csv/*.txt/*.zip patterns with specific directory
rules for clarity and transparency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement a pre-scoring re-referencing step that detects and corrects
systematic spectrometer referencing errors (~25% of BMRB entries).

The LACS method exploits the fact that CA and CB share the same 13C
reference: if both show a consistent mean offset, it indicates a 13C
referencing error. Similar per-atom-type analysis is applied for 15N
and 1H nuclei. All offsets are validated via AIC criterion.

- New module: trizod/referencing/ with estimate_reference_offsets(),
  validate_offsets(), and apply_rereferencing()
- Integrate into scoring.get_offset_corrected_wscs() before existing
  per-atom offset correction
- Add CLI args: --rereferencing, --rereferencing-method, --max-reref-offset
- Add to filter_defaults: enabled by default for tolerant/moderate/strict
- Store per-atom re-referencing offsets in output (reref_* columns)
- Cache re-referencing offsets alongside existing wSCS cache

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tsenoner tsenoner marked this pull request as draft March 10, 2026 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant