Skip to content

Commit 9074210

Browse files
tsenonerclaude
andcommitted
docs: update authors, emails, repo URL, and POTENCI docs
Add both authors (Markus Haak, Tobias Senoner) with @tum.de emails. Update GitHub URL to MarkusHaak/trizod. Update POTENCI docs to reflect current API (removed pka_csv_path, renamed private functions, caching). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 6727269 commit 9074210

4 files changed

Lines changed: 195 additions & 28 deletions

File tree

docs/potenci.md

Lines changed: 188 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,25 @@
11
# POTENCI — Random Coil Chemical Shift Prediction
22

3-
This module predicts random coil NMR chemical shifts for proteins using the
4-
POTENCI algorithm. It is used by the TriZOD pipeline to compute expected shifts,
5-
which are then compared to experimental shifts to derive disorder Z-scores.
3+
POTENCI (**P**rediction **O**f **T**otal **E**ffects on **N**MR **C**hemical shifts
4+
from **I**ntrinsically disordered proteins) predicts the **random coil chemical shifts**
5+
of backbone atoms in a protein sequence — i.e., the shifts you'd expect if the protein
6+
had no structure, just a flexible disordered chain.
7+
8+
In the TriZOD pipeline, these predicted shifts are compared against experimentally
9+
measured shifts. Large deviations indicate structure; small deviations indicate disorder.
610

711
## Origin
812

13+
Based on the POTENCI method published in:
14+
15+
> Nielsen JT, Mulder FAA. *POTENCI: prediction of temperature, neighbor and pH-corrected
16+
> chemical shifts for intrinsically disordered proteins.* J Biomol NMR. 2018;70:141-165.
17+
> [doi:10.1007/s10858-018-0175-y](https://doi.org/10.1007/s10858-018-0175-y)
18+
919
Adapted from [protein-nmr/POTENCI](https://github.com/protein-nmr/POTENCI)
1020
(commit `17dd2e6`, file `pytenci1_3.py`), originally by Frans Mulder
11-
(fmulder@chem.au.dk).
21+
(fmulder@chem.au.dk). Adapted by Markus Haak (markus.haak@tum.de) and
22+
Tobias Senoner (tobias.senoner@tum.de).
1223

1324
## Differences from upstream
1425

@@ -26,13 +37,142 @@ algorithm) but improved for integration into the TriZOD pipeline:
2637
| Naming | camelCase / smooshed (`getpredshifts`) | snake_case (`get_pred_shifts`) |
2738
| Constants | Single-letter names (`e`, `a`, `b`) | Descriptive (`DIELECTRIC_WATER`, `MIN_CHARGE_DISTANCE`) |
2839

29-
## Public API
40+
## Predicted atoms
41+
42+
7 backbone atom types per residue:
43+
44+
| Atom | Description | Notes |
45+
|------|-------------|-------|
46+
| `C` | Carbonyl carbon (C') | |
47+
| `CA` | Alpha carbon | |
48+
| `CB` | Beta carbon | Absent for Glycine |
49+
| `N` | Amide nitrogen | |
50+
| `H` | Amide proton | Absent for Proline |
51+
| `HA` | Alpha proton | |
52+
| `HB` | Beta proton | |
53+
54+
Terminal residues (first and last) are excluded from predictions because they
55+
lack the full 5-residue context window.
56+
57+
## The prediction model (4 additive terms)
58+
59+
For each interior residue, the predicted shift is:
60+
61+
```
62+
shift = center_shift + neighbor_corrections + temperature_correction + pH_correction
63+
```
64+
65+
### 1. Center shifts (`centshifts.csv`, 20 amino acids)
66+
67+
Base random coil chemical shift for each amino acid x atom type. These are empirically
68+
determined reference values (e.g., Alanine CA = 52.53 ppm).
69+
70+
### 2. Neighbor corrections (`neicorrs.csv` + `termcorrs.csv`)
71+
72+
The identity of the 4 neighboring residues (i-2, i-1, i+1, i+2) perturbs the shift.
73+
Computed in `pred_pent_shift()` using a **5-residue sliding window (pentamer)**:
74+
75+
```
76+
pentamer = [i-2, i-1, center, i+1, i+2]
77+
```
78+
79+
For terminal residues, out-of-bounds positions are replaced by virtual `"n"` and `"c"`
80+
residues with their own correction coefficients (`termcorrs.csv`).
81+
82+
On top of simple neighbor corrections, there are **combination corrections**
83+
(`combdevs.csv`, 247 entries) that capture non-additive effects. Amino acids are
84+
grouped into 6 classes (G, P, aromatic, aliphatic, positive, negative) and specific
85+
center+neighbor group pairs get extra correction terms.
86+
87+
### 3. Temperature correction (`tempcoeffs.csv`)
88+
89+
Linear scaling from a reference temperature of 298K:
90+
91+
```
92+
correction = coeff / 1000 * (T - 298)
93+
```
94+
95+
Coefficients are per amino acid x atom type. Not applied to HB.
96+
97+
### 4. pH correction (the expensive part)
98+
99+
Only applied when `use_ph_corr=True` (i.e., pH != 7.0). This accounts for the
100+
protonation state of titratable residues (D, E, H, C, K, R, Y, and N/C-termini)
101+
at non-neutral pH. This is where ~92% of compute time goes.
102+
103+
#### Step 1: Predict pKa values (`calc_pkas_from_seq`)
104+
105+
The algorithm iteratively predicts effective pKa values for all titratable sites:
106+
107+
1. **Identify titratable residues** in the sequence (D, E, H, C, K, R, Y, n-term,
108+
c-term) with their reference pKa values from `constants.py`.
109+
110+
2. **Compute pairwise electrostatic interactions** using a Debye-Hückel model:
111+
- Distance between sites estimated from sequence separation:
112+
`d = 5.0 + sqrt(|i-j|) * 7.5` Angstrom
113+
- Ionic strength dampens interactions (higher salt = weaker coupling)
114+
- `_debye_huckel_W()` computes the screened Coulomb energy using the complementary
115+
error function (`erfc`)
116+
117+
3. **Iteratively fit pKa values** (5 cycles):
118+
- For each titratable site, enumerate all 2^W protonation microstates in a sliding
119+
window of W residues (W = min(5, n_sites))
120+
- Compute the Boltzmann weight of each microstate from a free energy matrix G that
121+
combines: intrinsic pKa, pH, pairwise interactions, and the current titration
122+
state of sites outside the window
123+
- Sum Boltzmann weights to get the titration fraction at each pH value (54 pH points
124+
from 2.0 to 10.0)
125+
- Fit the resulting titration curve to a Henderson-Hasselbalch equation using
126+
`scipy.curve_fit` to extract effective pKa and Hill coefficient (nH)
127+
- Use the new titration fractions as input for the next cycle
128+
129+
**Why the sliding window?** With N titratable sites, the full partition function has 2^N
130+
microstates — exponential and infeasible for large proteins. The sliding window (size 5)
131+
approximates this by enumerating microstates locally, treating distant sites as fixed at
132+
their current titration state. The iterative cycles let information propagate across the
133+
sequence.
134+
135+
#### Step 2: Compute shift corrections (`_get_ph_corrs`)
136+
137+
For each titratable residue with predicted pKa:
138+
139+
```
140+
frac = titration_fraction(pH, predicted_pKa, nH) # at actual pH
141+
frac_ref = titration_fraction(7.0, reference_pKa, nH) # at reference pH
142+
delta = (frac - frac_ref) * ph_shift_delta # from phshifts.csv
143+
```
144+
145+
This correction is applied to the titratable residue itself AND its immediate neighbors
146+
(preceding/succeeding), since protonation state changes propagate to nearby residues.
147+
148+
## Output format
149+
150+
`get_pred_shifts("AAGDTFKISELVK", 298.0, 7.0, 0.1)` returns:
30151

31152
```python
32-
get_pred_shifts(seq, temperature, pH, ion, use_ph_corr=True, pka_csv_path=None, identifier="")
153+
{
154+
(2, 'A'): {'C': 177.5, 'CA': 52.7, 'CB': 19.2, 'H': 8.21, 'HA': 4.26, 'N': 123.5, 'HB': 1.32},
155+
(3, 'G'): {'C': 173.7, 'CA': 45.3, 'H': 8.28, 'HA': 3.95, 'N': 109.7}, # no CB/HB
156+
(4, 'D'): { ... },
157+
...
158+
(12, 'V'): { ... },
159+
# (1, 'A') and (13, 'K') excluded — terminal residues
160+
}
33161
```
34162

35-
Returns `dict[(residue_num, aa_letter)] -> dict[atom_type -> shift_value]`.
163+
- **Keys**: `(residue_number, amino_acid)` — 1-based, terminals excluded
164+
- **Values**: dict of atom type to predicted shift in **ppm**
165+
- Glycine has no CB/HB; Proline has no H
166+
167+
In the TriZOD pipeline, these predicted shifts are subtracted from experimental shifts
168+
to get **secondary chemical shifts (SCS)**, which are then converted to Z-scores
169+
indicating disorder.
170+
171+
## Public API
172+
173+
```python
174+
get_pred_shifts(seq, temperature, pH, ion, use_ph_corr=True)
175+
```
36176

37177
**Parameters:**
38178

@@ -41,7 +181,34 @@ Returns `dict[(residue_num, aa_letter)] -> dict[atom_type -> shift_value]`.
41181
- `pH` — sample pH (e.g. 7.0)
42182
- `ion` — ionic strength in M (e.g. 0.1)
43183
- `use_ph_corr` — apply pH-dependent corrections (set `False` for pH=7.0)
44-
- `pka_csv_path` — path to pre-computed pKa CSV, or `False` to compute on the fly
184+
185+
## Caching
186+
187+
### Module-level caching (at import time)
188+
189+
All CSV data tables are parsed once when the module is first imported and stored as
190+
module-level constants:
191+
192+
| Constant | Source | Content |
193+
|----------|--------|---------|
194+
| `CENTER_SHIFTS` | `centshifts.csv` | Base shift per (amino acid, atom) |
195+
| `NEIGHBOR_CORRS` | `neicorrs.csv` + `termcorrs.csv` | Neighbor effect coefficients |
196+
| `TEMP_CORRS` | `tempcoeffs.csv` | Temperature coefficients |
197+
| `COMB_CORRS` | `combdevs.csv` | Combination correction terms |
198+
| `PH_SHIFTS` | `phshifts.csv` | pH shift deltas for titratable residues |
199+
| `_AA_GROUP` | computed | Amino acid to group mapping (6 groups) |
200+
| `_OUTER_MATRICES` / `_ALL_TUPLES` | computed | Pre-computed binary enumeration matrices for pKa fitting |
201+
202+
### Pipeline-level caching (disk)
203+
204+
Since POTENCI output depends only on `(seq, temperature, pH, ionic_strength)`, the
205+
pipeline caches results to disk via `--cache-dir`:
206+
207+
- **Key**: SHA-256 hash of `"{seq}|{T}|{pH}|{ion}"` (truncated to 16 chars)
208+
- **Storage**: `{cache_dir}/potenci/{hash}.json`
209+
- **Precompute**: `scripts/precompute_potenci_cache.py` can batch-compute the cache
210+
for the entire dataset upfront, with an index file (`_index.tsv`) to skip
211+
already-processed BMRB entries on re-runs
45212

46213
## Performance profile
47214

@@ -55,9 +222,7 @@ On the 300-entry test subset:
55222
- **Without pH correction** (pH=7.0): ~1ms/entry
56223

57224
Sequence length and number of titratable residues (D, E, H, C, K, R, Y) are
58-
the main cost drivers. For full-dataset runs, the pipeline's `--cache-dir`
59-
option caches the downstream weighted SCS results, but POTENCI predictions
60-
themselves could also be cached since they depend only on (seq, T, pH, ion).
225+
the main cost drivers.
61226

62227
## Module structure
63228

@@ -76,25 +241,26 @@ trizod/potenci/
76241
└── README.md
77242
```
78243

79-
## Testing
80-
81-
Dedicated tests in `tests/test_potenci.py`:
82-
83-
- **Reference value matching** — predictions compared against known values (commit `8905a85`)
84-
- **pH correction** — verifies pH-sensitive residues shift at non-neutral pH
85-
- **Terminal exclusion** — first/last residues are excluded from predictions
86-
- **Glycine constraints** — no CB/HB predictions for glycine
87-
88244
## Key internal functions
89245

90246
| Function | Purpose |
91247
| ------------------------ | -------------------------------------------------------------- |
92248
| `calc_pkas_from_seq()` | Iterative pKa prediction (the bottleneck) |
93-
| `get_ph_corrs()` | pH-dependent shift corrections using predicted pKas |
249+
| `_get_ph_corrs()` | pH-dependent shift corrections using predicted pKas |
94250
| `pred_pent_shift()` | Predict shift for a 5-residue window |
251+
| `_build_pentamer()` | Build 5-residue context window with terminal handling |
95252
| `_get_temp_corr()` | Temperature correction |
96253
| `_titration_fraction()` | Henderson-Hasselbalch titration fraction (used by `curve_fit`) |
97-
| `_debye_huckel_W()` | Electrostatic interaction energy (Debye-Hückel) |
254+
| `_debye_huckel_W()` | Electrostatic interaction energy (Debye-Huckel) |
98255
| `_w_to_logp()` | Convert interaction energy to log-probability |
99256
| `_small_matrix_limits()` | Sliding window bounds for pKa fitting |
100257
| `_small_matrix_pos()` | Position within sliding window |
258+
259+
## Testing
260+
261+
Dedicated tests in `tests/test_potenci.py`:
262+
263+
- **Reference value matching** — predictions compared against known values (commit `8905a85`)
264+
- **pH correction** — verifies pH-sensitive residues shift at non-neutral pH
265+
- **Terminal exclusion** — first/last residues are excluded from predictions
266+
- **Glycine constraints** — no CB/HB predictions for glycine

pyproject.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@ readme = "README.md"
1010
license = "AGPL-3.0-only"
1111
requires-python = ">=3.9"
1212
authors = [
13-
{ name = "Markus Haak", email = "m.haak@tum.de" },
13+
{ name = "Markus Haak", email = "markus.haak@tum.de" },
14+
{ name = "Tobias Senoner", email = "tobias.senoner@tum.de" },
1415
]
1516
classifiers = [
1617
"Programming Language :: Python :: 3",
@@ -30,7 +31,7 @@ dependencies = [
3031
]
3132

3233
[project.urls]
33-
Repository = "https://github.com/tsenoner/trizod"
34+
Repository = "https://github.com/MarkusHaak/trizod"
3435

3536
[project.scripts]
3637
trizod = "trizod.trizod:main"

trizod/__init__.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
__version__ = "0.0.1"
22

33
META_DATA = {
4-
"author": "Markus Haak",
5-
"author_email": "m.haak@tum.de",
4+
"authors": ["Markus Haak", "Tobias Senoner"],
5+
"author_emails": ["markus.haak@tum.de", "tobias.senoner@tum.de"],
66
"description": "Modern disorder scores for BMRB data",
77
"package_name": "TriZOD",
8-
"github_url": "https://git.rostlab.org/haak/TriZOD",
8+
"github_url": "https://github.com/MarkusHaak/trizod",
99
}

trizod/potenci/potenci.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
33
Adapted from https://github.com/protein-nmr/POTENCI (commit 17dd2e6).
44
Original author: fmulder@chem.au.dk
5-
Adapted by: haak@rostlab.org
5+
Adapted by: markus.haak@tum.de & tobias.senoner@tum.de
66
77
Public API:
88
get_pred_shifts(seq, temperature, pH, ion, ...) -> dict

0 commit comments

Comments
 (0)