11# POTENCI — Random Coil Chemical Shift Prediction
22
3- This module predicts random coil NMR chemical shifts for proteins using the
4- POTENCI algorithm. It is used by the TriZOD pipeline to compute expected shifts,
5- which are then compared to experimental shifts to derive disorder Z-scores.
3+ POTENCI (** P** rediction ** O** f ** T** otal ** E** ffects on ** N** MR ** C** hemical shifts
4+ from ** I** ntrinsically disordered proteins) predicts the ** random coil chemical shifts**
5+ of backbone atoms in a protein sequence — i.e., the shifts you'd expect if the protein
6+ had no structure, just a flexible disordered chain.
7+
8+ In the TriZOD pipeline, these predicted shifts are compared against experimentally
9+ measured shifts. Large deviations indicate structure; small deviations indicate disorder.
610
711## Origin
812
13+ Based on the POTENCI method published in:
14+
15+ > Nielsen JT, Mulder FAA. * POTENCI: prediction of temperature, neighbor and pH-corrected
16+ > chemical shifts for intrinsically disordered proteins.* J Biomol NMR. 2018;70:141-165.
17+ > [ doi:10.1007/s10858-018-0175-y] ( https://doi.org/10.1007/s10858-018-0175-y )
18+
919Adapted from [ protein-nmr/POTENCI] ( https://github.com/protein-nmr/POTENCI )
1020(commit ` 17dd2e6 ` , file ` pytenci1_3.py ` ), originally by Frans Mulder
11- (fmulder@chem.au.dk ).
21+ (fmulder@chem.au.dk ). Adapted by Markus Haak (markus.haak@tum.de ) and
22+ Tobias Senoner (tobias.senoner@tum.de ).
1223
1324## Differences from upstream
1425
@@ -26,13 +37,142 @@ algorithm) but improved for integration into the TriZOD pipeline:
2637| Naming | camelCase / smooshed (` getpredshifts ` ) | snake_case (` get_pred_shifts ` ) |
2738| Constants | Single-letter names (` e ` , ` a ` , ` b ` ) | Descriptive (` DIELECTRIC_WATER ` , ` MIN_CHARGE_DISTANCE ` ) |
2839
29- ## Public API
40+ ## Predicted atoms
41+
42+ 7 backbone atom types per residue:
43+
44+ | Atom | Description | Notes |
45+ | ------| -------------| -------|
46+ | ` C ` | Carbonyl carbon (C') | |
47+ | ` CA ` | Alpha carbon | |
48+ | ` CB ` | Beta carbon | Absent for Glycine |
49+ | ` N ` | Amide nitrogen | |
50+ | ` H ` | Amide proton | Absent for Proline |
51+ | ` HA ` | Alpha proton | |
52+ | ` HB ` | Beta proton | |
53+
54+ Terminal residues (first and last) are excluded from predictions because they
55+ lack the full 5-residue context window.
56+
57+ ## The prediction model (4 additive terms)
58+
59+ For each interior residue, the predicted shift is:
60+
61+ ```
62+ shift = center_shift + neighbor_corrections + temperature_correction + pH_correction
63+ ```
64+
65+ ### 1. Center shifts (` centshifts.csv ` , 20 amino acids)
66+
67+ Base random coil chemical shift for each amino acid x atom type. These are empirically
68+ determined reference values (e.g., Alanine CA = 52.53 ppm).
69+
70+ ### 2. Neighbor corrections (` neicorrs.csv ` + ` termcorrs.csv ` )
71+
72+ The identity of the 4 neighboring residues (i-2, i-1, i+1, i+2) perturbs the shift.
73+ Computed in ` pred_pent_shift() ` using a ** 5-residue sliding window (pentamer)** :
74+
75+ ```
76+ pentamer = [i-2, i-1, center, i+1, i+2]
77+ ```
78+
79+ For terminal residues, out-of-bounds positions are replaced by virtual ` "n" ` and ` "c" `
80+ residues with their own correction coefficients (` termcorrs.csv ` ).
81+
82+ On top of simple neighbor corrections, there are ** combination corrections**
83+ (` combdevs.csv ` , 247 entries) that capture non-additive effects. Amino acids are
84+ grouped into 6 classes (G, P, aromatic, aliphatic, positive, negative) and specific
85+ center+neighbor group pairs get extra correction terms.
86+
87+ ### 3. Temperature correction (` tempcoeffs.csv ` )
88+
89+ Linear scaling from a reference temperature of 298K:
90+
91+ ```
92+ correction = coeff / 1000 * (T - 298)
93+ ```
94+
95+ Coefficients are per amino acid x atom type. Not applied to HB.
96+
97+ ### 4. pH correction (the expensive part)
98+
99+ Only applied when ` use_ph_corr=True ` (i.e., pH != 7.0). This accounts for the
100+ protonation state of titratable residues (D, E, H, C, K, R, Y, and N/C-termini)
101+ at non-neutral pH. This is where ~ 92% of compute time goes.
102+
103+ #### Step 1: Predict pKa values (` calc_pkas_from_seq ` )
104+
105+ The algorithm iteratively predicts effective pKa values for all titratable sites:
106+
107+ 1 . ** Identify titratable residues** in the sequence (D, E, H, C, K, R, Y, n-term,
108+ c-term) with their reference pKa values from ` constants.py ` .
109+
110+ 2 . ** Compute pairwise electrostatic interactions** using a Debye-Hückel model:
111+ - Distance between sites estimated from sequence separation:
112+ ` d = 5.0 + sqrt(|i-j|) * 7.5 ` Angstrom
113+ - Ionic strength dampens interactions (higher salt = weaker coupling)
114+ - ` _debye_huckel_W() ` computes the screened Coulomb energy using the complementary
115+ error function (` erfc ` )
116+
117+ 3 . ** Iteratively fit pKa values** (5 cycles):
118+ - For each titratable site, enumerate all 2^W protonation microstates in a sliding
119+ window of W residues (W = min(5, n_sites))
120+ - Compute the Boltzmann weight of each microstate from a free energy matrix G that
121+ combines: intrinsic pKa, pH, pairwise interactions, and the current titration
122+ state of sites outside the window
123+ - Sum Boltzmann weights to get the titration fraction at each pH value (54 pH points
124+ from 2.0 to 10.0)
125+ - Fit the resulting titration curve to a Henderson-Hasselbalch equation using
126+ ` scipy.curve_fit ` to extract effective pKa and Hill coefficient (nH)
127+ - Use the new titration fractions as input for the next cycle
128+
129+ ** Why the sliding window?** With N titratable sites, the full partition function has 2^N
130+ microstates — exponential and infeasible for large proteins. The sliding window (size 5)
131+ approximates this by enumerating microstates locally, treating distant sites as fixed at
132+ their current titration state. The iterative cycles let information propagate across the
133+ sequence.
134+
135+ #### Step 2: Compute shift corrections (` _get_ph_corrs ` )
136+
137+ For each titratable residue with predicted pKa:
138+
139+ ```
140+ frac = titration_fraction(pH, predicted_pKa, nH) # at actual pH
141+ frac_ref = titration_fraction(7.0, reference_pKa, nH) # at reference pH
142+ delta = (frac - frac_ref) * ph_shift_delta # from phshifts.csv
143+ ```
144+
145+ This correction is applied to the titratable residue itself AND its immediate neighbors
146+ (preceding/succeeding), since protonation state changes propagate to nearby residues.
147+
148+ ## Output format
149+
150+ ` get_pred_shifts("AAGDTFKISELVK", 298.0, 7.0, 0.1) ` returns:
30151
31152``` python
32- get_pred_shifts(seq, temperature, pH, ion, use_ph_corr = True , pka_csv_path = None , identifier = " " )
153+ {
154+ (2 , ' A' ): {' C' : 177.5 , ' CA' : 52.7 , ' CB' : 19.2 , ' H' : 8.21 , ' HA' : 4.26 , ' N' : 123.5 , ' HB' : 1.32 },
155+ (3 , ' G' ): {' C' : 173.7 , ' CA' : 45.3 , ' H' : 8.28 , ' HA' : 3.95 , ' N' : 109.7 }, # no CB/HB
156+ (4 , ' D' ): { ... },
157+ ...
158+ (12 , ' V' ): { ... },
159+ # (1, 'A') and (13, 'K') excluded — terminal residues
160+ }
33161```
34162
35- Returns ` dict[(residue_num, aa_letter)] -> dict[atom_type -> shift_value] ` .
163+ - ** Keys** : ` (residue_number, amino_acid) ` — 1-based, terminals excluded
164+ - ** Values** : dict of atom type to predicted shift in ** ppm**
165+ - Glycine has no CB/HB; Proline has no H
166+
167+ In the TriZOD pipeline, these predicted shifts are subtracted from experimental shifts
168+ to get ** secondary chemical shifts (SCS)** , which are then converted to Z-scores
169+ indicating disorder.
170+
171+ ## Public API
172+
173+ ``` python
174+ get_pred_shifts(seq, temperature, pH, ion, use_ph_corr = True )
175+ ```
36176
37177** Parameters:**
38178
@@ -41,7 +181,34 @@ Returns `dict[(residue_num, aa_letter)] -> dict[atom_type -> shift_value]`.
41181- ` pH ` — sample pH (e.g. 7.0)
42182- ` ion ` — ionic strength in M (e.g. 0.1)
43183- ` use_ph_corr ` — apply pH-dependent corrections (set ` False ` for pH=7.0)
44- - ` pka_csv_path ` — path to pre-computed pKa CSV, or ` False ` to compute on the fly
184+
185+ ## Caching
186+
187+ ### Module-level caching (at import time)
188+
189+ All CSV data tables are parsed once when the module is first imported and stored as
190+ module-level constants:
191+
192+ | Constant | Source | Content |
193+ | ----------| --------| ---------|
194+ | ` CENTER_SHIFTS ` | ` centshifts.csv ` | Base shift per (amino acid, atom) |
195+ | ` NEIGHBOR_CORRS ` | ` neicorrs.csv ` + ` termcorrs.csv ` | Neighbor effect coefficients |
196+ | ` TEMP_CORRS ` | ` tempcoeffs.csv ` | Temperature coefficients |
197+ | ` COMB_CORRS ` | ` combdevs.csv ` | Combination correction terms |
198+ | ` PH_SHIFTS ` | ` phshifts.csv ` | pH shift deltas for titratable residues |
199+ | ` _AA_GROUP ` | computed | Amino acid to group mapping (6 groups) |
200+ | ` _OUTER_MATRICES ` / ` _ALL_TUPLES ` | computed | Pre-computed binary enumeration matrices for pKa fitting |
201+
202+ ### Pipeline-level caching (disk)
203+
204+ Since POTENCI output depends only on ` (seq, temperature, pH, ionic_strength) ` , the
205+ pipeline caches results to disk via ` --cache-dir ` :
206+
207+ - ** Key** : SHA-256 hash of ` "{seq}|{T}|{pH}|{ion}" ` (truncated to 16 chars)
208+ - ** Storage** : ` {cache_dir}/potenci/{hash}.json `
209+ - ** Precompute** : ` scripts/precompute_potenci_cache.py ` can batch-compute the cache
210+ for the entire dataset upfront, with an index file (` _index.tsv ` ) to skip
211+ already-processed BMRB entries on re-runs
45212
46213## Performance profile
47214
@@ -55,9 +222,7 @@ On the 300-entry test subset:
55222- ** Without pH correction** (pH=7.0): ~ 1ms/entry
56223
57224Sequence length and number of titratable residues (D, E, H, C, K, R, Y) are
58- the main cost drivers. For full-dataset runs, the pipeline's ` --cache-dir `
59- option caches the downstream weighted SCS results, but POTENCI predictions
60- themselves could also be cached since they depend only on (seq, T, pH, ion).
225+ the main cost drivers.
61226
62227## Module structure
63228
@@ -76,25 +241,26 @@ trizod/potenci/
76241└── README.md
77242```
78243
79- ## Testing
80-
81- Dedicated tests in ` tests/test_potenci.py ` :
82-
83- - ** Reference value matching** — predictions compared against known values (commit ` 8905a85 ` )
84- - ** pH correction** — verifies pH-sensitive residues shift at non-neutral pH
85- - ** Terminal exclusion** — first/last residues are excluded from predictions
86- - ** Glycine constraints** — no CB/HB predictions for glycine
87-
88244## Key internal functions
89245
90246| Function | Purpose |
91247| ------------------------ | -------------------------------------------------------------- |
92248| ` calc_pkas_from_seq() ` | Iterative pKa prediction (the bottleneck) |
93- | ` get_ph_corrs ()` | pH-dependent shift corrections using predicted pKas |
249+ | ` _get_ph_corrs ()` | pH-dependent shift corrections using predicted pKas |
94250| ` pred_pent_shift() ` | Predict shift for a 5-residue window |
251+ | ` _build_pentamer() ` | Build 5-residue context window with terminal handling |
95252| ` _get_temp_corr() ` | Temperature correction |
96253| ` _titration_fraction() ` | Henderson-Hasselbalch titration fraction (used by ` curve_fit ` ) |
97- | ` _debye_huckel_W() ` | Electrostatic interaction energy (Debye-Hückel ) |
254+ | ` _debye_huckel_W() ` | Electrostatic interaction energy (Debye-Huckel ) |
98255| ` _w_to_logp() ` | Convert interaction energy to log-probability |
99256| ` _small_matrix_limits() ` | Sliding window bounds for pKa fitting |
100257| ` _small_matrix_pos() ` | Position within sliding window |
258+
259+ ## Testing
260+
261+ Dedicated tests in ` tests/test_potenci.py ` :
262+
263+ - ** Reference value matching** — predictions compared against known values (commit ` 8905a85 ` )
264+ - ** pH correction** — verifies pH-sensitive residues shift at non-neutral pH
265+ - ** Terminal exclusion** — first/last residues are excluded from predictions
266+ - ** Glycine constraints** — no CB/HB predictions for glycine
0 commit comments