-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Dear mdCATH Authors,
First of all, thanks for this amazing dataset!
I'm opening this issue to report that 48 out of 5,398 HDF5 files (0.89%) contain corrupted coordinate frames at the tail end of specific trajectories. I noticed this problem while using mdCATH for an ML project. My training loop randomly returned NaN values, and after some inspections (and thanks to Claude :) ), I pinpointed the issue to corrupted data. Let me also tag @g-turri to keep him posted on the updates on this.
Two types of corruption are present:
-
Integer overflow in coordinates (47 trajectories, 289 frames): One coordinate dimension contains values ~2.1 x 10^7 Angstroms, possibly consistent with an int32 overflow artifact (
2^31 / 100 = 21,474,836.48). The non-overflowed dimensions retain plausible values. -
All-zero coordinate frames (1 trajectory, 1 frame): All atomic coordinates are exactly
0.0.
In both cases the corruption always appears at the tail of a trajectory -- from some frame onward until the last frame. The frame immediately preceding the first bad frame is always physically valid.
The corruption was verified against the upstream HuggingFace repository by downloading fresh copies of 5 affected files and confirming byte-identical bad values, ruling out any downstream processing artifact.
Summary
| Metric | Value |
|---|---|
| Files scanned | 5,398 |
| Trajectories scanned | 134,950 |
| Total frames scanned | 62,581,026 |
| Affected PDB domains | 48 (0.89%) |
| Affected trajectories | 48 (0.036%) |
| Corrupted frames | 290 (0.00046%) |
Each affected PDB has exactly one bad trajectory (one specific temperature/replica combination). The remaining trajectories for the same PDB are unaffected.
Affected trajectories
| PDB ID | Temp | Replica | Total frames | First bad frame | Bad frames | Overflow dim | Max |coord| |
|---|---|---|---|---|---|---|---|
1ca1A01 |
379 | 4 | 60 | 58 | 2 | x | 17,833,816 |
1fhoA00 |
348 | 2 | 10 | 3 | 7 | y | 21,474,838 |
1gkgA02 |
320 | 4 | 150 | 140 | 10 | z | 21,474,838 |
1hm7B00 |
379 | 0 | 270 | 260 | 10 | z | 21,474,838 |
1mjwA00 |
413 | 1 | 250 | 246 | 4 | z | 21,474,838 |
1pqwA00 |
450 | 2 | 430 | 420 | 10 | z | 21,474,838 |
1u0mA01 |
348 | 4 | 180 | 175 | 5 | z | 21,474,838 |
1y66A00 |
450 | 0 | 140 | 135 | 5 | y | 21,474,838 |
1zhhB01 |
450 | 0 | 310 | 307 | 3 | y | 21,474,838 |
1zxqA02 |
348 | 2 | 161 | 160 | 1 | y | 21,474,838 |
2akcA00 |
320 | 1 | 40 | 31 | 9 | z | 21,474,838 |
2dnxA00 |
348 | 3 | 440 | 438 | 2 | y,z | 21,474,838 |
2exrA02 |
320 | 4 | 470 | 468 | 2 | z | 21,474,838 |
2jbrA03 |
413 | 3 | 460 | 454 | 6 | z | 21,474,838 |
2jysA00 |
320 | 4 | 470 | 466 | 4 | z | 21,474,838 |
2k88A00 |
450 | 4 | 410 | 409 | 1 | z | 21,474,838 |
2lmkA00 |
450 | 3 | 40 | 30 | 10 | z | 21,474,838 |
2m5hA00 |
320 | 4 | 430 | 420 | 10 | y | 21,474,838 |
2mutA00 |
379 | 0 | 418 | 411 | 7 | z | 21,474,838 |
2p6wA00 |
348 | 0 | 476 | 470 | 6 | z | 21,474,838 |
2qenA03 |
348 | 3 | 448 | 447 | 1 | (all-zero) | 0 |
2qfzA02 |
348 | 0 | 360 | 352 | 8 | z | 21,474,838 |
2qyzA01 |
348 | 4 | 480 | 471 | 9 | z | 21,474,838 |
2rnnA00 |
348 | 2 | 30 | 22 | 8 | z | 21,474,838 |
2wdcA03 |
348 | 3 | 475 | 470 | 5 | z | 21,474,838 |
2wn3A02 |
379 | 1 | 405 | 404 | 1 | z | 21,474,838 |
2y94A03 |
348 | 3 | 320 | 318 | 2 | y | 21,474,838 |
2z90A01 |
450 | 0 | 50 | 46 | 4 | z | 21,474,838 |
2zuvA03 |
379 | 1 | 260 | 250 | 10 | z | 21,474,838 |
3b0dC00 |
320 | 3 | 100 | 91 | 9 | z | 21,474,838 |
3d8lA00 |
348 | 0 | 480 | 478 | 2 | z | 21,474,838 |
3gk0A00 |
348 | 0 | 356 | 352 | 4 | z | 21,474,838 |
3hn2A02 |
348 | 4 | 50 | 42 | 8 | z | 21,474,838 |
3j7aN00 |
413 | 1 | 357 | 350 | 7 | z | 21,474,838 |
3ogdA01 |
450 | 4 | 205 | 202 | 3 | y | 21,474,838 |
3t7aA02 |
450 | 3 | 460 | 450 | 10 | z | 21,474,838 |
3uoaB01 |
413 | 1 | 70 | 65 | 5 | z | 21,474,838 |
3w0lB03 |
413 | 2 | 110 | 104 | 6 | z | 21,474,838 |
3wutG00 |
413 | 1 | 120 | 111 | 9 | z | 21,474,838 |
4ekuA01 |
320 | 2 | 376 | 371 | 5 | z | 21,474,838 |
4fchA02 |
413 | 2 | 440 | 430 | 10 | y | 21,474,838 |
4j11D00 |
413 | 1 | 410 | 402 | 8 | z | 21,474,838 |
4lsxC00 |
450 | 1 | 450 | 447 | 3 | y | 21,474,838 |
4o30B00 |
379 | 3 | 220 | 211 | 9 | y | 21,474,838 |
4qpiC00 |
348 | 1 | 357 | 350 | 7 | z | 21,474,838 |
4tmpA00 |
348 | 4 | 220 | 216 | 4 | y | 21,474,838 |
4xxhA02 |
413 | 4 | 90 | 80 | 10 | z | 21,474,838 |
5d5pA02 |
413 | 3 | 450 | 441 | 9 | z | 21,474,838 |