Skip to content

Corrupted frames in mdCATH coordinate data #7

@pietronvll

Description

@pietronvll

Dear mdCATH Authors,
First of all, thanks for this amazing dataset!

I'm opening this issue to report that 48 out of 5,398 HDF5 files (0.89%) contain corrupted coordinate frames at the tail end of specific trajectories. I noticed this problem while using mdCATH for an ML project. My training loop randomly returned NaN values, and after some inspections (and thanks to Claude :) ), I pinpointed the issue to corrupted data. Let me also tag @g-turri to keep him posted on the updates on this.

Two types of corruption are present:

  1. Integer overflow in coordinates (47 trajectories, 289 frames): One coordinate dimension contains values ~2.1 x 10^7 Angstroms, possibly consistent with an int32 overflow artifact (2^31 / 100 = 21,474,836.48). The non-overflowed dimensions retain plausible values.

  2. All-zero coordinate frames (1 trajectory, 1 frame): All atomic coordinates are exactly 0.0.

In both cases the corruption always appears at the tail of a trajectory -- from some frame onward until the last frame. The frame immediately preceding the first bad frame is always physically valid.

The corruption was verified against the upstream HuggingFace repository by downloading fresh copies of 5 affected files and confirming byte-identical bad values, ruling out any downstream processing artifact.

Summary

Metric Value
Files scanned 5,398
Trajectories scanned 134,950
Total frames scanned 62,581,026
Affected PDB domains 48 (0.89%)
Affected trajectories 48 (0.036%)
Corrupted frames 290 (0.00046%)

Each affected PDB has exactly one bad trajectory (one specific temperature/replica combination). The remaining trajectories for the same PDB are unaffected.

Affected trajectories

PDB ID Temp Replica Total frames First bad frame Bad frames Overflow dim Max |coord|
1ca1A01 379 4 60 58 2 x 17,833,816
1fhoA00 348 2 10 3 7 y 21,474,838
1gkgA02 320 4 150 140 10 z 21,474,838
1hm7B00 379 0 270 260 10 z 21,474,838
1mjwA00 413 1 250 246 4 z 21,474,838
1pqwA00 450 2 430 420 10 z 21,474,838
1u0mA01 348 4 180 175 5 z 21,474,838
1y66A00 450 0 140 135 5 y 21,474,838
1zhhB01 450 0 310 307 3 y 21,474,838
1zxqA02 348 2 161 160 1 y 21,474,838
2akcA00 320 1 40 31 9 z 21,474,838
2dnxA00 348 3 440 438 2 y,z 21,474,838
2exrA02 320 4 470 468 2 z 21,474,838
2jbrA03 413 3 460 454 6 z 21,474,838
2jysA00 320 4 470 466 4 z 21,474,838
2k88A00 450 4 410 409 1 z 21,474,838
2lmkA00 450 3 40 30 10 z 21,474,838
2m5hA00 320 4 430 420 10 y 21,474,838
2mutA00 379 0 418 411 7 z 21,474,838
2p6wA00 348 0 476 470 6 z 21,474,838
2qenA03 348 3 448 447 1 (all-zero) 0
2qfzA02 348 0 360 352 8 z 21,474,838
2qyzA01 348 4 480 471 9 z 21,474,838
2rnnA00 348 2 30 22 8 z 21,474,838
2wdcA03 348 3 475 470 5 z 21,474,838
2wn3A02 379 1 405 404 1 z 21,474,838
2y94A03 348 3 320 318 2 y 21,474,838
2z90A01 450 0 50 46 4 z 21,474,838
2zuvA03 379 1 260 250 10 z 21,474,838
3b0dC00 320 3 100 91 9 z 21,474,838
3d8lA00 348 0 480 478 2 z 21,474,838
3gk0A00 348 0 356 352 4 z 21,474,838
3hn2A02 348 4 50 42 8 z 21,474,838
3j7aN00 413 1 357 350 7 z 21,474,838
3ogdA01 450 4 205 202 3 y 21,474,838
3t7aA02 450 3 460 450 10 z 21,474,838
3uoaB01 413 1 70 65 5 z 21,474,838
3w0lB03 413 2 110 104 6 z 21,474,838
3wutG00 413 1 120 111 9 z 21,474,838
4ekuA01 320 2 376 371 5 z 21,474,838
4fchA02 413 2 440 430 10 y 21,474,838
4j11D00 413 1 410 402 8 z 21,474,838
4lsxC00 450 1 450 447 3 y 21,474,838
4o30B00 379 3 220 211 9 y 21,474,838
4qpiC00 348 1 357 350 7 z 21,474,838
4tmpA00 348 4 220 216 4 y 21,474,838
4xxhA02 413 4 90 80 10 z 21,474,838
5d5pA02 413 3 450 441 9 z 21,474,838

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions