-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Add support for read/write of a file format that (ideally)
- supports arbitrary mmCIF key:value pairs (so that we can convert it loss-free to or from mmCIF, including with custom data as per Allow adding extra user-defined categories #30)
- is fast to parse
- is compact
- is seekable (so that we can easily add extra frames in a trajectory, or quickly access frames from the middle of an existing trajectory)
Such a format would allow IMP folks to ditch the old RMF format as the "working format", and easily convert the resulting models to mmCIF. This may also be a practical solution for those that need to deposit huge files, bigger than is really practical for current mmCIF.
Conventional mmCIF fails points 2-4. We can't easily add an extra model to an mmCIF file without rewriting the entire file (since various IDs would have to be updated, and the data is gathered per-table rather than per-model). We also can't read a single trajectory frame without scanning the whole file.
MMTF (#11) has its own data model, so fails point 1. Both MMTF and BinaryCIF support a number of encoding mechanisms to generate compact files (point 3) but these render the file non-seekable (e.g. runlength encoding of the ATOM/HETATM field in the atom_site table necessitates reading and unpacking the entire thing to determine whether a given atom is ATOM or HETATM).
Fast parsing probably necessitates a binary format.
Proposal: use HDF5 as a binary container format. Each mmCIF category would map to an HDF5 table, which should be trivially seekable and extendable. This won't be as compact as BinaryCIF or MMTF (although HDF5 does support compression). To address this we can
- replace a lot of string data with more compact forms - for example replace any floating-point number with a 32-bit float, or enumerated data (e.g.
ATOMorHETATM) with a suitably small integer (just a 0 or 1 bool type in this case). - split out changing and non-changing data per ID - for example several models may have the same composition, so we only need to store coordinates for each model, and a single table with the composition.