Conversation
This commit adds a general CIF reader and read and write functionality for the PDBx/mmcif format. Wtih this, we can also remove our dependency on BioStructures.jl Signed-off-by: Andreas Hildebrandt <andreas.hildebrandt@uni-mainz.de>
There was a problem hiding this comment.
Pull request overview
This PR adds native CIF parsing plus read/write support for the PDBx/mmCIF format, enabling removal of the external BioStructures.jl dependency.
Changes:
- Removed
BioStructures.jlusage and dependency; routed mmCIF I/O through internalMMCIFDetails. - Added a standalone CIF (v1.1/v2.0) parser and new mmCIF reader/writer implementations.
- Updated/expanded PDB/mmCIF tests (including disulfide bonds, secondary structure presence, IO-loading, and coordinate round-trip).
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| test/fileformats/test_pdb.jl | Updates mmCIF expectations and adds additional mmCIF assertions (bonds/coords/IO/nonexistent/round-trip). |
| src/fileformats/pdb/pdb_general.jl | Refactors PDB postprocessing helpers to accept either PDBInfo or record collections (enables reuse from mmCIF reader). |
| src/fileformats/pdb.jl | Removes BioStructures-based conversion; delegates mmCIF read/write to new internal implementation. |
| src/fileformats/mmcif/mmcif_reader.jl | New mmCIF reader built on the CIF parser; creates atoms/fragments, disulfides, and secondary structures. |
| src/fileformats/mmcif/mmcif_writer.jl | New mmCIF writer emitting _atom_site, _struct_conn, and secondary-structure loops. |
| src/fileformats/cif.jl | New general CIF parser and in-memory model (CIFFile/CIFDataBlock/CIFLoop). |
| src/BiochemicalAlgorithms.jl | Wires in CIF and mmCIF modules. |
| Project.toml | Drops BioStructures from [deps] and [compat]. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/fileformats/cif.jl
Outdated
| if nvals % ntags != 0 | ||
| @warn "CIF loop has $(nvals) values for $(ntags) tags — not evenly divisible" | ||
| end | ||
|
|
||
| # Reshape flat values into rows of ntags columns | ||
| rows = Vector{Vector{String}}() | ||
| for i in 1:ntags:nvals | ||
| last_idx = min(i + ntags - 1, nvals) |
| # Use auth fields for residue identification | ||
| c_p1_asym = get(cols, "_struct_conn.ptnr1_auth_asym_id", get(cols, "_struct_conn.ptnr1_label_asym_id", 0)) | ||
| c_p1_comp = get(cols, "_struct_conn.ptnr1_auth_comp_id", get(cols, "_struct_conn.ptnr1_label_comp_id", 0)) | ||
| c_p1_seq = get(cols, "_struct_conn.ptnr1_auth_seq_id", get(cols, "_struct_conn.ptnr1_label_seq_id", 0)) | ||
| c_p2_asym = get(cols, "_struct_conn.ptnr2_auth_asym_id", get(cols, "_struct_conn.ptnr2_label_asym_id", 0)) | ||
| c_p2_comp = get(cols, "_struct_conn.ptnr2_auth_comp_id", get(cols, "_struct_conn.ptnr2_label_comp_id", 0)) | ||
| c_p2_seq = get(cols, "_struct_conn.ptnr2_auth_seq_id", get(cols, "_struct_conn.ptnr2_label_seq_id", 0)) | ||
|
|
| charge_str = isnothing(c_charge) ? nothing : _get(row, c_charge) | ||
| formal_charge = isnothing(charge_str) ? Int(0) : (tryparse(Int, charge_str) === nothing ? 0 : parse(Int, charge_str)) |
| """Quote a string value for CIF output.""" | ||
| function _cif_quote(s::String) | ||
| isempty(s) && return "." | ||
| # No quoting needed for simple values | ||
| if !any(c -> isspace(c), s) && !startswith(s, '_') && !startswith(s, '#') && | ||
| !startswith(s, '\'') && !startswith(s, '"') && s != "." && s != "?" | ||
| return s | ||
| end | ||
| # Use single quotes if possible | ||
| if !occursin('\'', s) | ||
| return "'$s'" | ||
| end | ||
| # Use double quotes | ||
| if !occursin('"', s) | ||
| return "\"$s\"" | ||
| end | ||
| # Fall back to semicolon text block | ||
| return ";\n$s\n;" | ||
| end |
There was a problem hiding this comment.
Do we expect anything other than (abstract) strings here? Silently converting seems fishy.
| # No quoting needed for simple values | ||
| if !any(c -> isspace(c), s) && !startswith(s, '_') && !startswith(s, '#') && | ||
| !startswith(s, '\'') && !startswith(s, '"') && s != "." && s != "?" | ||
| return s | ||
| end | ||
| # Use single quotes if possible | ||
| if !occursin('\'', s) | ||
| return "'$s'" | ||
| end | ||
| # Use double quotes | ||
| if !occursin('"', s) | ||
| return "\"$s\"" | ||
| end | ||
| # Fall back to semicolon text block | ||
| return ";\n$s\n;" | ||
| end |
| occ = @sprintf("%.2f", get_property(a, :occupancy, 1.0)) | ||
| bfac = @sprintf("%.2f", get_property(a, :tempfactor, 0.0)) | ||
|
|
||
| charge = a.formal_charge == 0 ? "?" : string(a.formal_charge) | ||
|
|
||
| model_num = a.frame_id | ||
|
|
||
| println(io, "$group $( a.number) $type_sym $atom_name $alt_id $comp_id $chain_id $entity_id $seq_id $ins_code $x $y $z $occ $bfac $charge $seq_id $comp_id $chain_id $atom_name $model_num") |
| # Use auth fields when available | ||
| c_beg_comp = get(cols, "_struct_conf.beg_auth_comp_id", get(cols, "_struct_conf.beg_label_comp_id", 0)) | ||
| c_beg_asym = get(cols, "_struct_conf.beg_auth_asym_id", get(cols, "_struct_conf.beg_label_asym_id", 0)) | ||
| c_beg_seq = get(cols, "_struct_conf.beg_auth_seq_id", get(cols, "_struct_conf.beg_label_seq_id", 0)) | ||
| c_end_comp = get(cols, "_struct_conf.end_auth_comp_id", get(cols, "_struct_conf.end_label_comp_id", 0)) | ||
| c_end_asym = get(cols, "_struct_conf.end_auth_asym_id", get(cols, "_struct_conf.end_label_asym_id", 0)) | ||
| c_end_seq = get(cols, "_struct_conf.end_auth_seq_id", get(cols, "_struct_conf.end_label_seq_id", 0)) | ||
|
|
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andreas Hildebrandt <andreas.hildebrandt@uni-mainz.de>
Signed-off-by: Andreas Hildebrandt <andreas.hildebrandt@uni-mainz.de>
tkemmer
left a comment
There was a problem hiding this comment.
Most notably (that is, ignoring documentation-related comments), the mmCIF reader ignores all non-disulphide bonds and aggregates all chains by name.
| """ | ||
| write_mmcif_impl(io::IO, ac::AbstractAtomContainer{T}) | ||
|
|
||
| Write an atom container as PDBx/mmCIF format to the given IO stream. | ||
| """ |
There was a problem hiding this comment.
This should be included in docs/src/private/fileformats.md, as it currently causes CI to fail.
|
|
||
| # ─── CIF value quoting ─────────────────────────────────────────────── | ||
|
|
||
| """Quote a string value for CIF output.""" |
There was a problem hiding this comment.
This (and all one-line docstrings here) should either be properly formatted as docstrings and included in docs/src/private/fileformats.md or reduced to simple comments. Otherwise, CI will fail.
| """ | ||
| read_mmcif(fname_io::Union{AbstractString, IO}, ::Type{T} = Float32; create_coils::Bool = true) -> System{T} | ||
|
|
||
| Read a PDBx/mmCIF file and return a System. | ||
|
|
||
| Models are stored as frames, using the model number as `frame_id`. | ||
| """ |
There was a problem hiding this comment.
This should be included in docs/src/private/fileformats.md, as it currently causes CI to fail.
|
|
||
| # ─── Helpers ────────────────────────────────────────────────────────── | ||
|
|
||
| """Find a loop in the data block whose tags start with the given prefix.""" |
There was a problem hiding this comment.
This (and all one-line docstrings here) should either be properly formatted as docstrings and included in docs/src/private/fileformats.md or reduced to simple comments. Otherwise, CI will fail.
| return nothing | ||
| end | ||
|
|
||
| """Build a tag→column-index map for a loop.""" |
| return ssbonds | ||
| end | ||
|
|
||
| """Parse CIF symmetry operator string like '1_555' into an integer.""" |
| end | ||
| end | ||
|
|
||
| """Parse _struct_sheet_order to get sense values for each sheet range.""" |
| # Set system name from data block name (or filename if available) | ||
| sys_name = if fname_io isa AbstractString | ||
| bn = basename(fname_io) | ||
| # strip extension | ||
| idx = findlast('.', bn) | ||
| isnothing(idx) ? bn : bn[1:idx-1] | ||
| else | ||
| block.name | ||
| end |
There was a problem hiding this comment.
This shouldn't really be read from the filename. We just removed this behavior from other readers/writers (cf. #277). The PDB ID is very well included in the file itself.
On a related note, our (legacy) PDB reader uses the name from the HEADER record as system and molecule name instead of the PDB ID. However, this name does not seem to (necessarily?) exist in the newer PDBx/mmCIF files. See for example 5PTI, which is named "HYDROLASE INHIBITOR" when read from PDB and "5PTI" when read from PDBx/mmCIF. Not sure if we should drop the HEADER record name from our PDB reader just to be in line with PDBx/mmCIF.
| alt_id = (alt_id_raw == "." || alt_id_raw == "?") ? nothing : alt_id_raw | ||
|
|
||
| # Use auth_* fields when available (matches PDB convention) | ||
| chain_id = isnothing(c_auth_asym) ? row[c_asym_id] : _get(row, c_auth_asym, row[c_asym_id]) |
There was a problem hiding this comment.
This reads atoms and hetero atoms into the same chain if they are named the same, although the atoms are read from different mmCIP loops.
See, for example, 5PTI:
julia> chains(sys_pdb)
ChainTable{Float32} with 2 rows:
┌───┬─────┬──────┐
│ # │ idx │ name │
├───┼─────┼──────┤
│ 1 │ 2 │ A │
│ 2 │ 953 │ A │
└───┴─────┴──────┘
julia> chains(sys_mmcif)
ChainTable{Float32} with 1 rows:
┌───┬─────┬──────┐
│ # │ idx │ name │
├───┼─────┼──────┤
│ 1 │ 2 │ A │
└───┴─────┴──────┘Our mmCIF reader should make the same distinction as our PDB reader and separate atoms and hetero atoms into different chains.
|
|
||
| # ─── SSBond Parsing ────────────────────────────────────────────────── | ||
|
|
||
| function _parse_ssbonds(block::CIFDataBlock) |
There was a problem hiding this comment.
What about non-disulphide bonds? These are completely ignored right now. Also, the TYPE__COVALENT flag is not set.
This commit adds a general CIF reader and read and write functionality for the PDBx/mmcif format. Wtih this, we can also remove our dependency on BioStructures.jl