Skip to content

Implement read/write support for PDBx/mmcif#280

Open
anhi wants to merge 3 commits intodevelopfrom
anhi/mmcif
Open

Implement read/write support for PDBx/mmcif#280
anhi wants to merge 3 commits intodevelopfrom
anhi/mmcif

Conversation

@anhi
Copy link
Copy Markdown
Member

@anhi anhi commented Mar 17, 2026

This commit adds a general CIF reader and read and write functionality for the PDBx/mmcif format. Wtih this, we can also remove our dependency on BioStructures.jl

This commit adds a general CIF reader and read and write
functionality for the PDBx/mmcif format.  Wtih this, we can also remove our dependency on BioStructures.jl

Signed-off-by: Andreas Hildebrandt <andreas.hildebrandt@uni-mainz.de>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds native CIF parsing plus read/write support for the PDBx/mmCIF format, enabling removal of the external BioStructures.jl dependency.

Changes:

  • Removed BioStructures.jl usage and dependency; routed mmCIF I/O through internal MMCIFDetails.
  • Added a standalone CIF (v1.1/v2.0) parser and new mmCIF reader/writer implementations.
  • Updated/expanded PDB/mmCIF tests (including disulfide bonds, secondary structure presence, IO-loading, and coordinate round-trip).

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
test/fileformats/test_pdb.jl Updates mmCIF expectations and adds additional mmCIF assertions (bonds/coords/IO/nonexistent/round-trip).
src/fileformats/pdb/pdb_general.jl Refactors PDB postprocessing helpers to accept either PDBInfo or record collections (enables reuse from mmCIF reader).
src/fileformats/pdb.jl Removes BioStructures-based conversion; delegates mmCIF read/write to new internal implementation.
src/fileformats/mmcif/mmcif_reader.jl New mmCIF reader built on the CIF parser; creates atoms/fragments, disulfides, and secondary structures.
src/fileformats/mmcif/mmcif_writer.jl New mmCIF writer emitting _atom_site, _struct_conn, and secondary-structure loops.
src/fileformats/cif.jl New general CIF parser and in-memory model (CIFFile/CIFDataBlock/CIFLoop).
src/BiochemicalAlgorithms.jl Wires in CIF and mmCIF modules.
Project.toml Drops BioStructures from [deps] and [compat].

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +230 to +237
if nvals % ntags != 0
@warn "CIF loop has $(nvals) values for $(ntags) tags — not evenly divisible"
end

# Reshape flat values into rows of ntags columns
rows = Vector{Vector{String}}()
for i in 1:ntags:nvals
last_idx = min(i + ntags - 1, nvals)
Comment on lines +248 to +255
# Use auth fields for residue identification
c_p1_asym = get(cols, "_struct_conn.ptnr1_auth_asym_id", get(cols, "_struct_conn.ptnr1_label_asym_id", 0))
c_p1_comp = get(cols, "_struct_conn.ptnr1_auth_comp_id", get(cols, "_struct_conn.ptnr1_label_comp_id", 0))
c_p1_seq = get(cols, "_struct_conn.ptnr1_auth_seq_id", get(cols, "_struct_conn.ptnr1_label_seq_id", 0))
c_p2_asym = get(cols, "_struct_conn.ptnr2_auth_asym_id", get(cols, "_struct_conn.ptnr2_label_asym_id", 0))
c_p2_comp = get(cols, "_struct_conn.ptnr2_auth_comp_id", get(cols, "_struct_conn.ptnr2_label_comp_id", 0))
c_p2_seq = get(cols, "_struct_conn.ptnr2_auth_seq_id", get(cols, "_struct_conn.ptnr2_label_seq_id", 0))

Comment on lines +209 to +210
charge_str = isnothing(c_charge) ? nothing : _get(row, c_charge)
formal_charge = isnothing(charge_str) ? Int(0) : (tryparse(Int, charge_str) === nothing ? 0 : parse(Int, charge_str))
Comment on lines +25 to +43
"""Quote a string value for CIF output."""
function _cif_quote(s::String)
isempty(s) && return "."
# No quoting needed for simple values
if !any(c -> isspace(c), s) && !startswith(s, '_') && !startswith(s, '#') &&
!startswith(s, '\'') && !startswith(s, '"') && s != "." && s != "?"
return s
end
# Use single quotes if possible
if !occursin('\'', s)
return "'$s'"
end
# Use double quotes
if !occursin('"', s)
return "\"$s\""
end
# Fall back to semicolon text block
return ";\n$s\n;"
end
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect anything other than (abstract) strings here? Silently converting seems fishy.

Comment on lines +28 to +43
# No quoting needed for simple values
if !any(c -> isspace(c), s) && !startswith(s, '_') && !startswith(s, '#') &&
!startswith(s, '\'') && !startswith(s, '"') && s != "." && s != "?"
return s
end
# Use single quotes if possible
if !occursin('\'', s)
return "'$s'"
end
# Use double quotes
if !occursin('"', s)
return "\"$s\""
end
# Fall back to semicolon text block
return ";\n$s\n;"
end
Comment on lines +103 to +110
occ = @sprintf("%.2f", get_property(a, :occupancy, 1.0))
bfac = @sprintf("%.2f", get_property(a, :tempfactor, 0.0))

charge = a.formal_charge == 0 ? "?" : string(a.formal_charge)

model_num = a.frame_id

println(io, "$group $( a.number) $type_sym $atom_name $alt_id $comp_id $chain_id $entity_id $seq_id $ins_code $x $y $z $occ $bfac $charge $seq_id $comp_id $chain_id $atom_name $model_num")
Comment on lines +321 to +328
# Use auth fields when available
c_beg_comp = get(cols, "_struct_conf.beg_auth_comp_id", get(cols, "_struct_conf.beg_label_comp_id", 0))
c_beg_asym = get(cols, "_struct_conf.beg_auth_asym_id", get(cols, "_struct_conf.beg_label_asym_id", 0))
c_beg_seq = get(cols, "_struct_conf.beg_auth_seq_id", get(cols, "_struct_conf.beg_label_seq_id", 0))
c_end_comp = get(cols, "_struct_conf.end_auth_comp_id", get(cols, "_struct_conf.end_label_comp_id", 0))
c_end_asym = get(cols, "_struct_conf.end_auth_asym_id", get(cols, "_struct_conf.end_label_asym_id", 0))
c_end_seq = get(cols, "_struct_conf.end_auth_seq_id", get(cols, "_struct_conf.end_label_seq_id", 0))

anhi and others added 2 commits March 17, 2026 18:54
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Andreas Hildebrandt <andreas.hildebrandt@uni-mainz.de>
Signed-off-by: Andreas Hildebrandt <andreas.hildebrandt@uni-mainz.de>
@tkemmer tkemmer added the enhancement New feature or request label Mar 17, 2026
@tkemmer tkemmer added this to the v0.7 milestone Mar 17, 2026
Copy link
Copy Markdown
Collaborator

@tkemmer tkemmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most notably (that is, ignoring documentation-related comments), the mmCIF reader ignores all non-disulphide bonds and aggregates all chains by name.

Comment on lines +5 to +9
"""
write_mmcif_impl(io::IO, ac::AbstractAtomContainer{T})

Write an atom container as PDBx/mmCIF format to the given IO stream.
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be included in docs/src/private/fileformats.md, as it currently causes CI to fail.


# ─── CIF value quoting ───────────────────────────────────────────────

"""Quote a string value for CIF output."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and all one-line docstrings here) should either be properly formatted as docstrings and included in docs/src/private/fileformats.md or reduced to simple comments. Otherwise, CI will fail.

Comment on lines +7 to +13
"""
read_mmcif(fname_io::Union{AbstractString, IO}, ::Type{T} = Float32; create_coils::Bool = true) -> System{T}

Read a PDBx/mmCIF file and return a System.

Models are stored as frames, using the model number as `frame_id`.
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be included in docs/src/private/fileformats.md, as it currently causes CI to fail.


# ─── Helpers ──────────────────────────────────────────────────────────

"""Find a loop in the data block whose tags start with the given prefix."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and all one-line docstrings here) should either be properly formatted as docstrings and included in docs/src/private/fileformats.md or reduced to simple comments. Otherwise, CI will fail.

return nothing
end

"""Build a tag→column-index map for a loop."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

return ssbonds
end

"""Parse CIF symmetry operator string like '1_555' into an integer."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

end
end

"""Parse _struct_sheet_order to get sense values for each sheet range."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Comment on lines +28 to +36
# Set system name from data block name (or filename if available)
sys_name = if fname_io isa AbstractString
bn = basename(fname_io)
# strip extension
idx = findlast('.', bn)
isnothing(idx) ? bn : bn[1:idx-1]
else
block.name
end
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't really be read from the filename. We just removed this behavior from other readers/writers (cf. #277). The PDB ID is very well included in the file itself.

On a related note, our (legacy) PDB reader uses the name from the HEADER record as system and molecule name instead of the PDB ID. However, this name does not seem to (necessarily?) exist in the newer PDBx/mmCIF files. See for example 5PTI, which is named "HYDROLASE INHIBITOR" when read from PDB and "5PTI" when read from PDBx/mmCIF. Not sure if we should drop the HEADER record name from our PDB reader just to be in line with PDBx/mmCIF.

alt_id = (alt_id_raw == "." || alt_id_raw == "?") ? nothing : alt_id_raw

# Use auth_* fields when available (matches PDB convention)
chain_id = isnothing(c_auth_asym) ? row[c_asym_id] : _get(row, c_auth_asym, row[c_asym_id])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads atoms and hetero atoms into the same chain if they are named the same, although the atoms are read from different mmCIP loops.

See, for example, 5PTI:

julia> chains(sys_pdb)
ChainTable{Float32} with 2 rows:
┌───┬─────┬──────┐
│ # │ idx │ name │
├───┼─────┼──────┤
│ 12   │ A    │
│ 2953 │ A    │
└───┴─────┴──────┘


julia> chains(sys_mmcif)
ChainTable{Float32} with 1 rows:
┌───┬─────┬──────┐
│ # │ idx │ name │
├───┼─────┼──────┤
│ 12   │ A    │
└───┴─────┴──────┘

Our mmCIF reader should make the same distinction as our PDB reader and separate atoms and hetero atoms into different chains.


# ─── SSBond Parsing ──────────────────────────────────────────────────

function _parse_ssbonds(block::CIFDataBlock)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about non-disulphide bonds? These are completely ignored right now. Also, the TYPE__COVALENT flag is not set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants