Implement read/write support for PDBx/mmcif by anhi · Pull Request #280 · hildebrandtlab/BiochemicalAlgorithms.jl

anhi · 2026-03-17T17:39:22Z

This commit adds a general CIF reader and read and write functionality for the PDBx/mmcif format. Wtih this, we can also remove our dependency on BioStructures.jl

This commit adds a general CIF reader and read and write functionality for the PDBx/mmcif format. Wtih this, we can also remove our dependency on BioStructures.jl Signed-off-by: Andreas Hildebrandt <andreas.hildebrandt@uni-mainz.de>

Copilot

Pull request overview

This PR adds native CIF parsing plus read/write support for the PDBx/mmCIF format, enabling removal of the external BioStructures.jl dependency.

Changes:

Removed BioStructures.jl usage and dependency; routed mmCIF I/O through internal MMCIFDetails.
Added a standalone CIF (v1.1/v2.0) parser and new mmCIF reader/writer implementations.
Updated/expanded PDB/mmCIF tests (including disulfide bonds, secondary structure presence, IO-loading, and coordinate round-trip).

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
test/fileformats/test_pdb.jl	Updates mmCIF expectations and adds additional mmCIF assertions (bonds/coords/IO/nonexistent/round-trip).
src/fileformats/pdb/pdb_general.jl	Refactors PDB postprocessing helpers to accept either `PDBInfo` or record collections (enables reuse from mmCIF reader).
src/fileformats/pdb.jl	Removes BioStructures-based conversion; delegates mmCIF read/write to new internal implementation.
src/fileformats/mmcif/mmcif_reader.jl	New mmCIF reader built on the CIF parser; creates atoms/fragments, disulfides, and secondary structures.
src/fileformats/mmcif/mmcif_writer.jl	New mmCIF writer emitting `_atom_site`, `_struct_conn`, and secondary-structure loops.
src/fileformats/cif.jl	New general CIF parser and in-memory model (`CIFFile`/`CIFDataBlock`/`CIFLoop`).
src/BiochemicalAlgorithms.jl	Wires in CIF and mmCIF modules.
Project.toml	Drops `BioStructures` from `[deps]` and `[compat]`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/fileformats/mmcif/mmcif_reader.jl

src/fileformats/cif.jl

+        if nvals % ntags != 0
+            @warn "CIF loop has $(nvals) values for $(ntags) tags — not evenly divisible"
+        end
+
+        # Reshape flat values into rows of ntags columns
+        rows = Vector{Vector{String}}()
+        for i in 1:ntags:nvals
+            last_idx = min(i + ntags - 1, nvals)


src/fileformats/mmcif/mmcif_reader.jl

+    # Use auth fields for residue identification
+    c_p1_asym = get(cols, "_struct_conn.ptnr1_auth_asym_id", get(cols, "_struct_conn.ptnr1_label_asym_id", 0))
+    c_p1_comp = get(cols, "_struct_conn.ptnr1_auth_comp_id", get(cols, "_struct_conn.ptnr1_label_comp_id", 0))
+    c_p1_seq  = get(cols, "_struct_conn.ptnr1_auth_seq_id", get(cols, "_struct_conn.ptnr1_label_seq_id", 0))
+    c_p2_asym = get(cols, "_struct_conn.ptnr2_auth_asym_id", get(cols, "_struct_conn.ptnr2_label_asym_id", 0))
+    c_p2_comp = get(cols, "_struct_conn.ptnr2_auth_comp_id", get(cols, "_struct_conn.ptnr2_label_comp_id", 0))
+    c_p2_seq  = get(cols, "_struct_conn.ptnr2_auth_seq_id", get(cols, "_struct_conn.ptnr2_label_seq_id", 0))
+


src/fileformats/mmcif/mmcif_reader.jl

+        charge_str = isnothing(c_charge) ? nothing : _get(row, c_charge)
+        formal_charge = isnothing(charge_str) ? Int(0) : (tryparse(Int, charge_str) === nothing ? 0 : parse(Int, charge_str))


tkemmer · 2026-03-19T10:33:31Z

src/fileformats/mmcif/mmcif_writer.jl

+"""Quote a string value for CIF output."""
+function _cif_quote(s::String)
+    isempty(s) && return "."
+    # No quoting needed for simple values
+    if !any(c -> isspace(c), s) && !startswith(s, '_') && !startswith(s, '#') &&
+       !startswith(s, '\'') && !startswith(s, '"') && s != "." && s != "?"
+        return s
+    end
+    # Use single quotes if possible
+    if !occursin('\'', s)
+        return "'$s'"
+    end
+    # Use double quotes
+    if !occursin('"', s)
+        return "\"$s\""
+    end
+    # Fall back to semicolon text block
+    return ";\n$s\n;"
+end


Do we expect anything other than (abstract) strings here? Silently converting seems fishy.

src/fileformats/mmcif/mmcif_writer.jl

+    # No quoting needed for simple values
+    if !any(c -> isspace(c), s) && !startswith(s, '_') && !startswith(s, '#') &&
+       !startswith(s, '\'') && !startswith(s, '"') && s != "." && s != "?"
+        return s
+    end
+    # Use single quotes if possible
+    if !occursin('\'', s)
+        return "'$s'"
+    end
+    # Use double quotes
+    if !occursin('"', s)
+        return "\"$s\""
+    end
+    # Fall back to semicolon text block
+    return ";\n$s\n;"
+end


src/fileformats/mmcif/mmcif_writer.jl

+        occ = @sprintf("%.2f", get_property(a, :occupancy, 1.0))
+        bfac = @sprintf("%.2f", get_property(a, :tempfactor, 0.0))
+
+        charge = a.formal_charge == 0 ? "?" : string(a.formal_charge)
+
+        model_num = a.frame_id
+
+        println(io, "$group $( a.number) $type_sym $atom_name $alt_id $comp_id $chain_id $entity_id $seq_id $ins_code $x $y $z $occ $bfac $charge $seq_id $comp_id $chain_id $atom_name $model_num")


src/fileformats/mmcif/mmcif_reader.jl

+    # Use auth fields when available
+    c_beg_comp = get(cols, "_struct_conf.beg_auth_comp_id", get(cols, "_struct_conf.beg_label_comp_id", 0))
+    c_beg_asym = get(cols, "_struct_conf.beg_auth_asym_id", get(cols, "_struct_conf.beg_label_asym_id", 0))
+    c_beg_seq  = get(cols, "_struct_conf.beg_auth_seq_id", get(cols, "_struct_conf.beg_label_seq_id", 0))
+    c_end_comp = get(cols, "_struct_conf.end_auth_comp_id", get(cols, "_struct_conf.end_label_comp_id", 0))
+    c_end_asym = get(cols, "_struct_conf.end_auth_asym_id", get(cols, "_struct_conf.end_label_asym_id", 0))
+    c_end_seq  = get(cols, "_struct_conf.end_auth_seq_id", get(cols, "_struct_conf.end_label_seq_id", 0))
+


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andreas Hildebrandt <andreas.hildebrandt@uni-mainz.de>

Signed-off-by: Andreas Hildebrandt <andreas.hildebrandt@uni-mainz.de>

tkemmer

Most notably (that is, ignoring documentation-related comments), the mmCIF reader ignores all non-disulphide bonds and aggregates all chains by name.

tkemmer · 2026-03-19T09:51:42Z

src/fileformats/mmcif/mmcif_writer.jl

+"""
+    write_mmcif_impl(io::IO, ac::AbstractAtomContainer{T})
+
+Write an atom container as PDBx/mmCIF format to the given IO stream.
+"""


This should be included in docs/src/private/fileformats.md, as it currently causes CI to fail.

tkemmer · 2026-03-19T09:53:39Z

src/fileformats/mmcif/mmcif_writer.jl

+
+# ─── CIF value quoting ───────────────────────────────────────────────
+
+"""Quote a string value for CIF output."""


This (and all one-line docstrings here) should either be properly formatted as docstrings and included in docs/src/private/fileformats.md or reduced to simple comments. Otherwise, CI will fail.

tkemmer · 2026-03-19T09:55:52Z

src/fileformats/mmcif/mmcif_reader.jl

+"""
+    read_mmcif(fname_io::Union{AbstractString, IO}, ::Type{T} = Float32; create_coils::Bool = true) -> System{T}
+
+Read a PDBx/mmCIF file and return a System.
+
+Models are stored as frames, using the model number as `frame_id`.
+"""


This should be included in docs/src/private/fileformats.md, as it currently causes CI to fail.

tkemmer · 2026-03-19T09:57:04Z

src/fileformats/mmcif/mmcif_reader.jl

+
+# ─── Helpers ──────────────────────────────────────────────────────────
+
+"""Find a loop in the data block whose tags start with the given prefix."""


This (and all one-line docstrings here) should either be properly formatted as docstrings and included in docs/src/private/fileformats.md or reduced to simple comments. Otherwise, CI will fail.

tkemmer · 2026-03-19T09:57:12Z

src/fileformats/mmcif/mmcif_reader.jl

+    return nothing
+end
+
+"""Build a tag→column-index map for a loop."""


Same as above

tkemmer · 2026-03-19T09:58:01Z

src/fileformats/mmcif/mmcif_reader.jl

+    return ssbonds
+end
+
+"""Parse CIF symmetry operator string like '1_555' into an integer."""


Same as above

tkemmer · 2026-03-19T09:58:11Z

src/fileformats/mmcif/mmcif_reader.jl

+    end
+end
+
+"""Parse _struct_sheet_order to get sense values for each sheet range."""


Same as above

tkemmer · 2026-03-19T10:02:54Z

src/fileformats/mmcif/mmcif_reader.jl

+    # Set system name from data block name (or filename if available)
+    sys_name = if fname_io isa AbstractString
+        bn = basename(fname_io)
+        # strip extension
+        idx = findlast('.', bn)
+        isnothing(idx) ? bn : bn[1:idx-1]
+    else
+        block.name
+    end


This shouldn't really be read from the filename. We just removed this behavior from other readers/writers (cf. #277). The PDB ID is very well included in the file itself.

On a related note, our (legacy) PDB reader uses the name from the HEADER record as system and molecule name instead of the PDB ID. However, this name does not seem to (necessarily?) exist in the newer PDBx/mmCIF files. See for example 5PTI, which is named "HYDROLASE INHIBITOR" when read from PDB and "5PTI" when read from PDBx/mmCIF. Not sure if we should drop the HEADER record name from our PDB reader just to be in line with PDBx/mmCIF.

tkemmer · 2026-03-19T10:09:36Z

src/fileformats/mmcif/mmcif_reader.jl

+        alt_id = (alt_id_raw == "." || alt_id_raw == "?") ? nothing : alt_id_raw
+
+        # Use auth_* fields when available (matches PDB convention)
+        chain_id = isnothing(c_auth_asym) ? row[c_asym_id] : _get(row, c_auth_asym, row[c_asym_id])


This reads atoms and hetero atoms into the same chain if they are named the same, although the atoms are read from different mmCIP loops.

See, for example, 5PTI:

julia> chains(sys_pdb) ChainTable{Float32} with 2 rows: ┌───┬─────┬──────┐ │ # │ idx │ name │ ├───┼─────┼──────┤ │ 1 │ 2 │ A │ │ 2 │ 953 │ A │ └───┴─────┴──────┘ julia> chains(sys_mmcif) ChainTable{Float32} with 1 rows: ┌───┬─────┬──────┐ │ # │ idx │ name │ ├───┼─────┼──────┤ │ 1 │ 2 │ A │ └───┴─────┴──────┘

Our mmCIF reader should make the same distinction as our PDB reader and separate atoms and hetero atoms into different chains.

tkemmer · 2026-03-19T10:16:14Z

src/fileformats/mmcif/mmcif_reader.jl

+
+# ─── SSBond Parsing ──────────────────────────────────────────────────
+
+function _parse_ssbonds(block::CIFDataBlock)


What about non-disulphide bonds? These are completely ignored right now. Also, the TYPE__COVALENT flag is not set.

Implement read/write support for PDBx/mmcif

9d82add

This commit adds a general CIF reader and read and write functionality for the PDBx/mmcif format. Wtih this, we can also remove our dependency on BioStructures.jl Signed-off-by: Andreas Hildebrandt <andreas.hildebrandt@uni-mainz.de>

anhi requested review from Copilot, jeleclaire and tkemmer March 17, 2026 17:39

Copilot started reviewing on behalf of anhi March 17, 2026 17:39 View session

Copilot AI reviewed Mar 17, 2026

View reviewed changes

anhi and others added 2 commits March 17, 2026 18:54

Handle corner cases in mmcif better.

1827ee7

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Andreas Hildebrandt <andreas.hildebrandt@uni-mainz.de>

Fix bugs found by copilot pr review

9fd5e43

Signed-off-by: Andreas Hildebrandt <andreas.hildebrandt@uni-mainz.de>

tkemmer assigned anhi Mar 17, 2026

tkemmer added the enhancement New feature or request label Mar 17, 2026

tkemmer added this to the v0.7 milestone Mar 17, 2026

tkemmer requested changes Mar 19, 2026

View reviewed changes

		charge_str = isnothing(c_charge) ? nothing : _get(row, c_charge)
		formal_charge = isnothing(charge_str) ? Int(0) : (tryparse(Int, charge_str) === nothing ? 0 : parse(Int, charge_str))


		# ─── CIF value quoting ───────────────────────────────────────────────

		"""Quote a string value for CIF output."""


		# ─── Helpers ──────────────────────────────────────────────────────────

		"""Find a loop in the data block whose tags start with the given prefix."""


		# ─── SSBond Parsing ──────────────────────────────────────────────────

		function _parse_ssbonds(block::CIFDataBlock)

Conversation

anhi commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkemmer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants