Skip to content

Comments

feat(mmcif/assembly): Initialize from_mmcif#17

Open
khs01654 wants to merge 3 commits intomainfrom
mmcif-assembly
Open

feat(mmcif/assembly): Initialize from_mmcif#17
khs01654 wants to merge 3 commits intomainfrom
mmcif-assembly

Conversation

@khs01654
Copy link

@khs01654 khs01654 commented Feb 24, 2026

Checklist

If the change is related to the source code, tests, or build environments, please also check the following:

  • Does pytest -vs pass without any errors and warnings (at the project root)?
  • Does mypy --pretty pass without any errors and warnings (at the project root)?

If you added new feature(s), then also check the following:

  • Did you also add corresponding tests?

Linked Issues

Link the tracking issue(s) of this PR, with the auto-close keywords here:

  • Closes #...


Note

Medium Risk
Touches core mmCIF export and introduces non-trivial filtering that must keep multiple interdependent tables consistent; mistakes could silently produce invalid or incomplete mmCIF output.

Overview
Assembly now stores the originating Mmcif (metadata) and propagates it through filter, transform, and join, so derived assemblies retain mmCIF context.

Assembly.to_mmcif is expanded to emit additional mmCIF header/experimental blocks and more complete atom_site/struct_conn fields, including placeholder pdbx_struct_assembly/pdbx_struct_oper_list identity data.

Mmcif gains filter/filter_chains to subset a structure while consistently pruning dependent tables (entities, schemes, connections, assembly gens, branch links), plus a new Mmcif.to_mmcif() writer to round-trip the parsed model back to mmCIF text.

Written by Cursor Bugbot for commit 1b19fc2. This will update automatically on new commits. Configure here.

@khs01654 khs01654 requested review from a team as code owners February 24, 2026 11:32
@khs01654 khs01654 added the feature New feature or request label Feb 24, 2026
@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 5.88235% with 112 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.42%. Comparing base (dfc92a6) to head (1b19fc2).

Files with missing lines Patch % Lines
src/gmol/base/data/mmcif/parse.py 4.46% 107 Missing ⚠️
src/gmol/base/data/mmcif/assembly.py 28.57% 5 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #17      +/-   ##
==========================================
- Coverage   82.58%   77.42%   -5.17%     
==========================================
  Files          14       14              
  Lines        1654     1772     +118     
  Branches      229      258      +29     
==========================================
+ Hits         1366     1372       +6     
- Misses        227      339     +112     
  Partials       61       61              
Files with missing lines Coverage Δ
src/gmol/base/data/mmcif/assembly.py 77.29% <28.57%> (-0.74%) ⬇️
src/gmol/base/data/mmcif/parse.py 68.39% <4.46%> (-22.95%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bbh-pharm bbh-pharm added the DO NOT MERGE Don't merge yet label Feb 24, 2026
@bbh-pharm
Copy link
Contributor

bug가 좀 있습니다..! 다시 쓰는 과정에서 없어지는 field가 있어서 읽을 때 에러가 나요....! @jnooree

@cursor
Copy link

cursor bot commented Feb 24, 2026

Bugbot Autofix prepared fixes for 3 of the 3 bugs found in the latest run.

  • ✅ Fixed: Polymer chains with branches misclassified as Ligand
    • Added conditional check to only set chain type to Ligand for branch_scheme if the chain is not already typed as a polymer from poly_seq_scheme.
  • ✅ Fixed: Duplicate seqres entries for polymer chains with branches
    • Added deduplication logic to filter out branch schemes with seq_ids already present in poly_schemes before concatenating.
  • ✅ Fixed: StopIteration crash on empty or invalid CIF file
    • Wrapped next(read_cif()) in try/except to catch StopIteration and raise a descriptive ValueError instead.

Create PR

Or push these changes by commenting:

@cursor push e941e73c8a
Preview (e941e73c8a)
diff --git a/src/gmol/base/data/mmcif/assembly.py b/src/gmol/base/data/mmcif/assembly.py
--- a/src/gmol/base/data/mmcif/assembly.py
+++ b/src/gmol/base/data/mmcif/assembly.py
@@ -549,7 +549,12 @@
         """
         from nuri.fmt import cif_ddl2_frame_as_dict, read_cif
 
-        data = next(read_cif(cif_file)).data
+        try:
+            data = next(read_cif(cif_file)).data
+        except StopIteration:
+            raise ValueError(
+                f"Empty or invalid CIF file: {cif_file}"
+            ) from None
         mmcif_dict = cif_ddl2_frame_as_dict(data)
 
         # Build CCD from embedded chem_comp data
@@ -644,9 +649,11 @@
         for chain_id, seq_infos in poly_seq_scheme.items():
             chain_types[chain_id] = polymer_mol_type(seq_infos, ccd)
         for chain_id in branch_scheme:
-            chain_types[chain_id] = MolType.Ligand
+            if chain_id not in chain_types:
+                chain_types[chain_id] = MolType.Ligand
         for chain_id in nonpoly_scheme:
-            chain_types[chain_id] = MolType.Ligand
+            if chain_id not in chain_types:
+                chain_types[chain_id] = MolType.Ligand
 
         # Fallback for chains with no scheme data
         for chain_id in struct_asym:
@@ -691,10 +698,15 @@
 
         # Set seqres and branches for each chain
         for cid, chain in chains.items():
+            poly_schemes = poly_seq_scheme.get(cid, [])
+            poly_seq_ids = {s.seq_id for s in poly_schemes}
+            branch_schemes = [
+                s
+                for s in branch_scheme.get(cid, [])
+                if s.seq_id not in poly_seq_ids
+            ]
             chain.seqres = _select_het_seq_scheme(
-                poly_seq_scheme.get(cid, [])
-                + branch_scheme.get(cid, [])
-                + nonpoly_scheme.get(cid, []),
+                poly_schemes + branch_schemes + nonpoly_scheme.get(cid, []),
                 residues,
             )
             chain.branches = _map_branches(
This Bugbot Autofix run was free. To enable Autofix for future PRs, go to the Cursor dashboard.

@bbh-pharm
Copy link
Contributor

원래 Assembly.to_mmcif로 써도 다시 load_mmcif_single로 다시 읽을 수 없었는데, 그 부분 수정했습니다. (쓰는 field들 확인 부탁드립니다).
Mmcif class의 역할을 제대로 이해했는지 모르겠는데 metadata 같은 역할이었던 것 같습니다. 근데 Assembly cif를 읽을 때 다시 assembly로 만드는게 번거로워서 filter랑 mmcif 쓰는 부분 추가했는데, 이것도 확인 부탁드립니다.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

"0.0",
)
],
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing pdbx_struct_assembly_gen block in Assembly.to_mmcif()

High Severity

Assembly.to_mmcif() writes pdbx_struct_assembly and pdbx_struct_oper_list blocks but never writes the required pdbx_struct_assembly_gen block that links them together. When the output is read back, the _join_bioassembly validator finds no pdbx_struct_assembly_gen entries, leaving each BioAssembly with an empty assembly_gens list. This causes mmcif_assemblies() to call _apply_transforms with an empty generators list, ultimately crashing in Assembly.join([]) via np.concatenate on an empty list. The analogous Mmcif.to_mmcif() method correctly writes pdbx_struct_assembly_gen.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DO NOT MERGE Don't merge yet feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants