Feature/mmjson parser #67

N283T · 2025-12-10T02:05:27Z

📋 PR Checklist

This PR is tagged as a draft if it is still under development and not ready for review.

This avoids auto-triggering the slower tests in the CI and needlessly wasting resources.
I have ensured that all my commits follow angular commit message conventions.
Format: <type>[optional scope]: <subject>
Example: fix(af3): add missing crop transform to the af3 pipeline

This affects semantic versioning as follows:
- fix: patch version increment (0.0.1 → 0.0.2)
- feat: minor version increment (0.0.1 → 0.1.0)
- BREAKING CHANGE: major version increment (0.0.1 → 1.0.0)
- All other types do not affect versioning
The format ensures readable changelogs through auto-generation from commit messages.
I have run make format on the codebase before submitting the PR (this autoformats the code and lints it).
I have named the PR in angular PR message format as well (c.f. above), with a sensible tag line that summarizes all the changes in the PR.

This is useful as the name of the PR is the default name of the commit that will be used if you merge with a squash & merge.
Format: <type>[optional scope]: <subject>
Example: fix(af3): add missing crop transform to the af3 pipeline

ℹ️ PR Description

What changes were made and why?

This PR introduces support for parsing macromolecular JSON (mmJSON) files (see documentation), a format that is becoming increasingly popular for web-based structure visualization and transmission.

Key changes include:

File Type Inference:
- Updated atomworks.io.utils.io_utils.infer_pdb_file_type to correctly identify files with .mmjson or .json extensions as mmjson type.
- Updated CIF_LIKE_EXTENSIONS to include .mmjson and .mmjson.gz.
Verification:
- Created a standalone verification script mmjson_verification/verify_mmjson.py to ensure correctness.
- Added example data (2hhb.json.gz and 2hhb.cif.gz) in mmjson_verification/ to compare mmJSON parsing results against the standard CIF parser.
- Verification script confirms that atom counts, coordinates, elements, residue names, and chain IDs match between the two formats.

Note: I am still new to contributing to this project, so any feedback or suggestions for improvement would be greatly appreciated. Thank you!

…parsing utilities.

…ntroduce a parsing verification script with example data.

nscorley

Thank you for this thoughtful PR this is a great addition! A few comments on tests/style; it would also be worth running make format on the repository to clean up the formatting so the linter tests pass

nscorley · 2025-12-10T17:07:39Z

mmjson_verification/2hhb.cif.gz

Could we please move these files to tests/data/io? You'll see other similar test files there

nscorley · 2025-12-10T17:10:53Z

mmjson_verification/verify_mmjson.py

In this project, we've been trying to use pytest - could we please convert this into a pytest-style test example, maybe in tests/io/tools?

nscorley · 2025-12-10T17:11:24Z

mmjson_verification/verify_mmjson.py

+    from atomworks.io.parser import parse
+    from atomworks.io.utils.io_utils import infer_pdb_file_type
+
+PROJECT_ROOT = Path(__file__).parent.parent


If we correctly set up via pytest, the paths should be configured correctly already so we will be able to remove this

nscorley · 2025-12-10T17:11:50Z

mmjson_verification/verify_mmjson.py

+    print("\n1. Testing File Type Inference...")
+    inferred_type = infer_pdb_file_type(json_path)
+    if inferred_type == "mmjson":
+        print(f"✅ Correctly inferred 'mmjson' from {json_path.name}")


When we convert to pytest, best-practice would be using assert rather than print statements

nscorley · 2025-12-10T17:12:59Z

mmjson_verification/verify_mmjson.py

+        print("❌ Residue names mismatch.")
+
+    # Compare Chain IDs
+    if np.array_equal(atoms_json.chain_id, atoms_cif.chain_id):


We have a utility for this - assert_same_atom_array - that should save quite a bit of code! We could just load into an atom_array with both methods and then use assert_same_atom_array which would be much more concise

nscorley · 2025-12-10T17:55:56Z

src/atomworks/io/utils/io_utils.py

        second_line = path_or_buffer.readline().strip()
        path_or_buffer.seek(0)
-        return "cif" if second_line.startswith("#") else "pdb"
+        path_or_buffer.seek(0)


The repeated seeks here are a little difficult to read -- maybe we just grab the first and second chars once then pattern match? perhaps something along the lines of:

def _infer_filetype(path_or_buffer: io.BytesIO | io.StringIO) -> str: """Infer file type from buffer contents.""" if isinstance(path_or_buffer, io.BytesIO): return "bcif" # StringIO - peek at contents to determine format path_or_buffer.seek(0) first_char = path_or_buffer.read(1) path_or_buffer.readline() # finish first line second_line = path_or_buffer.readline() path_or_buffer.seek(0) if first_char == "{": return "mmjson" if second_line.startswith("#"): return "cif" return "pdb" ```

nscorley · 2025-12-10T17:57:03Z

src/atomworks/io/utils/io_utils.py

+def _read_mmjson(file_obj: io.StringIO | io.BytesIO | io.TextIOWrapper) -> pdbx.CIFFile:
+    """Read an mmjson file into a CIFFile object."""
+    data = json.load(file_obj)
+    cif_file = pdbx.CIFFile()


… inference in `io_utils.py`.

N283T · 2025-12-11T00:49:31Z

Thanks for the thoughtful review! I've addressed your feedback with the following changes:

Test Data: Moved 2hhb.cif.gz and 2hhb.json.gz to tests/data/io to align with the project structure.
Tests: Converted the verification script into a proper pytest-style test in tests/io/tools/test_mmjson.py. I also switched to using assert_same_atom_array for more concise and robust assertions.
Refactoring: Refactored infer_pdb_file_type in io_utils.py to _infer_file_type_from_buffer to improve readability and avoid repeated seeks.
Formatting: Ran make format to ensure the code complies with the project's style guidelines.

Let me know if any additional changes are needed. Happy to adjust!

nscorley

Looks good! Two minor changes:

Use the existing fixture for TEST_DATA_DIR
I think we still need to commit the test files - they might be being ignored so you need to git add -f

nscorley · 2025-12-11T18:21:39Z

tests/io/tools/test_mmjson.py

+
+# Tests should look for data relative to their location or use a fixture
+# Based on project structure, tests/data is likely where we moved the files
+TEST_DATA_DIR = Path(__file__).parent.parent.parent / "data" / "io"


We have a fixture for this that we can import from conftest (as it looks like Claude suggested 😄, TEST_DATA_DIR )

nscorley · 2025-12-11T18:22:56Z

tests/io/tools/test_mmjson.py

+
+    # 4. Compare Results
+    # This utility checks atom count, coordinates, and other annotations
+    assert_same_atom_array(atoms_json, atoms_cif)


… data directory fixture

N283T · 2025-12-12T00:04:27Z

Thanks for the review!

I've addressed the feedback:

Updated tests/io/tools/test_mmjson.py to use the TEST_DATA_DIR fixture from tests.conftest.
Force-added the test data files (tests/data/io/2hhb.cif.gz and tests/data/io/2hhb.json.gz) to git so they are no longer ignored.

(And yes — this time the AI assistant was Gemini Pro 😄)

nscorley · 2025-12-12T07:07:43Z

Thanks! Looks great. I'm still seeing a linting error -- you can address with make format in the directory root; after that it'll be good to go!

N283T · 2025-12-12T07:31:33Z

Ran make format from the repository root to resolve the remaining linting issue. Thanks!

github-actions · 2025-12-19T23:54:04Z

🎉 This PR is included in version 2.2.0 🎉

The release is available on:

v2.2.0
GitHub release

Your semantic-release bot 📦🚀

…ltiple representative/center atoms per token (RosettaCommons#67) Co-authored-by: Rafi Brent <rafi.brent@gmail.com>

# [1.1.0](RosettaCommons/atomworks@v1.0.3...v1.1.0) (2025-11-29) ### Bug Fixes * a couple of Condition updates ([RosettaCommons#59](RosettaCommons#59)) ([a0f173b](RosettaCommons@a0f173b)) * add a flag to optionally tolerate the situation of missing or multiple representative/center atoms per token ([RosettaCommons#67](RosettaCommons#67)) ([896ca38](RosettaCommons@896ca38)) * add errors for cases where parsing an AtomArrayPlus is problematic ([80d00e3](RosettaCommons@80d00e3)) * add padding for short residue names in sharding ([94abca1](RosettaCommons@94abca1)) * add raise_if_not_set to get_msa_dirs_from_env ([0f50eb2](RosettaCommons@0f50eb2)) * add raise_if_not_set to get_msa_dirs_from_env ([c7bfee6](RosettaCommons@c7bfee6)) * add within poly res idx on-the-fly option ([c80104b](RosettaCommons@c80104b)) * address code review issues in performance PR ([9cda94a](RosettaCommons@9cda94a)) * allow for deleting 2d annotations from AtomArrayPlus ([84e0084](RosettaCommons@84e0084)) * allow numpy masks in addition to query syntax in SampleSeed ([3468645](RosettaCommons@3468645)) * allow override in add global token id transform ([381a743](RosettaCommons@381a743)) * apptainer ([37fe7a2](RosettaCommons@37fe7a2)) * apptainer ([e3bd135](RosettaCommons@e3bd135)) * apptainer ([0647b20](RosettaCommons@0647b20)) * apptainer for CI ([670a82a](RosettaCommons@670a82a)) * bcif tests ([378d03a](RosettaCommons@378d03a)) * Be more robust to nulls in a3m files. ([a8552a4](RosettaCommons@a8552a4)) * better messages and assertions for removing design tasks with 0 frequency ([9b6391d](RosettaCommons@9b6391d)) * broken tests ([1fcc397](RosettaCommons@1fcc397)) * bug in default seq cond mask ([00484d4](RosettaCommons@00484d4)) * change default sequence condition behavior ([981d924](RosettaCommons@981d924)) * ci for internal ([5096b6c](RosettaCommons@5096b6c)) * ci workers ([f7da3cd](RosettaCommons@f7da3cd)) * circular import ([7eafda9](RosettaCommons@7eafda9)) * claude code review ([885eb53](RosettaCommons@885eb53)) * condition set mask and terminus conditions changes ([56f661c](RosettaCommons@56f661c)) * correct cache dir structure and add padding for short IDs ([b8645fc](RosettaCommons@b8645fc)) * correct sharding path construction for cached residue data ([79d388b](RosettaCommons@79d388b)) * **databases:** correcting uniontype call bug ([8a3e59e](RosettaCommons@8a3e59e)) * **databases:** correcting uniontype call bug ([ebc26db](RosettaCommons@ebc26db)) * datasets documentation, DSSP path ([1565a94](RosettaCommons@1565a94)) * docstring formatting ([ff296d6](RosettaCommons@ff296d6)) * documentation ([1e5b0d4](RosettaCommons@1e5b0d4)) * documentation, formatting ([dcdde14](RosettaCommons@dcdde14)) * downgrade biotite ([cab5bcf](RosettaCommons@cab5bcf)) * enable deletion of 2d annotations ([1f88391](RosettaCommons@1f88391)) * enable spawn multiprocessing ([36ac421](RosettaCommons@36ac421)) * ensure that parse preserves AtomArrayPlus status, and add a test for this ([681fdeb](RosettaCommons@681fdeb)) * ensure that the Index condition's default annotation respects its mask ([RosettaCommons#50](RosettaCommons#50)) ([e57be2a](RosettaCommons@e57be2a)) * Formatting ([f6fe986](RosettaCommons@f6fe986)) * general masks in SampleSeed ([53df9a6](RosettaCommons@53df9a6)) * give more informative error messages for ConditionalRoute or RandomRoute failures ([3b72b18](RosettaCommons@3b72b18)) * Handle non-uniform shard sizes in AseDBDataset ([e34eb51](RosettaCommons@e34eb51)) * infer array type of TokenEncoding where possible ([RosettaCommons#68](RosettaCommons#68)) ([a6a8fb1](RosettaCommons@a6a8fb1)) * informative Route error messages ([87b1fbc](RosettaCommons@87b1fbc)) * minor fixes ([f45fafb](RosettaCommons@f45fafb)) * minor fixes for encodings ([e043104](RosettaCommons@e043104)) * more informative error messages ([7861023](RosettaCommons@7861023)) * parse preserves atom array plus ([d1eef92](RosettaCommons@d1eef92)) * parser defaults ([570f3ce](RosettaCommons@570f3ce)) * reduce logging level in `load_atom_array_with_conditions_from_cif` ([RosettaCommons#48](RosettaCommons#48)) ([52f316d](RosettaCommons@52f316d)) * remove _spoof ([995a260](RosettaCommons@995a260)) * remove ambiguous Greek characters and improve test assertions ([bad6dff](RosettaCommons@bad6dff)) * remove ASE import so we dont introduce a dependency ([7e12a8a](RosettaCommons@7e12a8a)) * remove design tasks with zero frequency during sampling instead of erroring ([4586fa2](RosettaCommons@4586fa2)) * remove hardcoded environment-specific default path ([24bf03f](RosettaCommons@24bf03f)) * remove lineprofiler stuff ([83fb3c5](RosettaCommons@83fb3c5)) * remove print statements ([ecd9e5b](RosettaCommons@ecd9e5b)) * residue starts bug with dependent functions. ([04da354](RosettaCommons@04da354)) * residue starts bug with dependent functions. ([fc252a8](RosettaCommons@fc252a8)) * **rf3:** json-level atom selection for bonds argument ([a569a7c](RosettaCommons@a569a7c)) * ruff formatting after merge with dev ([71ddb86](RosettaCommons@71ddb86)) * sasa for empty aa ([c2b9302](RosettaCommons@c2b9302)) * shard cache on structure ID (PDB ID) instead of args hash ([e5c29fd](RosettaCommons@e5c29fd)) * Support AtomArrayPlus and AtomArrayPlusStack in parse_atom_array, with some restrictions ([RosettaCommons#46](RosettaCommons#46)) ([c1e3b00](RosettaCommons@c1e3b00)) * test requirements ([4c4c775](RosettaCommons@4c4c775)) * tests ([eb10ba6](RosettaCommons@eb10ba6)) * typing ([8d6e39d](RosettaCommons@8d6e39d)) * updates for ame ([6530818](RosettaCommons@6530818)) * use warnings.warn instead of print for missing ase ([b6f9c7a](RosettaCommons@b6f9c7a)) ### Features * add as_atom_array_plus_stack ([804c854](RosettaCommons@804c854)) * add Claude Hode GitHub workflow ([0f4e603](RosettaCommons@0f4e603)) * add customizable annotation names for SSE validity flag ([fd6be37](RosettaCommons@fd6be37)) * add DSSP secondary structure annotation ([f1cc785](RosettaCommons@f1cc785)) * add molprobity executable? ([b87f76b](RosettaCommons@b87f76b)) * add utomatic msa depth and ext determination ([874a1dd](RosettaCommons@874a1dd)) * adding Conditions, annotator, and io utils ([b08c935](RosettaCommons@b08c935)) * atom37 with atomization encoding ([d35be40](RosettaCommons@d35be40)) * atomworks MSA preprocessing (generate, filter, organize) ([ef6ac85](RosettaCommons@ef6ac85)) * backbone constants and annotators ([de5854c](RosettaCommons@de5854c)) * **databases:** add experimental data database functionality from datahub ([f69a499](RosettaCommons@f69a499)) * **databases:** add experimental data database functionality from datahub ([8b71614](RosettaCommons@8b71614)) * generalize encodings ([c886d09](RosettaCommons@c886d09)) * generalize encodings ([83c8684](RosettaCommons@83c8684)) * Make SampleDesignTask handle empty design_tasks gracefully ([c9e5900](RosettaCommons@c9e5900)) * misc fixes for design ([71decdf](RosettaCommons@71decdf)) * MSA preprocessing (organizing, generating, filtering) ([89eb73d](RosettaCommons@89eb73d)) * MSA preprocessing; zst support ([aa3c837](RosettaCommons@aa3c837)) * paired MSA feature (from Rafi) ([805a27a](RosettaCommons@805a27a)) * paired MSA feature (from Rafi) ([2df1511](RosettaCommons@2df1511)) * skip for large structures ([6f2f33c](RosettaCommons@6f2f33c)) * update to work with transition metals ([6f2a80a](RosettaCommons@6f2a80a)) ### Performance Improvements * don't actually check symmetry every time on symmetric conditions since there's just too much overhead ([6ae656a](RosettaCommons@6ae656a)) * improve cache loading and template building performance ([c144ec7](RosettaCommons@c144ec7)) * performance optimizations for parsing ([e8faf3e](RosettaCommons@e8faf3e)) * precalculate bond graph for GrowByHoppingAlongBondGraph ([fd85178](RosettaCommons@fd85178)) * res idx ([22703bf](RosettaCommons@22703bf)) * res idx optimization ([11b5d0d](RosettaCommons@11b5d0d)) * speed up 2d symmetrization check and stop actually checking symmetry every time and make some other stuff faster ([5c3127b](RosettaCommons@5c3127b)) * speed up apply_and_spread ([ac612a4](RosettaCommons@ac612a4)) * speed up bond_graph creation ([94f0512](RosettaCommons@94f0512))

N283T added 2 commits December 10, 2025 10:32

feat: Add example CIF/JSON data, update dependencies, and modify I/O …

d29653c

…parsing utilities.

feat: Add mmjson file type support, update file type inference, and i…

88a7282

…ntroduce a parsing verification script with example data.

nscorley requested a review from Croydon-Brixton December 10, 2025 17:06

nscorley requested changes Dec 10, 2025

View reviewed changes

feat: add automated mmJSON parsing test and refactor buffer file type…

2f0ea72

… inference in `io_utils.py`.

nscorley requested changes Dec 11, 2025

View reviewed changes

test: add 2hhb mmJSON and CIF test data and update test to use shared…

b37aaf9

… data directory fixture

style: apply ruff formatting and lint fixes

16e6a6f

style: apply formatting fixes

4941565

nscorley merged commit 219d69e into RosettaCommons:production Dec 19, 2025
2 checks passed

github-actions bot added the released label Dec 19, 2025

N283T deleted the feature/mmjson-parser branch December 20, 2025 00:30

partrita pushed a commit to partrita/atomworks that referenced this pull request Jan 19, 2026

fix: add a flag to optionally tolerate the situation of missing or mu…

896ca38

…ltiple representative/center atoms per token (RosettaCommons#67) Co-authored-by: Rafi Brent <rafi.brent@gmail.com>

Feature/mmjson parser #67

Feature/mmjson parser #67

Uh oh!

Conversation

N283T commented Dec 10, 2025

📋 PR Checklist

ℹ️ PR Description

What changes were made and why?

Uh oh!

nscorley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

N283T commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nscorley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

N283T commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nscorley commented Dec 12, 2025

Uh oh!

N283T commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

N283T commented Dec 11, 2025 •

edited

Loading

N283T commented Dec 12, 2025 •

edited

Loading

N283T commented Dec 12, 2025 •

edited

Loading