Skip to content

Conversation

@N283T
Copy link
Contributor

@N283T N283T commented Dec 10, 2025

📋 PR Checklist

  • This PR is tagged as a draft if it is still under development and not ready for review.

    This avoids auto-triggering the slower tests in the CI and needlessly wasting resources.

  • I have ensured that all my commits follow angular commit message conventions.

    Format: <type>[optional scope]: <subject>
    Example: fix(af3): add missing crop transform to the af3 pipeline

    This affects semantic versioning as follows:

    • fix: patch version increment (0.0.1 → 0.0.2)
    • feat: minor version increment (0.0.1 → 0.1.0)
    • BREAKING CHANGE: major version increment (0.0.1 → 1.0.0)
    • All other types do not affect versioning

    The format ensures readable changelogs through auto-generation from commit messages.

  • I have run make format on the codebase before submitting the PR (this autoformats the code and lints it).

  • I have named the PR in angular PR message format as well (c.f. above), with a sensible tag line that summarizes all the changes in the PR.

    This is useful as the name of the PR is the default name of the commit that will be used if you merge with a squash & merge.
    Format: <type>[optional scope]: <subject>
    Example: fix(af3): add missing crop transform to the af3 pipeline


ℹ️ PR Description

What changes were made and why?

This PR introduces support for parsing macromolecular JSON (mmJSON) files (see documentation), a format that is becoming increasingly popular for web-based structure visualization and transmission.

Key changes include:

  1. File Type Inference:

    • Updated atomworks.io.utils.io_utils.infer_pdb_file_type to correctly identify files with .mmjson or .json extensions as mmjson type.
    • Updated CIF_LIKE_EXTENSIONS to include .mmjson and .mmjson.gz.
  2. Verification:

    • Created a standalone verification script mmjson_verification/verify_mmjson.py to ensure correctness.
    • Added example data (2hhb.json.gz and 2hhb.cif.gz) in mmjson_verification/ to compare mmJSON parsing results against the standard CIF parser.
    • Verification script confirms that atom counts, coordinates, elements, residue names, and chain IDs match between the two formats.

Note: I am still new to contributing to this project, so any feedback or suggestions for improvement would be greatly appreciated. Thank you!

Copy link
Collaborator

@nscorley nscorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this thoughtful PR this is a great addition! A few comments on tests/style; it would also be worth running make format on the repository to clean up the formatting so the linter tests pass

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please move these files to tests/data/io? You'll see other similar test files there

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this project, we've been trying to use pytest - could we please convert this into a pytest-style test example, maybe in tests/io/tools?

from atomworks.io.parser import parse
from atomworks.io.utils.io_utils import infer_pdb_file_type

PROJECT_ROOT = Path(__file__).parent.parent
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we correctly set up via pytest, the paths should be configured correctly already so we will be able to remove this

print("\n1. Testing File Type Inference...")
inferred_type = infer_pdb_file_type(json_path)
if inferred_type == "mmjson":
print(f"✅ Correctly inferred 'mmjson' from {json_path.name}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we convert to pytest, best-practice would be using assert rather than print statements

print("❌ Residue names mismatch.")

# Compare Chain IDs
if np.array_equal(atoms_json.chain_id, atoms_cif.chain_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a utility for this - assert_same_atom_array - that should save quite a bit of code! We could just load into an atom_array with both methods and then use assert_same_atom_array which would be much more concise

second_line = path_or_buffer.readline().strip()
path_or_buffer.seek(0)
return "cif" if second_line.startswith("#") else "pdb"
path_or_buffer.seek(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repeated seeks here are a little difficult to read -- maybe we just grab the first and second chars once then pattern match? perhaps something along the lines of:

def _infer_filetype(path_or_buffer: io.BytesIO | io.StringIO) -> str:
    """Infer file type from buffer contents."""
    if isinstance(path_or_buffer, io.BytesIO):
        return "bcif"
    
    # StringIO - peek at contents to determine format
    path_or_buffer.seek(0)
    first_char = path_or_buffer.read(1)
    path_or_buffer.readline()  # finish first line
    second_line = path_or_buffer.readline()
    path_or_buffer.seek(0)
    
    if first_char == "{":
        return "mmjson"
    if second_line.startswith("#"):
        return "cif"
    return "pdb"
    ```

def _read_mmjson(file_obj: io.StringIO | io.BytesIO | io.TextIOWrapper) -> pdbx.CIFFile:
"""Read an mmjson file into a CIFFile object."""
data = json.load(file_obj)
cif_file = pdbx.CIFFile()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job!

@N283T
Copy link
Contributor Author

N283T commented Dec 11, 2025

Thanks for the thoughtful review! I've addressed your feedback with the following changes:

  • Test Data: Moved 2hhb.cif.gz and 2hhb.json.gz to tests/data/io to align with the project structure.
  • Tests: Converted the verification script into a proper pytest-style test in tests/io/tools/test_mmjson.py. I also switched to using assert_same_atom_array for more concise and robust assertions.
  • Refactoring: Refactored infer_pdb_file_type in io_utils.py to _infer_file_type_from_buffer to improve readability and avoid repeated seeks.
  • Formatting: Ran make format to ensure the code complies with the project's style guidelines.

Let me know if any additional changes are needed. Happy to adjust!

Copy link
Collaborator

@nscorley nscorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Two minor changes:

  1. Use the existing fixture for TEST_DATA_DIR
  2. I think we still need to commit the test files - they might be being ignored so you need to git add -f


# Tests should look for data relative to their location or use a fixture
# Based on project structure, tests/data is likely where we moved the files
TEST_DATA_DIR = Path(__file__).parent.parent.parent / "data" / "io"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a fixture for this that we can import from conftest (as it looks like Claude suggested 😄, TEST_DATA_DIR )


# 4. Compare Results
# This utility checks atom count, coordinates, and other annotations
assert_same_atom_array(atoms_json, atoms_cif)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@N283T
Copy link
Contributor Author

N283T commented Dec 12, 2025

Thanks for the review!

I've addressed the feedback:

  • Updated tests/io/tools/test_mmjson.py to use the TEST_DATA_DIR fixture from tests.conftest.
  • Force-added the test data files (tests/data/io/2hhb.cif.gz and tests/data/io/2hhb.json.gz) to git so they are no longer ignored.

(And yes — this time the AI assistant was Gemini Pro 😄)

@nscorley
Copy link
Collaborator

Thanks! Looks great. I'm still seeing a linting error -- you can address with make format in the directory root; after that it'll be good to go!

@N283T
Copy link
Contributor Author

N283T commented Dec 12, 2025

Ran make format from the repository root to resolve the remaining linting issue. Thanks!

@nscorley nscorley merged commit 219d69e into RosettaCommons:production Dec 19, 2025
2 checks passed
@github-actions
Copy link
Contributor

🎉 This PR is included in version 2.2.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@N283T N283T deleted the feature/mmjson-parser branch December 20, 2025 00:30
partrita pushed a commit to partrita/atomworks that referenced this pull request Jan 19, 2026
…ltiple representative/center atoms per token (RosettaCommons#67)

Co-authored-by: Rafi Brent <rafi.brent@gmail.com>
partrita pushed a commit to partrita/atomworks that referenced this pull request Jan 19, 2026
# [1.1.0](RosettaCommons/atomworks@v1.0.3...v1.1.0) (2025-11-29)

### Bug Fixes

* a couple of Condition updates ([RosettaCommons#59](RosettaCommons#59)) ([a0f173b](RosettaCommons@a0f173b))
* add a flag to optionally tolerate the situation of missing or multiple representative/center atoms per token ([RosettaCommons#67](RosettaCommons#67)) ([896ca38](RosettaCommons@896ca38))
* add errors for cases where parsing an AtomArrayPlus is problematic ([80d00e3](RosettaCommons@80d00e3))
* add padding for short residue names in sharding ([94abca1](RosettaCommons@94abca1))
* add raise_if_not_set to get_msa_dirs_from_env ([0f50eb2](RosettaCommons@0f50eb2))
* add raise_if_not_set to get_msa_dirs_from_env ([c7bfee6](RosettaCommons@c7bfee6))
* add within poly res idx on-the-fly option ([c80104b](RosettaCommons@c80104b))
* address code review issues in performance PR ([9cda94a](RosettaCommons@9cda94a))
* allow for deleting 2d annotations from AtomArrayPlus ([84e0084](RosettaCommons@84e0084))
* allow numpy masks in addition to query syntax in SampleSeed ([3468645](RosettaCommons@3468645))
* allow override in add global token id transform ([381a743](RosettaCommons@381a743))
* apptainer ([37fe7a2](RosettaCommons@37fe7a2))
* apptainer ([e3bd135](RosettaCommons@e3bd135))
* apptainer ([0647b20](RosettaCommons@0647b20))
* apptainer for CI ([670a82a](RosettaCommons@670a82a))
* bcif tests ([378d03a](RosettaCommons@378d03a))
* Be more robust to nulls in a3m files. ([a8552a4](RosettaCommons@a8552a4))
* better messages and assertions for removing design tasks with 0 frequency ([9b6391d](RosettaCommons@9b6391d))
* broken tests ([1fcc397](RosettaCommons@1fcc397))
* bug in default seq cond mask ([00484d4](RosettaCommons@00484d4))
* change default sequence condition behavior ([981d924](RosettaCommons@981d924))
* ci for internal ([5096b6c](RosettaCommons@5096b6c))
* ci workers ([f7da3cd](RosettaCommons@f7da3cd))
* circular import ([7eafda9](RosettaCommons@7eafda9))
* claude code review ([885eb53](RosettaCommons@885eb53))
* condition set mask and terminus conditions changes ([56f661c](RosettaCommons@56f661c))
* correct cache dir structure and add padding for short IDs ([b8645fc](RosettaCommons@b8645fc))
* correct sharding path construction for cached residue data ([79d388b](RosettaCommons@79d388b))
* **databases:** correcting uniontype call bug ([8a3e59e](RosettaCommons@8a3e59e))
* **databases:** correcting uniontype call bug ([ebc26db](RosettaCommons@ebc26db))
* datasets documentation, DSSP path ([1565a94](RosettaCommons@1565a94))
* docstring formatting ([ff296d6](RosettaCommons@ff296d6))
* documentation ([1e5b0d4](RosettaCommons@1e5b0d4))
* documentation, formatting ([dcdde14](RosettaCommons@dcdde14))
* downgrade biotite ([cab5bcf](RosettaCommons@cab5bcf))
* enable deletion of 2d annotations ([1f88391](RosettaCommons@1f88391))
* enable spawn multiprocessing ([36ac421](RosettaCommons@36ac421))
* ensure that parse preserves AtomArrayPlus status, and add a test for this ([681fdeb](RosettaCommons@681fdeb))
* ensure that the Index condition's default annotation respects its mask ([RosettaCommons#50](RosettaCommons#50)) ([e57be2a](RosettaCommons@e57be2a))
* Formatting ([f6fe986](RosettaCommons@f6fe986))
* general masks in SampleSeed ([53df9a6](RosettaCommons@53df9a6))
* give more informative error messages for ConditionalRoute or RandomRoute failures ([3b72b18](RosettaCommons@3b72b18))
* Handle non-uniform shard sizes in AseDBDataset ([e34eb51](RosettaCommons@e34eb51))
* infer array type of TokenEncoding where possible ([RosettaCommons#68](RosettaCommons#68)) ([a6a8fb1](RosettaCommons@a6a8fb1))
* informative Route error messages ([87b1fbc](RosettaCommons@87b1fbc))
* minor fixes ([f45fafb](RosettaCommons@f45fafb))
* minor fixes for encodings ([e043104](RosettaCommons@e043104))
* more informative error messages ([7861023](RosettaCommons@7861023))
* parse preserves atom array plus ([d1eef92](RosettaCommons@d1eef92))
* parser defaults ([570f3ce](RosettaCommons@570f3ce))
* reduce logging level in `load_atom_array_with_conditions_from_cif` ([RosettaCommons#48](RosettaCommons#48)) ([52f316d](RosettaCommons@52f316d))
* remove _spoof ([995a260](RosettaCommons@995a260))
* remove ambiguous Greek characters and improve test assertions ([bad6dff](RosettaCommons@bad6dff))
* remove ASE import so we dont introduce a dependency ([7e12a8a](RosettaCommons@7e12a8a))
* remove design tasks with zero frequency during sampling instead of erroring ([4586fa2](RosettaCommons@4586fa2))
* remove hardcoded environment-specific default path ([24bf03f](RosettaCommons@24bf03f))
* remove lineprofiler stuff ([83fb3c5](RosettaCommons@83fb3c5))
* remove print statements ([ecd9e5b](RosettaCommons@ecd9e5b))
* residue starts bug with dependent functions. ([04da354](RosettaCommons@04da354))
* residue starts bug with dependent functions. ([fc252a8](RosettaCommons@fc252a8))
* **rf3:** json-level atom selection for bonds argument ([a569a7c](RosettaCommons@a569a7c))
* ruff formatting after merge with dev ([71ddb86](RosettaCommons@71ddb86))
* sasa for empty aa ([c2b9302](RosettaCommons@c2b9302))
* shard cache on structure ID (PDB ID) instead of args hash ([e5c29fd](RosettaCommons@e5c29fd))
* Support AtomArrayPlus and AtomArrayPlusStack in parse_atom_array, with some restrictions ([RosettaCommons#46](RosettaCommons#46)) ([c1e3b00](RosettaCommons@c1e3b00))
* test requirements ([4c4c775](RosettaCommons@4c4c775))
* tests ([eb10ba6](RosettaCommons@eb10ba6))
* typing ([8d6e39d](RosettaCommons@8d6e39d))
* updates for ame ([6530818](RosettaCommons@6530818))
* use warnings.warn instead of print for missing ase ([b6f9c7a](RosettaCommons@b6f9c7a))

### Features

* add as_atom_array_plus_stack ([804c854](RosettaCommons@804c854))
* add Claude Hode GitHub workflow ([0f4e603](RosettaCommons@0f4e603))
* add customizable annotation names for SSE validity flag ([fd6be37](RosettaCommons@fd6be37))
* add DSSP secondary structure annotation ([f1cc785](RosettaCommons@f1cc785))
* add molprobity executable? ([b87f76b](RosettaCommons@b87f76b))
* add utomatic msa depth and ext determination ([874a1dd](RosettaCommons@874a1dd))
* adding Conditions, annotator, and io utils ([b08c935](RosettaCommons@b08c935))
* atom37 with atomization encoding ([d35be40](RosettaCommons@d35be40))
* atomworks MSA preprocessing (generate, filter, organize) ([ef6ac85](RosettaCommons@ef6ac85))
* backbone constants and annotators ([de5854c](RosettaCommons@de5854c))
* **databases:** add experimental data database functionality from datahub ([f69a499](RosettaCommons@f69a499))
* **databases:** add experimental data database functionality from datahub ([8b71614](RosettaCommons@8b71614))
* generalize encodings ([c886d09](RosettaCommons@c886d09))
* generalize encodings ([83c8684](RosettaCommons@83c8684))
* Make SampleDesignTask handle empty design_tasks gracefully ([c9e5900](RosettaCommons@c9e5900))
* misc fixes for design ([71decdf](RosettaCommons@71decdf))
* MSA preprocessing (organizing, generating, filtering) ([89eb73d](RosettaCommons@89eb73d))
* MSA preprocessing; zst support ([aa3c837](RosettaCommons@aa3c837))
* paired MSA feature (from Rafi) ([805a27a](RosettaCommons@805a27a))
* paired MSA feature (from Rafi) ([2df1511](RosettaCommons@2df1511))
* skip for large structures ([6f2f33c](RosettaCommons@6f2f33c))
* update to work with transition metals ([6f2a80a](RosettaCommons@6f2a80a))

### Performance Improvements

* don't actually check symmetry every time on symmetric conditions since there's just too much overhead ([6ae656a](RosettaCommons@6ae656a))
* improve cache loading and template building performance ([c144ec7](RosettaCommons@c144ec7))
* performance optimizations for parsing ([e8faf3e](RosettaCommons@e8faf3e))
* precalculate bond graph for GrowByHoppingAlongBondGraph ([fd85178](RosettaCommons@fd85178))
* res idx ([22703bf](RosettaCommons@22703bf))
* res idx optimization ([11b5d0d](RosettaCommons@11b5d0d))
* speed up 2d symmetrization check and stop actually checking symmetry every time and make some other stuff faster ([5c3127b](RosettaCommons@5c3127b))
* speed up apply_and_spread ([ac612a4](RosettaCommons@ac612a4))
* speed up bond_graph creation ([94f0512](RosettaCommons@94f0512))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants