Skip to content

Inprocess: states and keys are serialized with json, not pickle#268

Merged
hexinw-nvidia merged 3 commits intoNVIDIA:mainfrom
szmigacz:inprocess_no_pickle
Mar 4, 2026
Merged

Inprocess: states and keys are serialized with json, not pickle#268
hexinw-nvidia merged 3 commits intoNVIDIA:mainfrom
szmigacz:inprocess_no_pickle

Conversation

@szmigacz
Copy link
Contributor

No description provided.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR replaces pickle with json for serializing State objects and rank keys in the inprocess store, eliminating the CWE-502 security warning that was previously suppressed with # nosec comments.

  • send_state now uses dataclasses.asdict + manual field overrides (mode → enum name string, fn_exceptionNone) before calling json.dumps; get_states reconstructs State by converting the mode string back to a Mode enum member.
  • send_key / get_keys now use json.dumps / json.loads directly.
  • The fn_exception field is unconditionally set to None before serialization because Exception objects are not JSON-serializable. This is safe in the current call-graph (rank_assignment.py always overwrites fn_exception on deserialized states with the local rank's value), but it is a silent behavioral change worth documenting.
  • Tuple-typed keys (or any non-JSON-serializable key) will either silently change type (tuplelist) or raise a TypeError at runtime — a potential breaking change compared to the previous pickle-based approach that accepted arbitrary Python objects.

Confidence Score: 4/5

  • Safe to merge for the common case of scalar keys; minor risks around tuple key type coercion and undocumented fn_exception data loss.
  • The security improvement (removing pickle) is clear and correct. The Mode enum round-trip is properly handled. The fn_exception data loss is safe given current consumers. The main risk is the undocumented breaking change for non-scalar or tuple keys, which could affect users who configure composite keys.
  • src/nvidia_resiliency_ext/inprocess/store.py — specifically send_key and send_state

Important Files Changed

Filename Overview
src/nvidia_resiliency_ext/inprocess/store.py Replaces pickle-based serialization of State and keys with JSON; removes the security-flagged pickle dependency. The Mode enum is correctly round-tripped via its name, and fn_exception is intentionally dropped (safe given current consumers). Potential behavioral change for non-scalar or tuple keys due to JSON's tuple→list coercion.

Sequence Diagram

sequenceDiagram
    participant R as Rank N
    participant S as TCPStore
    participant P as Peer Ranks

    Note over R: send_state (new JSON path)
    R->>R: dataclasses.asdict(state)
    R->>R: state_dict['mode'] = state.mode.name
    R->>R: state_dict['fn_exception'] = None
    R->>S: set(STATE_key, json.dumps(state_dict))

    Note over R: send_key (new JSON path)
    R->>S: set(KEY_key, json.dumps(key))

    Note over P: get_states
    P->>S: multi_get([STATE_key for each rank])
    S-->>P: [json bytes, ...]
    P->>P: json.loads(data) → state_dict
    P->>P: state_dict['mode'] = Mode[name]
    P->>P: State(**state_dict)

    Note over P: get_keys
    P->>S: multi_get([KEY_key for each rank])
    S-->>P: [json bytes, ...]
    P->>P: json.loads(data) → key (tuples become lists)
Loading

Last reviewed commit: c47d7b3

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 2, 2026

Additional Comments (2)

src/nvidia_resiliency_ext/inprocess/store.py, line 111
Silent fn_exception data loss on serialization

fn_exception is unconditionally set to None before serialization. Because Exception objects are not JSON-serializable this is necessary, but the silent discard has a subtle implication: any rank that calls get_states will always see fn_exception=None for every other rank, regardless of whether those ranks actually encountered an exception.

In the current code-path this is safe because rank_assignment.py immediately overwrites fn_exception on all leaf states with the calling rank's own local value (line 748: leaf.state.copy_from(state, fields=['fn_exception', 'iteration'])). However, the contract of get_states is now different from what it was with pickle, and any future consumer that expects a real exception from a peer rank will silently receive None.

Consider adding a comment to document this intentional trade-off:

        state_dict['fn_exception'] = None  # Exception objects are not JSON-serializable; dropped intentionally

src/nvidia_resiliency_ext/inprocess/store.py, line 114
Non-JSON-serializable or tuple-typed keys will silently change type or raise

pickle could round-trip any Python object as a key. With json.dumps, two categories of keys now break:

  1. Tuple keys become lists. json.dumps((1, 2))json.loads(...) returns [1, 2]. In build_tree the keys are used as dict keys (node.children[k]) and for set-membership checks. If the local key and the deserialized key are compared later (e.g. through a hash-based structure) the type mismatch will cause incorrect behaviour — lists are also unhashable, so set(...) over deserialized list-of-list keys will raise TypeError.

  2. Non-serializable keys raise TypeError at runtime. Any custom object previously usable as a key now raises immediately at json.dumps(key), with no graceful error message.

If keys are always scalars (strings / integers) in practice, this is fine, but the change is silently breaking for callers using tuple or composite keys. Consider validating the key type or documenting the restriction:

        self.set(self.KEY.format(rank=rank), json.dumps(key))  # key must be JSON-serializable; tuples become lists

@hexinw-nvidia hexinw-nvidia merged commit 768bd68 into NVIDIA:main Mar 4, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants