Skip to content

Fix _set_wandb_writer serialization issues#1806

Draft
gakkiri wants to merge 2 commits intoNVIDIA:mainfrom
gakkiri:fix-wandb-serialization
Draft

Fix _set_wandb_writer serialization issues#1806
gakkiri wants to merge 2 commits intoNVIDIA:mainfrom
gakkiri:fix-wandb-serialization

Conversation

@gakkiri
Copy link

@gakkiri gakkiri commented Sep 11, 2025

Description

Fix serialization issues in _set_wandb_writer function that can cause failures when passing complex argument configurations to wandb.init().

Problem

The current implementation directly passes the args namespace to wandb as configuration, which can fail when the args contain non-serializable objects such as:

  • bytes objects
  • torch.Tensor instances
  • Function/callable objects
  • Type objects
  • Other objects that cannot be JSON-serialized

This leads to serialization errors during wandb initialization, preventing proper logging functionality.

   File \"/root/Megatron-LM/megatron/training/global_vars.py\", line 211, in _set_wandb_writer\n"}
     wandb.init(**wandb_kwargs)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py\", line 1620, in init\n"}
     wandb._sentry.reraise(e)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/analytics/sentry.py\", line 157, in reraise\n"}
     raise exc.with_traceback(sys.exc_info()[2])\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py\", line 1606, in init\n"}
     return wi.init(run_settings, run_config, run_printer)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py\", line 981, in init\n"}
     run_init_handle = backend.interface.deliver_run(run)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py\", line 909, in deliver_run\n"}
     run_record = self._make_run(run)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py\", line 182, in _make_run\n"}
     self._make_config(data=config_dict, obj=proto_run.config)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py\", line 125, in _make_config\n"}
     update.value_json = json_dumps_safer(json_friendly(v)[0])\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/util.py\", line 806, in json_dumps_safer\n"}
     return dumps(obj, cls=WandBJSONEncoder, **kwargs)\n"}
   File \"/usr/lib/python3.10/json/__init__.py\", line 238, in dumps\n"}
     **kw).encode(obj)\n"}
   File \"/usr/lib/python3.10/json/encoder.py\", line 199, in encode\n"}
     chunks = self.iterencode(o, _one_shot=True)\n"}
   File \"/usr/lib/python3.10/json/encoder.py\", line 257, in iterencode\n"}
     return _iterencode(o, 0)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/util.py\", line 755, in default\n"}
     return json.JSONEncoder.default(self, obj)\n"}
   File \"/usr/lib/python3.10/json/encoder.py\", line 179, in default\n"}
     raise TypeError(f'Object of type {o.__class__.__name__} '\n"}
 TypeError: Object of type dtype is not JSON serializable\n"}

Solution

Added a comprehensive sanitization function _clean() that:

  1. Filters out non-serializable types: Removes bytes, type, and callable objects from the configuration
  2. Converts tensors to serializable format: Automatically converts torch.Tensor and numpy arrays to lists using .tolist()
  3. Handles bytes gracefully: Converts bytes to UTF-8 strings with error handling
  4. Validates JSON compatibility: Performs a final JSON serialization check to ensure all values are safe
  5. Provides fallback handling: Uses repr() as a last resort for any remaining problematic objects

Changes Made

  • Enhanced _set_wandb_writer() function in megatron/training/global_vars.py
  • Added _clean() helper function for recursive sanitization
  • Maintained full backward compatibility
  • Preserved all existing functionality while improving robustness

Testing

  • Preserves existing wandb logging behavior
  • Handles common non-serializable objects (bytes, tensors, callables)
  • Maintains backward compatibility with existing configurations
  • No breaking changes to the API

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

Impact

This fix ensures that wandb logging works reliably across different training configurations, especially when using complex argument setups that may include tensors, custom types, or other non-serializable objects. The change is minimal and focused, reducing the risk of introducing new issues while solving a real-world problem that can prevent proper experiment tracking.

Related Issues

Fixes potential TypeError and ValueError exceptions during wandb initialization when args contain non-serializable objects.

- Add sanitization for non-serializable objects (bytes, types, callables)
- Handle torch.Tensor and numpy arrays by converting to lists
- Add JSON safety check as final validation
- Preserve existing functionality while ensuring wandb config is serializable
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sbhavani sbhavani added bug Something isn't working module: debugging labels Sep 14, 2025
@Phlip79
Copy link
Member

Phlip79 commented Mar 4, 2026

/claude review

if isinstance(obj, Namespace):
obj = vars(obj)
if isinstance(obj, dict):
return {k: _clean(v) for k, v in obj.items()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: bytes values inside dicts are silently dropped, not converted.

The dict comprehension filters out bytes with not isinstance(v, (bytes, ...)), so they are excluded entirely. The if isinstance(obj, bytes) branch below is only reachable for a top-level bytes value, not one nested in a dict. This means bytes dict values are lost rather than converted to strings.

Suggested change
return {k: _clean(v) for k, v in obj.items()
if isinstance(obj, dict):
result = {}
for k, v in obj.items():
if isinstance(v, (type, type(lambda: None))):
continue
result[k] = _clean(v)
return result

Then the if isinstance(obj, bytes) branch will properly handle bytes values when they are recursively processed as dict values.

if isinstance(obj, Namespace):
obj = vars(obj)
if isinstance(obj, dict):
return {k: _clean(v) for k, v in obj.items()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type(lambda: None) only catches FunctionType, missing many callable types.

This check misses built-in functions (len, print), bound methods, functools.partial, and any class instance with __call__. Use callable() instead:

Suggested change
return {k: _clean(v) for k, v in obj.items()
if isinstance(v, (bytes, type)) or callable(v):

if isinstance(obj, bytes):
return obj.decode('utf-8', errors='ignore')
if hasattr(obj, 'tolist'): # torch.Tensor / numpy.ndarray
return obj.tolist()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lists are not recursively sanitized.

If args contains a list with tensors, numpy arrays, or other non-serializable objects, they pass through untouched and will still fail JSON serialization. A list branch is needed:

Suggested change
return obj.tolist()
if isinstance(obj, list):
return [_clean(v) for v in obj]
if hasattr(obj, 'tolist'): # torch.Tensor / numpy.ndarray

@claude
Copy link

claude bot commented Mar 4, 2026

Three bugs in the _clean() helper:

  1. bytes in dicts silently dropped (see inline): The dict comprehension filters out bytes values entirely rather than converting them. The isinstance(obj, bytes) branch is unreachable for bytes values nested inside a dict.

  2. callable check is too narrow (see inline): type(lambda: None) only catches FunctionType. Built-in functions, bound methods, functools.partial, and callable class instances all slip through.

  3. lists not recursively sanitized (see inline): A list containing tensors or other non-serializable objects is passed through as-is and will still fail JSON serialization.

@Phlip79
Copy link
Member

Phlip79 commented Mar 4, 2026

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

@Phlip79 Phlip79 marked this pull request as draft March 4, 2026 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants