Fix _set_wandb_writer serialization issues by gakkiri · Pull Request #1806 · NVIDIA/Megatron-LM

gakkiri · 2025-09-11T15:41:15Z

Description

Fix serialization issues in _set_wandb_writer function that can cause failures when passing complex argument configurations to wandb.init().

Problem

The current implementation directly passes the args namespace to wandb as configuration, which can fail when the args contain non-serializable objects such as:

bytes objects
torch.Tensor instances
Function/callable objects
Type objects
Other objects that cannot be JSON-serialized

This leads to serialization errors during wandb initialization, preventing proper logging functionality.

   File \"/root/Megatron-LM/megatron/training/global_vars.py\", line 211, in _set_wandb_writer\n"}
     wandb.init(**wandb_kwargs)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py\", line 1620, in init\n"}
     wandb._sentry.reraise(e)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/analytics/sentry.py\", line 157, in reraise\n"}
     raise exc.with_traceback(sys.exc_info()[2])\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py\", line 1606, in init\n"}
     return wi.init(run_settings, run_config, run_printer)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py\", line 981, in init\n"}
     run_init_handle = backend.interface.deliver_run(run)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py\", line 909, in deliver_run\n"}
     run_record = self._make_run(run)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py\", line 182, in _make_run\n"}
     self._make_config(data=config_dict, obj=proto_run.config)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py\", line 125, in _make_config\n"}
     update.value_json = json_dumps_safer(json_friendly(v)[0])\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/util.py\", line 806, in json_dumps_safer\n"}
     return dumps(obj, cls=WandBJSONEncoder, **kwargs)\n"}
   File \"/usr/lib/python3.10/json/__init__.py\", line 238, in dumps\n"}
     **kw).encode(obj)\n"}
   File \"/usr/lib/python3.10/json/encoder.py\", line 199, in encode\n"}
     chunks = self.iterencode(o, _one_shot=True)\n"}
   File \"/usr/lib/python3.10/json/encoder.py\", line 257, in iterencode\n"}
     return _iterencode(o, 0)\n"}
   File \"/usr/local/lib/python3.10/dist-packages/wandb/util.py\", line 755, in default\n"}
     return json.JSONEncoder.default(self, obj)\n"}
   File \"/usr/lib/python3.10/json/encoder.py\", line 179, in default\n"}
     raise TypeError(f'Object of type {o.__class__.__name__} '\n"}
 TypeError: Object of type dtype is not JSON serializable\n"}

Solution

Added a comprehensive sanitization function _clean() that:

Filters out non-serializable types: Removes bytes, type, and callable objects from the configuration
Converts tensors to serializable format: Automatically converts torch.Tensor and numpy arrays to lists using .tolist()
Handles bytes gracefully: Converts bytes to UTF-8 strings with error handling
Validates JSON compatibility: Performs a final JSON serialization check to ensure all values are safe
Provides fallback handling: Uses repr() as a last resort for any remaining problematic objects

Changes Made

Enhanced _set_wandb_writer() function in megatron/training/global_vars.py
Added _clean() helper function for recursive sanitization
Maintained full backward compatibility
Preserved all existing functionality while improving robustness

Testing

Preserves existing wandb logging behavior
Handles common non-serializable objects (bytes, tensors, callables)
Maintains backward compatibility with existing configurations
No breaking changes to the API

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update

Impact

This fix ensures that wandb logging works reliably across different training configurations, especially when using complex argument setups that may include tensors, custom types, or other non-serializable objects. The change is minimal and focused, reducing the risk of introducing new issues while solving a real-world problem that can prevent proper experiment tracking.

Related Issues

Fixes potential TypeError and ValueError exceptions during wandb initialization when args contain non-serializable objects.

- Add sanitization for non-serializable objects (bytes, types, callables) - Handle torch.Tensor and numpy arrays by converting to lists - Add JSON safety check as final validation - Preserve existing functionality while ensuring wandb config is serializable

copy-pr-bot · 2025-09-11T15:41:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Phlip79 · 2026-03-04T23:08:47Z

/claude review

claude · 2026-03-04T23:09:54Z

megatron/training/global_vars.py

+            if isinstance(obj, Namespace):
+                obj = vars(obj)
+            if isinstance(obj, dict):
+                return {k: _clean(v) for k, v in obj.items()


Bug: bytes values inside dicts are silently dropped, not converted.

The dict comprehension filters out bytes with not isinstance(v, (bytes, ...)), so they are excluded entirely. The if isinstance(obj, bytes) branch below is only reachable for a top-level bytes value, not one nested in a dict. This means bytes dict values are lost rather than converted to strings.

Suggested change

return {k: _clean(v) for k, v in obj.items()

if isinstance(obj, dict):

result = {}

for k, v in obj.items():

if isinstance(v, (type, type(lambda: None))):

continue

result[k] = _clean(v)

return result

Then the if isinstance(obj, bytes) branch will properly handle bytes values when they are recursively processed as dict values.

claude · 2026-03-04T23:09:55Z

megatron/training/global_vars.py

+            if isinstance(obj, Namespace):
+                obj = vars(obj)
+            if isinstance(obj, dict):
+                return {k: _clean(v) for k, v in obj.items()


type(lambda: None) only catches FunctionType, missing many callable types.

This check misses built-in functions (len, print), bound methods, functools.partial, and any class instance with __call__. Use callable() instead:

Suggested change

return {k: _clean(v) for k, v in obj.items()

if isinstance(v, (bytes, type)) or callable(v):

claude · 2026-03-04T23:10:00Z

megatron/training/global_vars.py

+            if isinstance(obj, bytes):
+                return obj.decode('utf-8', errors='ignore')
+            if hasattr(obj, 'tolist'):          # torch.Tensor / numpy.ndarray
+                return obj.tolist()


Lists are not recursively sanitized.

If args contains a list with tensors, numpy arrays, or other non-serializable objects, they pass through untouched and will still fail JSON serialization. A list branch is needed:

Suggested change

return obj.tolist()

if isinstance(obj, list):

return [_clean(v) for v in obj]

if hasattr(obj, 'tolist'): # torch.Tensor / numpy.ndarray

claude · 2026-03-04T23:10:06Z

Three bugs in the _clean() helper:

bytes in dicts silently dropped (see inline): The dict comprehension filters out bytes values entirely rather than converting them. The isinstance(obj, bytes) branch is unreachable for bytes values nested inside a dict.
callable check is too narrow (see inline): type(lambda: None) only catches FunctionType. Built-in functions, bound methods, functools.partial, and callable class instances all slip through.
lists not recursively sanitized (see inline): A list containing tensors or other non-serializable objects is passed through as-is and will still fail JSON serialization.

Phlip79 · 2026-03-04T23:10:42Z

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

sbhavani added bug Something isn't working module: debugging labels Sep 14, 2025

Merge branch 'main' into fix-wandb-serialization

01b9cb2

github-actions bot added the community-request label Mar 4, 2026

claude bot reviewed Mar 4, 2026

View reviewed changes

Phlip79 marked this pull request as draft March 4, 2026 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix _set_wandb_writer serialization issues#1806

Fix _set_wandb_writer serialization issues#1806
gakkiri wants to merge 2 commits intoNVIDIA:mainfrom
gakkiri:fix-wandb-serialization

gakkiri commented Sep 11, 2025

Uh oh!

copy-pr-bot bot commented Sep 11, 2025

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

claude bot Mar 4, 2026

Uh oh!

claude bot Mar 4, 2026

Uh oh!

claude bot Mar 4, 2026

Uh oh!

claude bot commented Mar 4, 2026

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-                return {k: _clean(v) for k, v in obj.items()
+            if isinstance(obj, dict):
+                result = {}
+                for k, v in obj.items():
+                    if isinstance(v, (type, type(lambda: None))):
+                        continue
+                    result[k] = _clean(v)
+                return result

	return {k: _clean(v) for k, v in obj.items()
	if isinstance(v, (bytes, type)) or callable(v):

Conversation

gakkiri commented Sep 11, 2025

Description

Problem

Solution

Changes Made

Testing

Type of Change

Impact

Related Issues

Uh oh!

copy-pr-bot bot commented Sep 11, 2025

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

claude bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 4, 2026

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants