Fix _set_wandb_writer serialization issues#1806
Conversation
- Add sanitization for non-serializable objects (bytes, types, callables) - Handle torch.Tensor and numpy arrays by converting to lists - Add JSON safety check as final validation - Preserve existing functionality while ensuring wandb config is serializable
|
/claude review |
| if isinstance(obj, Namespace): | ||
| obj = vars(obj) | ||
| if isinstance(obj, dict): | ||
| return {k: _clean(v) for k, v in obj.items() |
There was a problem hiding this comment.
Bug: bytes values inside dicts are silently dropped, not converted.
The dict comprehension filters out bytes with not isinstance(v, (bytes, ...)), so they are excluded entirely. The if isinstance(obj, bytes) branch below is only reachable for a top-level bytes value, not one nested in a dict. This means bytes dict values are lost rather than converted to strings.
| return {k: _clean(v) for k, v in obj.items() | |
| if isinstance(obj, dict): | |
| result = {} | |
| for k, v in obj.items(): | |
| if isinstance(v, (type, type(lambda: None))): | |
| continue | |
| result[k] = _clean(v) | |
| return result |
Then the if isinstance(obj, bytes) branch will properly handle bytes values when they are recursively processed as dict values.
| if isinstance(obj, Namespace): | ||
| obj = vars(obj) | ||
| if isinstance(obj, dict): | ||
| return {k: _clean(v) for k, v in obj.items() |
There was a problem hiding this comment.
type(lambda: None) only catches FunctionType, missing many callable types.
This check misses built-in functions (len, print), bound methods, functools.partial, and any class instance with __call__. Use callable() instead:
| return {k: _clean(v) for k, v in obj.items() | |
| if isinstance(v, (bytes, type)) or callable(v): |
| if isinstance(obj, bytes): | ||
| return obj.decode('utf-8', errors='ignore') | ||
| if hasattr(obj, 'tolist'): # torch.Tensor / numpy.ndarray | ||
| return obj.tolist() |
There was a problem hiding this comment.
Lists are not recursively sanitized.
If args contains a list with tensors, numpy arrays, or other non-serializable objects, they pass through untouched and will still fail JSON serialization. A list branch is needed:
| return obj.tolist() | |
| if isinstance(obj, list): | |
| return [_clean(v) for v in obj] | |
| if hasattr(obj, 'tolist'): # torch.Tensor / numpy.ndarray |
|
Three bugs in the
|
|
We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged. Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md. |
Description
Fix serialization issues in
_set_wandb_writerfunction that can cause failures when passing complex argument configurations to wandb.init().Problem
The current implementation directly passes the
argsnamespace to wandb as configuration, which can fail when the args contain non-serializable objects such as:bytesobjectstorch.TensorinstancesThis leads to serialization errors during wandb initialization, preventing proper logging functionality.
Solution
Added a comprehensive sanitization function
_clean()that:bytes,type, and callable objects from the configurationtorch.Tensorand numpy arrays to lists using.tolist()repr()as a last resort for any remaining problematic objectsChanges Made
_set_wandb_writer()function inmegatron/training/global_vars.py_clean()helper function for recursive sanitizationTesting
Type of Change
Impact
This fix ensures that wandb logging works reliably across different training configurations, especially when using complex argument setups that may include tensors, custom types, or other non-serializable objects. The change is minimal and focused, reducing the risk of introducing new issues while solving a real-world problem that can prevent proper experiment tracking.
Related Issues
Fixes potential
TypeErrorandValueErrorexceptions during wandb initialization when args contain non-serializable objects.