Skip to content

v1.5.0

Latest

Choose a tag to compare

@amaslenn amaslenn released this 10 Feb 10:27
· 128 commits to main since this release
88d6f5c

New Changes

AI Dynamo Improvements

AI Dynamo workload supports both Kubernetes and Slurm systems, upgraded to Dynamo v0.7. Key additions include disaggregated prefill/decode mode, multinode deployments, and pass-fail criteria for automated result validation. Deployment configuration has been simplified using TOML files instead of extra config files, and genai-perf is integrated directly from the Dynamo container.

Kubernetes Enhancements

Host network is enabled by default for all Kubernetes deployments. Job name sanitization has been improved across all workloads to prevent invalid characters. For NCCL workloads specifically, logs are fetched continuously during execution and sshd is automatically installed on workers when not available. AI Dynamo deployments properly clean up port forwarding processes on deletion.

Reporting Improvements

  • Comparison report automatically calculates the difference (value + percentage) when comparing two results
  • Status report includes scenario results for easier monitoring
  • Improved status table formatting
  • Report results directory is printed to users early in the process

Documentation

The documentation has been reorganized with AI Dynamo, covering both Kubernetes and Slurm examples on a single page. New sections have been added for parameter sweeps and test-in-scenario configuration. The workloads support matrix has been updated to reflect current platform availability.

Architectural Changes

  • Removed Test concept - Simplifies the codebase by eliminating the intermediate Test object
  • Removed TestTemplate concept - Direct workload usage instead of TestTemplate objects
  • Converted TestScenario to dataclass
  • Converted BaseSystemto Pydantic model
  • Aligned Grader and JsonGenStrategy with CmdGenStrategy patterns
  • MegatronRun workload defaults to not enabling recompute-activations
  • --distribution=arbitrary is not hardcoded for Slurm deployments anymore
  • srun commands always set the number of nodes (unless a nodelist is specified)
  • ETCD/NIXL processes are killed and waited for properly

All Changed

New Contributors

Full Changelog: v1.4.0...v1.5.0