diff --git a/3rdparty/Automodel-workspace/Automodel b/3rdparty/Automodel-workspace/Automodel index 1d42deb981..a2db048383 160000 --- a/3rdparty/Automodel-workspace/Automodel +++ b/3rdparty/Automodel-workspace/Automodel @@ -1 +1 @@ -Subproject commit 1d42deb98169fd94b54c714c0fe4bf308fe7115a +Subproject commit a2db048383cd54b3fafc928df4c30bf7bbf7c430 diff --git a/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge b/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge index 15398e08fc..8aa287df3c 160000 --- a/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge +++ b/3rdparty/Megatron-Bridge-workspace/Megatron-Bridge @@ -1 +1 @@ -Subproject commit 15398e08fc86be3de084c7382116527246ab1852 +Subproject commit 8aa287df3ca6833c78733460f0c0f0bcfb79f5de diff --git a/3rdparty/Megatron-LM-workspace/Megatron-LM b/3rdparty/Megatron-LM-workspace/Megatron-LM index 193463c4f8..76065f17e1 160000 --- a/3rdparty/Megatron-LM-workspace/Megatron-LM +++ b/3rdparty/Megatron-LM-workspace/Megatron-LM @@ -1 +1 @@ -Subproject commit 193463c4f8414e6906a40dd527a450bca50706b1 +Subproject commit 76065f17e1e1e2850d1e9009ae5f601007aeeeb3 diff --git a/fern/README.md b/fern/README.md new file mode 100644 index 0000000000..01b1b5e4f9 --- /dev/null +++ b/fern/README.md @@ -0,0 +1,118 @@ +# NeMo RL Fern Documentation + +This folder contains the Fern Docs configuration for NeMo RL. + +## Installation + +```bash +npm install -g fern-api +# Or: npx fern-api --version +``` + +## Local Preview + +```bash +cd fern/ +fern docs dev +# Or from project root: fern docs dev --project ./fern +``` + +Docs available at `http://localhost:3000`. + +## Folder Structure + +``` +fern/ +├── docs.yml # Global config (title, colors, versions) +├── fern.config.json # Fern CLI config +├── versions/ +│ └── v0.5.0.yml # Navigation for v0.5.0 +├── v0.5.0/ +│ └── pages/ # MDX content for v0.5.0 +├── scripts/ # Migration and conversion scripts +└── assets/ # Favicon, images +``` + +## Migration Workflow + +To migrate or update docs from `docs/` to Fern: + +**Assets:** The docs reference images (e.g. `../assets/*.png`). These must exist in `docs/assets/` and will be copied to `fern/assets/` by the copy script. If `docs/assets/` is missing or images are not committed, create the directory and add the image files, or the Fern build will report missing path errors. Image paths in MDX are normalized to `/assets/` (relative to the Fern site root). + +```bash +# 1. Copy docs to fern (run from repo root) +python3 fern/scripts/copy_docs_to_fern.py v0.5.0 + +# 2. Convert RL-specific syntax first (octicon, py:class, py:meth) +python3 fern/scripts/convert_rl_specific.py fern/v0.5.0/pages + +# 3. Convert MyST to Fern MDX +python3 fern/scripts/convert_myst_to_fern.py fern/v0.5.0/pages + +# 4. Add frontmatter +python3 fern/scripts/add_frontmatter.py fern/v0.5.0/pages + +# 5. Update internal links +python3 fern/scripts/update_links.py fern/v0.5.0/pages + +# 6. Remove duplicate H1s (when title matches frontmatter) +python3 fern/scripts/remove_duplicate_h1.py fern/v0.5.0/pages + +# 7. Validate +./fern/scripts/check_unconverted.sh fern/v0.5.0/pages +uv run python fern/scripts/find_tag_mismatches.py fern/v0.5.0/pages +``` + +## Bumping the Version + +When releasing a new version (e.g., v0.6.0): + +1. Copy the previous version's content: + ```bash + cp -r fern/v0.5.0 fern/v0.6.0 + ``` + +2. Create the navigation file: + ```bash + cp fern/versions/v0.5.0.yml fern/versions/v0.6.0.yml + ``` + +3. In `versions/v0.6.0.yml`: replace `../v0.5.0/pages/` → `../v0.6.0/pages/` + +4. In `docs.yml`: add the new version to the `versions:` list + +5. Make content changes in `fern/v0.6.0/pages/` + +## MDX Components + +```mdx +Informational note +Helpful tip +Warning message +Info callout + + + Description + + + + ```python\ncode\n``` + +``` + +## API Reference + +API docs are built by Sphinx (autodoc2) and hosted at docs.nvidia.com. The "API Reference" link in the navbar points to `https://docs.nvidia.com/nemo/rl/latest/apidocs/`. + +## Deploying + +```bash +fern generate --docs +fern docs deploy +``` + +## Useful Links + +- [Fern Docs](https://buildwithfern.com/learn/docs) +- [MDX Components](https://buildwithfern.com/learn/docs/components) +- [Versioning Guide](https://buildwithfern.com/learn/docs/configuration/versions) diff --git a/fern/assets/.gitkeep b/fern/assets/.gitkeep new file mode 100644 index 0000000000..e69de29bb2 diff --git a/fern/assets/NVIDIA_dark.svg b/fern/assets/NVIDIA_dark.svg new file mode 100644 index 0000000000..04850d9d6b --- /dev/null +++ b/fern/assets/NVIDIA_dark.svg @@ -0,0 +1,35 @@ + + + + + + + + + + + + + + + + + + + diff --git a/fern/assets/NVIDIA_light.svg b/fern/assets/NVIDIA_light.svg new file mode 100644 index 0000000000..9ee045c3ef --- /dev/null +++ b/fern/assets/NVIDIA_light.svg @@ -0,0 +1,34 @@ + + + + + + + + + + + + + + + + + + + diff --git a/fern/assets/NVIDIA_symbol.svg b/fern/assets/NVIDIA_symbol.svg new file mode 100644 index 0000000000..c0507afe00 --- /dev/null +++ b/fern/assets/NVIDIA_symbol.svg @@ -0,0 +1,22 @@ + + + + + + + + + + + + + + + diff --git a/fern/assets/README.md b/fern/assets/README.md new file mode 100644 index 0000000000..dc0d33c50d --- /dev/null +++ b/fern/assets/README.md @@ -0,0 +1,4 @@ +# Fern Assets + +Add `favicon.png` here for the docs site logo and favicon. +See NeMo Curator or DataDesigner fern/assets for reference. diff --git a/fern/assets/RL_diagram.png b/fern/assets/RL_diagram.png new file mode 100644 index 0000000000..7a47b5fa06 Binary files /dev/null and b/fern/assets/RL_diagram.png differ diff --git a/fern/assets/actor-wg-worker-vc.png b/fern/assets/actor-wg-worker-vc.png new file mode 100644 index 0000000000..fe360c9939 Binary files /dev/null and b/fern/assets/actor-wg-worker-vc.png differ diff --git a/fern/assets/aime_training_progress.png b/fern/assets/aime_training_progress.png new file mode 100644 index 0000000000..1c69e59504 Binary files /dev/null and b/fern/assets/aime_training_progress.png differ diff --git a/fern/assets/dapo_train_reward.png b/fern/assets/dapo_train_reward.png new file mode 100644 index 0000000000..efe8dda10b Binary files /dev/null and b/fern/assets/dapo_train_reward.png differ diff --git a/fern/assets/dapo_val_acc.png b/fern/assets/dapo_val_acc.png new file mode 100644 index 0000000000..8b1c5ddba9 Binary files /dev/null and b/fern/assets/dapo_val_acc.png differ diff --git a/fern/assets/deepscaler_training_progress.png b/fern/assets/deepscaler_training_progress.png new file mode 100644 index 0000000000..0a57482d68 Binary files /dev/null and b/fern/assets/deepscaler_training_progress.png differ diff --git a/fern/assets/dtensor-tp-accuracy/image-20260111142255534.png b/fern/assets/dtensor-tp-accuracy/image-20260111142255534.png new file mode 100644 index 0000000000..4754c8b7cc Binary files /dev/null and b/fern/assets/dtensor-tp-accuracy/image-20260111142255534.png differ diff --git a/fern/assets/dtensor-tp-accuracy/image-20260111160656891-1768118824549-2.png b/fern/assets/dtensor-tp-accuracy/image-20260111160656891-1768118824549-2.png new file mode 100644 index 0000000000..82ceed4045 Binary files /dev/null and b/fern/assets/dtensor-tp-accuracy/image-20260111160656891-1768118824549-2.png differ diff --git a/fern/assets/dtensor-tp-accuracy/kl_hf_prev.png b/fern/assets/dtensor-tp-accuracy/kl_hf_prev.png new file mode 100644 index 0000000000..646c88faf9 Binary files /dev/null and b/fern/assets/dtensor-tp-accuracy/kl_hf_prev.png differ diff --git a/fern/assets/dtensor-tp-accuracy/logprobs_unequal_1.png b/fern/assets/dtensor-tp-accuracy/logprobs_unequal_1.png new file mode 100644 index 0000000000..c67819d92c Binary files /dev/null and b/fern/assets/dtensor-tp-accuracy/logprobs_unequal_1.png differ diff --git a/fern/assets/dtensor-tp-accuracy/token_mult_prob_error_qwen3_4B.png b/fern/assets/dtensor-tp-accuracy/token_mult_prob_error_qwen3_4B.png new file mode 100644 index 0000000000..05cdc399ee Binary files /dev/null and b/fern/assets/dtensor-tp-accuracy/token_mult_prob_error_qwen3_4B.png differ diff --git a/fern/assets/dtensor-tp-accuracy/validation_accuracy.png b/fern/assets/dtensor-tp-accuracy/validation_accuracy.png new file mode 100644 index 0000000000..83cf93b0a6 Binary files /dev/null and b/fern/assets/dtensor-tp-accuracy/validation_accuracy.png differ diff --git a/fern/assets/favicon.png b/fern/assets/favicon.png new file mode 100644 index 0000000000..c2d361cf47 Binary files /dev/null and b/fern/assets/favicon.png differ diff --git a/fern/assets/fp8_curves.png b/fern/assets/fp8_curves.png new file mode 100644 index 0000000000..1825877a9e Binary files /dev/null and b/fern/assets/fp8_curves.png differ diff --git a/fern/assets/fp8_e2e_curve.png b/fern/assets/fp8_e2e_curve.png new file mode 100644 index 0000000000..d479602102 Binary files /dev/null and b/fern/assets/fp8_e2e_curve.png differ diff --git a/fern/assets/nsys-multi-report-view.png b/fern/assets/nsys-multi-report-view.png new file mode 100644 index 0000000000..4eac23c40b Binary files /dev/null and b/fern/assets/nsys-multi-report-view.png differ diff --git a/fern/assets/ray-debug-step1.png b/fern/assets/ray-debug-step1.png new file mode 100644 index 0000000000..1dc77052fd Binary files /dev/null and b/fern/assets/ray-debug-step1.png differ diff --git a/fern/assets/ray-debug-step2.png b/fern/assets/ray-debug-step2.png new file mode 100644 index 0000000000..bb3ebc509b Binary files /dev/null and b/fern/assets/ray-debug-step2.png differ diff --git a/fern/assets/ray-debug-step3.png b/fern/assets/ray-debug-step3.png new file mode 100644 index 0000000000..23abcaf749 Binary files /dev/null and b/fern/assets/ray-debug-step3.png differ diff --git a/fern/assets/ray-debug-step4.png b/fern/assets/ray-debug-step4.png new file mode 100644 index 0000000000..da22112404 Binary files /dev/null and b/fern/assets/ray-debug-step4.png differ diff --git a/fern/assets/sft-openmathinstruct2-train-loss.png b/fern/assets/sft-openmathinstruct2-train-loss.png new file mode 100644 index 0000000000..d9fe1b6481 Binary files /dev/null and b/fern/assets/sft-openmathinstruct2-train-loss.png differ diff --git a/fern/assets/sft-openmathinstruct2-train1M-loss.png b/fern/assets/sft-openmathinstruct2-train1M-loss.png new file mode 100644 index 0000000000..b4ad667e1e Binary files /dev/null and b/fern/assets/sft-openmathinstruct2-train1M-loss.png differ diff --git a/fern/assets/train-reward-sliding-puzzle.png b/fern/assets/train-reward-sliding-puzzle.png new file mode 100644 index 0000000000..82d319f4f2 Binary files /dev/null and b/fern/assets/train-reward-sliding-puzzle.png differ diff --git a/fern/assets/val-log.png b/fern/assets/val-log.png new file mode 100644 index 0000000000..bda6618b8c Binary files /dev/null and b/fern/assets/val-log.png differ diff --git a/fern/assets/valid_acc-sliding-puzzle.png b/fern/assets/valid_acc-sliding-puzzle.png new file mode 100644 index 0000000000..7b6d539916 Binary files /dev/null and b/fern/assets/valid_acc-sliding-puzzle.png differ diff --git a/fern/components/CustomFooter.tsx b/fern/components/CustomFooter.tsx new file mode 100644 index 0000000000..fab392c407 --- /dev/null +++ b/fern/components/CustomFooter.tsx @@ -0,0 +1,91 @@ +/** + * Custom footer for NVIDIA docs (Fern native header/footer). + * Markup and class names match the original custom-app footer 1:1 so that + * fern/main.css (footer + Built with Fern styles) applies correctly: + * dark mode logo, responsive layout, and Built with Fern tooltip. + */ +export default function CustomFooter() { + const currentYear = new Date().getFullYear(); + const logoUrl = + "https://fern-image-hosting.s3.us-east-1.amazonaws.com/nvidia/NVIDIA_Logo_0.svg"; + + return ( + + ); +} diff --git a/fern/docs.yml b/fern/docs.yml new file mode 100644 index 0000000000..d2e3f2f65e --- /dev/null +++ b/fern/docs.yml @@ -0,0 +1,54 @@ +instances: + - url: https://nemo-rl.docs.buildwithfern.com + +title: NeMo RL + +versions: + - display-name: v0.5.0 + path: versions/v0.5.0.yml + slug: v0.5.0 + +footer: ./components/CustomFooter.tsx + +layout: + searchbar-placement: header + page-width: 1376px + sidebar-width: 248px + content-width: 812px + tabs-placement: header + hide-feedback: true + +colors: + accentPrimary: + dark: "#76B900" + light: "#76B900" + background: + light: "#FFFFFF" + dark: "#000000" + +theme: + page-actions: toolbar + footer-nav: minimal + +logo: + dark: ./assets/NVIDIA_dark.svg + light: ./assets/NVIDIA_light.svg + height: 20 + href: / + right-text: NeMo RL + +favicon: ./assets/NVIDIA_symbol.svg + +css: + - ./main.css + +navbar-links: + - type: github + value: https://github.com/NVIDIA-NeMo/RL + - type: secondary + text: API Reference + url: https://docs.nvidia.com/nemo/rl/latest/apidocs/ + +experimental: + mdx-components: + - ./components diff --git a/fern/fern.config.json b/fern/fern.config.json new file mode 100644 index 0000000000..86813ca36b --- /dev/null +++ b/fern/fern.config.json @@ -0,0 +1 @@ +{"organization": "nvidia", "version": "3.77.0"} diff --git a/fern/main.css b/fern/main.css new file mode 100644 index 0000000000..87f5dbf7f9 --- /dev/null +++ b/fern/main.css @@ -0,0 +1,867 @@ +/*! + * SPDX-FileCopyrightText: Copyright (c) 2023-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-License-Identifier: LicenseRef-NvidiaProprietary + * + * NVIDIA CORPORATION, its affiliates and licensors retain all intellectual + * property and proprietary rights in and to this material, related + * documentation and any modifications thereto. Any use, reproduction, + * disclosure or distribution of this material and related documentation + * without an express license agreement from NVIDIA CORPORATION or + * its affiliates is strictly prohibited. + */ + +/* Color themes for light and dark modes */ +:root { + /* Brand Colors */ + --nv-color-green: #74B900; + --nv-color-green-2: #004B31; + --nv-color-black: #000000; + --nv-color-white: #FFFFFF; + + /* Grey Scale - Light */ + --nv-light-grey-1: #f7f7f7; + --nv-light-grey-2: #EEEEEE; + --nv-light-grey-3: #DDDDDD; + --nv-light-grey-4: #CCCCCC; + --nv-light-grey-5: #999999; + + /* Grey Scale - Dark */ + --nv-dark-grey-1: #111111; + --nv-dark-grey-2: #1A1A1A; + --nv-dark-grey-3: #222222; + --nv-dark-grey-4: #333333; + --nv-dark-grey-5: #666666; + + /* Colors by Usage */ + --nv-color-text: #000000; + --nv-color-bg-default: #FFFFFF; + --nv-color-bg-alt: #f7f7f7; + --nv-color-success: #76B900; + --nv-color-error: #f44336; + + /* Theme-independent settings */ + --rounded: 999px; +} +main { + min-height: calc(100vh - 200px); + } +/* Typography - Headers */ +h1 { + font-size: 36px; + font-weight: 700; + line-height: 1.25em; /* 45px */ +} + +h2 { + font-size: 28px; + font-weight: 700; + line-height: 1.25em; /* 35px */ +} + +h3 { + font-size: 24px; + font-weight: 700; + line-height: 1.25em; /* 30px */ +} + +h4 { + font-size: 20px; + font-weight: 700; + line-height: 1.25em; /* 25px */ +} + +/* Typography - Paragraphs */ +.prose{ + color: var(--nv-dark-grey-2) !important; +} +.dark .prose{ + color: var(--nv-light-grey-2) !important; +} +p { + text-decoration-thickness: 3px; +} +.fern-mdx-link { + color: var(--tw-prose-body); + text-decoration-color: var(--accent); + font-weight: var(--font-weight-normal); +} + +/* Light theme (default) */ +html:not([data-theme]),html[data-theme=light] { + --pst-color-background: #fff; + --pst-color-on-background: #fff; + --pst-color-shadow: #ccc; + --pst-color-heading: #000; + --pst-color-text-base: #1a1a1a; + --pst-color-text-muted: #666; + --pst-color-surface: #f7f7f7; + --pst-color-on-surface: #333; + --pst-color-primary: var(--nv-color-green-2); + --pst-color-table-row-hover-bg: var(--nv-color-green); + --pst-color-link: var(--pst-color-text-base); + --pst-color-link-hover: var(--pst-color-text-base); + --pst-color-inline-code: var(--pst-color-primary); + --pst-color-inline-code-links: var(--pst-color-primary); + --pst-color-secondary: var(--pst-color-primary); + --pst-color-secondary-bg: var(--nv-color-green); + --pst-color-accent: var(--nv-color-green); +} + +/* Dark theme */ +html[data-theme=dark] { + --pst-color-background: #111; + --pst-color-on-background: #000; + --pst-color-shadow: #000; + --pst-color-heading: #fff; + --pst-color-text-base: #eee; + --pst-color-text-muted: #999; + --pst-color-surface: #1a1a1a; + --pst-color-on-surface: #ddd; + --pst-color-primary: var(--nv-color-green); + --pst-color-table-row-hover-bg: var(--nv-color-green-2); + --pst-color-link: var(--pst-color-text-base); + --pst-color-link-hover: var(--pst-color-text-base); + --pst-color-inline-code: var(--pst-color-primary); + --pst-color-inline-code-links: var(--pst-color-primary); + --pst-color-secondary: var(--pst-color-primary); + --pst-color-secondary-bg: var(--nv-color-green-2); + --pst-color-accent: var(--nv-color-green); +} + +/* Product and verion selector styling */ + +.fern-product-selector { + border-radius: 8px; + pointer-events: none !important; + padding-right: 2px; +} + +.product-dropdown-trigger svg{ + display: none !important; +} + +.fern-product-selector .product-dropdown-trigger p{ + font-weight: bold !important; +} +.fern-product-selector-radio-group { + display: grid; + grid-template-columns: repeat(3, 1fr); + gap: 8px; + max-width: 1000px; +} + +@media (max-width: 768px) { + .fern-product-selector-radio-group { + grid-template-columns: repeat(2, 1fr); + } +} +.fern-version-selector { + transform: translateY(-1px); +} + +.fern-version-selector .version-dropdown-trigger{ + outline: 1px solid var(--border, var(--grayscale-a5)) !important; + border-radius: 5px; + transition: box-shadow 0.3s ease, outline 0.3s ease; +} +.product-dropdown-trigger{ + padding-left: 0px !important; +} + +.product-dropdown-trigger, .version-dropdown-trigger{ + background-color: transparent !important; +} +.product-dropdown-trigger svg:hover{ + stroke: var(--nv-color-green) !important; +} +.version-dropdown-trigger:hover{ + box-shadow: 0 0 0 1px var(--nv-color-green) !important; +} +.version-dropdown-trigger svg:hover{ + stroke: var(--nv-color-green) !important; +} +/* Sidebar styling */ +#fern-sidebar { + border-right: 1px solid var(--border, var(--grayscale-a5)) !important; + height: 100vh !important; +} +.fern-sidebar-link:not(:hover){ + background-color: transparent !important; +} +.fern-sidebar-link { + padding-left: 1rem !important; + padding-right: 1rem !important; + padding-top: 0.5rem !important; + padding-bottom: 0.5rem !important; + border-radius: 0px !important; + &.nested { + padding-left: 1rem !important; + } +} +/* Section-level sidebar links (pages that have children) should match sidebar heading padding */ +.fern-sidebar-group > li > .fern-sidebar-link:has(+ .fern-sidebar-group) { + padding-left: 0.25rem !important; +} +.fern-sidebar-group{ + padding: 0 !important +} +#fern-sidebar-scroll-area{ + padding-right: 0 !important +} + +/* header styling */ +.fern-header-content{ + padding-left: 18.5px; + margin-top: -5px; + margin-bottom: -5px; +} +#fern-header { + border-color: var(--border, var(--grayscale-a5)) !important; +} +@keyframes header-background-fade { + 0% { + background-color: transparent; + } + 100% { + background-color: var(--header-background); + } + } + +[data-theme=default]#fern-header { +animation: header-background-fade linear; +animation-timeline: scroll(); +animation-range: 0 50px; +} +.fern-header-navbar-links .fern-button{ + background-color: transparent !important; +} +.fern-header-navbar-links > button{ + background-color: transparent !important; +} +.fern-header-logo-container > div > div > a > img{ + padding-right: 0.5rem; +} +.fern-header-logo-container .font-heading{ + font-size: 16px !important; + font-weight: bold !important; + color: var(--grayscale-a12) !important; + border-inline: 1px solid var(--border, var(--grayscale-a5)); + padding: 15px 1rem; + margin: -20px 0.5rem; +} +@media (max-width: 1024px) { + .fern-header-logo-container .font-heading{ + display: none !important; + } +} +/* Search bar styling */ +#fern-search-button{ + background-color: transparent !important; + border-radius: var(--rounded); + transition: box-shadow 0.3s ease, outline 0.3s ease; +} +#fern-search-button:hover{ + box-shadow: 0 0 0 1px var(--nv-color-green) !important; +} +#fern-search-button .fern-kbd{ + display: none; +} + +.fern-layout-footer-toolbar button{ + background-color: transparent !important; + border-color: transparent !important; + padding-inline: 0px !important; +} + +/* ========== Custom footer (native React component) – 1:1 with original ========== */ +.bd-footer { + border-top: 1px solid var(--border, var(--grayscale-a5)) !important; + font-family: NVIDIA, -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif !important; + font-size: 0.875rem; + padding: 2rem 0; + width: 100%; +} +.bd-footer * { + font-family: inherit; +} +.bd-footer__inner { + padding: 0 2rem; +} +.footer-items__start { + display: flex; + flex-direction: column; + gap: 1.5rem; +} +.footer-logos-container { + display: flex; + align-items: center; + justify-content: space-between; + width: 100%; + gap: 1rem; +} +.footer-brand { + display: inline-block; + text-decoration: none; +} +.footer-brand .logo__image { + height: 24px; + width: auto; + transition: opacity 0.2s ease; +} +.footer-brand:hover .logo__image { + opacity: 0.8; +} +.footer-brand-fern { + display: flex; + align-items: center; + margin-left: auto; +} +/* Logo theme visibility – .dark is on ancestor in Fern */ +.only-light { + display: block; + filter: invert(1); +} +.only-dark { + display: none; +} +.dark .only-light { + display: none; +} +.dark .only-dark { + display: block; + filter: none; +} +.footer-links { + display: flex; + flex-wrap: wrap; + gap: 0.25rem 0.5rem; + line-height: 1.65; + margin: 0; + padding: 0; +} +.footer-links a { + color: var(--grayscale-a11); + text-decoration: none; + transition: color 0.2s ease; + white-space: nowrap; +} +.pipe-separator { + color: var(--grayscale-a11); + white-space: nowrap; +} +.copyright { + color: var(--grayscale-a11); + font-size: 0.875rem; + line-height: 1.65; + margin: 0; +} +@media (max-width: 768px) { + .bd-footer { padding: 1.5rem 0; } + .bd-footer__inner { padding: 0 1.5rem; } + .footer-items__start { gap: 1rem; } + .footer-links { flex-direction: row; gap: 0.5rem 0.75rem; } + .footer-links a { white-space: normal; word-break: break-word; } +} +@media (max-width: 480px) { + .footer-links { gap: 0.5rem; } + .footer-links a { font-size: 0.8125rem; } + .copyright { font-size: 0.8125rem; } +} +/* Built with Fern link + tooltip */ +.built-with-fern-link { + display: flex; + align-items: baseline; + gap: 0.25rem; + text-decoration: none; + position: relative; +} +.built-with-fern-logo { + height: 1rem; + margin: 0; + transition: filter 150ms ease; +} +.built-with-fern-logo path { fill: var(--grayscale-a12); } +.built-with-fern-link:hover .built-with-fern-logo { filter: saturate(1) opacity(1); } +.built-with-fern-link:hover .built-with-fern-logo path:nth-child(2) { fill: #51C233; } +.built-with-fern-tooltip { + position: absolute; + top: 50%; + right: calc(100%); + bottom: auto; + left: auto; + transform: translateY(-50%); + margin: 0; + margin-right: 0.5rem; + padding: 0.5rem 0.75rem; + background-color: #FFFFFF; + color: #000000; + font-size: 0.85rem; + border-radius: 0.375rem; + border: 1px solid var(--grayscale-a5); + white-space: nowrap; + pointer-events: none; + opacity: 0; + transition: opacity 150ms ease; + transition-delay: 0s; + z-index: 50; + box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15); + width: max-content; +} +.built-with-fern-link:hover .built-with-fern-tooltip { + opacity: 1; + transition-delay: 0.75s; +} +.dark .built-with-fern-tooltip { + background-color: #000000; + color: #FFFFFF; +} +.built-with-fern-logo-dark { display: none; } +.dark .built-with-fern-logo-light { display: none; } +.dark .built-with-fern-logo-dark { display: block; } +@media (prefers-color-scheme: dark) { + .built-with-fern-logo-light { display: none; } + .built-with-fern-logo-dark { display: block; } +} + +/* Footer styling */ +.fern-footer-nav{ + border-radius: var(--rounded); + background-color: transparent !important; + transition: box-shadow 0.3s ease, outline 0.3s ease; +} +/* Hide line numbers */ +.code-block-line-gutter { + display: none !important; +} +.fern-footer-prev h4, .fern-footer-next h4{ + font-size: inherit !important; +} +.fern-sidebar-link.nested[data-state="active"]:before { + left: -0px !important; + bottom: -0px !important; + top: -0px !important; + width: 2px !important; +} +.fern-sidebar-link[data-state="active"] { + color: unset !important; +} + +.fern-selection-item .fern-selection-item-icon{ + border-color: transparent !important; +} +/* Button styling */ +.fern-button{ + border-radius: var(--rounded); + font-weight: bold; +} +.fern-button.filled.primary{ + color: var(--nv-color-black); +} +.dark .fern-button.filled.primary{ + background-color: var(--nv-color-white); +} +.dark .fern-button.filled.primary:hover{ + background-color: var(--nv-light-grey-2); +} +.fern-button.outlined.normal{ + background-color: transparent; + --tw-ring-color: transparent; + color: var(--nv-color-black); +} +.fern-button.outlined.normal:hover{ + color: var(--nv-color-green) +} +.dark .fern-button.outlined.normal{ + color: var(--nv-color-white); +} +.dark .fern-button.outlined.normal:hover{ + color: var(--nv-color-green); +} +/* Card styling */ +.fern-card{ + transition: box-shadow 0.3s ease, outline 0.3s ease; +} +svg.card-icon{ + height: 24px !important; + width: 24px !important; +} +.card-icon{ + background-color: transparent !important; +} +.fern-card:hover{ + box-shadow: 0 0 0 1px var(--nv-color-green) !important; +} +.fern-docs-badge{ + border-radius: var(--rounded); +} +.fern-page-actions button:hover{ + background-color: transparent !important; +} +.fern-page-actions a:hover{ + background-color: transparent !important; +} +/* Moving logo to footer */ +#builtwithfern, #builtwithfern * { + display: none !important; +} + +/* Landing Page Gradients */ +/* Top: Simple radial gradient (no mask, responsive) */ +.landing-gradient-top { + position: absolute; + top: 0; + left: 0; + right: 0; + height: 800px; + background: radial-gradient(ellipse 100% 100% at 50% 10%, + rgba(191, 242, 48, 0.15) 0%, + rgba(158, 228, 179, 0.12) 30%, + rgba(124, 215, 254, 0.12) 50%, + rgba(124, 215, 254, 0.06) 75%, + transparent 100%); + pointer-events: none; + z-index: 0; +} + +/* Bottom: Masked gradient for organic transition */ +.landing-gradient-bottom { + position: absolute; + bottom: -282px; + left: 0; + right: 0; + height: 1232px; + background: linear-gradient(85deg, #BFF230 41.98%, #7CD7FE 99.52%); + opacity: 0.05; + pointer-events: none; + z-index: 5; + mask-image: url('https://www.figma.com/api/mcp/asset/27509afa-9c16-46bb-8415-4395e2e5a347'); + mask-repeat: no-repeat; + mask-position: 0% -17px; + mask-size: 100% auto; + -webkit-mask-image: url('https://www.figma.com/api/mcp/asset/27509afa-9c16-46bb-8415-4395e2e5a347'); + -webkit-mask-repeat: no-repeat; + -webkit-mask-position: 0% -17px; + -webkit-mask-size: 100% auto; +} + +/* Landing Page Gradients Wrapper */ +.landing-page-gradients { + position: relative; + width: 100%; + margin-top: -100px; + padding-top: 100px; + overflow: visible; + background: #181818; +} + +/* Hero Section (Landing page only) */ +.hero-section { + position: relative; + width: 100%; + padding: 3rem 6rem; + margin: 0 auto; + overflow: visible; + display: flex; + flex-direction: column; + align-items: center; + z-index: 10; +} + +/* Hero Section Content - constrain width */ +.hero-section > * { + position: relative; + z-index: 100; + max-width: 1440px; + width: 100%; +} + +/* Tablet and Mobile: fix spacing and layout */ +@media (max-width: 1024px) { + /* Extend dark background behind header */ + .landing-page body, .landing-page html, .landing-page main { + background: #181818 !important; + } + + .landing-page-gradients { + margin-top: -100px; + padding-top: 100px; + } + + .hero-section { + padding: 2rem 2rem; + } + + .hero-section > * { + max-width: none; + } + + .hero-content-grid { + grid-template-columns: 1fr; + gap: 2rem; + } + + .hero-heading { + font-size: 36px; + } + + .hero-subtitle { + font-size: 16px; + } + + .hero-title-section { + margin-bottom: 2rem; + } +} + +/* Small mobile only */ +@media (max-width: 600px) { + .hero-heading { + font-size: 28px; + } + + .hero-section { + padding: 1.5rem 1.5rem; + } +} + +.hero-section h1, +.hero-section h2, +.hero-section h3, +.hero-section h4, +.hero-section h5, +.hero-section h6 { + pointer-events: none !important; +} +/* Hero Title Section */ +.hero-title-section { + text-align: center; + margin-bottom: 4rem; + position: relative; + z-index: 100; +} + +.hero-heading { + font-size: 48px; + font-weight: 700; + line-height: 1.2; + margin: 0 0 1rem 0; + color: var(--nv-color-white); +} + +.hero-subtitle { + font-size: 18px; + line-height: 1.5; + margin: 0; + color: var(--nv-color-white); +} + +/* Hero Content Grid */ +.hero-content-grid { + display: grid; + grid-template-columns: repeat(2, 1fr); + gap: 3rem; + align-items: start; + position: relative; + z-index: 100; +} + +.hero-column { + display: flex; + flex-direction: column; + gap: 1rem; +} + +.hero-column-title { + font-size: 24px; + font-weight: 700; + margin: 0; + color: var(--nv-color-white); +} + +.hero-column-subtitle { + font-size: 16px; + margin: 0 0 1rem 0; + color: var(--nv-color-white); +} + +/* Hero Card Container (Left Column) */ +.hero-card-container { + display: flex; + flex-direction: column; + border-radius: 8px; + overflow: hidden; + border: 1px solid var(--border, var(--grayscale-a5)); + margin-top: 1.5rem !important; + background: rgba(26, 26, 26, 0.2); + backdrop-filter: blur(6px); +} + +.hero-card-image { + width: 100%; + height: auto; + display: block; +} + +.hero-card-content { + padding: 1.5rem; + display: flex; + flex-direction: row; + gap: 1rem; + align-items: center; + justify-content: space-between; + background: rgba(26, 26, 26, 0.2); + backdrop-filter: blur(6px); +} + +.hero-card-text-wrapper { + flex: 1; +} + +.hero-card-text { + margin: 0; + font-size: 14px; + line-height: 1.5; + color: var(--nv-color-white); +} + +.hero-card-button-wrapper { + flex-shrink: 0; +} +.hero-card-button-wrapper .fern-mdx-link{ + text-decoration: none !important; +} + +.hero-card-button { + white-space: nowrap; +} + +/* Hero Cards */ + +.hero-column .fern-card { + padding: 9px 17px; + background-color: rgba(26, 26, 26, 0.2) !important; + backdrop-filter: blur(6px); +} + +.hero-section .fern-card{ + color: white !important; +} + +.hero-column .card-icon { + font-size: 64px !important; + width: 64px !important; + height: 64px !important; +} + +.hero-column .card-icon svg, +.hero-column .card-icon i { + font-size: 64px !important; + width: 64px !important; + height: 64px !important; +} + +.hero-column .fern-card-title { + font-size: 16px; + font-weight: 500; + line-height: 24px; +} + +.hero-column .fern-card p { + font-size: 14px; + line-height: 20px; + color: white !important; +} + +/* Body Section */ +.body-section { + display: flex; + padding: 4rem 16rem; + flex-direction: column; + justify-content: center; + align-items: center; + gap: 4rem; + align-self: stretch; + position: relative; + z-index: 1; + background: #181818; +} + +/* Body Section Content - constrain width */ +.body-section > * { + max-width: 1440px; + width: 100%; + position: relative; + z-index: 10; +} + +.code-block .fern-code-link{ + text-decoration: underline !important; + text-decoration-color: var(--accent) !important; + text-underline-offset: 1px !important; + text-decoration-style: underline !important; +} + +/* Mobile Styles */ +@media (max-width: 768px) { + .hero-section { + padding: 2rem 1.5rem; + } + + .hero-title-section { + margin-bottom: 2rem; + } + + .hero-heading { + font-size: 32px; + } + + .hero-subtitle { + font-size: 16px; + } + + .hero-content-grid { + grid-template-columns: 1fr; + gap: 2rem; + } + + .hero-column-title { + font-size: 20px; + } + + .hero-column-subtitle { + font-size: 14px; + } + + .hero-card-content { + flex-direction: column; + align-items: flex-start; + } + + .hero-card-button-wrapper { + align-self: flex-start; + } + + .hero-column .card-icon, + .hero-column .card-icon svg, + .hero-column .card-icon i { + font-size: 40px !important; + width: 40px !important; + height: 40px !important; + } + + .hero-column .fern-card-title { + font-size: 14px; + } + + .hero-column .fern-card p { + font-size: 11px; + } + + .body-section { + padding: 2rem 1.5rem; + } + + .fern-selection-item-icon.use-icon { + display: none !important; + } +} \ No newline at end of file diff --git a/fern/scripts/add_frontmatter.py b/fern/scripts/add_frontmatter.py new file mode 100644 index 0000000000..75fada11b8 --- /dev/null +++ b/fern/scripts/add_frontmatter.py @@ -0,0 +1,67 @@ +#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Add frontmatter (title, description) to MDX files derived from first H1.""" + +import argparse +import re +from pathlib import Path + + +def derive_title(content: str) -> str: + """Extract title from first # Heading.""" + match = re.search(r"^#\s+(.+)$", content, re.MULTILINE) + if match: + title = match.group(1).strip() + title = re.sub(r"\{[^}]+\}`[^`]*`", "", title).strip() + return title or "Untitled" + return "Untitled" + + +def add_frontmatter(filepath: Path) -> bool: + """Add frontmatter if missing. Returns True if changes were made.""" + content = filepath.read_text() + + if content.strip().startswith("---"): + return False + + title = derive_title(content) + title_escaped = title.replace('"', '\\"') + frontmatter = f'---\ntitle: "{title_escaped}"\ndescription: ""\n---\n\n' + body = content.lstrip() + + # Remove duplicate H1 that matches title (Fern uses frontmatter title) + body = re.sub(r"^#\s+" + re.escape(title) + r"\s*\n+", "", body, count=1) + + new_content = frontmatter + body + filepath.write_text(new_content) + return True + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Add frontmatter to MDX files" + ) + parser.add_argument( + "pages_dir", + type=Path, + help="Path to pages directory (e.g. fern/v0.5.0/pages)", + ) + args = parser.parse_args() + + pages_dir = args.pages_dir.resolve() + if not pages_dir.exists(): + raise SystemExit(f"Error: pages directory not found at {pages_dir}") + + changed = [] + for mdx_file in sorted(pages_dir.rglob("*.mdx")): + if add_frontmatter(mdx_file): + changed.append(mdx_file.relative_to(pages_dir)) + print(f" Added frontmatter: {mdx_file.relative_to(pages_dir)}") + + print(f"\nAdded frontmatter to {len(changed)} files") + + +if __name__ == "__main__": + main() diff --git a/fern/scripts/check_unconverted.sh b/fern/scripts/check_unconverted.sh new file mode 100755 index 0000000000..13a5d79836 --- /dev/null +++ b/fern/scripts/check_unconverted.sh @@ -0,0 +1,74 @@ +#!/bin/bash +# Check for unconverted MyST syntax in Fern docs + +set -e + +PAGES_DIR="${1:-fern/v0.5.0/pages}" + +echo "=== Checking for unconverted MyST syntax in $PAGES_DIR ===" +echo "" + +ISSUES_FOUND=0 + +echo "Checking for MyST directives (:::)..." +if grep -r ':::' "$PAGES_DIR" 2>/dev/null; then + echo "⚠️ Found unconverted MyST directives (see above)" + ISSUES_FOUND=1 +else + echo "✓ No MyST directives found" +fi +echo "" + +echo "Checking for {ref} references (Sphinx cross-refs, not LaTeX \\text{ref})..." +if grep -rE '\{ref\}`' "$PAGES_DIR" 2>/dev/null || grep -rE '\{ref\} ' "$PAGES_DIR" 2>/dev/null; then + echo "⚠️ Found unconverted {ref} references" + ISSUES_FOUND=1 +else + echo "✓ No {ref} references found" +fi +echo "" + +echo "Checking for {octicon} icons..." +if grep -r '{octicon}' "$PAGES_DIR" 2>/dev/null; then + echo "⚠️ Found unconverted {octicon} icons" + ISSUES_FOUND=1 +else + echo "✓ No {octicon} icons found" +fi +echo "" + +echo "Checking for {py:class} / {py:meth}..." +if grep -rE '\{py:(class|meth)\}' "$PAGES_DIR" 2>/dev/null; then + echo "⚠️ Found unconverted py:class or py:meth" + ISSUES_FOUND=1 +else + echo "✓ No py:class/py:meth found" +fi +echo "" + +echo "Checking for sphinx-design badges..." +if grep -r '{bdg-' "$PAGES_DIR" 2>/dev/null; then + echo "⚠️ Found unconverted badges" + ISSUES_FOUND=1 +else + echo "✓ No badges found" +fi +echo "" + +echo "Checking for MyST mermaid syntax..." +if grep -r '```{mermaid}' "$PAGES_DIR" 2>/dev/null; then + echo "⚠️ Found unconverted mermaid blocks (should be \`\`\`mermaid)" + ISSUES_FOUND=1 +else + echo "✓ No MyST mermaid syntax found" +fi +echo "" + +echo "=== Summary ===" +if [ $ISSUES_FOUND -eq 0 ]; then + echo "✓ All checks passed" + exit 0 +else + echo "⚠️ Some issues found - review and fix above" + exit 1 +fi diff --git a/fern/scripts/convert_myst_to_fern.py b/fern/scripts/convert_myst_to_fern.py new file mode 100644 index 0000000000..518551fd7a --- /dev/null +++ b/fern/scripts/convert_myst_to_fern.py @@ -0,0 +1,234 @@ +#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Convert MyST Markdown syntax to Fern MDX components. + +Handles: admonitions, dropdowns, tab sets, grid cards, toctree removal, +HTML comments. Run convert_rl_specific.py first to strip {octicon} and {py:*} roles. +""" + +import argparse +import re +from pathlib import Path + + +def convert_admonitions(content: str) -> str: + """Convert MyST admonitions to Fern components.""" + admonition_map = { + "note": "Note", + "warning": "Warning", + "tip": "Tip", + "important": "Info", + "seealso": "Note", + "caution": "Warning", + "danger": "Warning", + "attention": "Warning", + "hint": "Tip", + } + + for myst_type, fern_component in admonition_map.items(): + pattern = rf"```\{{{myst_type}\}}\s*\n(.*?)```" + replacement = rf"<{fern_component}>\n\1" + content = re.sub(pattern, replacement, content, flags=re.DOTALL | re.IGNORECASE) + + pattern = rf":::\{{{myst_type}\}}\s*\n(.*?):::" + content = re.sub(pattern, replacement, content, flags=re.DOTALL | re.IGNORECASE) + + return content + + +def convert_dropdowns(content: str) -> str: + """Convert MyST dropdowns to Fern Accordion components.""" + pattern = r"```\{dropdown\}\s+([^\n]+)\s*\n(.*?)```" + + def replace_dropdown(match: re.Match[str]) -> str: + title = match.group(1).strip() + body = match.group(2).strip() + if '"' in title: + title = title.replace('"', "'") + return f'\n{body}\n' + + return re.sub(pattern, replace_dropdown, content, flags=re.DOTALL) + + +def convert_tab_sets(content: str) -> str: + """Convert MyST tab sets to Fern Tabs components.""" + content = re.sub(r"::::+\s*\{tab-set\}\s*", "\n", content) + content = re.sub(r"```\{tab-set\}\s*", "\n", content) + + def replace_tab_item(match: re.Match[str]) -> str: + title = match.group(1).strip() + return f'' + + content = re.sub(r"::::*\s*\{tab-item\}\s+([^\n]+)", replace_tab_item, content) + content = re.sub(r":::*\s*\{tab-item\}\s+([^\n]+)", replace_tab_item, content) + + lines = content.split("\n") + result = [] + in_tab = False + + for line in lines: + if '\n") + in_tab = True + result.append(line) + elif line.strip() in [":::::", "::::", ":::", ""]: + if in_tab and line.strip() != "": + result.append("") + in_tab = False + if line.strip() in [":::::", "::::"]: + result.append("") + else: + result.append(line) + else: + result.append(line) + + content = "\n".join(result) + content = re.sub(r"\n::::+\n", "\n", content) + content = re.sub(r"\n:::+\n", "\n", content) + return content + + +def convert_grid_cards(content: str) -> str: + """Convert MyST grid cards to Fern Cards components.""" + content = re.sub(r"::::+\s*\{grid\}[^\n]*\n", "\n", content) + content = re.sub(r"```\{grid\}[^\n]*\n", "\n", content) + + def replace_card(match: re.Match[str]) -> str: + full_match = match.group(0) + title_match = re.search(r"\{grid-item-card\}\s+(.+?)(?:\n|$)", full_match) + title = title_match.group(1).strip() if title_match else "Card" + + link_match = re.search(r":link:\s*(\S+)", full_match) + href = link_match.group(1) if link_match else "" + + if href and href != "apidocs/index": + if not href.startswith("http"): + href = "/" + href.replace(".md", "").replace(".mdx", "") + return f'' + if href == "apidocs/index": + return f'' + return f'' + + content = re.sub( + r"::::*\s*\{grid-item-card\}[^\n]*(?:\n:link:[^\n]*)?(?:\n:link-type:[^\n]*)?", + replace_card, + content, + ) + content = re.sub( + r":::*\s*\{grid-item-card\}[^\n]*(?:\n:link:[^\n]*)?(?:\n:link-type:[^\n]*)?", + replace_card, + content, + ) + + lines = content.split("\n") + result = [] + in_card = False + + for line in lines: + if '\n") + in_card = True + result.append(line) + elif line.strip() in [":::::", "::::", ":::", ""]: + if in_card and line.strip() != "": + result.append("\n") + in_card = False + if line.strip() in [":::::", "::::"]: + result.append("\n") + else: + result.append(line) + + return "\n".join(result) + + +def remove_toctree(content: str) -> str: + """Remove toctree blocks entirely.""" + content = re.sub(r"```\{toctree\}.*?```", "", content, flags=re.DOTALL) + content = re.sub(r":::\{toctree\}.*?:::", "", content, flags=re.DOTALL) + return content + + +def convert_html_comments(content: str) -> str: + """Convert HTML comments to JSX comments.""" + return re.sub(r"", r"{/* \1 */}", content, flags=re.DOTALL) + + +def remove_directive_options(content: str) -> str: + """Remove MyST directive options.""" + for opt in [ + ":icon:", ":class:", ":columns:", ":gutter:", ":margin:", ":padding:", + ":link-type:", ":maxdepth:", ":titlesonly:", ":hidden:", ":link:", + ":caption:", + ]: + content = re.sub(rf"\n{re.escape(opt)}[^\n]*", "", content) + return content + + +def fix_malformed_tags(content: str) -> str: + """Fix common malformed tag issues.""" + content = re.sub(r'title=""', 'title="Details"', content) + content = re.sub( + r"<(Note|Warning|Tip|Info)([^>]*)/>\s*\n([^<]+)", + r"<\1\2>\n\3", + content, + ) + return content + + +def clean_multiple_newlines(content: str) -> str: + """Clean up excessive newlines.""" + content = re.sub(r"\n{3,}", "\n\n", content) + return content.strip() + "\n" + + +def convert_file(filepath: Path) -> bool: + """Convert a single file. Returns True if changes were made.""" + content = filepath.read_text() + original = content + + content = convert_admonitions(content) + content = convert_dropdowns(content) + content = convert_grid_cards(content) + content = convert_tab_sets(content) + content = remove_toctree(content) + content = convert_html_comments(content) + content = remove_directive_options(content) + content = fix_malformed_tags(content) + content = clean_multiple_newlines(content) + + if content != original: + filepath.write_text(content) + return True + return False + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Convert MyST syntax to Fern MDX in pages directory" + ) + parser.add_argument( + "pages_dir", + type=Path, + help="Path to pages directory (e.g. fern/v0.5.0/pages)", + ) + args = parser.parse_args() + + pages_dir = args.pages_dir.resolve() + if not pages_dir.exists(): + raise SystemExit(f"Error: pages directory not found at {pages_dir}") + + changed = [] + for mdx_file in sorted(pages_dir.rglob("*.mdx")): + if convert_file(mdx_file): + changed.append(mdx_file.relative_to(pages_dir)) + print(f" Converted: {mdx_file.relative_to(pages_dir)}") + + print(f"\nConverted {len(changed)} files") + + +if __name__ == "__main__": + main() diff --git a/fern/scripts/convert_rl_specific.py b/fern/scripts/convert_rl_specific.py new file mode 100644 index 0000000000..92e66b906c --- /dev/null +++ b/fern/scripts/convert_rl_specific.py @@ -0,0 +1,80 @@ +#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Convert RL-specific MyST/Sphinx syntax: {octicon}, {py:class}, {py:meth}.""" + +import argparse +import re +from pathlib import Path + +API_DOCS_BASE = "https://docs.nvidia.com/nemo/rl/latest/apidocs" + + +def strip_octicon(content: str) -> str: + """Remove {octicon}`icon` from text, leaving the rest (e.g. 'Overview').""" + return re.sub(r"\{octicon\}`[^`]+`\s*", "", content) + + +def escape_mdx_curly_braces(content: str) -> str: + """Escape {variable} in code blocks so MDX doesn't parse as JSX (e.g. {overrides}).""" + return content.replace("{overrides}", "\\{overrides\\}") + + +def convert_py_roles(content: str) -> str: + """Convert {py:class}`text` and {py:meth}`text` to inline code `text`.""" + # {py:class}`text` or {py:class}`text ` - strip trailing space from capture + content = re.sub( + r"\{py:class\}`([^`<]+?)(?:\s*<[^>]+>)?`", + lambda m: f"`{m.group(1).strip()}`", + content, + ) + content = re.sub( + r"\{py:meth\}`([^`<]+?)(?:\s*<[^>]+>)?`", + lambda m: f"`{m.group(1).strip()}`", + content, + ) + return content + + +def convert_file(filepath: Path) -> bool: + """Convert a single file. Returns True if changes were made.""" + content = filepath.read_text() + original = content + + content = strip_octicon(content) + content = escape_mdx_curly_braces(content) + content = convert_py_roles(content) + + if content != original: + filepath.write_text(content) + return True + return False + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Convert RL-specific syntax (octicon, py:class, py:meth)" + ) + parser.add_argument( + "pages_dir", + type=Path, + help="Path to pages directory (e.g. fern/v0.5.0/pages)", + ) + args = parser.parse_args() + + pages_dir = args.pages_dir.resolve() + if not pages_dir.exists(): + raise SystemExit(f"Error: pages directory not found at {pages_dir}") + + changed = [] + for mdx_file in sorted(pages_dir.rglob("*.mdx")): + if convert_file(mdx_file): + changed.append(mdx_file.relative_to(pages_dir)) + print(f" Converted: {mdx_file.relative_to(pages_dir)}") + + print(f"\nConverted {len(changed)} files") + + +if __name__ == "__main__": + main() diff --git a/fern/scripts/copy_docs_to_fern.py b/fern/scripts/copy_docs_to_fern.py new file mode 100644 index 0000000000..e9977f92f8 --- /dev/null +++ b/fern/scripts/copy_docs_to_fern.py @@ -0,0 +1,82 @@ +#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Copy docs/*.md to fern//pages/*.mdx preserving directory structure.""" + +import argparse +import shutil +from pathlib import Path + +SKIP_FILES = { + "conf.py", + "Makefile", + "helpers.py", + "versions1.json", + "project.json", +} +SKIP_DIRS = {"_templates", "_build", "apidocs", ".venv", ".git"} + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Copy docs/*.md to fern//pages/*.mdx" + ) + parser.add_argument( + "version", + help="Version folder name (e.g. v0.5.0)", + ) + parser.add_argument( + "--docs-dir", + default="docs", + help="Source docs directory (default: docs)", + ) + parser.add_argument( + "--fern-dir", + default="fern", + help="Fern root directory (default: fern)", + ) + args = parser.parse_args() + + repo_root = Path(__file__).resolve().parent.parent.parent + docs_dir = repo_root / args.docs_dir + fern_dir = repo_root / args.fern_dir + pages_dir = fern_dir / args.version / "pages" + + if not docs_dir.exists(): + raise SystemExit(f"Error: docs directory not found at {docs_dir}") + + pages_dir.mkdir(parents=True, exist_ok=True) + + # Copy docs/assets to fern/assets if they exist + docs_assets = docs_dir / "assets" + fern_assets = fern_dir / "assets" + if docs_assets.exists(): + for asset in docs_assets.rglob("*"): + if asset.is_file(): + rel = asset.relative_to(docs_assets) + dst = fern_assets / rel + dst.parent.mkdir(parents=True, exist_ok=True) + shutil.copy2(asset, dst) + print(f"Copied assets from {docs_assets} to {fern_assets}") + + copied = 0 + for md_file in docs_dir.rglob("*.md"): + rel = md_file.relative_to(docs_dir) + + if rel.name in SKIP_FILES: + continue + if any(part in SKIP_DIRS or part.startswith(".") for part in rel.parts): + continue + + mdx_path = pages_dir / rel.with_suffix(".mdx") + mdx_path.parent.mkdir(parents=True, exist_ok=True) + shutil.copy2(md_file, mdx_path) + copied += 1 + print(f" {rel} -> {args.version}/pages/{rel.with_suffix('.mdx')}") + + print(f"\nCopied {copied} files to {pages_dir}") + + +if __name__ == "__main__": + main() diff --git a/fern/scripts/find_tag_mismatches.py b/fern/scripts/find_tag_mismatches.py new file mode 100644 index 0000000000..f274b6012a --- /dev/null +++ b/fern/scripts/find_tag_mismatches.py @@ -0,0 +1,95 @@ +#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Find mismatched opening/closing tags in MDX files.""" + +import argparse +import re +from pathlib import Path + + +def check_file(filepath: Path) -> list[str]: + """Check a file for tag mismatches. Returns list of issues.""" + content = filepath.read_text() + lines = content.split("\n") + issues = [] + + tag_stack: list[str] = [] + tag_pattern = re.compile(r"<(/?)(\w+)(?:\s|>|$)") + + for line_num, line in enumerate(lines, 1): + for match in tag_pattern.finditer(line): + is_closing = match.group(1) == "/" + tag_name = match.group(2) + + known_tags = { + "Tabs", "Tab", "Cards", "Card", "Accordion", + "Note", "Warning", "Tip", "Info", + } + if tag_name not in known_tags: + continue + + if is_closing: + if not tag_stack: + issues.append( + f"Line {line_num}: Closing without opening tag" + ) + else: + expected = tag_stack.pop() + if expected != tag_name: + issues.append( + f"Line {line_num}: Closing but expected " + f"" + ) + else: + if "/>" not in line[match.start() :]: + tag_stack.append(tag_name) + + if tag_stack: + issues.append(f"Unclosed tags at end of file: {tag_stack}") + + return issues + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Find mismatched tags in MDX files" + ) + parser.add_argument( + "pages_dir", + type=Path, + nargs="?", + default=None, + help="Path to pages directory (default: fern/v0.5.0/pages)", + ) + args = parser.parse_args() + + if args.pages_dir is not None: + pages_dir = args.pages_dir.resolve() + else: + pages_dir = Path(__file__).resolve().parent.parent / "v0.5.0" / "pages" + + if not pages_dir.exists(): + raise SystemExit(f"Error: pages directory not found at {pages_dir}") + + files_with_issues: list[tuple[Path, list[str]]] = [] + for mdx_file in sorted(pages_dir.rglob("*.mdx")): + issues = check_file(mdx_file) + if issues: + rel_path = mdx_file.relative_to(pages_dir) + files_with_issues.append((rel_path, issues)) + + if files_with_issues: + print(f"Found issues in {len(files_with_issues)} files:\n") + for rel_path, issues in files_with_issues: + print(f" {rel_path}") + for issue in issues: + print(f" - {issue}") + print() + else: + print("No tag mismatches found!") + + +if __name__ == "__main__": + main() diff --git a/fern/scripts/quote_frontmatter_titles.py b/fern/scripts/quote_frontmatter_titles.py new file mode 100644 index 0000000000..2401b33395 --- /dev/null +++ b/fern/scripts/quote_frontmatter_titles.py @@ -0,0 +1,64 @@ +#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Quote frontmatter titles that contain colons (invalid unquoted YAML).""" + +import argparse +import re +from pathlib import Path + + +def quote_title(filepath: Path) -> bool: + """Quote title if it contains a colon. Returns True if changed.""" + content = filepath.read_text() + + if not content.strip().startswith("---"): + return False + + # Match unquoted title with colon + match = re.search(r"^title:\s+([^\"'\n]+)$", content, re.MULTILINE) + if not match: + return False + + title = match.group(1).strip() + if ":" not in title or title.startswith('"') or title.startswith("'"): + return False + + title_escaped = title.replace('\\', '\\\\').replace('"', '\\"') + new_content = re.sub( + rf"^title:\s+{re.escape(title)}\s*$", + f'title: "{title_escaped}"', + content, + count=1, + flags=re.MULTILINE, + ) + + if new_content != content: + filepath.write_text(new_content) + return True + return False + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Quote frontmatter titles with colons" + ) + parser.add_argument("pages_dir", type=Path, help="Path to pages directory") + args = parser.parse_args() + + pages_dir = args.pages_dir.resolve() + if not pages_dir.exists(): + raise SystemExit(f"Error: pages directory not found at {pages_dir}") + + changed = [] + for mdx_file in sorted(pages_dir.rglob("*.mdx")): + if quote_title(mdx_file): + changed.append(mdx_file.relative_to(pages_dir)) + print(f" Quoted: {mdx_file.relative_to(pages_dir)}") + + print(f"\nQuoted {len(changed)} titles") + + +if __name__ == "__main__": + main() diff --git a/fern/scripts/remove_duplicate_h1.py b/fern/scripts/remove_duplicate_h1.py new file mode 100644 index 0000000000..1488122feb --- /dev/null +++ b/fern/scripts/remove_duplicate_h1.py @@ -0,0 +1,59 @@ +#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Remove duplicate H1 that matches frontmatter title.""" + +import argparse +import re +from pathlib import Path + + +def remove_duplicate_h1(filepath: Path) -> bool: + """Remove H1 after frontmatter if it duplicates the title. Returns True if changed.""" + content = filepath.read_text() + + if not content.strip().startswith("---"): + return False + + # Extract title from frontmatter + match = re.search(r"^---\s*\ntitle:\s*(.+?)\n", content, re.MULTILINE) + if not match: + return False + + title = match.group(1).strip().strip('"\'') + pattern = rf"(---\s*\n.*?---\s*\n\n)#\s+{re.escape(title)}\s*\n+" + new_content = re.sub(pattern, r"\1", content, count=1, flags=re.DOTALL) + + if new_content != content: + filepath.write_text(new_content) + return True + return False + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Remove duplicate H1 that matches frontmatter title" + ) + parser.add_argument( + "pages_dir", + type=Path, + help="Path to pages directory", + ) + args = parser.parse_args() + + pages_dir = args.pages_dir.resolve() + if not pages_dir.exists(): + raise SystemExit(f"Error: pages directory not found at {pages_dir}") + + changed = [] + for mdx_file in sorted(pages_dir.rglob("*.mdx")): + if remove_duplicate_h1(mdx_file): + changed.append(mdx_file.relative_to(pages_dir)) + print(f" Removed H1: {mdx_file.relative_to(pages_dir)}") + + print(f"\nRemoved duplicate H1 from {len(changed)} files") + + +if __name__ == "__main__": + main() diff --git a/fern/scripts/update_links.py b/fern/scripts/update_links.py new file mode 100644 index 0000000000..559a095276 --- /dev/null +++ b/fern/scripts/update_links.py @@ -0,0 +1,68 @@ +#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Update internal links: .md -> Fern paths, relative paths -> absolute.""" + +import argparse +import re +from pathlib import Path + + +def update_links_in_content(content: str, file_dir: Path, pages_root: Path) -> str: + """Update markdown links and image paths: .md/.mdx -> Fern paths.""" + + def replace_link(match: re.Match[str]) -> str: + text, url = match.group(1), match.group(2) + if url.startswith(("http://", "https://", "#", "mailto:")): + return match.group(0) + clean = url.replace(".md", "").replace(".mdx", "") + # Normalize asset paths to /assets/ + if "assets/" in clean or clean.startswith("./assets") or clean.startswith("../assets"): + clean = "/assets/" + clean.split("assets/")[-1] + elif not clean.startswith("/"): + clean = "/" + clean + return f"[{text}]({clean})" + + content = re.sub(r"\[([^\]]+)\]\(([^)]+)\)", replace_link, content) + return content + + +def update_file(filepath: Path, pages_root: Path) -> bool: + """Update links in a single file. Returns True if changes were made.""" + content = filepath.read_text() + file_dir = filepath.parent + new_content = update_links_in_content(content, file_dir, pages_root) + + if new_content != content: + filepath.write_text(new_content) + return True + return False + + +def main() -> None: + parser = argparse.ArgumentParser( + description="Update internal links in MDX files" + ) + parser.add_argument( + "pages_dir", + type=Path, + help="Path to pages directory (e.g. fern/v0.5.0/pages)", + ) + args = parser.parse_args() + + pages_dir = args.pages_dir.resolve() + if not pages_dir.exists(): + raise SystemExit(f"Error: pages directory not found at {pages_dir}") + + changed = [] + for mdx_file in sorted(pages_dir.rglob("*.mdx")): + if update_file(mdx_file, pages_dir): + changed.append(mdx_file.relative_to(pages_dir)) + print(f" Updated: {mdx_file.relative_to(pages_dir)}") + + print(f"\nUpdated {len(changed)} files") + + +if __name__ == "__main__": + main() diff --git a/fern/v0.5.0/pages/about/algorithms/dapo.mdx b/fern/v0.5.0/pages/about/algorithms/dapo.mdx new file mode 100644 index 0000000000..fdfb53c5d1 --- /dev/null +++ b/fern/v0.5.0/pages/about/algorithms/dapo.mdx @@ -0,0 +1,90 @@ +--- +title: DAPO +description: "" +--- + +[Dual-Clip Asymmetric Policy Optimization (DAPO)](https://arxiv.org/pdf/2503.14476) extends GRPO by allowing asymmetric clipping with distinct minimum and maximum clip parameters. This provides more fine-grained control over policy updates. + +DAPO is implemented through the same `ClippedPGLossFn` as GRPO, but with the ability to set different values for `ratio_clip_min` and `ratio_clip_max`. For standard GRPO/PPO, these parameters are set to the same value. + +## Key Differences from GRPO + +- **Asymmetric Clipping**: DAPO allows `ratio_clip_min` ≠ `ratio_clip_max`, providing asymmetric bounds on the probability ratio +- **Same Infrastructure**: Uses the same training infrastructure and configurations as GRPO + +## DAPO Single Node + +To run DAPO on a single GPU, use the GRPO script with asymmetric clip parameters: + +```sh +# Run DAPO with asymmetric clipping +uv run python examples/run_grpo.py \ + policy.model_name="Qwen/Qwen2.5-1.5B" \ + grpo.ratio_clip_min=0.15 \ + grpo.ratio_clip_max=0.25 \ + checkpointing.checkpoint_dir="results/dapo_math" \ + logger.wandb_enabled=True \ + logger.wandb.name="dapo-math" +``` + +For multi-GPU setups: + +```sh +uv run python examples/run_grpo.py \ + cluster.gpus_per_node=8 \ + grpo.ratio_clip_min=0.15 \ + grpo.ratio_clip_max=0.25 \ + checkpointing.checkpoint_dir="results/dapo_8gpu" \ + logger.wandb_enabled=True \ + logger.wandb.name="dapo-8gpu" +``` + +## DAPO Multi-node + +DAPO can be run on multiple nodes using the same approach as GRPO: + +```sh +# Run from the root of NeMo RL repo +NUM_ACTOR_NODES=2 + +COMMAND="uv run ./examples/run_grpo.py \ + --config examples/configs/grpo_math_8B.yaml \ + cluster.num_nodes=2 \ + grpo.ratio_clip_min=0.15 \ + grpo.ratio_clip_max=0.25 \ + checkpointing.checkpoint_dir='results/dapo_2nodes' \ + logger.wandb_enabled=True \ + logger.wandb.name='dapo-multinode'" \ +CONTAINER=YOUR_CONTAINER \ +MOUNTS="$PWD:$PWD" \ +sbatch \ + --nodes=${NUM_ACTOR_NODES} \ + --account=YOUR_ACCOUNT \ + --job-name=YOUR_JOBNAME \ + --partition=YOUR_PARTITION \ + --time=4:0:0 \ + --gres=gpu:8 \ + ray.sub +``` + +> [!NOTE] +> For GB200 systems with 4 GPUs per node, use `--gres=gpu:4` instead. + +## Configuration + +DAPO uses the same configuration structure as GRPO. The key parameters are: + +```yaml +grpo: + ratio_clip_min: 0.15 # Minimum clip value (can be different from max) + ratio_clip_max: 0.25 # Maximum clip value (can be different from min) + # ... other GRPO parameters ... +``` + +For more details on other configuration options, refer to the [GRPO documentation](/grpo). + +## Additional Resources + +- [DAPO Paper](https://arxiv.org/pdf/2503.14476) +- [GRPO Documentation](/grpo) +- [Training Backends](/../../design-docs/training-backends) diff --git a/fern/v0.5.0/pages/about/algorithms/dpo.mdx b/fern/v0.5.0/pages/about/algorithms/dpo.mdx new file mode 100644 index 0000000000..0f20907492 --- /dev/null +++ b/fern/v0.5.0/pages/about/algorithms/dpo.mdx @@ -0,0 +1,63 @@ +--- +title: DPO +description: "" +--- + +We provide a sample DPO experiment that uses the [HelpSteer3 dataset](https://huggingface.co/datasets/nvidia/HelpSteer3) for preference-based training. + +## DPO Single Node + +The default DPO experiment is configured to run on a single GPU. To launch the experiment: + +```sh +uv run python examples/run_dpo.py +``` + +This trains `Llama3.2-1B-Instruct` on 1 GPU. + +If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration and switch to an 8B Llama3.1 Instruct model: + +```sh +uv run python examples/run_dpo.py \ + policy.model_name="meta-llama/Llama-3.1-8B-Instruct" \ + policy.train_global_batch_size=256 \ + cluster.gpus_per_node=8 +``` + +Any of the DPO parameters can be customized from the command line. For example: + +```sh +uv run python examples/run_dpo.py \ + dpo.sft_loss_weight=0.1 \ + dpo.preference_average_log_probs=True \ + checkpointing.checkpoint_dir="results/llama_dpo_sft" \ + logger.wandb_enabled=True \ + logger.wandb.name="llama-dpo-sft" +``` + +Refer to `examples/configs/dpo.yaml` for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](/../../guides/dpo). + +## DPO Multi-node + +For distributed DPO training across multiple nodes, modify the following script for your use case: + +```sh +# Run from the root of NeMo RL repo +## number of nodes to use for your job +NUM_ACTOR_NODES=2 + +COMMAND="uv run ./examples/run_dpo.py --config examples/configs/dpo.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 dpo.val_global_batch_size=32 checkpointing.checkpoint_dir='results/dpo_llama81_2nodes' logger.wandb_enabled=True logger.wandb.name='dpo-llama1b'" \ +CONTAINER=YOUR_CONTAINER \ +MOUNTS="$PWD:$PWD" \ +sbatch \ + --nodes=${NUM_ACTOR_NODES} \ + --account=YOUR_ACCOUNT \ + --job-name=YOUR_JOBNAME \ + --partition=YOUR_PARTITION \ + --time=4:0:0 \ + --gres=gpu:8 \ + ray.sub +``` + +> [!NOTE] +> For GB200 systems with 4 GPUs per node, use `--gres=gpu:4` instead. diff --git a/fern/v0.5.0/pages/about/algorithms/grpo.mdx b/fern/v0.5.0/pages/about/algorithms/grpo.mdx new file mode 100644 index 0000000000..96f592d19a --- /dev/null +++ b/fern/v0.5.0/pages/about/algorithms/grpo.mdx @@ -0,0 +1,110 @@ +--- +title: GRPO +description: "" +--- + +We provide a reference GRPO configuration for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset. + +You can read about the details of the GRPO implementation [here](/../../guides/grpo). + +## GRPO Single Node + +To run GRPO on a single GPU for `Qwen/Qwen2.5-1.5B`: + +```sh +# Run the GRPO math example using a 1B parameter model +uv run python examples/run_grpo.py +``` + +By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 GPUs: + +```sh +# Run the GRPO math example using a 1B parameter model using 8 GPUs +uv run python examples/run_grpo.py \ + cluster.gpus_per_node=8 +``` + +You can override any of the parameters listed in the YAML configuration file. For example: + +```sh +uv run python examples/run_grpo.py \ + policy.model_name="meta-llama/Llama-3.2-1B-Instruct" \ + checkpointing.checkpoint_dir="results/llama1b_math" \ + logger.wandb_enabled=True \ + logger.wandb.name="grpo-llama1b_math" \ + logger.num_val_samples_to_print=10 +``` + +The default configuration uses the DTensor training backend. We also provide a config `examples/configs/grpo_math_1B_megatron.yaml` which is set up to use the Megatron backend out of the box. + +To train using this config on a single GPU: + +```sh +# Run a GRPO math example on 1 GPU using the Megatron backend +uv run python examples/run_grpo.py \ + --config examples/configs/grpo_math_1B_megatron.yaml +``` + +For additional details on supported backends and how to configure the training backend to suit your setup, refer to the [Training Backends documentation](/../../design-docs/training-backends). + +## GRPO Multi-node + +```sh +# Run from the root of NeMo RL repo +NUM_ACTOR_NODES=2 + +# grpo_math_8b uses Llama-3.1-8B-Instruct model +COMMAND="uv run ./examples/run_grpo.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \ +CONTAINER=YOUR_CONTAINER \ +MOUNTS="$PWD:$PWD" \ +sbatch \ + --nodes=${NUM_ACTOR_NODES} \ + --account=YOUR_ACCOUNT \ + --job-name=YOUR_JOBNAME \ + --partition=YOUR_PARTITION \ + --time=4:0:0 \ + --gres=gpu:8 \ + ray.sub +``` + +> [!NOTE] +> For GB200 systems with 4 GPUs per node, use `--gres=gpu:4` instead. + +The required `CONTAINER` can be built by following the instructions in the [Docker documentation](/../../docker). + +## GRPO Qwen2.5-32B + +This section outlines how to run GRPO for Qwen2.5-32B with a 16k sequence length. + +```sh +# Run from the root of NeMo RL repo +NUM_ACTOR_NODES=32 + +# Download Qwen before the job starts to avoid spending time downloading during the training loop +HF_HOME=/path/to/hf_home huggingface-cli download Qwen/Qwen2.5-32B + +# Ensure HF_HOME is included in your MOUNTS +HF_HOME=/path/to/hf_home \ +COMMAND="uv run ./examples/run_grpo.py --config examples/configs/grpo_math_8B.yaml policy.model_name='Qwen/Qwen2.5-32B' policy.generation.vllm_cfg.tensor_parallel_size=4 policy.max_total_sequence_length=16384 cluster.num_nodes=${NUM_ACTOR_NODES} policy.dtensor_cfg.enabled=True policy.dtensor_cfg.tensor_parallel_size=8 policy.dtensor_cfg.sequence_parallel=True policy.dtensor_cfg.activation_checkpointing=True checkpointing.checkpoint_dir='results/qwen2.5-32b' logger.wandb_enabled=True logger.wandb.name='qwen2.5-32b'" \ +CONTAINER=YOUR_CONTAINER \ +MOUNTS="$PWD:$PWD" \ +sbatch \ + --nodes=${NUM_ACTOR_NODES} \ + --account=YOUR_ACCOUNT \ + --job-name=YOUR_JOBNAME \ + --partition=YOUR_PARTITION \ + --time=4:0:0 \ + --gres=gpu:8 \ + ray.sub +``` + +> [!NOTE] +> For GB200 systems with 4 GPUs per node, use `--gres=gpu:4` instead. + +## GRPO Multi-Turn + +We also support multi-turn generation and training (tool use, games, etc.). Reference example for training to play a Sliding Puzzle Game: + +```sh +uv run python examples/run_grpo_sliding_puzzle.py +``` diff --git a/fern/v0.5.0/pages/about/algorithms/index.mdx b/fern/v0.5.0/pages/about/algorithms/index.mdx new file mode 100644 index 0000000000..7b533295ea --- /dev/null +++ b/fern/v0.5.0/pages/about/algorithms/index.mdx @@ -0,0 +1,19 @@ +--- +title: Algorithms +description: "" +--- + +NeMo RL supports multiple training algorithms for post-training large language models. + +## Support Matrix + +| Algorithms | Single Node | Multi-node | +|------------|-------------|------------| +| [GRPO](/grpo) | [GRPO Single Node](/grpo#grpo-single-node) | [GRPO Multi-node](/grpo#grpo-multi-node): [GRPO Qwen2.5-32B](/grpo#grpo-qwen25-32b), [GRPO Multi-Turn](/grpo#grpo-multi-turn) | +|DAPO (dapo.md)| similar to GRPO example| similar to GRPO example| +| [DAPO](/dapo) | [DAPO Single Node](/dapo#dapo-single-node) | [DAPO Multi-node](/dapo#dapo-multi-node) | +| [On-policy Distillation](/on-policy-distillation) | [Distillation Single Node](/on-policy-distillation#on-policy-distillation-single-node) | [Distillation Multi-node](/on-policy-distillation#on-policy-distillation-multi-node) | +| [Supervised Fine-Tuning (SFT)](/sft) | [SFT Single Node](/sft#sft-single-node) | [SFT Multi-node](/sft#sft-multi-node) | +| [DPO](/dpo) | [DPO Single Node](/dpo#dpo-single-node) | [DPO Multi-node](/dpo#dpo-multi-node) | +| [RM](/rm) | [RM Single Node](/rm#rm-single-node) | [RM Multi-node](/rm#rm-multi-node) | +On-policy distillation is also supported in the PyTorch DTensor path. diff --git a/fern/v0.5.0/pages/about/algorithms/on-policy-distillation.mdx b/fern/v0.5.0/pages/about/algorithms/on-policy-distillation.mdx new file mode 100644 index 0000000000..51757ace34 --- /dev/null +++ b/fern/v0.5.0/pages/about/algorithms/on-policy-distillation.mdx @@ -0,0 +1,48 @@ +--- +title: On-policy Distillation +description: "" +--- + +We provide an example on-policy distillation experiment using the [DeepScaler dataset](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview). + +> [!NOTE] +> Distillation currently supports the DTensor and vLLM generation backend. Megatron generation/training paths are not supported yet. + +## On-policy Distillation Single Node + +To run on-policy distillation on a single GPU using `Qwen/Qwen3-1.7B-Base` as the student and `Qwen/Qwen3-4B` as the teacher: + +```sh +uv run python examples/run_distillation.py +``` + +Customize parameters with command-line overrides. For example: + +```sh +uv run python examples/run_distillation.py \ + policy.model_name="Qwen/Qwen3-1.7B-Base" \ + teacher.model_name="Qwen/Qwen3-4B" \ + cluster.gpus_per_node=8 +``` + +## On-policy Distillation Multi-node + +```sh +# Run from the root of NeMo RL repo +NUM_ACTOR_NODES=2 + +COMMAND="uv run ./examples/run_distillation.py --config examples/configs/distillation_math.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/distill_2nodes' logger.wandb_enabled=True logger.wandb.name='distill-2nodes'" \ +CONTAINER=YOUR_CONTAINER \ +MOUNTS="$PWD:$PWD" \ +sbatch \ + --nodes=${NUM_ACTOR_NODES} \ + --account=YOUR_ACCOUNT \ + --job-name=YOUR_JOBNAME \ + --partition=YOUR_PARTITION \ + --time=4:0:0 \ + --gres=gpu:8 \ + ray.sub +``` + +> [!NOTE] +> For GB200 systems with 4 GPUs per node, use `--gres=gpu:4` instead. diff --git a/fern/v0.5.0/pages/about/algorithms/rm.mdx b/fern/v0.5.0/pages/about/algorithms/rm.mdx new file mode 100644 index 0000000000..e721d4449f --- /dev/null +++ b/fern/v0.5.0/pages/about/algorithms/rm.mdx @@ -0,0 +1,49 @@ +--- +title: RM +description: "" +--- + +We provide a sample RM experiment that uses the [HelpSteer3 dataset](https://huggingface.co/datasets/nvidia/HelpSteer3) for preference-based training. + +## RM Single Node + +The default RM experiment is configured to run on a single GPU. To launch the experiment: + +```sh +uv run python examples/run_rm.py +``` + +This trains a RM based on `meta-llama/Llama-3.2-1B-Instruct` on 1 GPU. + +If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration: + +```sh +uv run python examples/run_rm.py cluster.gpus_per_node=8 +``` + +Refer to the [RM documentation](/../../guides/rm) for more information. + +## RM Multi-node + +For distributed RM training across multiple nodes, modify the following script for your use case: + +```sh +# Run from the root of NeMo RL repo +## number of nodes to use for your job +NUM_ACTOR_NODES=2 + +COMMAND="uv run ./examples/run_rm.py --config examples/configs/rm.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/rm_llama1b_2nodes' logger.wandb_enabled=True logger.wandb.name='rm-llama1b-2nodes'" \ +CONTAINER=YOUR_CONTAINER \ +MOUNTS="$PWD:$PWD" \ +sbatch \ + --nodes=${NUM_ACTOR_NODES} \ + --account=YOUR_ACCOUNT \ + --job-name=YOUR_JOBNAME \ + --partition=YOUR_PARTITION \ + --time=4:0:0 \ + --gres=gpu:8 \ + ray.sub +``` + +> [!NOTE] +> For GB200 systems with 4 GPUs per node, use `--gres=gpu:4` instead. diff --git a/fern/v0.5.0/pages/about/algorithms/sft.mdx b/fern/v0.5.0/pages/about/algorithms/sft.mdx new file mode 100644 index 0000000000..8822e25710 --- /dev/null +++ b/fern/v0.5.0/pages/about/algorithms/sft.mdx @@ -0,0 +1,50 @@ +--- +title: Supervised Fine-Tuning (SFT) +description: "" +--- + +We provide example SFT experiments using various datasets including [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/), OpenAI format datasets (with tool calling support), and custom JSONL datasets. For detailed documentation on supported datasets and configurations, see the [SFT documentation](/../../guides/sft). + +## SFT Single Node + +The default SFT configuration is set to run on a single GPU. To start the experiment: + +```sh +uv run python examples/run_sft.py +``` + +This fine-tunes the `Llama3.2-1B` model on the SQuAD dataset using a 1 GPU. + +To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size: + +```sh +uv run python examples/run_sft.py \ + policy.model_name="meta-llama/Meta-Llama-3-8B" \ + policy.train_global_batch_size=128 \ + sft.val_global_batch_size=128 \ + cluster.gpus_per_node=8 +``` + +Refer to `examples/configs/sft.yaml` for a full list of parameters that can be overridden. + +## SFT Multi-node + +```sh +# Run from the root of NeMo RL repo +NUM_ACTOR_NODES=2 + +COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \ +CONTAINER=YOUR_CONTAINER \ +MOUNTS="$PWD:$PWD" \ +sbatch \ + --nodes=${NUM_ACTOR_NODES} \ + --account=YOUR_ACCOUNT \ + --job-name=YOUR_JOBNAME \ + --partition=YOUR_PARTITION \ + --time=4:0:0 \ + --gres=gpu:8 \ + ray.sub +``` + +> [!NOTE] +> For GB200 systems with 4 GPUs per node, use `--gres=gpu:4` instead. diff --git a/fern/v0.5.0/pages/about/backends.mdx b/fern/v0.5.0/pages/about/backends.mdx new file mode 100644 index 0000000000..0c866b94fd --- /dev/null +++ b/fern/v0.5.0/pages/about/backends.mdx @@ -0,0 +1,22 @@ +--- +title: Training and Generation Backends +description: "" +--- + +## Training Backends + +NeMo RL supports multiple training backends to accommodate different model sizes and hardware configurations: + +- **PyTorch** - This leverages [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) to provide accelerated PyTorch training with improved memory efficiency (PyTorch-native TP, SP, PP, CP, and FSDP2) +- [**Megatron**](https://github.com/NVIDIA-NeMo/Megatron-Bridge) - NVIDIA's high-performance training framework for scaling to large models with 6D parallelisms + +The training backend is automatically determined based on your YAML configuration settings. For detailed information on backend selection, configuration, and examples, see the [Training Backends documentation](/../design-docs/training-backends). + +## Generation Backends + +NeMo RL supports multiple generation/rollout backends to accommodate different model sizes and hardware configurations: + +- [**vLLM**](https://github.com/vllm-project/vllm) - A high-throughput and memory-efficient popular inference and serving engine +- [**Megatron**](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core/inference) - A high-performance Megatron-native inference backend which eliminates weight conversion between training and inference + +For detailed information on backend selection, configuration, and examples, see the [Generation Backends documentation](/../design-docs/generation). diff --git a/fern/v0.5.0/pages/about/clusters.mdx b/fern/v0.5.0/pages/about/clusters.mdx new file mode 100644 index 0000000000..f6ad1311b2 --- /dev/null +++ b/fern/v0.5.0/pages/about/clusters.mdx @@ -0,0 +1,6 @@ +--- +title: "Installation: Set Up Clusters" +description: "" +--- + +For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated [Cluster Start](/../cluster) documentation. diff --git a/fern/v0.5.0/pages/about/evaluation.mdx b/fern/v0.5.0/pages/about/evaluation.mdx new file mode 100644 index 0000000000..1c56a47d64 --- /dev/null +++ b/fern/v0.5.0/pages/about/evaluation.mdx @@ -0,0 +1,62 @@ +--- +title: Evaluation +description: "" +--- + +We provide evaluation tools to assess model capabilities. + +## Convert Model Format (Optional) + +If you have trained a model and saved the checkpoint in the PyTorch DCP format, you first need to convert it to the Hugging Face format before running evaluation: + +```sh +# Example for a GRPO checkpoint at step 170 +uv run python examples/converters/convert_dcp_to_hf.py \ + --config results/grpo/step_170/config.yaml \ + --dcp-ckpt-path results/grpo/step_170/policy/weights/ \ + --hf-ckpt-path results/grpo/hf +``` + +If you have a model saved in Megatron format, you can use the following command to convert it to Hugging Face format prior to running evaluation. This script requires Megatron Core, so make sure you launch with the mcore extra: + +```sh +# Example for a GRPO checkpoint at step 170 +uv run --extra mcore python examples/converters/convert_megatron_to_hf.py \ + --config results/grpo/step_170/config.yaml \ + --megatron-ckpt-path results/grpo/step_170/policy/weights/iter_0000000 \ + --hf-ckpt-path results/grpo/hf +``` + +> [!NOTE] +> Adjust the paths according to your training output directory structure. + +For an in-depth explanation of checkpointing, refer to the [Checkpointing documentation](/../design-docs/checkpointing). + +## Run Evaluation + +Run the evaluation script with the converted model: + +```sh +uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf +``` + +Run the evaluation script with custom settings: + +```sh +# Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs +# Pass@1 accuracy averaged over 16 samples for each problem +uv run python examples/run_eval.py \ + --config examples/configs/evals/math_eval.yaml \ + generation.model_name=agentica-org/DeepScaleR-1.5B-Preview \ + generation.temperature=0.6 \ + generation.top_p=0.95 \ + generation.vllm_cfg.max_model_len=32768 \ + data.dataset_name=math500 \ + eval.num_tests_per_prompt=16 \ + cluster.gpus_per_node=8 +``` + +> [!NOTE] +> Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings. + +Refer to `examples/configs/evals/eval.yaml` for a full list of parameters that can be overridden. For an in-depth explanation of evaluation, refer to the [Evaluation documentation](/../guides/eval). diff --git a/fern/v0.5.0/pages/about/features.mdx b/fern/v0.5.0/pages/about/features.mdx new file mode 100644 index 0000000000..f80fb20031 --- /dev/null +++ b/fern/v0.5.0/pages/about/features.mdx @@ -0,0 +1,34 @@ +--- +title: Features and Roadmap +description: "" +--- + +_Available now_ | _Coming in v0.4_ + +## Coming in v0.4 + +- **Megatron Inference** - Megatron Inference for fast Day-0 support for new Megatron models (avoid weight conversion) +- **Async RL** - Support for asynchronous rollouts and replay buffers for off-policy training, and enable a fully asynchronous GRPO +- **Vision Language Models (VLM)** - Support SFT and GRPO on VLMs through the DTensor path +- **Improved Native Performance** - Improve training time for native PyTorch models +- **Improved Large MoE Performance** - Improve Megatron Core training performance and generation performance +- **End-to-End FP8 Low-Precision Training** - Support for Megatron Core FP8 training and FP8 vLLM generation +- **Megatron Bridge Integration** - Integrate Megatron Bridge to enable training features from Megatron Core +- **NeMo Automodel Integration** - Integrate NeMo Automodel to power the DTensor path +- **New Models** - `gpt-oss` +- **Expand Algorithms** - DAPO, GSPO, On-policy Distillation +- **GB200** - Add container support for GB200 + +## Available Now + +- **Distributed Training** - Ray-based infrastructure +- **Environment Support and Isolation** - Support for multi-environment training and dependency isolation between components +- **Worker Isolation** - Process isolation between RL Actors (no worries about global state) +- **Learning Algorithms** - GRPO/GSPO, SFT, and DPO +- **Multi-Turn RL** - Multi-turn generation and training for RL with tool use, games, etc +- **Advanced Parallelism with DTensor** - PyTorch FSDP2, TP, CP, and SP for efficient training +- **Larger Model Support with Longer Sequences** - Performant parallelisms with Megatron Core (TP/PP/CP/SP/EP/FSDP) +- **MoE Models** - Support for DeepSeekV3 and Qwen-3 MoE models (Megatron) +- **Sequence Packing** - Sequence packing in both DTensor and Megatron Core for huge training performance gains +- **Fast Generation** - vLLM backend for optimized inference +- **Hugging Face Integration** - Works with 1B–70B models (Qwen, Llama) diff --git a/fern/v0.5.0/pages/about/installation.mdx b/fern/v0.5.0/pages/about/installation.mdx new file mode 100644 index 0000000000..333fbf145f --- /dev/null +++ b/fern/v0.5.0/pages/about/installation.mdx @@ -0,0 +1,94 @@ +--- +title: Installation and Prerequisites +description: "" +--- + +## Clone the Repository + +Clone **NeMo RL** with submodules: + +```sh +git clone git@github.com:NVIDIA-NeMo/RL.git nemo-rl --recursive +cd nemo-rl + +# If you are already cloned without the recursive option, you can initialize the submodules recursively +git submodule update --init --recursive + +# Different branches of the repo can have different pinned versions of these third-party submodules. Ensure +# submodules are automatically updated after switching branches or pulling updates by configuring git with: +# git config submodule.recurse true + +# **NOTE**: this setting will not download **new** or remove **old** submodules with the branch's changes. +# You will have to run the full `git submodule update --init --recursive` command in these situations. +``` + +## Install System Dependencies + +### cuDNN (For Megatron Backend) + +If you are using the Megatron backend on bare metal (outside of a container), you may need to install the cuDNN headers. Here is how you check and install them: + +```sh +# Check if you have libcudnn installed +dpkg -l | grep cudnn.*cuda + +# Find the version you need here: https://developer.nvidia.com/cudnn-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_network +# As an example, these are the "Linux Ubuntu 20.04 x86_64" instructions +wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb +sudo dpkg -i cuda-keyring_1.1-1_all.deb +sudo apt update +sudo apt install cudnn # Will install cuDNN meta packages which points to the latest versions +# sudo apt install cudnn9-cuda-12 # Will install cuDNN version 9.x.x compiled for cuda 12.x +# sudo apt install cudnn9-cuda-12-8 # Will install cuDNN version 9.x.x compiled for cuda 12.8 +``` + +### libibverbs (For vLLM Dependencies) + +If you encounter problems when installing vllm's dependency `deepspeed` on bare-metal (outside of a container), you may need to install `libibverbs-dev`: + +```sh +sudo apt-get update +sudo apt-get install libibverbs-dev +``` + +## Install UV Package Manager + +For faster setup and environment isolation, we use [uv](https://docs.astral.sh/uv/). + +Follow [these instructions](https://docs.astral.sh/uv/getting-started/installation/) to install uv. + +Quick install: +```sh +curl -LsSf https://astral.sh/uv/install.sh | sh +``` + +## Create Virtual Environment + +Initialize the NeMo RL project virtual environment: + +```sh +uv venv +``` + +> [!NOTE] +> Please do not use `-p/--python` and instead allow `uv venv` to read it from `.python-version`. +> This ensures that the version of python used is always what we prescribe. + +## Using UV to Run Commands + +Use `uv run` to launch all commands. It handles pip installing implicitly and ensures your environment is up to date with our lock file. + +```sh +# Example: Run GRPO with DTensor backend +uv run python examples/run_grpo.py + +# Example: Run GRPO with Megatron backend +uv run python examples/run_grpo.py --config examples/configs/grpo_math_1B_megatron.yaml +``` + +> [!NOTE] +> - It is not recommended to activate the `venv`, and you should use `uv run ` instead to execute scripts within the managed environment. +> This ensures consistent environment usage across different shells and sessions. +> - Ensure your system has the appropriate CUDA drivers installed, and that your PyTorch version is compatible with both your CUDA setup and hardware. +> - If you update your environment in `pyproject.toml`, it is necessary to force a rebuild of the virtual environments by setting `NRL_FORCE_REBUILD_VENVS=true` next time you launch a run. +> - **Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models. diff --git a/fern/v0.5.0/pages/about/model-support.mdx b/fern/v0.5.0/pages/about/model-support.mdx new file mode 100644 index 0000000000..fe5022a765 --- /dev/null +++ b/fern/v0.5.0/pages/about/model-support.mdx @@ -0,0 +1,33 @@ +--- +title: Model Support +description: "" +--- + +## Broad coverage for 🤗Hugging Face models via [NeMo AutoModel](https://github.com/NVIDIA-NeMo/Automodel) + +NeMo-RL support 🤗Hugging Face models from the following classes +- LLMs ([AutoModelForCausalLM](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModelForCausalLM)) +- VLMs ([AutoModelForImageTextToText](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModelForImageTextToText)) + +for model sizes under 70B at up to 32k sequence length. + +## Optimal acceleration for top models via [NeMo Megatron-bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) + +[NeMo Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) provides acceleration [recipes](https://github.com/NVIDIA-NeMo/RL/tree/main/examples/configs/recipes) for the below models. Users can also leverage the on-line checkpoint conversion (i.e the "bridge") by directly inputting a 🤗Hugging Face checkpoint. + +**LLMs**: + +- **Qwen**: Qwen2.5-1.5B/7B/32B, Qwen3-1.5B/8B/32B, Qwen3-30B-A3B, Qwen3-235B-A22B +- **Llama**: Llama 3.1/3.3-8B, Llama 3.1/3.3-70B, Llama 3.2-1B +- **Deepseek**: Deepseek-V3/R1-671B +- **Mistral**: Mistral-NeMo-12B +- **Moonlight-16B-A3B** +- **Gemma**: Gemma-3-1B/27B +- **GPT-OSS**: GPT-OSS-20B/120B +- **NeMotron**: Llama-Nemotron-Super-49B, Nemotron-nano-v2-12B, Nemotron-Nano-v3-30A3B + +**VLMs**: + +- **Qwen**: Qwen2.5VL-3B + +In addition, please refer to our [performance page](https://docs.nvidia.com/nemo/rl/latest/about/performance-summary.html) for benchmarks and full reproducible yaml recipe configs. diff --git a/fern/v0.5.0/pages/about/overview.mdx b/fern/v0.5.0/pages/about/overview.mdx new file mode 100644 index 0000000000..231bd23184 --- /dev/null +++ b/fern/v0.5.0/pages/about/overview.mdx @@ -0,0 +1,21 @@ +--- +title: Overview +description: "" +--- + +**NeMo RL** is an open-source post-training library within the [NeMo Framework](https://github.com/NVIDIA-NeMo), designed to streamline and scale reinforcement learning methods for multimodal models (LLMs, VLMs, etc.). Designed for flexibility, reproducibility, and scale, NeMo RL enables both small-scale experiments and massive multi-GPU, multi-node deployments for fast experimentation in research and production environments. + +## What You Can Expect + +- **Flexibility** with a modular design that allows easy integration and customization. +- **Efficient resource management using Ray**, enabling scalable and flexible deployment across different hardware configurations. +- **Hackable** with native PyTorch-only paths for quick research prototypes. +- **High performance with Megatron Core**, supporting various parallelism techniques for large models and large context lengths. +- **Seamless integration with Hugging Face** for ease of use, allowing users to leverage a wide range of pre-trained models and tools. +- **Comprehensive documentation** that is both detailed and user-friendly, with practical examples. + +For more details on the architecture and design philosophy, see the [design documents](/../design-docs/design-and-philosophy). + +## Releases + +For a complete list of releases and detailed changelogs, visit the [GitHub Releases page](https://github.com/NVIDIA-NeMo/RL/releases). diff --git a/fern/v0.5.0/pages/about/performance-summary.mdx b/fern/v0.5.0/pages/about/performance-summary.mdx new file mode 100644 index 0000000000..97528f367d --- /dev/null +++ b/fern/v0.5.0/pages/about/performance-summary.mdx @@ -0,0 +1,102 @@ +--- +title: Performance +description: "" +--- + +As part of the NVIDIA NeMo Framework, NeMo RL, provides optimal performance for reinforcement learning on generative AI models by incorporating the latest optimizations - such as refit optimizations, mixed-precision training, and off-policy training. + +This page provides performance benchmarks for LLMs and VLMs using NeMo RL across different GPU systems and configurations. The recipes to reproduce these runs, in yaml file form, can be found under [this folder](https://github.com/NVIDIA-NeMo/RL/tree/r0.5.0/examples/configs/recipes/llm/performance). + +## Nomenclature + +- **GBS**: Global Batch Size +- **MBS**: Micro Batch Size +- **TP**: Tensor Parallel Size +- **PP**: Pipeline Parallel Size +- **CP**: Context Parallel Size +- **VP**: Virtual Pipeline Parallel Size +- **EP**: Expert Parallel Size +- **T-**: Training related +- **G-**: Generation related +- **Training backend**: NeMo RL have two training backends: Megatron and PyTorch DTensor. This performance summary currently only shows number from Megatron backend. + +## Performance Metrics + +Since reinforcement learning consists of training, generation and transition between the two, performance measurement also reflects this. Specifically, we track the following metrics: +- **Step time**: Time for each step, which includes training, generation, policy logprobs, and refit time. +- **Tokens/sec/GPU**: The rate at the tokens are processed by a stage (such as training, generation, or refitting) on a single GPU: + + $$ + \text{Tokens/sec/GPU} = \frac{\text{Total Tokens Processed}}{\text{Time for Stage} \times \text{Number of GPUs}} + $$ + +- **Training MFU**: Model floating-point operations per second per GPU + +## Performance Summary for Large Language Models + +Below are performance benchmarks for various large language models organized by release version. These results were obtained using performance recipes available [here](https://github.com/NVIDIA-NeMo/RL/tree/r0.4.0/examples/configs/recipes/llm/performance). + +The performance data includes: + +- **RL Performance**: Performance metrics for various model sizes and architectures on different RL algorithms (GRPO and in the future DAPO, PPO, for both on-policy and asynchronous). +- **System Configurations**: Results across different GPU systems (DGX-H100 and in the future DGX-GB200, DGX-B200) +- **Precision Options**: Performance comparisons between different precision modes (BF16, FP8) + +--- + +## Nemo RL v0.5 + +### H100 BF16 Benchmarks +* GRPO Dataset: [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2); DAPO dataset: [DAPOMath17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k) +* System: DGX-H100 +* Precision: Training BF16, Generation BF16 +* Training Backend: Megatron-core. + +| Algorithm | Model |On/Off policy|T-Max Sequence Length|G-Average Seq len|#-GPUs|G-GBS|T-GBS|Generation [TP,PP]|Training [TP,CP,EP,PP,VPP]|Tokens / sec / GPU|Total Step time(s)| +|--------- |------- |-------- |----- |----- |------|---- |---- |---- |---- |--- |---| +| GRPO |LLAMA3.1_8B|On policy |4,096 |1,019 |16 |2,048|512 |[1,1] |[1,1,1,1,1,2,n/a] |1,581 | 92.8| +| GRPO |LLAMA3.1_8B|1-step Off |4,096 |1,123 |16 |2,048|512 |[1,1] |[1,1,1,1,1,1,n/a] |2,478 | 64.8| +| GRPO |DeepSeek V3|On policy |1,536 |744 |256 |512 |512 |[32,1] |[1,1,16,16,n/a] |12.7 | 134| +| GRPO |DeepSeek V3|1-step Off |1,536 |738 |512 |512 |512 |[32,1] |[1,1,16,16,n/a] |13.1 | 64.9| +| DAPO |DeepSeek V3|On policy |1,536 |974 |512 |512 |512 |[64,1] |[8,4,32,8,n/a] |2.45 | 458| +| GRPO |Qwen3-235B |On policy |8,192 |5,700 |128 |512 |512 |[16,1] |[2,2,16,8,n/a] |54.1 | 431| +| GRPO |Qwen3-235B |1-step Off |8,192 |5,707 |256 |512 |512 |[8,1] |[4,1,16,8,n/a] |58.7 | 203| +| GRPO |Qwen3-30B3A|On policy |4,096 |3,196 |32 |2,048|512 |[2,1] |[1,1,8,1,n/a] |1066 | 198| +| GRPO |Qwen3-30B3A|1-step Off |4,096 |3,201 |32 |2,048|512 |[2,1] |[1,1,8,2,n/a] |1391 | 154| +| GRPO |Qwen3-32B |On policy |4,096 |3,251 |32 |2,048|512 |[4,1] |[4,1,1,4,n/a] |571 | 376| +| GRPO |Qwen3-32B |1-step Off |4,096 |3,252 |64 |2,048|512 |[4,1] |[4,1,1,4,n/a] |538 | 200| + +### H100 FP8 Benchmarks +* GRPO Dataset: [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) +* System: DGX-H100 +* Precision: Generation FP8, Training FP8 +* Training Backend: Megatron-core. + +| Algorithm | Model |On/Off policy|T-Max Sequence Length|G-Average Seq len|#-GPUs|G-GBS|T-GBS|Generation [TP,PP]|Training [TP,CP,EP,PP,VPP]|Tokens / sec / GPU|Total Step time(s)| +|--------- |------- |-------- |----- |----- |------|---- |---- |---- |---- |--- |---| +| GRPO |LLAMA3.1_8B|1-step Off |4,096 |1,128 |16 |2,048|512 |[1,1] |[1,1,1,1,1,1,n/a] |3,052 | 53.0| +| GRPO |DeepSeek V3|1-step Off |1,536 |761 |512 |512 |512 |[16,1] |[1,1,16,16,n/a] |14.1 | 67.6| + +### GB200 BF16 Benchmarks +* GRPO Dataset: [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) +* System: GB200-NVL72 +* Precision: Training BF16, Generation BF16 +* Training Backend: Megatron-core. + +| Algorithm | Model |On/Off policy|T-Max Sequence Length|G-Average Seq len|#-GPUs|G-GBS|T-GBS|Generation [TP,PP]|Training [TP,CP,EP,PP,VPP]|Tokens / sec / GPU|Total Step time(s)| +|--------- |------- |-------- |----- |----- |------|---- |---- |---- |---- |--- |---| +| GRPO |LLAMA3.1_8B|On policy |4,096 |1,066 |8 |2,048|512 |[1,1] |[1,1,1,1,1,1,n/a] |3,359 | 91.0| +| GRPO |LLAMA3.1_8B|1-step Off |4,096 |1,107 |8 |2,048|512 |[1,1] |[1,1,1,1,1,1,n/a] |4,463 | 71.1| +| GRPO |DeepSeek V3|On policy |1,536 |996 |128 |512 |512 |[32,1] |[1,1,16,8,n/a] |34.3 | 128| +| GRPO |DeepSeek V3|1-step Off |1,536 |994 |256 |512 |512 |[16,1] |[1,1,16,8,n/a] |31.7 | 64.5| +| GRPO |Qwen3-235B |On policy |8,192 |5,711 |64 |512 |512 |[8,1] |[2,2,16,4,n/a] |140 | 332| +| GRPO |Qwen3-235B |1-step Off |8,192 |5,711 |128 |512 |512 |[8,1] |[4,1,16,4,n/a] |87.9 | 268| +| GRPO |Qwen3-30B3A|On policy |4,096 |3,198 |16 |2,048|512 |[1,1] |[1,1,16,1,n/a] |1,822 | 232| +| GRPO |Qwen3-30B3A|1-step Off |4,096 |3,204 |32 |2,048|512 |[1,1] |[1,1,16,1,n/a] |1,558 | 136| +| GRPO |Qwen3-32B |On policy |4,096 |3,253 |16 |2,048|512 |[1,1] |[2,1,1,1,n/a] |1,127 | 381| +| GRPO |Qwen3-32B |1-step Off |4,096 |3,258 |32 |2,048|512 |[1,1] |[2,1,1,1,n/a] |1,025 | 210| + +Note: + +* All Mixture-of-expert (MoE) model training uses token drop-less. +* The following metrics are extracted from the average of 5 steps: G-Average Seq len, Tokens/sec/gpu, Total Step time(s). Because of the averaging, the numbers in table does not completely match the equation stated in Performance Metrics above but the difference is small. diff --git a/fern/v0.5.0/pages/about/quick-start.mdx b/fern/v0.5.0/pages/about/quick-start.mdx new file mode 100644 index 0000000000..7571706daa --- /dev/null +++ b/fern/v0.5.0/pages/about/quick-start.mdx @@ -0,0 +1,42 @@ +--- +title: Quick Start +description: "" +--- + +Use this quick start to get going with either the native PyTorch DTensor or Megatron Core training backends. + +> [!NOTE] +> Both training backends are independent — you can install and use either one on its own. + +For more examples and setup details, continue to the [Prerequisites](/installation) section. + +## Quick Start Options + +| Native PyTorch (DTensor) | Megatron Core | +|--------------------------|---------------| +| **Clone and create the environment** | | + +```sh +git clone git@github.com:NVIDIA-NeMo/RL.git nemo-rl +cd nemo-rl +git submodule update --init --recursive +uv venv +``` + +> [!NOTE] +> If you previously ran without checking out the submodules, you may need to rebuild virtual environments by setting `NRL_FORCE_REBUILD_VENVS=true`. See [Tips and Tricks](/tips-and-tricks). + +| Native PyTorch (DTensor) | Megatron Core | +|--------------------------|---------------| +| **Run GRPO (DTensor)** | **Run GRPO (Megatron)** | + +```sh +# DTensor +uv run python examples/run_grpo.py +``` + +```sh +# Megatron +uv run examples/run_grpo.py \ + --config examples/configs/grpo_math_1B_megatron.yaml +``` diff --git a/fern/v0.5.0/pages/about/tips-and-tricks.mdx b/fern/v0.5.0/pages/about/tips-and-tricks.mdx new file mode 100644 index 0000000000..9aea60f1c7 --- /dev/null +++ b/fern/v0.5.0/pages/about/tips-and-tricks.mdx @@ -0,0 +1,45 @@ +--- +title: Tips and Tricks +description: "" +--- + +## Missing Submodules Error + +If you forget to initialize the NeMo and Megatron submodules when cloning the NeMo-RL repository, you may run into an error like this: + +```sh +ModuleNotFoundError: No module named 'megatron' +``` + +If you see this error, there is likely an issue with your virtual environments. To fix this, first initialize the submodules: + +```sh +git submodule update --init --recursive +``` + +and then force a rebuild of the virtual environments by setting `NRL_FORCE_REBUILD_VENVS=true` next time you launch a run: + +```sh +NRL_FORCE_REBUILD_VENVS=true uv run examples/run_grpo.py ... +``` + +## Memory Fragmentation + +Large amounts of memory fragmentation might occur when running models without support for FlashAttention2. If OOM occurs after a few iterations of training, it may help to tweak the allocator settings to reduce memory fragmentation. To do so, specify [`max_split_size_mb`](https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-alloc-conf) at **either** one of the following places: + +1. Launch training with: + +```sh +# This will globally apply to all Ray actors +PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64 uv run python examples/run_dpo.py ... +``` + +2. Make the change more permanently by adding this flag in the training configuration: + +```yaml +policy: + # ... + dtensor_cfg: + env_vars: + PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:64" +``` diff --git a/fern/v0.5.0/pages/adding-new-models.mdx b/fern/v0.5.0/pages/adding-new-models.mdx new file mode 100644 index 0000000000..555277017c --- /dev/null +++ b/fern/v0.5.0/pages/adding-new-models.mdx @@ -0,0 +1,315 @@ +--- +title: Add New Models +description: "" +--- + +This guide outlines how to integrate and validate a new model within NeMo RL. Each new model must pass a standard set of compatibility tests before being considered ready to be used in RL pipelines. The guide also details diagnostic scripts to help identify and resolve common issues during model integration. + +## Importance of Log Probability Consistency in Training and Inference + +In on-policy RL, we sample tokens (actions) from the latest version of the policy. This means the sampling distribution of token probabilities produced by the inference framework must closely match those from the training framework. If the inference framework produces significantly different probabilities, we effectively sample from a different distribution, leading to errors in the loss estimation. + +As an example, we would see errors in naive KL estimation: + +$$\text{KL} = E_{x \sim \pi}[\pi(x) - \pi_{\text{ref}}(x)]$$ + +When summed/integrated, replacing the $x \sim \pi$ with $x \sim \pi_{\text{wrong}}$ leads to an error of: + +$$\sum_{x} \left( \pi(x) - \pi_{\text{ref}}(x) \right) \left( \pi_{\text{wrong}}(x) - \pi(x) \right)$$ + +So, to verify correctness, we calculate: + +$$ +\frac{1}{n}\sum_{i=1}^{n\text{(tokens)}}\exp\left(\left\|\text{logprobs-train-fwk}_i - \text{logprobs-inference-fwk}_i\right\|\right) +$$ + +as a measure of multiplicative probability error for sampled tokens, where samples are drawn as $x \sim \pi_{\text{inference-framework}}$. + +Note that this is not exhaustive (the inference framework could lack distribution support and we wouldn't catch it here, as $x \sim \pi_{\text{inference-framework}}$). To get a much stricter guarantee on correctness, you should run this metric twice and average the results, where in the second run, you sample $x \sim \pi_{\text{training-framework}}$. In practice, we use just the former in our tests and find it sufficient. + +## Understand Discrepancies Between Backends + +When validating models across different backends, you may encounter discrepancies in log probabilities. These differences can stem from various sources with effects ranging from negligible to significant: + +- **Numerical precision differences**: Training and inference backends may differ in precision formats (FP32, FP16, BF16, FP8). + - Training may use mixed precision, while the inference backend may not. + - High-precision training with FP8 inference may not be numerically stable for certain models. + - Differences can occur at the layer level, with some layers in FP32, while others use lower precision. + +- **Implementation variations**: Subtle differences in how layer implementations like softmax, layer normalization, or attention mechanisms are implemented. + - Attention/Norm layers (which could be fused) in TransformerEngine may not be bit-wise identical to implementations in inference backends. + - Inference backends may re-implement kernels (e.g., for SSM layers) leading to differences. + - Softmax in training frameworks may be calculated differently than in inference backends for numerical stability. + +- **KV/Prefill cache handling**: Differences in how key-value/prefill caches are managed during autoregressive generation. + - In some cases, disabling the inference backend cache can resolve discrepancies. + +- **Parallelism effects**: Parallelisms like Tensor parallelism may introduce small variations. + +- **Inherent non-determinism**: Some neural network operations are inherently non-deterministic (e.g., `torch.cumsum`). + +- **Prefill/Decoding kernel mismatch**: Different kernels for prefill and decoding phases may produce different log probabilities. + - Training frameworks typically use prefill kernels, while inference backends may use both prefill kernels and specialized decoding kernels. + +- **Imperfect Refit**: Weight conversion from the training framework to the inference backend may be incomplete or data formats may be incorrect. + - If weights are reshaped or reordered incorrectly, generations tend to be very wrong. + - In some cases, if some weights in the inference backend are not refit after each training step, the error between training and inference log probabilities can diverge as training progresses. + +- **Batch size**: In some cases, `batch_size>1` may produce larger errors than `batch_size=1` + +When investigating discrepancies beyond the acceptable threshold, focus on these areas and determine whether the differences appear systematically or only in specific contexts. + +--- + +## 1. Hugging Face–Based Models + +### Validation Workflow + +When validating Hugging Face-based models, perform the following checks: + +- **Compare log probabilities** + Ensure the generation log probabilities from inference backends like **vLLM** match those computed by Hugging Face. This comparison helps diagnose potential mismatches. + +- **Test parallelism** + Verify consistency with other parallelism settings. + +- **Variance** + Repeat tests multiple times (e.g., 10 runs) to confirm that behavior is deterministic or within acceptable variance. + +- **Check sequence lengths** + Perform inference on sequence lengths of 100, 1,000, and 10,000 tokens. + Ensure the model behaves consistently at each length. + +- **Use real and dummy data** + - **Real data:** Tokenize and generate from actual text samples. + - **Dummy data:** Simple numeric sequences to test basic generation. + +- **Vary sampling parameters** + Test both greedy and sampling generation modes. + Adjust temperature and top-p to confirm output consistency across backends. + +- **Test different batch sizes** + Try with batch sizes of 1, 8, and 32 to ensure consistent behavior across different batch configurations. + +--- + +## 2. Megatron Models + +### Additional Validation + +- **Compare Megatron outputs** + Ensure the Megatron forward pass aligns with Hugging Face and the generation log probabilities from inference backends like **vLLM**. + +- **Parallel settings** + Match the same parallelism configurations used for the HuggingFace-based tests. + Confirm outputs remain consistent across repeated runs. + +--- + +## 3. Expected Error Threshold + +When comparing log probabilities between training and inference backends, we use an error threshold of `1.05` to determine acceptable variance (for equal precision). An error of `1.0` indicates a perfect match, and values exceeding `1.05` require further investigation. + +When validating your model, you should analyze the results across different configurations. Your analysis should include: + +| Sequence Length | Data Type | Generation Method | Batch Size | HF vs VLLM | Megatron vs VLLM | +|-----------------|------------|-------------------|------------|------------|------------------| +| 100 | Real | Greedy | 1 | 1.02 | 1.01 | +| 100 | Real | Sampling | 8 | 1.03 | 1.02 | +| 100 | Synthetic | Greedy | 1 | 1.01 | 1.02 | +| 1,000 | Real | Greedy | 32 | 1.04 | 1.03 | +| ... | ... | ... | ... | ... | ... | + +--- + +By following these validation steps and ensuring your model's outputs remain consistent across backends, you can confirm that your new model meets the requirements of NeMo RL. + +# Model Diagnostics + +We also maintain a set of standalone scripts that can be used to diagnose issues related to correctness that +we have encountered before. + +## [1.max_model_len_respected.py](https://github.com/NVIDIA-NeMo/RL/blob/main/tools/model_diagnostics/1.max_model_len_respected.py) + +Test if a new model respects the `max_model_len` passed to vllm: + +```sh +# Run that is expected to pass +uv run --extra vllm tools/model_diagnostics/1.max_model_len_respected.py Qwen/Qwen2.5-1.5B +# ... +# Prompt tokens: 8 +# Generated tokens: 12 +# Total tokens: 20 +# [Qwen/Qwen2.5-1.5B] ALL GOOD! +``` + +## [2.long_generation_decode_vs_prefill](https://github.com/NVIDIA-NeMo/RL/blob/main/tools/model_diagnostics/2.long_generation_decode_vs_prefill.py) + +Test that vLLM yields near-identical token log-probabilities when comparing decoding with a single prefill pass across multiple prompts. + +```sh +# Run that is expected to pass +uv run --extra vllm tools/model_diagnostics/2.long_generation_decode_vs_prefill.py Qwen/Qwen2.5-1.5B +# ... +# [Qwen/Qwen2.5-1.5B] ALL GOOD! +``` + +## [3.check_hf_model_embeddings_untrained.py](https://github.com/NVIDIA-NeMo/RL/blob/main/tools/model_diagnostics/3.check_hf_model_embeddings_untrained.py) + +Detects untrained or improperly initialized Hugging Face model embeddings by scanning for near-zero rows and rows with near-identical values in both input and output embeddings. The script also reports whether word embeddings are tied and summarizes basic statistics. + +```sh +# Example run +uv run --extra mcore tools/model_diagnostics/3.check_hf_model_embeddings_untrained.py --model nvidia/Nemotron-H-8B-Base-8K + +# .... +#================================================================================ +#EMBEDDING SUMMARIES +#================================================================================ +# +#--- Input Embeddings Summary --- +#Shape: torch.Size([131072, 4096]), Dtype: torch.bfloat16 +#Near-zero embeddings (abs < 1.00e-10): 1039/131072 (0.8%) +# Indices: 0-1,3-999,1192-1193,1245-1255,55014,77579,81772,81819,82312,82500,82725,82737,82977,84020,84121,84521,84794,85015,86409,87411,89412,90320,91368,94485,96385,104097,108262,112147,112327,112497,114755 +#Identical embeddings (std < 1.00e-08): 1041/131072 (0.8%) +# Indices: 0-1,3-999,1192-1193,1245-1255,55014,77579,81772,81819,82312,82500,82725,82737,82977,83855,84020,84121,84521,84794,85015,86409,87411,89412,90320,91368,94485,96385,101707,104097,108262,112147,112327,112497,114755 +#Statistics: mean_abs=0.007874, max_abs=0.196289, std_range=[0.000000, 0.015442] +#⚠️ POTENTIAL ISSUES: 1039 near-zero embeddings, 1041 identical embeddings +# +#--- Output Embeddings Summary (Tied: False) --- +#Shape: torch.Size([131072, 4096]), Dtype: torch.bfloat16 +#Near-zero embeddings (abs < 1.00e-10): 0/131072 (0.0%) +#Identical embeddings (std < 1.00e-08): 0/131072 (0.0%) +#Statistics: mean_abs=0.006775, max_abs=0.200195, std_range=[0.004089, 0.021240] +#✅ No obvious untrained patterns detected +# +#=== Final Summary === +#Model: nvidia/Nemotron-H-8B-Base-8K +#Analysis complete. +``` + +- Thresholds can be adjusted via flags: + - `--near-zero-threshold` (default: `1e-10`) + - `--identical-threshold` (default: `1e-8`) +- If any near-zero or identical rows are reported, the model may have issues of numerical instability (e.g., inf grad norms) during post-training if any of these problematic tokens are encountered. We have observed this happening when special tokens are reserved in the tokenizer and embedding, but none are encountered during pre-training. It may help to initialize these embeddings similar to how they were initialize during pre-training. + +## [4.vllm_precision_compilation_test.py](https://github.com/NVIDIA-NeMo/RL/blob/main/tools/model_diagnostics/4.vllm_precision_compilation_test.py) + +Tests vLLM precision compilation by comparing log probabilities across different compilation modes and configurations. This script helps diagnose numerical precision issues that commonly arise when using different vLLM compilation settings. **Note that this is not a strict pass/fail test** - it's designed to help you understand and investigate numerical discrepancies. + +```sh +# Example run +uv run --extra vllm tools/model_diagnostics/4.vllm_precision_compilation_test.py --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B + +# Typical output shows mixed results: +# Eager and cuda graph mode lps: FAILED - Arrays are different +... +# Eager and cuda graph mode lps with torch inductor precision flag: FAILED - Arrays are different +... +# Eager and cuda graph mode lps with use_inductor disabled: PASSED - Arrays are close within tolerance (atol=0.001, rtol=0.001) +``` + +See example for model `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` +``` +==================================================================================================== +Eager and cuda graph mode lps (prompt lps): FAILED - Arrays are different + Detailed error: +Not equal to tolerance rtol=0.001, atol=0.001 + +Mismatched elements: 96 / 515 (18.6%) +Max absolute difference among violations: 0.3885002 +Max relative difference among violations: 0.20179409 + ACTUAL: array([[-1.424489e+01, -3.924684e-01, -3.135911e+00, -4.258007e-01, + -3.443364e-04, nan, nan, nan, + nan, nan, nan, nan,... + DESIRED: array([[-1.420929e+01, -3.619126e-01, -3.241854e+00, -4.308376e-01, + -3.047717e-04, nan, nan, nan, + nan, nan, nan, nan,... +==================================================================================================== +==================================================================================================== +Eager and cuda graph mode lps (generation lps): FAILED - Arrays are different + Detailed error: +Not equal to tolerance rtol=0.001, atol=0.001 + +nan location mismatch: + ACTUAL: array([[-1.231834e+01, -1.411233e-01, -3.764260e-01, ..., nan, + nan, nan], + [-8.567932e+00, -1.066314e+01, -4.463661e-01, ..., nan,... + DESIRED: array([[-1.226752e+01, -1.508305e-01, -4.024158e-01, ..., nan, + nan, nan], + [-8.610202e+00, -1.067061e+01, -4.593382e-01, ..., -1.060957e-05,... +==================================================================================================== +... +==================================================================================================== +Eager and cuda graph mode lps with torch inductor precision flag (prompt lps): FAILED - Arrays are different + Detailed error: +Not equal to tolerance rtol=0.001, atol=0.001 + +Mismatched elements: 96 / 515 (18.6%) +Max absolute difference among violations: 0.3885002 +Max relative difference among violations: 0.20179409 + ACTUAL: array([[-1.424489e+01, -3.924684e-01, -3.135911e+00, -4.258007e-01, + -3.443364e-04, nan, nan, nan, + nan, nan, nan, nan,... + DESIRED: array([[-1.420929e+01, -3.619126e-01, -3.241854e+00, -4.308376e-01, + -3.047717e-04, nan, nan, nan, + nan, nan, nan, nan,... +==================================================================================================== +==================================================================================================== +Eager and cuda graph mode lps with torch inductor precision flag (generation lps): FAILED - Arrays are different + Detailed error: +Not equal to tolerance rtol=0.001, atol=0.001 + +nan location mismatch: + ACTUAL: array([[-1.231834e+01, -1.411233e-01, -3.764260e-01, ..., nan, + nan, nan], + [-8.567932e+00, -1.066314e+01, -4.463661e-01, ..., nan,... + DESIRED: array([[-1.226752e+01, -1.508305e-01, -4.024158e-01, ..., nan, + nan, nan], + [-8.610202e+00, -1.067061e+01, -4.593382e-01, ..., -1.060957e-05,... +==================================================================================================== +... +Eager and cuda graph mode lps with use_inductor disabled (prompt lps): PASSED - Arrays are close within tolerance (atol=0.001, rtol=0.001) +Eager and cuda graph mode lps with use_inductor disabled (generation lps): PASSED - Arrays are close within tolerance (atol=0.001, rtol=0.001) +``` + +**What this script tests:** + +The script is to compare both prompt and generation logprobs under the following setups: + +1. **Eager vs CUDA Graph Mode**: Compares log probabilities between eager execution (ground truth) and CUDA graph compilation mode + - **⚠️ Commonly fails**: This comparison often shows discrepancies due to compilation optimizations +2. **Torch Inductor Precision**: Tests with `TORCHINDUCTOR_EMULATE_PRECISION_CASTS=1` environment variable + - **⚠️ May help**: This flag may help but typically doesn't resolve all the numerical differences +3. **Inductor Disabled**: Verifies that disabling Torch Inductor compilation (`use_inductor=False`) maintains output consistency + - **✅ Usually works well**: This configuration often produces results very close to eager mode + - **Note**: `use_inductor=False` disables Inductor compilation but keeps CUDA graph capture active for compatible operations + +**Performance vs Accuracy Trade-offs:** + +The different compilation modes offer distinct trade-offs between accuracy and performance: + +- **Eager Mode** (`enforce_eager=True`): Highest accuracy (ground truth) but slowest execution +- **CUDA Graph Mode with Inductor Disabled** (`enforce_eager=False` and `compilation_config={"use_inductor": False}`): Near-eager accuracy with significant speedup from CUDA graph optimization +- **CUDA Graph Mode with Inductor Enabled** (`enforce_eager=False` and `compilation_config={"use_inductor": True}`): Potentially fastest execution with custom Triton kernels (since Triton is the current backend of Inductor), but may introduce numerical differences. For accuracy improvement, try the torch inductor precision flag: `export TORCHINDUCTOR_EMULATE_PRECISION_CASTS=1` + +**Note**: Performance characteristics vary by model. For example, `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` shows similar speed performance between `use_inductor=True` and `use_inductor=False`, making the accuracy-preserving option preferable. + +**Why this matters:** + +- **Debugging**: Helps identify which compilation settings cause numerical differences +- **Configuration**: Shows which settings work best for your model +- **Understanding**: Reveals how compilation affects model outputs + +**When to use:** + +- **Model integration** - understand numerical behavior across vLLM configurations +- **Debugging** - investigate differences between development and production +- **Research** - study compilation strategy impacts on precision + +**Interpreting results:** + +- **Eager vs CUDA Graph failures are normal** - don't panic if this fails +- **Focus on patterns** - some models are more sensitive than others +- **Use as guidance** - helps choose reliable compilation settings +- **Balance precision vs performance** - choose what works for your use case diff --git a/fern/v0.5.0/pages/cluster.mdx b/fern/v0.5.0/pages/cluster.mdx new file mode 100644 index 0000000000..74669fd5bd --- /dev/null +++ b/fern/v0.5.0/pages/cluster.mdx @@ -0,0 +1,577 @@ +--- +title: Set Up Clusters +description: "" +--- + +This guide explains how to run NeMo RL with Ray on Slurm or Kubernetes. + +## Use Slurm for Batched and Interactive Jobs + + The following code provides instructions on how to use Slurm to run batched job submissions and run jobs interactively. + +### Batched Job Submission + +```sh +# Run from the root of NeMo RL repo +NUM_ACTOR_NODES=1 # Total nodes requested (head is colocated on ray-worker-0) + +COMMAND="uv run ./examples/run_grpo.py" \ +CONTAINER=YOUR_CONTAINER \ +MOUNTS="$PWD:$PWD" \ +sbatch \ + --nodes=${NUM_ACTOR_NODES} \ + --account=YOUR_ACCOUNT \ + --job-name=YOUR_JOBNAME \ + --partition=YOUR_PARTITION \ + --time=1:0:0 \ + --gres=gpu:8 \ + ray.sub +``` + +> [!TIP] +> Depending on your Slurm cluster configuration, you may or may not need to include the `--gres=gpu:8` option in the `sbatch` command. + +> [!NOTE] +> For GB200 systems with 4 GPUs per node, use `--gres=gpu:4` instead of `--gres=gpu:8`. + +Upon successful submission, Slurm will print the `SLURM_JOB_ID`: +```text +Submitted batch job 1980204 +``` +Make a note of the job submission number. Once the job begins, you can track its process in the driver logs which you can `tail`: +```sh +tail -f 1980204-logs/ray-driver.log +``` + +### Interactive Launching + +> [!TIP] +> A key advantage of running interactively on the head node is the ability to execute multiple multi-node jobs without needing to requeue in the Slurm job queue. This means that during debugging sessions, you can avoid submitting a new `sbatch` command each time. Instead, you can debug and re-submit your NeMo RL job directly from the interactive session. + +To run interactively, launch the same command as [Batched Job Submission](#batched-job-submission), but omit the `COMMAND` line: +```sh +# Run from the root of NeMo RL repo +NUM_ACTOR_NODES=1 # Total nodes requested (head is colocated on ray-worker-0) + +CONTAINER=YOUR_CONTAINER \ +MOUNTS="$PWD:$PWD" \ +sbatch \ + --nodes=${NUM_ACTOR_NODES} \ + --account=YOUR_ACCOUNT \ + --job-name=YOUR_JOBNAME \ + --partition=YOUR_PARTITION \ + --time=1:0:0 \ + --gres=gpu:8 \ + ray.sub +``` + +> [!NOTE] +> For GB200 systems with 4 GPUs per node, use `--gres=gpu:4` instead. + +Upon successful submission, Slurm will print the `SLURM_JOB_ID`: +```text +Submitted batch job 1980204 +``` +Once the Ray cluster is up, a script will be created to attach to the Ray head node. Run this script to launch experiments: +```sh +bash 1980204-attach.sh +``` +Now that you are on the head node, you can launch the command as follows: +```sh +uv run ./examples/run_grpo.py +``` + +### Slurm Environment Variables + +All Slurm environment variables described below can be added to the `sbatch` +invocation of `ray.sub`. For example, `GPUS_PER_NODE=8` can be specified as follows: + +```sh +GPUS_PER_NODE=8 \ +... \ +sbatch ray.sub \ + ... +``` +#### Common Environment Configuration +``````{list-table} +:header-rows: 1 + +* - Environment Variable + - Explanation +* - `CONTAINER` + - (Required) Specifies the container image to be used for the Ray cluster. + Use either a docker image from a registry or a squashfs (if using enroot/pyxis). +* - `MOUNTS` + - (Required) Defines paths to mount into the container. Examples: + ```md + * `MOUNTS="$PWD:$PWD"` (mount in current working directory (CWD)) + * `MOUNTS="$PWD:$PWD,/nfs:/nfs:ro"` (mounts the current working directory and `/nfs`, with `/nfs` mounted as read-only) + ``` +* - `COMMAND` + - Command to execute after the Ray cluster starts. If empty, the cluster idles and enters interactive mode (see the [Slurm interactive instructions](#interactive-launching)). +* - `HF_HOME` + - Sets the cache directory for huggingface-hub assets (e.g., models/tokenizers). +* - `WANDB_API_KEY` + - Setting this allows you to use the wandb logger without having to run `wandb login`. +* - `HF_TOKEN` + - Setting the token used by huggingface-hub. Avoids having to run the `huggingface-cli login` +* - `HF_DATASETS_CACHE` + - Sets the cache dir for downloaded Huggingface datasets. +`````` + +> [!TIP] +> When `HF_TOKEN`, `WANDB_API_KEY`, `HF_HOME`, and `HF_DATASETS_CACHE` are set in your shell environment using `export`, they are automatically passed to `ray.sub`. For instance, if you set: +> +> ```sh +> export HF_TOKEN=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX +> ``` +> this token will be available to your NeMo RL run. Consider adding these exports to your shell configuration file, such as `~/.bashrc`. + +#### Advanced Environment Configuration +``````{list-table} +:header-rows: 1 + +* - Environment Variable + (and default) + - Explanation +* - `UV_CACHE_DIR_OVERRIDE` + - By default, this variable does not need to be set. If unset, `ray.sub` uses the + `UV_CACHE_DIR` defined within the container (defaulting to `/root/.cache/uv`). + `ray.sub` intentionally avoids using the `UV_CACHE_DIR` from the user's host + environment to prevent the host's cache from interfering with the container's cache. + Set `UV_CACHE_DIR_OVERRIDE` if you have a customized `uv` environment (e.g., + with pre-downloaded packages or specific configurations) that you want to persist + and reuse across container runs. This variable should point to a path on a shared + filesystem accessible by all nodes (head and workers). This path will be mounted + into the container and will override the container's default `UV_CACHE_DIR`. +* - `CPUS_PER_WORKER=128` + - CPUs each Ray worker node claims. Default is `16 * GPUS_PER_NODE`. +* - `GPUS_PER_NODE=8` + - Number of GPUs each Ray worker node claims. To determine this, run `nvidia-smi` on a worker node. +* - `BASE_LOG_DIR=$SLURM_SUBMIT_DIR` + - Base directory for storing Ray logs. Defaults to the Slurm submission directory ([SLURM_SUBMIT_DIR](https://slurm.schedmd.com/sbatch.html#OPT_SLURM_SUBMIT_DIR)). +* - `NODE_MANAGER_PORT=53001` + - Port for the Ray node manager on worker nodes. +* - `OBJECT_MANAGER_PORT=53003` + - Port for the Ray object manager on worker nodes. +* - `RUNTIME_ENV_AGENT_PORT=53005` + - Port for the Ray runtime environment agent on worker nodes. +* - `DASHBOARD_AGENT_GRPC_PORT=53007` + - gRPC port for the Ray dashboard agent on worker nodes. +* - `METRICS_EXPORT_PORT=53009` + - Port for exporting metrics from worker nodes. +* - `PORT=6379` + - Main port for the Ray head node. +* - `RAY_CLIENT_SERVER_PORT=10001` + - Port for the Ray client server on the head node. +* - `DASHBOARD_GRPC_PORT=52367` + - gRPC port for the Ray dashboard on the head node. +* - `DASHBOARD_PORT=8265` + - Port for the Ray dashboard UI on the head node. This is also the port + used by the Ray distributed debugger. +* - `DASHBOARD_AGENT_LISTEN_PORT=52365` + - Listening port for the dashboard agent on the head node. +* - `MIN_WORKER_PORT=54001` + - Minimum port in the range for Ray worker processes. +* - `MAX_WORKER_PORT=54257` + - Maximum port in the range for Ray worker processes. +`````` + +> [!NOTE] +> For the most part, you will not need to change ports unless these +> are already taken by some other service backgrounded on your cluster. + +## Kubernetes + +This guide outlines the process of migrating NemoRL training jobs from a Slurm environment to a Kubernetes cluster utilizing Ray orchestration and NVIDIA GPUs. + +--- + +## Prerequisites + +Before beginning, ensure the following requirements are met: + +* **Cluster Access:** You must have access to the K8s cluster from a client machine via `kubectl`. + +> [!IMPORTANT] +> **Authentication Required**: +> Simply installing `kubectl` on your local machine is not sufficient. You must work with your **Infrastructure Administrator** to obtain a valid `KUBECONFIG` file (usually placed at `~/.kube/config`) or authentication token. This file contains the endpoint and credentials required to connect your local client to the specific remote GPU cluster. +> +* **Operators:** The cluster must have the [**NVIDIA Operator**](https://github.com/NVIDIA/gpu-operator) (for GPU provisioning) and the [**KubeRay Operator**](https://github.com/ray-project/kuberay) (for Ray Cluster lifecycle management) installed. +* **Registry Access:** Ability to push/pull Docker images to a registry (e.g., nvcr.io or Docker Hub). + +### 1. Test Cluster Access +Verify your connection and operator status: + +```bash +kubectl get pods -o wide -w +``` + +### 2. Build and Push the Docker Container +We will use the NVIDIA cloud registry (`nvcr.io`) for this guide. From your client machine: + +**Login to the Registry** +```bash +# Set up Docker and nvcr.io with your NGC_API_KEY +docker login nvcr.io + +# Username: $oauthtoken +# Password: +``` + +**Build and Push** +Clone the NemoRL repository and build the container. + +```bash +# Clone recursively +git clone [https://github.com/NVIDIA-NeMo/RL](https://github.com/NVIDIA-NeMo/RL) --recursive +cd RL + +# If you already cloned without --recursive, update submodules: +git submodule update --init --recursive + +# Set your organization +export NGC_ORG= + +# Self-contained build (default: builds from main) +docker buildx build --target release -f docker/Dockerfile --tag nvcr.io/${NGC_ORG}/nemo-rl:latest --push . +``` + +--- + +## Phase 1: Infrastructure Setup + +### 1. Configure Shared Storage (NFS) +This tutorial uses a NFS-based `ReadWriteMany` volume to ensure the Head node and Worker nodes see the exact same files (code, data, checkpoints). This prevents "File Not Found" errors. + +> **Note:** This is a cluster-wide resource. If your admin has already provided an NFS storage class, you only need to create this PVC once. + +**File:** `shared-pvc.yaml` + +```yaml +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: nemo-shared-workspace +spec: + accessModes: + - ReadWriteMany # Critical: Allows RW access from multiple nodes + storageClassName: nfs-client + resources: + requests: + storage: 2Ti # Adjust based on dataset and model size +``` + +**Apply the configuration:** +```bash +kubectl apply -f shared-pvc.yaml +``` + +### 2. Create Registry Secret +This secret allows the cluster to pull the private image you built earlier. + +```bash +kubectl create secret docker-registry nvcr-secret \ + --docker-server=nvcr.io \ + --docker-username='$oauthtoken' \ + --docker-password=YOUR_NGC_API_KEY_HERE \ + --docker-email=admin@example.com +``` + +--- + +## Phase 2: Ray Cluster Configuration + +We will create a Ray cluster with **1x Head node** and **1x Worker node** (with 8x GPUs each). + +**Key Configuration Notes:** +* **Networking:** Uses `bond0` to bypass virtual ethernet overhead (check with your admin regarding the correct interface for NCCL). +* **Memory:** Disables Ray's OOM killer to prevent false positives. +* **Caching:** Redirects HuggingFace cache to the shared PVC. +* **Version Match:** The `rayVersion` spec must match the version in `RL/pyproject.toml`. Check this example [version snapshot](https://github.com/NVIDIA-NeMo/RL/blob/b2e4265d4f2424c0467691f2f0f864cdebe1ab0f/pyproject.toml#L25). +* **Container image:** Replace the image name `nvcr.io/nvidian/nemo-rl:latest` with your actual image, e.g., `nvcr.io/YOUR_NGC_ORG/nemo-rl:latest`. + +> [!WARNING] +> **Check Your Node Capacity & Resource Limits** +> The resource requests in the manifest below (e.g., `cpu: "128"`, `memory: "1500Gi"`) are configured for high-end H100 nodes. If these numbers exceed your physical node's available capacity, your pods will remain in a **Pending** state indefinitely. +> +> Additionally, the shared memory volume is backed by actual node RAM: +> ```yaml +> volumes: +> - name: dshm +> emptyDir: +> medium: Memory +> sizeLimit: "1000Gi" # Counts against Node RAM +> ``` +> You must ensure your physical node has enough memory to cover the container `requests` **plus** the `sizeLimit` of this volume. Please adjust these values to match your specific hardware compute shape. + +**File:** `nemo-rl-h100.yaml` + +```yaml +apiVersion: ray.io/v1 +kind: RayCluster +metadata: + name: nemo-h100-cluster +spec: + rayVersion: '2.49.2' + + ###################### + # HEAD NODE (Uniform with Workers) + ###################### + headGroupSpec: + rayStartParams: + dashboard-host: '0.0.0.0' + block: 'true' + num-gpus: "8" + template: + spec: + imagePullSecrets: + - name: nvcr-secret + + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + + tolerations: + - key: "nvidia.com/gpu" + operator: "Exists" + effect: "NoSchedule" + + containers: + - name: ray-head + image: nvcr.io/nvidian/nemo-rl:latest + imagePullPolicy: Always + resources: + limits: + nvidia.com/gpu: 8 + cpu: "128" + memory: "1500Gi" + requests: + nvidia.com/gpu: 8 + cpu: "128" + memory: "1500Gi" + env: + - name: NVIDIA_VISIBLE_DEVICES + value: "all" + # IMPORTANT: Verify the correct network interface with your cluster admin + # Common values: bond0, eth0, ib0 (for InfiniBand) + # Run 'ip addr' or 'ifconfig' on a node to identify available interfaces + - name: NCCL_SOCKET_IFNAME + value: bond0 + - name: NCCL_SHM_DISABLE + value: "0" + - name: RAY_memory_monitor_refresh_ms + value: "0" + - name: HF_HOME + value: "/shared/huggingface" + volumeMounts: + # All code and data now live here + - mountPath: /shared + name: shared-vol + - mountPath: /dev/shm + name: dshm + volumes: + - name: shared-vol + persistentVolumeClaim: + claimName: nemo-shared-workspace + - name: dshm + emptyDir: + medium: Memory + sizeLimit: "1000Gi" + + ########################## + # WORKER NODES (H100) + ########################## + workerGroupSpecs: + - replicas: 1 + minReplicas: 1 + maxReplicas: 1 + groupName: gpu-group-h100 + rayStartParams: + block: 'true' + num-gpus: "8" + template: + spec: + imagePullSecrets: + - name: nvcr-secret + + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchExpressions: + - key: ray.io/node-type + operator: In + values: ["worker", "head"] + topologyKey: "kubernetes.io/hostname" + + containers: + - name: ray-worker + image: nvcr.io/nvidian/nemo-rl:latest + imagePullPolicy: Always + resources: + limits: + nvidia.com/gpu: 8 + cpu: "128" + memory: "1500Gi" + requests: + nvidia.com/gpu: 8 + cpu: "128" + memory: "1500Gi" + env: + # IMPORTANT: Verify the correct network interface with your cluster admin + # Common values: bond0, eth0, ib0 (for InfiniBand) + # Run 'ip addr' or 'ifconfig' on a node to identify available interfaces + - name: NCCL_SOCKET_IFNAME + value: bond0 + - name: NCCL_SHM_DISABLE + value: "0" + - name: RAY_memory_monitor_refresh_ms + value: "0" + - name: HF_HOME + value: "/shared/huggingface" + volumeMounts: + - mountPath: /shared + name: shared-vol + - mountPath: /dev/shm + name: dshm + + tolerations: + - key: "nvidia.com/gpu" + operator: "Exists" + effect: "NoSchedule" + volumes: + - name: shared-vol + persistentVolumeClaim: + claimName: nemo-shared-workspace + - name: dshm + emptyDir: + medium: Memory + sizeLimit: "1000Gi" + +``` + +**Cluster Management Commands:** + +* **Startup:** `kubectl create -f nemo-rl-h100.yaml` +* **Shutdown:** `kubectl delete -f nemo-rl-h100.yaml` + +--- + +## Phase 3: Run Sample NemoRL Workloads + +Once the cluster is running, you can interact with the Ray head node to submit jobs. + +### 1. Access the Head Node +```bash +kubectl exec -it $(kubectl get pod -l ray.io/node-type=head -o jsonpath='{.items[0].metadata.name}') -- /bin/bash +``` + +### 2. Setup Code on Shared Volume +Inside the pod, clone the code to the shared PVC (`/shared`). This ensures workers can see the code. + +```bash +cd /shared +git clone [https://github.com/NVIDIA-NeMo/RL](https://github.com/NVIDIA-NeMo/RL) --recursive +cd RL +git submodule update --init --recursive +``` + +### 3. Submit a Job +Move to the code directory, edit your configuration, and run the job. + +```bash +cd /shared/RL + +# Edit config (e.g., paths, model config) +vim examples/configs/grpo_math_1B.yaml + +# Set environment variables +export HF_TOKEN=... +export WANDB_API_KEY=... + +# Run the job +uv run examples/run_grpo.py \ + --config examples/configs/grpo_math_1B.yaml +``` + +### 4. Configuration Adjustments +To run across multiple nodes, or to ensure logs/checkpoints persist, update your YAML config file (`examples/configs/grpo_math_1B.yaml`): + +**Cluster Size:** +```yaml +cluster: + gpus_per_node: 8 + num_nodes: 2 +``` + +**Logging & Checkpointing:** +Redirect these to `/shared` so they persist after the pod is deleted. + +```yaml +checkpointing: + enabled: true + checkpoint_dir: "/shared/results/grpo" + +# ... + +logger: + log_dir: "/shared/logs" # Base directory for all logs + wandb_enabled: true + wandb: + project: "grpo-dev" + name: "grpo-dev-logger" +``` + +### 5. Monitoring +* **Console:** Watch job progress directly in the terminal where you ran `uv run`. +* **WandB:** If enabled, check the Weights & Biases web interface. + +--- + +## Utility: PVC Busybox Helper + +Use a lightweight "busybox" pod to inspect the PVC or copy data in/out without spinning up a heavy GPU node. + +**Create the Busybox Pod:** + +```bash +# Variables +PVC_NAME=nemo-shared-workspace +MOUNT_PATH=/shared + +kubectl create -f - < PVC):** + ```bash + kubectl cp ./my-nemo-code nemo-workspace-busybox:/shared/ + ``` diff --git a/fern/v0.5.0/pages/debugging.mdx b/fern/v0.5.0/pages/debugging.mdx new file mode 100644 index 0000000000..ccb0c1b15d --- /dev/null +++ b/fern/v0.5.0/pages/debugging.mdx @@ -0,0 +1,80 @@ +--- +title: Debug NeMo RL Applications +description: "" +--- + +This guide explains how to debug NeMo RL applications, covering two scenarios. It first outlines the procedure for debugging distributed Ray worker/actor processes using the Ray Distributed Debugger within a SLURM environment, and then details debugging the main driver script. + +## Debug Worker/Actors on SLURM + +Since Ray programs can spawn multiple workers and actors, using the Ray Distributed Debugger is essential to accurately jump to breakpoints on each worker. + +### Prerequisites + +* Install the [Ray Debugger VS Code/Cursor extension](https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html). +* Launch the [interactive cluster](/./cluster#interactive-launching) with `ray.sub`. +* Launch VS Code/Cursor on the SLURM login node (where `squeue`/`sbatch` is available). +* Add `breakpoint()` in your code under actors & tasks (i.e. classes or functions decorated with `@ray.remote`). +* **Ensure** `RAY_DEBUG=legacy` is not set since this debugging requires the default distributed debugger. + +### Forward a Port from the Head Node + +From the SLURM login node, query the nodes used by the interactive `ray.sub` job as follows: + +```sh +teryk@slurm-login:~$ squeue --me + JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) + 2504248 batch ray-cluster terryk R 15:01 4 node-12,node-[22,30],node-49 +``` + +The first node is always the head node, so we need to port forward the dashboard port to the login node: + +```sh +# Traffic from the login node's $LOCAL is forwarded to node-12:$DASHBOARD_PORT +# - If you haven't changed the default DASHBOARD_PORT in ray.sub, it is likely 8265 +# - Choose a LOCAL_PORT that isn't taken. If the cluster is multi-tenant, 8265 +# on the login node is likely taken by someone else. +ssh -L $LOCAL_PORT:localhost:$DASHBOARD_PORT -N node-12 + +# Example chosing a port other than 8265 for the LOCAL_PORT +ssh -L 52640:localhost:8265 -N node-12 +``` + +The example output from the port-forwarding with `ssh` may print logs like this, where the warning is expected. + +```text +Warning: Permanently added 'node-12' (ED25519) to the list of known hosts. +bind [::1]:52640: Cannot assign requested address +``` + +### Open the Ray Debugger Extension + +In VS Code or Cursor, open the Ray Debugger extension by clicking the Ray icon in the activity bar or searching for "View: Show Ray Debugger" in the Command Palette (Ctrl+Shift+P or Cmd+Shift+P). + +![Ray Debugger Extension Step 1](/assets/ray-debug-step1.png) + +### Add the Ray Cluster + +Click on the "Add Cluster" button in the Ray Debugger panel. + +![Ray Debugger Extension Step 2](/assets/ray-debug-step2.png) + +Enter the address and port you set up in the port forwarding step. If you followed the example above using port 52640, you would enter: + +![Ray Debugger Extension Step 3](/assets/ray-debug-step3.png) + +### Add a Breakpoint and Run Your Program + +The Ray Debugger Panel for cluster `127.0.0.1:52640` lists all active breakpoints. To begin debugging, select a breakpoint from the dropdown and click `Start Debugging` to jump to that worker. + +Note that you can jump between breakpoints across all workers with this process. + +![Ray Debugger Extension Step 4](/assets/ray-debug-step4.png) + +## Debug with legacy Ray debugger + +To use legacy ray debugger, you can use two ways +1. In general, set `RAY_DEBUG=legacy` and add `--ray-debugger-external` to your `ray start` command +2. If you are using `ray.sub` in a slurm cluster, you can simply set `RAY_DEBUG=legacy` before `sbatch ray.sub`, the script can detect this environment variable and attach `--ray-debugger-external` automatically. + +After you start ray with these changes, you can add `breakpoint` to your code. When you run the program, it will stop at where breakpoints are inserted. Then you can use a separate terminal to attach to the header node via `bash -attach.sh` (this script should automatically be generated by `ray.sub`), and run `ray debug` to see all the breakpoints. You can enter any breakpoint and interactively debug. Please refer to [Ray documentation](https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/ray-debugging.html) for more info on this debugging approach. diff --git a/fern/v0.5.0/pages/design-docs/chat-datasets.mdx b/fern/v0.5.0/pages/design-docs/chat-datasets.mdx new file mode 100644 index 0000000000..fbc8735e95 --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/chat-datasets.mdx @@ -0,0 +1,66 @@ +--- +title: Data Format +description: "" +--- + +This guide outlines the required data format for Hugging Face chat datasets and demonstrates how to use chat templates with Hugging Face tokenizers to add special tokens or task-specific information. + +## Hugging Face Chat Datasets + +Hugging Face chat datasets are expected to have the following structure: Each example in the dataset should be a dictionary with a `messages` key. The `messages` should be a list of dictionaries, each with a `role` and `content` key. The `role` typically has one of the following values: `system`, `user`, and `assistant`. For example: + +```json +{ + "messages": [ + { + "role": "system", + "content": "This is a helpful system message." + }, + { + "role": "user", + "content": "This is a user's question" + }, + { + "role": "assistant", + "content": "This is the assistant's response." + } + ] +} +``` + +## Chat Templates + +Formatting the data in this way allows us to take advantage of the Hugging Face tokenizers' `apply_chat_template` functionality to combine the messages. Chat templates can be used to add special tokens or task-specific information to each example in the dataset. Refer to the [HuggingFace apply_chat_template documentation](https://huggingface.co/docs/transformers/main/en/chat_templating#using-applychattemplate) for details. + +By default, `apply_chat_template` attempts to apply the `chat_template` associated with the tokenizer. However, in some cases, users might want to specify their own chat template. Also, note that many tokenizers do not have associated `chat_template`s, in which case an explicit chat template is required. Users can specify an explicit chat template string using Jinja format and can pass that string to `apply_chat_template`. +The following is an example using a simple template which prepends a role header to each turn: + +```{testcode} +from transformers import AutoTokenizer + +example_template = "{% for message in messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{{ content }}{% endfor %}" + +example_input = [ + { + 'role': 'user', + 'content': 'Hello!' + }, + { + 'role': 'assistant', + 'content': 'Hi there!' + } +] +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") +output = tokenizer.apply_chat_template(example_input, chat_template=example_template, tokenize=False) + +## this is the output string we expect +expected_output = '<|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi there!<|eot_id|>' +assert output == expected_output +``` + +{/* This testoutput is intentionally empty */} +```{testoutput} +:hide: +``` + +For more details on creating chat templates, refer to the [Hugging Face documentation](https://huggingface.co/docs/transformers/v4.34.0/en/chat_templating#how-do-i-create-a-chat-template). diff --git a/fern/v0.5.0/pages/design-docs/checkpointing.mdx b/fern/v0.5.0/pages/design-docs/checkpointing.mdx new file mode 100644 index 0000000000..edf0304cc6 --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/checkpointing.mdx @@ -0,0 +1,34 @@ +--- +title: Exporting Checkpoints to Hugging Face Format +description: "" +--- + +NeMo RL provides two checkpoint formats for Hugging Face models: Torch distributed and Hugging Face format. Torch distributed is used by default for efficiency, and Hugging Face format is provided for compatibility with Hugging Face's `AutoModel.from_pretrained` API. Note that Hugging Face format checkpoints save only the model weights, ignoring the optimizer states. It is recommended to use Torch distributed format to save intermediate checkpoints and to save a Hugging Face checkpoint only at the end of training. + +## Converting Torch Distributed Checkpoints to Hugging Face Format + +A checkpoint converter is provided to convert a Torch distributed checkpoint to Hugging Face format after training: + +```sh +uv run examples/converters/convert_dcp_to_hf.py --config= --dcp-ckpt-path= --hf-ckpt-path= +``` + +Usually Hugging Face checkpoints keep the weights and tokenizer together (which we also recommend for provenance). You can copy it afterwards. Here's an end-to-end example: + +```sh +# Change to your appropriate checkpoint directory +CKPT_DIR=results/sft/step_10 + +uv run examples/converters/convert_dcp_to_hf.py --config=$CKPT_DIR/config.yaml --dcp-ckpt-path=$CKPT_DIR/policy/weights --hf-ckpt-path=${CKPT_DIR}-hf +rsync -ahP $CKPT_DIR/policy/tokenizer ${CKPT_DIR}-hf/ +``` + +## Converting Megatron Checkpoints to Hugging Face Format + +For models that were originally trained using the Megatron-LM backend, a separate converter is available to convert Megatron checkpoints to Hugging Face format. This script requires Megatron-Core, so make sure to launch the conversion with the `mcore` extra. For example, + +```sh +CKPT_DIR=results/sft/step_10 + +uv run --extra mcore examples/converters/convert_megatron_to_hf.py --config=$CKPT_DIR/config.yaml --megatron-ckpt-path=$CKPT_DIR/policy/weights/iter_0000000/ --hf-ckpt-path= +``` diff --git a/fern/v0.5.0/pages/design-docs/dependency-management.mdx b/fern/v0.5.0/pages/design-docs/dependency-management.mdx new file mode 100644 index 0000000000..6f3d4e5490 --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/dependency-management.mdx @@ -0,0 +1,345 @@ +--- +title: Dependency Management +description: "" +--- + +NeMo RL's dependency management system supports both production and development workflows through a flexible virtual environment architecture. This document explains how NeMo RL manages Python dependencies and when to use each workflow. + +## Workflows Overview + +NeMo RL supports two distinct workflows based on your use case: + +### Production Workflow + +A **production workflow** is when you run NeMo RL out-of-the-box (OOTB) without modifying dependencies. This is the typical scenario for: +- Running NeMo RL with pre-built Docker containers +- Using released versions without local modifications +- Executing examples with default dependencies + +In a production workflow, the container's dependencies are aligned with your NeMo RL code version, and you can run applications directly without rebuilding environments. + +> [!NOTE] +> This workflow is similar to how other machine learning projects work: the Docker image is static, and there's an assumption that the code works with the container's pre-installed dependencies. However, NeMo RL goes further by providing mechanisms to align container dependencies dynamically, offering more flexibility than traditional static containers. + +### Development Workflow + +A **development workflow** is when you actively modify dependencies, submodules, or work with code that has different dependency requirements than the container. Common scenarios include: + +- **Version mismatch**: Using a container built from commit A, but your local NeMo RL code is at commit B, where B has different submodule versions or Python dependencies than A +- **Dependency changes**: Actively developing new features that require updated Python packages +- **Submodule modifications**: Working with modified versions of Megatron-LM, NeMo-Automodel, or other submodules + +> [!WARNING] +> If your container was built from commit `abc123` which used `vllm==0.9.0`, but your local checkout is at commit `def456` which requires `vllm==0.10.0`, you are in a development workflow. The container's cached environments won't match your code's requirements. + +## How `uv run` Works + +When you execute a NeMo RL application, such as: + +```bash +uv run examples/run_grpo.py +``` + +This command actually performs several steps behind the scenes: + +```bash +uv lock + uv sync + source .venv/bin/activate + python examples/run_grpo.py +``` + +Let's break down each component: + +### 1. `uv lock` + +Resolves all dependencies specified in [`pyproject.toml`](https://github.com/NVIDIA-NeMo/RL/blob/main/pyproject.toml#L21-L54) and generates a lock file (`uv.lock`) that pins exact versions of all packages. This ensures reproducible builds across different environments. + +### 2. `uv sync` + +Synchronizes your local virtual environment with the locked dependencies. It installs or updates packages as needed to match the lock file. + +The virtual environment location depends on your runtime environment: +- **Bare metal**: The venv defaults to `.venv/` local to your NeMo RL clone +- **Container**: The container sets [`UV_PROJECT_ENVIRONMENT=/opt/nemo_rl_venv`](https://github.com/NVIDIA-NeMo/RL/blob/main/docker/Dockerfile#L67), so the environment is synced to `/opt/nemo_rl_venv`. Note that this location is ephemeral to the container instance. + +### 3. `source .venv/bin/activate` + +Activates the virtual environment, setting up the Python path and environment variables so your script runs with the correct dependencies. + +### 4. `python examples/run_grpo.py` + +Executes your driver script within the activated environment. + +## Multi-Environment Architecture + +```mermaid +graph TD + subgraph Container["uv run examples/run_grpo.py"] + A[Driver Script Environment
Default dependencies from pyproject.toml] + A --> B[Starts Ray Worker Groups] + B --> C[Policy Workers
Separate venv: MCORE] + B --> D[Generation Workers
Separate venv: VLLM] + B --> E[Environment Workers
Separate venv: SYSTEM] + end +``` + +The driver script (`examples/run_grpo.py`) runs with the [default dependencies specified in `pyproject.toml`](https://github.com/NVIDIA-NeMo/RL/blob/main/pyproject.toml#L21-L54) (without optional extras like `mcore` or `vllm`). However, the application creates multiple worker groups, each potentially requiring different Python environments. + +### Worker Groups and Virtual Environments + +Within the driver script, NeMo RL starts multiple [`RayWorkerGroup`](https://github.com/NVIDIA-NeMo/RL/blob/main/nemo_rl/distributed/worker_groups.py#L303-L313) instances. Each worker group manages a set of Ray actors that execute tasks in parallel. These workers may have specialized dependency requirements: + +- **Policy workers** (e.g., using Megatron-Core): Require `mcore` dependencies +- **Generation workers** (e.g., vLLM): Require `vllm` dependencies +- **Environment workers** (e.g., math evaluation): Use system/base dependencies + +Each worker type is mapped to a specific Python executable configuration in the [`ACTOR_ENVIRONMENT_REGISTRY`](https://github.com/NVIDIA-NeMo/RL/blob/main/nemo_rl/distributed/ray_actor_environment_registry.py#L27-L46). This registry defines which virtual environment should be used for each actor type: + +```python +ACTOR_ENVIRONMENT_REGISTRY: dict[str, str] = { + "nemo_rl.models.generation.vllm.vllm_worker.VllmGenerationWorker": VLLM_EXECUTABLE, + "nemo_rl.models.policy.megatron_policy_worker.MegatronPolicyWorker": MCORE_EXECUTABLE, + "nemo_rl.environments.math_environment.MathEnvironment": PY_EXECUTABLES.SYSTEM, + # ... more mappings +} +``` + +> [!NOTE] +> For more details on how workers define and use their Python executables, see the [UV Documentation](/uv#worker-configuration). + +## Container Pre-caching + +When a [release container](/../docker#building-the-release-image) is built, it pre-caches: + +1. **Virtual environments**: All worker virtual environments are created and stored in the container +2. **UV cache**: Python packages are pre-downloaded into the UV cache directory + +This pre-caching significantly speeds up application startup in production workflows, as workers can immediately use their required environments without downloading or compiling packages. + +### When Pre-cached Environments Are Sufficient + +If your local NeMo RL checkout has the **same** Python dependencies and submodules as the container was built with, the pre-cached environments work seamlessly. You can simply run: + +```bash +uv run examples/run_grpo.py +``` + +The workers will use the pre-cached virtual environments, and your application starts quickly. + +## Handling Dependency Changes + +When your local code has **different** dependencies than the container (development workflow), you have two options: + +### Option 1: Force Rebuild Environments + +Set the `NRL_FORCE_REBUILD_VENVS` environment variable to rebuild all worker virtual environments on every node: + +```bash +export NRL_FORCE_REBUILD_VENVS=true +uv run examples/run_grpo.py +``` + +This approach works on both single-node and multi-node setups. On multi-node runs, each node will independently rebuild its virtual environments. + +> [!TIP] +> This approach is convenient for local development and small-scale experiments. It automatically rebuilds environments to match your current dependency specifications without requiring a container rebuild. + +> [!WARNING] +> On large-scale distributed runs (e.g., >=32 nodes), rebuilding environments on all ranks can add significant overhead. Consider rebuilding the container for these large runs + +### Option 2: Rebuild the Container + +For production deployments or large-scale runs, rebuild the container to pre-cache the new dependencies: + +```bash +docker buildx build --target release -f docker/Dockerfile --tag my-registry/nemo-rl:custom . +``` + +> [!TIP] +> Rebuilding the container is recommended when: +> - Running a job with many nodes (>=32 nodes) +> - Dependencies have changed significantly +> - You need reproducible, fast startup times +> - Multiple team members need the same environment + +The rebuilt container will have all virtual environments pre-cached with your updated dependencies, eliminating runtime overhead. + +### Option 3: Classic Workflow - Mounting Modified Submodules + +For situations where you're **only changing submodules** (like nemo-automodel, NeMo Gym, Megatron-LM, or Megatron-Bridge) but **not changing Python package versions**, you can use a classic mounting approach. This workflow assumes that the non-submodule Python packages in your local checkout match what the container was built with. + +The container's NeMo RL code is located at `/opt/nemo-rl`. By mounting your local `3rdparty/` directory over the container's `/opt/nemo-rl/3rdparty/`, you can swap out submodules without rebuilding environments or containers. + +**Example - Mounting Modified Submodules on Slurm:** + +Assuming you're launching from the root of your local NeMo RL clone: + +```bash +# Run from the root of NeMo RL repo + +CONTAINER=YOUR_CONTAINER \ +MOUNTS="$PWD:$PWD,$PWD/3rdparty:/opt/nemo-rl/3rdparty" \ +sbatch \ + --nodes=1 \ + --account=YOUR_ACCOUNT \ + --job-name=YOUR_JOBNAME \ + --partition=YOUR_PARTITION \ + --time=1:0:0 \ + ray.sub +``` + +This mounts: +1. `$PWD:$PWD` - Your local NeMo RL directory to the same path in the container +2. `$PWD/3rdparty:/opt/nemo-rl/3rdparty` - Your local submodules override the container's submodules at `/opt/nemo-rl/3rdparty` + +> [!NOTE] +> This approach works because Python packages are already installed in the cached virtual environments. You're only swapping out the source code in the `3rdparty/` submodules, which doesn't require reinstalling packages or rebuilding environments. + +> [!IMPORTANT] +> This workflow is **only suitable when**: +> - Python package versions in `pyproject.toml` and `uv.lock` haven't changed +> - You're only modifying code within submodules (nemo-automodel, NeMo Gym, Megatron-LM, Megatron-Bridge) +> - The submodule commits/branches are compatible with the installed package versions + +If you've changed Python package versions or dependencies outside of submodules, use Option 1 (`NRL_FORCE_REBUILD_VENVS=true`) or Option 2 (rebuild the container) instead. + +## Decision Guide + +Use this flowchart to determine which workflow applies to you: + +```mermaid +flowchart TD + A[Start] --> B{Are you modifying
dependencies or submodules?} + + B -->|No| C{Container built from
same commit as code?} + B -->|Yes| D{Small scale
or testing?} + + C ---->|Yes| F["✓ Run with + NRL_FORCE_REBUILD_VENVS=true uv run examples/..."] + C -->|No| D + + D -->|Yes| E["✓ Run directly + uv run examples/..."] + D -->|No| G[✓ Rebuild container with new dependencies] + + G --> E +``` + +## Frozen Environments + +For users who prefer or do not need to use `uv` at runtime, NeMo RL containers provide "frozen" environments. In these environments, Python executables—each corresponding to an actor's `PY_EXECUTABLE`—are prebuilt with all required dependencies and made available directly in your `PATH`. + +### What Are Frozen Environments? + +In a frozen environment setup: +- `pip` is available in all virtual environments +- Python executables like `python-MegatronPolicyWorker` are accessible directly +- Users can manually install packages with `python-MegatronPolicyWorker -m pip install ` + +> [!WARNING] +> While `pip` installing packages into a frozen environment is possible for experimentation or local debugging, **all dependencies must ultimately be added to `pyproject.toml` and locked in `uv.lock` before any change is upstreamed**. Direct `pip` installs are not reproducible or supported for collaborative or production workflows. **We cannot accept package additions that only exist via manual pip installs.** + +### When to Use Frozen Environments + +Frozen environments are useful when: +- You prefer traditional Python virtual environment workflows +- You want to manually manage package installations with `pip` +- You do not need `uv run` at runtime to automatically check if your dependencies are in sync + +> [!NOTE] +> For most users, `uv run` is still the recommended approach as it ensures reproducible builds and automatic dependency management. Frozen environments require manual intervention to keep dependencies in sync. + +### Available Python Executables + +Containers provide convenience symlinks for each worker type: + +```bash +# List all available python executables +ls /usr/local/bin/python-* + +# Examples: +python # Default executable for driver scripts (e.g., examples/run_grpo.py) +python-MegatronPolicyWorker # For Megatron policy workers +python-VllmGenerationWorker # For vLLM generation workers +python-MathEnvironment # For environment workers +``` + +> [!NOTE] +> The `python` executable (without any suffix) corresponds to the default frozen environment used to launch driver scripts like `examples/run_grpo.py`. This environment contains the base dependencies from `pyproject.toml` without optional extras. + +To see which packages can be mounted for each executable: + +```bash +python tools/list_editable_packages.py +``` + +### Container Version Checking + +NeMo RL containers enforce environment reproducibility by automatically checking that your code and dependencies match the state of the container at build time. The version checking mechanism works by comparing: + +- The **md5sum of `pyproject.toml`** +- The **md5sum of `uv.lock`** +- The **commit hashes of relevant submodules** + +If any of these values differ between your code and the container image, NeMo RL will alert you and show exactly what has changed: + +```text +-------------------------------------------------------------------------------- +WARNING: Container/Code Version Mismatch Detected! + +-------------------------------------------------------------------------------- +Your container's dependencies do not match your current code. + +Differences found: + - pyproject.toml: + Container: abc123def456 + Current: xyz789abc012 + - uv.lock: + Container: 0987f6543210 + Current: 1234abcd5678 + - submodules/3rdparty/ExampleSubmodule: + Container: a1b2c3d4e5f6 + Current: f6e5d4c3b2a1 + +This can lead to unexpected behavior or errors. + +Solutions: + 1. Rebuild the container to match your code + 2. Set NRL_FORCE_REBUILD_VENVS=true to rebuild virtual environments + (This forces Ray workers to recreate their venvs with updated dependencies) + 3. Set NRL_IGNORE_VERSION_MISMATCH=1 to bypass this check (not recommended) + +Learn more about dependency management: + https://github.com/NVIDIA-NeMo/RL/blob/main/docs/design-docs/dependency-management.md + +-------------------------------------------------------------------------------- +``` + +This check **only runs in containers** (when `NRL_CONTAINER=1` is set) and can be bypassed if absolutely needed: + +```bash +export NRL_IGNORE_VERSION_MISMATCH=1 +``` + +> [!WARNING] +> Bypassing version checks can result in subtle, hard-to-debug errors due to dependency mismatches. Only do this if you fully understand the risks and have a specific need. + +> [!CAUTION] +> **If you modify a frozen environment manually** (for example, by running `python-MegatronPolicyWorker -m pip install `) this change will *not* be detected or tracked by the container version check described above. This is strongly discouraged as it leads to a non-reproducible setup, increases the chance of hard-to-debug environment errors, and breaks the guarantee of consistency across developer machines and production deployments. +> +> Always make dependency changes in `pyproject.toml` and use the recommended workflows so that your environment stays consistent and traceable. + +## Summary + +NeMo RL's dependency management balances flexibility and performance: + +- **Production workflows** leverage pre-cached environments for fast, reliable startup +- **Development workflows** can dynamically rebuild environments as needed (this works on multi-node setups as well) +- **Submodule-only changes** can use the classic mount workflow to swap submodules without rebuilding environments +- **Container rebuilds** provide the best performance for large-scale production runs +- **`NRL_FORCE_REBUILD_VENVS`** offers flexibility for development without container rebuilds +- **Frozen environments** provide an alternative to `uv run` for users who prefer traditional Python virtual environment workflows with direct access to specialized Python executables + +Choose the approach that best fits your scale and development velocity: +- For most users, the **production workflow** with pre-built containers provides the optimal experience +- When iterating on submodule code, the **classic mount workflow** offers a fast middle ground +- For significant dependency changes, use **`NRL_FORCE_REBUILD_VENVS`** for small runs or **rebuild containers** for large-scale deployments +- For manual dependency management, **frozen environments** are available, though `uv run` is recommended for reproducibility diff --git a/fern/v0.5.0/pages/design-docs/design-and-philosophy.mdx b/fern/v0.5.0/pages/design-docs/design-and-philosophy.mdx new file mode 100644 index 0000000000..fcc19ad50f --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/design-and-philosophy.mdx @@ -0,0 +1,137 @@ +--- +title: Design and Philosophy +description: "" +--- + +This section introduces the NeMo RL APIs, configuration patterns with TypedDicts, and addresses the challenges of online Reinforcement Learning (RL). Coordinating various software components, known as RL Actors, requires effective resource allocation, isolation, coordination, and communication. Our design philosophy focuses on creating modular abstractions for these tasks, ensuring scalability from one GPU to thousands, regardless of the RL Actor's implementation. + +## Motivation + +Online RL demands the coordination of a wide range of software components and models, for example: +- Policy Model/Training Framework +- Fast Inference Framework (vLLM, SGLANG, TRT-LLM) +- Reward Environments, Critics, etc. + +We refer to each of these pieces of software as an **RL Actor**. + +Fundamentally, managing these RL Actors requires four key capabilities: +- Resource them (provide GPUs/CPUs). +- Isolate them: RL Actors need isolated process environments with configurable dependencies to avoid global variable or dependency conflicts. +- Coordinate them (control). +- Communicate between them (data). + +## Design + +We create composable and hackable abstractions for each layer of the tasks above +- Resourcing: `RayVirtualCluster` +- Isolation: `RayWorkerGroup` +- Coordination: A Single-Process Controller using Ray +- Communication: Data flows through one of the following: + - the single controller + - a communication scheme set-up by the controller such as + - NCCL Collectives + - Multiprocess Queues + +By creating a common interface for these four tasks, the RL algorithm code can scale seamlessly from 1 to 1000 GPUs and remain independent of the specific RL Actor (such as Megatron, Hugging Face, or abstract components like a grad student with pen and paper). + +![actor-wg-worker-vc](/assets/actor-wg-worker-vc.png) + +### `RayVirtualCluster` +VirtualCluster provides a basic abstraction on top of Ray Placement Groups that allow you to section off a part of your compute resources for WorkerGroups to run on as though they had their own cluster. They support running just one WorkerGroup on each VirtualCluster, or *colocation*, where multiple WorkerGroups share resources (i.e running policy training(hf) and generation(vllm) on the same GPUs in-turn). + +```python +class RayVirtualCluster: +""" + Creates a virtual distributed cluster using Ray placement groups. + + This class simplifies distributed training setup by: + - Creating placement groups that represent logical compute nodes. + - Allocating GPU and CPU resources for distributed workers. + - Managing communication between distributed processes. + + - Bundle: A resource allocation unit (ex: 4 GPUs on a single node). + - Worker: A process that performs computation (model training/inference). + - Node: A physical or virtual machine containing multiple bundles. +""" + def __init__(self, bundle_ct_per_node_list: List[int], {other args}): + """ + Initialize a virtual cluster using Ray placement groups. + + Args: + bundle_ct_per_node_list: List specifying GPU bundles per node + (e.g., [2,2] creates 2 nodes with 2 GPU bundles each) + """ + def get_placement_groups(self): + """ + Returns a list of placement groups that have at least one bundle, filtering out empty nodes. + This represents the "virtual cluster" - only nodes that are actually being used. + + Returns: + List of placement groups that have at least one bundle. + """ +``` + +### `RayWorkerGroup` +All work is done by "Worker Processes" (Ray Actors) that run on a small unit of resources (usually 1 CPU or 1 CPU+GPU). These workers are managed by the *RayWorkerGroup*. +```python +class RayWorkerGroup: + """ + Manages a group of distributed Ray worker/actor processes that execute tasks in parallel. + + This class creates and manages Ray actor instances that run on resources + allocated by a RayVirtualCluster. It handles: + - Worker creation and placement on specific GPU resources. + - Setting up distributed training environment variables (rank, world size, etc.). + - Executing methods across all workers in parallel. + - Collecting and aggregating results. + - Support for tied worker groups where multiple workers process the same data. + """ +``` +`RayWorkerGroup` provides functions like `run_all_workers_single_data` and `run_all_workers_multiple_data` to control and communicate to individual worker processes. + +### Single-Controller and Execution Diagram + +We control the RL Actors using a single-process head controller. Using the aforementioned abstractions, this allows us to represent the main loop of Group Relative Policy Optimization (GRPO) as though we were working on 1 GPU. + +```python +# data processing/transformations between each step omitted +def grpo_train( + policy: PolicyInterface, + policy_generation: GenerationInterface, + environment: EnvironmentInterface, + dataloader: Iterable[BatchedDataDict[DatumSpec]], +): + loss_fn = GRPOLossFn() + for batch in dataloader: + batch.repeat_interleave(num_generations_per_prompt) # repeat for GRPO + generations = policy_generation.generate(batch) + rewards = environment.step(generations) + + logprobs = policy.get_logprobs(generations) + reference_logprobs = policy.get_reference_logprobs(generations) + + training_data = calculate_grpo_training_data(generations, logprobs, reference_logprobs, rewards) + policy.train(generations, logprobs, reference_logprobs, GRPOLossFn) +``` +For a complete implementation of GRPO, including validation, checkpointing, memory movement, and the data processing steps not detailed here, see [grpo_train](/../../nemo_rl/algorithms/grpo.py). + +### TypedDict and Configuration Defaults + +In NeMo RL, we use YAML files for configuration and load them with `omegaconf` into a recursive `dict`. Within the codebase, +the root `dict` and sub-`dict`s are typed with `TypedDict` subclasses to provide type hints when accessing attributes. This +allows our type checker to validate if an undocumented attribute is accessed when not present in the `TypedDict` subclass, +or to identify an incompatible type. + +We chose this design because it's simple and gives users the flexibility to use older configuration files without encountering errors during config loading due to unexpected attributes, whether obsolete or user defined. While we considered using dataclasses or other structured configuration formats, those approaches introduce more boilerplate and would require config versioning to support loading across different versions of NeMo RL. + +We follow a few design principles regarding configuration: + +1. We forbid defaults in the code, except in limited cases (e.g., alpha features). Defaults should be defined in YAML configuration files. Setting defaults in code makes it difficult to trace where values originate during debugging. + * Forbidden examples include: + * `grpo_config.get("num_prompts_per_step", 32)` + * `policy_config.get("model_name", "meta-llama/Llama-3.1-8B-Instruct")` + * Acceptable examples: + * If an attribute is typed `typing.NotRequired[...]`, it is okay for the code to check for absence/`None`, e.g., `assert "milestones" in scheduler_cfg` or `if "milestones" in scheduler_cfg` +1. All configs under [examples/configs/*.yaml](https://github.com/NVIDIA-NeMo/RL/tree/main/examples/configs) are exemplars and should contain the defaults for `typing.Required` or `typing.NotRequired` attributes, along with accompanying documentation. + * All configs under [examples/configs/recipes/**/*.yaml](https://github.com/NVIDIA-NeMo/RL/tree/main/examples/configs/recipes) do not require documentation and are snapshots of functional configurations. +1. All configs under [examples/configs/**/*.yaml](https://github.com/NVIDIA-NeMo/RL/tree/main/examples/configs) should adhere to their `TypedDict` subclass configuration. Unit tests in [tests/unit/test_config_validation.py](https://github.com/NVIDIA-NeMo/RL/blob/main/tests/unit/test_config_validation.py) are run to validate compliance. diff --git a/fern/v0.5.0/pages/design-docs/env-vars.mdx b/fern/v0.5.0/pages/design-docs/env-vars.mdx new file mode 100644 index 0000000000..f3c1c76ed1 --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/env-vars.mdx @@ -0,0 +1,34 @@ +--- +title: Environment Variable Precedence in NeMo RL +description: "" +--- + +There are a number of ways to pass environment variables to Ray workers in NeMo RL. This document explains each of the methods and why they are useful. + +## Precedence Order + +### 1. Ray Runtime Environment Variables (lowest) +- Set via `ray.remote(runtime_env={'env_vars': {...}})` decorators. +- Applied to all instances of specific worker classes. These define the default environment variables for the class if not overwritten by a method of higher precedence. +- Example: `@ray.remote(runtime_env=get_runtime_env_for_policy_worker("megatron_policy_worker"))`. See [here](https://github.com/NVIDIA-NeMo/RL/blob/def76820d7838c63c1ee4900e63f73a93d927ff2/nemo_rl/models/policy/megatron_policy_worker.py#L338) where `get_runtime_env_for_policy_worker` will be applied to all instances of `MegatronPolicyWorker`. + +### 2. System-level Environment Variables (medium) +- Set via `export` in shell or `os.environ` in Python. +- Useful for controlling environment variables from a high level. If not overwritten by higher priority methods, all workers will inherit these environment variables. +- Example: `export HF_TOKEN=` + +### 3. YAML Configuration `env_vars` (high) +- Set in YAML config files under `policy.megatron_cfg.env_vars` or `policy.dtensor_cfg.env_vars`. +- Useful for controlling environment variables on an experiment level. +- Example: + ```yaml + policy: + megatron_cfg: + env_vars: + PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:False" + ``` + +### 4. Worker-specific `configure_worker` Method (highest) +- Set via static `configure_worker` method in worker classes. +- Applied to specific worker instances based on configuration. +- See an example in `VllmGenerationWorker` [here](https://github.com/NVIDIA-NeMo/RL/blob/def76820d7838c63c1ee4900e63f73a93d927ff2/nemo_rl/models/generation/vllm.py#L88). diff --git a/fern/v0.5.0/pages/design-docs/fsdp2-parallel-plan.mdx b/fern/v0.5.0/pages/design-docs/fsdp2-parallel-plan.mdx new file mode 100644 index 0000000000..05ceca0783 --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/fsdp2-parallel-plan.mdx @@ -0,0 +1,33 @@ +--- +title: FSDP2 Parallel Plan +description: "" +--- + +This guide outlines the parallelization strategy for Fully Sharded Data Parallel version 2 (FSDP2) training in NeMo RL. + +## Fallback Priority + +NeMo RL supports three parallelization strategies, applied in the following order of fallback priority: + +### 1. Custom Parallel Plan + +Your user-defined custom parallel plans always take precedence when available. For detailed implementation and usage, refer to the [Custom Parallel Plan Example](#custom-parallel-plan-example). + +### 2. Optimized Parallel Plan + +Optimized parallel plans are available for specific model architectures. They may offer superior performance compared to Hugging Face's tensor parallel implementation. This approach is used if no custom parallel plan is specified and the model class supports optimized parallelization. + +### 3. Hugging Face Tensor Parallel Plan + +The Hugging Face tensor parallel plan is the default. It's available for most models via `._tp_plan` and is used when neither a custom nor an optimized parallel plan is available. + +## Custom Parallel Plan Example + +A custom parallel plan should be defined in a separate file, such as the example provided in `examples/custom_parallel/custom_parallel.py`. + +To implement the custom parallel plan, either update the value of `custom_parallel_plan` in the `yaml` file directly, or pass the override via the command line. For example: + +```bash +uv run examples/run_grpo.py \ + policy.dtensor_cfg.custom_parallel_plan=examples.custom_parallel.custom_parallel.custom_parallel_plan +``` diff --git a/fern/v0.5.0/pages/design-docs/generation.mdx b/fern/v0.5.0/pages/design-docs/generation.mdx new file mode 100644 index 0000000000..e437cc6a31 --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/generation.mdx @@ -0,0 +1,231 @@ +--- +title: Generation Interface +description: "" +--- + +This document explains the token generation interface and various backends for the NeMo RL framework. The generation system is designed with a unified interface that allows different backends (like VLLM, Megatron, Hugging Face, SGLang, and TRT-LLM) to provide token generation capabilities while adhering to the same API. + +## Generation Interface + +The core of the generation system is defined in `interfaces.py`, which establishes an abstract interface that all generation backends must implement. This ensures consistency across different implementations and makes it easy to swap backends without changing the calling code. + +### Key Components + +1. **GenerationConfig**: A TypedDict that defines the configuration for generation: + ```python + class GenerationConfig(TypedDict): + """Configuration for generation.""" + backend: str # The backend to use (e.g., "vllm", "megatron", "hf") + max_new_tokens: int # Maximum number of tokens to generate + temperature: float # Sampling temperature + top_p: float # Top-p sampling parameter + top_k: int | None # Top-k sampling parameter + model_name: str # Name or path of the model + ``` + +2. **GenerationDatumSpec**: A TypedDict that defines the input data format: + ```python + class GenerationDatumSpec(TypedDict): + input_ids: torch.Tensor # Input token IDs + attention_mask: torch.Tensor # Attention mask + __extra__: Any # Additional data specific to the backend + ``` + +3. **GenerationOutputSpec**: A TypedDict that defines output data format: + ```python + class GenerationOutputSpec(TypedDict): + output_ids: torch.Tensor + generation_lengths: torch.Tensor # Length of just the generated response part + unpadded_sequence_lengths: torch.Tensor # Length of full valid sequence (input + generated response) + logprobs: torch.Tensor + __extra__: Any # Additional output data specific to the backend + ``` + +4. **GenerationInterface**: An abstract base class that all generation backends must implement: + ```python + class GenerationInterface(ABC): + """Abstract base class defining the interface for RL policies.""" + + @abstractmethod + def generate( + self, data: BatchedDataDict["GenerationDatumSpec"], greedy: bool + ) -> BatchedDataDict["GenerationOutputSpec"]: + pass + + @abstractmethod + def prepare_for_generation(self, *args, **kwargs): + pass + + @abstractmethod + def finish_generation(self, *args, **kwargs): + pass + ``` + +A key design principle for generation backends is that they process tokens directly, without involving the tokenizer. By ensuring that only tokens are exchanged, we eliminate the risk of inconsistencies arising from different tokenizer versions or specifications between the training and generation frameworks. + +## Generation Backends + +NeMo RL supports multiple generation backends that implement the `GenerationInterface` to provide efficient text generation for different use cases. + +## VLLM Backend + +The VLLM backend (`models/generation/vllm/vllm_generation.py`) implements the `GenerationInterface` to provide efficient text generation using the VLLM library, which is optimized for large language models. + +### VllmGeneration Class + +The `VllmGeneration` class is the main implementation of the `GenerationInterface` for VLLM. It performs the following functions: + +1. Sets up VLLM workers in a distributed environment using Ray. +2. Manages the lifecycle of these workers (initialization, generation, shutdown). +3. Distributes inputs to workers and collects outputs. +4. Handles weight updates and synchronization. + +### VllmGenerationWorker + +The `VllmGenerationWorker` is a Ray actor that: + +1. Initializes and manages a VLLM model instance. +2. Performs the actual generation on a GPU. +3. Supports dynamic weight updates through IPC handles. +4. Implements sleep/wake mechanisms for efficient resource utilization. + +### Custom VLLM Extensions + +The `UpdatableVllmInternalWorker` class in `vllm_backend.py` extends the VLLM worker with additional capabilities: + +1. Reporting device IDs to allow mapping of workers to specific GPUs. +2. Updating weights from IPC handles for efficient weight sharing. +3. Checking if weights have been updated correctly. + +## Megatron Backend + +The Megatron backend provides native Megatron-Core inference capabilities, eliminating the need for weight conversion between training and generation. This backend is particularly beneficial when using Megatron for training, as it enables seamless integration and optimal performance. + +### Key Features + +1. **No Weight Conversion**: Uses the same Megatron model format for both training and generation, eliminating conversion overhead and potential inconsistencies. +2. **CUDA Graph Support**: Leverages CUDA graphs for optimized inference performance. +3. **Dynamic Inference Engine**: Utilizes Megatron Core's `DynamicInferenceEngine` for efficient batched generation. +4. **Integrated with Training**: The generation capability is built directly into the `MegatronPolicyWorker`, enabling efficient co-located training and generation. + +### MegatronPolicyWorker Generation + +The Megatron generation backend is implemented within the `MegatronPolicyWorker` class. The `generate ` method performs the following: + +1. Wraps the Megatron model with `GPTInferenceWrapper` for inference optimization. +2. Creates a `DynamicInferenceContext` to manage inference state and memory. +3. Initializes a `DynamicInferenceEngine` with CUDA graph support enabled. +4. Processes batched requests with proper sampling parameters (temperature, top_k, top_p). +5. Returns outputs conforming to `GenerationOutputSpec`. + +### Configuration + +To use the Megatron generation backend, configure your YAML file as follows: + +```yaml +policy: + megatron_cfg: + enabled: true + generation: + backend: megatron + max_new_tokens: 512 + temperature: 1.0 + top_p: 1.0 + top_k: null + mcore_generation_config: + buffer_size_gb: 20 # Memory buffer size for inference context + buffer_guaranteed_fraction: 0.1 # Fraction of buffer guaranteed to be available for active requests + num_cuda_graphs: 16 # Number of CUDA graphs to pre-allocate + max_tokens: 16384 # Maximum number of tokens for inference +``` + +### Configuration Parameters + +The `mcore_generation_config` section controls Megatron Core inference engine behavior: + +- **buffer_size_gb**: Total memory buffer size (in GB) allocated for the dynamic inference context. This determines how much GPU memory is reserved for KV caches and intermediate states. Keeping this higher will pull in more requests at once. +- **buffer_guaranteed_fraction**: Fraction of the buffer that is guaranteed to be available (between 0.0 and 1.0). This helps to make sure that there is always some memory for active requests to complete. +- **num_cuda_graphs**: Number of CUDA graphs to pre-allocate for different batch sizes. More graphs can improve performance by avoiding runtime graph capture, but consume more memory. +- **max_tokens**: Maximum total number of tokens (across all requests) that can be processed simultaneously. This limits the maximum batch size and sequence length combinations. Increasing this might throw OOM depending on vocab size and buffer size allocated. + +## Usage Examples + +### Using VLLM Backend + +To use the VLLM generation backend: + +```python +from nemo_rl.algorithms.utils import get_tokenizer +from nemo_rl.distributed.virtual_cluster import RayVirtualCluster +from nemo_rl.distributed.batched_data_dict import BatchedDataDict +from nemo_rl.models.generation.interfaces import configure_generation_config +from nemo_rl.models.generation.vllm import VllmGeneration, VllmConfig + +# Set up the configuration +config = VllmConfig( + model_name="Qwen/Qwen2.5-1.5B", + max_new_tokens=100, + temperature=0.7, + top_p=1, + top_k=None, + backend="vllm", + vllm_cfg={ + "tensor_parallel_size": 1, + "gpu_memory_utilization": 0.8, + "max_model_len": 2048, + } +) + +# Configure config with tokenizer +tokenizer = get_tokenizer(config["model_name"]) +config = configure_generation_config(config, tokenizer) + +# Initialize the cluster and generation backend +cluster = RayVirtualCluster(...) +generator = VllmGeneration(cluster, config) + +# Prepare input data +input_data = BatchedDataDict(...) + +# Generate text +generator.prepare_for_generation() +output = generator.generate(input_data, greedy=False) +generator.finish_generation() +``` + +### Using Megatron Backend + +To use the Megatron generation backend, configure your YAML file: + +```yaml +policy: + model_name: meta-llama/Llama-3.2-1B-Instruct + megatron_cfg: + enabled: true + generation: + backend: megatron + max_new_tokens: 512 + temperature: 1.0 + top_p: 1.0 + top_k: null + mcore_generation_config: + buffer_size_gb: 20 + buffer_guaranteed_fraction: 0.1 + num_cuda_graphs: 16 + max_tokens: 16384 +``` + +For a complete example, see: +- **Configuration**: `examples/configs/recipes/llm/grpo-llama3.2-1b-instruct-1n8g-megatron_generation.yaml` +- **Test Script**: `tests/functional/grpo_megatron_generation.sh` + +## Extend with New Backends + +To add a new generation backend: + +1. Create a new class that implements `GenerationInterface`. +2. Implement the required methods: `generate`, `prepare_for_generation`, and `finish_generation`. +3. Ensure your implementation works with the standard `GenerationConfig` and `GenerationDatumSpec` structures. +4. Register your backend with the system (if needed) to make it accessible. + +This modular design allows for easy extension with new backends while maintaining a consistent interface for the rest of the system. diff --git a/fern/v0.5.0/pages/design-docs/logger.mdx b/fern/v0.5.0/pages/design-docs/logger.mdx new file mode 100644 index 0000000000..53e3487ab6 --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/logger.mdx @@ -0,0 +1,184 @@ +--- +title: Logger +description: "" +--- + +The logger is designed to track key training metrics (including distributed metrics with reductions and timing), as well as providing integration with logging backends like WandB, Tensorboard, MLflow and Swanlab. + +## Requirements + +* Tracking distributed metrics with specified reductions (mean, max, etc.) +* Tracking distributed timing with (usually) 'max' reduction across ranks +* Logging: + * WandB + * Tensorboard + * MLflow + * Swanlab + +## Overall Design + +Since there is a single controller, the single process running the main training loop will gather the metrics and do the logging. + +To handle multiple logger backends, we will have a `LoggerInterface` interface that the `TensorboardLogger`, `WandbLogger`, `MLflowLogger` and `SwanlabLogger` will implement: + +```python +class LoggerInterface(ABC): + """Abstract base class for logger backends.""" + + @abstractmethod + def log_metrics(self, metrics: dict[str, Any], step: int, prefix: Optional[str]: "") -> None: + """Log a dictionary of metrics.""" + pass + + @abstractmethod + def log_hyperparams(self, params: dict[str, Any]) -> None: + """Log dictionary of hyperparameters.""" + pass +``` + +A `Logger` wrapper class will also implement `LoggerInterface` and maintain a list of loggers to which it delegates writing logs. This will be the main class the user uses in the training loop. Usage example: + +```python +# Initialize logger with wandb, tensorboard, mlflow and swanlab enabled +logging_config = { + "wandb_enabled": True, + "tensorboard_enabled": False, + "mlflow_enabled": True, + + "wandb": { + "project": "grpo-dev", + "name": "grpo-dev-logging", + }, + "swanlab": { + "project": "nemo-rl", + "name": "grpo-dev-logging", + }, + "tensorboard": { + "log_dir": "logs", + }, + "mlflow": { + "experiment_name": "nemo-rl-experiment", + "run_name": "grpo-dev-run", + "tracking_uri": None, # Use local tracking + }, +} +logger = Logger( + cfg=logger_config, +) + +# Log metrics, will go to all enabled backends +logger.log_metrics({ + "loss": 0.123, +}, step=10) +``` + +## Supported Logging Backends + +The logger supports three main logging backends: + +### WandB (Weights & Biases) +- Provides cloud-based experiment tracking +- Supports custom step metrics for better visualization +- Includes built-in hyperparameter logging +- Offers rich visualization and collaboration features + +### Swanlab +- Training visualization (Android, iOS, Wechat public account and Web) +- Automatic logging +- Hyperparameter recording +- Experiment comparison +- Multi-user collaboration + +### Tensorboard +- Local file-based logging +- Standard TensorBoard visualization +- Supports hyperparameter logging via HParams +- Lightweight and self-contained + +### MLflow +- Comprehensive platform for experiment tracking and model management +- Supports both local and remote tracking servers +- Provides model versioning and artifact management +- Includes a web UI for experiment visualization +- Supports model deployment and serving + +#### MLflow Configuration + +MLflow can be configured with the following parameters: + +```python +mlflow: + experiment_name: "nemo-rl-experiment" # Name of the MLflow experiment + run_name: "my-training-run" # Run name + tracking_uri: "http://localhost:5000" # Optional tracking server URI +``` + +#### MLflow UI + +After starting training with MLflow enabled, you can view the MLflow UI to monitor your experiments: + +```bash +# Start MLflow UI (run in a separate terminal) +mlflow ui --host 0.0.0.0 --port 5000 +``` + +Then access the UI at `http://127.0.0.1:5000/` to view: +- Training runs and experiments +- Metrics (loss, validation metrics, etc.) +- Hyperparameters +- Model artifacts and checkpoints + +## Validation Pretty Logging + +The logger supports pretty-formatted logging of validation samples to help visualize model outputs during training. This feature is controlled by the `num_val_samples_to_print` configuration parameter. + +```python +logger: + wandb_enabled: false + swanlab_enabled: false + tensorboard_enabled: false + mlflow_enabled: false + num_val_samples_to_print: 10 +``` + +When `num_val_samples_to_print` is set to a value greater than 0, the logger will generate well-formatted text outputs for the specified number of validation samples. This is particularly useful for: + +1. Quickly inspecting model generation quality during training. +2. Comparing inputs and outputs side-by-side. +3. Tracking validation sample performance over time. + +### Example Output + +When enabled, the pretty logging will generate formatted text similar to: + +![Validation Pretty Logging Example](/assets/val-log.png) + +## GPU Metric Logging + +NeMo RL monitors GPU memory and utilization through [system metrics](https://docs.ray.io/en/latest/ray-observability/reference/system-metrics.html#system-metrics) exposed by Ray nodes. While Ray makes these metrics available for tools like Prometheus, NeMo RL directly polls GPU memory and utilization data and logs them to TensorBoard, WandB, MLflow and/or SwanLab. + +This approach allows us to offer the same GPU metric tracking on all loggers and simplifies the implementation greatly. + +This feature is enabled with the `monitor_gpus` configuration parameter. The frequency of data collection and flushing to the loggers is controlled by the `gpu_collection_interval` and `gpu_flush_interval` parameters, both specified in seconds. + +```python +logger: + wandb_enabled: false + swanlab_enabled: false + tensorboard_enabled: false + mlflow_enabled: false + monitor_gpus: true + gpu_monitoring: + collection_interval: 10 + flush_interval: 10 +``` + +> [!NOTE] +> While it is feasible to monitor using remote workers, the implementation requires careful attention to details to ensure: +> * Logs sent back to the driver do not introduce significant overhead. +> * Metrics remain clear and interpretable, avoiding issues like double counting caused by colocated workers. +> * Workers can gracefully flush their logs in case of failure. +> * Logging behaves consistently across TensorBoard, WandB, MLflow and Swanlab. +> * Workers that spawn other workers accurately report the total resource usage of any grandchild workers. +> +> Due to these complexities, we opted for a simpler approach: collecting metrics exposed by the Ray metrics server from the driver. diff --git a/fern/v0.5.0/pages/design-docs/loss-functions.mdx b/fern/v0.5.0/pages/design-docs/loss-functions.mdx new file mode 100644 index 0000000000..4fda304d99 --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/loss-functions.mdx @@ -0,0 +1,101 @@ +--- +title: Loss functions in NeMo RL +description: "" +--- + +Loss functions in NeMo RL are specially designed to ensure that full batch training is equivalent to training with gradient accumulation. To understand +why special care needs to be taken here, consider the following example of a simple loss function that takes the average of some per-token loss over all tokens in a microbatch, and then averages loss over the microbatches. + +Suppose we have a global batch with 16 unmasked tokens. The first 10 unmasked tokens come from the first half of the samples in the batch, and the last 6 come from the second half. If training with one global batch, + +$$ +L = \frac{\sum_{t=1}^{16} L_t}{16}. +$$ + +But if we train with two microbatches, + +$$ +L = \frac{\frac{\sum_{t=1}^{10} L_t}{10} + \frac{\sum_{t={10}}^{16} L_t}{6}}{2}, +$$ + +which is, in general, not equivalent to the full-batch loss. To fix this, we need each microbatch to have information about how many tokens are in the other microbatches in the global batch. + +In NeMo RL, this information is passed to the loss function directly. Each loss function is expected to fall into one of two categories, token-level or sequence-level, which is an attribute of the loss function itself (see [loss_functions.py](/../../nemo_rl/algorithms/loss_functions.py) for some examples). The policy then uses this information to compute the global normalization factor using the full batch (for token-level losses, this is the total number of tokens in the batch. For sequence-level losses, this is the number of valid sequences in the batch). The normalization factor is then passed to the loss function, which uses it to normalize the microbatch loss. To get the loss for the global batch, the policy simply sums across all microbatch losses. + +For our simple example above, this would look like: + +```{testcode} +import torch +from nemo_rl.algorithms.interfaces import LossFunction +from nemo_rl.algorithms.loss_functions import LossType +from nemo_rl.distributed.batched_data_dict import BatchedDataDict + +class SimpleAverageLoss(LossFunction): + """Simple average loss function that demonstrates proper microbatch handling. + + NOTE: We assume for simplicity that the losses per token are passed directly into the this loss function. + This is not the case in practice! + """ + + loss_type = LossType.TOKEN_LEVEL + + def __call__( + self, + next_token_losses: torch.Tensor, + data: BatchedDataDict, + total_valid_tokens_or_seqs: torch.Tensor, + ) -> tuple[torch.Tensor, dict]: + """Compute the simple average loss with proper microbatch handling.""" + token_mask = data["token_mask"] ## token mask for this microbatch + sample_mask = data["sample_mask"] ## sample mask for this microbatch + + # mask.sum() will be 10 for microbatch 1, 6 for microbatch 2 + mask = token_mask * sample_mask.unsqueeze(-1) + + # total_valid_tokens_or_seqs will be 16 in our example since there are 16 tokens in the global batch + # since we specified that this is a token-level loss, the policy + # will give us the right normalization factor automatically. + loss = (next_token_losses * mask).sum() / (total_valid_tokens_or_seqs + 1e-8) + return loss + +## test out the loss function +import torch + +## in this example, we have a batch of size 2 with a sequence length of 16 +batch_size = 2 +seq_len = 16 +next_token_losses = torch.randn((batch_size, seq_len)) +sample_data = { + "token_mask": torch.tensor( + [ + [1] * 10 + [0] * 6, + [1] * 6 + [0] * 10, + ] + ), + "sample_mask": torch.ones(2) +} +total_valid_tokens_or_seqs = torch.sum(sample_data["token_mask"] * sample_data["sample_mask"].unsqueeze(-1)) + +loss_fn = SimpleAverageLoss() +loss_no_microbatching = loss_fn(next_token_losses, sample_data, total_valid_tokens_or_seqs) + +microbatch_1_data = { + "token_mask": sample_data["token_mask"][:1], + "sample_mask": sample_data["sample_mask"][:1], +} +microbatch_2_data = { + "token_mask": sample_data["token_mask"][1:], + "sample_mask": sample_data["sample_mask"][1:], +} +loss_with_microbatching = ( + loss_fn(next_token_losses[:1], microbatch_1_data, total_valid_tokens_or_seqs) + + loss_fn(next_token_losses[1:], microbatch_2_data, total_valid_tokens_or_seqs) +) + +torch.testing.assert_close(loss_no_microbatching, loss_with_microbatching) +``` + +{/* This testoutput is intentionally empty */} +```{testoutput} +:hide: +``` diff --git a/fern/v0.5.0/pages/design-docs/nemo-gym-integration.mdx b/fern/v0.5.0/pages/design-docs/nemo-gym-integration.mdx new file mode 100644 index 0000000000..22da86cadc --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/nemo-gym-integration.mdx @@ -0,0 +1,260 @@ +--- +title: NeMo Gym Integration +description: "" +--- + +This document describes how NeMo RL integrates with [NeMo Gym](https://docs.nvidia.com/nemo/gym/latest/index.html) for multi-step and multi-turn reinforcement learning training. + +## Overview + +NeMo Gym provides HTTP-based training environments for LLMs. **NeMo Gym is CPU-only**—it runs no inference engines and holds no GPU memory. NeMo RL exposes its vLLM generation engine as an OpenAI-compatible HTTP server, which NeMo Gym calls during rollouts, enabling: + +- **Decoupled architecture**: Environments don't need direct access to model internals +- **Multi-step/multi-turn support**: Agents can orchestrate complex interactions with tools +- **Refit compatibility**: NeMo RL's weight synchronization works transparently + +## Configuration + +To enable NeMo Gym integration, add the following to your NeMo RL config: + +```yaml +policy: + generation: + backend: vllm + vllm_cfg: + async_engine: true # Both required for HTTP server support: + expose_http_server: true # async_engine enables the async worker; expose_http_server starts the server + +env: + should_use_nemo_gym: true # Enables NeMo Gym integration + nemo_gym: + # NeMo Gym config paths and settings + config_paths: + - resources_servers/math/configs/math.yaml + - responses_api_agents/simple_agent/configs/simple_agent.yaml +``` + +For a complete example, see `examples/nemo_gym/` and its associated configs. + +### Version Requirements + +NeMo Gym runs as a Ray actor within NeMo RL's Ray cluster, so the same Ray and Python versions must be used in both environments. + +## Architecture Overview + +```mermaid +%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%% +flowchart LR + subgraph RL["NeMo RL"] + GRPO["GRPO Loop"] + vLLM["vLLM + HTTP"] + Bridge["NemoGym Actor"] + end + + subgraph Gym["NeMo Gym"] + Agent["Agent"] + Model["Model (Proxy)"] + Resources["Resources"] + end + + GRPO -->|refit| vLLM + GRPO -->|run_rollouts| Bridge + Bridge -->|spawns| Gym + Agent <--> Model + Agent <--> Resources + Model -->|HTTP| vLLM + + style RL fill:#e3f2fd,stroke:#1565c0,stroke-width:2px + style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px +``` + +**Color coding**: +- Blue = NeMo RL code (`nemo_rl/`) +- Orange = NeMo Gym code (`3rdparty/Gym-workspace/Gym/nemo_gym/`) + +## The NemoGym Actor + +The integration is handled by the `NemoGym` Ray actor at `nemo_rl/environments/nemo_gym.py`: + +1. **Created by NeMo RL** during training setup via `NemoGym.remote(config)` +2. **Joins the existing Ray cluster** that NeMo RL already initialized +3. **Spawns NeMo Gym servers** as OS subprocesses (Head, Agent, Model, Resources) +4. **Injects vLLM base URLs** so NeMo Gym's Model Server knows where to proxy requests +5. **Exposes `run_rollouts()`** as the entry point for the training loop to call + +```mermaid +%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%% +flowchart LR + subgraph RL["NeMo RL"] + GRPO["GRPO Loop"] + Actor["NemoGym Actor"] + end + + subgraph Gym["NeMo Gym"] + RCH["RolloutCollectionHelper"] + Agent["Agent Server"] + end + + GRPO --> Actor + Actor --> Agent + Agent --> RCH + RCH --> Actor + Actor --> GRPO + + style RL fill:#e3f2fd,stroke:#1565c0,stroke-width:2px + style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px +``` + +The flow is: +1. GRPO Loop calls `run_rollouts.remote(batch)` on the NemoGym Actor +2. Actor sends `POST /run` to the Agent Server +3. Agent Server orchestrates the rollout via RolloutCollectionHelper +4. Results return to the Actor +5. Actor returns results to the training loop + +## vLLM HTTP Server + +**NeMo Gym does not run its own vLLM engine.** The Model Server is purely an HTTP proxy: + +| Aspect | NeMo RL vLLM Worker | NeMo Gym Model Server | +|--------|---------------------|----------------------| +| **Engine** | Runs actual vLLM `AsyncLLM` | No engine - HTTP proxy only | +| **GPU** | Holds model weights | No GPU required | +| **Endpoints** | `/v1/chat/completions`, `/tokenize` | `/v1/responses` | +| **Role** | Inference | API translation, forwards requests | + +Data parallel vLLM workers each expose their own HTTP server. NeMo Gym's Model Server load-balances requests across them. + +## Initialization Sequence + +```mermaid +%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%% +sequenceDiagram + autonumber + box rgb(227, 242, 253) NeMo RL + participant RL as Training Script + participant Ray as Ray Cluster + participant vLLM as vLLM Workers + participant Bridge as NemoGym Actor + end + box rgb(255, 243, 224) NeMo Gym + participant Servers as NeMo Gym Servers + end + + RL->>Ray: Initialize Ray cluster + RL->>vLLM: Create vLLM workers with HTTP servers + vLLM-->>RL: Return base URLs (one per DP rank) + RL->>Bridge: NemoGym.remote(config, base_urls) + Note over Bridge: Reuses existing Ray cluster + Bridge->>Servers: Spawn subprocess servers + Servers-->>Bridge: Health check OK + Bridge-->>RL: Ready for rollouts +``` + +## Training Loop Control Flow + +```mermaid +%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%% +sequenceDiagram + autonumber + box rgb(227, 242, 253) NeMo RL + participant GRPO as GRPO Loop + participant Policy as Policy Workers + participant vLLM as vLLM HTTP + participant Bridge as NemoGym Actor + end + box rgb(255, 243, 224) NeMo Gym + participant Agent as Agent Server + participant Model as Model Server + participant Resource as Resource Server + end + + GRPO->>Policy: Refit (trigger weight sync) + Policy->>vLLM: Sync weights to vLLM + GRPO->>Bridge: run_rollouts.remote(batch) + Bridge->>Agent: POST /run + Agent->>Model: POST /v1/responses + Model->>vLLM: POST /v1/chat/completions + vLLM-->>Model: Response + Model-->>Agent: Responses API format + Agent->>Resource: Execute tool / compute reward + Resource-->>Agent: Tool result / reward + Agent-->>Bridge: Results + rewards + Bridge-->>GRPO: Token IDs, logprobs, rewards + GRPO->>Policy: Compute loss and train +``` + +> **NeMo Gym server types** (see [Core Components](https://docs.nvidia.com/nemo/gym/latest/about/concepts/core-components.html)): +> - **Agent Server**: Orchestrates the rollout loop +> - **Model Server**: HTTP proxy to vLLM; translates Responses API ↔ Chat Completions +> - **Resource Server**: Provides tools and rewards + +### Key Steps + +| Step | Location | Description | +|------|----------|-------------| +| **Refit** | NeMo RL | Synchronizes policy weights to vLLM workers. For async RL, refit timing may differ—see {doc}`generation` for details. | +| **run_rollouts.remote()** | NeMo RL | Ray remote call from GRPO loop to the NemoGym actor | +| **POST /run** | NeMo RL → NeMo Gym | HTTP request from NemoGym actor to Agent Server subprocess | +| **Rollout orchestration** | NeMo Gym | Agent calls Model Server and Resources Server via HTTP | +| **POST /v1/chat/completions** | NeMo Gym → NeMo RL | Model Server proxies to NeMo RL's vLLM HTTP endpoint | +| **Result processing** | NeMo RL | NemoGym actor extracts token IDs, logprobs, rewards | + +### Async Result Processing + +The NemoGym actor uses an **as-completed** pattern to overlap waiting with post-processing: + +1. **Results return out of order**: Single steps of the rollouts (the "assistant" + "tool" turns) complete at different times depending on conversation length and tool calls. Rather than waiting for all results, the actor processes each result as soon as it completes. Note: this is pipelining within NeMo Gym, not asynchronous processing of global batch steps by NeMo RL. + +2. **Immediate post-processing**: As each rollout completes, the actor immediately extracts token IDs and logprobs. This overlaps CPU work with network I/O from slower rollouts still in flight. + +3. **Reordering at the end**: Each example carries an index. After all results are collected, results are reordered to match the original batch order before returning to the training loop. + +This pattern maximizes throughput by keeping the CPU busy while waiting for network responses. + +## Data Format Translation + +```mermaid +%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%% +flowchart LR + subgraph RL1["NeMo RL Input"] + Datum["DatumSpec"] + end + + subgraph Gym["NeMo Gym"] + Example["Example Dict"] + ReqResp["Responses API"] + ReqChat["Chat Completions"] + end + + subgraph RL2["NeMo RL Output"] + Result["Result"] + end + + Datum --> Example + Example --> ReqResp + ReqResp --> ReqChat + ReqChat --> ReqResp + ReqResp --> Example + Example --> Result + + style RL1 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px + style RL2 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px + style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px +``` + +**Formats**: +- **DatumSpec** (NeMo RL): Training-focused format with `prompt`, `prompt_token_ids`, and task metadata +- **Example Dict** (NeMo Gym): Environment-focused format containing `responses_create_params` and `expected` answer +- **Responses API** (NeMo Gym): OpenAI Responses API format with `input`, `tools`, and multi-turn conversation +- **Chat Completions** (vLLM): OpenAI Chat Completions format for the actual inference call + +**Data flow**: DatumSpec is converted to Example Dict, which passes through to the Responses API with generation parameters (`temperature`, `top_p`) added for on-policy sampling. The Model Server translates Responses API ↔ Chat Completions (converting message formats, extracting reasoning content, attaching token IDs). Results flow back with token IDs and logprobs extracted into the final Result. + +## Tokenization and On-Policy Corrections + +Token IDs are extracted at the NeMo RL vLLM layer via the `/tokenize` endpoint. This ensures: +- Tokenization matches the exact model and tokenizer used for generation +- No re-tokenization drift between generation and training + +For details on on-policy token ID handling, see {doc}`../guides/environments` and the [NeMo Gym on-policy corrections documentation](https://docs.nvidia.com/nemo/gym/latest/contribute/rl-framework-integration/openai-compatible-http-server-on-policy-correction.html). diff --git a/fern/v0.5.0/pages/design-docs/padding.mdx b/fern/v0.5.0/pages/design-docs/padding.mdx new file mode 100644 index 0000000000..37faa1bbaa --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/padding.mdx @@ -0,0 +1,98 @@ +--- +title: Padding in NeMo RL +description: "" +--- + +This document explains padding in NeMo RL and why consistent padding is critical for the framework. + +## Padding Approach + +NeMo RL uses **right padding** for all tensor operations, where padding tokens are added to the right/end of sequences: + +``` +[101, 2054, 2003, 0, 0] # Length 3 +[101, 2054, 2003, 2001, 1996] # Length 5 (no padding needed) +[101, 2054, 0, 0, 0] # Length 2 +``` + +This approach: +1. **Naturally aligns with LLM processing**: Tokens are processed from left to right. +2. **Keeps meaningful tokens contiguous**: All valid tokens appear at the beginning of tensors. +3. **Simplifies indexing and operations**: Valid token boundaries are easily defined with a single length value. + +## Right-Padded Generation Example + +Input (right-padded) → Generation → Final (right-padded): +``` +[101, 2054, 2003, 0, 0] # Original input (length 3) + ↓ +[101, 2054, 2003, 2001, 1996, 4568, 7899, 0] # After generation +|-- input --| |----- generation -----| |pad| +``` + +Corresponding logprobs: +``` +[ 0, 0, 0, -1.2, -0.8, -1.5, -2.1, 0] +|-- zeros for input --| |- gen logprobs -| |pad| +``` + +## Verify Right Padding + +NeMo RL provides utilities to verify correct padding. For example: + +```{testcode} +import torch +from nemo_rl.distributed.batched_data_dict import BatchedDataDict +from nemo_rl.models.generation.interfaces import verify_right_padding + +# For input data (BatchedDataDict containing input_ids and input_lengths) +input_data = BatchedDataDict({ + "input_ids": torch.tensor([ + [101, 2054, 2003, 0, 0], # Example input sequence + [101, 2054, 0, 0, 0] # Another input sequence + ]), + "input_lengths": torch.tensor([3, 2]) # Length of each sequence +}) + +# Check if input data is properly right-padded +is_right_padded, error_msg = verify_right_padding(input_data, pad_value=0) + +# For generation output data (BatchedDataDict containing output_ids and generation_lengths) +output_data = BatchedDataDict({ + "output_ids": torch.tensor([ + [101, 2054, 2003, 2001, 1996, 0, 0], # Example output sequence + [101, 2054, 2001, 4568, 0, 0, 0] # Another output sequence + ]), + "generation_lengths": torch.tensor([2, 2]), # Length of generated response + "unpadded_sequence_lengths": torch.tensor([5, 4]) # Total valid tokens +}) + +# Check if output data is properly right-padded +is_right_padded, error_msg = verify_right_padding(output_data, pad_value=0) + +if not is_right_padded: + print(f"Padding error: {error_msg}") +``` + +{/* This testoutput is intentionally empty */} +```{testoutput} +:hide: +``` + +The `verify_right_padding()` function checks that: +1. All padding (zeros or padding token provided by the user) appears after valid tokens. +2. The padding starts at the position specified by the length tensor. + +The function automatically detects whether you're passing input or output data: +- For input data: Requires `input_ids` and `input_lengths` fields. +- For output data: Requires `output_ids` and either `generation_lengths` or `unpadded_sequence_lengths`. + +## Best Practices + +1. **Always Use Right Padding**: All components expect this format. + +2. **Track Length Tensors**: Include appropriate length tensors with your data. + +3. **Verify Padding**: Use `verify_right_padding()` when in doubt. + +4. **Mask Padding in Operations**: Use lengths to exclude padding tokens from loss calculations. diff --git a/fern/v0.5.0/pages/design-docs/sequence-packing-and-dynamic-batching.mdx b/fern/v0.5.0/pages/design-docs/sequence-packing-and-dynamic-batching.mdx new file mode 100644 index 0000000000..90995fcc8f --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/sequence-packing-and-dynamic-batching.mdx @@ -0,0 +1,415 @@ +--- +title: Sequence Packing and Dynamic Batching +description: "" +--- + +This document describes the sequence packing and dynamic batching features implemented in NeMo-RL to optimize training efficiency for variable-length sequences. + +## Table of Contents + +1. [Problem](#problem) +2. [Sequence Packing and Dynamic Batching](#sequence-packing-and-dynamic-batching) +3. [Sequence Packing](#sequence-packing) +4. [Dynamic Batching](#dynamic-batching) +5. [Configuration](#configuration) +6. [Integration with Training Pipeline](#integration-with-training-pipeline) +7. [Metrics and Monitoring](#metrics-and-monitoring) +8. [Usage](#usage) + +## Problem + +### Challenge: Variable Sequence Lengths in RL/SFT + +RL and SFT exhibit highly variable sequence lengths due to many datasets having seqlens following Zipf's law: + +- **Skewed Distribution**: Most sequences are short, with a few very long sequences +- **Padding Inefficiency**: Traditional fixed-length batching requires padding all sequences to the maximum length, resulting in: + - Wasted computation on pad tokens + - Underutilized GPU memory + - Poor GPU compute efficiency +- **Memory Constraints**: Batch size is often limited by the longest sequences in the batch + +Without optimization, 50-70% of computation can be wasted on padding tokens. + +## Sequence Packing and Dynamic Batching +NeMo-RL implements two exclusive approaches to address variable sequence lengths: + +1. **Sequence Packing**: Concatenates multiple sequences into a single "packed" sequence, eliminating most padding. +2. **Dynamic Batching**: Groups sequences of similar lengths and adjusts microbatch sizes based on total token count, reducing the excess padding. + +### Important Notes + +- Dynamic batching and sequence packing cannot be enabled simultaneously, **they are exclusive**. +- Compatible with Context Parallelism (CP) +- Requires FlashAttention-2 for packed sequences + +## Sequence Packing + +Sequence packing concatenates multiple variable-length sequences into a single sequence, eliminating the need for padding tokens. This approach maximizes GPU utilization by ensuring all computational resources are used for meaningful tokens. + +``` +Unpacked: (# == useful token, p == padding token) +0 0 0 p p p +1 1 1 1 1 1 +2 2 p p p p +3 3 3 p p p +~40% padding + +Packed: +0 0 0 1 1 1 1 1 1 2 2 3 3 3 p # some padding may still be required as discussed later, but it is significantly reduced +``` + +### Implementation Details + +#### 1. Packing Process (`BatchedDataDict.shard_by_batch_size`) +```python +# Located in: nemo_rl/distributed/batched_data_dict.py +def shard_by_batch_size( + self, + shards: int, + sequence_packing_args: Optional[SequencePackingArgs] = None +): + # 1. Get bin packer for specified algorithm + bin_packer = get_packer( + algorithm=sequence_packing_args["algorithm"], + bin_capacity=sequence_packing_args["max_tokens_per_microbatch"] + ) + + # 2. Pack sequences into bins per chunk + for chunk_idx in range(num_chunks): + chunk_bin_assignments = bin_packer.pack( + sequence_lengths=chunk_padded_seqlens_list + ) + + # 3. Create sharded microbatches from packed bins +``` +This method **does not** actually concatenate the sequences and create the packed tensor. Rather, it reorders the elements in the batch and creates metadata such that after you call your workers with `RayWorkerGroup.run_all_workers_sharded_data`, each worker can call `BatchedDataDict.make_microbatch_iterator_for_packable_sequences` locally to return an iterator over batches, where each batch contains elements that should be packed together. For an example of this, you can take a look at the `MegatronPolicyWorker`'s train function. + +We have the policy backends perform the actual packing because implementations can vary widely on how exactly it should be done and what metadata needs to be collected. + +#### 2. Packing Algorithms (`nemo_rl/data/packing/algorithms.py`) + +Four packing algorithms are implemented, but we recommend you just use Modified First Fit Decreasing for the best packing efficiency: + +##### Concatenative Packer +- Sequential concatenation until bin capacity is reached +- O(n) +- Simple, deterministic packing for debugging + +##### Modified First Fit Decreasing (MFFD) +- Johnson & Garey (1985) heuristic with 5-phase packing strategy +- O(n log n + n*m) +- Best bin utilization +- Phases: + 1. Classify items (large: >C/2, medium: >C/3, small: >C/6, tiny: ≤C/6) + 2. Create one bin per large item + 3. Add medium items to large bins (forward pass) + 4. Add pairs of small items (backward pass) + 5. Greedy fit remaining items + 6. Apply FFD to leftovers + +##### First Fit Decreasing (FFD) +- Sort sequences by length (descending), place each in first fitting bin +- O(n log n + n*m) where m = number of bins +- Good general-purpose algorithm + +##### First Fit Shuffle +- Randomly shuffle sequences, then apply first-fit +- O(n*m) +- When sequence order doesn't matter + +### Usage with Context Parallelism + +For long sequences with context parallelism (CP > 1): +- Individual sequences must be padded to a multiple of `cp_size * 2 * tp_size`, where the factor of 2 ensures load balancing for causal attention + +#### Understanding CP Load balancing: +``` +Given a sequence of length 6, CP 2: + +0 1 2 3 4 5 + +The attention mask is: + | 0 1 2 3 4 5 +--+------------ +0 | 1 0 0 0 0 0 +1 | 1 1 0 0 0 0 +2 | 1 1 1 0 0 0 +3 | 1 1 1 1 0 0 +4 | 1 1 1 1 1 0 +5 | 1 1 1 1 1 1 + +If we were to naively chunk this sequence into CP chunks, we would have: + +CP0: + | 0 1 2 +--+------ +0 | 1 0 0 +1 | 1 1 0 + send KV 0 1 2 +2 | 1 1 1 + +CP1: + | 3 4 5 | 0 1 2 +--+------ --+------ +3 | 1 0 0 3 | 1 1 1 +4 | 1 1 0 + recv KV 0 1 2 + 4 | 1 1 1 +5 | 1 1 1 5 | 1 1 1 + +Here, CP1 ends up with more than double the work of CP0, stalling training on CP0. + +To fix this, we can chunk the sequence into 2*CP chunks (and pad to accommodate): + +| 0 1 | 2 3 | 4 5 | p p | +|--V--|--V--|--V--|--V--| +| CP0 | CP1 | CP1 | CP0 | + +Now, the work looks like this: + +CP0: + | 0 1 | 2 3 4 5 p p +--+---- --+------------ +0 | 1 0 + send KV 0 1, recv KV 2 3 4 5 + p | 1 1 1 1 1 0 +1 | 1 1 p | 1 1 1 1 1 1 + +CP1: + | 2 3 4 5 | 0 1 +--+-------- --+---- +2 | 1 0 0 0 2 | 1 1 +3 | 1 1 0 0 + send KV 2 3 4 5, recv KV 0 1 + 3 | 1 1 +4 | 1 1 1 0 4 | 1 1 +5 | 1 1 1 1 5 | 1 1 + +Much more even! +``` + +With Sequence packing + CP, we pack and CP-shard _per sequence_ to take full advantage of the load-balancing properties of CP-sharding. + +``` +Input batch: +0 0 0 0 0 p p p +1 1 1 1 1 1 1 1 +2 p p p p p p p +3 3 3 p p p p p + +CP = 2 + +First pack every sequence to 2 * CP * TP = 4: +[ +0 0 0 0 0 p p p, +1 1 1 1 1 1 1 1, +2 p p p, +3 3 3 p +] + +Now CP-shard each individual sequence and pack +CP0: +0 0 p p +1 1 1 1 +2 p +3 p +packed: +0 0 p p 1 1 1 1 2 p 3 p + +CP1: +0 0 0 p +1 1 1 1 +p p +3 3 +packed: +0 0 0 p 1 1 1 1 p p 3 3 +``` + +Internally, DTensor and Megatron-Core are made aware of sequence packing with either `FlashAttentionArgs` or `PackedSeqParams`, which contain `cu_seqlens_q` and `cu_seqlens_kv`, which are the cumulative sequence lengths of the sequence in the packed batch without CP. + +### Nuances +- With using Sequence Packing with Megatron + Pipeline Parallelism (PP), note that all packed sequences will be padded up to the maximum packed sequence length because PP requires maintaining a fixed-size batch x seqlen buffer for PP communications. In practice, however, we find that packing is _so efficient_ that this hardly makes a difference. + +All together, we see **speedups in the ~2-3x range** when enabling sequence packing. + +## Dynamic Batching + +Dynamic batching optimizes microbatch formation by: +1. Sorting sequences by length within batches (and respects chunk boundaries, so there are no training order diffs). +2. Grouping sequences to achieve target token count per microbatch. +3. Padding sequences to configurable multiples for hardware alignment. + +**Cannot be used with sequence packing** + +### Architecture + +#### Processing Pipeline + +``` +┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ +│ Input Batch │ ── │ Sort by Length │ ── │ Group by Tokens │ +│ │ │ (within chunks) │ │ │ +└─────────────────┘ └──────────────────┘ └─────────────────┘ + │ +┌─────────────────┐ ┌──────────────────┐ ┌────────V────────┐ +│ Dynamic Micros │ <─ │ Pad to Multiple │ <─ │ Calculate Sizes │ +│ │ │ │ │ │ +└─────────────────┘ └──────────────────┘ └─────────────────┘ +``` + +``` +Input batch: +0 0 p p p p p +1 1 1 1 p p p +2 2 2 2 2 2 2 +3 3 3 3 3 3 p +4 4 4 p p p p +5 5 5 5 p p p + +MBS = 16 tokens + +Dynamic Batching will re-order this batch to minimize padding + +1. Sort: +2 2 2 2 2 2 2 +3 3 3 3 3 3 p +1 1 1 1 p p p +5 5 5 5 p p p +4 4 4 p p p p +0 0 p p p p p + +2. Chunk by MBS token count +MBS 0: +2 2 2 2 2 2 2 +3 3 3 3 3 3 p + +MBS 1: +1 1 1 1 +5 5 5 5 +4 4 4 p +0 0 p p + +Note how we're able to remove a huge chunk of padding this way and do the full batch with fewer microbatches than we would otherwise need. +``` + +#### Implementation Details + +**Sorting and Load Balancing** (`nemo_rl/distributed/batched_data_dict.py`) +```python +if dynamic_batching_args is not None: + # Sort sequences by length within each chunk + for chunk_idx in range(num_chunks): + chunk_seqlens = self.data[input_lengths_key][chunk_start:chunk_end] + chunk_idx_indices = sorted(range(batch_size), + key=lambda i: chunk_seqlens[i]) + # Stride sorted sequences across DP ranks for load balancing + chunk_idx_indices = [chunk_idx_indices[i::shards] for i in range(shards)] +``` + +**Dynamic Shape Processing** (`nemo_rl/distributed/batched_data_dict.py`) +```python +# In the batched datadict, everything is padded up to the max seqlen. This truncates +# everything in one dynamic batch to just pad up to the max within this batch. +def make_microbatch_iterator_with_dynamic_shapes(self): + for seqlen, (start_idx, end_idx) in zip(self.micro_batch_lengths[0], + self.micro_batch_indices[0]): + mb = self.slice(start_idx, end_idx) + mb.truncate_tensors(dim=sequence_dim, truncated_len=seqlen) + yield mb +``` + +### Interface +```python +class BatchedDataDict(UserDict, Generic[DictT]): + def shard_by_batch_size( + self, + shards: int, + dynamic_batching_args: Optional[DynamicBatchingArgs] = None, + sequence_packing_args: Optional[SequencePackingArgs] = None + ) -> list[SlicedDataDict]: + # Main entry point for both packing and dynamic batching +``` + +Similar to Sequence Packing, we do not actually create the dynamic batches upon the call to shard_by_batch_size, but just reorder sequences and create metadata internally. With a call to `RayWorkerGroup.run_all_workers_sharded_data`, the workers can run `make_microbatch_iterator_with_dynamic_shapes` to get the true dynamic batches. + +### Nuances +- Dynamic batching **cannot** be used with Megatron + PP because PP requires a fixed [batch x seqlen] buffer for PP communication. Please use Sequence Packing. +- Dynamic batching is almost always slower than Sequence Packing, but does not require that your model (and in particular, your attention variant) have Sequence-packing implemented (which can be complicated). We'd recommend always using Sequence Packing where possible, and falling back to Dynamic batching as a last resort. + +## Configuration + +### Dynamic Batching Configuration +```python +class DynamicBatchingArgs(TypedDict): + max_tokens_per_microbatch: int # Target tokens per microbatch + sequence_length_round: int # Padding alignment multiple + input_key: str # Input tensor key ("input_ids") + input_lengths_key: str # Sequence lengths key ("input_lengths") +``` + +### Sequence Packing Configuration +```python +class SequencePackingArgs(TypedDict): + max_tokens_per_microbatch: int # Bin capacity for packing + input_key: str # Input tensor key + input_lengths_key: str # Sequence lengths key + algorithm: str # Packing algorithm name + sequence_length_pad_multiple: int # CP/TP alignment factor +``` + +## Integration with Training Pipeline + +### Loss Function Integration +A key design consideration was that we wanted to avoid the loss function writer needing to be aware of if there is sequence packing or not. To do this, we created a `SequencePackingLossWrapper` which takes the packed next_token_logits and the unpacked auxiliary loss function data and runs the loss function on each sequence individually. Since the loss function's computation time is typically trivial, we don't see a slowdown from this approach. With this, the loss function can be written as though it just deals with typical, unpacked batched data (as long as it is capable of processing one sequence at a time). + +If your loss function cannot assume batch-independence, however, then both Dynamic Batching and Sequence Packing won't work. (I.e. DPO [issue #719](https://github.com/NVIDIA-NeMo/RL/issues/719)). + +## Metrics and Monitoring + +### Packing Efficiency Metrics (`nemo_rl/data/packing/metrics.py`) + +- **Bin Utilization**: Percentage of bin capacity used +- **Waste Ratio**: Fraction of capacity unused due to packing constraints +- **Bin Balance**: Measure of load distribution evenness across bins +- **Packing Efficiency**: Ratio of theoretical minimum to actual bins used + +## Usage +### Sequence Packing Configuration +```yaml +# examples/configs/grpo_math_1B.yaml +policy: + sequence_packing: + enabled: True + train_mb_tokens: 2048 # Target tokens per microbatch + logprob_mb_tokens: 2048 + algorithm: "modified_first_fit_decreasing" # Best algorithm + sequence_length_round: 64 # Hardware alignment + + dynamic_batching: + enabled: False # Mutually exclusive +``` + +### Dynamic Batching Configuration +```yaml +# examples/configs/grpo_math_8B.yaml +policy: + dynamic_batching: + enabled: True + train_mb_tokens: 4096 + logprob_mb_tokens: 8192 + sequence_length_round: 64 + + sequence_packing: + enabled: False # Mutually exclusive +``` + +### Framework Compatibility + +**Sequence Packing Requirements:** +- Megatron or DTensor policy +- FlashAttention-2 for efficient packed attention +- If using CP with Megatron, you _must_ use sequence packing. If using CP with Dtensor, you _cannot_ yet use packing (WIP, [Issue #520](https://github.com/NVIDIA-NeMo/RL/issues/520)) + +**Dynamic Batching Requirements:** +- Any policy framework +- Pipeline parallelism size = 1 +- Cannot be used with torch.compile since shapes change. + +--- + +## References +[Johnson & Garey (1985) - Modified First Fit Decreasing](https://doi.org/10.1016/0885-064X(85)90022-6) diff --git a/fern/v0.5.0/pages/design-docs/training-backends.mdx b/fern/v0.5.0/pages/design-docs/training-backends.mdx new file mode 100644 index 0000000000..8ca89d4b84 --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/training-backends.mdx @@ -0,0 +1,81 @@ +--- +title: Training Backends +description: "" +--- + +NeMo RL supports multiple training backends to accommodate different model sizes and hardware configurations. + +## Available Backends + +- **DTensor (FSDP2)** - PyTorch's next-generation distributed training with improved memory efficiency. +- **Megatron** - NVIDIA's high-performance training framework for scaling to large models (>100B parameters). + +## Supported Input Checkpoint Format + +At this time, NeMo RL only supports Hugging Face checkpoints as inputs to the training scripts. This applies to both +the `DTensor` backend and the `Megatron` backend. + +* `DTensor` uses the Hugging Face checkpoint both to initialize the training backend and to configure `vllm`, ensuring the model implementations match exactly. This is crucial for correctness. +* `Megatron` also uses the Hugging Face checkpoint to configure `vllm`, and performs a one-time conversion to a Megatron-format checkpoint to initialize the training backend. + +If you would like to see direct support for Megatron checkpoints, please share your use case on +https://github.com/NVIDIA-NeMo/RL/issues/671. + +## Backend Selection + +The training backend is automatically determined based on your YAML configuration settings. Here's how to configure each backend. + +### Megatron Backend +To enable Megatron-based training: + +1. Initialize the NeMo and Megatron submodules by running `git submodule update --init --recursive` +2. Add the `megatron_cfg` key to your policy configuration. +3. Set `policy.megatron_cfg.enabled=True`. +4. Refer to [examples/configs/grpo_math_1B_megatron.yaml](/../../examples/configs/grpo_math_1B_megatron.yaml) for a complete configuration example. + +_Note_: When using Megatron, the optimizer and learning rate schedule are configured through `policy.megatron_cfg.optimizer` and `policy.megatron_cfg.scheduler`, respectively. + +### DTensor Backend +To enable DTensor (FSDP2) training: + +1. Set `policy.dtensor_config.enabled=True`. +2. Refer to [examples/configs/grpo_math_1B.yaml](/../../examples/configs/grpo_math_1B.yaml) for a configuration example. + +## Backend Priority + +**Megatron takes precedence over DTensor.** If both backends are enabled simultaneously (`policy.megatron_cfg.enabled=True` and `policy.dtensor_config.enabled=True`), the Megatron backend will be used. + +## Configuration Examples + +For comprehensive examples of each algorithm and backend, see the [examples/configs/recipes/llm](https://github.com/NVIDIA-NeMo/RL/tree/main/examples/configs/recipes/llm) folder. This directory contains ready-to-use configurations for various supported combinations. + +## Megatron Configuration + +The Megatron backend requires a checkpoint directory for storing converted Hugging Face model weights in Megatron format. This directory must be accessible from all nodes in your distributed training setup. + +### Environment Variable Priority (Highest to Lowest) ### + +1. **`NRL_MEGATRON_CHECKPOINT_DIR`** - The custom checkpoint directory path. +2. [RECOMMENDED] **`HF_HOME/nemo_rl`** - Uses the Hugging Face cache directory, if available. +3. **`~/.cache/huggingface/nemo_rl`** - The default fallback location. + +### Configuration Examples ### + +```bash +# Option 1: Set custom checkpoint directory +export NRL_MEGATRON_CHECKPOINT_DIR="/shared/nfs/checkpoints/megatron" + +# Option 2: Use HuggingFace home directory (recommended for shared setups) +export HF_HOME="/shared/nfs/huggingface" +# This will use /shared/nfs/huggingface/nemo_rl + +# Option 3: Use default (no environment variables needed) +# Uses ~/.cache/huggingface/nemo_rl +``` + +### Best Practices ### + +- **Mount in checkpoint directory**: If you are using Docker, make sure the Megatron checkpoint path is covered by `-v`/`--mount`. Similarly, if you are using SLURM+pyxis, ensure `--container-mounts` includes this path. +- **Use shared storage**: Ensure the checkpoint directory is accessible from all nodes (e.g., NFS, shared filesystem). +- **Prefer HF_HOME**: If you already have `HF_HOME` mounted across nodes, this reduces the number of environment variables to manage. +- **Sufficient space**: Ensure adequate disk space for the converted model checkpoints. diff --git a/fern/v0.5.0/pages/design-docs/uv.mdx b/fern/v0.5.0/pages/design-docs/uv.mdx new file mode 100644 index 0000000000..839835980d --- /dev/null +++ b/fern/v0.5.0/pages/design-docs/uv.mdx @@ -0,0 +1,82 @@ +--- +title: uv in NeMo RL +description: "" +--- + +We use the `uv` Python package installer for managing dependencies in NeMo RL. + +## Overview + +`uv` is an incredible tool that simplifies our workflow and is blazingly fast because it's written in Rust. This document explains why we've adopted `uv` for package management in our repository, particularly for NeMo RL, and how it helps us manage dependencies across Ray clusters. + +## Why `uv`? + +`uv` brings the following key advantages to our Python development workflow: + +### Speed and Efficiency + +- Written in Rust, making it significantly faster than traditional Python package managers. +- Optimized caching mechanisms that reduce redundant downloads and installations. +- Quick environment creation and switching, enabling rapid development cycles. + +### Isolated Environments + +- Creates fully isolated Python environments, preventing dependency conflicts between system packages and project-specific packages. +- Avoids nuanced dependency situations where a Python script might accidentally use both virtualenv dependencies and system dependencies. +- Ensures consistent behavior across different machines and deployment environments. + +### Dependency Management in Ray Clusters + +- Enables management of heterogeneous Python environments across a Ray cluster. +- Provides flexibility for each actor (worker) to use the specific Python dependencies it requires. +- Simplifies propagation of environments to worker nodes without manual setup on each node. + +### Container-Free Flexibility + +- Frees us from having to publish many containers for different dependency combinations. +- Allows us to define different [dependency groups](https://docs.astral.sh/uv/concepts/projects/dependencies/#dependency-groups) and [extras](https://docs.astral.sh/uv/concepts/projects/dependencies/#optional-dependencies) and select which ones we need dynamically. +- Reduces infrastructure complexity and maintenance overhead. + +## Implementation in NeMo RL + +This section outlines how workers define their required executables, details the available predefined configurations (like BASE or VLLM), and explains how to customize these setups for specific needs, ensuring consistency across actors. + +### Worker Configuration + +In our codebase, workers (classes decorated with `@ray.remote`, e.g., `PolicyWorker`) are associated with a `PY_EXECUTABLE` which specifies what dependencies the worker needs. These are set in a global registry in [`ACTOR_ENVIRONMENT_REGISTRY`](/../../nemo_rl/distributed/ray_actor_environment_registry.py). This allows different parts of our application to have their own tailored environments. + +### Supported Python Executables + +We provide several predefined Python executable configurations in `PY_EXECUTABLES`: + +```python +class PY_EXECUTABLES: + SYSTEM = sys.executable + + # Use NeMo RL direct dependencies. + BASE = "uv run --locked" + + # Use NeMo RL direct dependencies and vllm. + VLLM = "uv run --locked --extra vllm" +``` + +To ensure consistent dependencies between actors, we run with `--locked` to make sure the dependencies are consistent with the contents of `uv.lock`. + +### Customization + +If you need a different Python executable configuration, you can override the default one by passing your own in `RayWorkerBuilder.__call__`. This provides flexibility for special use cases without modifying the core configurations. + +## How It Works + +When a NeMo RL job is started: + +1. The driver script creates several `RayWorkerGroup`s. +2. Each worker group will create their workers which are wrapped in a `RayWorkerBuilder` where the fully qualified name (FQN) of the worker class is passed as a string. +3. `RayWorkerBuilder` launches the worker under `RayWorkerBuilder` which allows us to initialize the class without importing packages not available in the base environment. +4. Before the worker class is instantiated by the `RayWorkerBuilder`, the FQN is used to lookup -- in a [global registry](/../../nemo_rl/distributed/ray_actor_environment_registry.py))) -- to determine which member of `PY_EXECUTABLES` should be used to launch that set of workers. If the chosen `PY_EXECUTABLES.*` starts with `uv`; a `venv` is created with all the dependencies it needs and the `runtime_env["py_executable"]` is replaced with the `venv`'s python interpreter. + +This approach allows a fast start-up and maintains dependency isolation. It also has the added benefit of having all the virtual environments local under `./venvs`. + +## Conclusion + +Using `uv` for dependency management in NeMo RL provides us with a fast, flexible, and reliable way to handle Python dependencies across distributed Ray clusters. It eliminates many of the traditional pain points of dependency management in distributed systems, while enabling heterogeneous environments that can be tailored to specific workloads. diff --git a/fern/v0.5.0/pages/docker.mdx b/fern/v0.5.0/pages/docker.mdx new file mode 100644 index 0000000000..3ce728855b --- /dev/null +++ b/fern/v0.5.0/pages/docker.mdx @@ -0,0 +1,67 @@ +--- +title: Build Docker Images +description: "" +--- + +This guide explains how to build the NeMo RL Docker image. + +The **release** image is our recommended option as it provides the most complete environment. It includes the base image with pre-fetched NeMo RL python packages in the `uv` cache, plus the nemo-rl source code and pre-fetched virtual environments for isolated workers. This is the ideal choice for production deployments. + +## Building the Release Image + +```sh +# Self-contained build (default: builds from main): +docker buildx build -f docker/Dockerfile \ + --tag /nemo-rl:latest \ + --push . + +# Self-contained build (specific git ref): +docker buildx build -f docker/Dockerfile \ + --build-arg NRL_GIT_REF=r0.3.0 \ + --tag /nemo-rl:r0.3.0 \ + --push . + +# Self-contained build (remote NeMo RL source; no need for a local clone of NeMo RL): +docker buildx build -f docker/Dockerfile \ + --build-arg NRL_GIT_REF=r0.3.0 \ + --tag /nemo-rl:r0.3.0 \ + --push https://github.com/NVIDIA-NeMo/RL.git + +# Local NeMo RL source override: +docker buildx build --build-context nemo-rl=. -f docker/Dockerfile \ + --tag /nemo-rl:latest \ + --push . +``` + +> [!NOTE] +> The `--tag /nemo-rl:latest --push` flags are not necessary if you just want to build locally. + +## Skipping vLLM or SGLang Dependencies + +If you don't need vLLM or SGLang support, you can skip building those dependencies to reduce build time and image size. Use the `SKIP_VLLM_BUILD` and/or `SKIP_SGLANG_BUILD` build arguments: + +```sh +# Skip vLLM dependencies: +docker buildx build -f docker/Dockerfile \ + --build-arg SKIP_VLLM_BUILD=1 \ + --tag /nemo-rl:latest \ + . + +# Skip SGLang dependencies: +docker buildx build -f docker/Dockerfile \ + --build-arg SKIP_SGLANG_BUILD=1 \ + --tag /nemo-rl:latest \ + . + +# Skip both vLLM and SGLang dependencies: +docker buildx build -f docker/Dockerfile \ + --build-arg SKIP_VLLM_BUILD=1 \ + --build-arg SKIP_SGLANG_BUILD=1 \ + --tag /nemo-rl:latest \ + . +``` + +When these build arguments are set, the corresponding `uv sync --extra` commands are skipped, and the virtual environment prefetching will exclude actors that depend on those packages. + +> [!NOTE] +> If you skip vLLM or SGLang during the build but later try to use those backends at runtime, the dependencies will be fetched and built on-demand. This may add significant setup time on first use. diff --git a/fern/v0.5.0/pages/documentation.mdx b/fern/v0.5.0/pages/documentation.mdx new file mode 100644 index 0000000000..2a704a4589 --- /dev/null +++ b/fern/v0.5.0/pages/documentation.mdx @@ -0,0 +1,96 @@ +--- +title: Documentation Development +description: "" +--- + +- [Documentation Development](#documentation-development) + - [Build the Documentation](#build-the-documentation) + - [Live Building](#live-building) + - [Run Tests in Python Docstrings](#run-tests-in-python-docstrings) + - [Write Tests in Python Docstrings](#write-tests-in-python-docstrings) + - [Documentation Version](#documentation-version) + +## Build the Documentation + +The following sections describe how to set up and build the NeMo RL documentation. + +Switch to the documentation source folder and generate HTML output. + +```sh +cd docs/ +uv run --group docs sphinx-build . _build/html +``` + +* The resulting HTML files are generated in a `_build/html` folder that is created under the project `docs/` folder. +* The generated python API docs are placed in `apidocs` under the `docs/` folder. + +## Checking for Broken Links + +To check for broken http links in the docs, run this command: + +```sh +cd docs/ +uv run --group docs sphinx-build --builder linkcheck . _build/linkcheck +``` + +It will output a JSON file at `_build/linkcheck/output.json` with links it found while building the +docs. Records will have a status of `broken` if the link is not reachable. The `docs/conf.py` file is +configured to ignore github links because the CI test will often experience rate limit errors. +Comment out the `linkcheck_ignore` variable there to check all the links. + +## Live Building + +When writing documentation, it can be helpful to serve the documentation and have it update live while you edit. + +To do so, run: + +```sh +cd docs/ +uv run --group docs sphinx-autobuild . _build/html --port 12345 --host 0.0.0.0 +``` + +Open a web browser and go to `http://${HOST_WHERE_SPHINX_COMMAND_RUN}:12345` to view the output. + +## Run Tests in Python Docstrings + +We also run tests in our Python docstrings. You can run them with: + +```sh +cd docs/ +uv run --group docs sphinx-build -b doctest . _build/doctest +``` + +## Write Tests in Python Docstrings + +Any code in triple backtick blocks with the `{doctest}` directive will be tested. The format follows Python's doctest module syntax, where `>>>` indicates Python input and the following line shows the expected output. Here's an example: + +```python +def add(x: int, y: int) -> int: + """ + Adds two integers together. + + Args: + x (int): The first integer to add. + y (int): The second integer to add. + + Returns: + int: The sum of x and y. + + Examples: + ```{doctest} + >>> from nemo_rl.made_up_package import add + >>> add(1, 2) + 3 + ``` + + """ + return x + y +``` + +## Documentation Version + +The three files below control the version switcher. Before you attempt to publish a new version of the documentation, update these files to match the latest version numbers. + +* docs/versions1.json +* docs/project.json +* docs/conf.py diff --git a/fern/v0.5.0/pages/fp8.mdx b/fern/v0.5.0/pages/fp8.mdx new file mode 100644 index 0000000000..b116f4165b --- /dev/null +++ b/fern/v0.5.0/pages/fp8.mdx @@ -0,0 +1,97 @@ +--- +title: FP8 Quantization in NeMo RL +description: "" +--- + +This module provides a suite of tools to enable FP8 quantization for large language models. It is currently under active development. + +## Supported Features + +### FP8 Generation +- Implements **Deepseek-style FP8** quantization using **sub-channel scaling**. + +### FP8 Training +- Uses **TransformerEngine** for linear layer implementation. +- Supports both **Deepseek-style sub-channel scaling** and **per-tensor scaling**. + +### Recommended recipe +- For Hopper GPUs we recommend to use FP8 (Deepseek-style) precision for both generation and training for best convergence and speedup +- For Blackwell GPUs, FP8 (deepseek-style) with FP32 scaling factor is not supported in training. Currently we recommend to use FP8 precision for generation and BF16 for training. We are actively exploring other recipes for better performance. + +## Integration with NeMo RL + +NeMo RL applies monkey patches to several core `vLLM` components to enable FP8 generation for reinforcement learning. +When the `init_fp8` function is called, it modifies the following: + +### RayDistributedExecutor +- For multi-GPU inference, the executor is patched to ensure that every worker process applies the same FP8 patches **before model initialization**. + +### Quantization Utilities +- Functions within `vllm.model_executor.layers.quantization` are replaced with custom implementations that support: + - **Power-of-2 scaling** + - Other custom features + +### Weight Loading +- A custom `load_weights` function performs on-the-fly quantization of model weights from higher-precision formats to FP8. + +## Usage + +FP8 generations are recommended to be configured with the following settings: + + ``` + loss_fn: + # importance sampling helps improve stability + use_importance_sampling_correction: true + + policy: + generation: + vllm_cfg: + precision: 'fp8' + # DeepGemm is much more performant than vLLM's default cutlass fp8 subchannel scaling kernels + use_deep_gemm: true + # Users can specify number of layers to be kept in BF16 precision in their experiments + # and by default they are set to 0 + num_last_layers_in_bf16: 0 + num_first_layers_in_bf16: 0 + # Use FP32 scaling factors. Rounding scaling factors to the nearest pow2 may improve quantization + # fidelity however this feature is still under research. + use_weight_pow2_scale: False + use_activation_pow2_scale: False +``` + +To train with FP8, you need to set the Megatron path and configure it using the following settings: + +``` + policy: + megatron_cfg: + fp8_cfg: + fp8: "hybrid" # choices: [hybrid, e4m3] + fp8_recipe: "tensorwise" # choices: [tensorwise, blockwise] + fp8_param: false # boolean value +``` + +## Compatibility Note for Deepseek-Style FP8 Training + +The TransformerEngine implementation for this recipe requires **cuda version ≥ 12.9**. The latest nemo-rl depends on torch 2.8.0 + cuda 12.9 (since this [commit](https://github.com/NVIDIA-NeMo/RL/commit/3f36d14b53e906b27c01c06e36dbbd2b8eb300cd)). Users should check-out code to latest and build container from `docker/Dockerfile` ([instructions](/docker)). + +If you are using nemo-rl before this [commit](https://github.com/NVIDIA-NeMo/RL/commit/3f36d14b53e906b27c01c06e36dbbd2b8eb300cd), you will see the following error when trying to use fp8 training: + +``` +File "/opt/ray_venvs/nemo_rl.models.policy.workers.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/transformer_engine/pytorch/fp8.py", line 646, in fp8_autocast +FP8GlobalStateManager.fp8_autocast_enter( +File "/opt/ray_venvs/nemo_rl.models.policy.workers.megatron_policy_worker.MegatronPolicyWorker/lib/python3.12/site-packages/transformer_engine/pytorch/fp8.py", line 465, in fp8_autocast_enter +assert fp8_block_available, reason_for_no_fp8_block + ^^^^^^^^^^^^^^^^^^^ +AssertionError: FP8 block scaled GEMM requires Hopper and CUDA >= 12.9. +``` + +## Accuracy + +![Llama-3.1-8B-Instruct GRPO Curve BF16 vs FP8](/assets/fp8_e2e_curve.png) + +The above results are from Llama-3.1-8B-Instruct GRPO experiments. You can run them with the following example configs: +* For BF16: `examples/configs/grpo_math_8B_megatron.yaml` +* For FP8: `examples/configs/grpo_math_8B_megatron_fp8.yaml` + +In the experiment in this figure, enabling FP8 rollout and training gives 15%-25% decrease in step time, and the validation accuracy curves match up to 1000 steps. +Efforts are ongoing to performs longer runs and further optimize performance. diff --git a/fern/v0.5.0/pages/guides/async-grpo.mdx b/fern/v0.5.0/pages/guides/async-grpo.mdx new file mode 100644 index 0000000000..86188fa739 --- /dev/null +++ b/fern/v0.5.0/pages/guides/async-grpo.mdx @@ -0,0 +1,210 @@ +--- +title: Train with Async GRPO +description: "" +--- + +Async GRPO is an asynchronous training mode that allows trajectory generation and policy training to run concurrently, improving GPU utilization and throughput compared to synchronous GRPO. + +## Configure Async GRPO + +This section covers how to configure async GRPO by modifying your settings and includes a complete example configuration. +### Enable Async GRPO + +To use async GRPO, make these configuration changes: + +1. **Enable vLLM async engine**: +```yaml +policy: + generation: + backend: "vllm" + vllm_cfg: + async_engine: true +``` + +2. **Enable importance sampling correction** (required for convergence): +```yaml +loss_fn: + use_importance_sampling_correction: true +``` + +3. **Disable colocated inference** (required for async mode): +```yaml +policy: + generation: + colocated: + enabled: false + resources: + num_nodes: 1 # or more + gpus_per_node: 2 # adjust based on your setup +``` + +4. **Add async GRPO configuration**: +```yaml +grpo: + async_grpo: + enabled: true + max_trajectory_age_steps: 1 # Maximum age, in training steps, for trajectories + in_flight_weight_updates: false # Enable for faster weight synchronization + recompute_kv_cache_after_weight_updates: false # Invalidates kv cache after in-flight-weight-updates +``` + +### Complete Example Config +```yaml +policy: + generation: + backend: "vllm" + colocated: + enabled: false + resources: + num_nodes: 1 + gpus_per_node: 2 + vllm_cfg: + async_engine: true + +loss_fn: + use_importance_sampling_correction: true + +grpo: + num_prompts_per_step: 32 + num_generations_per_prompt: 4 + async_grpo: + enabled: true + max_trajectory_age_steps: 1 + in_flight_weight_updates: false # Enable for faster weight synchronization + recompute_kv_cache_after_weight_updates: false # Invalidates kv cache after in-flight-weight-updates + +cluster: + num_nodes: 2 + gpus_per_node: 4 +``` + +## Implementation Structure +This section covers the internal architecture of async GRPO and includes detailed explanations of how the core components interact. +### Core Components + +The async GRPO implementation consists of three main components: + +#### 1. Main Training Loop (`async_grpo_train` in `grpo.py`) +- Coordinates overall training process +- Samples trajectories from replay buffer +- Runs policy training steps +- Handles validation and checkpointing +- Manages weight synchronization between training and generation + +#### 2. Async Trajectory Collector (`AsyncTrajectoryCollector` in `async_utils.py`) +- Runs in background Ray actor +- Continuously generates trajectories using current policy weights +- Manages generation scheduling and weight version tracking +- Handles pause/resume for weight updates and validation +- Coordinates with replay buffer for trajectory storage + +#### 3. Replay Buffer (`ReplayBuffer` in `async_utils.py`) +- Stores generated trajectories with metadata +- Tracks weight versions for both generation and intended training use +- Implements age-based filtering to prevent stale trajectories +- Provides sampling interface for training steps + +### Weight Version Tracking + +Async GRPO uses a weight versioning system: +- **Generation Weight Version**: The policy weights used to generate a trajectory +- **Target Weight Version**: The training step where the trajectory will be used +- **Max Trajectory Age**: How many steps old a trajectory can be before being discarded + +Example with `max_trajectory_age_steps: 1`: +- Trajectory generated with weights v10 can be used for training steps v10 or v11 +- At training step v12, trajectories from v10 are too old and discarded + +### Coordination Flow + +1. **Startup**: Trajectory collector starts generating trajectories in background +2. **Buffer Fill**: Training waits until buffer has sufficient trajectories +3. **Training Step**: + - Sample trajectories from buffer + - Run policy training + - Update weights and notify collector +4. **Weight Sync**: Collector pauses, waits for weight refit, then resumes +5. **Repeat**: Process continues with updated weights + +### Architecture Diagram + +The following sequence diagram illustrates the interactions between the three main components: + +```mermaid +sequenceDiagram + participant Training as Training Loop + participant Collector as Trajectory Collector + participant Buffer as Replay Buffer + + Note over Training, Buffer: Startup + Training->>Collector: Start generation + Training->>Buffer: Initialize + + Note over Training, Buffer: Main Loop + loop Async Training + par Background Generation + Collector->>Buffer: Store trajectories + and Training Steps + Training->>Buffer: Sample trajectories + Buffer-->>Training: Return valid data + Training->>Training: Update policy weights + Training->>Collector: Sync new weights + end + end +``` + +## Usage Tips + +1. **Buffer Sizing**: The replay buffer size is automatically calculated as: + ``` + buffer_size = num_prompts_per_step × max_trajectory_age_steps × 2 + ``` + +2. **Age Limits**: Start with `max_trajectory_age_steps: 1` and increase if needed for higher throughput + +3. **Resource Allocation**: Ensure sufficient GPU memory for both the training and generation clusters + +4. **In-Flight Weight Updates**: Enable `in_flight_weight_updates: true` when using `async_engine: true` for updating the weights of vLLM engine during generation. This prevents stalling training pipeline until longest generation finishes and provides significant performance benefits. + +5. **Recompute KV Cache After Weight Updates**: While using in-flight weight update, user can choose whether to recompute +KV caches after weight udpate by configuring `recompute_kv_cache_after_weight_update` configuration. + +## Why Importance Sampling Correction Is Required for Async + +### The GRPO Objective + +The standard GRPO loss function (without KL penalty) is: + +$$ +L(\theta) = E_{x \sim \pi_{\theta_{\text{old}}}} \Big[ \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big) \Big] +$$ + +where: +- $\pi_\theta$ is the policy model we are currently optimizing +- $\pi_{\theta_{\text{old}}}$ is the previous policy model (from the beginning of this step) +- $A_t$ is the advantage estimate +- $\varepsilon$ is a clipping hyperparameter + +In standard GRPO, we assume trajectories are sampled from $\pi_{\theta_{\text{old}}}$. However, in async GRPO, trajectories are actually sampled from $\pi_{\theta_{\text{generator}}}$, which is the policy weights from N training steps ago (where N ≥ 1 depending on `max_trajectory_age_steps`). + +Without importance sampling correction, the GRPO objective becomes fundamentally incorrect: + +1. **Incorrect probability ratios**: The ratio $\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}$ uses $\pi_{\theta_{\text{old}}}$ probabilities that were never actually used to generate the trajectories. + +2. **Biased gradient estimates**: Since we're computing gradients based on samples from the wrong distribution, the policy updates become biased and can lead to instability. + +When we enable importance sampling correction (`use_importance_sampling_correction: true`), we introduce the corrective term: + +$$ +\frac{\pi_{\text{training}}(x)}{\pi_{\text{generator}}(x)} +$$ + +This transforms our loss function to properly account for the distribution mismatch. The corrected objective becomes: + +$$ +L(\theta) = E_{x \sim \pi_{\theta_{\text{generator}}}} \Big[ \frac{\pi_{\text{training}}(x)}{\pi_{\text{generator}}(x)} \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big) \Big] +$$ + +The importance sampling ratio $\frac{\pi_{\text{training}}(x)}{\pi_{\text{generator}}(x)}$ is effectively $\frac{\pi_{\theta_{\text{old}}}(x)}{\pi_{\theta_{\text{generator}}}(x)}$, which corrects for the N-step gap between the generator policy and the policy we assume we're sampling from. + +This correction ensures that we have unbiased gradient estimates and stable convergence. diff --git a/fern/v0.5.0/pages/guides/dapo.mdx b/fern/v0.5.0/pages/guides/dapo.mdx new file mode 100644 index 0000000000..5687bbb86a --- /dev/null +++ b/fern/v0.5.0/pages/guides/dapo.mdx @@ -0,0 +1,102 @@ +--- +title: An in-depth Walkthrough of DAPO in NeMo RL +description: "" +--- + +This guide covers the [Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)](https://arxiv.org/pdf/2503.14476) implementation in NeMo RL. + +DAPO introduces four key improvements over Group Relative Policy Optimization (GRPO): +1. **Clip-Higher**, which promotes the diversity of the system and avoids entropy collapse +2. **Dynamic Sampling**, which improves training efficiency and stability +3. **Token-Level Policy Gradient Loss**, which is critical in long-CoT RL scenarios +4. **Overlong Reward Shaping**, which reduces reward noise and stabilizes training + +This document focuses on DAPO-specific features: Dynamic Sampling and Overlong Reward Shaping. For foundational concepts on GRPO including data handling, policy training, generation, and loss functions, see the [NeMo RL GRPO Guide](/grpo). + +## Quickstart: Launch a DAPO Run + +To get started quickly, use the example configuration [examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml](/../../examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml). You can launch this using the same script as GRPO: + +```bash +uv run examples/run_grpo.py --config examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml \{overrides\} +``` + +**Reminder**: Don't forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You'll need to do a `huggingface-cli login` as well for LLaMA models. + +## Dynamic Sampling + +Standard GRPO trains on all generated responses, even when they have identical rewards (zero gradient signal) within a prompt group of generations. Dynamic sampling filters to keep only groups with diverse rewards (`std > 0`), and accumulates them across batches until reaching the target batch size. Dynamic sampling can be enabled by setting `use_dynamic_sampling=True` in your configuration. For implementation details, see the [`dynamic_sampling`](/../../nemo_rl/algorithms/grpo.py) function. + +**Algorithm**: For each training step: + +1. Sample `batch_multiplier × num_prompts_per_step` prompts from the dataset. The default value of `batch_multiplier` is 1. +2. Generate `num_generations_per_prompt` responses per prompt and compute rewards. +3. Compute the baseline and standard deviation for each prompt group. +4. Filter prompt groups where `std > 0`. +5. Store these prompts in a cache until reaching the target training batch size of `num_prompts_per_step × num_generations_per_prompt` samples. +6. Samples are accumulated until the maximum number of allowed batches (`dynamic_sampling_max_gen_batches`) is reached. If the cache still does not meet the target rollout batch size at that point, an error is raised. To resolve this, consider adjusting parameters such as `num_prompts_per_step` or `num_generations_per_prompt` to increase sample diversity, or revisit the complexity of your data. +7. Perform training on the collected samples with nonzero standard deviation + +### About batch_multiplier + +`batch_multiplier` (a float ≥ 1.0) controls the initial prompt pool size by sampling `batch_multiplier × num_prompts_per_step` prompts before dynamic sampling. Higher values increase memory and compute requirements, while very low values (e.g., 1.0) may slow the cache accumulation of prompt groups with nonzero standard deviation. The optimal value depends on the dataset, model capacity, and overall training setup. When **dynamic sampling** is enabled, we also log two additional metrics: + + * `dynamic_sampling_num_gen_batches`: The number of generation rounds required to produce `num_prompts_per_step * num_generations_per_prompt` samples with a nonzero standard deviation. If this number remains consistently high across iterations, try increasing the `batch_multiplier`. The maximum allowed value for this parameter is determined by `dynamic_sampling_max_gen_batches`. + * `dynamic_sampling_num_discarded_valid_samples`: The number of samples with a nonzero standard deviation that are discarded because the total exceeds `num_prompts_per_step * num_generations_per_prompt`. If this value is frequently high (e.g., above `0.5 * num_prompts_per_step * num_generations_per_prompt`) and `dynamic_sampling_num_gen_batches` is consistently 1, it suggests that a large fraction of the dataset is being discarded unnecessarily. To improve data efficiency, consider decreasing the `batch_multiplier`. + +## Reward Shaping +DAPO introduces an overlong reward shaping mechanism to reduce reward noise and stabilize training. This approach penalizes responses that exceed a specified length threshold, helping to prevent the model from generating excessively long outputs while maintaining solution quality. + +For a detailed explanation of the overlong reward shaping mechanism, please refer to Section 3.4 of the [DAPO paper](https://arxiv.org/pdf/2503.14476). For implementation details, see the [`apply_reward_shaping`](/../../nemo_rl/algorithms/reward_functions.py) function. + +## Configuration + +```yaml +grpo: + use_dynamic_sampling: true # Enable DAPO dynamic sampling + num_prompts_per_step: 512 # Target number of prompts per training step + num_generations_per_prompt: 16 # Generations per prompt + batch_multiplier: 3 # Dataloader batch size = batch_multiplier × num_prompts_per_step + dynamic_sampling_max_gen_batches: 10 # Maximum number of batches to be used for accumulating non-zero std prompts + reward_scaling: + enabled: true + source_min: 0.0 + source_max: 1.0 + target_min: -1.0 + target_max: 1.0 + + reward_shaping: + enabled: true + overlong_buffer_length: 4096 # Threshold before penalties apply (paper uses 4096) + overlong_buffer_penalty: 1.0 # Penalty per excess token + max_response_length: 20480 # Hard maximum generation length +``` + +**Key Parameters:** +- **`use_dynamic_sampling`**: When enabled, activates DAPO's dynamic sampling algorithm to filter and accumulate prompt groups with nonzero standard deviation +- **`batch_multiplier`**: Factor that scales the initial prompt pool size for sampling. +- **`dynamic_sampling_max_gen_batches`**: Maximum number of batches to be used for accumulating nonzero standard deviation prompts. +- **`reward_scaling`**: When enabled, clamps each reward in the batch to [source_min, source_max] and linearly rescales it to [target_min, target_max]. Defaults: source_min=0.0, source_max=1.0, target_min=0.0, target_max=1.0. +- **`reward_shaping`**: When enabled, applies the overlong penalty mechanism described in the Reward Shaping section above. Responses exceeding `max_response_length - overlong_buffer_length` receive penalties proportional to their excess length, helping to reduce reward noise and stabilize training. + +> [!NOTE] +> When dynamic sampling is enabled, monitor the `filtered_reward` metric to track the average reward of the prompts with std > 0. + +> [!NOTE] +> **Clip-Higher** and **Token-Level Policy Gradient Loss** are already supported in NeMo RL and can be configured through the `loss_fn` section of your experiment config: +> - Set `ratio_clip_max` to enable Clip-Higher (e.g., `ratio_clip_max: 0.28`) +> - Set `token_level_loss: true` to enable Token-Level Policy Gradient Loss +> +> See the full [DAPO example config](/../../examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml) for reference. + +## Example Training Results +Using the [DAPO example config](/../../examples/configs/recipes/llm/dapo-qwen2.5-7b.yaml), you can expect to see intermediate plots such as the training reward curve and validation accuracy on AIME24 for Qwen/Qwen2.5-Math-7B. These plots serve as reference outputs to help verify reproducibility. They are not intended to reflect the best accuracy that can be achieved using DAPO for this model. + +![DAPO Qwen2.5-7B Training Reward](/assets/dapo_train_reward.png) +![DAPO Qwen2.5-7B Validation Accuracy](/assets/dapo_val_acc.png) + +## References + +- **DAPO Paper**: [Decoupled Clip and Dynamic Sampling Policy Optimization](https://arxiv.org/pdf/2503.14476) +- **GRPO Paper**: [Group Relative Policy Optimization](https://arxiv.org/abs/2402.03300) +- **[NeMo RL GRPO Guide](/grpo)** diff --git a/fern/v0.5.0/pages/guides/deepseek.mdx b/fern/v0.5.0/pages/guides/deepseek.mdx new file mode 100644 index 0000000000..2e87fde1b2 --- /dev/null +++ b/fern/v0.5.0/pages/guides/deepseek.mdx @@ -0,0 +1,31 @@ +--- +title: DeepSeek-V3 +description: "" +--- + +## Create BF16 Hugging Face checkpoint + +(adapted from https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/deepseek_v3.html) + +```bash +# clone DeepSeek V3 weights from HF (This can take hours) +git lfs install +git clone https://huggingface.co/deepseek-ai/DeepSeek-V3 DeepSeek-V3-FP8 + +# clone DeepSeek-V3 code +git clone https://github.com/deepseek-ai/DeepSeek-V3.git + +# transformers (since v4.23.0) (checks for tensor format in the metadata)[https://github.com/huggingface/transformers/blob/9ae22fe3c1b81f99a764d382054b6ebe2b025bd4/src/transformers/modeling_utils.py#L388] +cd DeepSeek-V3/inference +sed -i '88{s/new_safetensor_file/new_safetensor_file, metadata={"format": "pt"}/}' fp8_cast_bf16.py + +# convert weights +python fp8_cast_bf16.py --input-fp8-hf-path ../../DeepSeek-V3-FP8 --output-bf16-hf-path ../../DeepSeek-V3-BF16 + +# copy other files +cd ../.. +cp DeepSeek-V3-FP8/{tokenizer_config.json,tokenizer.json,modeling_deepseek.py,configuration_deepseek.py} DeepSeek-V3-BF16/ + +# copy config.json, remove `quantization_config`, and set num_nextn_predict_layers to 0 (we currently do not support mtp): +jq 'del(.quantization_config) | .num_nextn_predict_layers=0' DeepSeek-V3-FP8/config.json > DeepSeek-V3-BF16/config.json +``` diff --git a/fern/v0.5.0/pages/guides/dpo.mdx b/fern/v0.5.0/pages/guides/dpo.mdx new file mode 100644 index 0000000000..22d446e690 --- /dev/null +++ b/fern/v0.5.0/pages/guides/dpo.mdx @@ -0,0 +1,203 @@ +--- +title: Direct Preference Optimization in NeMo RL +description: "" +--- + +[Direct Preference Optimization (DPO)](https://arxiv.org/pdf/2305.18290) is an RL-free alignment algorithm that operates on preference data. Given a prompt and a pair of chosen and rejected responses, DPO aims +to increase the probability of the chosen response and decrease the probability of the rejected response relative to a frozen reference model. The actor is initialized using the reference model. For more details, refer to the +[DPO paper](https://arxiv.org/pdf/2305.18290). + +## Launch a DPO Run + +The script [examples/run_dpo.py](/../../examples/run_dpo.py) can be used to launch a DPO experiment. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](/../cluster). + +Be sure to launch the job using `uv`. The command to launch a DPO job is as follows: +```bash +uv run examples/run_dpo.py --config +``` +If not specified, `config` will default to [examples/configs/dpo.yaml](/../../examples/configs/dpo.yaml). + +## Configuration + +NeMo RL allows users to configure DPO experiments using `yaml` config files. An example DPO configuration file can be found [here](/../../examples/configs/dpo.yaml). + +To override a value in the config, either update the value in the `yaml` file directly, or pass the override via the command line. For example: + +```bash +uv run examples/run_dpo.py \ + cluster.gpus_per_node=8 \ + dpo.sft_loss_weight=0.1 \ + dpo.preference_average_log_probs=True \ + logger.wandb.name="dpo-dev-8-gpu" +``` + +**Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models. + +## Datasets + +DPO datasets in NeMo RL are encapsulated using classes. Each DPO data class is expected to have the following attributes: + 1. `dataset`: A dictionary containing the formatted datasets. Each example in the dataset must conform to the format described below. + 2. `task_name`: A string identifier that uniquely identifies the dataset. + +If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. An example implementation can be found in [preference_datasets/tulu3.py](/../../nemo_rl/data/datasets/preference_datasets/tulu3.py). + +**Note:** The `task_name` field is required in each formatted example. + +```json +{ + "context": [], // list of dicts - The prompt message (including previous turns, if any) + "completions": [ // list of dicts — The list of completions + { + "rank": 0, // int — The rank of the completion (lower rank is preferred) + "completion": [] // list of dicts — The completion message(s) + }, + { + "rank": 1, // int — The rank of the completion (lower rank is preferred) + "completion": [] // list of dicts — The completion message(s) + } + ], + "task_name": "task_name" // identifier for the task +} +``` + +DPO training supports only two completions (where the lowest rank is preferred and the highest one is rejected), with each completion being a single response. For example: +```json +{ + "context": [ + { + "role": "user", + "content": "What's the capital of France?" + }, + { + "role": "assistant", + "content": "The capital of France is Paris." + }, + { + "role": "user", + "content": "Thanks! And what's the capital of Germany?" + } + ], + "completions": [ + { + "rank": 0, + "completion": [ + { + "role": "assistant", + "content": "The capital of Germany is Berlin." + } + ] + }, + { + "rank": 1, + "completion": [ + { + "role": "assistant", + "content": "The capital of Germany is Munich." + } + ] + } + ], + "task_name": "task_name" +} +``` + +By default, NeMo RL has support for [HelpSteer3](/../../nemo_rl/data/datasets/preference_datasets/helpsteer3.py) and [Tulu3Preference](/../../nemo_rl/data/datasets/preference_datasets/tulu3.py) datasets. Both of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk. + +We provide a [PreferenceDataset](/../../nemo_rl/data/datasets/preference_datasets/preference_dataset.py) class that is compatible with jsonl-formatted preference datasets for loading datasets from local path or HuggingFace. You can modify your config as follows to use such a custom preference dataset: +```yaml +data: + # other data settings, see `examples/configs/dpo.yaml` for more details + ... + # dataset settings + train: + # this dataset will override prompt_key and use the default values for other vars + data_path: /path/to/local/train_dataset.jsonl # local file or hf_org/hf_dataset_name (HuggingFace) + subset: null # used for HuggingFace datasets + split: train # used for HuggingFace datasets + validation: + # this dataset will use the default values for other vars except data_path + data_path: /path/to/local/val_dataset.jsonl + default: + # will use below vars as default values if dataset doesn't specify it + dataset_name: PreferenceDataset + prompt_file: null + system_prompt_file: null + # multiple validation sets is supported by using val_data_paths + # this will be removed after refactor + val_data_paths: + : /path/to/local/val_dataset_1.jsonl + : /path/to/local/val_dataset_2.jsonl +``` + +Your JSONL files should contain one JSON object per line with the following structure: + +```json +{ + "context": [{"role": "user", "content": "What is 2+2?"}], // list of dicts - The prompt message (including previous turns, if any) + "completions": [ // list of dicts — The list of completions + { + "rank": 0, // int — The rank of the completion (lower rank is preferred) + "completion": [{"role": "assistant", "content": "The answer is 4."}] // list of dicts — The completion message(s) + }, + { + "rank": 1, // int — The rank of the completion (lower rank is preferred) + "completion": [{"role": "assistant", "content": "I don't know."}] // list of dicts — The completion message(s) + } + ] +} +``` + +We also provide a [BinaryPreferenceDataset](/../../nemo_rl/data/datasets/preference_datasets/binary_preference_dataset.py) class, which is a simplified version of PreferenceDataset for pairwise ranked preference with single turn completions. You can use `prompt_key`, `chosen_key` and `rejected_key` to specify which fields in your data correspond to the question, chosen answer and rejected answer respectively. Here's an example configuration: +```yaml +data: + # other data settings, see `examples/configs/dpo.yaml` for more details + ... + # dataset settings + train: + # this dataset will override prompt_key and use the default values for other vars + data_path: /path/to/local/train_dataset.jsonl # local file or hf_org/hf_dataset_name (HuggingFace) + prompt_key: context + subset: null # used for HuggingFace datasets + split: train # used for HuggingFace datasets + validation: + # this dataset will use the default values for other vars except data_path + data_path: /path/to/local/val_dataset.jsonl + default: + # will use below vars as default values if dataset doesn't specify it + dataset_name: BinaryPreferenceDataset + prompt_key: prompt + chosen_key: chosen + rejected_key: rejected + prompt_file: null + system_prompt_file: null +``` + +Your JSONL files should contain one JSON object per line with the following structure: + +```json +{ + "prompt": "What is 2+2?", // : + "chosen": "The answer is 4.", // : + "rejected": "I don't know." // : +} +``` + +Please note: +- If you are using a logger, the prefix used for each validation set will be `validation-`. The total validation time, summed across all validation sets, is reported under `timing/validation/total_validation_time`. +- If you are doing checkpointing, the `metric_name` value in your `checkpointing` config should reflect the metric and validation set to be tracked. For example, `validation-_loss`. + +## DPO-Specific Parameters + +The DPO implementation in NeMo RL supports several key parameters that can be adjusted: + +- `dpo.reference_policy_kl_penalty`: Controls the strength of the KL penalty term +- `dpo.preference_loss_weight`: Weight for the preference loss +- `dpo.sft_loss_weight`: Weight for the auxiliary SFT loss +- `dpo.preference_average_log_probs`: Whether to average log probabilities over tokens in the preference loss term +- `dpo.sft_average_log_probs`: Whether to average log probabilities over tokens in the SFT loss term + +These parameters can be adjusted in the config file or via command-line overrides to optimize training for your specific use case. + +## Evaluate the Trained Model + +Upon completion of the training process, you can refer to our [evaluation guide](/eval) to assess model capabilities. diff --git a/fern/v0.5.0/pages/guides/dtensor-tp-accuracy.mdx b/fern/v0.5.0/pages/guides/dtensor-tp-accuracy.mdx new file mode 100644 index 0000000000..e756ebe062 --- /dev/null +++ b/fern/v0.5.0/pages/guides/dtensor-tp-accuracy.mdx @@ -0,0 +1,239 @@ +--- +title: DTensor Tensor Parallel Accuracy Issue +description: "" +--- + +During reinforcement learning (RL) post-training, maintaining accuracy is both **critical and challenging**. Minor numerical deviations can propagate and amplify across policy updates, ultimately distorting reward signals and affecting convergence. Consequently, understanding and mitigating accuracy issues is central to ensuring consistent and reliable training behavior in large-scale distributed RL settings. + +## Observed Accuracy Issues Under Tensor Parallelism with DTensor Backend + +During our development, we identified that the **tensor parallel (TP)** strategy can be a significant factor contributing to accuracy problems. + +We have encountered several accuracy issues related to TP in **DTensor**, including: + +1. **For policy models**: We observed severe `token_mult_prob_error` spikes when TP was enabled during post-training of a Qwen3 dense model (e.g., [Qwen/Qwen3-4B-Instruct-2507 · Hugging Face](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)), indicating a significant difference between the training and inference engines. +2. **For reward models**: The reward model exhibited large discrepancies under different TP configurations. +3. **For overall model training performance**: Using a $TP > 1$ configuration often leads to degraded downstream performance when utilizing either **DTensorPolicyWorker** or **DTensorPolicyWorkerV2**. + +### Misalignment between Training and Inference for Policy Models + +Using [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) as an example, Figure 1 illustrates the `token_mult_prob_error` observed during training. We applied a *time-weighted exponential moving average (EMA)* smoothing method and used a logarithmic scale on the Y-axis for better visualization. + +The `token_mult_prob_error` [metric](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/grpo.md#multiplicative-token-probability-error) measures the discrepancy between the inference engine and the training engine when processing the same sample. It is defined as follows: + +$$ +\begin{aligned} +g_i & : \text{the } i^{th} \text{ item in } \text{generation-logprobs}, \\ +p_i & : \text{the } i^{th} \text{ item in } \text{policy-logprobs}, \\ +m_i & : \text{mask the } i^{th} \text{ token , whether 1 or 0} \\ +&\text{global-valid-toks} = \sum_i m_i \, . \\ +& \text{token-mult-prob-error}= \frac{1}{\text{global-valid-toks}}\sum_{i} m_i \exp\left(\left|g_i - p_i\right|\right) +\end{aligned} +$$ + +In general, **generation logprobs** and **policy logprobs** should align closely, resulting in a `token_mult_prob_error` value near **1.0**. In our development, when this metric exceeds **1.05**, we consider it indicative of a potential framework issue that warrants further investigation. + +As shown in Figure 1, numerous spikes can be observed during training. Occasional spikes are acceptable if the `token_mult_prob_error` quickly returns to around 1.0. However, in this case, even with EMA smoothing applied, the figure reveals an overall upward trend, which is unacceptable and indicates a persistent misalignment between the training and inference behaviors. + +![](/assets/dtensor-tp-accuracy/token_mult_prob_error_qwen3_4B.png) + +

Fig 1: The token_mult_prob_error of Qwen3-4B

+ +### Discrepancies Across TP Configurations in Reward Modeling + +For the reward model, different TP plans lead to slight but noticeable inconsistencies in the validation loss. As summarized in Table 1, the loss values vary across TP settings, with TP=4 showing a larger deviation from the TP=1 baseline than TP=2 or TP=8. This suggests that the choice of TP configuration can subtly affect the numerical behavior of the reward model, even when all other training conditions are held constant. + +To investigate whether mixed‑precision arithmetic was a major contributor, autocast was disabled in a separate set of experiments so that computations were performed in full precision. However, the validation losses with and without autocast are essentially identical for all TP settings, indicating that mixed‑precision itself is not the root cause of the discrepancy. Instead, these results imply that the primary source of inconsistency lies in how different TP plans partition and aggregate computations across devices, rather than in precision loss from autocast. + +| | TP=1 | TP=2 | TP=4 | TP=8 | +| ------------- | ------ | ------ | ------ | ------ | +| With autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 | +| W/O autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 | +

Table 1: The validation loss of reward model training

+ +### Overall Performance Degradation Under Tensor Parallelism + +Figure 2 and Figure 3 present the reward curves and validation accuracy curves for multiple runs under different tensor parallel (TP) configurations. We also apply EMA smoothing for better visualization. The mismatch between the policy engine and the generation engine can lead to degraded downstream accuracy. This issue is most evident in the blue and purple curves, whose corresponding experiments are also the most abnormal cases observed in Figure 1. + +Combining the three images for observation, it is not necessarily true that abnormal `token_mult_prob_error` leads to abnormal reward and validation accuracy. This occurs for several reasons: + +1. **Spike pattern instead of continuous growth**: In many runs, `token_mult_prob_error` shows frequent spikes rather than a monotonically increasing trend, indicating that training is unstable but not fundamentally broken. +2. **Stochastic occurrence of spikes**: The abnormal `token_mult_prob_error` is itself unstable; even with the same batch of data, spikes may not appear in every run. +3. **Dilution effect with large datasets**: When the dataset is sufficiently large and no critical samples are repeatedly affected, these extreme but sporadic spikes may have limited impact on aggregate metrics, so the final reward and validation accuracy may not exhibit significant deviations. + +![](/assets/dtensor-tp-accuracy/image-20260111142255534.png) + +

Fig 2: The reward of Qwen3-4B

+ +![](/assets/dtensor-tp-accuracy/validation_accuracy.png) + +

Fig 3: The validation accuracy of Qwen3-4B

+ +However, such training instability is unacceptable for an RL training framework, so we aim to identify and eliminate the underlying issues. There are several challenges in resolving this problem: + +1. **Model dependence**: The issue is model-dependent rather than universal. For example, this phenomenon is observed on Qwen3-4B but not on Llama-3.1-8B-Instruct. +2. **Poor reproducibility**: Abnormal spikes in `token_mult_prob_error` cannot be reproduced reliably. Even with the same batch of data and identical configurations, repeated runs may yield different outcomes. + +Our in-depth analysis across multiple models and runs indicates that this behavior does not stem from a single root cause but rather from the interaction of several subtle factors. Taken together, these findings point to a small set of dominant contributors that consistently correlate with the observed instability. Our investigation revealed multiple contributing factors, with the most significant being: + +1. **Batch-variant kernels**, which can produce inconsistent results across microbatches. +2. A **row-wise TP plan**, as row-wise partitioning can introduce additional numerical inconsistencies during distributed computation. + +## Batch-Variant Kernels + +In RL training, log probabilities are typically computed for samples drawn from the old policy, denoted as `prev_logprobs`. The same samples are then evaluated under the current policy being optimized, yielding `current_logprobs`. Using these two quantities, we compute the ratio between the current and previous policies as follows: + +$$ +\begin{aligned} +\text{ratio} &= \exp\left(\text{current-logprobs} - \text{prev-logprobs}\right) \\ +&= \exp\left(\log\left(\frac{\text{current-probs}}{\text{prev-probs}}\right)\right) \\ +&= \frac{\text{current-probs}}{\text{prev-probs}} +\end{aligned} +$$ + +This ratio is the standard importance ratio used in off-policy RL to reweight returns when the data are collected under an older behavior policy. In on-policy training, this ratio should be exactly 1. However, in our experiments, we observed cases where the ratio deviates from 1, indicating a mismatch between the intended on-policy setting and the actual behavior of the system. Figure 4 and Figure 5 illustrate this phenomenon by showing the mismatch between `prev_logprobs` and `current_logprobs` under TP=4, as well as the reward curves under TP=4 and TP=1 for the `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B` model. + +![](/assets/dtensor-tp-accuracy/logprobs_unequal_1.png) + +

Fig 4: The mismatch of prev_logprobs and current_logprobs under TP=4

+ +![](/assets/dtensor-tp-accuracy/image-20260111160656891-1768118824549-2.png) + +

Fig 5: The reward of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B under TP=4 and TP=1

+ +### Root Cause + +Upon further investigation, the discrepancy between `current_logprobs` and `prev_logprobs` was traced to a mismatch between `train_micro_batch_size` and `logprob_batch_size`, which caused the model to behave differently for the same logical samples under different effective batch sizes. This behavior is a typical manifestation of **batch-variant kernels**, where the numerical outputs of certain operators depend not only on the input tensors themselves but also on how those tensors are grouped into batches or microbatches. + +In batch-variant kernels, low-level implementation details—such as parallel reduction order, tiling strategy, fused-kernel heuristics, or algorithm selection conditioned on batch size or sequence layout—can change when the batch size changes, leading to small but systematic numerical differences in the computed logprobs. When `train_micro_batch_size` and `logprob_batch_size` are inconsistent, the same token sequence may traverse slightly different computational paths during training and logprob evaluation, resulting in `current_logprobs != prev_logprobs` and importance-sampling ratios that deviate from 1, even in nominally on-policy settings. + +After aligning `train_micro_batch_size` and `logprob_batch_size` so that the same samples are processed with identical effective batch configurations, the importance-sampling ratio (`probs_ratio`) becomes 1 as expected, and the observed accuracy issues disappear. This confirms that the mismatch was caused by batch-dependent numerical variation rather than a conceptual error in the RL objective or data pipeline. + +### Recommended Solutions + +When using DTensor with TP > 1, or when `probs_ratio != 1` is observed in an on-policy setting, the following mitigation strategies are recommended to restore numerical consistency and stabilize training: + +- **Align micro-batch sizes**: + Configure `train_micro_batch_size` and `logprob_batch_size` to be exactly equal so that both the training forward pass and the logprob evaluation traverse identical kernel configurations and batching patterns. This alignment minimizes batch-variant behavior in underlying kernels and ensures that `current_logprobs` and `prev_logprobs` are computed under the same numerical conditions, which in turn drives `probs_ratio` back toward 1. +- **Force an on-policy ratio**: + In strictly on-policy scenarios, enable the `loss_fn.force_on_policy_ratio` flag to explicitly set `probs_ratio` to 1 during loss computation. This option is appropriate only when the data are guaranteed to be collected from the current policy and the theoretical importance-sampling ratio should be exactly 1; under these assumptions, clamping the ratio removes spurious numerical noise introduced by minor logprob mismatches while preserving the correctness of the training objective. + +## Row-Wise TP Plan + +Row-wise and column-wise parallelism are two common ways to split a large linear layer across multiple devices. They differ in **which dimension of the weight matrix is partitioned** and how the partial results are combined. + +Consider a linear layer $y=xW^T$ with $ W^T \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}},\quad x \in \mathbb{R}^{d_{\text{in}}},\quad y \in \mathbb{R}^{d_{\text{out}}}. $. + +1. Row-wise parallel (TP = 2) + + In **row-wise** parallelism, we split $W^T$ by rows (input dimension) into two blocks: + +$$ + W^T = + \begin{bmatrix} + W_1^T \\ + W_2^T + \end{bmatrix}, + \quad\text{where}\quad + W_1^T \in \mathbb{R}^{d_{\text{in}}^{(1)} \times d_{\text{out}}},\quad + W_2^T \in \mathbb{R}^{d_{\text{in}}^{(2)} \times d_{\text{out}}},\quad + d_{\text{in}}^{(1)} + d_{\text{in}}^{(2)} = d_{\text{in}}. +$$ + + We also split the input: + +$$ + x = + \begin{bmatrix} + x_1 & x_2 + \end{bmatrix}, + \quad + x_1 \in \mathbb{R}^{d_{\text{in}}^{(1)}},\quad + x_2 \in \mathbb{R}^{d_{\text{in}}^{(2)}}. +$$ + + Each GPU holds its own **input slice** and weight slice, and computes: $y_1 = x_1W_1^T,\quad y_2 =x_2W_2^T$, then we **sum** the partial outputs: $y = y_1 + y_2$ + + + +2. Column-wise parallel (TP = 2) + + In **column-wise** parallelism, we split \(W^T\) by columns (output dimension) into two blocks: + +$$ + W^T = + \begin{bmatrix} + W_1^T & W_2^T + \end{bmatrix}, + \quad \text{where} \quad + W_1^T \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}^{(1)}},\quad + W_2^T \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}^{(2)}},\quad + d_{\text{out}}^{(1)} + d_{\text{out}}^{(2)} = d_{\text{out}}. +$$ + + Each GPU gets the **full input** $x$ and computes: $y_1 = xW_1^T ,\quad y_2 = xW_2^T$, then we **concatenate** along the output dimension: $y = \left[ y_1, y_2 \right]$. + +### Root Cause + +Our analysis shows that the **row-wise (colwise) tensor parallel (TP) plan** is a primary driver of the observed spikes in metrics and the instability of the reward model when TP is enabled. Row-wise tensor parallelism inevitably introduces cross-device reductions on the output activations. In the row-wise case, each rank produces a partial output $y_i$, and these partial results must be summed across GPUs to form the final $y=∑_iy_i$. Although floating‑point addition is mathematically associative, its implementation in finite precision is **non-associative**, so [changing the summation order can lead to different numerical results](https://arxiv.org/html/2408.05148v3), and the accumulated error can grow over long reduction chains. This makes large distributed reductions—such as the cross‑GPU adds required by row-wise TP—particularly vulnerable to run‑to‑run variability and small but systematic drift. + +By contrast, when the entire reduction is executed within a single device and on the same tensor core pipeline, the execution order and kernel implementation are typically fixed for a given problem size, which tends to yield deterministic and more numerically stable results for repeated runs with the same inputs. In other words, on a single GPU, the hardware and library stack generally ensure that the same matmul and accumulation schedule is reused, so the rounding pattern is at least consistent, even if it is not perfectly exact. However, once the computation is split across multiple GPUs, the final sum depends on the collective communication pattern (for example, ring or tree AllReduce), thread scheduling, and low‑level communication libraries. These factors are not guaranteed to be deterministic and can change the effective addition order, leading to additional rounding error and small cross‑rank discrepancies in the aggregated outputs. + +### Recommended Solutions: + +To mitigate the numerical instability introduced by row-wise TP (especially the cross‑GPU reductions on attention and MLP outputs), we recommend using a **numerically more stable TP plan** that avoids cross‑rank summations. Instead of summing partial outputs across GPUs, the stable plan favors **column-wise sharding with local outputs**, so that each rank produces a complete, independent slice of the logits and no inter‑GPU add is required on these critical paths. + +Below is an example of how the default plan can be adjusted into a more numerically stable configuration. More details can refer to [NeMo-RL PR! 1235](https://github.com/NVIDIA-NeMo/RL/pull/1235). + +```python +custom_parallel_plan = { + "model.embed_tokens": RowwiseParallel(input_layouts=Replicate()), + "model.layers.*.self_attn.q_proj": ColwiseParallel(), + "model.layers.*.self_attn.k_proj": ColwiseParallel(), + "model.layers.*.self_attn.v_proj": ColwiseParallel(), + "model.layers.*.self_attn.o_proj": RowwiseParallel(), + "model.layers.*.mlp.up_proj": ColwiseParallel(), + "model.layers.*.mlp.gate_proj": ColwiseParallel(), + "model.layers.*.mlp.down_proj": RowwiseParallel(), + "lm_head": ColwiseParallel(output_layouts=Shard(-1), use_local_output=False), +} + +numerical_stable_parallel_plan = { + "model.embed_tokens": RowwiseParallel(input_layouts=Replicate()), + "model.layers.*.self_attn.q_proj": ColwiseParallel(), + "model.layers.*.self_attn.k_proj": ColwiseParallel(), + "model.layers.*.self_attn.v_proj": ColwiseParallel(), + "model.layers.*.self_attn.o_proj": ColwiseParallel( + input_layouts=Shard(-1), + output_layouts=Replicate(), + use_local_output=True, + ), + "model.layers.*.mlp.up_proj": ColwiseParallel(), + "model.layers.*.mlp.gate_proj": ColwiseParallel(), + "model.layers.*.mlp.down_proj": ColwiseParallel( + input_layouts=Shard(-1), + output_layouts=Replicate(), + use_local_output=True, + ), + "lm_head": ColwiseParallel(output_layouts=Shard(-1), use_local_output=False), +} +``` + +## Additional Observations and Insights + +Beyond the TP-related issues discussed above, our experiments also highlight that **accuracy in RL training is influenced by a broad set of numerical factors**, including attention backends (such as SDPA and flash attention2), GPU architectures (such as *Ampere* vs *Hopper*), and arithmetic precision settings (such as BF16/FP16/FP8/FP32). Different inference and training engines often implement kernels using distinct implementation methods, which naturally introduce small discrepancies in floating‑point results even when the high‑level math is identical. As a result, two systems that are “functionally equivalent” may still produce slightly different logprobs, rewards, or validation metrics. + +Figure 6 reports the KL divergence between the logits produced by the Hugging Face stack and those produced by NeMo‑RL for the same input sequence. The plot shows that, even with identical data and model weights, the resulting logit distributions differ noticeably across the two execution engines. In our experiments, similar behavior appeared when varying attention implementations and hardware configurations, where we consistently observed measurable numerical discrepancies, although we did not attempt to systematically eliminate every such source of variation. + +![](/assets/dtensor-tp-accuracy/kl_hf_prev.png) + +

Fig 6: The KL divergence between hugging face and nemorl

+ +The broader research community has proposed multiple strategies to mitigate these issues. We have referred to a list of publications: + +* [Defeating the Training-Inference Mismatch via FP16](https://arxiv.org/pdf/2510.26788) +* [Accumulator accuracy](https://docs.pytorch.org/docs/stable/notes/cuda.html#reduced-precision-reduction-in-bf16-gemms) +* [Systematic Outliers in Large Language Models](https://arxiv.org/abs/2502.06415) +* [Training-Inference Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda) + +In our current work, we treat these effects primarily as **background noise** and focus on TP‑induced misalignment that has a clear and actionable impact on RL training. A more exhaustive treatment—such as systematically unifying attention backends, enforcing TP‑invariant kernels, or integrating compensated summation into critical paths—is left as future engineering work informed by the aforementioned research directions. diff --git a/fern/v0.5.0/pages/guides/environments.mdx b/fern/v0.5.0/pages/guides/environments.mdx new file mode 100644 index 0000000000..553a36e0aa --- /dev/null +++ b/fern/v0.5.0/pages/guides/environments.mdx @@ -0,0 +1,228 @@ +--- +title: Environments for GRPO Training +description: "" +--- + +GRPO includes multiple environments, each offering a standard interface for reward computation and evaluation. + +## Math Environment + +The Math Environment is designed for mathematical reasoning tasks. It evaluates responses to math problems using `math-verify` and provides rewards based on correctness. + +### Key Features +- Evaluates mathematical reasoning +- Supports multiple mathematical domains +- Provides detailed feedback on solution correctness + +### Usage +```python +from nemo_rl.environments.math_environment import MathEnvironment + +env_config = { + "num_workers": 2, +} + +math_env = MathEnvironment.remote(env_config) +``` + +## Code Environment + +The Code Environment is designed for code generation and execution tasks. It provides a sandboxed environment for executing Python code and evaluating the results. + +### Usage +```python +from nemo_rl.environments.code_environment import CodeEnvironment + +env_config = { + "num_workers": 2, + "terminate_on_evaluation": True, # Terminate after code execution +} + +code_env = CodeEnvironment.remote(env_config) +``` + +### Configuration +- `num_workers`: Number of parallel workers for code execution +- `terminate_on_evaluation`: Whether to terminate after code execution (True for single-turn, False for multi-turn). + +We are tracking an end-to-end example of this environment in [#858](https://github.com/NVIDIA-NeMo/RL/issues/858). Add a 👍 to show your interest. + +## Code Jaccard Environment + +The Code Jaccard Environment evaluates code (or text) responses by measuring Jaccard-based similarity against ground-truth answers. This is a lightweight, text-similarity reward useful when an execution sandbox is unnecessary or unavailable. + +### How It Works +- Extracts the assistant’s response text from each conversation. +- Computes a Jaccard similarity score between the response and ground truth: + - Tokenizes both texts by whitespace, computes intersection/union, then applies a length ratio penalty. + - Scores are in [0, 1]. Observations label responses as “aligned/misaligned” using a 0.5 threshold. +- Returns: + - observations: Environment feedback strings. + - rewards: Tensor of similarity scores. + - terminateds: All ones (single-step episodes). + - answers: The response text when requested (optional). + +### Usage +```python +from nemo_rl.environments.code_jaccard_environment import CodeJaccardEnvironment + +env_config = { + "num_workers": 2, + # Optional default stop strings (unused in scoring but available for consistency) + "stop_strings": None, +} + +code_jaccard_env = CodeJaccardEnvironment.remote(env_config) +``` + +### Configuration +- `num_workers` (int): Number of parallel verification workers. +- `stop_strings` (list[str] | None): Optional default stop strings (propagated downstream; not required for scoring). + +### Sample GRPO Config +```yaml +env: + code_jaccard: + num_workers: 2 + stop_strings: null +data: + env_name: code_jaccard +``` + +## Reward Model Environment + +The Reward Model Environment uses pre-trained reward models to score conversation quality. + +### Usage +```python +from nemo_rl.environments.reward_model_environment import RewardModelEnvironment + +env_config = { + "enabled": True, + "model_name": "Skywork/Skywork-Reward-V2-Qwen3-0.6B", + "tokenizer": {"name": "Skywork/Skywork-Reward-V2-Qwen3-0.6B"}, + "precision": "bfloat16", + "batch_size": 32, + "resources": {"gpus_per_node": 1, "num_nodes": 1}, + "reward_model_cfg": { + "enabled": True, + "reward_model_type": "bradley_terry", + }, +} + +reward_env = RewardModelEnvironment.remote(env_config) +``` + +### Resource Allocation in GRPO Training + +In GRPO training, resources are allocated across three main components: + +- **Policy Actor**: The trained model. +- **Generation Actor**: Used for generating responses during rollouts (can be colocated with policy or on separate nodes/GPUs). +- **Reward Model Environment Actor**: Evaluates generated responses and computes rewards. + +The resource allocation logic works as follows: + +#### Single-Node Setup (`num_nodes: 1`) +- All components share the same node +- GPUs are divided between policy training, generation, and reward model +- Example: + 1. Policy and generation colocated: 8 GPUs total = 4 for colocated policy and generation + 4 for reward model + 2. Policy and generation non-colocated: 8 GPUs total = 2 for policy + 2 for generation + 4 for reward model + +#### Multi-Node Setup (`num_nodes > 1`) +- Policy training, generation, and reward model environment can be distributed across different nodes. +- Reward model gets dedicated resources as specified in `env.reward_model.resources`. +- Generation gets dedicated resources as specified in `policy.generation.colocated.resources`. +- Remaining nodes are allocated to policy training. + +In the future, the resource control part will be refactored to enable fine-grained resource configuration for each actor. For detailed resource management and optimization strategies, see [#1100](https://github.com/NVIDIA-NeMo/RL/issues/1100). + +### Complete GRPO Training with Reward Model Environments + +See [examples/run_grpo.py](/../../examples/run_grpo.py) with [examples/configs/grpo_rm_1B.yaml](/../../examples/configs/grpo_rm_1B.yaml) for a complete example of using the reward model environment with GRPO training. + +```bash +uv run examples/run_grpo.py --config examples/configs/grpo_rm_1B.yaml +``` + +## Registering Custom Environments + +NeMo RL provides a flexible environment registration mechanism that allows you to add custom environments without modifying the source code. + +### Using the `register_env` Interface + +You can use the `register_env` function to dynamically register new environments without modifying NeMo RL's internal code. + +**Function Signature** + +```python +from nemo_rl.environments.utils import register_env + +register_env(env_name: str, actor_class_fqn: str) -> None +``` + +**Parameters:** + +- `env_name`: Unique identifier name for the environment (string) +- `actor_class_fqn`: Fully Qualified Name of the environment Actor class, in the format `'module.path.ClassName'` + +### Example: Registering a Custom Environment + +Suppose you've created a custom reinforcement learning environment for code generation tasks: + +**1. Create Your Custom Environment Actor Class** + +```python +# File: my_custom_envs/code_gen_env.py +import ray +from nemo_rl.environments.interfaces import EnvironmentInterface + +@ray.remote +class CodeGenEnvironmentActor(EnvironmentInterface): + """Custom code generation environment.""" + + def __init__(self, config): + self.config = config + # Initialize your environment + + async def reset(self): + # Reset environment logic + return initial_state + + async def step(self, action): + # Execute action, return reward, etc. + return observation, reward, done, info + + # Implement other required interface methods... +``` + +**2. Register the Environment in Your Training Script** + +```python +# File: train.py +from nemo_rl.environments.utils import register_env + +# Register your custom environment +register_env( + env_name="code_gen", + actor_class_fqn="my_custom_envs.code_gen_env.CodeGenEnvironmentActor" +) + +# Now you can use "code_gen" in your config +# Training code... +``` + +**3. Use the Registered Environment in Your Config** + +```yaml +# config.yaml +env: + code_gen: + num_workers: 2 + max_code_length: 512 + test_cases_per_problem: 5 + +data: + env_name: code_gen # Use your registered environment name +``` diff --git a/fern/v0.5.0/pages/guides/eval.mdx b/fern/v0.5.0/pages/guides/eval.mdx new file mode 100644 index 0000000000..56f4a6a6ad --- /dev/null +++ b/fern/v0.5.0/pages/guides/eval.mdx @@ -0,0 +1,115 @@ +--- +title: Evaluation +description: "" +--- + +This document explains how to use an evaluation script for assessing model capabilities. + +## Prepare for Evaluation + +To prepare for evaluation, first ensure your model is in the correct format, which may involve an optional conversion of PyTorch DCP checkpoints to the HuggingFace format. Following this, you need to prepare the evaluation configuration, which includes defining prompt templates and any custom settings required to run the evaluation. + +### Convert DCP to HF (Optional) +If you have trained a model and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the HuggingFace format before running evaluation. + +Use the `examples/converters/convert_dcp_to_hf.py` script. You'll need the path to the training configuration file (`config.yaml`), the DCP checkpoint directory, and specify an output path for the HF format model. + +```sh +# Example for a GRPO checkpoint at step 170 +uv run python examples/converters/convert_dcp_to_hf.py \ + --config results/grpo/step_170/config.yaml \ + --dcp-ckpt-path results/grpo/step_170/policy/weights/ \ + --hf-ckpt-path results/grpo/hf +``` +> **Note:** Adjust the paths according to your training output directory structure. + +Once the conversion is complete, you can override the `generation.model_name` to point to the directory containing the converted HF model in [this section](#run-the-evaluation-script). + +### Prepare the Evaluation Configuration +**Override with Custom Settings** + +To run the evaluation, you can use the [default configuration file](/../../examples/configs/evals/eval.yaml). Alternatively, you can specify a custom one or override some settings via the command line. + +The default configuration employs greedy sampling to evaluate Qwen2.5-Math-1.5B-Instruct on AIME-2024. + +**Prompt Template Configuration** + +Always remember to use the same prompt and chat_template that were used during training. + +For open-source models, we recommend setting `tokenizer.chat_template=default`, `data.prompt_file=null` and `data.system_prompt_file=null` to allow them to use their native chat templates. + +## Run the Evaluation Script + +We will use the `run_eval.py` script to run an evaluation using a model directly from the HuggingFace Hub or from a local path that is already in HuggingFace format. + +Note that the evaluation script only supports the HuggingFace format model. If you haven't converted your DCP format model, you should back to [Convert DCP to HF](#convert-dcp-to-hf-optional) and follow the guide to convert your model. + +```sh +# Run evaluation script with default config (examples/configs/evals/eval.yaml) +uv run python examples/run_eval.py + +# Run evaluation script with converted model +uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf + +# Run evaluation script with Qwen3 model under thinking mode +uv run python examples/run_eval.py \ + generation.model_name=Qwen/Qwen3-8B \ + generation.temperature=0.6 \ + generation.top_p=0.95 \ + generation.top_k=20 \ + generation.vllm_cfg.max_model_len=38912 \ + tokenizer.chat_template_kwargs.enable_thinking=true \ + data.prompt_file=examples/prompts/cot.txt + +# Run evaluation script with custom config file +uv run python examples/run_eval.py --config path/to/custom_config.yaml + +# Run evaluation script on one of the supported benchmarks (e.g., GPQA) +uv run python examples/run_eval.py --config examples/configs/evals/gpqa_eval.yaml + +# Run evaluation script with a local dataset where the problem and solution keys are "Question" and "Answer" respectively. +uv run python examples/run_eval.py \ + --config examples/configs/evals/local_eval.yaml \ + data.dataset_name=/path/to/local/dataset \ + data.problem_key=Question \ + data.solution_key=Answer + +# Override specific config values via command line +# Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs +# Pass@1 accuracy averaged over 16 samples for each problem +uv run python examples/run_eval.py \ + --config examples/configs/evals/math_eval.yaml \ + generation.model_name=agentica-org/DeepScaleR-1.5B-Preview \ + generation.temperature=0.6 \ + generation.top_p=0.95 \ + generation.vllm_cfg.max_model_len=32768 \ + data.dataset_name=math500 \ + eval.num_tests_per_prompt=16 \ + cluster.gpus_per_node=8 +``` +> **Note:** Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings. + +## Example Evaluation Output + +When you complete the evaluation, you will receive a summary similar to the following. + +``` +============================================================ +model_name='Qwen2.5-Math-1.5B-Instruct' dataset_name='aime2024' +max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1 seed=42 + +metric=pass@1 num_tests_per_prompt=1 + +score=0.1000 (3.0/30) +============================================================ +``` + +## List of currently supported benchmarks + +- [AIME-2024 and AIME-2025](/../../nemo_rl/data/datasets/eval_datasets/aime.py): the corresponding `data.dataset_name` are `"aime2024"` and `"aime2025"`. +- [GPQA and GPQA-diamond](/../../nemo_rl/data/datasets/eval_datasets/gpqa.py): the corresponding `data.dataset_name` are `"gpqa"` and `"gpqa_diamond"`. +- [MATH and MATH-500](/../../nemo_rl/data/datasets/eval_datasets/math.py): the corresponding `data.dataset_name` are `"math"` and `"math500"`. +- [MMLU](/../../nemo_rl/data/datasets/eval_datasets/mmlu.py): this also includes MMMLU (Multilingual MMLU), a total of 14 languages. When `data.dataset_name` is set to `mmlu`, the English version is used. If one wants to run evaluation on another language, `data.dataset_name` should be set to `mmlu_{language}` where `language` is one of following 14 values, `["AR-XY", "BN-BD", "DE-DE", "ES-LA", "FR-FR", "HI-IN", "ID-ID", "IT-IT", "JA-JP", "KO-KR", "PT-BR", "ZH-CN", "SW-KE", "YO-NG"]`. +- [MMLU-Pro](/../../nemo_rl/data/datasets/eval_datasets/mmlu_pro.py): the corresponding `data.dataset_name` is `"mmlu_pro"`. + +More details can be found in [load_eval_dataset](/../../nemo_rl/data/datasets/eval_datasets/__init__.py). diff --git a/fern/v0.5.0/pages/guides/ft-launcher-guide.mdx b/fern/v0.5.0/pages/guides/ft-launcher-guide.mdx new file mode 100644 index 0000000000..772258b7dd --- /dev/null +++ b/fern/v0.5.0/pages/guides/ft-launcher-guide.mdx @@ -0,0 +1,61 @@ +--- +title: Fault Tolerance Launcher Guide +description: "" +--- + +The `ft_launcher` is provided by `nvidia-resiliency-ext` (included in NeMo RL dependencies) and enables automatic fault tolerance and recovery for distributed training runs. + +## Key Arguments + +| Argument | Description | Example | +|----------|-------------|---------| +| `--ft-cfg-path` | Path to FT YAML config file | `examples/ft_launcher/ft_config.yaml` | +| `--ft-rank-heartbeat-timeout` | Heartbeat timeout in seconds | `450` | +| `--ft-initial-rank-heartbeat-timeout` | Initial timeout (longer for setup) | `1200` | +| `--max-restarts` | Maximum number of restart attempts | `5` | + +## Basic Usage + +```bash +uv run ft_launcher \ + --ft-cfg-path examples/ft_launcher/ft_config.yaml \ + --ft-rank-heartbeat-timeout 450 \ + --ft-initial-rank-heartbeat-timeout 1200 \ + --max-restarts 5 \ + examples/run_grpo.py \ + --config +``` + +## FT Config File (examples/ft_launcher/ft_config.yaml) + +```yaml +fault_tolerance: + initial_rank_heartbeat_timeout: 360 + restart_policy: any-failed +``` + +## Important Notes + +1. **Checkpointing**: Enable checkpointing for recovery to work: + ```bash + ++checkpointing.enabled=true + ++checkpointing.checkpoint_dir=/path/to/checkpoints + ++checkpointing.save_period=50 + ``` + +2. **Timeouts**: Set `--ft-initial-rank-heartbeat-timeout` higher than `--ft-rank-heartbeat-timeout` to allow for model loading/setup time. + +3. **Restart Policy**: The `any-failed` restart policy will restart the entire job if any rank fails. Look for these log messages to identify when a restart occurs: + + ``` + [ERROR] [ft_launcher...] failed (exitcode: 1) local_rank: 0 (pid: ...) of binary: ... + [INFO] [ft_launcher...] [default] Worker group FAILED. 3/5 attempts left; will restart worker group + [INFO] [ft_launcher...] Stopping workers... Timeout = 30 sec. + [INFO] [ft_launcher...] The node '...' attempts to join the next round of the rendezvous '...'. + [INFO] [ft_launcher...] The node '...' has joined round N of the rendezvous '...' as rank 0 in a world of size 1. + ``` + + Key indicators: + - `Worker group FAILED. X/Y attempts left` - shows a restart is happening and remaining attempts + - `will restart worker group` - confirms restart is in progress + - `has joined round N` - the round number increases with each restart diff --git a/fern/v0.5.0/pages/guides/grpo-deepscaler.mdx b/fern/v0.5.0/pages/guides/grpo-deepscaler.mdx new file mode 100644 index 0000000000..e2d9bb7b0c --- /dev/null +++ b/fern/v0.5.0/pages/guides/grpo-deepscaler.mdx @@ -0,0 +1,56 @@ +--- +title: GRPO on DeepScaler +description: "" +--- + +This guide explains how to use NeMo RL to train long Chain of Thought (CoT) reasoning models with Group Relative Policy Optimization (GRPO). To do so, we train [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) on the [DeepScaleR](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) dataset. We then show how to use NeMo RL's evaluation scripts to evaluate the trained model on the [AIME24](https://huggingface.co/datasets/HuggingFaceH4/aime_2024) benchmark. + +## Train the Model +We follow the DeepScaleR recipe and train the model in three stages. In the first stage, we train with an 8K context window. In the second stage, we train with a 16K context window. In the third stage, we train with a 24K context window. +To train the model using NeMo RL, use the `examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml` config file. This file closely matches the experiment settings in the original DeepScaleR recipe. We then train with `examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml` and `examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml` for the second and third stages, respectively. + +```sh +uv run examples/run_grpo.py --config=examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml +uv run examples/run_grpo.py --config=examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml policy.model_name=/path/to/8K/checkpoint/hf +uv run examples/run_grpo.py --config=examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml policy.model_name=/path/to/16K/checkpoint/hf +``` + +At the end of each stage, you need to specify the Hugging Face checkpoint to continue training with. To get this checkpoint, we convert a model checkpoint to a Hugging Face checkpoint with the following command: + +```sh +uv run examples/converters/convert_dcp_to_hf.py --config=results/grpo-deepscaler-1.5b-8K/step_240/config.yaml --dcp-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_240/policy/weights --hf-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_240/hf +``` + +When running the next command, we use the Hugging Face checkpoint as the initial checkpoint. We train with an 8K context window for 240 steps, a 16K context window for 290 steps, and a 24K context window for 50 steps. We run all experiments on a single 8XH100 80GB node. If you're running on 8XA100 80GB, you will need at least 1 node for 8K training and 2 nodes for 16-24k training. + +## Training Curve +When using the above commands, we get the following training curve: + +![Training Performance](/assets/deepscaler_training_progress.png) + +Notably, we are able to achieve an average training reward of 0.65 in just 400 training steps. + +## Evaluate the Model +Throughout training, the checkpoints of the model will be saved to the `results` folder (specified by `checkpointing.checkpoint_dir`). To evaluate the model, we first need to convert the PyTorch distributed checkpoint to Hugging Face format as before. Then, to evaluate on the [AIME24 benchmark](https://huggingface.co/datasets/HuggingFaceH4/aime_2024), use the following command: + +```sh +uv run examples/run_eval.py \ + generation.model_name=results/grpo-deepscaler-1.5b-8K/step_240/hf \ + data.prompt_file=examples/prompts/cot.txt \ + generation.vllm_cfg.max_model_len=32768 \ + generation.vllm_cfg.enforce_eager=True \ + generation.temperature=1.0 +``` + +Use `generation.model_name` to specify the path to the Hugging Face checkpoint. In addition, we use AIME24 as the validation dataset and calculate pass@1 on it throughout training. + +> [!NOTE] +> AIME24 only has 30 examples so the accuracy can be very noisy. +> To reduce the variance consider runing `run_eval.py` with `eval.num_tests_per_prompt=16`. + +## Evaluation Results +Using the above instructions to train DeepSeek-R1-Distill-Qwen-1.5B on the DeepScaleR dataset, we can track the model's performance on the AIME24 benchmark throughout training. The following plot shows the evaluation metrics as training progresses: + +![AIME24 Performance](/assets/aime_training_progress.png) + +We are able to surpass OpenAI O1's performance on the AIME24 benchmark with about 600 training steps. diff --git a/fern/v0.5.0/pages/guides/grpo-sliding-puzzle.mdx b/fern/v0.5.0/pages/guides/grpo-sliding-puzzle.mdx new file mode 100644 index 0000000000..45b60b5740 --- /dev/null +++ b/fern/v0.5.0/pages/guides/grpo-sliding-puzzle.mdx @@ -0,0 +1,295 @@ +--- +title: Solve a Sliding Puzzle Using GRPO +description: "" +--- + +This guide explains how to use Nemo RL to train a model to solve the classic **nxn sliding puzzle** game through multi-turn reinforcement learning. This environment implements a classic **n×n sliding puzzle** where numbered tiles must be arranged in sequential order by sliding them into an empty space. + +The sliding puzzle task serves as a simple, yet effective example, to illustrate how multi-turn RL and tool-calling are implemented within Nemo RL. This example provides a minimal setup for understanding the core components of Group Relative Policy Optimization (GRPO) and sequential decision-making. + +## Quick Start Guide + +### 1. Install and Set Up NeMo RL with Megatron Backend (Optional) + +To get started, clone and set up the NeMo RL repository by initializing submodules, installing CUDA dependencies, and configuring the environment with uv. Refer to [Prerequisites](https://github.com/NVIDIA-NeMo/RL/tree/main?tab=readme-ov-file#prerequisites) for detailed instructions on installation. + +### 2. Train a Model + +Train a model to solve the sliding puzzle using GRPO with the default 2×2 configuration. + +```bash +uv run python examples/run_grpo_sliding_puzzle.py +``` + +### 3. Customize Puzzle Configuration + +By default, this training script uses the configuration in [grpo_sliding_puzzle.yaml](/../../examples/configs/grpo_sliding_puzzle.yaml). You can customize parameters with command-line overrides to experiment with different puzzle sizes or levels of difficulty. +```bash +# Train on a 3×3 puzzle with 10 random moves to scramble the board +uv run python examples/run_grpo_sliding_puzzle.py \ + env.sliding_puzzle_game.cfg.game_config.size=3 \ + env.sliding_puzzle_game.cfg.game_config.shuffle_moves=10 +``` + +### 4. Monitor Progress + +You can enable logging via Weights & Biases and TensorBoard to monitor training metrics such as rewards, success rate, and loss curves. + +```bash +# Enable logging (optional) +uv run examples/run_grpo_sliding_puzzle.py \ + --config examples/configs/grpo_sliding_puzzle.yaml \ + logger.wandb_enabled=true \ + logger.tensorboard_enabled=true +``` + +## Game Mechanics + +### Puzzle Structure + +The sliding puzzle consists of: +- **Grid**: An `n×n` grid with numbered tiles and one empty space +- **Tiles**: Numbered from `1` to `n²-1`, placed in random order +- **Empty Space**: Represented by `0`, typically starting at the bottom-right corner +- **Goal State**: Sequential arrangement `1, 2, 3, ..., n²-1` with `0` at bottom-right + +### Example Data Sample +``` +===== SLIDING PUZZLE ===== +Arrange the 3x3 grid by sliding tiles into the empty space. +- The goal is to arrange numbers from 1 to 8 in order +- Use 'up', 'down', 'left', 'right' to slide in that direction +- Use 'view' to see the current state of the board + +Current Board State: + + +---------+ +1 | 1 3 | +2 | 4 2 5 | +3 | 7 8 6 | + +---------+ + 1 2 3 + +Reach the goal state where numbers are ordered 1 through 8 with the empty space (0) at the bottom right. +Valid actions: 'up', 'down', 'left', 'right', or 'slide row col' (e.g., 'slide 1 2'). +After thinking, output your chosen action on a new line starting with '' like this: +your_action +If you just want to see the board, output view +Think carefully step-by-step before acting. + +``` + +### Movement Rules + +1. **Valid Moves**: Only tiles adjacent to the empty space `0` can be moved. +2. **Movement Direction**: Tiles slide into the empty space, not the other way around. +3. **Grid Boundaries**: Moves that would go beyond the grid are invalid. +4. **Single Tile Movement**: Each action affects only one tile at a time. + +All actions must be wrapped in XML-style tags and follow one of the formats below: +```xml +up {/* Slide a tile up into the empty space */} +slide 2 1 {/* Slide tile at row 2, column 1 */} +view {/* View the current board state */} +``` + +## Data Generation + +### Configuration Parameters + +Sliding puzzle instances are generated using the following parameters, which can be customized via the configuration file: + +```yaml +env: + sliding_puzzle_game: + cfg: + game_config: + size: 5 # Size of the puzzle grid (e.g., 3x3, 4x4, 5x5) + shuffle_moves: 4 # Number of random moves to scramble the puzzle + max_moves: 40 # Maximum number of moves allowed per episode +``` +#### Description + +- **`size`**: Determines the dimensions of the puzzle board (`n×n`). +- **`shuffle_moves`**: Controls the initial difficulty by randomly moving tiles to scramble the puzzle. +- **`max_moves`**: Sets an upper limit on the number of actions the agent can take in one episode. + +Grids are generated with sizes ranging from 2 to game_config.size. Each grid starts with a solved state and is shuffled by moving random tiles to the empty space n times, where n is a random number between 1 and `shuffle_moves`. The grid is shuffled using only valid moves. +The `generate_puzzle_datum()` function in [run_grpo_sliding_puzzle.py](/../../examples/run_grpo_sliding_puzzle.py) is responsible for generating the dataset. [sliding_puzzle.py](/../../nemo_rl/environments/games/sliding_puzzle.py) contains the `SlidingPuzzleGameLogic` class, responsible for puzzle generation and initialization logic. The number of shuffle moves and size of the grid will control puzzle difficulty. + +#### Generation Algorithm +The puzzle configuration is randomly generated by sampling the grid size and number of shuffling moves within the defined maximums: + +```python +def generate_random_config(max_config: dict[str, Any]) -> dict[str, Any]: + """Generate a random config for the sliding puzzle game.""" + shuffle_moves = random.randint(1, max_config.get("shuffle_moves")) + if shuffle_moves % 2 == 0: + shuffle_moves += 1 # Ensure odd number for proper scrambling + return { + "size": random.randint(2, max_config.get("size", 3)), + "shuffle_moves": shuffle_moves, + } + + game_config = generate_random_config(game_config) + initial_game_state = SlidingPuzzleGameLogic.generate(game_config) + initial_render = SlidingPuzzleGameLogic.render(initial_game_state) + welcome_message = SlidingPuzzleGameLogic.init(initial_game_state) + ``` + +### Dataset Size Calculation + +Dataset size is defined by parameters in grpo_sliding_puzzle.yaml: +``` +Training Size = num_prompts_per_step × num_generations_per_prompt × max_num_steps +Validation Size = max_val_samples +``` + +### Data Structure + +Each training sample is returned as a `DatumSpec` dictionary with the following structure: + +```python +datum: DatumSpec = { + "message_log": message_log, # Conversation history + "length": len(tokenized_prompt), # Token count + "extra_env_info": metadata, # Game state metadata + "loss_multiplier": 1.0, # Training weight + "idx": idx, # Sample index + "task_name": task_name, # Task identifier + "stop_strings": [""], # Termination tokens +} +``` + +## Environment Interface + +{/* ### Architecture Flow + +``` +GRPO Training Pipeline: +run_grpo_sliding_puzzle.grpo_train → nemo_rl.experience.rollouts.run_multi_turn_rollouts → generate_response + calculate_reward → environments.games.sliding_puzzle.SlidingPuzzleEnv.step +``` */} + +### Core Classes + +The [sliding_puzzle.py](/../../nemo_rl/environments/games/sliding_puzzle.py) defines the environment and the logic for interacting with the environment. The core classes used are outlined below: + +#### SlidingPuzzleEnv +The SlidingPuzzleEnv class serves as the main environment, implementing a Ray remote actor for distributed processing and using functions from both the SlidingPuzzleGameLogic and SlidingPuzzleRunner classes to interact with the environment. + +```python +@ray.remote +class SlidingPuzzleEnv(EnvironmentInterface): + def __init__(self, cfg: Optional[SlidingPuzzleConfig] = None): + """Initialize environment with configuration.""" + + def step( + self, + message_log_batch: list[LLMMessageLogType], + metadata_batch: list[SlidingPuzzleMetadata], + ) -> EnvironmentReturn: + """Process batch of interactions.""" +``` + +#### SlidingPuzzleGameLogic +The SlidingPuzzleGameLogic class defines the core game mechanics through static methods for puzzle operations and includes functionality for reward calculation. + +```python +class SlidingPuzzleGameLogic: + @staticmethod + def generate(config: dict[str, Any]) -> dict[str, Any]: + """Generate new puzzle with specified configuration.""" + + @staticmethod + def init(game_state: dict[str, Any]) -> str: + """Create welcome message with game rules.""" + + @staticmethod + def step(action: str, game_state: dict[str, Any]) -> tuple[str, float, bool, dict[str, Any]]: + """Execute action and return (response, reward, terminated, new_state).""" + + @staticmethod + def render(game_state: dict[str, Any]) -> str: + """Render current puzzle state as visual grid.""" +``` + +#### SlidingPuzzleRunner + +The SlidingPuzzleRunner class handles turn processing and action management. + +```python +class SlidingPuzzleRunner: + def __init__(self): + """Initialize runner with no persistent state.""" + + def _parse_action(self, text: str) -> Optional[str]: + """Extract action from model response using XML tag parsing.""" + + def process_turn( + self, + message_log: LLMMessageLogType, + metadata: SlidingPuzzleMetadata, + ) -> tuple[dict[str, str], float, bool, Optional[list[str]], Optional[SlidingPuzzleMetadata]]: + """Process single turn and return (response_dict, reward, terminated, stop_strings, updated_metadata).""" +``` + +### Processing Pipeline + +The step function creates a processing pipeline where each class handles specific responsibilities: + +1. **Parse action** (`SlidingPuzzleRunner`): Extracts the action from the model response using XML tag parsing via the `process_turn` method. +2. **Validate Move** (`SlidingPuzzleGameLogic`): Checks if the action is valid for the current game state and then executes the move. +3. **Execute Action** (`SlidingPuzzleGameLogic`): Applies the move to the game state using the `SlidingPuzzleGameLogic.step` method. +4. **Calculate Reward** (`SlidingPuzzleGameLogic`): Assigns a reward based on progress toward solving the puzzle (step function). +5. **Return Results** (`SlidingPuzzleEnv`): Returns the updated interaction state as an `EnvironmentReturn` object. + +## Reward System + +### Reward Structure + +The environment uses a sparse reward scheme designed to encourage complete solution strategies, rather than incremental progress or reward hacking. + +| Condition | Reward | Termination | +|-----------|--------|-------------| +| Valid move (non-solving) | 0.0 | False | +| Invalid move | 0.0 | False | +| Puzzle solved | 1.0 | True | +| Max moves reached | 0.0 | True | +| Invalid action format | 0.0 | False | + +>Goal: The agent receives a reward only upon successfully solving the puzzle, promoting long-horizon planning. + +### Reward Calculation Logic + +```python +def step(action: str, game_state: dict[str, Any]) -> tuple[str, float, bool, dict[str, Any]]: + """Process action and calculate reward.""" + reward = 0.0 + is_terminated = False + + if move_made: + # Check if puzzle is solved + if new_state["grid"] == new_state["solution"]: + reward = 1.0 + is_terminated = True + else: + reward = 0.0 # No reward for non-solving moves + + return response, reward, is_terminated, new_state +``` +## Results + +We fine-tuned [`Qwen/Qwen2.5-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) on synthetic data for 120 steps using the following configuration settings: + +``` +game_config: + size: 5 # Size of the puzzle (e.g., 2 for 2x2, 3 for 3x3) + shuffle_moves: 10 # Number of random moves to shuffle the solved state +max_moves: 30 +``` + +The figure below displays training rewards vs. steps, along with validation accuracy. + +![Training Curve](/assets/train-reward-sliding-puzzle.png) + +![Validation Accuracy](/assets/valid_acc-sliding-puzzle.png) diff --git a/fern/v0.5.0/pages/guides/grpo.mdx b/fern/v0.5.0/pages/guides/grpo.mdx new file mode 100755 index 0000000000..a26735669d --- /dev/null +++ b/fern/v0.5.0/pages/guides/grpo.mdx @@ -0,0 +1,464 @@ +--- +title: An In-depth Walkthrough of GRPO in NeMo RL +description: "" +--- + +This guide details the Group Relative Policy Optimization (GRPO) implementation within NeMo RL. We walk through data handling, policy model training, fast generation, and the GRPO loss function. + +## Quickstart: Launch a GRPO Run + +To get started quickly, use the script [examples/run_grpo.py](/../../examples/run_grpo.py), which demonstrates how to train a model on math problems using GRPO. You can launch this script locally or through Slurm. For detailed instructions on setting up Ray and launching a job with Slurm, refer to the [cluster documentation](/../cluster). + +We recommend launching the job using `uv`: + +```bash +uv run examples/run_grpo.py --config \{overrides\} +``` + +If not specified, `config` will default to [examples/configs/grpo_math_1B.yaml](/../../examples/configs/grpo_math_1B.yaml). + +**Reminder**: Do not forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You'll need to do a `huggingface-cli login` as well for Llama models. + +In this guide, we'll walk through how we handle: + +* Data +* Model training +* Fast generation +* Overall resource flow +* Loss + +### Data + +We support training with multiple RL "Environments" at the same time. + +An [Environment](/../../nemo_rl/environments/interfaces.py) is an object that accepts a state/action history and returns an updated state and rewards for the step. They run as Ray Remote Actors. Example [MathEnvironment](/../../nemo_rl/environments/math_environment.py). + +To support this, we need to know: + +* What environments you have +* Which data should go to which environments +* How to prepare the data from your dataset into a form we can use + +#### Dataset + +GRPO datasets in NeMo RL are encapsulated using classes. Each GRPO data class is expected to have the following attributes: + 1. `dataset`: A dictionary containing the formatted datasets. Each example in the dataset must conform to the format described below. + 2. `task_name`: A string identifier that uniquely identifies the dataset. + +GRPO datasets are expected to follow the HuggingFace chat format. Refer to the [chat dataset document](/../design-docs/chat-datasets) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [response_datasets/deepscaler.py](/../../nemo_rl/data/datasets/response_datasets/deepscaler.py) has an example: + +**Note:** The `task_name` field is required in each formatted example. + +```python +def format_data(self, data: dict[str, Any]) -> dict[str, Any]: + return { + "messages": [ + {"role": "user", "content": data["problem"]}, + {"role": "assistant", "content": data["answer"]}, + ], + "task_name": self.task_name, + } +``` + +By default, NeMo RL has some built-in supported datasets (e.g., [OpenAssistant](/../../nemo_rl/data/datasets/response_datasets/oasst.py), [OpenMathInstruct-2](/../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py), [Squad](/../../nemo_rl/data/datasets/response_datasets/squad.py), etc.). You can see the full list [here](/../../nemo_rl/data/datasets/response_datasets/__init__.py). +All of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk. + +We provide a [ResponseDataset](/../../nemo_rl/data/datasets/response_datasets/response_dataset.py) class that is compatible with JSONL-formatted response datasets for loading datasets from local path or Hugging Face. You can use `input_key`, `output_key` to specify which fields in your data correspond to the question and answer respectively. Here's an example configuration: +```yaml +data: + # other data settings, see `examples/configs/grpo_math_1B.yaml` for more details + ... + # dataset settings + train: + # this dataset will override input_key and use the default values for other vars + data_path: /path/to/local/train_dataset.jsonl # local file or hf_org/hf_dataset_name (HuggingFace) + input_key: question + subset: null # used for HuggingFace datasets + split: train # used for HuggingFace datasets + split_validation_size: 0.05 # use 5% of the training data as validation data + seed: 42 # seed for train/validation split when split_validation_size > 0 + validation: + # this dataset will use the default values for other vars except data_path + data_path: /path/to/local/val_dataset.jsonl + default: + # will use below vars as default values if dataset doesn't specify it + dataset_name: ResponseDataset + input_key: input + output_key: output + prompt_file: null + system_prompt_file: null + processor: "math_hf_data_processor" + env_name: "math" +``` + +Your JSONL files should contain one JSON object per line with the following structure: + +```json +{ + "input": "Hello", // : + "output": "Hi there!" // : +} +``` + +We support using multiple datasets for train and validation. You can refer to `examples/configs/grpo_multiple_datasets.yaml` for a full configuration example. Here's an example configuration: +```yaml +data: + _override_: true # override the data config instead of merging with it + # other data settings, see `examples/configs/grpo_math_1B.yaml` for more details + ... + # dataset settings + train: + # train dataset 1 + - dataset_name: OpenMathInstruct-2 + split_validation_size: 0.05 # use 5% of the training data as validation data + seed: 42 # seed for train/validation split when split_validation_size > 0 + # train dataset 2 + - dataset_name: DeepScaler + validation: + # validation dataset 1 + - dataset_name: AIME2024 + repeat: 16 + # validation dataset 2 + - dataset_name: DAPOMathAIME2024 + # default settings for all datasets + default: + ... +``` + +We support using a single dataset for both train and validation by using `split_validation_size` to set the validation ratio. +[OpenAssistant](/../../nemo_rl/data/datasets/response_datasets/oasst.py), [OpenMathInstruct-2](/../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py), [ResponseDataset](/../../nemo_rl/data/datasets/response_datasets/response_dataset.py), [Tulu3SftMixtureDataset](/../../nemo_rl/data/datasets/response_datasets/tulu3.py) are supported for this feature. +If you want to support this feature for your custom datasets or other built-in datasets, you can simply add the code to the dataset like [ResponseDataset](/../../nemo_rl/data/datasets/response_datasets/response_dataset.py). +```python +# `self.val_dataset` is used (not None) only when current dataset is used for both training and validation +self.val_dataset = None +self.split_train_validation(split_validation_size, seed) +``` + +#### Common Data Format + +We define a [DatumSpec](/../../nemo_rl/data/interfaces.py) that holds all relevant information for each training example: + +```python +class DatumSpec(TypedDict): + message_log: LLMMessageLogType + length: int # total (concatenated) length of the message tensors + extra_env_info: dict[str, Any] # anything your environment requires goes here, for example the 'answer' of a math problem + loss_multiplier: float # multiplier for the loss for this datum. 0 to mask out (say the sample is invalid) + idx: int + task_name: Optional[str] = "default" + __extra__: Any # This allows additional fields of any type +``` + +#### Data Processors + +We refer to each distinct environment your model aims to optimize against as a "task." For example, you might define tasks like "math" or "code." + +For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](/../../nemo_rl/data/interfaces.py). + +```python +def my_data_processor( + datum_dict: dict[str, Any], # loaded directly from your dataset (that is, a single line of JSONL data) + task_data_spec: TaskDataSpec, + tokenizer, + max_seq_length: int, + idx: int, +) -> DatumSpec: +``` + +We have an example of this as `math_data_processor` in [processors.py](/../../nemo_rl/data/processors.py). + +### Task–Dataset Mapping + +- task_name (unique task identifier): + - Determines which processor, env, prompts, and dataset to use for this task. + - Currently, we support a single dataset and a single environment. Therefore, task_name equals the dataset_name in the config (i.e., config.data.dataset_name). +- task_spec (TaskDataSpec): + - Specifies per-task system prompt and prompt. +- task_data_processors: + - Dict mapping: task_name -> (task_spec, processor_fn). +- task_to_env: + - Dict mapping: task_name -> task_env. + +Example (simplified): + +```python +task_data_processors = {data.task_name: (data.task_spec, data.processor)} +task_to_env = {data.task_name: env} +``` + +#### Putting It All Together + +GRPO expects datasets to have the following form: + +```json +{"task_name": "math", /* actual data */} +``` + +Then, you can set the data up as follows: + +```python + +# 1) Setup environments from data config +env_name_list = extract_necessary_env_names(data_config) +envs = { + env_name: create_env(env_name=env_name, env_config=env_configs[env_name]) + for env_name in env_name_list +} + +# 2) Load dataset using the helper (built-ins or local/HF datasets) +data = load_response_dataset(data_config["train"]) + +# 3) Build task mapping +task_data_processors = {data.task_name: (data.task_spec, data.processor)} +task_to_env = {data.task_name: envs[data_config["train"]["env_name"]]} + +# 4) Construct processed dataset +dataset = AllTaskProcessedDataset( + data.dataset, + tokenizer, + None, + task_data_processors, + max_seq_length=data_config["max_input_seq_length"], +) + +# 5) Do the same thing for validation dataset if it exists +if "validation" in data_config and data_config["validation"] is not None: + val_data = load_response_dataset(data_config["validation"]) + + val_task_data_processors = {val_data.task_name: (val_data.task_spec, val_data.processor)} + val_task_to_env = {val_data.task_name: envs[data_config["validation"]["env_name"]]} + + val_dataset = AllTaskProcessedDataset( + val_data.dataset, + tokenizer, + None, + val_task_data_processors, + max_seq_length=data_config["max_input_seq_length"], + ) +``` + +Ensure you provide a mapping of tasks to their processors so the dataset knows which processor to use when handling samples. + +## Environments + +GRPO supports various types of environments for different tasks, including **[Math](/../../nemo_rl/environments/math_environment.py)**, **[Code](/../../nemo_rl/environments/code_environment.py)**, and **[Reward Model](/../../nemo_rl/environments/reward_model_environment.py)** environments. Each environment provides a standardized interface for reward computation and evaluation, enabling consistent training across diverse domains. + +For more information about environments, see the [Environments Guide](/environments). + +### Env–Task Mapping + +- env: + - The environment actor for reward/evaluation, constructed using `create_env(env_name=..., env_config=...)`. + - The environment to use is declared under the data section of the config (e.g., `data.env_name` states which env the dataset uses). +- task_to_env: + - Dict mapping: task_name -> env. In the current single-task setup this typically points all tasks to the same env, but this structure enables different envs per task in future multi-task scenarios. + +Example (simplified): + +```python +env_name_list = extract_necessary_env_names(data_config) +envs = { + env_name: create_env(env_name=env_name, env_config=env_configs[env_name]) + for env_name in env_name_list +} + +task_to_env[task_name] = envs[data_config["train"]["env_name"]] +val_task_to_env = task_to_env # validation usually mirrors training mapping +``` + +## Policy Model + +We define a `~nemo_rl.models.policy.interfaces.PolicyInterface` that contains everything you need to train a Policy model. + +This Policy object holds a [RayWorkerGroup](/../../nemo_rl/distributed/worker_groups.py) of SPMD (1 proc/GPU) processes that run HF/MCore, all coordinated by this object so it appears to you like 1 GPU! + +## Fast Generation + +We support vLLM through the [VllmGeneration](/../../nemo_rl/models/generation/vllm/vllm_generation.py) class right now. + +The function, [grpo_train](/../../nemo_rl/algorithms/grpo.py), contains the core GRPO training loop. + +## Performance Optimizations + +RL generations typically produce highly variable sequence lengths, which result in a significant amount of padding if approached naively. We address this with Sequence Packing and Dynamic Batching, which are techniques to reduce the amount of padding required. You can read more about these in the [design doc](/../design-docs/sequence-packing-and-dynamic-batching). + +## Loss +We use the [ClippedPGLossFn](/../../nemo_rl/algorithms/loss_functions.py) to calculate the loss for GRPO. Formally, + +$$ +L(\theta) = E_{x \sim \pi_{\theta_{\text{old}}}} \Big[ \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big) \Big] - \beta D_{\text{KL}} (\pi_\theta \| \pi_\text{ref}) +$$ + +where: + +- $\pi_\theta$ is the policy model we are currently optimizing +- $\pi_{\theta_{\text{old}}}$ is the previous policy model (from the beginning of this step) +- $A_t$ is the advantage estimate +- $\varepsilon$ is a clipping hyperparameter +- $\beta$ is the KL penalty coefficient +- $\pi_{\text{ref}}$ is the reference policy + +It also supports "Dual-Clipping" from [Ye et al. (2019)](https://arxiv.org/pdf/1912.09729), which +imposes an additional upper bound on the probability ratio when advantages are negative. +This prevents excessive policy updates. $rA \ll 0$ -> $cA$(clipped). +The loss function is modified to the following when A_t < 0: + +$$ +L(\theta) = E_t \Big[ \max \Big( \min \big(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) A_t \big), c A_t \Big) \Big] - \beta D_{\text{KL}} (\pi_\theta \| \pi_\text{ref}) +$$ + +where: +- c is the dual-clip parameter (ratio_clip_c), which must be greater than 1 and is usually set to 3 empirically. +- $r_t(\theta)$ is the ratio $\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}$ that measures how much the policy has changed. + +### Improvements to the GRPO Loss Formulation for Stability and Accuracy + +#### On-Policy KL Approximation + +This feature is controlled by the parameter `use_on_policy_kl_approximation`. It enables the use of an estimator for KL divergence based on [Schulman (2020)](http://joschu.net/blog/kl-approx.html), which is both unbiased and guaranteed to be positive. + +$$ +D_{\text{KL}} (\pi_\theta || \pi_\text{ref}) \approx E_{x \sim \pi_{\theta}} \Big[ \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - \log \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - 1 \Big] +$$ + +Note that the loss function above samples from $\pi_{\theta_{\text{old}}}$ instead of $\pi_\theta$, meaning that the KL approximation is off-policy if we use samples from $\pi_{\theta_{\text{old}}}$. This is the default formulation used in the [original GRPO paper](https://arxiv.org/abs/2402.03300). In order to use an _on-policy_ KL approximation while sampling from $\pi_{\theta_{\text{old}}}$, we can incorporate importance weights: + +$$ +\begin{align*} +D_{\text{KL}} (\pi_\theta || \pi_\text{ref}) &\approx E_{x \sim \pi_{\theta}} \Big[ \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - \log \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - 1 \Big] \\ +&= \sum_x \pi_{\theta}(x) \Big[ \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - \log \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - 1 \Big] \\ +&= \sum_x \pi_{\theta_{\text{old}}}(x) \frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)} \Big[ \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - \log \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - 1 \Big] \\ +&= E_{x \sim \pi_{\theta_\text{old}}} \frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)} \Big[ \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - \log \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - 1 \Big] \\ +\end{align*} +$$ + +To enable the on-policy KL approximation, set the config `use_on_policy_kl_approximation=True` in the `ClippedPGLossConfig`. By default, we set this config to False to align with standard GRPO. + +#### Importance Sampling Correction +This feature is controlled by the parameter `use_importance_sampling_correction`. It applies importance sampling to adjust for discrepancies between the behavior policy and the target policy, improving the accuracy of off-policy estimates. The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](/../adding-new-models#understand-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function. + +Let $f_\theta(x) = \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big)$ represent the first term of loss function. Then, + +$$ +\begin{align*} +E_{x \sim \pi_\text{training}} f_\theta(x) &= \sum_x \pi_\text{training}(x) f_\theta(x) \\ +&= \sum_x \pi_\text{inference}(x) \frac{\pi_\text{training}(x)}{\pi_\text{inference}(x)} f_\theta(x) \\ +&= E_{x \sim \pi_\text{inference}} \frac{\pi_\text{training}(x)}{\pi_\text{inference}(x)} f_\theta(x) +\end{align*} +$$ + +By multiplying the first term of the loss function by the importance weights $\frac{\pi_\text{training}(x)}{\pi_\text{inference}(x)}$, we can correct for the distribution mismatch between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ while still sampling from $\pi_{\text{inference}}$. + +To enable the importance sampling correction, set the config `use_importance_sampling_correction=True` in the `ClippedPGLossConfig`. By default, we set this config to False to align with standard GRPO. + +#### Overlong Filtering + +This feature is controlled by the parameter `overlong_filtering`. It filters out sequences that exceed a predefined maximum length, helping maintain computational efficiency and model stability. When `overlong_filtering=True`, samples that reach `max_total_sequence_length` without producing an end-of-text token are excluded from loss computation. This reduces noise from penalizing generations that may be high-quality but exceed the sequence length limit. + +The implementation modifies the loss calculation as follows: + +For each sample $i$ in the batch: + +$$ +\text{truncated}_i = \begin{cases} +1 & \text{if sample } i \text{ reached max length without EOS} \\ +0 & \text{otherwise} +\end{cases} +$$ + +The sample mask becomes (let m_i denote the sample mask and ℓ_i denote the loss multiplier): + +$$ +m_i = \ell_i \cdot (1 - \text{truncated}_i) +$$ + +This results in the effective loss: + +$$ +L_{\text{effective}} = \sum_{i} m_i \cdot L_i +$$ + +where $L_i$ is the per-sample loss. Truncated samples contribute 0 to the gradient update while remaining in the batch for reward baseline calculations. + +To configure: +```yaml +grpo: + overlong_filtering: false # default +``` + +Set `overlong_filtering` to true when training on tasks where truncation at the maximum sequence length is expected, such as long-form reasoning or mathematical proofs. + +## Metrics +This feature is controlled by the parameters `wandb_name` and `tb_name`. We track a few metrics during training for scientific experimentation and to validate correctness as the run progresses. + +### Multiplicative Token Probability Error +This feature is controlled by the parameter `token_mult_prob_error`. It measures the error introduced when token probabilities are scaled multiplicatively, which can affect model calibration and output consistency. This is equal to the 'Logprob consistency metric' defined in [Adding New Models](/../adding-new-models#importance-of-log-probability-consistency-in-training-and-inference): + +$$ +\text{token-mult-prob-error} = \frac{1}{n}\sum_{i=1}^{n\text{(tokens)}}\exp\left(\left\|\text{log-train-fwk}_i - \text{logprobs-inference-fwk}_i\right\|\right) +$$ + +Intuitively, this measures the average multiplicative probability error for sampled tokens, where samples are drawn as $x \sim \pi_{\text{inference-framework}}$. The purpose of this is to highlight any obvious sampling errors or discrepancies between the inference backend and training framework. If it trends upward steeply over the course of training past $\sim 1-2\%$, there is usually a problem with how your weights are being updated. If these metrics are very spiky, they can indicate a bug in the inference framework or buggy weight refitting. + +### KL Divergence Error +This feature is controlled by the following metrics: +* `gen_kl_error`: $D_{\text{KL}}(P_{gen} || P_{policy})$ + - the generation distribution as ground truth +* `policy_kl_error`: $D_{\text{KL}}(P_{policy} || P_{gen})$ + - the policy (training) distribution as ground truth +* `js_divergence_error` or (Jensen–Shannon divergence): $(D_{\text{KL}}(P_{policy} || P_{m}) + D_{\text{KL}}(P_{gen} || P_{m})) / 2$, where $P_{m} = (P_{policy} + P_{gen}) / 2$ + - uses the mean mixture distribution as reference + +According to the paper [When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda), `gen_kl_error` was introduced (referred to as `vllm-kl` in the paper) as the key metric to measure the mismatch between the policy and generation distributions. Empirically, the mismatch is approximately 1e-3, and the divergence is larger for low-probability tokens as predicted by the generation inference engine (like vLLM). + +The three divergence metrics provide complementary perspectives on distribution mismatch. For example: + +We observed a case where vLLM assigned a disproportionately high probability to a single rare token, causing significant logprob error spikes (especially in MoE architectures): + +```text +# extreme example +1. Position 4559: 'au' (ID: 1786) + logp_gen (from vLLM): -5.xxx + logp_policy (from Mcore): -15.xxx +``` +Assuming other tokens have near-zero divergence, this single token's metrics with `kl_type=k3` are: + +* `gen_kl_error`: exp(-15 + 5) - (-15 + 5) - 1 ≈ 9 (moderate mismatch) +* `policy_kl_error`: exp(-5 + 15) - (-5 + 15) - 1 ≈ 22,015 (severe mismatch dominating the metric) +* `js_divergence_error`: ≈ 9, close to `gen_kl_error` since the mixture distribution (~-5.69) is dominated by the higher-probability value (logp_gen in this example) + +Ideally, all KL divergence metrics should be close to 0, with values below 1e-3 considered acceptable. Investigate any metric that shows spikes above this threshold. + +### Sampling Importance Ratio +This feature is controlled by the parameter `sampling_importance_ratio`. It adjusts the weighting of samples based on the ratio between the target policy and the behavior policy, helping to correct for distributional shift in off-policy learning. Not to be confused with the clipped importance ratio in PPO/GRPO, this is the importance ratio between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$. + +This is simply $\frac{1}{|T|}\sum_{t \in \text{tokens}}\text{exp}(\text{log}(\pi_{\text{training}}(t)) - \text{log}(\pi_{\text{inference}}(t)))$ + +Similar to [Multiplicative Token Probability Error](#multiplicative-token-probability-error), this is a measure of how far off your inference backend is from your training framework. However, this metric is meant to find the bias in that error, rather than the variance, as it does not take the absolute value of the error. With some noise, this should hover around 1. + +This metric is always calculated and the per-token version (without the mean) is used in the loss function when [Importance Sampling Correction](#importance-sampling-correction) is enabled. + +### Entropy +This feature is controlled by the parameter `approx_entropy`. It estimates the entropy of the policy distribution, which can be used to encourage exploration and prevent premature convergence during training. We roughly approximate the entropy of the LLM's distribution throughout training by calculating: + +$$ +E_{s \sim \pi_{\text{inference}}(x)}[-\frac{\pi_{\text{training}}(x)}{\pi_{\text{inference}}(x)}log(\pi_{\text{training}}(x))] +$$ + +This expectation is estimated using the rollouts in each global training batch as Monte Carlo samples. The ratio of $\pi$ values in the formula serves to apply importance correction for the mismatch between the training policy during a single GRPO step and the inference-time policy used to sample states. + +We use this to track if our models are experiencing entropy collapse too quickly during training (as is quite common). This is a fairly rough Monte Carlo approximation, so we wouldn't recommend using this directly for an entropy bonus or otherwise backpropagating through this. You can take a look at NeMo Aligner's [implementation](https://github.com/NVIDIA/NeMo-Aligner/blob/main/nemo_aligner/utils/distributed.py#L351) of a full entropy calculation if you're interested (work-in-progress efficient calculation in NeMo RL). + +## LoRA Configuration + +### DTensor Backend + +GRPO supports LoRA on the NeMoRL DTensor backend. The LoRA settings live under `policy.dtensor_cfg.lora_cfg`, and the fields follow the SFT LoRA configuration. For DTensor parameter details, see [SFT LoRA: DTensor Configuration Parameters](/./sft#dtensor-configuration-parameters). To enable LoRA, set `policy.dtensor_cfg.lora_cfg.enabled=true`, then configure target modules, rank, alpha, and dropout as needed. + +Our DTensor LoRA path uses a merge-weight approach: during generation, LoRA adapter weights are merged into the base linear weights. This improves performance, with a small training-inference mismatch that we consider acceptable. If you require strict training-inference parity, use the [split-weight variant branch](https://github.com/NVIDIA-NeMo/RL/tree/ruit/lora_grpo_async), which may trade off some performance. For a comparison between merge-weight and split-weight, see [PR 1797: Support lora in dtensor grpo workflow by merging weight](https://github.com/NVIDIA-NeMo/RL/pull/1797). + +We already provide a DTensor-based Nano v3 GRPO LoRA recipe. See [grpo-nanov3-30BA3B-2n8g-fsdp2-lora.yaml](/../../examples/configs/recipes/llm/grpo-nanov3-30BA3B-2n8g-fsdp2-lora.yaml) for an end-to-end example. + +## Evaluate the Trained Model + +Upon completion of the training process, you can refer to our [evaluation guide](/eval) to assess model capabilities. diff --git a/fern/v0.5.0/pages/guides/nemotron-3-nano.mdx b/fern/v0.5.0/pages/guides/nemotron-3-nano.mdx new file mode 100644 index 0000000000..49697325e3 --- /dev/null +++ b/fern/v0.5.0/pages/guides/nemotron-3-nano.mdx @@ -0,0 +1,70 @@ +--- +title: Nemotron 3 Nano +description: "" +--- + +This guide explains how to post-train the [Nemotron 3 Nano model](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf) using NeMo RL. + +## Download and prepare the data + +```bash +# Download RL data blend +uvx --from huggingface-hub hf download nvidia/Nemotron-3-Nano-RL-Training-Blend --repo-type dataset --local-dir=data + +# Fill in placeholders in dataset +chmod +x data/create_nanov3_jsonl.py +./data/create_nanov3_jsonl.py --input data/train.jsonl --output data/train-full.jsonl + +# Use the last 1000 rows for validation +head -n -1000 data/train-full.jsonl > data/train-split.jsonl +tail -n 1000 data/train-full.jsonl > data/val-split.jsonl +``` + +## Prepare the code +Note that we currently require using the `nano-v3` branch to train Nemotron 3 Nano. +```bash +# Checkout NeMo RL +git clone -b nano-v3 https://github.com/NVIDIA-NeMo/RL.git +cd RL + +# Initialize the submodules +git submodule update --init --recursive +``` + +## Create a launch script + +Create a file named `launch.sh` with the following contents. Be sure to fill in the `DATA_DIR`, `MODEL_CHECKPOINT`, `WANDB_API_KEY`, `SLURM_ACCOUNT`, `SLURM_PARTITION`, `MOUNTS`. Note that the default recipe (`examples/nemo_gym/grpo_nanov3.yaml`) uses 32 nodes. + +```bash +CODE_DIR=$PWD +SLURM_JOB_NAME=nano-v3-rl-training + +# Fill these in +DATA_DIR=... +MODEL_CHECKPOINT=... +WANDB_API_KEY=... +SLURM_ACCOUNT=... +SLURM_PARTITION=... +MOUNTS=... # SRC:DST[,SRC:DST...] e.g., MOUNTS="/lustre:/lustre,/data:/data" + +CONTAINER="nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano" +COMMAND="uv run examples/nemo_gym/run_grpo_nemo_gym.py --config examples/nemo_gym/grpo_nanov3.yaml data.train_jsonl_fpath=$DATA_DIR/train-split.jsonl data.validation_jsonl_fpath=$DATA_DIR/val-split.jsonl policy.model_name=$MODEL_CHECKPOINT logger.wandb_enabled=True" + +COMMAND="${COMMAND}" \ +CONTAINER="${CONTAINER}" \ +MOUNTS="${MOUNTS}" \ +WANDB_API_KEY=${WANDB_API_KEY} \ +sbatch \ + --nodes=32 \ + --account="${SLURM_ACCOUNT}" \ + --job-name="${SLURM_JOB_NAME}" \ + --partition="${SLURM_PARTITION}" \ + --time=4:0:0 \ + --gres=gpu:8 \ + ray.sub +``` + +## Launch training +```bash +bash launch.sh +``` diff --git a/fern/v0.5.0/pages/guides/prorlv2.mdx b/fern/v0.5.0/pages/guides/prorlv2.mdx new file mode 100644 index 0000000000..3672e0fbe8 --- /dev/null +++ b/fern/v0.5.0/pages/guides/prorlv2.mdx @@ -0,0 +1,238 @@ +--- +title: An In-Depth Walkthrough of ProRLv2 in NeMo RL +description: "" +--- + +This guide covers the ProRLv2 configuration pattern in NeMo RL, based on the example config [`examples/configs/prorlv2.yaml`](/../../examples/configs/prorlv2.yaml). + +ProRLv2 (as used in this repo) is best thought of as **GRPO and a bundle of stability/efficiency techniques** commonly used for long-horizon RL fine-tuning + +- **DAPO dynamic sampling**: skip prompt-groups with zero reward variance +- **Decoupled (asymmetric) clipping**: `ratio_clip_max > ratio_clip_min` +- **Token-level policy gradient loss** +- **Importance sampling correction and TIS/ICE-POP** (especially helpful for MoE/backend-mismatch scenarios) +- **Reinforce++: Decoupled local/global advantage normalization** (`reinforce_plus_plus`) +- **“Stop properly” penalty** for truncated responses + +This document focuses on ProRLv2-specific knobs and gotchas. For foundational concepts on GRPO (data, environments, generation backends, loss/metrics), see the [NeMo RL GRPO Guide](/grpo). For the original DAPO motivation behind dynamic sampling/overlong shaping, see the [NeMo RL DAPO Guide](/dapo). + +## Quickstart: Launch a ProRLv2 Run + +Use the example configuration [`examples/configs/prorlv2.yaml`](/../../examples/configs/prorlv2.yaml): + +```bash +uv run examples/run_grpo_math.py --config examples/configs/prorlv2.yaml \{overrides\} +``` + +`prorlv2.yaml` inherits from [`examples/configs/grpo_math_1B.yaml`](/../../examples/configs/grpo_math_1B.yaml) and only overrides a small set of fields under `grpo` and `loss_fn`, plus output directories. + +**Reminder**: Don’t forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You’ll need to do a `huggingface-cli login` as well for gated models. + +## DAPO: Dynamic Sampling + +Standard GRPO will train on all generated responses, even when a prompt’s `num_generations_per_prompt` responses all receive the same reward (no per-prompt learning signal). **Dynamic sampling** filters to keep only prompt-groups with diverse rewards (`std > 0`), and can accumulate across multiple generation batches until it reaches the target rollout batch size. + +- **Config**: enable with `grpo.use_dynamic_sampling: true` and tune: + - `grpo.batch_multiplier`: how many extra prompts to generate to compensate filtering + - `grpo.dynamic_sampling_max_gen_batches`: upper bound before raising an error +- **Implementation**: see `dynamic_sampling()` in [`nemo_rl/algorithms/grpo.py`](/../../nemo_rl/algorithms/grpo.py). + +## Advantage Estimator: Reinforce++ + +The ProRLv2 recipe uses **Reinforce++** advantage estimation instead of the standard GRPO-style group baseline. + +Quick intuition: + +- Reinforce++ uses **decoupled local + global normalization**. +- Compared to GRPO-style **local-only normalization**, this decoupling can be **more stable** in longer runs (less sensitivity to per-batch scale/variance shifts). + +Computation (as implemented in this repo, with the ProRLv2 example defaults): + +```text +Defaults in examples/configs/prorlv2.yaml: + grpo.adv_estimator.minus_baseline = true + loss_fn.use_kl_in_reward = false + +Steps: + 1) Per prompt-group, compute mean reward, then subtract it: + a_i = r_i - mean_{\{j in same prompt\}} r_j + + 2) Global normalize across *all valid response tokens* in the batch: + A <- (A - mean(A)) / sqrt(max(var(A), 1e-8)) +``` + +```yaml +grpo: + adv_estimator: + name: "reinforce_plus_plus" + normalize_rewards: true + use_leave_one_out_baseline: false + minus_baseline: true +``` + +- **Config**: `grpo.adv_estimator.name: "reinforce_plus_plus"` +- **Implementation**: the training loop wires this via `ReinforcePlusPlusAdvantageEstimator` in [`nemo_rl/algorithms/grpo.py`](/../../nemo_rl/algorithms/grpo.py). +- **Reference**: [REINFORCE++ paper](https://arxiv.org/abs/2501.03262) + +## Reward Shaping: “Stop properly” Penalty (Truncation Penalty) + +When a generation hits the max length without emitting EOS, many pipelines mark it as **truncated**. The “stop properly” penalty scales the reward for truncated samples: + +- `stop_properly_penalty_coef = 0.0`: truncated samples get **zero reward** +- `stop_properly_penalty_coef = 1.0`: **no penalty** (keep original rewards) +- Any value in $[0, 1]$ interpolates between the two. + +In the example config: + +```yaml +grpo: + reward_shaping: + enabled: true + stop_properly_penalty_coef: 0.0 +``` + +- **Implementation**: `apply_reward_shaping()` in [`nemo_rl/algorithms/reward_functions.py`](/../../nemo_rl/algorithms/reward_functions.py). + + +In the current implementation, if `stop_properly_penalty_coef` is set (not `null`), `apply_reward_shaping()` **returns early** after applying truncation scaling. That means you **cannot** apply DAPO "overlong reward shaping" in the same run unless you set `stop_properly_penalty_coef: null` and provide the DAPO overlong parameters (`overlong_buffer_length`, `overlong_buffer_penalty`, `max_response_length`). + + +## Loss: Decoupled (Asymmetric) Clipping + +ProRLv2 uses DAPO’s “decoupled clipping” idea by setting different lower/upper clip bounds: + +```yaml +loss_fn: + ratio_clip_min: 0.2 + ratio_clip_max: 0.27 +``` + +This keeps PPO/GRPO-style clipping behavior but allows a larger expansion region than the contraction region, which can help exploration and reduce early collapse. + +- **Implementation**: `ClippedPGLossFn` documents decoupled clipping in [`nemo_rl/algorithms/loss_functions.py`](/../../nemo_rl/algorithms/loss_functions.py). + +## Loss: Token-level Policy Gradient + +ProRLv2 enables token-level loss: + +```yaml +loss_fn: + token_level_loss: true +``` + +This computes the policy gradient loss per token (under masking) instead of aggregating per sequence, which is often helpful for long CoT/variable-length rollouts. + +## Truncated Importance Sampling + +When training and generation backends differ (e.g., numerics, precision, MoE routing, or vLLM vs training framework), you may see a mismatch between: + +- `generation_logprobs` (logprobs under the generation backend that produced samples) +- `prev_logprobs` (logprobs under the training framework policy) + +NeMo RL supports **importance sampling correction**, and ProRLv2’s example config turns it on together with **truncated importance sampling**. + +Quick intuition: + +- This is mainly useful for **MoE/backend mismatch** cases, where the generation backend and the training policy can disagree on logprobs. +- We compute an importance weight from `prev_logprobs` (training policy) vs `generation_logprobs` (generator). **ICE-POP** drops outliers by zeroing weights outside $[min, max]$. +- In the common setup of **one policy update per rollout batch** (i.e., minibatch equals the per-step rollout batch; no PPO multi-epoch reuse), the PPO/GRPO likelihood ratio term is effectively **1.0** at update time, so the main stability issue is the MoE/backend-mismatch importance weights. +- “Online ICE-POP” here just means applying that ICE-POP filtering **during loss computation** on the current training batch. + +- **Reference**: [The Online IcePop Solution for MoE models](https://hijkzzz.notion.site/online-ice-pop) + +```yaml +loss_fn: + use_importance_sampling_correction: true + truncated_importance_sampling_ratio: 5.0 + truncated_importance_sampling_ratio_min: 0.5 + truncated_importance_sampling_type: "icepop" +``` + +- **`use_importance_sampling_correction`**: enable token-level importance weights (must be `true` for truncated IS) +- **`truncated_importance_sampling_ratio`**: upper bound (or upper threshold) +- **`truncated_importance_sampling_ratio_min`**: lower bound used by ICE-POP filtering +- **`truncated_importance_sampling_type`**: + - `"tis"`: clamp weights to `<= truncated_importance_sampling_ratio` + - `"icepop"`: set weights outside $[min, max]$ to zero (filter outliers) + - `"seq-mask-tis"`: sequence-level geometric-mean mask + non-truncated token-level IS correction (see below) + +- **Implementation**: see `ClippedPGLossFn` init-time checks and logic in [`nemo_rl/algorithms/loss_functions.py`](/../../nemo_rl/algorithms/loss_functions.py). + +### Seq-mask-tis: Sequence-level Geometric-Mean Mask + +`seq-mask-tis` is an alternative to ICE-POP that operates at the **sequence level** instead of per-token: + +1. For each sequence, compute the **geometric mean** of per-token IS ratios: $\text{geo\_mean}_i = \exp\!\bigl(\frac{1}{T_i}\sum_t \log \frac{\pi_{\text{train}}(a_t)}{\pi_{\text{gen}}(a_t)}\bigr)$ +2. **Mask out** entire sequences whose geometric mean falls outside $[min, max]$. +3. For retained sequences, apply the **non-truncated** (raw) token-level IS ratios to correct per-token gradients — no clamping, no per-token filtering. + +Key differences from ICE-POP: + +| | ICE-POP | seq-mask-tis | +|---|---|---| +| Filtering granularity | per token | per sequence | +| IS correction weights | filtered (zeroed outside bounds) | raw / non-truncated | +| Reference bounds | min=0.5, max=5 | min=0.999, max=1.002 | + +```yaml +loss_fn: + use_importance_sampling_correction: true + truncated_importance_sampling_ratio: 1.002 + truncated_importance_sampling_ratio_min: 0.999 + truncated_importance_sampling_type: "seq-mask-tis" +``` + +Both ICE-POP and seq-mask-tis report a shared metric **`is_oob_ratio`** — the fraction of tokens (ICE-POP) or sequences (seq-mask-tis) that were filtered out. + +- **Reference**: [When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda) + +## Full Example Config (Annotated) + +The ProRLv2 example config is intentionally small and relies on defaults from `grpo_math_1B.yaml`. + +- **Example config**: [`examples/configs/prorlv2.yaml`](/../../examples/configs/prorlv2.yaml) +- **Base defaults**: [`examples/configs/grpo_math_1B.yaml`](/../../examples/configs/grpo_math_1B.yaml) + +## Practical Overrides + +A few common overrides when launching: + +```bash +uv run examples/run_grpo_math.py \ + --config examples/configs/prorlv2.yaml \ + policy.model_name="Qwen/Qwen2.5-1.5B" \ + logger.wandb_enabled=true \ + logger.wandb.project="prorlv2-dev" \ + checkpointing.checkpoint_dir="results/prorlv2" \ + logger.log_dir="logs/prorlv2" +``` + +If you want to enable DAPO overlong reward shaping instead of stop-properly: + +```bash +uv run examples/run_grpo_math.py \ + --config examples/configs/prorlv2.yaml \ + grpo.reward_shaping.stop_properly_penalty_coef=null \ + grpo.reward_shaping.overlong_buffer_length=4096 \ + grpo.reward_shaping.overlong_buffer_penalty=1.0 \ + grpo.reward_shaping.max_response_length=20480 +``` + +## What to Monitor + +In addition to task rewards/accuracy, a few stability signals are particularly useful with ProRLv2-style runs: + +- **Dynamic sampling efficiency**: if enabled, watch how often batches need multiple generation rounds (see `dapo.md` for detailed guidance). +- **Training–generation mismatch**: `token_mult_prob_error`, `gen_kl_error`, `policy_kl_error`, `js_divergence_error` are computed in `ClippedPGLossFn` (see the [GRPO metrics section](/grpo#metrics)). +- **Truncation rate**: if high, either increase `policy.max_total_sequence_length`/`policy.generation.max_model_len` or relax truncation penalty (`stop_properly_penalty_coef`). + +## References + +- **ProRLv2 blog**: [Scaling LLM Reinforcement Learning with Prolonged Training using ProRL v2](https://developer.nvidia.com/blog/scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2/) +- **DAPO**: [Decoupled Clip and Dynamic Sampling Policy Optimization](https://arxiv.org/pdf/2503.14476) +- **GRPO**: [Group Relative Policy Optimization](https://arxiv.org/abs/2402.03300) +- **REINFORCE++**: [REINFORCE++](https://arxiv.org/abs/2501.03262) +- **DLER (stop properly penalty explanation)**: [DLER](https://arxiv.org/pdf/2510.15110) +- **seq-mask-tis blog**: [When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda) +- **[NeMo RL GRPO Guide](/grpo)** +- **[NeMo RL DAPO Guide](/dapo)** diff --git a/fern/v0.5.0/pages/guides/rm.mdx b/fern/v0.5.0/pages/guides/rm.mdx new file mode 100644 index 0000000000..cef70848ea --- /dev/null +++ b/fern/v0.5.0/pages/guides/rm.mdx @@ -0,0 +1,233 @@ +--- +title: Reward Model Training in NeMo RL +description: "" +--- + +This document explains how to train reward models (RM) within NeMo RL. Currently, only Bradley-Terry reward models are supported on the DTensor backend. Megatron backend support is tracked [here](https://github.com/NVIDIA-NeMo/RL/issues/720). + +## Launch a Training Job + +The script, [examples/run_rm.py](/../../examples/run_rm.py), is used to train a Bradley-Terry reward model. This script can be launched either locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](/../cluster). + +Be sure to launch the job using `uv`. The command to launch a training job is as follows: + +```bash +uv run examples/run_rm.py + +# Can also add overrides on CLI, like changing the config or changing the model +uv run examples/run_rm.py --config examples/configs/rm.yaml policy.model_name=Qwen/Qwen2.5-1.5B +``` + +The default YAML config shares the same base template as the SFT config but includes a new `reward_model_cfg` section with `enabled: true` to load the model as a Reward Model. You can find an example RM config file at [examples/configs/rm.yaml](/../../examples/configs/rm.yaml). + +**Reminder**: Set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). Make sure to log in using `huggingface-cli` if you're working with Llama models. + +## Datasets + +RM datasets in NeMo RL are encapsulated using classes. Each RM data class is expected to have the following attributes: + 1. `dataset`: A dictionary containing the formatted datasets. Each example in the dataset must conform to the format described below. + 2. `task_name`: A string identifier that uniquely identifies the dataset. + +If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. An example implementation can be found in [preference_datasets/tulu3.py](/../../nemo_rl/data/datasets/preference_datasets/tulu3.py). + +**Note:** The `task_name` field is required in each formatted example. + +```json +{ + "context": [], // list of dicts - The prompt message (including previous turns, if any) + "completions": [ // list of dicts — The list of completions + { + "rank": 0, // int — The rank of the completion (lower rank is preferred) + "completion": [] // list of dicts — The completion message(s) + }, + { + "rank": 1, // int — The rank of the completion (lower rank is preferred) + "completion": [] // list of dicts — The completion message(s) + } + ], + "task_name": "task_name" // identifier for the task +} +``` + +Currently, RM training supports only two completions (where the lowest rank is preferred and the highest one is rejected), with each completion being a single response. For example: +```json +{ + "context": [ + { + "role": "user", + "content": "What's the capital of France?" + }, + { + "role": "assistant", + "content": "The capital of France is Paris." + }, + { + "role": "user", + "content": "Thanks! And what's the capital of Germany?" + } + ], + "completions": [ + { + "rank": 0, + "completion": [ + { + "role": "assistant", + "content": "The capital of Germany is Berlin." + } + ] + }, + { + "rank": 1, + "completion": [ + { + "role": "assistant", + "content": "The capital of Germany is Munich." + } + ] + } + ], + "task_name": "task_name" +} +``` + +By default, NeMo RL has support for [HelpSteer3](/../../nemo_rl/data/datasets/preference_datasets/helpsteer3.py) and [Tulu3Preference](/../../nemo_rl/data/datasets/preference_datasets/tulu3.py) datasets. Both of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk. + +We provide a [PreferenceDataset](/../../nemo_rl/data/datasets/preference_datasets/preference_dataset.py) class that is compatible with jsonl-formatted preference datasets for loading datasets from local path or HuggingFace.. You can modify your config as follows to use such a custom preference dataset: +```yaml +data: + # other data settings, see `examples/configs/dpo.yaml` for more details + ... + # dataset settings + train: + # this dataset will override prompt_key and use the default values for other vars + data_path: /path/to/local/train_dataset.jsonl # local file or hf_org/hf_dataset_name (HuggingFace) + subset: null # used for HuggingFace datasets + split: train # used for HuggingFace datasets + validation: + # this dataset will use the default values for other vars except data_path + data_path: /path/to/local/val_dataset.jsonl + default: + # will use below vars as default values if dataset doesn't specify it + dataset_name: PreferenceDataset + prompt_file: null + system_prompt_file: null + # multiple validation sets is supported by using val_data_paths + # this will be removed after refactor + val_data_paths: + : /path/to/local/val_dataset_1.jsonl + : /path/to/local/val_dataset_2.jsonl +``` + +Your JSONL files should contain one JSON object per line with the following structure: + +```json +{ + "context": [{"role": "user", "content": "What is 2+2?"}], // list of dicts - The prompt message (including previous turns, if any) + "completions": [ // list of dicts — The list of completions + { + "rank": 0, // int — The rank of the completion (lower rank is preferred) + "completion": [{"role": "assistant", "content": "The answer is 4."}] // list of dicts — The completion message(s) + }, + { + "rank": 1, // int — The rank of the completion (lower rank is preferred) + "completion": [{"role": "assistant", "content": "I don't know."}] // list of dicts — The completion message(s) + } + ] +} +``` + +We also provide a [BinaryPreferenceDataset](/../../nemo_rl/data/datasets/preference_datasets/binary_preference_dataset.py) class, which is a simplified version of PreferenceDataset for pairwise ranked preference with single turn completions. You can use `prompt_key`, `chosen_key` and `rejected_key` to specify which fields in your data correspond to the question, chosen answer and rejected answer respectively. Here's an example configuration: +```yaml +data: + # other data settings, see `examples/configs/dpo.yaml` for more details + ... + # dataset settings + train: + # this dataset will override prompt_key and use the default values for other vars + data_path: /path/to/local/train_dataset.jsonl # local file or hf_org/hf_dataset_name (HuggingFace) + prompt_key: context + subset: null # used for HuggingFace datasets + split: train # used for HuggingFace datasets + validation: + # this dataset will use the default values for other vars except data_path + data_path: /path/to/local/val_dataset.jsonl + default: + # will use below vars as default values if dataset doesn't specify it + dataset_name: BinaryPreferenceDataset + prompt_key: prompt + chosen_key: chosen + rejected_key: rejected + prompt_file: null + system_prompt_file: null +``` + +Your JSONL files should contain one JSON object per line with the following structure: + +```json +{ + "prompt": "What is 2+2?", // : + "chosen": "The answer is 4.", // : + "rejected": "I don't know." // : +} +``` + +Please note: +- If you are using a logger, the prefix used for each validation set will be `validation-`. The total validation time, summed across all validation sets, is reported under `timing/validation/total_validation_time`. +- If you are doing checkpointing, the `metric_name` value in your `checkpointing` config should reflect the metric and validation set to be tracked. For example, `validation-_loss`. + +## Using Reward Models as Environments + +Trained reward models can be used as environments in GRPO training for reinforcement learning from human feedback (RLHF). This allows you to use your trained reward model to provide rewards during policy optimization. + +### Reward Model Environment + +The Reward Model Environment provides a standardized interface for using trained reward models in RL training: + +```python +from nemo_rl.environments.reward_model_environment import RewardModelEnvironment + +env_config = { + "enabled": True, + "model_name": "path/to/your/trained/reward/model", + "tokenizer": {"name": "path/to/your/trained/reward/model"}, + "precision": "bfloat16", + "batch_size": 32, + "resources": {"gpus_per_node": 1, "num_nodes": 1}, + "reward_model_cfg": { + "enabled": True, + "reward_model_type": "bradley_terry", + }, +} + +reward_env = RewardModelEnvironment.remote(env_config) +``` + +### Integration with GRPO + +To use your trained reward model with GRPO, you can use the [examples/run_grpo.py](/../../examples/run_grpo.py) script with the [examples/configs/grpo_rm_1B.yaml](/../../examples/configs/grpo_rm_1B.yaml) config: + +```bash +# Run GRPO training with your trained reward model +uv run examples/run_grpo.py --config examples/configs/grpo_rm_1B.yaml +``` + +### Configuration + +In your GRPO configuration, specify the reward model environment: + +```yaml +env: + reward_model: + enabled: true + model_name: "path/to/your/trained/reward/model" + tokenizer: + name: "path/to/your/trained/reward/model" + precision: "bfloat16" + batch_size: 32 + resources: + gpus_per_node: 1 + num_nodes: 1 + reward_model_cfg: + enabled: true + reward_model_type: "bradley_terry" +``` diff --git a/fern/v0.5.0/pages/guides/sft-openmathinstruct2.mdx b/fern/v0.5.0/pages/guides/sft-openmathinstruct2.mdx new file mode 100644 index 0000000000..0fa5cae5d1 --- /dev/null +++ b/fern/v0.5.0/pages/guides/sft-openmathinstruct2.mdx @@ -0,0 +1,96 @@ +--- +title: SFT on OpenMathInstruct-2 +description: "" +--- + +This guide explains how to use NeMo RL to run SFT on the [nvidia/OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) math instruction tuning dataset. We then show how to use NeMo RL's evaluation scripts to evaluate the trained model on the [MATH-500 benchmark](https://huggingface.co/datasets/HuggingFaceH4/MATH-500). + +## Train the Model +To train the model using NeMo RL, use the `examples/configs/recipes/tutorials/sft/sft_openmathinstruct2.yaml` config file. This file closely matches the experiment settings in the [original OpenMathInstruct-2 paper](https://arxiv.org/abs/2410.01560). + +``` +uv run examples/run_sft.py --config=examples/configs/sft_openmathinstruct2.yaml +``` + +### Dataset Splits + +The OpenMathInstruct-2 has several versions of different sizes. Configure the version of the dataset via the `data.split` config: + +* `train`: full 14 M problem–solution pairs +* `train_1M`, `train_2M`, `train_5M`: fair-downsampled subsets of 1M, 2M, or 5M examples + +By default, the config uses the 1M subset (`data.split=train_1M`). + +### Training Time +The default config uses 8 GPUs (`cluster.gpus_per_node`) on 1 node (`cluster.num_nodes`), which should complete 1 epoch of training for the `train_1M` dataset (1855 steps) in around 20 hours. Additional nodes can be used to speed up training. We found in our experiments that using 8 nodes, we can complete 1 epoch of training for the `train_1M` dataset in less than 4 hours. + +## Evaluate the Model +Throughout training, the checkpoints of the model will be saved to the `results/sft_openmathinstruct2` folder (specified by `checkpointing.checkpoint_dir`). To evaluate the model, we first need to convert the PyTorch distributed checkpoint to Hugging Face format: + +``` +uv run examples/converters/convert_dcp_to_hf.py \ + --config=results/sft_openmathinstruct2/step_1855/config.yaml \ + --dcp-ckpt-path=results/sft_openmathinstruct2/step_1855/policy/weights \ + --hf-ckpt-path=results/sft_openmathinstruct2/step_1855/hf +``` + +Replace `results/sft_openmathinstruct2/step_1855` with the path to the checkpoint you are evaluating. The resulting Hugging Face checkpoint will be saved to `--hf-ckpt-path`. + +To evaluate on the [MATH-500 benchmark](https://huggingface.co/datasets/HuggingFaceH4/MATH-500), use the following command: + +``` +uv run examples/run_eval.py \ + --config=examples/configs/evals/eval.yaml \ + generation.model_name=results/sft_openmathinstruct2/step_1855/hf \ + tokenizer.name=meta-llama/Llama-3.1-8B-Instruct \ + data.dataset_name=HuggingFaceH4/MATH-500 \ + data.dataset_key=test +``` + +Use `generation.model_name` to specify the path to the Hugging Face checkpoint. + +## Results + +In this section we present the results of several reference experiments for the `train_1M` and `train` versions of the dataset. + +### train_1M +Using the above instructions to train a Llama-3.1-8B model for 1 epoch on the `train_1M` version of the OpenMathInstruct-2 dataset, we get the following loss curve: + +![image](/assets/sft-openmathinstruct2-train1M-loss.png) + +Evaluating the final checkpoint on MATH-500, we get the following result: + +``` +============================================================ +model_name='hf' dataset_name='MATH-500' +max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1 + +metric='pass@1' num_tests_per_prompt=1 + +score=0.5020 (251.0/500) +============================================================ +``` + +As a reference, using NeMo-Aligner and NeMo-Skills (as is done in the [original OpenMathInstruct-2 paper](https://arxiv.org/abs/2410.01560)) to train and evaluate the same model on the same dataset achieves the same score of 0.5020 on MATH-500. + +### train +We also trained a Llama-3.1-8B model for 1 epoch on the full `train` version of the OpenMathInstruct-2 dataset. We obtain the following loss curve: + +![image](/assets/sft-openmathinstruct2-train-loss.png) + +Evaluating the final checkpoint on MATH-500, we get the following result: + +``` +============================================================ +model_name='hf' dataset_name='MATH-500' +max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1 + +metric='pass@1' num_tests_per_prompt=1 + +score=0.6220 (311.0/500) +============================================================ +``` + +Using NeMo-Aligner and NeMo-Skills to train the model in the same settings achieves a score of 0.6140 (307/500). + +As another point of reference, using a checkpoint after 10,000 steps of training using NeMo-RL achieves a score of 0.5800 (290.0/500). diff --git a/fern/v0.5.0/pages/guides/sft.mdx b/fern/v0.5.0/pages/guides/sft.mdx new file mode 100644 index 0000000000..44dfef47f3 --- /dev/null +++ b/fern/v0.5.0/pages/guides/sft.mdx @@ -0,0 +1,324 @@ +--- +title: Supervised Fine-Tuning in NeMo RL +description: "" +--- + +This document explains how to perform SFT within NeMo RL. It outlines key operations, including initiating SFT runs, managing experiment configurations using YAML, and integrating custom datasets that conform to the required structure and attributes. + +## Launch an SFT Run + +The script, [examples/run_sft.py](/../../examples/run_sft.py), can be used to launch an experiment. This script can be launched either locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](/../cluster). + +Be sure to launch the job using `uv`. The command to launch an SFT job is as follows: + +```bash +uv run examples/run_sft.py --config +``` + +If not specified, `config` will default to [examples/configs/sft.yaml](/../../examples/configs/sft.yaml). + +## Example Configuration File + +NeMo RL allows users to configure experiments using `yaml` config files. An example SFT configuration file can be found [here](/../../examples/configs/sft.yaml). + +To override a value in the config, either update the value in the `yaml` file directly, or pass the override via the command line. For example: + +```bash +uv run examples/run_sft.py \ + cluster.gpus_per_node=1 \ + logger.wandb.name="sft-dev-1-gpu" +``` + +**Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models. + +## Datasets + +SFT datasets in NeMo RL are encapsulated using classes. Each SFT data class is expected to have the following attributes: + 1. `dataset`: A dictionary containing the formatted datasets. Each example in the dataset must conform to the format described below. + 2. `task_name`: A string identifier that uniquely identifies the dataset. + +SFT datasets are expected to follow the HuggingFace chat format. Refer to the [chat dataset document](/../design-docs/chat-datasets) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [response_datasets/squad.py](/../../nemo_rl/data/datasets/response_datasets/squad.py) has an example: + +**Note:** The `task_name` field is required in each formatted example. + +```python +def format_data(self, data: dict[str, Any]) -> dict[str, Any]: + return { + "messages": [ + { + "role": "system", + "content": data["context"], + }, + { + "role": "user", + "content": data["question"], + }, + { + "role": "assistant", + "content": data["answers"]["text"][0], + }, + ], + "task_name": self.task_name, + } +``` + +NeMo RL SFT uses Hugging Face chat templates to format the individual examples. Three types of chat templates are supported, which can be configured using the `tokenizer.chat_template` in your YAML config (see [sft.yaml](/../../examples/configs/sft.yaml) for an example): + +1. Apply the tokenizer's default chat template. To use the tokenizer's default, either omit `tokenizer.chat_template` from the config altogether, or set `tokenizer.chat_template="default"`. +2. Use a "passthrough" template which simply concatenates all messages. This is desirable if the chat template has been applied to your dataset as an offline preprocessing step. In this case, you should set `tokenizer.chat_template` to None as follows: + ```yaml + tokenizer: + chat_template: NULL + ``` +3. Use a custom template: If you would like to use a custom template, create a string template in [Jinja format](https://huggingface.co/docs/transformers/v4.34.0/en/chat_templating#how-do-i-create-a-chat-template), and add that string to the config. For example, + + ```yaml + tokenizer: + custom_template: "{% for message in messages %}{%- if message['role'] == 'system' %}{{'Context: ' + message['content'].strip()}}{%- elif message['role'] == 'user' %}{{' Question: ' + message['content'].strip() + ' Answer: '}}{%- elif message['role'] == 'assistant' %}{{message['content'].strip()}}{%- endif %}{% endfor %}" + ``` + +By default, NeMo RL has some built-in supported datasets (e.g., [OpenAssistant](/../../nemo_rl/data/datasets/response_datasets/oasst.py), [OpenMathInstruct-2](/../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py), [Squad](/../../nemo_rl/data/datasets/response_datasets/squad.py), etc.), you can see the full list [here](/../../nemo_rl/data/datasets/response_datasets/__init__.py). +All of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk. + +We provide a [ResponseDataset](/../../nemo_rl/data/datasets/response_datasets/response_dataset.py) class that is compatible with JSONL-formatted response datasets for loading datasets from local path or Hugging Face. You can use `input_key`, `output_key` to specify which fields in your data correspond to the question and answer respectively. Here's an example configuration: +```yaml +data: + # other data settings, see `examples/configs/sft.yaml` for more details + ... + # dataset settings + train: + # this dataset will override input_key and use the default values for other vars + data_path: /path/to/local/train_dataset.jsonl # local file or hf_org/hf_dataset_name (HuggingFace) + input_key: question + subset: null # used for HuggingFace datasets + split: train # used for HuggingFace datasets + split_validation_size: 0.05 # use 5% of the training data as validation data + seed: 42 # seed for train/validation split when split_validation_size > 0 + validation: + # this dataset will use the default values for other vars except data_path + data_path: /path/to/local/val_dataset.jsonl + default: + # will use below vars as default values if dataset doesn't specify it + dataset_name: ResponseDataset + input_key: input + output_key: output + prompt_file: null + system_prompt_file: null + processor: "sft_processor" +``` + +Your JSONL files should contain one JSON object per line with the following structure: + +```json +{ + "input": "Hello", // : + "output": "Hi there!" // : +} +``` + +We support using multiple datasets for train and validation. You can refer to `examples/configs/grpo_multiple_datasets.yaml` for a full configuration example. Here's an example configuration: +```yaml +data: + _override_: true # override the data config instead of merging with it + # other data settings, see `examples/configs/sft.yaml` for more details + ... + # dataset settings + train: + # train dataset 1 + - dataset_name: OpenMathInstruct-2 + split_validation_size: 0.05 # use 5% of the training data as validation data + seed: 42 # seed for train/validation split when split_validation_size > 0 + # train dataset 2 + - dataset_name: DeepScaler + validation: + # validation dataset 1 + - dataset_name: AIME2024 + repeat: 16 + # validation dataset 2 + - dataset_name: DAPOMathAIME2024 + # default settings for all datasets + default: + ... +``` + +We support using a single dataset for both train and validation by using `split_validation_size` to set the ratio of validation. +[OpenAssistant](/../../nemo_rl/data/datasets/response_datasets/oasst.py), [OpenMathInstruct-2](/../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py), [ResponseDataset](/../../nemo_rl/data/datasets/response_datasets/response_dataset.py), [Tulu3SftMixtureDataset](/../../nemo_rl/data/datasets/response_datasets/tulu3.py) are supported for this feature. +If you want to support this feature for your custom datasets or other built-in datasets, you can simply add the code to the dataset like [ResponseDataset](/../../nemo_rl/data/datasets/response_datasets/response_dataset.py). +```python +# `self.val_dataset` is used (not None) only when current dataset is used for both training and validation +self.val_dataset = None +self.split_train_validation(split_validation_size, seed) +``` + +### OpenAI Format Datasets (with Tool Calling Support) + +NeMo RL also supports datasets in the OpenAI conversation format, which is commonly used for chat models and function calling. This format is particularly useful for training models with tool-use capabilities. + +#### Basic Usage + +To use an OpenAI format dataset, configure your YAML as follows: + +```yaml +data: + train: + dataset_name: openai_format + data_path: # Path to training data + chat_key: "messages" # Key for messages in the data (default: "messages") + system_key: null # Key for system message in the data (optional) + system_prompt: null # Default system prompt if not in data (optional) + tool_key: "tools" # Key for tools in the data (default: "tools") + use_preserving_dataset: false # Set to true for heterogeneous tool schemas (see below) + validation: + ... +``` + +#### Data Format + +Your JSONL files should contain one JSON object per line with the following structure: + +```json +{ + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "What's the weather in Paris?"}, + {"role": "assistant", "content": "I'll check the weather for you.", "tool_calls": [ + {"name": "get_weather", "arguments": {"city": "Paris", "unit": "celsius"}} + ]}, + {"role": "tool", "content": "22°C, sunny", "tool_call_id": "call_123"}, + {"role": "assistant", "content": "The weather in Paris is currently 22°C and sunny."} + ], + "tools": [ + { + "name": "get_weather", + "description": "Get current weather for a city", + "parameters": { + "city": {"type": "string", "description": "City name"}, + "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} + } + } + ] +} +``` + +#### Tool Calling with Heterogeneous Schemas + +When your dataset contains tools with different argument structures (heterogeneous schemas), you should enable `use_preserving_dataset: true` to avoid data corruption: + +```yaml +data: + dataset_name: openai_format + ... + use_preserving_dataset: true # IMPORTANT: Enable this for tool calling datasets +``` + +**Why this matters:** Standard HuggingFace dataset loading enforces uniform schemas by adding `None` values for missing keys. For example: +- Tool A has arguments: `{"query": "search term"}` +- Tool B has arguments: `{"expression": "2+2", "precision": 2}` + +Without `use_preserving_dataset: true`, the loader would incorrectly add: +- Tool A becomes: `{"query": "search term", "expression": None, "precision": None}` +- Tool B becomes: `{"query": None, "expression": "2+2", "precision": 2}` + +This corrupts your training data and can lead to models generating invalid tool calls. The `PreservingDataset` mode maintains the exact structure of each tool call. + +## Evaluate the Trained Model + +Upon completion of the training process, you can refer to our [evaluation guide](/eval) to assess model capabilities. + +## LoRA Configuration + +NeMo RL supports LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning, including Nano‑v3 models. LoRA reduces trainable parameters by using low-rank matrices for weight updates while keeping the base model frozen. + +Notes: +- LoRA is supported with DTensor v2 and Megatron backends. Uses the DTensor backend by default. DTensor v1 does not support LoRA (ensure `policy.dtensor_cfg._v2=true` when using DTensor). +- Triton kernels are only used in the DTensor v2 path. For `tensor_parallel_size > 1`, Automodel currently does not support Triton kernels (see note below). + +### DTensor Configuration Parameters + +The LoRA configuration is specified under the `policy.dtensor_cfg.lora_cfg` section: + +```yaml +policy: + dtensor_cfg: + lora_cfg: + enabled: False # Set to True to enable LoRA fine-tuning + target_modules: [] # List of module names to apply LoRA + exclude_modules: [] # List of module names to exclude from LoRA + match_all_linear: true # Apply LoRA to all linear layers + dim: 8 # LoRA rank (r): controls adaptation capacity + alpha: 32 # LoRA scaling factor (effective lr = alpha/dim) + dropout: 0.0 # Dropout probability for LoRA layers + dropout_position: "post" # Dropout position: "pre" or "post" + lora_A_init: "xavier" # Initialization method: "xavier" or "uniform" + use_triton: true # Use Triton-optimized kernels (DTensor v2 path) +``` + +### DTensor (Automodel) Parameter Details +- **`enabled`** (bool): Whether to enable LoRA training +- **`target_modules`** (list): Specific module names to apply LoRA. Empty with `match_all_linear=true` applies to all linear layers +- **`exclude_modules`** (list): Module names to exclude from LoRA +- **`match_all_linear`** (bool): When `true`, applies LoRA to all linear layers (overrides `target_modules`) +- **`dim`** (int): LoRA rank (r). Lower values = fewer parameters but less capacity. Typical: 4, 8, 16, 32, 64 +- **`alpha`** (int): LoRA scaling factor. Effective learning rate multiplier = `alpha/dim`. Typical: 16, 32, 64 +- **`dropout`** (float): Dropout probability for regularization +- **`dropout_position`** (str): Apply dropout before ("pre") or after ("post") LoRA +- **`lora_A_init`** (str): Initialization method for LoRA A matrix +- **`use_triton`** (bool): Use Triton-optimized kernels for better performance. Used for DTensor v2 only. **Note**: [Automodel does not support Triton for TP > 1](https://github.com/NVIDIA-NeMo/Automodel/blob/b2db55eee98dfe81a8bfe5e23ac4e57afd8ab261/nemo_automodel/recipes/llm/train_ft.py#L199). Set to `false` when `tensor_parallel_size > 1` to avoid compatibility issues + +### DTensor Example Usage + +```bash +uv run examples/run_sft.py policy.dtensor_cfg.lora_cfg.enabled=true +``` +For the Nano‑v3 SFT LoRA recipe, see:[sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml](/../../examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml). + +### Megatron Configuration Parameters + +The LoRA configuration is specified under the `policy.megatron_cfg.peft` section: + +```yaml +policy: + megatron_cfg: + peft: + enabled: false # Set to True to enable LoRA fine-tuning + target_modules: [] # List of module names to apply LoRA, defaults to all linear layers + exclude_modules: [] # List of module names not to apply LoRa. + dim: 32 # LoRA rank (r): controls adaptation capacity + alpha: 32 # LoRA scaling factor (effective lr = alpha/dim) + dropout: 0.0 # Dropout probability for LoRA layers + dropout_position: "pre" # Dropout position: "pre" or "post" + lora_A_init_method: "xavier" # Initialization method for lora A: "xavier" or "uniform" + lora_B_init_method: "zero" # Initialization method for lora B: "zero" + a2a_experimental: false # Enables the experimental All-to-All (A2A) communication strategy. + lora_dtype: None # Weight's dtype +``` + +### Megatron Parameter Details +- **`enabled`** (bool): Whether to enable LoRA training +- **`target_modules`** (list): Specific module names to apply LoRA. Defaults to all linear layers if the list is left empty. Example: ['linear_qkv', 'linear_proj', 'linear_fc1', 'linear_fc2']. + - 'linear_qkv': Apply LoRA to the fused linear layer used for query, key, and value projections in self-attention. + - 'linear_proj': Apply LoRA to the linear layer used for projecting the output of self-attention. + - 'linear_fc1': Apply LoRA to the first fully-connected layer in MLP. + - 'linear_fc2': Apply LoRA to the second fully-connected layer in MLP. + Target modules can also contain wildcards. For example, you can specify target_modules=['*.layers.0.*.linear_qkv', '*.layers.1.*.linear_qkv'] to add LoRA to only linear_qkv on the first two layers. +- **`exclude_modules`** (List[str], optional): A list of module names not to apply LoRa. It will match all nn.Linear & nn.Linear-adjacent modules whose name does not match any string in exclude_modules. If used, will require target_modules to be empty list or None. +- **`dim`** (int): LoRA rank (r). Lower values = fewer parameters but less capacity. Typical: 4, 8, 16, 32, 64 +- **`alpha`** (int): LoRA scaling factor. Effective learning rate multiplier = `alpha/dim`. Typical: 16, 32, 64 +- **`dropout`** (float): Dropout probability for regularization, defaults to 0.0 +- **`dropout_position`** (str): Apply dropout before ("pre") or after ("post") LoRA +- **`lora_A_init`** (str): Initialization method for lora_A (choices: ['xavier', 'uniform']), defaults to xavier. +- **`lora_B_init`** (str): Initialization method for the low-rank matrix B. Defaults to "zero". +- **`a2a_experimental`** (bool): Enables the experimental All-to-All (A2A) communication strategy. Defaults to False. +- **`lora_dtype`** (torch.dtype): Weight's dtype, by default will use orig_linear's but if they are quantized weights (e.g. 4bit) needs to be specified explicitly. + +### Megatron Example Usage +The config uses DTensor by default, so the megatron backend needs to be explicitly enabled. +```sh +uv run examples/run_sft.py \ + --config examples/configs/sft.yaml \ + policy.dtensor_cfg.enabled=false \ + policy.megatron_cfg.enabled=true \ + policy.megatron_cfg.peft.enabled=true +``` + +For more details on LoRA, see [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685). diff --git a/fern/v0.5.0/pages/guides/use-custom-vllm.mdx b/fern/v0.5.0/pages/guides/use-custom-vllm.mdx new file mode 100644 index 0000000000..f1db5b88cc --- /dev/null +++ b/fern/v0.5.0/pages/guides/use-custom-vllm.mdx @@ -0,0 +1,159 @@ +--- +title: Experiment with Custom vLLM +description: "" +--- + +This guide explains how to use your own version of vLLM while leveraging a pre-compiled vLLM wheel, so you don't have to recompile the C++ source code. + +## Clone and Build Your Custom vLLM + +Clone your vLLM fork and build it using the provided script. For example: + +```sh +# Usage: bash tools/build-custom-vllm.sh +bash tools/build-custom-vllm.sh https://github.com/terrykong/vllm.git terryk/demo-custom-vllm https://wheels.vllm.ai/862f2ef893d9751db0a92bd2d4ae0e3d9677872f/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl + +# [INFO] pyproject.toml updated. NeMo RL is now configured to use the local vLLM at 3rdparty/vllm. +# [INFO] Verify this new vllm version by running: +# +# VLLM_PRECOMPILED_WHEEL_LOCATION=http://.....whl \ +# uv run --extra vllm vllm serve Qwen/Qwen3-0.6B +# +# [INFO] For more information on this custom install, visit https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/use-custom-vllm.md +# [IMPORTANT] Remember to set the shell variable 'VLLM_PRECOMPILED_WHEEL_LOCATION' when running NeMo RL apps with this custom vLLM to avoid re-compiling. +``` + +This script does the following: +1. Clones the `vllm` you specify at a particular branch. +2. Builds `vllm`. +3. Updates NeMo RL's pyproject.toml to work with this `vllm`. +4. Updates `uv.lock`. + +Make sure to add the updated `pyproject.toml` and `uv.lock` to version control so that your branch can be reproduced by others. + +## Verify Your Custom vLLM in Isolation +Test your setup to ensure your custom vLLM is being used: +```sh +uv run --extra vllm python -c 'import vllm; print(f"Successfully imported vLLM version: {vllm.__version__}")' +# Uninstalled 1 package in 1ms +# Installed 1 package in 2ms +# Hi! If you see this, you're using a custom version of vLLM for the purposes of this tutorial +# INFO 06-18 09:22:44 [__init__.py:244] Automatically detected platform cuda. +# Successfully imported vLLM version: 0.0.1.dev1+g69d5add74.d20250910 +``` + +If you don't see the log message `Hi! If you see this...`, it's because this message is unique to the tutorial's specific `vLLM` fork. It was added in [this commit](https://github.com/terrykong/vllm/commit/69d5add744e51b988e985736f35c162d3e87b683) and doesn't exist in the main `vLLM` project. + +## Running NeMo RL Apps with Custom vLLM + +To ensure the custom vLLM install is setup properly in NeMo RL applications, always run the following before anything: + +```sh +# Ensures vLLM uses the precompiled wheel and avoids recompiling C++ sources +export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/862f2ef893d9751db0a92bd2d4ae0e3d9677872f/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl +# Ensures worker venvs are rebuilt to use the custom vLLM. Otherwise it may use the cached version in cached venvs +export NRL_FORCE_REBUILD_VENVS=true +# This isn't necessary if you only do `uv run foobar.py`, but may be needed if you switching between optional extras `uv run --extra vllm foobar.py`. If you are unsure if you need this, it's safer to include it. +uv pip install setuptools_scm +``` + +Then run your application: +```sh +uv run examples/run_grpo.py +``` + +## Re-building the NeMo RL Docker Image + +Using a custom vllm may require you to rebuild the docker image. The two most common reasons are: + +1. The `ray` version was changed, so you **must** rebuild the image to allow `ray.sub` to start the ray cluster with the same version as the application. +2. Many dependencies changed and add a large overhead when `NRL_FORCE_REBUILD_VENVS=true` is set to rebuild venvs, so you wish to cache the dependencies in the image to avoid re-build/re-pulling wheels. + +For convenience, you can have the image build your custom vLLM by running the same script inside the Docker build. +Pass `--build-arg BUILD_CUSTOM_VLLM=1` to enable this path; the build will create `3rdparty/vllm` and source `3rdparty/vllm/nemo-rl.env` automatically. + +```sh +docker buildx build \ + --build-arg BUILD_CUSTOM_VLLM=1 \ + --target release \ + --build-context nemo-rl=. \ + -f docker/Dockerfile \ + --tag /nemo-rl:latest \ + --push \ + . +``` + +### SSH Setup for Private Repositories + +If your custom vLLM is hosted in a **private repository** (e.g., internal GitLab), you need to set up SSH agent forwarding for Docker to clone it during the build. + +#### Prerequisites +1. Your SSH key must be registered on the Git server (GitLab/GitHub) +2. The key must **not be expired** - check your Git server's SSH key settings +3. The key must be loaded into your local ssh-agent + +#### Step 1: Verify your SSH key works + +```sh +# For GitLab (adjust host/port as needed) +ssh -T git@gitlab.example.com -p 12051 + +# You should see: "Welcome to GitLab, @username!" +# If you see "Your SSH key has expired", renew it on the server +``` + +#### Step 2: Load your SSH key into the agent + +```sh +# Check if an ssh-agent is already running +echo $SSH_AUTH_SOCK + +# If empty, start one (this also sets SSH_AUTH_SOCK which `docker buildx` expects to be set when using `--ssh default`) +eval "$(ssh-agent -s)" + +# Clear any old/expired keys from the agent +ssh-add -D + +# Add your SSH key (use the key registered on your Git server) +ssh-add ~/.ssh/id_ed25519 + +# Verify it's loaded +ssh-add -l +``` + +#### Step 3: Run the Docker build with SSH forwarding + +```sh +docker buildx build \ + --build-arg BUILD_CUSTOM_VLLM=1 \ + --target release \ + --build-context nemo-rl=. \ + -f docker/Dockerfile \ + --ssh default \ + --tag /nemo-rl:latest \ + --push \ + . +``` + +## Running Applications with a Custom vLLM Container + +When using a container built with custom vLLM, **use the frozen environment workflow** (bare `python`) instead of `uv run` with `NRL_FORCE_REBUILD_VENVS=true`. + +```sh +# Recommended: use bare python (frozen environment) +python examples/run_grpo.py + +# NOT recommended with custom vLLM containers: +# uv run examples/run_grpo.py +# or +# NRL_FORCE_REBUILD_VENVS=true uv run examples/run_grpo.py +``` + +### Why Not Use `uv run` or Rebuild Venvs? + +Rebuilding worker virtual environments (via `uv run` or `NRL_FORCE_REBUILD_VENVS=true`) requires having the custom vLLM compiled locally. However, compiling vLLM requires a container environment with the correct CUDA toolchain—creating a chicken-and-egg problem. + +The container already has vLLM built and cached in the frozen environments. Using bare `python` leverages these pre-built environments directly, avoiding the need to recompile vLLM at runtime. + +> [!TIP] +> For more details on frozen environments and how they differ from `uv run`, see the [Dependency Management](/../design-docs/dependency-management#frozen-environments) documentation. diff --git a/fern/v0.5.0/pages/index.mdx b/fern/v0.5.0/pages/index.mdx new file mode 100644 index 0000000000..33284cf497 --- /dev/null +++ b/fern/v0.5.0/pages/index.mdx @@ -0,0 +1,146 @@ +--- +title: NeMo RL Documentation +description: "" +--- + +Welcome to the NeMo RL documentation. NeMo RL is an open-source post-training library developed by NVIDIA, designed to streamline and scale reinforcement learning methods for multimodal models (LLMs, VLMs, etc.). + +This documentation provides comprehensive guides, examples, and references to help you get started with NeMo RL and build powerful post-training pipelines for your models. + +## Getting Started + + + + + +Learn about NeMo RL's architecture, design philosophy, and key features that make it ideal for scalable reinforcement learning. + + + + + +Get up and running quickly with examples for both DTensor and Megatron Core training backends. + + + + + +Step-by-step instructions for installing NeMo RL, including prerequisites, system dependencies, and environment setup. + + + + + +Explore the current features and upcoming enhancements in NeMo RL, including distributed training, advanced parallelism, and more. + + + + + +Troubleshooting common issues including missing submodules, Ray dashboard access, and debugging techniques. + + + + + +## Training and Generation + + + + + +Learn about DTensor and Megatron Core training backends, their capabilities, and how to choose the right one for your use case. + + + + + +Discover supported algorithms including GRPO, SFT, DPO, RM, and on-policy distillation with detailed guides and examples. + + + + + +Learn how to evaluate your models using built-in evaluation datasets and custom evaluation pipelines. + + + + + +Configure and deploy NeMo RL on multi-node Slurm or Kubernetes clusters for distributed computing. + + + + + +## Guides and Examples + + + + + +Reproduce DeepscaleR results with NeMo RL using GRPO on mathematical reasoning tasks. + + + + + +Step-by-step guide for supervised fine-tuning on the OpenMathInstruct2 dataset. + + + + + +Create custom reward environments and integrate them with NeMo RL training pipelines. + + + + + +Learn how to add support for new model architectures in NeMo RL. + + + + + +## Advanced Topics + + + + + +Deep dive into NeMo RL's architecture, APIs, and design decisions for scalable RL. + + + + + +Tools and techniques for debugging distributed Ray applications and RL training runs. + + + + + +Optimize large language models with FP8 quantization for faster training and inference. + + + + + +Build and use Docker containers for reproducible NeMo RL environments. + + + + + +## API Reference + + + + + +Comprehensive reference for all NeMo RL modules, classes, functions, and methods. Browse the complete Python API with detailed docstrings and usage examples. + + + + diff --git a/fern/v0.5.0/pages/local-workstation.mdx b/fern/v0.5.0/pages/local-workstation.mdx new file mode 100644 index 0000000000..5384ba5fe5 --- /dev/null +++ b/fern/v0.5.0/pages/local-workstation.mdx @@ -0,0 +1,38 @@ +--- +title: Run on Your Local Workstation +description: "" +--- + +When launching examples locally with `uv`, `init_ray()` will first attempt to connect to an existing cluster. If none is found, it will start a local one and connect to it using all available GPU and CPU resources on your node. + +To launch a job outside of a container, simply run: + +```sh +uv run examples/run_grpo.py +``` + +In the logs, you will see that Ray has started a local cluster instance, along with details on the resources made available to it: +``` +2025-03-17 13:37:45,360 INFO worker.py:1841 -- Started a local Ray instance. +... +INFO:nemo_rl.distributed.virtual_cluster:Started local cluster with: {'node:__internal_head__': 1.0, 'CPU': 24.0, 'object_store_memory': 80448493977.0, 'accelerator_type:RTX': 1.0, 'memory': 177713152615.0, 'GPU': 1.0, 'node:10.0.0.1': 1.0} +``` + +To have more precise control over the GPUs Ray uses locally, please use `CUDA_VISIBLE_DEVICES`: + +```sh +# Use the 0th and 3rd indexed GPU (for a total of 2 GPUs) +CUDA_VISIBLE_DEVICES=0,3 uv run examples/run_grpo.py +``` + +We also allow multiple colocated local clusters, which are uniquely identified by the values in +`CUDA_VISIBLE_DEVICES`. Concretely: + +```sh +# (1) Start a fresh cluster on GPU=0 +CUDA_VISIBLE_DEVICES=0 uv run examples/run_grpo.py + +# (2) While (1) is running, this will start a new cluster using GPUs 1 and 2 without interferring with (1) +# Ensure that the CUDA_VISIBLE_DEVICES do not overlap already running jobs. +CUDA_VISIBLE_DEVICES=1,2 uv run examples/run_grpo.py +``` diff --git a/fern/v0.5.0/pages/model-quirks.mdx b/fern/v0.5.0/pages/model-quirks.mdx new file mode 100644 index 0000000000..07f7774548 --- /dev/null +++ b/fern/v0.5.0/pages/model-quirks.mdx @@ -0,0 +1,52 @@ +--- +title: Model Quirks +description: "" +--- + +This document outlines special cases and model-specific behaviors that require custom handling in NeMo RL. These special cases are controlled by the `ModelFlag` enum. + +## Gemma-3 + +### vLLM Initialization + +Gemma-3 models have a specific issue with vLLM dummy weight initialization due to a vLLM bug where [a `normalizer` buffer is created](https://github.com/vllm-project/vllm/blob/964472b9667508b1d4a7ed92068ff81740ae0036/vllm/model_executor/models/gemma3.py#L372) that is not present in the Hugging Face model. This causes the `normalizer` buffer to be set to dummy weights at initialization and then never updated with the correct values during model refit. As a workaround for this issue, we do not use dummy weight initialization for vLLM with Gemma-3 models and instead use the `load_format="auto"` setting to load the full weights at initialization. + +**Special Handling:** +- We automatically use `load_format="auto"` for Gemma-3 models when initializing vLLM. +- This avoids issues with dummy weight initialization, where the dummy weights for this buffer would never get overwritten during refit. + +### vLLM V1 runtime + +NeMo-RL uses the vLLM V1 runtime for both synchronous and asynchronous inference. The V1 runtime provides improved performance and stability for inference. + +**Special Handling:** +- Both sync and async inference modes use the V1 runtime by default. +- Users can override to the V0 runtime by setting the environment variable `NRL_VLLM_USE_V1=0`. +- **Important**: The async implementation always uses the V1 runtime. Users who need to use the V0 runtime must switch to synchronous inference by setting `policy.generation.vllm_cfg.async_engine=False`. + +### Context Parallel with FSDP2 + +- NeMo-RL implemented this feature based on torch CP [implementation](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/experimental/_attention.py). And we inherit its limitations. +Whether model level support CP only depends on arguments passed to `torch.nn.functional.scaled_dot_product_attention`. Current NeMo-RL passed all ones attention mask to `model.forward`. For Gemma-3, it won't ignore attention mask as result `attn_bias` is not None which is not supported by torch CP. Please see [assertion](https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/distributed/tensor/experimental/_attention.py#L262) . + +- Context parallel can't be used together with sequence packing. Sequence packing requires `attn_implementation="flash_attention_2"`, this conflict with context parallel requires SDPA impl. Refer to [here](https://github.com/huggingface/transformers/blob/bda75b4011239d065de84aa3e744b67ebfa7b245/src/transformers/modeling_utils.py#L2317) for more details. + +- It's a known issue that context parallel can't be used together with sequence parallel. +Refer to [here](https://github.com/NVIDIA-NeMo/RL/issues/659) for more details. + +## DeepScaleR Recipe Convergence Issues + +The DeepScaleR recipe (e.g., `examples/configs/grpo-deepscaler-1.5b-8K.yaml`) has been found to experience convergence issues when CUDA graphs are enabled in vLLM. + +**Special Handling:** +- CUDA graphs must be disabled by setting `enforce_eager: True` in the vLLM configuration (https://github.com/NVIDIA-NeMo/RL/pull/857 forces eager execution by default). + +## vLLM Async Rollout Timeout + +vLLM async generation has a configurable timeout for waiting for individual sample results. This is particularly important for longer sequences on large models. + +```bash +export NRL_VLLM_ASYNC_TIMEOUT_SECONDS=1800 # Default: 600 (10 minutes) +``` + +If you encounter timeout errors, the system will suggest doubling the current timeout value. diff --git a/fern/v0.5.0/pages/nsys-profiling.mdx b/fern/v0.5.0/pages/nsys-profiling.mdx new file mode 100644 index 0000000000..c6f88c7c9b --- /dev/null +++ b/fern/v0.5.0/pages/nsys-profiling.mdx @@ -0,0 +1,147 @@ +--- +title: Profile GPU with Nsys +description: "" +--- + +NeMo RL supports Nsight profiling for Ray workers through environment variable pattern matching. This allows you to selectively profile specific worker types without modifying code or affecting the performance of workers that don't need profiling. + +**Note**: To prevent profile files from becoming too large, consider limiting profiling to a smaller number of steps (e.g., 10 steps). + +## Prerequisites + +* Install NVIDIA Nsight Systems (`nsys`) on the compute nodes where workers will run. For Ubuntu installation instructions, see the [NVIDIA Nsight Systems Installation Guide](https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html#package-manager-installation)). + +**Note: If you're using NeMo RL containers, `nsys` is already installed.** + +* Ensure the workers you want to profile have GPU access + +## Configure the Environment Variables + +Set the `NRL_NSYS_WORKER_PATTERNS` environment variable with a comma-separated list of patterns to match worker names: + +```bash +export NRL_NSYS_WORKER_PATTERNS="*policy*,*vllm*" +``` + +Set the `NRL_NSYS_PROFILE_STEP_RANGE` environment variable to control which training steps the profiler captures. Its +format is colon separated integers representing `start:stop`, where `start` is inclusive and `stop` is exclusive +(same as slice syntax `arr[start:stop]`). Note that the `start` is 1-index, so `NRL_NSYS_PROFILE_STEP_RANGE=0:10` would error. + +```bash +export NRL_NSYS_PROFILE_STEP_RANGE=3:5 +``` + +### Pattern Format + +- Use shell-style wildcards (`*`, `?`, `[seq]`, `[!seq]`) +- Patterns are matched against worker names using `fnmatch` +- Multiple patterns are separated by commas +- Whitespace around patterns is automatically stripped +- Empty patterns are ignored + +### Supported Workers + +The supported worker types are: +- **DTensorPolicyWorker**: Pattern matched against `"dtensor_policy_worker"` +- **VllmGenerationWorker**: Pattern matched against `"vllm_generation_worker"` + +## Example Usage + +### Profile Only Policy Workers +```bash +NRL_NSYS_PROFILE_STEP_RANGE=2:3 NRL_NSYS_WORKER_PATTERNS="*policy*" uv run examples/run_grpo.py grpo.max_num_steps=5 +``` + +### Profile Multiple Worker Types + +```bash +NRL_NSYS_PROFILE_STEP_RANGE=1:2 NRL_NSYS_WORKER_PATTERNS="*policy*,*vllm*" uv run examples/run_grpo.py grpo.max_num_steps=5 +``` + +### Profile Workers with Exact Names + +```bash +NRL_NSYS_PROFILE_STEP_RANGE=3:10 NRL_NSYS_WORKER_PATTERNS="dtensor_policy_worker,vllm_generation_worker" uv run examples/run_grpo.py grpo.max_num_steps=5 +``` + +### Profile Megatron Workers + +> [!IMPORTANT] +> To profile a Megatron worker, you should set `LD_LIBRARY_PATH` as follows, otherwise you will get errors when loading `libtransformer_engine.so`. + +```bash +LD_LIBRARY_PATH="/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu" \ +NRL_NSYS_PROFILE_STEP_RANGE=2:3 NRL_NSYS_WORKER_PATTERNS="megatron_policy_worker,vllm_generation_worker" uv run examples/run_grpo.py --config examples/configs/grpo_math_1B_megatron.yaml grpo.max_num_steps=5 +``` + +## Profile Output + +When profiling is enabled, it generates the following logs and files: + +1. **Logging**: You'll see log messages indicating which workers have profiling enabled: + ``` + Nsight profiling enabled for worker 'dtensor_policy_worker' (matched pattern '*policy*') + ``` + +2. **Profile Files**: Each profiled worker generates a `.nsys-rep` file with naming pattern: + ``` + dtensor_policy_worker__.nsys-rep + vllm_generation_worker__.nsys-rep + worker_process_.nsys-rep + ``` +If you are not using model parallelism in Vllm, you should directly refer to `vllm_generation_worker__.nsys-rep` for nsight reports; If you are using model parallelism, the `vllm_generation_worker__.nsys-rep` will be empty, and the `worker_process_.nsys-rep` are nsight profiles from vllm's ray distributed executors (refer to https://github.com/vllm-project/vllm/blob/7e3a8dc90670fd312ce1e0d4eba9bf11c571e3ad/vllm/executor/ray_distributed_executor.py#L136 for more information). + +3. **File Location**: Profile files are saved in `/tmp/ray/session*/logs/nsight/` directory on each worker node. Ensure you check both `ls /tmp/ray/session_[0-9]*/logs/nsight` and `ls /tmp/ray/session_latest/logs/nsight` for the profiles, since the "latest" pointer may be stale. + +**Note for SLURM users with `ray.sub`**: When using `ray.sub` on SLURM, set `RAY_LOG_SYNC_FREQUENCY=$NUM_SEC` (e.g., `RAY_LOG_SYNC_FREQUENCY=30`) to ensure that the nsight profile files get copied from the container's ephemeral filesystem (`/tmp/ray`) to the persistent directory. The header node's files will be synced to ``$SLURM_JOB_ID-logs/ray`, and other nodes' files will be synced to `$SLURM_JOB_ID-logs/ray/$node_ip/` where `$node_ip` is the IP address of the node. + +## Analyze Profile Files + +To analyze the generated profile files, load the `.nsys-rep` files into the NVIDIA Nsight Systems desktop application, which you can download from the [NVIDIA Nsight Systems Get Started page](https://developer.nvidia.com/nsight-systems/get-started). + +### How to Analyze the End-to-End RL Loop All at Once + +Nsight Systems supports [multi-report view](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#viewing-multiple-reports-in-the-same-timeline) functionality. If you open the profiles from different workers (e.g., `*policy_worker*.nsys-rep` and `*generation_worker*.nsys-rep`) in a single multi-report view, you can analyze the behavior of the end-to-end RL loop on the same timeline. + +![Nsys multi report view](/assets/nsys-multi-report-view.png) + +## How We Patched Nsight Support in Ray + +Ray's Nsight profiling support had a bug where it hardcoded the Python executable path instead of using the actual Python executable from the runtime environment. This caused issues when using virtual environments or custom Python installations (`py_executables`). + +### The Problem + +In Ray's `nsight.py` file, the original code was: + +```python +context.py_executable = " ".join(self.nsight_cmd) + " python" +``` + +This hardcoded `" python"` instead of correctly preserving the intended Python executable path. + +### The Fix + +To fix this problem, we patched the following line to preserve the original `context.py_executable`: + +```python +context.py_executable = " ".join(self.nsight_cmd) + f" {context.py_executable}" +``` + +### Where We Applied the Patch + +We applied this patch in two locations to cover different deployment scenarios: + +1. **In `ray.sub` (SLURM clusters)**: The patch is applied before Ray's control plane starts up on both head and worker nodes: + ```bash + sed -i 's/context\.py_executable = " "\.join(self\.nsight_cmd) + " python"/context.py_executable = " ".join(self.nsight_cmd) + f" {context.py_executable}"/g' /opt/nemo_rl_venv/lib64/python*/site-packages/ray/_private/runtime_env/nsight.py + ``` + +2. **In `nemo_rl/__init__.py` (Local clusters)**: The patch is applied automatically when NeMo RL is imported, making it work seamlessly for local development and testing environments. + +### Why We Needed Both Locations + +- **`ray.sub`**: Required for SLURM-managed clusters where Ray processes start in containers before Python imports happen. The patch must be applied at the filesystem level before Ray's control plane initializes. + +- **`__init__.py`**: Required for local clusters and development environments where users start Ray clusters directly. The patch is applied when `nemo_rl` is imported, ensuring the fix is in place before any Ray processes are spawned. + +This dual approach ensures that Nsight profiling works correctly regardless of how the Ray cluster is deployed. diff --git a/fern/v0.5.0/pages/testing.mdx b/fern/v0.5.0/pages/testing.mdx new file mode 100644 index 0000000000..60469e4a33 --- /dev/null +++ b/fern/v0.5.0/pages/testing.mdx @@ -0,0 +1,325 @@ +--- +title: Test NeMo RL +description: "" +--- + +This guide outlines how to test NeMo RL using unit and functional tests, detailing steps for local or Docker-based execution, dependency setup, and metric tracking to ensure effective and reliable testing. + +## Unit Tests + +> [!IMPORTANT] +> Unit tests require 2 GPUs to test the full suite. + +> [!TIP] +> Some unit tests require setting up test assets which you can download with: +> ```sh +> uv run tests/unit/prepare_unit_test_assets.py +> ``` + +```sh +# Run the unit tests using local GPUs + +# Configuration 1: Default tests only - excludes both hf_gated and mcore tests +uv run --group test bash tests/run_unit.sh + +# Configuration 2: Default + HF gated tests, excluding mcore tests +uv run --group test bash tests/run_unit.sh --hf-gated + +# Configuration 3: ONLY mcore tests, excluding ones with hf_gated +uv run --extra mcore --group test bash tests/run_unit.sh --mcore-only + +# Configuration 4: ONLY mcore tests, including ones with hf_gated +uv run --extra mcore --group test bash tests/run_unit.sh --mcore-only --hf-gated +``` + +### Experimental: Faster Local Test Iteration with pytest-testmon + +We support `pytest-testmon` to speed up local unit test runs by re-running only impacted tests. This works for both regular in-process code and out-of-process `@ray.remote` workers via a lightweight, test-only selection helper. + +Usage: +```sh +# Re-run only impacted unit tests +uv run --group test pytest --testmon tests/unit + +# You can also combine with markers/paths +uv run --group test pytest --hf-gated --testmon tests/unit/models/policy/test_dtensor_worker.py +``` + +What to expect: +- On the first run in a fresh workspace, testmon may run a broader set (or deselect everything if nothing was executed yet) to build its dependency cache. +- On subsequent runs, editing non-remote code narrows selection to only the tests that import/use those modules. +- Editing code inside `@ray.remote` actors also retriggers impacted tests. We maintain a static mapping from test modules to transitive `nemo_rl` modules they import and intersect that with changed files when `--testmon` is present. +- After a successful impacted run, a second `--testmon` invocation (with no further edits) will deselect all tests. +- Running `pytest` with `-k some_substring_in_test_name` will always run tests that match even if `--testmon` is passed. + +Limitations and tips: +- Selection is based on Python imports and file mtimes; non-Python assets (YAML/JSON/shell) are not tracked. When editing those, re-run target tests explicitly. +- The remote-aware selection uses a conservative static import map (no dynamic import resolution). If a test loads code dynamically that isn’t visible via imports, you may need to run it explicitly once to seed the map. +- The helper is test-only and does not alter library behavior. It activates automatically when you pass `--testmon`. + +Refreshing remote-selection artifacts +### Refreshing Remote-Selection Artifacts +If you change test layout or significantly refactor imports, the remote-selection artifacts may become stale. +To rebuild them, delete the following files at the repo root and re-run with `--testmon` to seed again: + +```sh +# At the root of nemo-rl +rm .nrl_remote_map.json .nrl_remote_state.json +``` + +### Run Unit Tests in a Hermetic Environment + +For environments lacking necessary dependencies (e.g., `gcc`, `nvcc`) +or where environmental configuration may be problematic, tests can be run +in Docker with this script: + +```sh +CONTAINER=... bash tests/run_unit_in_docker.sh +``` + +The required `CONTAINER` can be built by following the instructions in the [Docker documentation](/docker). + +### Track Metrics in Unit Tests + +Unit tests may also log metrics to a fixture. The fixture is called `tracker` and has the following API: + +```python +# Track an arbitrary metric (must be json serializable) +tracker.track(metric_name, metric_value) +# Log the maximum memory across the entire cluster. Okay for tests since they are run serially. +tracker.log_max_mem(metric_name) +# Returns the maximum memory. Useful if you are measuring changes in memory. +tracker.get_max_mem() +``` + +Including the `tracker` fixture also tracks the elapsed time for the test implicitly. + +Here is an example test: + +```python +def test_exponentiate(tracker): + starting_mem = tracker.get_max_mem() + base = 2 + exponent = 4 + result = base ** exponent + tracker.track("result", result) + tracker.log_max_mem("memory_after_exponentiating") + change_in_mem = tracker.get_max_mem() - starting_mem + tracker.track("change_in_mem", change_in_mem) + assert result == 16 +``` + +Which would produce this file in `tests/unit/unit_results.json`: + +```json +{ + "exit_status": 0, + "git_commit": "f1062bd3fd95fc64443e2d9ee4a35fc654ba897e", + "start_time": "2025-03-24 23:34:12", + "metrics": { + "test_hf_ray_policy::test_lm_policy_generation": { + "avg_prob_mult_error": 1.0000039339065552, + "mean_lps": -1.5399343967437744, + "_elapsed": 17.323044061660767 + } + }, + "gpu_types": [ + "NVIDIA H100 80GB HBM3" + ], + "coverage": 24.55897613282601 +} +``` + +> [!TIP] +> Past unit test results are logged in `tests/unit/unit_results/`. These are helpful to view trends over time and commits. +> +> ```sh +> jq -r '[.start_time, .git_commit, .metrics["test_hf_ray_policy::test_lm_policy_generation"].avg_prob_mult_error] | @tsv' tests/unit/unit_results/* +> +> # Example output: +> #2025-03-24 23:35:39 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552 +> #2025-03-24 23:36:37 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552 +> #2025-03-24 23:37:37 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552 +> #2025-03-24 23:38:14 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552 +> #2025-03-24 23:38:50 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552 +> ``` + +## Functional Tests + +> [!IMPORTANT] +> Functional tests may require multiple GPUs to run. See each script to understand the requirements. + +Functional tests are located under `tests/functional/`. + +```sh +# Run the functional test for sft +uv run bash tests/functional/sft.sh +``` + +At the end of each functional test, the metric checks will be printed as well as +At the end of each functional test, the metric checks will be printed as well as whether they pass or fail. Here is an example: + +```text + Metric Checks +┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓ +┃ Status ┃ Check ┃ Value ┃ Message ┃ +┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩ +│ PASS │ data["train/loss"]["9"] < 1500 │ 817.4517822265625 │ │ +└────────┴────────────────────────────────┴───────────────────┴─────────┘ +``` + +### Run Functional Tests in a Hermetic Environment + +For environments lacking necessary dependencies (e.g., `gcc`, `nvcc`) +or where environmental configuration may be problematic, tests can be run +For environments lacking necessary dependencies (e.g., `gcc`, `nvcc`) or where environmental configuration may be problematic, tests can be run in Docker with this script: + +```sh +CONTAINER=... bash tests/run_functional_in_docker.sh tests/functional/sft.sh +``` + +The required `CONTAINER` can be built by following the instructions in the [Docker documentation](/docker). + +## Bisecting Failing Tests + +> [!IMPORTANT] +> Always rsync the `tools/` directory to `tools.bisect/` before starting a bisect: +> +> ```sh +> rsync -ahP --delete tools/ tools.bisect/ +> ``` +> +> This creates a stable copy of the bisect scripts that won't change as git checks out different commits during the bisect process. Without this, the scripts themselves may change mid-bisect, leading to inconsistent behavior or failures. All examples below reference `tools.bisect/` to ensure you use the stable copy. + +### Bisecting Unit/Functional Tests + +Use `tools.bisect/bisect-run.sh` to automatically run your test command across a commit range and find the first bad commit. It forces venv rebuilds so dependencies match each commit. + +Basic usage: + +```sh +GOOD= BAD= \ + tools.bisect/bisect-run.sh uv run --group test pytest tests/unit/test_foobar.py::test_case +``` + +Examples: + +```sh +GOOD=56a6225 BAD=32faafa \ + tools.bisect/bisect-run.sh uv run --group dev pre-commit run --all-files + +GOOD=464ed38 BAD=c843f1b \ + tools.bisect/bisect-run.sh uv run --group test pytest tests/unit/test_foobar.py +``` + +Notes: + +- Exit codes drive the classification: 0=good, non-zero=bad, 125=skip. +- The script pre-verifies that `GOOD` is actually good by running your command on it. +- On failure or interruption, it saves a timestamped `git bisect log` to `/bisect-logs/`. You can resume later with `BISECT_REPLAY_LOG` (see below). +- Set `BISECT_NO_RESET=1` to keep the bisect state after the script exits. + +Resume from a saved bisect log: + +```sh +BISECT_REPLAY_LOG=/abs/path/to/bisect-2025....log \ + tools.bisect/bisect-run.sh uv run --group test pytest tests/unit/test_foobar.py +``` + +### Bisecting Nightlies + +Nightly training scripts can be bisected using the same driver plus a helper that sets up hermetic runs on Slurm. + +Vanilla flow: + +```sh +# Copy bisect utilities outside of VCS to ensure a stable runner +rsync -ahP --delete tools/ tools.bisect/ + +TEST_CASE=tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh + +HF_HOME=... \ +HF_DATASETS_CACHE=... \ +CONTAINER=... \ +MOUNTS=... \ +ACCOUNT=... \ +PARTITION=... \ +GOOD=$(git log --format="%h" --diff-filter=A -- "$TEST_CASE") \ +BAD=HEAD \ + tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh "$TEST_CASE" +``` + +: +The command `GOOD=$(git log --format="%h" --diff-filter=A -- "$TEST_CASE")` selects the commit that introduced the test script. Because the path is typically added only once, this yields the introduction commit to use as the known good baseline. +: + +- `tools.bisect/launch-bisect-helper.sh` ensures each commit runs in a fresh venv, creates an isolated code snapshot per commit, blocks until metrics are checked, and returns a suitable exit code for bisect. + +Progressively more advanced cases: + +1) Adjusting the test case on the fly with `SED_CLAUSES` + +- If a test script needs small textual edits during bisect (e.g., to relax a threshold or drop a noisy metric you don't care to bisect over when focusing on convergence vs. performance), provide a sed script via `SED_CLAUSES`. You can also use this to adjust runtime controls like `MAX_STEPS`, `STEPS_PER_RUN`, or `NUM_MINUTES` when a performance regression slows runs down, ensuring they still complete and emit metrics. The helper applies it and automatically restores the test script after the run. + +```sh +SED_CLAUSES=$(cat <<'SED' +s#mean(data\["timing/train/total_step_time"\], -6, -1) < 0\.6#mean(data["timing/train/total_step_time"], -6, -1) < 0.63# +/ray\/node\.0\.gpu\.0\.mem_gb/d +SED +) \ +GOOD=$(git log --format="%h" --diff-filter=A -- "$TEST_CASE") \ +BAD=HEAD \ + tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh "$TEST_CASE" +``` + +2) Passing extra script arguments + +- If the nightly script supports Hydra/CLI overrides, pass them via `EXTRA_SCRIPT_ARGS` so each run adopts those overrides (e.g., fix a transient incompatibility): + + +Changing script arguments can materially affect performance characteristics and/or convergence behavior. This may influence the validity of the bisect outcome relative to your baseline configuration. Prefer the smallest, clearly-justified overrides, keep them consistent across all commits, and document them alongside your results so conclusions are interpreted correctly. + + +```sh +EXTRA_SCRIPT_ARGS="++data.num_workers=1" \ +GOOD=$(git log --format="%h" --diff-filter=A -- "$TEST_CASE") \ +BAD=HEAD \ + tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh "$TEST_CASE" +``` + +3) Resuming from an earlier interrupted or misclassified session + +- Use `BISECT_REPLAY_LOG` with the bisect driver to replay prior markings and continue running. This is handy if a run failed for an unrelated reason or you manually edited a log to change `bad` → `skip` or to drop an incorrect line. + +```sh +BISECT_REPLAY_LOG=/abs/path/to/bisect-logs/bisect-YYYYmmdd-HHMMSS-.log \ +HF_HOME=... HF_DATASETS_CACHE=... CONTAINER=... MOUNTS=... ACCOUNT=... PARTITION=... \ + tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh "$TEST_CASE" +``` + +Tips and conventions: + +- Exit code 125 means "skip this commit" in git bisect; our helper returns 125 if required env is missing or if it needs to abort safely. +- Submodules must be clean. The bisect script enforces `submodule.recurse=true` and `fetch.recurseSubmodules=on-demand` so submodules follow commit checkouts. +- The bisect script automatically unshallows all submodules at the start to ensure any submodule commit can be checked out during the bisect process. This is important because bisecting may need to jump to arbitrary commits in submodule history. +- Each commit uses a fresh code snapshot directory and a separate Megatron checkpoint dir to avoid cross-commit contamination. +- On failure/interrupt, a timestamped bisect log is saved under `/bisect-logs/`. Use it with `BISECT_REPLAY_LOG` to resume. +- In some unusual cases, the bisect may fail while updating a submodule because it references a commit that is orphaned or deleted. Git will typically print the commit hash it was unable to find (e.g., `fatal: remote error: upload-pack: not our ref `). If the commit is simply orphaned, you can try to manually fetch it: + + ```sh + # Assuming Automodel is the submodule with the missing commit + cd 3rdparty/Automodel-workspace/Automodel/ + git fetch origin $the_automodel_commit_that_it_could_not_find + ``` + + If the manual fetch fails, the commit has likely been deleted from the remote. In this case, skip the problematic commit: + + ```sh + git bisect skip $the_nemorl_commit_that_has_the_broken_automodel_commit + ``` + + After skipping, add the skip command to your `BISECT_REPLAY_LOG` file (located in `/bisect-logs/`) so the bisect will continue from where it left off and skip that commit when you relaunch `tools.bisect/bisect-run.sh`: + + ```sh + echo "git bisect skip $the_nemorl_commit_that_has_the_broken_automodel_commit" >> bisect-logs/bisect--.log + ``` diff --git a/fern/versions/v0.5.0.yml b/fern/versions/v0.5.0.yml new file mode 100644 index 0000000000..ed1577dd9a --- /dev/null +++ b/fern/versions/v0.5.0.yml @@ -0,0 +1,137 @@ +navigation: + - section: Home + contents: + - page: Welcome + path: ../v0.5.0/pages/index.mdx + - section: About + contents: + - page: Overview + path: ../v0.5.0/pages/about/overview.mdx + - page: Performance Summary + path: ../v0.5.0/pages/about/performance-summary.mdx + - page: Model Support + path: ../v0.5.0/pages/about/model-support.mdx + - page: Features + path: ../v0.5.0/pages/about/features.mdx + - page: Backends + path: ../v0.5.0/pages/about/backends.mdx + - page: Quick Start + path: ../v0.5.0/pages/about/quick-start.mdx + - page: Installation + path: ../v0.5.0/pages/about/installation.mdx + - section: Algorithms + contents: + - page: Index + path: ../v0.5.0/pages/about/algorithms/index.mdx + - page: SFT + path: ../v0.5.0/pages/about/algorithms/sft.mdx + - page: DPO + path: ../v0.5.0/pages/about/algorithms/dpo.mdx + - page: RM + path: ../v0.5.0/pages/about/algorithms/rm.mdx + - page: GRPO + path: ../v0.5.0/pages/about/algorithms/grpo.mdx + - page: DAPO + path: ../v0.5.0/pages/about/algorithms/dapo.mdx + - page: On-Policy Distillation + path: ../v0.5.0/pages/about/algorithms/on-policy-distillation.mdx + - page: Evaluation + path: ../v0.5.0/pages/about/evaluation.mdx + - page: Clusters + path: ../v0.5.0/pages/about/clusters.mdx + - page: Tips and Tricks + path: ../v0.5.0/pages/about/tips-and-tricks.mdx + - section: Environment Start + contents: + - page: Local Workstation + path: ../v0.5.0/pages/local-workstation.mdx + - page: Cluster + path: ../v0.5.0/pages/cluster.mdx + - section: E2E Examples + contents: + - page: SFT on OpenMathInstruct2 + path: ../v0.5.0/pages/guides/sft-openmathinstruct2.mdx + - section: Guides + contents: + - page: Nemotron 3 Nano + path: ../v0.5.0/pages/guides/nemotron-3-nano.mdx + - page: Adding New Models + path: ../v0.5.0/pages/adding-new-models.mdx + - page: SFT + path: ../v0.5.0/pages/guides/sft.mdx + - page: DPO + path: ../v0.5.0/pages/guides/dpo.mdx + - page: DAPO + path: ../v0.5.0/pages/guides/dapo.mdx + - page: ProRLv2 + path: ../v0.5.0/pages/guides/prorlv2.mdx + - page: GRPO + path: ../v0.5.0/pages/guides/grpo.mdx + - page: GRPO DeepscaleR + path: ../v0.5.0/pages/guides/grpo-deepscaler.mdx + - page: GRPO Sliding Puzzle + path: ../v0.5.0/pages/guides/grpo-sliding-puzzle.mdx + - page: RM + path: ../v0.5.0/pages/guides/rm.mdx + - page: Environments + path: ../v0.5.0/pages/guides/environments.mdx + - page: Eval + path: ../v0.5.0/pages/guides/eval.mdx + - page: Deepseek + path: ../v0.5.0/pages/guides/deepseek.mdx + - page: Model Quirks + path: ../v0.5.0/pages/model-quirks.mdx + - page: Async GRPO + path: ../v0.5.0/pages/guides/async-grpo.mdx + - page: DTensor TP Accuracy + path: ../v0.5.0/pages/guides/dtensor-tp-accuracy.mdx + - page: FT Launcher Guide + path: ../v0.5.0/pages/guides/ft-launcher-guide.mdx + - section: Containers + contents: + - page: Docker + path: ../v0.5.0/pages/docker.mdx + - section: Development + contents: + - page: Testing + path: ../v0.5.0/pages/testing.mdx + - page: Documentation + path: ../v0.5.0/pages/documentation.mdx + - page: Debugging + path: ../v0.5.0/pages/debugging.mdx + - page: NSys Profiling + path: ../v0.5.0/pages/nsys-profiling.mdx + - page: FP8 + path: ../v0.5.0/pages/fp8.mdx + - page: Use Custom vLLM + path: ../v0.5.0/pages/guides/use-custom-vllm.mdx + - section: Design Docs + contents: + - page: Design and Philosophy + path: ../v0.5.0/pages/design-docs/design-and-philosophy.mdx + - page: Padding + path: ../v0.5.0/pages/design-docs/padding.mdx + - page: Logger + path: ../v0.5.0/pages/design-docs/logger.mdx + - page: UV + path: ../v0.5.0/pages/design-docs/uv.mdx + - page: Dependency Management + path: ../v0.5.0/pages/design-docs/dependency-management.mdx + - page: Chat Datasets + path: ../v0.5.0/pages/design-docs/chat-datasets.mdx + - page: Generation + path: ../v0.5.0/pages/design-docs/generation.mdx + - page: Checkpointing + path: ../v0.5.0/pages/design-docs/checkpointing.mdx + - page: Loss Functions + path: ../v0.5.0/pages/design-docs/loss-functions.mdx + - page: FSDP2 Parallel Plan + path: ../v0.5.0/pages/design-docs/fsdp2-parallel-plan.mdx + - page: Training Backends + path: ../v0.5.0/pages/design-docs/training-backends.mdx + - page: Sequence Packing and Dynamic Batching + path: ../v0.5.0/pages/design-docs/sequence-packing-and-dynamic-batching.mdx + - page: Env Vars + path: ../v0.5.0/pages/design-docs/env-vars.mdx + - page: NeMo Gym Integration + path: ../v0.5.0/pages/design-docs/nemo-gym-integration.mdx