_loss`.
+
+## DPO-Specific Parameters
+
+The DPO implementation in NeMo RL supports several key parameters that can be adjusted:
+
+- `dpo.reference_policy_kl_penalty`: Controls the strength of the KL penalty term
+- `dpo.preference_loss_weight`: Weight for the preference loss
+- `dpo.sft_loss_weight`: Weight for the auxiliary SFT loss
+- `dpo.preference_average_log_probs`: Whether to average log probabilities over tokens in the preference loss term
+- `dpo.sft_average_log_probs`: Whether to average log probabilities over tokens in the SFT loss term
+
+These parameters can be adjusted in the config file or via command-line overrides to optimize training for your specific use case.
+
+## Evaluate the Trained Model
+
+Upon completion of the training process, you can refer to our [evaluation guide](/eval) to assess model capabilities.
diff --git a/fern/v0.5.0/pages/guides/dtensor-tp-accuracy.mdx b/fern/v0.5.0/pages/guides/dtensor-tp-accuracy.mdx
new file mode 100644
index 0000000000..e756ebe062
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/dtensor-tp-accuracy.mdx
@@ -0,0 +1,239 @@
+---
+title: DTensor Tensor Parallel Accuracy Issue
+description: ""
+---
+
+During reinforcement learning (RL) post-training, maintaining accuracy is both **critical and challenging**. Minor numerical deviations can propagate and amplify across policy updates, ultimately distorting reward signals and affecting convergence. Consequently, understanding and mitigating accuracy issues is central to ensuring consistent and reliable training behavior in large-scale distributed RL settings.
+
+## Observed Accuracy Issues Under Tensor Parallelism with DTensor Backend
+
+During our development, we identified that the **tensor parallel (TP)** strategy can be a significant factor contributing to accuracy problems.
+
+We have encountered several accuracy issues related to TP in **DTensor**, including:
+
+1. **For policy models**: We observed severe `token_mult_prob_error` spikes when TP was enabled during post-training of a Qwen3 dense model (e.g., [Qwen/Qwen3-4B-Instruct-2507 · Hugging Face](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)), indicating a significant difference between the training and inference engines.
+2. **For reward models**: The reward model exhibited large discrepancies under different TP configurations.
+3. **For overall model training performance**: Using a $TP > 1$ configuration often leads to degraded downstream performance when utilizing either **DTensorPolicyWorker** or **DTensorPolicyWorkerV2**.
+
+### Misalignment between Training and Inference for Policy Models
+
+Using [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) as an example, Figure 1 illustrates the `token_mult_prob_error` observed during training. We applied a *time-weighted exponential moving average (EMA)* smoothing method and used a logarithmic scale on the Y-axis for better visualization.
+
+The `token_mult_prob_error` [metric](https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/grpo.md#multiplicative-token-probability-error) measures the discrepancy between the inference engine and the training engine when processing the same sample. It is defined as follows:
+
+$$
+\begin{aligned}
+g_i & : \text{the } i^{th} \text{ item in } \text{generation-logprobs}, \\
+p_i & : \text{the } i^{th} \text{ item in } \text{policy-logprobs}, \\
+m_i & : \text{mask the } i^{th} \text{ token , whether 1 or 0} \\
+&\text{global-valid-toks} = \sum_i m_i \, . \\
+& \text{token-mult-prob-error}= \frac{1}{\text{global-valid-toks}}\sum_{i} m_i \exp\left(\left|g_i - p_i\right|\right)
+\end{aligned}
+$$
+
+In general, **generation logprobs** and **policy logprobs** should align closely, resulting in a `token_mult_prob_error` value near **1.0**. In our development, when this metric exceeds **1.05**, we consider it indicative of a potential framework issue that warrants further investigation.
+
+As shown in Figure 1, numerous spikes can be observed during training. Occasional spikes are acceptable if the `token_mult_prob_error` quickly returns to around 1.0. However, in this case, even with EMA smoothing applied, the figure reveals an overall upward trend, which is unacceptable and indicates a persistent misalignment between the training and inference behaviors.
+
+
+
+Fig 1: The token_mult_prob_error of Qwen3-4B
+
+### Discrepancies Across TP Configurations in Reward Modeling
+
+For the reward model, different TP plans lead to slight but noticeable inconsistencies in the validation loss. As summarized in Table 1, the loss values vary across TP settings, with TP=4 showing a larger deviation from the TP=1 baseline than TP=2 or TP=8. This suggests that the choice of TP configuration can subtly affect the numerical behavior of the reward model, even when all other training conditions are held constant.
+
+To investigate whether mixed‑precision arithmetic was a major contributor, autocast was disabled in a separate set of experiments so that computations were performed in full precision. However, the validation losses with and without autocast are essentially identical for all TP settings, indicating that mixed‑precision itself is not the root cause of the discrepancy. Instead, these results imply that the primary source of inconsistency lies in how different TP plans partition and aggregate computations across devices, rather than in precision loss from autocast.
+
+| | TP=1 | TP=2 | TP=4 | TP=8 |
+| ------------- | ------ | ------ | ------ | ------ |
+| With autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 |
+| W/O autocast | 0.6035 | 0.6010 | 0.5864 | 0.6021 |
+Table 1: The validation loss of reward model training
+
+### Overall Performance Degradation Under Tensor Parallelism
+
+Figure 2 and Figure 3 present the reward curves and validation accuracy curves for multiple runs under different tensor parallel (TP) configurations. We also apply EMA smoothing for better visualization. The mismatch between the policy engine and the generation engine can lead to degraded downstream accuracy. This issue is most evident in the blue and purple curves, whose corresponding experiments are also the most abnormal cases observed in Figure 1.
+
+Combining the three images for observation, it is not necessarily true that abnormal `token_mult_prob_error` leads to abnormal reward and validation accuracy. This occurs for several reasons:
+
+1. **Spike pattern instead of continuous growth**: In many runs, `token_mult_prob_error` shows frequent spikes rather than a monotonically increasing trend, indicating that training is unstable but not fundamentally broken.
+2. **Stochastic occurrence of spikes**: The abnormal `token_mult_prob_error` is itself unstable; even with the same batch of data, spikes may not appear in every run.
+3. **Dilution effect with large datasets**: When the dataset is sufficiently large and no critical samples are repeatedly affected, these extreme but sporadic spikes may have limited impact on aggregate metrics, so the final reward and validation accuracy may not exhibit significant deviations.
+
+
+
+Fig 2: The reward of Qwen3-4B
+
+
+
+Fig 3: The validation accuracy of Qwen3-4B
+
+However, such training instability is unacceptable for an RL training framework, so we aim to identify and eliminate the underlying issues. There are several challenges in resolving this problem:
+
+1. **Model dependence**: The issue is model-dependent rather than universal. For example, this phenomenon is observed on Qwen3-4B but not on Llama-3.1-8B-Instruct.
+2. **Poor reproducibility**: Abnormal spikes in `token_mult_prob_error` cannot be reproduced reliably. Even with the same batch of data and identical configurations, repeated runs may yield different outcomes.
+
+Our in-depth analysis across multiple models and runs indicates that this behavior does not stem from a single root cause but rather from the interaction of several subtle factors. Taken together, these findings point to a small set of dominant contributors that consistently correlate with the observed instability. Our investigation revealed multiple contributing factors, with the most significant being:
+
+1. **Batch-variant kernels**, which can produce inconsistent results across microbatches.
+2. A **row-wise TP plan**, as row-wise partitioning can introduce additional numerical inconsistencies during distributed computation.
+
+## Batch-Variant Kernels
+
+In RL training, log probabilities are typically computed for samples drawn from the old policy, denoted as `prev_logprobs`. The same samples are then evaluated under the current policy being optimized, yielding `current_logprobs`. Using these two quantities, we compute the ratio between the current and previous policies as follows:
+
+$$
+\begin{aligned}
+\text{ratio} &= \exp\left(\text{current-logprobs} - \text{prev-logprobs}\right) \\
+&= \exp\left(\log\left(\frac{\text{current-probs}}{\text{prev-probs}}\right)\right) \\
+&= \frac{\text{current-probs}}{\text{prev-probs}}
+\end{aligned}
+$$
+
+This ratio is the standard importance ratio used in off-policy RL to reweight returns when the data are collected under an older behavior policy. In on-policy training, this ratio should be exactly 1. However, in our experiments, we observed cases where the ratio deviates from 1, indicating a mismatch between the intended on-policy setting and the actual behavior of the system. Figure 4 and Figure 5 illustrate this phenomenon by showing the mismatch between `prev_logprobs` and `current_logprobs` under TP=4, as well as the reward curves under TP=4 and TP=1 for the `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B` model.
+
+
+
+Fig 4: The mismatch of prev_logprobs and current_logprobs under TP=4
+
+
+
+Fig 5: The reward of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B under TP=4 and TP=1
+
+### Root Cause
+
+Upon further investigation, the discrepancy between `current_logprobs` and `prev_logprobs` was traced to a mismatch between `train_micro_batch_size` and `logprob_batch_size`, which caused the model to behave differently for the same logical samples under different effective batch sizes. This behavior is a typical manifestation of **batch-variant kernels**, where the numerical outputs of certain operators depend not only on the input tensors themselves but also on how those tensors are grouped into batches or microbatches.
+
+In batch-variant kernels, low-level implementation details—such as parallel reduction order, tiling strategy, fused-kernel heuristics, or algorithm selection conditioned on batch size or sequence layout—can change when the batch size changes, leading to small but systematic numerical differences in the computed logprobs. When `train_micro_batch_size` and `logprob_batch_size` are inconsistent, the same token sequence may traverse slightly different computational paths during training and logprob evaluation, resulting in `current_logprobs != prev_logprobs` and importance-sampling ratios that deviate from 1, even in nominally on-policy settings.
+
+After aligning `train_micro_batch_size` and `logprob_batch_size` so that the same samples are processed with identical effective batch configurations, the importance-sampling ratio (`probs_ratio`) becomes 1 as expected, and the observed accuracy issues disappear. This confirms that the mismatch was caused by batch-dependent numerical variation rather than a conceptual error in the RL objective or data pipeline.
+
+### Recommended Solutions
+
+When using DTensor with TP > 1, or when `probs_ratio != 1` is observed in an on-policy setting, the following mitigation strategies are recommended to restore numerical consistency and stabilize training:
+
+- **Align micro-batch sizes**:
+ Configure `train_micro_batch_size` and `logprob_batch_size` to be exactly equal so that both the training forward pass and the logprob evaluation traverse identical kernel configurations and batching patterns. This alignment minimizes batch-variant behavior in underlying kernels and ensures that `current_logprobs` and `prev_logprobs` are computed under the same numerical conditions, which in turn drives `probs_ratio` back toward 1.
+- **Force an on-policy ratio**:
+ In strictly on-policy scenarios, enable the `loss_fn.force_on_policy_ratio` flag to explicitly set `probs_ratio` to 1 during loss computation. This option is appropriate only when the data are guaranteed to be collected from the current policy and the theoretical importance-sampling ratio should be exactly 1; under these assumptions, clamping the ratio removes spurious numerical noise introduced by minor logprob mismatches while preserving the correctness of the training objective.
+
+## Row-Wise TP Plan
+
+Row-wise and column-wise parallelism are two common ways to split a large linear layer across multiple devices. They differ in **which dimension of the weight matrix is partitioned** and how the partial results are combined.
+
+Consider a linear layer $y=xW^T$ with $ W^T \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}},\quad x \in \mathbb{R}^{d_{\text{in}}},\quad y \in \mathbb{R}^{d_{\text{out}}}. $.
+
+1. Row-wise parallel (TP = 2)
+
+ In **row-wise** parallelism, we split $W^T$ by rows (input dimension) into two blocks:
+
+$$
+ W^T =
+ \begin{bmatrix}
+ W_1^T \\
+ W_2^T
+ \end{bmatrix},
+ \quad\text{where}\quad
+ W_1^T \in \mathbb{R}^{d_{\text{in}}^{(1)} \times d_{\text{out}}},\quad
+ W_2^T \in \mathbb{R}^{d_{\text{in}}^{(2)} \times d_{\text{out}}},\quad
+ d_{\text{in}}^{(1)} + d_{\text{in}}^{(2)} = d_{\text{in}}.
+$$
+
+ We also split the input:
+
+$$
+ x =
+ \begin{bmatrix}
+ x_1 & x_2
+ \end{bmatrix},
+ \quad
+ x_1 \in \mathbb{R}^{d_{\text{in}}^{(1)}},\quad
+ x_2 \in \mathbb{R}^{d_{\text{in}}^{(2)}}.
+$$
+
+ Each GPU holds its own **input slice** and weight slice, and computes: $y_1 = x_1W_1^T,\quad y_2 =x_2W_2^T$, then we **sum** the partial outputs: $y = y_1 + y_2$
+
+
+
+2. Column-wise parallel (TP = 2)
+
+ In **column-wise** parallelism, we split \(W^T\) by columns (output dimension) into two blocks:
+
+$$
+ W^T =
+ \begin{bmatrix}
+ W_1^T & W_2^T
+ \end{bmatrix},
+ \quad \text{where} \quad
+ W_1^T \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}^{(1)}},\quad
+ W_2^T \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}^{(2)}},\quad
+ d_{\text{out}}^{(1)} + d_{\text{out}}^{(2)} = d_{\text{out}}.
+$$
+
+ Each GPU gets the **full input** $x$ and computes: $y_1 = xW_1^T ,\quad y_2 = xW_2^T$, then we **concatenate** along the output dimension: $y = \left[ y_1, y_2 \right]$.
+
+### Root Cause
+
+Our analysis shows that the **row-wise (colwise) tensor parallel (TP) plan** is a primary driver of the observed spikes in metrics and the instability of the reward model when TP is enabled. Row-wise tensor parallelism inevitably introduces cross-device reductions on the output activations. In the row-wise case, each rank produces a partial output $y_i$, and these partial results must be summed across GPUs to form the final $y=∑_iy_i$. Although floating‑point addition is mathematically associative, its implementation in finite precision is **non-associative**, so [changing the summation order can lead to different numerical results](https://arxiv.org/html/2408.05148v3), and the accumulated error can grow over long reduction chains. This makes large distributed reductions—such as the cross‑GPU adds required by row-wise TP—particularly vulnerable to run‑to‑run variability and small but systematic drift.
+
+By contrast, when the entire reduction is executed within a single device and on the same tensor core pipeline, the execution order and kernel implementation are typically fixed for a given problem size, which tends to yield deterministic and more numerically stable results for repeated runs with the same inputs. In other words, on a single GPU, the hardware and library stack generally ensure that the same matmul and accumulation schedule is reused, so the rounding pattern is at least consistent, even if it is not perfectly exact. However, once the computation is split across multiple GPUs, the final sum depends on the collective communication pattern (for example, ring or tree AllReduce), thread scheduling, and low‑level communication libraries. These factors are not guaranteed to be deterministic and can change the effective addition order, leading to additional rounding error and small cross‑rank discrepancies in the aggregated outputs.
+
+### Recommended Solutions:
+
+To mitigate the numerical instability introduced by row-wise TP (especially the cross‑GPU reductions on attention and MLP outputs), we recommend using a **numerically more stable TP plan** that avoids cross‑rank summations. Instead of summing partial outputs across GPUs, the stable plan favors **column-wise sharding with local outputs**, so that each rank produces a complete, independent slice of the logits and no inter‑GPU add is required on these critical paths.
+
+Below is an example of how the default plan can be adjusted into a more numerically stable configuration. More details can refer to [NeMo-RL PR! 1235](https://github.com/NVIDIA-NeMo/RL/pull/1235).
+
+```python
+custom_parallel_plan = {
+ "model.embed_tokens": RowwiseParallel(input_layouts=Replicate()),
+ "model.layers.*.self_attn.q_proj": ColwiseParallel(),
+ "model.layers.*.self_attn.k_proj": ColwiseParallel(),
+ "model.layers.*.self_attn.v_proj": ColwiseParallel(),
+ "model.layers.*.self_attn.o_proj": RowwiseParallel(),
+ "model.layers.*.mlp.up_proj": ColwiseParallel(),
+ "model.layers.*.mlp.gate_proj": ColwiseParallel(),
+ "model.layers.*.mlp.down_proj": RowwiseParallel(),
+ "lm_head": ColwiseParallel(output_layouts=Shard(-1), use_local_output=False),
+}
+
+numerical_stable_parallel_plan = {
+ "model.embed_tokens": RowwiseParallel(input_layouts=Replicate()),
+ "model.layers.*.self_attn.q_proj": ColwiseParallel(),
+ "model.layers.*.self_attn.k_proj": ColwiseParallel(),
+ "model.layers.*.self_attn.v_proj": ColwiseParallel(),
+ "model.layers.*.self_attn.o_proj": ColwiseParallel(
+ input_layouts=Shard(-1),
+ output_layouts=Replicate(),
+ use_local_output=True,
+ ),
+ "model.layers.*.mlp.up_proj": ColwiseParallel(),
+ "model.layers.*.mlp.gate_proj": ColwiseParallel(),
+ "model.layers.*.mlp.down_proj": ColwiseParallel(
+ input_layouts=Shard(-1),
+ output_layouts=Replicate(),
+ use_local_output=True,
+ ),
+ "lm_head": ColwiseParallel(output_layouts=Shard(-1), use_local_output=False),
+}
+```
+
+## Additional Observations and Insights
+
+Beyond the TP-related issues discussed above, our experiments also highlight that **accuracy in RL training is influenced by a broad set of numerical factors**, including attention backends (such as SDPA and flash attention2), GPU architectures (such as *Ampere* vs *Hopper*), and arithmetic precision settings (such as BF16/FP16/FP8/FP32). Different inference and training engines often implement kernels using distinct implementation methods, which naturally introduce small discrepancies in floating‑point results even when the high‑level math is identical. As a result, two systems that are “functionally equivalent” may still produce slightly different logprobs, rewards, or validation metrics.
+
+Figure 6 reports the KL divergence between the logits produced by the Hugging Face stack and those produced by NeMo‑RL for the same input sequence. The plot shows that, even with identical data and model weights, the resulting logit distributions differ noticeably across the two execution engines. In our experiments, similar behavior appeared when varying attention implementations and hardware configurations, where we consistently observed measurable numerical discrepancies, although we did not attempt to systematically eliminate every such source of variation.
+
+
+
+Fig 6: The KL divergence between hugging face and nemorl
+
+The broader research community has proposed multiple strategies to mitigate these issues. We have referred to a list of publications:
+
+* [Defeating the Training-Inference Mismatch via FP16](https://arxiv.org/pdf/2510.26788)
+* [Accumulator accuracy](https://docs.pytorch.org/docs/stable/notes/cuda.html#reduced-precision-reduction-in-bf16-gemms)
+* [Systematic Outliers in Large Language Models](https://arxiv.org/abs/2502.06415)
+* [Training-Inference Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda)
+
+In our current work, we treat these effects primarily as **background noise** and focus on TP‑induced misalignment that has a clear and actionable impact on RL training. A more exhaustive treatment—such as systematically unifying attention backends, enforcing TP‑invariant kernels, or integrating compensated summation into critical paths—is left as future engineering work informed by the aforementioned research directions.
diff --git a/fern/v0.5.0/pages/guides/environments.mdx b/fern/v0.5.0/pages/guides/environments.mdx
new file mode 100644
index 0000000000..553a36e0aa
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/environments.mdx
@@ -0,0 +1,228 @@
+---
+title: Environments for GRPO Training
+description: ""
+---
+
+GRPO includes multiple environments, each offering a standard interface for reward computation and evaluation.
+
+## Math Environment
+
+The Math Environment is designed for mathematical reasoning tasks. It evaluates responses to math problems using `math-verify` and provides rewards based on correctness.
+
+### Key Features
+- Evaluates mathematical reasoning
+- Supports multiple mathematical domains
+- Provides detailed feedback on solution correctness
+
+### Usage
+```python
+from nemo_rl.environments.math_environment import MathEnvironment
+
+env_config = {
+ "num_workers": 2,
+}
+
+math_env = MathEnvironment.remote(env_config)
+```
+
+## Code Environment
+
+The Code Environment is designed for code generation and execution tasks. It provides a sandboxed environment for executing Python code and evaluating the results.
+
+### Usage
+```python
+from nemo_rl.environments.code_environment import CodeEnvironment
+
+env_config = {
+ "num_workers": 2,
+ "terminate_on_evaluation": True, # Terminate after code execution
+}
+
+code_env = CodeEnvironment.remote(env_config)
+```
+
+### Configuration
+- `num_workers`: Number of parallel workers for code execution
+- `terminate_on_evaluation`: Whether to terminate after code execution (True for single-turn, False for multi-turn).
+
+We are tracking an end-to-end example of this environment in [#858](https://github.com/NVIDIA-NeMo/RL/issues/858). Add a 👍 to show your interest.
+
+## Code Jaccard Environment
+
+The Code Jaccard Environment evaluates code (or text) responses by measuring Jaccard-based similarity against ground-truth answers. This is a lightweight, text-similarity reward useful when an execution sandbox is unnecessary or unavailable.
+
+### How It Works
+- Extracts the assistant’s response text from each conversation.
+- Computes a Jaccard similarity score between the response and ground truth:
+ - Tokenizes both texts by whitespace, computes intersection/union, then applies a length ratio penalty.
+ - Scores are in [0, 1]. Observations label responses as “aligned/misaligned” using a 0.5 threshold.
+- Returns:
+ - observations: Environment feedback strings.
+ - rewards: Tensor of similarity scores.
+ - terminateds: All ones (single-step episodes).
+ - answers: The response text when requested (optional).
+
+### Usage
+```python
+from nemo_rl.environments.code_jaccard_environment import CodeJaccardEnvironment
+
+env_config = {
+ "num_workers": 2,
+ # Optional default stop strings (unused in scoring but available for consistency)
+ "stop_strings": None,
+}
+
+code_jaccard_env = CodeJaccardEnvironment.remote(env_config)
+```
+
+### Configuration
+- `num_workers` (int): Number of parallel verification workers.
+- `stop_strings` (list[str] | None): Optional default stop strings (propagated downstream; not required for scoring).
+
+### Sample GRPO Config
+```yaml
+env:
+ code_jaccard:
+ num_workers: 2
+ stop_strings: null
+data:
+ env_name: code_jaccard
+```
+
+## Reward Model Environment
+
+The Reward Model Environment uses pre-trained reward models to score conversation quality.
+
+### Usage
+```python
+from nemo_rl.environments.reward_model_environment import RewardModelEnvironment
+
+env_config = {
+ "enabled": True,
+ "model_name": "Skywork/Skywork-Reward-V2-Qwen3-0.6B",
+ "tokenizer": {"name": "Skywork/Skywork-Reward-V2-Qwen3-0.6B"},
+ "precision": "bfloat16",
+ "batch_size": 32,
+ "resources": {"gpus_per_node": 1, "num_nodes": 1},
+ "reward_model_cfg": {
+ "enabled": True,
+ "reward_model_type": "bradley_terry",
+ },
+}
+
+reward_env = RewardModelEnvironment.remote(env_config)
+```
+
+### Resource Allocation in GRPO Training
+
+In GRPO training, resources are allocated across three main components:
+
+- **Policy Actor**: The trained model.
+- **Generation Actor**: Used for generating responses during rollouts (can be colocated with policy or on separate nodes/GPUs).
+- **Reward Model Environment Actor**: Evaluates generated responses and computes rewards.
+
+The resource allocation logic works as follows:
+
+#### Single-Node Setup (`num_nodes: 1`)
+- All components share the same node
+- GPUs are divided between policy training, generation, and reward model
+- Example:
+ 1. Policy and generation colocated: 8 GPUs total = 4 for colocated policy and generation + 4 for reward model
+ 2. Policy and generation non-colocated: 8 GPUs total = 2 for policy + 2 for generation + 4 for reward model
+
+#### Multi-Node Setup (`num_nodes > 1`)
+- Policy training, generation, and reward model environment can be distributed across different nodes.
+- Reward model gets dedicated resources as specified in `env.reward_model.resources`.
+- Generation gets dedicated resources as specified in `policy.generation.colocated.resources`.
+- Remaining nodes are allocated to policy training.
+
+In the future, the resource control part will be refactored to enable fine-grained resource configuration for each actor. For detailed resource management and optimization strategies, see [#1100](https://github.com/NVIDIA-NeMo/RL/issues/1100).
+
+### Complete GRPO Training with Reward Model Environments
+
+See [examples/run_grpo.py](/../../examples/run_grpo.py) with [examples/configs/grpo_rm_1B.yaml](/../../examples/configs/grpo_rm_1B.yaml) for a complete example of using the reward model environment with GRPO training.
+
+```bash
+uv run examples/run_grpo.py --config examples/configs/grpo_rm_1B.yaml
+```
+
+## Registering Custom Environments
+
+NeMo RL provides a flexible environment registration mechanism that allows you to add custom environments without modifying the source code.
+
+### Using the `register_env` Interface
+
+You can use the `register_env` function to dynamically register new environments without modifying NeMo RL's internal code.
+
+**Function Signature**
+
+```python
+from nemo_rl.environments.utils import register_env
+
+register_env(env_name: str, actor_class_fqn: str) -> None
+```
+
+**Parameters:**
+
+- `env_name`: Unique identifier name for the environment (string)
+- `actor_class_fqn`: Fully Qualified Name of the environment Actor class, in the format `'module.path.ClassName'`
+
+### Example: Registering a Custom Environment
+
+Suppose you've created a custom reinforcement learning environment for code generation tasks:
+
+**1. Create Your Custom Environment Actor Class**
+
+```python
+# File: my_custom_envs/code_gen_env.py
+import ray
+from nemo_rl.environments.interfaces import EnvironmentInterface
+
+@ray.remote
+class CodeGenEnvironmentActor(EnvironmentInterface):
+ """Custom code generation environment."""
+
+ def __init__(self, config):
+ self.config = config
+ # Initialize your environment
+
+ async def reset(self):
+ # Reset environment logic
+ return initial_state
+
+ async def step(self, action):
+ # Execute action, return reward, etc.
+ return observation, reward, done, info
+
+ # Implement other required interface methods...
+```
+
+**2. Register the Environment in Your Training Script**
+
+```python
+# File: train.py
+from nemo_rl.environments.utils import register_env
+
+# Register your custom environment
+register_env(
+ env_name="code_gen",
+ actor_class_fqn="my_custom_envs.code_gen_env.CodeGenEnvironmentActor"
+)
+
+# Now you can use "code_gen" in your config
+# Training code...
+```
+
+**3. Use the Registered Environment in Your Config**
+
+```yaml
+# config.yaml
+env:
+ code_gen:
+ num_workers: 2
+ max_code_length: 512
+ test_cases_per_problem: 5
+
+data:
+ env_name: code_gen # Use your registered environment name
+```
diff --git a/fern/v0.5.0/pages/guides/eval.mdx b/fern/v0.5.0/pages/guides/eval.mdx
new file mode 100644
index 0000000000..56f4a6a6ad
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/eval.mdx
@@ -0,0 +1,115 @@
+---
+title: Evaluation
+description: ""
+---
+
+This document explains how to use an evaluation script for assessing model capabilities.
+
+## Prepare for Evaluation
+
+To prepare for evaluation, first ensure your model is in the correct format, which may involve an optional conversion of PyTorch DCP checkpoints to the HuggingFace format. Following this, you need to prepare the evaluation configuration, which includes defining prompt templates and any custom settings required to run the evaluation.
+
+### Convert DCP to HF (Optional)
+If you have trained a model and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the HuggingFace format before running evaluation.
+
+Use the `examples/converters/convert_dcp_to_hf.py` script. You'll need the path to the training configuration file (`config.yaml`), the DCP checkpoint directory, and specify an output path for the HF format model.
+
+```sh
+# Example for a GRPO checkpoint at step 170
+uv run python examples/converters/convert_dcp_to_hf.py \
+ --config results/grpo/step_170/config.yaml \
+ --dcp-ckpt-path results/grpo/step_170/policy/weights/ \
+ --hf-ckpt-path results/grpo/hf
+```
+> **Note:** Adjust the paths according to your training output directory structure.
+
+Once the conversion is complete, you can override the `generation.model_name` to point to the directory containing the converted HF model in [this section](#run-the-evaluation-script).
+
+### Prepare the Evaluation Configuration
+**Override with Custom Settings**
+
+To run the evaluation, you can use the [default configuration file](/../../examples/configs/evals/eval.yaml). Alternatively, you can specify a custom one or override some settings via the command line.
+
+The default configuration employs greedy sampling to evaluate Qwen2.5-Math-1.5B-Instruct on AIME-2024.
+
+**Prompt Template Configuration**
+
+Always remember to use the same prompt and chat_template that were used during training.
+
+For open-source models, we recommend setting `tokenizer.chat_template=default`, `data.prompt_file=null` and `data.system_prompt_file=null` to allow them to use their native chat templates.
+
+## Run the Evaluation Script
+
+We will use the `run_eval.py` script to run an evaluation using a model directly from the HuggingFace Hub or from a local path that is already in HuggingFace format.
+
+Note that the evaluation script only supports the HuggingFace format model. If you haven't converted your DCP format model, you should back to [Convert DCP to HF](#convert-dcp-to-hf-optional) and follow the guide to convert your model.
+
+```sh
+# Run evaluation script with default config (examples/configs/evals/eval.yaml)
+uv run python examples/run_eval.py
+
+# Run evaluation script with converted model
+uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf
+
+# Run evaluation script with Qwen3 model under thinking mode
+uv run python examples/run_eval.py \
+ generation.model_name=Qwen/Qwen3-8B \
+ generation.temperature=0.6 \
+ generation.top_p=0.95 \
+ generation.top_k=20 \
+ generation.vllm_cfg.max_model_len=38912 \
+ tokenizer.chat_template_kwargs.enable_thinking=true \
+ data.prompt_file=examples/prompts/cot.txt
+
+# Run evaluation script with custom config file
+uv run python examples/run_eval.py --config path/to/custom_config.yaml
+
+# Run evaluation script on one of the supported benchmarks (e.g., GPQA)
+uv run python examples/run_eval.py --config examples/configs/evals/gpqa_eval.yaml
+
+# Run evaluation script with a local dataset where the problem and solution keys are "Question" and "Answer" respectively.
+uv run python examples/run_eval.py \
+ --config examples/configs/evals/local_eval.yaml \
+ data.dataset_name=/path/to/local/dataset \
+ data.problem_key=Question \
+ data.solution_key=Answer
+
+# Override specific config values via command line
+# Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs
+# Pass@1 accuracy averaged over 16 samples for each problem
+uv run python examples/run_eval.py \
+ --config examples/configs/evals/math_eval.yaml \
+ generation.model_name=agentica-org/DeepScaleR-1.5B-Preview \
+ generation.temperature=0.6 \
+ generation.top_p=0.95 \
+ generation.vllm_cfg.max_model_len=32768 \
+ data.dataset_name=math500 \
+ eval.num_tests_per_prompt=16 \
+ cluster.gpus_per_node=8
+```
+> **Note:** Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings.
+
+## Example Evaluation Output
+
+When you complete the evaluation, you will receive a summary similar to the following.
+
+```
+============================================================
+model_name='Qwen2.5-Math-1.5B-Instruct' dataset_name='aime2024'
+max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1 seed=42
+
+metric=pass@1 num_tests_per_prompt=1
+
+score=0.1000 (3.0/30)
+============================================================
+```
+
+## List of currently supported benchmarks
+
+- [AIME-2024 and AIME-2025](/../../nemo_rl/data/datasets/eval_datasets/aime.py): the corresponding `data.dataset_name` are `"aime2024"` and `"aime2025"`.
+- [GPQA and GPQA-diamond](/../../nemo_rl/data/datasets/eval_datasets/gpqa.py): the corresponding `data.dataset_name` are `"gpqa"` and `"gpqa_diamond"`.
+- [MATH and MATH-500](/../../nemo_rl/data/datasets/eval_datasets/math.py): the corresponding `data.dataset_name` are `"math"` and `"math500"`.
+- [MMLU](/../../nemo_rl/data/datasets/eval_datasets/mmlu.py): this also includes MMMLU (Multilingual MMLU), a total of 14 languages. When `data.dataset_name` is set to `mmlu`, the English version is used. If one wants to run evaluation on another language, `data.dataset_name` should be set to `mmlu_{language}` where `language` is one of following 14 values, `["AR-XY", "BN-BD", "DE-DE", "ES-LA", "FR-FR", "HI-IN", "ID-ID", "IT-IT", "JA-JP", "KO-KR", "PT-BR", "ZH-CN", "SW-KE", "YO-NG"]`.
+- [MMLU-Pro](/../../nemo_rl/data/datasets/eval_datasets/mmlu_pro.py): the corresponding `data.dataset_name` is `"mmlu_pro"`.
+
+More details can be found in [load_eval_dataset](/../../nemo_rl/data/datasets/eval_datasets/__init__.py).
diff --git a/fern/v0.5.0/pages/guides/ft-launcher-guide.mdx b/fern/v0.5.0/pages/guides/ft-launcher-guide.mdx
new file mode 100644
index 0000000000..772258b7dd
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/ft-launcher-guide.mdx
@@ -0,0 +1,61 @@
+---
+title: Fault Tolerance Launcher Guide
+description: ""
+---
+
+The `ft_launcher` is provided by `nvidia-resiliency-ext` (included in NeMo RL dependencies) and enables automatic fault tolerance and recovery for distributed training runs.
+
+## Key Arguments
+
+| Argument | Description | Example |
+|----------|-------------|---------|
+| `--ft-cfg-path` | Path to FT YAML config file | `examples/ft_launcher/ft_config.yaml` |
+| `--ft-rank-heartbeat-timeout` | Heartbeat timeout in seconds | `450` |
+| `--ft-initial-rank-heartbeat-timeout` | Initial timeout (longer for setup) | `1200` |
+| `--max-restarts` | Maximum number of restart attempts | `5` |
+
+## Basic Usage
+
+```bash
+uv run ft_launcher \
+ --ft-cfg-path examples/ft_launcher/ft_config.yaml \
+ --ft-rank-heartbeat-timeout 450 \
+ --ft-initial-rank-heartbeat-timeout 1200 \
+ --max-restarts 5 \
+ examples/run_grpo.py \
+ --config
+```
+
+## FT Config File (examples/ft_launcher/ft_config.yaml)
+
+```yaml
+fault_tolerance:
+ initial_rank_heartbeat_timeout: 360
+ restart_policy: any-failed
+```
+
+## Important Notes
+
+1. **Checkpointing**: Enable checkpointing for recovery to work:
+ ```bash
+ ++checkpointing.enabled=true
+ ++checkpointing.checkpoint_dir=/path/to/checkpoints
+ ++checkpointing.save_period=50
+ ```
+
+2. **Timeouts**: Set `--ft-initial-rank-heartbeat-timeout` higher than `--ft-rank-heartbeat-timeout` to allow for model loading/setup time.
+
+3. **Restart Policy**: The `any-failed` restart policy will restart the entire job if any rank fails. Look for these log messages to identify when a restart occurs:
+
+ ```
+ [ERROR] [ft_launcher...] failed (exitcode: 1) local_rank: 0 (pid: ...) of binary: ...
+ [INFO] [ft_launcher...] [default] Worker group FAILED. 3/5 attempts left; will restart worker group
+ [INFO] [ft_launcher...] Stopping workers... Timeout = 30 sec.
+ [INFO] [ft_launcher...] The node '...' attempts to join the next round of the rendezvous '...'.
+ [INFO] [ft_launcher...] The node '...' has joined round N of the rendezvous '...' as rank 0 in a world of size 1.
+ ```
+
+ Key indicators:
+ - `Worker group FAILED. X/Y attempts left` - shows a restart is happening and remaining attempts
+ - `will restart worker group` - confirms restart is in progress
+ - `has joined round N` - the round number increases with each restart
diff --git a/fern/v0.5.0/pages/guides/grpo-deepscaler.mdx b/fern/v0.5.0/pages/guides/grpo-deepscaler.mdx
new file mode 100644
index 0000000000..e2d9bb7b0c
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/grpo-deepscaler.mdx
@@ -0,0 +1,56 @@
+---
+title: GRPO on DeepScaler
+description: ""
+---
+
+This guide explains how to use NeMo RL to train long Chain of Thought (CoT) reasoning models with Group Relative Policy Optimization (GRPO). To do so, we train [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) on the [DeepScaleR](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) dataset. We then show how to use NeMo RL's evaluation scripts to evaluate the trained model on the [AIME24](https://huggingface.co/datasets/HuggingFaceH4/aime_2024) benchmark.
+
+## Train the Model
+We follow the DeepScaleR recipe and train the model in three stages. In the first stage, we train with an 8K context window. In the second stage, we train with a 16K context window. In the third stage, we train with a 24K context window.
+To train the model using NeMo RL, use the `examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml` config file. This file closely matches the experiment settings in the original DeepScaleR recipe. We then train with `examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml` and `examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml` for the second and third stages, respectively.
+
+```sh
+uv run examples/run_grpo.py --config=examples/configs/recipes/llm/grpo-deepscaler-1.5b-8K.yaml
+uv run examples/run_grpo.py --config=examples/configs/recipes/llm/grpo-deepscaler-1.5b-16K.yaml policy.model_name=/path/to/8K/checkpoint/hf
+uv run examples/run_grpo.py --config=examples/configs/recipes/llm/grpo-deepscaler-1.5b-24K.yaml policy.model_name=/path/to/16K/checkpoint/hf
+```
+
+At the end of each stage, you need to specify the Hugging Face checkpoint to continue training with. To get this checkpoint, we convert a model checkpoint to a Hugging Face checkpoint with the following command:
+
+```sh
+uv run examples/converters/convert_dcp_to_hf.py --config=results/grpo-deepscaler-1.5b-8K/step_240/config.yaml --dcp-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_240/policy/weights --hf-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_240/hf
+```
+
+When running the next command, we use the Hugging Face checkpoint as the initial checkpoint. We train with an 8K context window for 240 steps, a 16K context window for 290 steps, and a 24K context window for 50 steps. We run all experiments on a single 8XH100 80GB node. If you're running on 8XA100 80GB, you will need at least 1 node for 8K training and 2 nodes for 16-24k training.
+
+## Training Curve
+When using the above commands, we get the following training curve:
+
+
+
+Notably, we are able to achieve an average training reward of 0.65 in just 400 training steps.
+
+## Evaluate the Model
+Throughout training, the checkpoints of the model will be saved to the `results` folder (specified by `checkpointing.checkpoint_dir`). To evaluate the model, we first need to convert the PyTorch distributed checkpoint to Hugging Face format as before. Then, to evaluate on the [AIME24 benchmark](https://huggingface.co/datasets/HuggingFaceH4/aime_2024), use the following command:
+
+```sh
+uv run examples/run_eval.py \
+ generation.model_name=results/grpo-deepscaler-1.5b-8K/step_240/hf \
+ data.prompt_file=examples/prompts/cot.txt \
+ generation.vllm_cfg.max_model_len=32768 \
+ generation.vllm_cfg.enforce_eager=True \
+ generation.temperature=1.0
+```
+
+Use `generation.model_name` to specify the path to the Hugging Face checkpoint. In addition, we use AIME24 as the validation dataset and calculate pass@1 on it throughout training.
+
+> [!NOTE]
+> AIME24 only has 30 examples so the accuracy can be very noisy.
+> To reduce the variance consider runing `run_eval.py` with `eval.num_tests_per_prompt=16`.
+
+## Evaluation Results
+Using the above instructions to train DeepSeek-R1-Distill-Qwen-1.5B on the DeepScaleR dataset, we can track the model's performance on the AIME24 benchmark throughout training. The following plot shows the evaluation metrics as training progresses:
+
+
+
+We are able to surpass OpenAI O1's performance on the AIME24 benchmark with about 600 training steps.
diff --git a/fern/v0.5.0/pages/guides/grpo-sliding-puzzle.mdx b/fern/v0.5.0/pages/guides/grpo-sliding-puzzle.mdx
new file mode 100644
index 0000000000..45b60b5740
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/grpo-sliding-puzzle.mdx
@@ -0,0 +1,295 @@
+---
+title: Solve a Sliding Puzzle Using GRPO
+description: ""
+---
+
+This guide explains how to use Nemo RL to train a model to solve the classic **nxn sliding puzzle** game through multi-turn reinforcement learning. This environment implements a classic **n×n sliding puzzle** where numbered tiles must be arranged in sequential order by sliding them into an empty space.
+
+The sliding puzzle task serves as a simple, yet effective example, to illustrate how multi-turn RL and tool-calling are implemented within Nemo RL. This example provides a minimal setup for understanding the core components of Group Relative Policy Optimization (GRPO) and sequential decision-making.
+
+## Quick Start Guide
+
+### 1. Install and Set Up NeMo RL with Megatron Backend (Optional)
+
+To get started, clone and set up the NeMo RL repository by initializing submodules, installing CUDA dependencies, and configuring the environment with uv. Refer to [Prerequisites](https://github.com/NVIDIA-NeMo/RL/tree/main?tab=readme-ov-file#prerequisites) for detailed instructions on installation.
+
+### 2. Train a Model
+
+Train a model to solve the sliding puzzle using GRPO with the default 2×2 configuration.
+
+```bash
+uv run python examples/run_grpo_sliding_puzzle.py
+```
+
+### 3. Customize Puzzle Configuration
+
+By default, this training script uses the configuration in [grpo_sliding_puzzle.yaml](/../../examples/configs/grpo_sliding_puzzle.yaml). You can customize parameters with command-line overrides to experiment with different puzzle sizes or levels of difficulty.
+```bash
+# Train on a 3×3 puzzle with 10 random moves to scramble the board
+uv run python examples/run_grpo_sliding_puzzle.py \
+ env.sliding_puzzle_game.cfg.game_config.size=3 \
+ env.sliding_puzzle_game.cfg.game_config.shuffle_moves=10
+```
+
+### 4. Monitor Progress
+
+You can enable logging via Weights & Biases and TensorBoard to monitor training metrics such as rewards, success rate, and loss curves.
+
+```bash
+# Enable logging (optional)
+uv run examples/run_grpo_sliding_puzzle.py \
+ --config examples/configs/grpo_sliding_puzzle.yaml \
+ logger.wandb_enabled=true \
+ logger.tensorboard_enabled=true
+```
+
+## Game Mechanics
+
+### Puzzle Structure
+
+The sliding puzzle consists of:
+- **Grid**: An `n×n` grid with numbered tiles and one empty space
+- **Tiles**: Numbered from `1` to `n²-1`, placed in random order
+- **Empty Space**: Represented by `0`, typically starting at the bottom-right corner
+- **Goal State**: Sequential arrangement `1, 2, 3, ..., n²-1` with `0` at bottom-right
+
+### Example Data Sample
+```
+===== SLIDING PUZZLE =====
+Arrange the 3x3 grid by sliding tiles into the empty space.
+- The goal is to arrange numbers from 1 to 8 in order
+- Use 'up', 'down', 'left', 'right' to slide in that direction
+- Use 'view' to see the current state of the board
+
+Current Board State:
+
+ +---------+
+1 | 1 3 |
+2 | 4 2 5 |
+3 | 7 8 6 |
+ +---------+
+ 1 2 3
+
+Reach the goal state where numbers are ordered 1 through 8 with the empty space (0) at the bottom right.
+Valid actions: 'up', 'down', 'left', 'right', or 'slide row col' (e.g., 'slide 1 2').
+After thinking, output your chosen action on a new line starting with '' like this:
+your_action
+If you just want to see the board, output view
+Think carefully step-by-step before acting.
+
+```
+
+### Movement Rules
+
+1. **Valid Moves**: Only tiles adjacent to the empty space `0` can be moved.
+2. **Movement Direction**: Tiles slide into the empty space, not the other way around.
+3. **Grid Boundaries**: Moves that would go beyond the grid are invalid.
+4. **Single Tile Movement**: Each action affects only one tile at a time.
+
+All actions must be wrapped in XML-style tags and follow one of the formats below:
+```xml
+up {/* Slide a tile up into the empty space */}
+slide 2 1 {/* Slide tile at row 2, column 1 */}
+view {/* View the current board state */}
+```
+
+## Data Generation
+
+### Configuration Parameters
+
+Sliding puzzle instances are generated using the following parameters, which can be customized via the configuration file:
+
+```yaml
+env:
+ sliding_puzzle_game:
+ cfg:
+ game_config:
+ size: 5 # Size of the puzzle grid (e.g., 3x3, 4x4, 5x5)
+ shuffle_moves: 4 # Number of random moves to scramble the puzzle
+ max_moves: 40 # Maximum number of moves allowed per episode
+```
+#### Description
+
+- **`size`**: Determines the dimensions of the puzzle board (`n×n`).
+- **`shuffle_moves`**: Controls the initial difficulty by randomly moving tiles to scramble the puzzle.
+- **`max_moves`**: Sets an upper limit on the number of actions the agent can take in one episode.
+
+Grids are generated with sizes ranging from 2 to game_config.size. Each grid starts with a solved state and is shuffled by moving random tiles to the empty space n times, where n is a random number between 1 and `shuffle_moves`. The grid is shuffled using only valid moves.
+The `generate_puzzle_datum()` function in [run_grpo_sliding_puzzle.py](/../../examples/run_grpo_sliding_puzzle.py) is responsible for generating the dataset. [sliding_puzzle.py](/../../nemo_rl/environments/games/sliding_puzzle.py) contains the `SlidingPuzzleGameLogic` class, responsible for puzzle generation and initialization logic. The number of shuffle moves and size of the grid will control puzzle difficulty.
+
+#### Generation Algorithm
+The puzzle configuration is randomly generated by sampling the grid size and number of shuffling moves within the defined maximums:
+
+```python
+def generate_random_config(max_config: dict[str, Any]) -> dict[str, Any]:
+ """Generate a random config for the sliding puzzle game."""
+ shuffle_moves = random.randint(1, max_config.get("shuffle_moves"))
+ if shuffle_moves % 2 == 0:
+ shuffle_moves += 1 # Ensure odd number for proper scrambling
+ return {
+ "size": random.randint(2, max_config.get("size", 3)),
+ "shuffle_moves": shuffle_moves,
+ }
+
+ game_config = generate_random_config(game_config)
+ initial_game_state = SlidingPuzzleGameLogic.generate(game_config)
+ initial_render = SlidingPuzzleGameLogic.render(initial_game_state)
+ welcome_message = SlidingPuzzleGameLogic.init(initial_game_state)
+ ```
+
+### Dataset Size Calculation
+
+Dataset size is defined by parameters in grpo_sliding_puzzle.yaml:
+```
+Training Size = num_prompts_per_step × num_generations_per_prompt × max_num_steps
+Validation Size = max_val_samples
+```
+
+### Data Structure
+
+Each training sample is returned as a `DatumSpec` dictionary with the following structure:
+
+```python
+datum: DatumSpec = {
+ "message_log": message_log, # Conversation history
+ "length": len(tokenized_prompt), # Token count
+ "extra_env_info": metadata, # Game state metadata
+ "loss_multiplier": 1.0, # Training weight
+ "idx": idx, # Sample index
+ "task_name": task_name, # Task identifier
+ "stop_strings": [""], # Termination tokens
+}
+```
+
+## Environment Interface
+
+{/* ### Architecture Flow
+
+```
+GRPO Training Pipeline:
+run_grpo_sliding_puzzle.grpo_train → nemo_rl.experience.rollouts.run_multi_turn_rollouts → generate_response + calculate_reward → environments.games.sliding_puzzle.SlidingPuzzleEnv.step
+``` */}
+
+### Core Classes
+
+The [sliding_puzzle.py](/../../nemo_rl/environments/games/sliding_puzzle.py) defines the environment and the logic for interacting with the environment. The core classes used are outlined below:
+
+#### SlidingPuzzleEnv
+The SlidingPuzzleEnv class serves as the main environment, implementing a Ray remote actor for distributed processing and using functions from both the SlidingPuzzleGameLogic and SlidingPuzzleRunner classes to interact with the environment.
+
+```python
+@ray.remote
+class SlidingPuzzleEnv(EnvironmentInterface):
+ def __init__(self, cfg: Optional[SlidingPuzzleConfig] = None):
+ """Initialize environment with configuration."""
+
+ def step(
+ self,
+ message_log_batch: list[LLMMessageLogType],
+ metadata_batch: list[SlidingPuzzleMetadata],
+ ) -> EnvironmentReturn:
+ """Process batch of interactions."""
+```
+
+#### SlidingPuzzleGameLogic
+The SlidingPuzzleGameLogic class defines the core game mechanics through static methods for puzzle operations and includes functionality for reward calculation.
+
+```python
+class SlidingPuzzleGameLogic:
+ @staticmethod
+ def generate(config: dict[str, Any]) -> dict[str, Any]:
+ """Generate new puzzle with specified configuration."""
+
+ @staticmethod
+ def init(game_state: dict[str, Any]) -> str:
+ """Create welcome message with game rules."""
+
+ @staticmethod
+ def step(action: str, game_state: dict[str, Any]) -> tuple[str, float, bool, dict[str, Any]]:
+ """Execute action and return (response, reward, terminated, new_state)."""
+
+ @staticmethod
+ def render(game_state: dict[str, Any]) -> str:
+ """Render current puzzle state as visual grid."""
+```
+
+#### SlidingPuzzleRunner
+
+The SlidingPuzzleRunner class handles turn processing and action management.
+
+```python
+class SlidingPuzzleRunner:
+ def __init__(self):
+ """Initialize runner with no persistent state."""
+
+ def _parse_action(self, text: str) -> Optional[str]:
+ """Extract action from model response using XML tag parsing."""
+
+ def process_turn(
+ self,
+ message_log: LLMMessageLogType,
+ metadata: SlidingPuzzleMetadata,
+ ) -> tuple[dict[str, str], float, bool, Optional[list[str]], Optional[SlidingPuzzleMetadata]]:
+ """Process single turn and return (response_dict, reward, terminated, stop_strings, updated_metadata)."""
+```
+
+### Processing Pipeline
+
+The step function creates a processing pipeline where each class handles specific responsibilities:
+
+1. **Parse action** (`SlidingPuzzleRunner`): Extracts the action from the model response using XML tag parsing via the `process_turn` method.
+2. **Validate Move** (`SlidingPuzzleGameLogic`): Checks if the action is valid for the current game state and then executes the move.
+3. **Execute Action** (`SlidingPuzzleGameLogic`): Applies the move to the game state using the `SlidingPuzzleGameLogic.step` method.
+4. **Calculate Reward** (`SlidingPuzzleGameLogic`): Assigns a reward based on progress toward solving the puzzle (step function).
+5. **Return Results** (`SlidingPuzzleEnv`): Returns the updated interaction state as an `EnvironmentReturn` object.
+
+## Reward System
+
+### Reward Structure
+
+The environment uses a sparse reward scheme designed to encourage complete solution strategies, rather than incremental progress or reward hacking.
+
+| Condition | Reward | Termination |
+|-----------|--------|-------------|
+| Valid move (non-solving) | 0.0 | False |
+| Invalid move | 0.0 | False |
+| Puzzle solved | 1.0 | True |
+| Max moves reached | 0.0 | True |
+| Invalid action format | 0.0 | False |
+
+>Goal: The agent receives a reward only upon successfully solving the puzzle, promoting long-horizon planning.
+
+### Reward Calculation Logic
+
+```python
+def step(action: str, game_state: dict[str, Any]) -> tuple[str, float, bool, dict[str, Any]]:
+ """Process action and calculate reward."""
+ reward = 0.0
+ is_terminated = False
+
+ if move_made:
+ # Check if puzzle is solved
+ if new_state["grid"] == new_state["solution"]:
+ reward = 1.0
+ is_terminated = True
+ else:
+ reward = 0.0 # No reward for non-solving moves
+
+ return response, reward, is_terminated, new_state
+```
+## Results
+
+We fine-tuned [`Qwen/Qwen2.5-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) on synthetic data for 120 steps using the following configuration settings:
+
+```
+game_config:
+ size: 5 # Size of the puzzle (e.g., 2 for 2x2, 3 for 3x3)
+ shuffle_moves: 10 # Number of random moves to shuffle the solved state
+max_moves: 30
+```
+
+The figure below displays training rewards vs. steps, along with validation accuracy.
+
+
+
+
diff --git a/fern/v0.5.0/pages/guides/grpo.mdx b/fern/v0.5.0/pages/guides/grpo.mdx
new file mode 100755
index 0000000000..a26735669d
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/grpo.mdx
@@ -0,0 +1,464 @@
+---
+title: An In-depth Walkthrough of GRPO in NeMo RL
+description: ""
+---
+
+This guide details the Group Relative Policy Optimization (GRPO) implementation within NeMo RL. We walk through data handling, policy model training, fast generation, and the GRPO loss function.
+
+## Quickstart: Launch a GRPO Run
+
+To get started quickly, use the script [examples/run_grpo.py](/../../examples/run_grpo.py), which demonstrates how to train a model on math problems using GRPO. You can launch this script locally or through Slurm. For detailed instructions on setting up Ray and launching a job with Slurm, refer to the [cluster documentation](/../cluster).
+
+We recommend launching the job using `uv`:
+
+```bash
+uv run examples/run_grpo.py --config \{overrides\}
+```
+
+If not specified, `config` will default to [examples/configs/grpo_math_1B.yaml](/../../examples/configs/grpo_math_1B.yaml).
+
+**Reminder**: Do not forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
+
+In this guide, we'll walk through how we handle:
+
+* Data
+* Model training
+* Fast generation
+* Overall resource flow
+* Loss
+
+### Data
+
+We support training with multiple RL "Environments" at the same time.
+
+An [Environment](/../../nemo_rl/environments/interfaces.py) is an object that accepts a state/action history and returns an updated state and rewards for the step. They run as Ray Remote Actors. Example [MathEnvironment](/../../nemo_rl/environments/math_environment.py).
+
+To support this, we need to know:
+
+* What environments you have
+* Which data should go to which environments
+* How to prepare the data from your dataset into a form we can use
+
+#### Dataset
+
+GRPO datasets in NeMo RL are encapsulated using classes. Each GRPO data class is expected to have the following attributes:
+ 1. `dataset`: A dictionary containing the formatted datasets. Each example in the dataset must conform to the format described below.
+ 2. `task_name`: A string identifier that uniquely identifies the dataset.
+
+GRPO datasets are expected to follow the HuggingFace chat format. Refer to the [chat dataset document](/../design-docs/chat-datasets) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [response_datasets/deepscaler.py](/../../nemo_rl/data/datasets/response_datasets/deepscaler.py) has an example:
+
+**Note:** The `task_name` field is required in each formatted example.
+
+```python
+def format_data(self, data: dict[str, Any]) -> dict[str, Any]:
+ return {
+ "messages": [
+ {"role": "user", "content": data["problem"]},
+ {"role": "assistant", "content": data["answer"]},
+ ],
+ "task_name": self.task_name,
+ }
+```
+
+By default, NeMo RL has some built-in supported datasets (e.g., [OpenAssistant](/../../nemo_rl/data/datasets/response_datasets/oasst.py), [OpenMathInstruct-2](/../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py), [Squad](/../../nemo_rl/data/datasets/response_datasets/squad.py), etc.). You can see the full list [here](/../../nemo_rl/data/datasets/response_datasets/__init__.py).
+All of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.
+
+We provide a [ResponseDataset](/../../nemo_rl/data/datasets/response_datasets/response_dataset.py) class that is compatible with JSONL-formatted response datasets for loading datasets from local path or Hugging Face. You can use `input_key`, `output_key` to specify which fields in your data correspond to the question and answer respectively. Here's an example configuration:
+```yaml
+data:
+ # other data settings, see `examples/configs/grpo_math_1B.yaml` for more details
+ ...
+ # dataset settings
+ train:
+ # this dataset will override input_key and use the default values for other vars
+ data_path: /path/to/local/train_dataset.jsonl # local file or hf_org/hf_dataset_name (HuggingFace)
+ input_key: question
+ subset: null # used for HuggingFace datasets
+ split: train # used for HuggingFace datasets
+ split_validation_size: 0.05 # use 5% of the training data as validation data
+ seed: 42 # seed for train/validation split when split_validation_size > 0
+ validation:
+ # this dataset will use the default values for other vars except data_path
+ data_path: /path/to/local/val_dataset.jsonl
+ default:
+ # will use below vars as default values if dataset doesn't specify it
+ dataset_name: ResponseDataset
+ input_key: input
+ output_key: output
+ prompt_file: null
+ system_prompt_file: null
+ processor: "math_hf_data_processor"
+ env_name: "math"
+```
+
+Your JSONL files should contain one JSON object per line with the following structure:
+
+```json
+{
+ "input": "Hello", // :
+ "output": "Hi there!" // :
+}
+```
+
+We support using multiple datasets for train and validation. You can refer to `examples/configs/grpo_multiple_datasets.yaml` for a full configuration example. Here's an example configuration:
+```yaml
+data:
+ _override_: true # override the data config instead of merging with it
+ # other data settings, see `examples/configs/grpo_math_1B.yaml` for more details
+ ...
+ # dataset settings
+ train:
+ # train dataset 1
+ - dataset_name: OpenMathInstruct-2
+ split_validation_size: 0.05 # use 5% of the training data as validation data
+ seed: 42 # seed for train/validation split when split_validation_size > 0
+ # train dataset 2
+ - dataset_name: DeepScaler
+ validation:
+ # validation dataset 1
+ - dataset_name: AIME2024
+ repeat: 16
+ # validation dataset 2
+ - dataset_name: DAPOMathAIME2024
+ # default settings for all datasets
+ default:
+ ...
+```
+
+We support using a single dataset for both train and validation by using `split_validation_size` to set the validation ratio.
+[OpenAssistant](/../../nemo_rl/data/datasets/response_datasets/oasst.py), [OpenMathInstruct-2](/../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py), [ResponseDataset](/../../nemo_rl/data/datasets/response_datasets/response_dataset.py), [Tulu3SftMixtureDataset](/../../nemo_rl/data/datasets/response_datasets/tulu3.py) are supported for this feature.
+If you want to support this feature for your custom datasets or other built-in datasets, you can simply add the code to the dataset like [ResponseDataset](/../../nemo_rl/data/datasets/response_datasets/response_dataset.py).
+```python
+# `self.val_dataset` is used (not None) only when current dataset is used for both training and validation
+self.val_dataset = None
+self.split_train_validation(split_validation_size, seed)
+```
+
+#### Common Data Format
+
+We define a [DatumSpec](/../../nemo_rl/data/interfaces.py) that holds all relevant information for each training example:
+
+```python
+class DatumSpec(TypedDict):
+ message_log: LLMMessageLogType
+ length: int # total (concatenated) length of the message tensors
+ extra_env_info: dict[str, Any] # anything your environment requires goes here, for example the 'answer' of a math problem
+ loss_multiplier: float # multiplier for the loss for this datum. 0 to mask out (say the sample is invalid)
+ idx: int
+ task_name: Optional[str] = "default"
+ __extra__: Any # This allows additional fields of any type
+```
+
+#### Data Processors
+
+We refer to each distinct environment your model aims to optimize against as a "task." For example, you might define tasks like "math" or "code."
+
+For each task, you should provide a data processor that reads from your dataset and returns a [DatumSpec](/../../nemo_rl/data/interfaces.py).
+
+```python
+def my_data_processor(
+ datum_dict: dict[str, Any], # loaded directly from your dataset (that is, a single line of JSONL data)
+ task_data_spec: TaskDataSpec,
+ tokenizer,
+ max_seq_length: int,
+ idx: int,
+) -> DatumSpec:
+```
+
+We have an example of this as `math_data_processor` in [processors.py](/../../nemo_rl/data/processors.py).
+
+### Task–Dataset Mapping
+
+- task_name (unique task identifier):
+ - Determines which processor, env, prompts, and dataset to use for this task.
+ - Currently, we support a single dataset and a single environment. Therefore, task_name equals the dataset_name in the config (i.e., config.data.dataset_name).
+- task_spec (TaskDataSpec):
+ - Specifies per-task system prompt and prompt.
+- task_data_processors:
+ - Dict mapping: task_name -> (task_spec, processor_fn).
+- task_to_env:
+ - Dict mapping: task_name -> task_env.
+
+Example (simplified):
+
+```python
+task_data_processors = {data.task_name: (data.task_spec, data.processor)}
+task_to_env = {data.task_name: env}
+```
+
+#### Putting It All Together
+
+GRPO expects datasets to have the following form:
+
+```json
+{"task_name": "math", /* actual data */}
+```
+
+Then, you can set the data up as follows:
+
+```python
+
+# 1) Setup environments from data config
+env_name_list = extract_necessary_env_names(data_config)
+envs = {
+ env_name: create_env(env_name=env_name, env_config=env_configs[env_name])
+ for env_name in env_name_list
+}
+
+# 2) Load dataset using the helper (built-ins or local/HF datasets)
+data = load_response_dataset(data_config["train"])
+
+# 3) Build task mapping
+task_data_processors = {data.task_name: (data.task_spec, data.processor)}
+task_to_env = {data.task_name: envs[data_config["train"]["env_name"]]}
+
+# 4) Construct processed dataset
+dataset = AllTaskProcessedDataset(
+ data.dataset,
+ tokenizer,
+ None,
+ task_data_processors,
+ max_seq_length=data_config["max_input_seq_length"],
+)
+
+# 5) Do the same thing for validation dataset if it exists
+if "validation" in data_config and data_config["validation"] is not None:
+ val_data = load_response_dataset(data_config["validation"])
+
+ val_task_data_processors = {val_data.task_name: (val_data.task_spec, val_data.processor)}
+ val_task_to_env = {val_data.task_name: envs[data_config["validation"]["env_name"]]}
+
+ val_dataset = AllTaskProcessedDataset(
+ val_data.dataset,
+ tokenizer,
+ None,
+ val_task_data_processors,
+ max_seq_length=data_config["max_input_seq_length"],
+ )
+```
+
+Ensure you provide a mapping of tasks to their processors so the dataset knows which processor to use when handling samples.
+
+## Environments
+
+GRPO supports various types of environments for different tasks, including **[Math](/../../nemo_rl/environments/math_environment.py)**, **[Code](/../../nemo_rl/environments/code_environment.py)**, and **[Reward Model](/../../nemo_rl/environments/reward_model_environment.py)** environments. Each environment provides a standardized interface for reward computation and evaluation, enabling consistent training across diverse domains.
+
+For more information about environments, see the [Environments Guide](/environments).
+
+### Env–Task Mapping
+
+- env:
+ - The environment actor for reward/evaluation, constructed using `create_env(env_name=..., env_config=...)`.
+ - The environment to use is declared under the data section of the config (e.g., `data.env_name` states which env the dataset uses).
+- task_to_env:
+ - Dict mapping: task_name -> env. In the current single-task setup this typically points all tasks to the same env, but this structure enables different envs per task in future multi-task scenarios.
+
+Example (simplified):
+
+```python
+env_name_list = extract_necessary_env_names(data_config)
+envs = {
+ env_name: create_env(env_name=env_name, env_config=env_configs[env_name])
+ for env_name in env_name_list
+}
+
+task_to_env[task_name] = envs[data_config["train"]["env_name"]]
+val_task_to_env = task_to_env # validation usually mirrors training mapping
+```
+
+## Policy Model
+
+We define a `~nemo_rl.models.policy.interfaces.PolicyInterface` that contains everything you need to train a Policy model.
+
+This Policy object holds a [RayWorkerGroup](/../../nemo_rl/distributed/worker_groups.py) of SPMD (1 proc/GPU) processes that run HF/MCore, all coordinated by this object so it appears to you like 1 GPU!
+
+## Fast Generation
+
+We support vLLM through the [VllmGeneration](/../../nemo_rl/models/generation/vllm/vllm_generation.py) class right now.
+
+The function, [grpo_train](/../../nemo_rl/algorithms/grpo.py), contains the core GRPO training loop.
+
+## Performance Optimizations
+
+RL generations typically produce highly variable sequence lengths, which result in a significant amount of padding if approached naively. We address this with Sequence Packing and Dynamic Batching, which are techniques to reduce the amount of padding required. You can read more about these in the [design doc](/../design-docs/sequence-packing-and-dynamic-batching).
+
+## Loss
+We use the [ClippedPGLossFn](/../../nemo_rl/algorithms/loss_functions.py) to calculate the loss for GRPO. Formally,
+
+$$
+L(\theta) = E_{x \sim \pi_{\theta_{\text{old}}}} \Big[ \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big) \Big] - \beta D_{\text{KL}} (\pi_\theta \| \pi_\text{ref})
+$$
+
+where:
+
+- $\pi_\theta$ is the policy model we are currently optimizing
+- $\pi_{\theta_{\text{old}}}$ is the previous policy model (from the beginning of this step)
+- $A_t$ is the advantage estimate
+- $\varepsilon$ is a clipping hyperparameter
+- $\beta$ is the KL penalty coefficient
+- $\pi_{\text{ref}}$ is the reference policy
+
+It also supports "Dual-Clipping" from [Ye et al. (2019)](https://arxiv.org/pdf/1912.09729), which
+imposes an additional upper bound on the probability ratio when advantages are negative.
+This prevents excessive policy updates. $rA \ll 0$ -> $cA$(clipped).
+The loss function is modified to the following when A_t < 0:
+
+$$
+L(\theta) = E_t \Big[ \max \Big( \min \big(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) A_t \big), c A_t \Big) \Big] - \beta D_{\text{KL}} (\pi_\theta \| \pi_\text{ref})
+$$
+
+where:
+- c is the dual-clip parameter (ratio_clip_c), which must be greater than 1 and is usually set to 3 empirically.
+- $r_t(\theta)$ is the ratio $\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}$ that measures how much the policy has changed.
+
+### Improvements to the GRPO Loss Formulation for Stability and Accuracy
+
+#### On-Policy KL Approximation
+
+This feature is controlled by the parameter `use_on_policy_kl_approximation`. It enables the use of an estimator for KL divergence based on [Schulman (2020)](http://joschu.net/blog/kl-approx.html), which is both unbiased and guaranteed to be positive.
+
+$$
+D_{\text{KL}} (\pi_\theta || \pi_\text{ref}) \approx E_{x \sim \pi_{\theta}} \Big[ \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - \log \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - 1 \Big]
+$$
+
+Note that the loss function above samples from $\pi_{\theta_{\text{old}}}$ instead of $\pi_\theta$, meaning that the KL approximation is off-policy if we use samples from $\pi_{\theta_{\text{old}}}$. This is the default formulation used in the [original GRPO paper](https://arxiv.org/abs/2402.03300). In order to use an _on-policy_ KL approximation while sampling from $\pi_{\theta_{\text{old}}}$, we can incorporate importance weights:
+
+$$
+\begin{align*}
+D_{\text{KL}} (\pi_\theta || \pi_\text{ref}) &\approx E_{x \sim \pi_{\theta}} \Big[ \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - \log \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - 1 \Big] \\
+&= \sum_x \pi_{\theta}(x) \Big[ \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - \log \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - 1 \Big] \\
+&= \sum_x \pi_{\theta_{\text{old}}}(x) \frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)} \Big[ \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - \log \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - 1 \Big] \\
+&= E_{x \sim \pi_{\theta_\text{old}}} \frac{\pi_{\theta}(x)}{\pi_{\theta_{\text{old}}}(x)} \Big[ \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - \log \frac{\pi_\text{ref}(x)}{\pi_\theta(x)} - 1 \Big] \\
+\end{align*}
+$$
+
+To enable the on-policy KL approximation, set the config `use_on_policy_kl_approximation=True` in the `ClippedPGLossConfig`. By default, we set this config to False to align with standard GRPO.
+
+#### Importance Sampling Correction
+This feature is controlled by the parameter `use_importance_sampling_correction`. It applies importance sampling to adjust for discrepancies between the behavior policy and the target policy, improving the accuracy of off-policy estimates. The policy we use to draw samples, $\pi_{\theta_{\text{old}}}$, is used in both the inference framework and the training framework. To account for this distinction, we refer to the inference framework policy as $\pi_{\text{inference}}$ and the training framework policy as $\pi_{\text{training}}$. As noted in [Adding New Models](/../adding-new-models#understand-discrepancies-between-backends), it is possible for the token probabilities from $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to have discrepancies (from numerics, precision differences, bugs, etc.), leading to off-policy samples. We can correct for this by introducing importance weights between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ to the first term of the loss function.
+
+Let $f_\theta(x) = \min \Big(\frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}A_t, \text{clip} \big( \frac{\pi_\theta(x)}{\pi_{\theta_{\text{old}}}(x)}, 1 - \varepsilon, 1 + \varepsilon \big) A_t \Big)$ represent the first term of loss function. Then,
+
+$$
+\begin{align*}
+E_{x \sim \pi_\text{training}} f_\theta(x) &= \sum_x \pi_\text{training}(x) f_\theta(x) \\
+&= \sum_x \pi_\text{inference}(x) \frac{\pi_\text{training}(x)}{\pi_\text{inference}(x)} f_\theta(x) \\
+&= E_{x \sim \pi_\text{inference}} \frac{\pi_\text{training}(x)}{\pi_\text{inference}(x)} f_\theta(x)
+\end{align*}
+$$
+
+By multiplying the first term of the loss function by the importance weights $\frac{\pi_\text{training}(x)}{\pi_\text{inference}(x)}$, we can correct for the distribution mismatch between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$ while still sampling from $\pi_{\text{inference}}$.
+
+To enable the importance sampling correction, set the config `use_importance_sampling_correction=True` in the `ClippedPGLossConfig`. By default, we set this config to False to align with standard GRPO.
+
+#### Overlong Filtering
+
+This feature is controlled by the parameter `overlong_filtering`. It filters out sequences that exceed a predefined maximum length, helping maintain computational efficiency and model stability. When `overlong_filtering=True`, samples that reach `max_total_sequence_length` without producing an end-of-text token are excluded from loss computation. This reduces noise from penalizing generations that may be high-quality but exceed the sequence length limit.
+
+The implementation modifies the loss calculation as follows:
+
+For each sample $i$ in the batch:
+
+$$
+\text{truncated}_i = \begin{cases}
+1 & \text{if sample } i \text{ reached max length without EOS} \\
+0 & \text{otherwise}
+\end{cases}
+$$
+
+The sample mask becomes (let m_i denote the sample mask and ℓ_i denote the loss multiplier):
+
+$$
+m_i = \ell_i \cdot (1 - \text{truncated}_i)
+$$
+
+This results in the effective loss:
+
+$$
+L_{\text{effective}} = \sum_{i} m_i \cdot L_i
+$$
+
+where $L_i$ is the per-sample loss. Truncated samples contribute 0 to the gradient update while remaining in the batch for reward baseline calculations.
+
+To configure:
+```yaml
+grpo:
+ overlong_filtering: false # default
+```
+
+Set `overlong_filtering` to true when training on tasks where truncation at the maximum sequence length is expected, such as long-form reasoning or mathematical proofs.
+
+## Metrics
+This feature is controlled by the parameters `wandb_name` and `tb_name`. We track a few metrics during training for scientific experimentation and to validate correctness as the run progresses.
+
+### Multiplicative Token Probability Error
+This feature is controlled by the parameter `token_mult_prob_error`. It measures the error introduced when token probabilities are scaled multiplicatively, which can affect model calibration and output consistency. This is equal to the 'Logprob consistency metric' defined in [Adding New Models](/../adding-new-models#importance-of-log-probability-consistency-in-training-and-inference):
+
+$$
+\text{token-mult-prob-error} = \frac{1}{n}\sum_{i=1}^{n\text{(tokens)}}\exp\left(\left\|\text{log-train-fwk}_i - \text{logprobs-inference-fwk}_i\right\|\right)
+$$
+
+Intuitively, this measures the average multiplicative probability error for sampled tokens, where samples are drawn as $x \sim \pi_{\text{inference-framework}}$. The purpose of this is to highlight any obvious sampling errors or discrepancies between the inference backend and training framework. If it trends upward steeply over the course of training past $\sim 1-2\%$, there is usually a problem with how your weights are being updated. If these metrics are very spiky, they can indicate a bug in the inference framework or buggy weight refitting.
+
+### KL Divergence Error
+This feature is controlled by the following metrics:
+* `gen_kl_error`: $D_{\text{KL}}(P_{gen} || P_{policy})$
+ - the generation distribution as ground truth
+* `policy_kl_error`: $D_{\text{KL}}(P_{policy} || P_{gen})$
+ - the policy (training) distribution as ground truth
+* `js_divergence_error` or (Jensen–Shannon divergence): $(D_{\text{KL}}(P_{policy} || P_{m}) + D_{\text{KL}}(P_{gen} || P_{m})) / 2$, where $P_{m} = (P_{policy} + P_{gen}) / 2$
+ - uses the mean mixture distribution as reference
+
+According to the paper [When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda), `gen_kl_error` was introduced (referred to as `vllm-kl` in the paper) as the key metric to measure the mismatch between the policy and generation distributions. Empirically, the mismatch is approximately 1e-3, and the divergence is larger for low-probability tokens as predicted by the generation inference engine (like vLLM).
+
+The three divergence metrics provide complementary perspectives on distribution mismatch. For example:
+
+We observed a case where vLLM assigned a disproportionately high probability to a single rare token, causing significant logprob error spikes (especially in MoE architectures):
+
+```text
+# extreme example
+1. Position 4559: 'au' (ID: 1786)
+ logp_gen (from vLLM): -5.xxx
+ logp_policy (from Mcore): -15.xxx
+```
+Assuming other tokens have near-zero divergence, this single token's metrics with `kl_type=k3` are:
+
+* `gen_kl_error`: exp(-15 + 5) - (-15 + 5) - 1 ≈ 9 (moderate mismatch)
+* `policy_kl_error`: exp(-5 + 15) - (-5 + 15) - 1 ≈ 22,015 (severe mismatch dominating the metric)
+* `js_divergence_error`: ≈ 9, close to `gen_kl_error` since the mixture distribution (~-5.69) is dominated by the higher-probability value (logp_gen in this example)
+
+Ideally, all KL divergence metrics should be close to 0, with values below 1e-3 considered acceptable. Investigate any metric that shows spikes above this threshold.
+
+### Sampling Importance Ratio
+This feature is controlled by the parameter `sampling_importance_ratio`. It adjusts the weighting of samples based on the ratio between the target policy and the behavior policy, helping to correct for distributional shift in off-policy learning. Not to be confused with the clipped importance ratio in PPO/GRPO, this is the importance ratio between $\pi_{\text{training}}$ and $\pi_{\text{inference}}$.
+
+This is simply $\frac{1}{|T|}\sum_{t \in \text{tokens}}\text{exp}(\text{log}(\pi_{\text{training}}(t)) - \text{log}(\pi_{\text{inference}}(t)))$
+
+Similar to [Multiplicative Token Probability Error](#multiplicative-token-probability-error), this is a measure of how far off your inference backend is from your training framework. However, this metric is meant to find the bias in that error, rather than the variance, as it does not take the absolute value of the error. With some noise, this should hover around 1.
+
+This metric is always calculated and the per-token version (without the mean) is used in the loss function when [Importance Sampling Correction](#importance-sampling-correction) is enabled.
+
+### Entropy
+This feature is controlled by the parameter `approx_entropy`. It estimates the entropy of the policy distribution, which can be used to encourage exploration and prevent premature convergence during training. We roughly approximate the entropy of the LLM's distribution throughout training by calculating:
+
+$$
+E_{s \sim \pi_{\text{inference}}(x)}[-\frac{\pi_{\text{training}}(x)}{\pi_{\text{inference}}(x)}log(\pi_{\text{training}}(x))]
+$$
+
+This expectation is estimated using the rollouts in each global training batch as Monte Carlo samples. The ratio of $\pi$ values in the formula serves to apply importance correction for the mismatch between the training policy during a single GRPO step and the inference-time policy used to sample states.
+
+We use this to track if our models are experiencing entropy collapse too quickly during training (as is quite common). This is a fairly rough Monte Carlo approximation, so we wouldn't recommend using this directly for an entropy bonus or otherwise backpropagating through this. You can take a look at NeMo Aligner's [implementation](https://github.com/NVIDIA/NeMo-Aligner/blob/main/nemo_aligner/utils/distributed.py#L351) of a full entropy calculation if you're interested (work-in-progress efficient calculation in NeMo RL).
+
+## LoRA Configuration
+
+### DTensor Backend
+
+GRPO supports LoRA on the NeMoRL DTensor backend. The LoRA settings live under `policy.dtensor_cfg.lora_cfg`, and the fields follow the SFT LoRA configuration. For DTensor parameter details, see [SFT LoRA: DTensor Configuration Parameters](/./sft#dtensor-configuration-parameters). To enable LoRA, set `policy.dtensor_cfg.lora_cfg.enabled=true`, then configure target modules, rank, alpha, and dropout as needed.
+
+Our DTensor LoRA path uses a merge-weight approach: during generation, LoRA adapter weights are merged into the base linear weights. This improves performance, with a small training-inference mismatch that we consider acceptable. If you require strict training-inference parity, use the [split-weight variant branch](https://github.com/NVIDIA-NeMo/RL/tree/ruit/lora_grpo_async), which may trade off some performance. For a comparison between merge-weight and split-weight, see [PR 1797: Support lora in dtensor grpo workflow by merging weight](https://github.com/NVIDIA-NeMo/RL/pull/1797).
+
+We already provide a DTensor-based Nano v3 GRPO LoRA recipe. See [grpo-nanov3-30BA3B-2n8g-fsdp2-lora.yaml](/../../examples/configs/recipes/llm/grpo-nanov3-30BA3B-2n8g-fsdp2-lora.yaml) for an end-to-end example.
+
+## Evaluate the Trained Model
+
+Upon completion of the training process, you can refer to our [evaluation guide](/eval) to assess model capabilities.
diff --git a/fern/v0.5.0/pages/guides/nemotron-3-nano.mdx b/fern/v0.5.0/pages/guides/nemotron-3-nano.mdx
new file mode 100644
index 0000000000..49697325e3
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/nemotron-3-nano.mdx
@@ -0,0 +1,70 @@
+---
+title: Nemotron 3 Nano
+description: ""
+---
+
+This guide explains how to post-train the [Nemotron 3 Nano model](https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf) using NeMo RL.
+
+## Download and prepare the data
+
+```bash
+# Download RL data blend
+uvx --from huggingface-hub hf download nvidia/Nemotron-3-Nano-RL-Training-Blend --repo-type dataset --local-dir=data
+
+# Fill in placeholders in dataset
+chmod +x data/create_nanov3_jsonl.py
+./data/create_nanov3_jsonl.py --input data/train.jsonl --output data/train-full.jsonl
+
+# Use the last 1000 rows for validation
+head -n -1000 data/train-full.jsonl > data/train-split.jsonl
+tail -n 1000 data/train-full.jsonl > data/val-split.jsonl
+```
+
+## Prepare the code
+Note that we currently require using the `nano-v3` branch to train Nemotron 3 Nano.
+```bash
+# Checkout NeMo RL
+git clone -b nano-v3 https://github.com/NVIDIA-NeMo/RL.git
+cd RL
+
+# Initialize the submodules
+git submodule update --init --recursive
+```
+
+## Create a launch script
+
+Create a file named `launch.sh` with the following contents. Be sure to fill in the `DATA_DIR`, `MODEL_CHECKPOINT`, `WANDB_API_KEY`, `SLURM_ACCOUNT`, `SLURM_PARTITION`, `MOUNTS`. Note that the default recipe (`examples/nemo_gym/grpo_nanov3.yaml`) uses 32 nodes.
+
+```bash
+CODE_DIR=$PWD
+SLURM_JOB_NAME=nano-v3-rl-training
+
+# Fill these in
+DATA_DIR=...
+MODEL_CHECKPOINT=...
+WANDB_API_KEY=...
+SLURM_ACCOUNT=...
+SLURM_PARTITION=...
+MOUNTS=... # SRC:DST[,SRC:DST...] e.g., MOUNTS="/lustre:/lustre,/data:/data"
+
+CONTAINER="nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano"
+COMMAND="uv run examples/nemo_gym/run_grpo_nemo_gym.py --config examples/nemo_gym/grpo_nanov3.yaml data.train_jsonl_fpath=$DATA_DIR/train-split.jsonl data.validation_jsonl_fpath=$DATA_DIR/val-split.jsonl policy.model_name=$MODEL_CHECKPOINT logger.wandb_enabled=True"
+
+COMMAND="${COMMAND}" \
+CONTAINER="${CONTAINER}" \
+MOUNTS="${MOUNTS}" \
+WANDB_API_KEY=${WANDB_API_KEY} \
+sbatch \
+ --nodes=32 \
+ --account="${SLURM_ACCOUNT}" \
+ --job-name="${SLURM_JOB_NAME}" \
+ --partition="${SLURM_PARTITION}" \
+ --time=4:0:0 \
+ --gres=gpu:8 \
+ ray.sub
+```
+
+## Launch training
+```bash
+bash launch.sh
+```
diff --git a/fern/v0.5.0/pages/guides/prorlv2.mdx b/fern/v0.5.0/pages/guides/prorlv2.mdx
new file mode 100644
index 0000000000..3672e0fbe8
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/prorlv2.mdx
@@ -0,0 +1,238 @@
+---
+title: An In-Depth Walkthrough of ProRLv2 in NeMo RL
+description: ""
+---
+
+This guide covers the ProRLv2 configuration pattern in NeMo RL, based on the example config [`examples/configs/prorlv2.yaml`](/../../examples/configs/prorlv2.yaml).
+
+ProRLv2 (as used in this repo) is best thought of as **GRPO and a bundle of stability/efficiency techniques** commonly used for long-horizon RL fine-tuning
+
+- **DAPO dynamic sampling**: skip prompt-groups with zero reward variance
+- **Decoupled (asymmetric) clipping**: `ratio_clip_max > ratio_clip_min`
+- **Token-level policy gradient loss**
+- **Importance sampling correction and TIS/ICE-POP** (especially helpful for MoE/backend-mismatch scenarios)
+- **Reinforce++: Decoupled local/global advantage normalization** (`reinforce_plus_plus`)
+- **“Stop properly” penalty** for truncated responses
+
+This document focuses on ProRLv2-specific knobs and gotchas. For foundational concepts on GRPO (data, environments, generation backends, loss/metrics), see the [NeMo RL GRPO Guide](/grpo). For the original DAPO motivation behind dynamic sampling/overlong shaping, see the [NeMo RL DAPO Guide](/dapo).
+
+## Quickstart: Launch a ProRLv2 Run
+
+Use the example configuration [`examples/configs/prorlv2.yaml`](/../../examples/configs/prorlv2.yaml):
+
+```bash
+uv run examples/run_grpo_math.py --config examples/configs/prorlv2.yaml \{overrides\}
+```
+
+`prorlv2.yaml` inherits from [`examples/configs/grpo_math_1B.yaml`](/../../examples/configs/grpo_math_1B.yaml) and only overrides a small set of fields under `grpo` and `loss_fn`, plus output directories.
+
+**Reminder**: Don’t forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You’ll need to do a `huggingface-cli login` as well for gated models.
+
+## DAPO: Dynamic Sampling
+
+Standard GRPO will train on all generated responses, even when a prompt’s `num_generations_per_prompt` responses all receive the same reward (no per-prompt learning signal). **Dynamic sampling** filters to keep only prompt-groups with diverse rewards (`std > 0`), and can accumulate across multiple generation batches until it reaches the target rollout batch size.
+
+- **Config**: enable with `grpo.use_dynamic_sampling: true` and tune:
+ - `grpo.batch_multiplier`: how many extra prompts to generate to compensate filtering
+ - `grpo.dynamic_sampling_max_gen_batches`: upper bound before raising an error
+- **Implementation**: see `dynamic_sampling()` in [`nemo_rl/algorithms/grpo.py`](/../../nemo_rl/algorithms/grpo.py).
+
+## Advantage Estimator: Reinforce++
+
+The ProRLv2 recipe uses **Reinforce++** advantage estimation instead of the standard GRPO-style group baseline.
+
+Quick intuition:
+
+- Reinforce++ uses **decoupled local + global normalization**.
+- Compared to GRPO-style **local-only normalization**, this decoupling can be **more stable** in longer runs (less sensitivity to per-batch scale/variance shifts).
+
+Computation (as implemented in this repo, with the ProRLv2 example defaults):
+
+```text
+Defaults in examples/configs/prorlv2.yaml:
+ grpo.adv_estimator.minus_baseline = true
+ loss_fn.use_kl_in_reward = false
+
+Steps:
+ 1) Per prompt-group, compute mean reward, then subtract it:
+ a_i = r_i - mean_{\{j in same prompt\}} r_j
+
+ 2) Global normalize across *all valid response tokens* in the batch:
+ A <- (A - mean(A)) / sqrt(max(var(A), 1e-8))
+```
+
+```yaml
+grpo:
+ adv_estimator:
+ name: "reinforce_plus_plus"
+ normalize_rewards: true
+ use_leave_one_out_baseline: false
+ minus_baseline: true
+```
+
+- **Config**: `grpo.adv_estimator.name: "reinforce_plus_plus"`
+- **Implementation**: the training loop wires this via `ReinforcePlusPlusAdvantageEstimator` in [`nemo_rl/algorithms/grpo.py`](/../../nemo_rl/algorithms/grpo.py).
+- **Reference**: [REINFORCE++ paper](https://arxiv.org/abs/2501.03262)
+
+## Reward Shaping: “Stop properly” Penalty (Truncation Penalty)
+
+When a generation hits the max length without emitting EOS, many pipelines mark it as **truncated**. The “stop properly” penalty scales the reward for truncated samples:
+
+- `stop_properly_penalty_coef = 0.0`: truncated samples get **zero reward**
+- `stop_properly_penalty_coef = 1.0`: **no penalty** (keep original rewards)
+- Any value in $[0, 1]$ interpolates between the two.
+
+In the example config:
+
+```yaml
+grpo:
+ reward_shaping:
+ enabled: true
+ stop_properly_penalty_coef: 0.0
+```
+
+- **Implementation**: `apply_reward_shaping()` in [`nemo_rl/algorithms/reward_functions.py`](/../../nemo_rl/algorithms/reward_functions.py).
+
+
+In the current implementation, if `stop_properly_penalty_coef` is set (not `null`), `apply_reward_shaping()` **returns early** after applying truncation scaling. That means you **cannot** apply DAPO "overlong reward shaping" in the same run unless you set `stop_properly_penalty_coef: null` and provide the DAPO overlong parameters (`overlong_buffer_length`, `overlong_buffer_penalty`, `max_response_length`).
+
+
+## Loss: Decoupled (Asymmetric) Clipping
+
+ProRLv2 uses DAPO’s “decoupled clipping” idea by setting different lower/upper clip bounds:
+
+```yaml
+loss_fn:
+ ratio_clip_min: 0.2
+ ratio_clip_max: 0.27
+```
+
+This keeps PPO/GRPO-style clipping behavior but allows a larger expansion region than the contraction region, which can help exploration and reduce early collapse.
+
+- **Implementation**: `ClippedPGLossFn` documents decoupled clipping in [`nemo_rl/algorithms/loss_functions.py`](/../../nemo_rl/algorithms/loss_functions.py).
+
+## Loss: Token-level Policy Gradient
+
+ProRLv2 enables token-level loss:
+
+```yaml
+loss_fn:
+ token_level_loss: true
+```
+
+This computes the policy gradient loss per token (under masking) instead of aggregating per sequence, which is often helpful for long CoT/variable-length rollouts.
+
+## Truncated Importance Sampling
+
+When training and generation backends differ (e.g., numerics, precision, MoE routing, or vLLM vs training framework), you may see a mismatch between:
+
+- `generation_logprobs` (logprobs under the generation backend that produced samples)
+- `prev_logprobs` (logprobs under the training framework policy)
+
+NeMo RL supports **importance sampling correction**, and ProRLv2’s example config turns it on together with **truncated importance sampling**.
+
+Quick intuition:
+
+- This is mainly useful for **MoE/backend mismatch** cases, where the generation backend and the training policy can disagree on logprobs.
+- We compute an importance weight from `prev_logprobs` (training policy) vs `generation_logprobs` (generator). **ICE-POP** drops outliers by zeroing weights outside $[min, max]$.
+- In the common setup of **one policy update per rollout batch** (i.e., minibatch equals the per-step rollout batch; no PPO multi-epoch reuse), the PPO/GRPO likelihood ratio term is effectively **1.0** at update time, so the main stability issue is the MoE/backend-mismatch importance weights.
+- “Online ICE-POP” here just means applying that ICE-POP filtering **during loss computation** on the current training batch.
+
+- **Reference**: [The Online IcePop Solution for MoE models](https://hijkzzz.notion.site/online-ice-pop)
+
+```yaml
+loss_fn:
+ use_importance_sampling_correction: true
+ truncated_importance_sampling_ratio: 5.0
+ truncated_importance_sampling_ratio_min: 0.5
+ truncated_importance_sampling_type: "icepop"
+```
+
+- **`use_importance_sampling_correction`**: enable token-level importance weights (must be `true` for truncated IS)
+- **`truncated_importance_sampling_ratio`**: upper bound (or upper threshold)
+- **`truncated_importance_sampling_ratio_min`**: lower bound used by ICE-POP filtering
+- **`truncated_importance_sampling_type`**:
+ - `"tis"`: clamp weights to `<= truncated_importance_sampling_ratio`
+ - `"icepop"`: set weights outside $[min, max]$ to zero (filter outliers)
+ - `"seq-mask-tis"`: sequence-level geometric-mean mask + non-truncated token-level IS correction (see below)
+
+- **Implementation**: see `ClippedPGLossFn` init-time checks and logic in [`nemo_rl/algorithms/loss_functions.py`](/../../nemo_rl/algorithms/loss_functions.py).
+
+### Seq-mask-tis: Sequence-level Geometric-Mean Mask
+
+`seq-mask-tis` is an alternative to ICE-POP that operates at the **sequence level** instead of per-token:
+
+1. For each sequence, compute the **geometric mean** of per-token IS ratios: $\text{geo\_mean}_i = \exp\!\bigl(\frac{1}{T_i}\sum_t \log \frac{\pi_{\text{train}}(a_t)}{\pi_{\text{gen}}(a_t)}\bigr)$
+2. **Mask out** entire sequences whose geometric mean falls outside $[min, max]$.
+3. For retained sequences, apply the **non-truncated** (raw) token-level IS ratios to correct per-token gradients — no clamping, no per-token filtering.
+
+Key differences from ICE-POP:
+
+| | ICE-POP | seq-mask-tis |
+|---|---|---|
+| Filtering granularity | per token | per sequence |
+| IS correction weights | filtered (zeroed outside bounds) | raw / non-truncated |
+| Reference bounds | min=0.5, max=5 | min=0.999, max=1.002 |
+
+```yaml
+loss_fn:
+ use_importance_sampling_correction: true
+ truncated_importance_sampling_ratio: 1.002
+ truncated_importance_sampling_ratio_min: 0.999
+ truncated_importance_sampling_type: "seq-mask-tis"
+```
+
+Both ICE-POP and seq-mask-tis report a shared metric **`is_oob_ratio`** — the fraction of tokens (ICE-POP) or sequences (seq-mask-tis) that were filtered out.
+
+- **Reference**: [When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda)
+
+## Full Example Config (Annotated)
+
+The ProRLv2 example config is intentionally small and relies on defaults from `grpo_math_1B.yaml`.
+
+- **Example config**: [`examples/configs/prorlv2.yaml`](/../../examples/configs/prorlv2.yaml)
+- **Base defaults**: [`examples/configs/grpo_math_1B.yaml`](/../../examples/configs/grpo_math_1B.yaml)
+
+## Practical Overrides
+
+A few common overrides when launching:
+
+```bash
+uv run examples/run_grpo_math.py \
+ --config examples/configs/prorlv2.yaml \
+ policy.model_name="Qwen/Qwen2.5-1.5B" \
+ logger.wandb_enabled=true \
+ logger.wandb.project="prorlv2-dev" \
+ checkpointing.checkpoint_dir="results/prorlv2" \
+ logger.log_dir="logs/prorlv2"
+```
+
+If you want to enable DAPO overlong reward shaping instead of stop-properly:
+
+```bash
+uv run examples/run_grpo_math.py \
+ --config examples/configs/prorlv2.yaml \
+ grpo.reward_shaping.stop_properly_penalty_coef=null \
+ grpo.reward_shaping.overlong_buffer_length=4096 \
+ grpo.reward_shaping.overlong_buffer_penalty=1.0 \
+ grpo.reward_shaping.max_response_length=20480
+```
+
+## What to Monitor
+
+In addition to task rewards/accuracy, a few stability signals are particularly useful with ProRLv2-style runs:
+
+- **Dynamic sampling efficiency**: if enabled, watch how often batches need multiple generation rounds (see `dapo.md` for detailed guidance).
+- **Training–generation mismatch**: `token_mult_prob_error`, `gen_kl_error`, `policy_kl_error`, `js_divergence_error` are computed in `ClippedPGLossFn` (see the [GRPO metrics section](/grpo#metrics)).
+- **Truncation rate**: if high, either increase `policy.max_total_sequence_length`/`policy.generation.max_model_len` or relax truncation penalty (`stop_properly_penalty_coef`).
+
+## References
+
+- **ProRLv2 blog**: [Scaling LLM Reinforcement Learning with Prolonged Training using ProRL v2](https://developer.nvidia.com/blog/scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2/)
+- **DAPO**: [Decoupled Clip and Dynamic Sampling Policy Optimization](https://arxiv.org/pdf/2503.14476)
+- **GRPO**: [Group Relative Policy Optimization](https://arxiv.org/abs/2402.03300)
+- **REINFORCE++**: [REINFORCE++](https://arxiv.org/abs/2501.03262)
+- **DLER (stop properly penalty explanation)**: [DLER](https://arxiv.org/pdf/2510.15110)
+- **seq-mask-tis blog**: [When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch](https://yingru.notion.site/When-Speed-Kills-Stability-Demystifying-RL-Collapse-from-the-Training-Inference-Mismatch-271211a558b7808d8b12d403fd15edda)
+- **[NeMo RL GRPO Guide](/grpo)**
+- **[NeMo RL DAPO Guide](/dapo)**
diff --git a/fern/v0.5.0/pages/guides/rm.mdx b/fern/v0.5.0/pages/guides/rm.mdx
new file mode 100644
index 0000000000..cef70848ea
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/rm.mdx
@@ -0,0 +1,233 @@
+---
+title: Reward Model Training in NeMo RL
+description: ""
+---
+
+This document explains how to train reward models (RM) within NeMo RL. Currently, only Bradley-Terry reward models are supported on the DTensor backend. Megatron backend support is tracked [here](https://github.com/NVIDIA-NeMo/RL/issues/720).
+
+## Launch a Training Job
+
+The script, [examples/run_rm.py](/../../examples/run_rm.py), is used to train a Bradley-Terry reward model. This script can be launched either locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](/../cluster).
+
+Be sure to launch the job using `uv`. The command to launch a training job is as follows:
+
+```bash
+uv run examples/run_rm.py
+
+# Can also add overrides on CLI, like changing the config or changing the model
+uv run examples/run_rm.py --config examples/configs/rm.yaml policy.model_name=Qwen/Qwen2.5-1.5B
+```
+
+The default YAML config shares the same base template as the SFT config but includes a new `reward_model_cfg` section with `enabled: true` to load the model as a Reward Model. You can find an example RM config file at [examples/configs/rm.yaml](/../../examples/configs/rm.yaml).
+
+**Reminder**: Set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). Make sure to log in using `huggingface-cli` if you're working with Llama models.
+
+## Datasets
+
+RM datasets in NeMo RL are encapsulated using classes. Each RM data class is expected to have the following attributes:
+ 1. `dataset`: A dictionary containing the formatted datasets. Each example in the dataset must conform to the format described below.
+ 2. `task_name`: A string identifier that uniquely identifies the dataset.
+
+If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. An example implementation can be found in [preference_datasets/tulu3.py](/../../nemo_rl/data/datasets/preference_datasets/tulu3.py).
+
+**Note:** The `task_name` field is required in each formatted example.
+
+```json
+{
+ "context": [], // list of dicts - The prompt message (including previous turns, if any)
+ "completions": [ // list of dicts — The list of completions
+ {
+ "rank": 0, // int — The rank of the completion (lower rank is preferred)
+ "completion": [] // list of dicts — The completion message(s)
+ },
+ {
+ "rank": 1, // int — The rank of the completion (lower rank is preferred)
+ "completion": [] // list of dicts — The completion message(s)
+ }
+ ],
+ "task_name": "task_name" // identifier for the task
+}
+```
+
+Currently, RM training supports only two completions (where the lowest rank is preferred and the highest one is rejected), with each completion being a single response. For example:
+```json
+{
+ "context": [
+ {
+ "role": "user",
+ "content": "What's the capital of France?"
+ },
+ {
+ "role": "assistant",
+ "content": "The capital of France is Paris."
+ },
+ {
+ "role": "user",
+ "content": "Thanks! And what's the capital of Germany?"
+ }
+ ],
+ "completions": [
+ {
+ "rank": 0,
+ "completion": [
+ {
+ "role": "assistant",
+ "content": "The capital of Germany is Berlin."
+ }
+ ]
+ },
+ {
+ "rank": 1,
+ "completion": [
+ {
+ "role": "assistant",
+ "content": "The capital of Germany is Munich."
+ }
+ ]
+ }
+ ],
+ "task_name": "task_name"
+}
+```
+
+By default, NeMo RL has support for [HelpSteer3](/../../nemo_rl/data/datasets/preference_datasets/helpsteer3.py) and [Tulu3Preference](/../../nemo_rl/data/datasets/preference_datasets/tulu3.py) datasets. Both of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.
+
+We provide a [PreferenceDataset](/../../nemo_rl/data/datasets/preference_datasets/preference_dataset.py) class that is compatible with jsonl-formatted preference datasets for loading datasets from local path or HuggingFace.. You can modify your config as follows to use such a custom preference dataset:
+```yaml
+data:
+ # other data settings, see `examples/configs/dpo.yaml` for more details
+ ...
+ # dataset settings
+ train:
+ # this dataset will override prompt_key and use the default values for other vars
+ data_path: /path/to/local/train_dataset.jsonl # local file or hf_org/hf_dataset_name (HuggingFace)
+ subset: null # used for HuggingFace datasets
+ split: train # used for HuggingFace datasets
+ validation:
+ # this dataset will use the default values for other vars except data_path
+ data_path: /path/to/local/val_dataset.jsonl
+ default:
+ # will use below vars as default values if dataset doesn't specify it
+ dataset_name: PreferenceDataset
+ prompt_file: null
+ system_prompt_file: null
+ # multiple validation sets is supported by using val_data_paths
+ # this will be removed after refactor
+ val_data_paths:
+ : /path/to/local/val_dataset_1.jsonl
+ : /path/to/local/val_dataset_2.jsonl
+```
+
+Your JSONL files should contain one JSON object per line with the following structure:
+
+```json
+{
+ "context": [{"role": "user", "content": "What is 2+2?"}], // list of dicts - The prompt message (including previous turns, if any)
+ "completions": [ // list of dicts — The list of completions
+ {
+ "rank": 0, // int — The rank of the completion (lower rank is preferred)
+ "completion": [{"role": "assistant", "content": "The answer is 4."}] // list of dicts — The completion message(s)
+ },
+ {
+ "rank": 1, // int — The rank of the completion (lower rank is preferred)
+ "completion": [{"role": "assistant", "content": "I don't know."}] // list of dicts — The completion message(s)
+ }
+ ]
+}
+```
+
+We also provide a [BinaryPreferenceDataset](/../../nemo_rl/data/datasets/preference_datasets/binary_preference_dataset.py) class, which is a simplified version of PreferenceDataset for pairwise ranked preference with single turn completions. You can use `prompt_key`, `chosen_key` and `rejected_key` to specify which fields in your data correspond to the question, chosen answer and rejected answer respectively. Here's an example configuration:
+```yaml
+data:
+ # other data settings, see `examples/configs/dpo.yaml` for more details
+ ...
+ # dataset settings
+ train:
+ # this dataset will override prompt_key and use the default values for other vars
+ data_path: /path/to/local/train_dataset.jsonl # local file or hf_org/hf_dataset_name (HuggingFace)
+ prompt_key: context
+ subset: null # used for HuggingFace datasets
+ split: train # used for HuggingFace datasets
+ validation:
+ # this dataset will use the default values for other vars except data_path
+ data_path: /path/to/local/val_dataset.jsonl
+ default:
+ # will use below vars as default values if dataset doesn't specify it
+ dataset_name: BinaryPreferenceDataset
+ prompt_key: prompt
+ chosen_key: chosen
+ rejected_key: rejected
+ prompt_file: null
+ system_prompt_file: null
+```
+
+Your JSONL files should contain one JSON object per line with the following structure:
+
+```json
+{
+ "prompt": "What is 2+2?", // :
+ "chosen": "The answer is 4.", // :
+ "rejected": "I don't know." // :
+}
+```
+
+Please note:
+- If you are using a logger, the prefix used for each validation set will be `validation-`. The total validation time, summed across all validation sets, is reported under `timing/validation/total_validation_time`.
+- If you are doing checkpointing, the `metric_name` value in your `checkpointing` config should reflect the metric and validation set to be tracked. For example, `validation-_loss`.
+
+## Using Reward Models as Environments
+
+Trained reward models can be used as environments in GRPO training for reinforcement learning from human feedback (RLHF). This allows you to use your trained reward model to provide rewards during policy optimization.
+
+### Reward Model Environment
+
+The Reward Model Environment provides a standardized interface for using trained reward models in RL training:
+
+```python
+from nemo_rl.environments.reward_model_environment import RewardModelEnvironment
+
+env_config = {
+ "enabled": True,
+ "model_name": "path/to/your/trained/reward/model",
+ "tokenizer": {"name": "path/to/your/trained/reward/model"},
+ "precision": "bfloat16",
+ "batch_size": 32,
+ "resources": {"gpus_per_node": 1, "num_nodes": 1},
+ "reward_model_cfg": {
+ "enabled": True,
+ "reward_model_type": "bradley_terry",
+ },
+}
+
+reward_env = RewardModelEnvironment.remote(env_config)
+```
+
+### Integration with GRPO
+
+To use your trained reward model with GRPO, you can use the [examples/run_grpo.py](/../../examples/run_grpo.py) script with the [examples/configs/grpo_rm_1B.yaml](/../../examples/configs/grpo_rm_1B.yaml) config:
+
+```bash
+# Run GRPO training with your trained reward model
+uv run examples/run_grpo.py --config examples/configs/grpo_rm_1B.yaml
+```
+
+### Configuration
+
+In your GRPO configuration, specify the reward model environment:
+
+```yaml
+env:
+ reward_model:
+ enabled: true
+ model_name: "path/to/your/trained/reward/model"
+ tokenizer:
+ name: "path/to/your/trained/reward/model"
+ precision: "bfloat16"
+ batch_size: 32
+ resources:
+ gpus_per_node: 1
+ num_nodes: 1
+ reward_model_cfg:
+ enabled: true
+ reward_model_type: "bradley_terry"
+```
diff --git a/fern/v0.5.0/pages/guides/sft-openmathinstruct2.mdx b/fern/v0.5.0/pages/guides/sft-openmathinstruct2.mdx
new file mode 100644
index 0000000000..0fa5cae5d1
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/sft-openmathinstruct2.mdx
@@ -0,0 +1,96 @@
+---
+title: SFT on OpenMathInstruct-2
+description: ""
+---
+
+This guide explains how to use NeMo RL to run SFT on the [nvidia/OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) math instruction tuning dataset. We then show how to use NeMo RL's evaluation scripts to evaluate the trained model on the [MATH-500 benchmark](https://huggingface.co/datasets/HuggingFaceH4/MATH-500).
+
+## Train the Model
+To train the model using NeMo RL, use the `examples/configs/recipes/tutorials/sft/sft_openmathinstruct2.yaml` config file. This file closely matches the experiment settings in the [original OpenMathInstruct-2 paper](https://arxiv.org/abs/2410.01560).
+
+```
+uv run examples/run_sft.py --config=examples/configs/sft_openmathinstruct2.yaml
+```
+
+### Dataset Splits
+
+The OpenMathInstruct-2 has several versions of different sizes. Configure the version of the dataset via the `data.split` config:
+
+* `train`: full 14 M problem–solution pairs
+* `train_1M`, `train_2M`, `train_5M`: fair-downsampled subsets of 1M, 2M, or 5M examples
+
+By default, the config uses the 1M subset (`data.split=train_1M`).
+
+### Training Time
+The default config uses 8 GPUs (`cluster.gpus_per_node`) on 1 node (`cluster.num_nodes`), which should complete 1 epoch of training for the `train_1M` dataset (1855 steps) in around 20 hours. Additional nodes can be used to speed up training. We found in our experiments that using 8 nodes, we can complete 1 epoch of training for the `train_1M` dataset in less than 4 hours.
+
+## Evaluate the Model
+Throughout training, the checkpoints of the model will be saved to the `results/sft_openmathinstruct2` folder (specified by `checkpointing.checkpoint_dir`). To evaluate the model, we first need to convert the PyTorch distributed checkpoint to Hugging Face format:
+
+```
+uv run examples/converters/convert_dcp_to_hf.py \
+ --config=results/sft_openmathinstruct2/step_1855/config.yaml \
+ --dcp-ckpt-path=results/sft_openmathinstruct2/step_1855/policy/weights \
+ --hf-ckpt-path=results/sft_openmathinstruct2/step_1855/hf
+```
+
+Replace `results/sft_openmathinstruct2/step_1855` with the path to the checkpoint you are evaluating. The resulting Hugging Face checkpoint will be saved to `--hf-ckpt-path`.
+
+To evaluate on the [MATH-500 benchmark](https://huggingface.co/datasets/HuggingFaceH4/MATH-500), use the following command:
+
+```
+uv run examples/run_eval.py \
+ --config=examples/configs/evals/eval.yaml \
+ generation.model_name=results/sft_openmathinstruct2/step_1855/hf \
+ tokenizer.name=meta-llama/Llama-3.1-8B-Instruct \
+ data.dataset_name=HuggingFaceH4/MATH-500 \
+ data.dataset_key=test
+```
+
+Use `generation.model_name` to specify the path to the Hugging Face checkpoint.
+
+## Results
+
+In this section we present the results of several reference experiments for the `train_1M` and `train` versions of the dataset.
+
+### train_1M
+Using the above instructions to train a Llama-3.1-8B model for 1 epoch on the `train_1M` version of the OpenMathInstruct-2 dataset, we get the following loss curve:
+
+
+
+Evaluating the final checkpoint on MATH-500, we get the following result:
+
+```
+============================================================
+model_name='hf' dataset_name='MATH-500'
+max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1
+
+metric='pass@1' num_tests_per_prompt=1
+
+score=0.5020 (251.0/500)
+============================================================
+```
+
+As a reference, using NeMo-Aligner and NeMo-Skills (as is done in the [original OpenMathInstruct-2 paper](https://arxiv.org/abs/2410.01560)) to train and evaluate the same model on the same dataset achieves the same score of 0.5020 on MATH-500.
+
+### train
+We also trained a Llama-3.1-8B model for 1 epoch on the full `train` version of the OpenMathInstruct-2 dataset. We obtain the following loss curve:
+
+
+
+Evaluating the final checkpoint on MATH-500, we get the following result:
+
+```
+============================================================
+model_name='hf' dataset_name='MATH-500'
+max_new_tokens=2048 temperature=0.0 top_p=1.0 top_k=-1
+
+metric='pass@1' num_tests_per_prompt=1
+
+score=0.6220 (311.0/500)
+============================================================
+```
+
+Using NeMo-Aligner and NeMo-Skills to train the model in the same settings achieves a score of 0.6140 (307/500).
+
+As another point of reference, using a checkpoint after 10,000 steps of training using NeMo-RL achieves a score of 0.5800 (290.0/500).
diff --git a/fern/v0.5.0/pages/guides/sft.mdx b/fern/v0.5.0/pages/guides/sft.mdx
new file mode 100644
index 0000000000..44dfef47f3
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/sft.mdx
@@ -0,0 +1,324 @@
+---
+title: Supervised Fine-Tuning in NeMo RL
+description: ""
+---
+
+This document explains how to perform SFT within NeMo RL. It outlines key operations, including initiating SFT runs, managing experiment configurations using YAML, and integrating custom datasets that conform to the required structure and attributes.
+
+## Launch an SFT Run
+
+The script, [examples/run_sft.py](/../../examples/run_sft.py), can be used to launch an experiment. This script can be launched either locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](/../cluster).
+
+Be sure to launch the job using `uv`. The command to launch an SFT job is as follows:
+
+```bash
+uv run examples/run_sft.py --config
+```
+
+If not specified, `config` will default to [examples/configs/sft.yaml](/../../examples/configs/sft.yaml).
+
+## Example Configuration File
+
+NeMo RL allows users to configure experiments using `yaml` config files. An example SFT configuration file can be found [here](/../../examples/configs/sft.yaml).
+
+To override a value in the config, either update the value in the `yaml` file directly, or pass the override via the command line. For example:
+
+```bash
+uv run examples/run_sft.py \
+ cluster.gpus_per_node=1 \
+ logger.wandb.name="sft-dev-1-gpu"
+```
+
+**Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
+
+## Datasets
+
+SFT datasets in NeMo RL are encapsulated using classes. Each SFT data class is expected to have the following attributes:
+ 1. `dataset`: A dictionary containing the formatted datasets. Each example in the dataset must conform to the format described below.
+ 2. `task_name`: A string identifier that uniquely identifies the dataset.
+
+SFT datasets are expected to follow the HuggingFace chat format. Refer to the [chat dataset document](/../design-docs/chat-datasets) for details. If your data is not in the correct format, simply write a preprocessing script to convert the data into this format. [response_datasets/squad.py](/../../nemo_rl/data/datasets/response_datasets/squad.py) has an example:
+
+**Note:** The `task_name` field is required in each formatted example.
+
+```python
+def format_data(self, data: dict[str, Any]) -> dict[str, Any]:
+ return {
+ "messages": [
+ {
+ "role": "system",
+ "content": data["context"],
+ },
+ {
+ "role": "user",
+ "content": data["question"],
+ },
+ {
+ "role": "assistant",
+ "content": data["answers"]["text"][0],
+ },
+ ],
+ "task_name": self.task_name,
+ }
+```
+
+NeMo RL SFT uses Hugging Face chat templates to format the individual examples. Three types of chat templates are supported, which can be configured using the `tokenizer.chat_template` in your YAML config (see [sft.yaml](/../../examples/configs/sft.yaml) for an example):
+
+1. Apply the tokenizer's default chat template. To use the tokenizer's default, either omit `tokenizer.chat_template` from the config altogether, or set `tokenizer.chat_template="default"`.
+2. Use a "passthrough" template which simply concatenates all messages. This is desirable if the chat template has been applied to your dataset as an offline preprocessing step. In this case, you should set `tokenizer.chat_template` to None as follows:
+ ```yaml
+ tokenizer:
+ chat_template: NULL
+ ```
+3. Use a custom template: If you would like to use a custom template, create a string template in [Jinja format](https://huggingface.co/docs/transformers/v4.34.0/en/chat_templating#how-do-i-create-a-chat-template), and add that string to the config. For example,
+
+ ```yaml
+ tokenizer:
+ custom_template: "{% for message in messages %}{%- if message['role'] == 'system' %}{{'Context: ' + message['content'].strip()}}{%- elif message['role'] == 'user' %}{{' Question: ' + message['content'].strip() + ' Answer: '}}{%- elif message['role'] == 'assistant' %}{{message['content'].strip()}}{%- endif %}{% endfor %}"
+ ```
+
+By default, NeMo RL has some built-in supported datasets (e.g., [OpenAssistant](/../../nemo_rl/data/datasets/response_datasets/oasst.py), [OpenMathInstruct-2](/../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py), [Squad](/../../nemo_rl/data/datasets/response_datasets/squad.py), etc.), you can see the full list [here](/../../nemo_rl/data/datasets/response_datasets/__init__.py).
+All of these datasets are downloaded from HuggingFace and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.
+
+We provide a [ResponseDataset](/../../nemo_rl/data/datasets/response_datasets/response_dataset.py) class that is compatible with JSONL-formatted response datasets for loading datasets from local path or Hugging Face. You can use `input_key`, `output_key` to specify which fields in your data correspond to the question and answer respectively. Here's an example configuration:
+```yaml
+data:
+ # other data settings, see `examples/configs/sft.yaml` for more details
+ ...
+ # dataset settings
+ train:
+ # this dataset will override input_key and use the default values for other vars
+ data_path: /path/to/local/train_dataset.jsonl # local file or hf_org/hf_dataset_name (HuggingFace)
+ input_key: question
+ subset: null # used for HuggingFace datasets
+ split: train # used for HuggingFace datasets
+ split_validation_size: 0.05 # use 5% of the training data as validation data
+ seed: 42 # seed for train/validation split when split_validation_size > 0
+ validation:
+ # this dataset will use the default values for other vars except data_path
+ data_path: /path/to/local/val_dataset.jsonl
+ default:
+ # will use below vars as default values if dataset doesn't specify it
+ dataset_name: ResponseDataset
+ input_key: input
+ output_key: output
+ prompt_file: null
+ system_prompt_file: null
+ processor: "sft_processor"
+```
+
+Your JSONL files should contain one JSON object per line with the following structure:
+
+```json
+{
+ "input": "Hello", // :
+ "output": "Hi there!" // :
+}
+```
+
+We support using multiple datasets for train and validation. You can refer to `examples/configs/grpo_multiple_datasets.yaml` for a full configuration example. Here's an example configuration:
+```yaml
+data:
+ _override_: true # override the data config instead of merging with it
+ # other data settings, see `examples/configs/sft.yaml` for more details
+ ...
+ # dataset settings
+ train:
+ # train dataset 1
+ - dataset_name: OpenMathInstruct-2
+ split_validation_size: 0.05 # use 5% of the training data as validation data
+ seed: 42 # seed for train/validation split when split_validation_size > 0
+ # train dataset 2
+ - dataset_name: DeepScaler
+ validation:
+ # validation dataset 1
+ - dataset_name: AIME2024
+ repeat: 16
+ # validation dataset 2
+ - dataset_name: DAPOMathAIME2024
+ # default settings for all datasets
+ default:
+ ...
+```
+
+We support using a single dataset for both train and validation by using `split_validation_size` to set the ratio of validation.
+[OpenAssistant](/../../nemo_rl/data/datasets/response_datasets/oasst.py), [OpenMathInstruct-2](/../../nemo_rl/data/datasets/response_datasets/openmathinstruct2.py), [ResponseDataset](/../../nemo_rl/data/datasets/response_datasets/response_dataset.py), [Tulu3SftMixtureDataset](/../../nemo_rl/data/datasets/response_datasets/tulu3.py) are supported for this feature.
+If you want to support this feature for your custom datasets or other built-in datasets, you can simply add the code to the dataset like [ResponseDataset](/../../nemo_rl/data/datasets/response_datasets/response_dataset.py).
+```python
+# `self.val_dataset` is used (not None) only when current dataset is used for both training and validation
+self.val_dataset = None
+self.split_train_validation(split_validation_size, seed)
+```
+
+### OpenAI Format Datasets (with Tool Calling Support)
+
+NeMo RL also supports datasets in the OpenAI conversation format, which is commonly used for chat models and function calling. This format is particularly useful for training models with tool-use capabilities.
+
+#### Basic Usage
+
+To use an OpenAI format dataset, configure your YAML as follows:
+
+```yaml
+data:
+ train:
+ dataset_name: openai_format
+ data_path: # Path to training data
+ chat_key: "messages" # Key for messages in the data (default: "messages")
+ system_key: null # Key for system message in the data (optional)
+ system_prompt: null # Default system prompt if not in data (optional)
+ tool_key: "tools" # Key for tools in the data (default: "tools")
+ use_preserving_dataset: false # Set to true for heterogeneous tool schemas (see below)
+ validation:
+ ...
+```
+
+#### Data Format
+
+Your JSONL files should contain one JSON object per line with the following structure:
+
+```json
+{
+ "messages": [
+ {"role": "system", "content": "You are a helpful assistant."},
+ {"role": "user", "content": "What's the weather in Paris?"},
+ {"role": "assistant", "content": "I'll check the weather for you.", "tool_calls": [
+ {"name": "get_weather", "arguments": {"city": "Paris", "unit": "celsius"}}
+ ]},
+ {"role": "tool", "content": "22°C, sunny", "tool_call_id": "call_123"},
+ {"role": "assistant", "content": "The weather in Paris is currently 22°C and sunny."}
+ ],
+ "tools": [
+ {
+ "name": "get_weather",
+ "description": "Get current weather for a city",
+ "parameters": {
+ "city": {"type": "string", "description": "City name"},
+ "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
+ }
+ }
+ ]
+}
+```
+
+#### Tool Calling with Heterogeneous Schemas
+
+When your dataset contains tools with different argument structures (heterogeneous schemas), you should enable `use_preserving_dataset: true` to avoid data corruption:
+
+```yaml
+data:
+ dataset_name: openai_format
+ ...
+ use_preserving_dataset: true # IMPORTANT: Enable this for tool calling datasets
+```
+
+**Why this matters:** Standard HuggingFace dataset loading enforces uniform schemas by adding `None` values for missing keys. For example:
+- Tool A has arguments: `{"query": "search term"}`
+- Tool B has arguments: `{"expression": "2+2", "precision": 2}`
+
+Without `use_preserving_dataset: true`, the loader would incorrectly add:
+- Tool A becomes: `{"query": "search term", "expression": None, "precision": None}`
+- Tool B becomes: `{"query": None, "expression": "2+2", "precision": 2}`
+
+This corrupts your training data and can lead to models generating invalid tool calls. The `PreservingDataset` mode maintains the exact structure of each tool call.
+
+## Evaluate the Trained Model
+
+Upon completion of the training process, you can refer to our [evaluation guide](/eval) to assess model capabilities.
+
+## LoRA Configuration
+
+NeMo RL supports LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning, including Nano‑v3 models. LoRA reduces trainable parameters by using low-rank matrices for weight updates while keeping the base model frozen.
+
+Notes:
+- LoRA is supported with DTensor v2 and Megatron backends. Uses the DTensor backend by default. DTensor v1 does not support LoRA (ensure `policy.dtensor_cfg._v2=true` when using DTensor).
+- Triton kernels are only used in the DTensor v2 path. For `tensor_parallel_size > 1`, Automodel currently does not support Triton kernels (see note below).
+
+### DTensor Configuration Parameters
+
+The LoRA configuration is specified under the `policy.dtensor_cfg.lora_cfg` section:
+
+```yaml
+policy:
+ dtensor_cfg:
+ lora_cfg:
+ enabled: False # Set to True to enable LoRA fine-tuning
+ target_modules: [] # List of module names to apply LoRA
+ exclude_modules: [] # List of module names to exclude from LoRA
+ match_all_linear: true # Apply LoRA to all linear layers
+ dim: 8 # LoRA rank (r): controls adaptation capacity
+ alpha: 32 # LoRA scaling factor (effective lr = alpha/dim)
+ dropout: 0.0 # Dropout probability for LoRA layers
+ dropout_position: "post" # Dropout position: "pre" or "post"
+ lora_A_init: "xavier" # Initialization method: "xavier" or "uniform"
+ use_triton: true # Use Triton-optimized kernels (DTensor v2 path)
+```
+
+### DTensor (Automodel) Parameter Details
+- **`enabled`** (bool): Whether to enable LoRA training
+- **`target_modules`** (list): Specific module names to apply LoRA. Empty with `match_all_linear=true` applies to all linear layers
+- **`exclude_modules`** (list): Module names to exclude from LoRA
+- **`match_all_linear`** (bool): When `true`, applies LoRA to all linear layers (overrides `target_modules`)
+- **`dim`** (int): LoRA rank (r). Lower values = fewer parameters but less capacity. Typical: 4, 8, 16, 32, 64
+- **`alpha`** (int): LoRA scaling factor. Effective learning rate multiplier = `alpha/dim`. Typical: 16, 32, 64
+- **`dropout`** (float): Dropout probability for regularization
+- **`dropout_position`** (str): Apply dropout before ("pre") or after ("post") LoRA
+- **`lora_A_init`** (str): Initialization method for LoRA A matrix
+- **`use_triton`** (bool): Use Triton-optimized kernels for better performance. Used for DTensor v2 only. **Note**: [Automodel does not support Triton for TP > 1](https://github.com/NVIDIA-NeMo/Automodel/blob/b2db55eee98dfe81a8bfe5e23ac4e57afd8ab261/nemo_automodel/recipes/llm/train_ft.py#L199). Set to `false` when `tensor_parallel_size > 1` to avoid compatibility issues
+
+### DTensor Example Usage
+
+```bash
+uv run examples/run_sft.py policy.dtensor_cfg.lora_cfg.enabled=true
+```
+For the Nano‑v3 SFT LoRA recipe, see:[sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml](/../../examples/configs/recipes/llm/sft-nanov3-30BA3B-2n8g-fsdp2-lora.yaml).
+
+### Megatron Configuration Parameters
+
+The LoRA configuration is specified under the `policy.megatron_cfg.peft` section:
+
+```yaml
+policy:
+ megatron_cfg:
+ peft:
+ enabled: false # Set to True to enable LoRA fine-tuning
+ target_modules: [] # List of module names to apply LoRA, defaults to all linear layers
+ exclude_modules: [] # List of module names not to apply LoRa.
+ dim: 32 # LoRA rank (r): controls adaptation capacity
+ alpha: 32 # LoRA scaling factor (effective lr = alpha/dim)
+ dropout: 0.0 # Dropout probability for LoRA layers
+ dropout_position: "pre" # Dropout position: "pre" or "post"
+ lora_A_init_method: "xavier" # Initialization method for lora A: "xavier" or "uniform"
+ lora_B_init_method: "zero" # Initialization method for lora B: "zero"
+ a2a_experimental: false # Enables the experimental All-to-All (A2A) communication strategy.
+ lora_dtype: None # Weight's dtype
+```
+
+### Megatron Parameter Details
+- **`enabled`** (bool): Whether to enable LoRA training
+- **`target_modules`** (list): Specific module names to apply LoRA. Defaults to all linear layers if the list is left empty. Example: ['linear_qkv', 'linear_proj', 'linear_fc1', 'linear_fc2'].
+ - 'linear_qkv': Apply LoRA to the fused linear layer used for query, key, and value projections in self-attention.
+ - 'linear_proj': Apply LoRA to the linear layer used for projecting the output of self-attention.
+ - 'linear_fc1': Apply LoRA to the first fully-connected layer in MLP.
+ - 'linear_fc2': Apply LoRA to the second fully-connected layer in MLP.
+ Target modules can also contain wildcards. For example, you can specify target_modules=['*.layers.0.*.linear_qkv', '*.layers.1.*.linear_qkv'] to add LoRA to only linear_qkv on the first two layers.
+- **`exclude_modules`** (List[str], optional): A list of module names not to apply LoRa. It will match all nn.Linear & nn.Linear-adjacent modules whose name does not match any string in exclude_modules. If used, will require target_modules to be empty list or None.
+- **`dim`** (int): LoRA rank (r). Lower values = fewer parameters but less capacity. Typical: 4, 8, 16, 32, 64
+- **`alpha`** (int): LoRA scaling factor. Effective learning rate multiplier = `alpha/dim`. Typical: 16, 32, 64
+- **`dropout`** (float): Dropout probability for regularization, defaults to 0.0
+- **`dropout_position`** (str): Apply dropout before ("pre") or after ("post") LoRA
+- **`lora_A_init`** (str): Initialization method for lora_A (choices: ['xavier', 'uniform']), defaults to xavier.
+- **`lora_B_init`** (str): Initialization method for the low-rank matrix B. Defaults to "zero".
+- **`a2a_experimental`** (bool): Enables the experimental All-to-All (A2A) communication strategy. Defaults to False.
+- **`lora_dtype`** (torch.dtype): Weight's dtype, by default will use orig_linear's but if they are quantized weights (e.g. 4bit) needs to be specified explicitly.
+
+### Megatron Example Usage
+The config uses DTensor by default, so the megatron backend needs to be explicitly enabled.
+```sh
+uv run examples/run_sft.py \
+ --config examples/configs/sft.yaml \
+ policy.dtensor_cfg.enabled=false \
+ policy.megatron_cfg.enabled=true \
+ policy.megatron_cfg.peft.enabled=true
+```
+
+For more details on LoRA, see [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685).
diff --git a/fern/v0.5.0/pages/guides/use-custom-vllm.mdx b/fern/v0.5.0/pages/guides/use-custom-vllm.mdx
new file mode 100644
index 0000000000..f1db5b88cc
--- /dev/null
+++ b/fern/v0.5.0/pages/guides/use-custom-vllm.mdx
@@ -0,0 +1,159 @@
+---
+title: Experiment with Custom vLLM
+description: ""
+---
+
+This guide explains how to use your own version of vLLM while leveraging a pre-compiled vLLM wheel, so you don't have to recompile the C++ source code.
+
+## Clone and Build Your Custom vLLM
+
+Clone your vLLM fork and build it using the provided script. For example:
+
+```sh
+# Usage: bash tools/build-custom-vllm.sh
+bash tools/build-custom-vllm.sh https://github.com/terrykong/vllm.git terryk/demo-custom-vllm https://wheels.vllm.ai/862f2ef893d9751db0a92bd2d4ae0e3d9677872f/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
+
+# [INFO] pyproject.toml updated. NeMo RL is now configured to use the local vLLM at 3rdparty/vllm.
+# [INFO] Verify this new vllm version by running:
+#
+# VLLM_PRECOMPILED_WHEEL_LOCATION=http://.....whl \
+# uv run --extra vllm vllm serve Qwen/Qwen3-0.6B
+#
+# [INFO] For more information on this custom install, visit https://github.com/NVIDIA-NeMo/RL/blob/main/docs/guides/use-custom-vllm.md
+# [IMPORTANT] Remember to set the shell variable 'VLLM_PRECOMPILED_WHEEL_LOCATION' when running NeMo RL apps with this custom vLLM to avoid re-compiling.
+```
+
+This script does the following:
+1. Clones the `vllm` you specify at a particular branch.
+2. Builds `vllm`.
+3. Updates NeMo RL's pyproject.toml to work with this `vllm`.
+4. Updates `uv.lock`.
+
+Make sure to add the updated `pyproject.toml` and `uv.lock` to version control so that your branch can be reproduced by others.
+
+## Verify Your Custom vLLM in Isolation
+Test your setup to ensure your custom vLLM is being used:
+```sh
+uv run --extra vllm python -c 'import vllm; print(f"Successfully imported vLLM version: {vllm.__version__}")'
+# Uninstalled 1 package in 1ms
+# Installed 1 package in 2ms
+# Hi! If you see this, you're using a custom version of vLLM for the purposes of this tutorial
+# INFO 06-18 09:22:44 [__init__.py:244] Automatically detected platform cuda.
+# Successfully imported vLLM version: 0.0.1.dev1+g69d5add74.d20250910
+```
+
+If you don't see the log message `Hi! If you see this...`, it's because this message is unique to the tutorial's specific `vLLM` fork. It was added in [this commit](https://github.com/terrykong/vllm/commit/69d5add744e51b988e985736f35c162d3e87b683) and doesn't exist in the main `vLLM` project.
+
+## Running NeMo RL Apps with Custom vLLM
+
+To ensure the custom vLLM install is setup properly in NeMo RL applications, always run the following before anything:
+
+```sh
+# Ensures vLLM uses the precompiled wheel and avoids recompiling C++ sources
+export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/862f2ef893d9751db0a92bd2d4ae0e3d9677872f/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
+# Ensures worker venvs are rebuilt to use the custom vLLM. Otherwise it may use the cached version in cached venvs
+export NRL_FORCE_REBUILD_VENVS=true
+# This isn't necessary if you only do `uv run foobar.py`, but may be needed if you switching between optional extras `uv run --extra vllm foobar.py`. If you are unsure if you need this, it's safer to include it.
+uv pip install setuptools_scm
+```
+
+Then run your application:
+```sh
+uv run examples/run_grpo.py
+```
+
+## Re-building the NeMo RL Docker Image
+
+Using a custom vllm may require you to rebuild the docker image. The two most common reasons are:
+
+1. The `ray` version was changed, so you **must** rebuild the image to allow `ray.sub` to start the ray cluster with the same version as the application.
+2. Many dependencies changed and add a large overhead when `NRL_FORCE_REBUILD_VENVS=true` is set to rebuild venvs, so you wish to cache the dependencies in the image to avoid re-build/re-pulling wheels.
+
+For convenience, you can have the image build your custom vLLM by running the same script inside the Docker build.
+Pass `--build-arg BUILD_CUSTOM_VLLM=1` to enable this path; the build will create `3rdparty/vllm` and source `3rdparty/vllm/nemo-rl.env` automatically.
+
+```sh
+docker buildx build \
+ --build-arg BUILD_CUSTOM_VLLM=1 \
+ --target release \
+ --build-context nemo-rl=. \
+ -f docker/Dockerfile \
+ --tag /nemo-rl:latest \
+ --push \
+ .
+```
+
+### SSH Setup for Private Repositories
+
+If your custom vLLM is hosted in a **private repository** (e.g., internal GitLab), you need to set up SSH agent forwarding for Docker to clone it during the build.
+
+#### Prerequisites
+1. Your SSH key must be registered on the Git server (GitLab/GitHub)
+2. The key must **not be expired** - check your Git server's SSH key settings
+3. The key must be loaded into your local ssh-agent
+
+#### Step 1: Verify your SSH key works
+
+```sh
+# For GitLab (adjust host/port as needed)
+ssh -T git@gitlab.example.com -p 12051
+
+# You should see: "Welcome to GitLab, @username!"
+# If you see "Your SSH key has expired", renew it on the server
+```
+
+#### Step 2: Load your SSH key into the agent
+
+```sh
+# Check if an ssh-agent is already running
+echo $SSH_AUTH_SOCK
+
+# If empty, start one (this also sets SSH_AUTH_SOCK which `docker buildx` expects to be set when using `--ssh default`)
+eval "$(ssh-agent -s)"
+
+# Clear any old/expired keys from the agent
+ssh-add -D
+
+# Add your SSH key (use the key registered on your Git server)
+ssh-add ~/.ssh/id_ed25519
+
+# Verify it's loaded
+ssh-add -l
+```
+
+#### Step 3: Run the Docker build with SSH forwarding
+
+```sh
+docker buildx build \
+ --build-arg BUILD_CUSTOM_VLLM=1 \
+ --target release \
+ --build-context nemo-rl=. \
+ -f docker/Dockerfile \
+ --ssh default \
+ --tag /nemo-rl:latest \
+ --push \
+ .
+```
+
+## Running Applications with a Custom vLLM Container
+
+When using a container built with custom vLLM, **use the frozen environment workflow** (bare `python`) instead of `uv run` with `NRL_FORCE_REBUILD_VENVS=true`.
+
+```sh
+# Recommended: use bare python (frozen environment)
+python examples/run_grpo.py
+
+# NOT recommended with custom vLLM containers:
+# uv run examples/run_grpo.py
+# or
+# NRL_FORCE_REBUILD_VENVS=true uv run examples/run_grpo.py
+```
+
+### Why Not Use `uv run` or Rebuild Venvs?
+
+Rebuilding worker virtual environments (via `uv run` or `NRL_FORCE_REBUILD_VENVS=true`) requires having the custom vLLM compiled locally. However, compiling vLLM requires a container environment with the correct CUDA toolchain—creating a chicken-and-egg problem.
+
+The container already has vLLM built and cached in the frozen environments. Using bare `python` leverages these pre-built environments directly, avoiding the need to recompile vLLM at runtime.
+
+> [!TIP]
+> For more details on frozen environments and how they differ from `uv run`, see the [Dependency Management](/../design-docs/dependency-management#frozen-environments) documentation.
diff --git a/fern/v0.5.0/pages/index.mdx b/fern/v0.5.0/pages/index.mdx
new file mode 100644
index 0000000000..33284cf497
--- /dev/null
+++ b/fern/v0.5.0/pages/index.mdx
@@ -0,0 +1,146 @@
+---
+title: NeMo RL Documentation
+description: ""
+---
+
+Welcome to the NeMo RL documentation. NeMo RL is an open-source post-training library developed by NVIDIA, designed to streamline and scale reinforcement learning methods for multimodal models (LLMs, VLMs, etc.).
+
+This documentation provides comprehensive guides, examples, and references to help you get started with NeMo RL and build powerful post-training pipelines for your models.
+
+## Getting Started
+
+
+
+
+
+Learn about NeMo RL's architecture, design philosophy, and key features that make it ideal for scalable reinforcement learning.
+
+
+
+
+
+Get up and running quickly with examples for both DTensor and Megatron Core training backends.
+
+
+
+
+
+Step-by-step instructions for installing NeMo RL, including prerequisites, system dependencies, and environment setup.
+
+
+
+
+
+Explore the current features and upcoming enhancements in NeMo RL, including distributed training, advanced parallelism, and more.
+
+
+
+
+
+Troubleshooting common issues including missing submodules, Ray dashboard access, and debugging techniques.
+
+
+
+
+
+## Training and Generation
+
+
+
+
+
+Learn about DTensor and Megatron Core training backends, their capabilities, and how to choose the right one for your use case.
+
+
+
+
+
+Discover supported algorithms including GRPO, SFT, DPO, RM, and on-policy distillation with detailed guides and examples.
+
+
+
+
+
+Learn how to evaluate your models using built-in evaluation datasets and custom evaluation pipelines.
+
+
+
+
+
+Configure and deploy NeMo RL on multi-node Slurm or Kubernetes clusters for distributed computing.
+
+
+
+
+
+## Guides and Examples
+
+
+
+
+
+Reproduce DeepscaleR results with NeMo RL using GRPO on mathematical reasoning tasks.
+
+
+
+
+
+Step-by-step guide for supervised fine-tuning on the OpenMathInstruct2 dataset.
+
+
+
+
+
+Create custom reward environments and integrate them with NeMo RL training pipelines.
+
+
+
+
+
+Learn how to add support for new model architectures in NeMo RL.
+
+
+
+
+
+## Advanced Topics
+
+
+
+
+
+Deep dive into NeMo RL's architecture, APIs, and design decisions for scalable RL.
+
+
+
+
+
+Tools and techniques for debugging distributed Ray applications and RL training runs.
+
+
+
+
+
+Optimize large language models with FP8 quantization for faster training and inference.
+
+
+
+
+
+Build and use Docker containers for reproducible NeMo RL environments.
+
+
+
+
+
+## API Reference
+
+
+
+
+
+Comprehensive reference for all NeMo RL modules, classes, functions, and methods. Browse the complete Python API with detailed docstrings and usage examples.
+
+
+
+
diff --git a/fern/v0.5.0/pages/local-workstation.mdx b/fern/v0.5.0/pages/local-workstation.mdx
new file mode 100644
index 0000000000..5384ba5fe5
--- /dev/null
+++ b/fern/v0.5.0/pages/local-workstation.mdx
@@ -0,0 +1,38 @@
+---
+title: Run on Your Local Workstation
+description: ""
+---
+
+When launching examples locally with `uv`, `init_ray()` will first attempt to connect to an existing cluster. If none is found, it will start a local one and connect to it using all available GPU and CPU resources on your node.
+
+To launch a job outside of a container, simply run:
+
+```sh
+uv run examples/run_grpo.py
+```
+
+In the logs, you will see that Ray has started a local cluster instance, along with details on the resources made available to it:
+```
+2025-03-17 13:37:45,360 INFO worker.py:1841 -- Started a local Ray instance.
+...
+INFO:nemo_rl.distributed.virtual_cluster:Started local cluster with: {'node:__internal_head__': 1.0, 'CPU': 24.0, 'object_store_memory': 80448493977.0, 'accelerator_type:RTX': 1.0, 'memory': 177713152615.0, 'GPU': 1.0, 'node:10.0.0.1': 1.0}
+```
+
+To have more precise control over the GPUs Ray uses locally, please use `CUDA_VISIBLE_DEVICES`:
+
+```sh
+# Use the 0th and 3rd indexed GPU (for a total of 2 GPUs)
+CUDA_VISIBLE_DEVICES=0,3 uv run examples/run_grpo.py
+```
+
+We also allow multiple colocated local clusters, which are uniquely identified by the values in
+`CUDA_VISIBLE_DEVICES`. Concretely:
+
+```sh
+# (1) Start a fresh cluster on GPU=0
+CUDA_VISIBLE_DEVICES=0 uv run examples/run_grpo.py
+
+# (2) While (1) is running, this will start a new cluster using GPUs 1 and 2 without interferring with (1)
+# Ensure that the CUDA_VISIBLE_DEVICES do not overlap already running jobs.
+CUDA_VISIBLE_DEVICES=1,2 uv run examples/run_grpo.py
+```
diff --git a/fern/v0.5.0/pages/model-quirks.mdx b/fern/v0.5.0/pages/model-quirks.mdx
new file mode 100644
index 0000000000..07f7774548
--- /dev/null
+++ b/fern/v0.5.0/pages/model-quirks.mdx
@@ -0,0 +1,52 @@
+---
+title: Model Quirks
+description: ""
+---
+
+This document outlines special cases and model-specific behaviors that require custom handling in NeMo RL. These special cases are controlled by the `ModelFlag` enum.
+
+## Gemma-3
+
+### vLLM Initialization
+
+Gemma-3 models have a specific issue with vLLM dummy weight initialization due to a vLLM bug where [a `normalizer` buffer is created](https://github.com/vllm-project/vllm/blob/964472b9667508b1d4a7ed92068ff81740ae0036/vllm/model_executor/models/gemma3.py#L372) that is not present in the Hugging Face model. This causes the `normalizer` buffer to be set to dummy weights at initialization and then never updated with the correct values during model refit. As a workaround for this issue, we do not use dummy weight initialization for vLLM with Gemma-3 models and instead use the `load_format="auto"` setting to load the full weights at initialization.
+
+**Special Handling:**
+- We automatically use `load_format="auto"` for Gemma-3 models when initializing vLLM.
+- This avoids issues with dummy weight initialization, where the dummy weights for this buffer would never get overwritten during refit.
+
+### vLLM V1 runtime
+
+NeMo-RL uses the vLLM V1 runtime for both synchronous and asynchronous inference. The V1 runtime provides improved performance and stability for inference.
+
+**Special Handling:**
+- Both sync and async inference modes use the V1 runtime by default.
+- Users can override to the V0 runtime by setting the environment variable `NRL_VLLM_USE_V1=0`.
+- **Important**: The async implementation always uses the V1 runtime. Users who need to use the V0 runtime must switch to synchronous inference by setting `policy.generation.vllm_cfg.async_engine=False`.
+
+### Context Parallel with FSDP2
+
+- NeMo-RL implemented this feature based on torch CP [implementation](https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/experimental/_attention.py). And we inherit its limitations.
+Whether model level support CP only depends on arguments passed to `torch.nn.functional.scaled_dot_product_attention`. Current NeMo-RL passed all ones attention mask to `model.forward`. For Gemma-3, it won't ignore attention mask as result `attn_bias` is not None which is not supported by torch CP. Please see [assertion](https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/distributed/tensor/experimental/_attention.py#L262) .
+
+- Context parallel can't be used together with sequence packing. Sequence packing requires `attn_implementation="flash_attention_2"`, this conflict with context parallel requires SDPA impl. Refer to [here](https://github.com/huggingface/transformers/blob/bda75b4011239d065de84aa3e744b67ebfa7b245/src/transformers/modeling_utils.py#L2317) for more details.
+
+- It's a known issue that context parallel can't be used together with sequence parallel.
+Refer to [here](https://github.com/NVIDIA-NeMo/RL/issues/659) for more details.
+
+## DeepScaleR Recipe Convergence Issues
+
+The DeepScaleR recipe (e.g., `examples/configs/grpo-deepscaler-1.5b-8K.yaml`) has been found to experience convergence issues when CUDA graphs are enabled in vLLM.
+
+**Special Handling:**
+- CUDA graphs must be disabled by setting `enforce_eager: True` in the vLLM configuration (https://github.com/NVIDIA-NeMo/RL/pull/857 forces eager execution by default).
+
+## vLLM Async Rollout Timeout
+
+vLLM async generation has a configurable timeout for waiting for individual sample results. This is particularly important for longer sequences on large models.
+
+```bash
+export NRL_VLLM_ASYNC_TIMEOUT_SECONDS=1800 # Default: 600 (10 minutes)
+```
+
+If you encounter timeout errors, the system will suggest doubling the current timeout value.
diff --git a/fern/v0.5.0/pages/nsys-profiling.mdx b/fern/v0.5.0/pages/nsys-profiling.mdx
new file mode 100644
index 0000000000..c6f88c7c9b
--- /dev/null
+++ b/fern/v0.5.0/pages/nsys-profiling.mdx
@@ -0,0 +1,147 @@
+---
+title: Profile GPU with Nsys
+description: ""
+---
+
+NeMo RL supports Nsight profiling for Ray workers through environment variable pattern matching. This allows you to selectively profile specific worker types without modifying code or affecting the performance of workers that don't need profiling.
+
+**Note**: To prevent profile files from becoming too large, consider limiting profiling to a smaller number of steps (e.g., 10 steps).
+
+## Prerequisites
+
+* Install NVIDIA Nsight Systems (`nsys`) on the compute nodes where workers will run. For Ubuntu installation instructions, see the [NVIDIA Nsight Systems Installation Guide](https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html#package-manager-installation)).
+
+**Note: If you're using NeMo RL containers, `nsys` is already installed.**
+
+* Ensure the workers you want to profile have GPU access
+
+## Configure the Environment Variables
+
+Set the `NRL_NSYS_WORKER_PATTERNS` environment variable with a comma-separated list of patterns to match worker names:
+
+```bash
+export NRL_NSYS_WORKER_PATTERNS="*policy*,*vllm*"
+```
+
+Set the `NRL_NSYS_PROFILE_STEP_RANGE` environment variable to control which training steps the profiler captures. Its
+format is colon separated integers representing `start:stop`, where `start` is inclusive and `stop` is exclusive
+(same as slice syntax `arr[start:stop]`). Note that the `start` is 1-index, so `NRL_NSYS_PROFILE_STEP_RANGE=0:10` would error.
+
+```bash
+export NRL_NSYS_PROFILE_STEP_RANGE=3:5
+```
+
+### Pattern Format
+
+- Use shell-style wildcards (`*`, `?`, `[seq]`, `[!seq]`)
+- Patterns are matched against worker names using `fnmatch`
+- Multiple patterns are separated by commas
+- Whitespace around patterns is automatically stripped
+- Empty patterns are ignored
+
+### Supported Workers
+
+The supported worker types are:
+- **DTensorPolicyWorker**: Pattern matched against `"dtensor_policy_worker"`
+- **VllmGenerationWorker**: Pattern matched against `"vllm_generation_worker"`
+
+## Example Usage
+
+### Profile Only Policy Workers
+```bash
+NRL_NSYS_PROFILE_STEP_RANGE=2:3 NRL_NSYS_WORKER_PATTERNS="*policy*" uv run examples/run_grpo.py grpo.max_num_steps=5
+```
+
+### Profile Multiple Worker Types
+
+```bash
+NRL_NSYS_PROFILE_STEP_RANGE=1:2 NRL_NSYS_WORKER_PATTERNS="*policy*,*vllm*" uv run examples/run_grpo.py grpo.max_num_steps=5
+```
+
+### Profile Workers with Exact Names
+
+```bash
+NRL_NSYS_PROFILE_STEP_RANGE=3:10 NRL_NSYS_WORKER_PATTERNS="dtensor_policy_worker,vllm_generation_worker" uv run examples/run_grpo.py grpo.max_num_steps=5
+```
+
+### Profile Megatron Workers
+
+> [!IMPORTANT]
+> To profile a Megatron worker, you should set `LD_LIBRARY_PATH` as follows, otherwise you will get errors when loading `libtransformer_engine.so`.
+
+```bash
+LD_LIBRARY_PATH="/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu" \
+NRL_NSYS_PROFILE_STEP_RANGE=2:3 NRL_NSYS_WORKER_PATTERNS="megatron_policy_worker,vllm_generation_worker" uv run examples/run_grpo.py --config examples/configs/grpo_math_1B_megatron.yaml grpo.max_num_steps=5
+```
+
+## Profile Output
+
+When profiling is enabled, it generates the following logs and files:
+
+1. **Logging**: You'll see log messages indicating which workers have profiling enabled:
+ ```
+ Nsight profiling enabled for worker 'dtensor_policy_worker' (matched pattern '*policy*')
+ ```
+
+2. **Profile Files**: Each profiled worker generates a `.nsys-rep` file with naming pattern:
+ ```
+ dtensor_policy_worker__.nsys-rep
+ vllm_generation_worker__.nsys-rep
+ worker_process_.nsys-rep
+ ```
+If you are not using model parallelism in Vllm, you should directly refer to `vllm_generation_worker__.nsys-rep` for nsight reports; If you are using model parallelism, the `vllm_generation_worker__.nsys-rep` will be empty, and the `worker_process_.nsys-rep` are nsight profiles from vllm's ray distributed executors (refer to https://github.com/vllm-project/vllm/blob/7e3a8dc90670fd312ce1e0d4eba9bf11c571e3ad/vllm/executor/ray_distributed_executor.py#L136 for more information).
+
+3. **File Location**: Profile files are saved in `/tmp/ray/session*/logs/nsight/` directory on each worker node. Ensure you check both `ls /tmp/ray/session_[0-9]*/logs/nsight` and `ls /tmp/ray/session_latest/logs/nsight` for the profiles, since the "latest" pointer may be stale.
+
+**Note for SLURM users with `ray.sub`**: When using `ray.sub` on SLURM, set `RAY_LOG_SYNC_FREQUENCY=$NUM_SEC` (e.g., `RAY_LOG_SYNC_FREQUENCY=30`) to ensure that the nsight profile files get copied from the container's ephemeral filesystem (`/tmp/ray`) to the persistent directory. The header node's files will be synced to ``$SLURM_JOB_ID-logs/ray`, and other nodes' files will be synced to `$SLURM_JOB_ID-logs/ray/$node_ip/` where `$node_ip` is the IP address of the node.
+
+## Analyze Profile Files
+
+To analyze the generated profile files, load the `.nsys-rep` files into the NVIDIA Nsight Systems desktop application, which you can download from the [NVIDIA Nsight Systems Get Started page](https://developer.nvidia.com/nsight-systems/get-started).
+
+### How to Analyze the End-to-End RL Loop All at Once
+
+Nsight Systems supports [multi-report view](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#viewing-multiple-reports-in-the-same-timeline) functionality. If you open the profiles from different workers (e.g., `*policy_worker*.nsys-rep` and `*generation_worker*.nsys-rep`) in a single multi-report view, you can analyze the behavior of the end-to-end RL loop on the same timeline.
+
+
+
+## How We Patched Nsight Support in Ray
+
+Ray's Nsight profiling support had a bug where it hardcoded the Python executable path instead of using the actual Python executable from the runtime environment. This caused issues when using virtual environments or custom Python installations (`py_executables`).
+
+### The Problem
+
+In Ray's `nsight.py` file, the original code was:
+
+```python
+context.py_executable = " ".join(self.nsight_cmd) + " python"
+```
+
+This hardcoded `" python"` instead of correctly preserving the intended Python executable path.
+
+### The Fix
+
+To fix this problem, we patched the following line to preserve the original `context.py_executable`:
+
+```python
+context.py_executable = " ".join(self.nsight_cmd) + f" {context.py_executable}"
+```
+
+### Where We Applied the Patch
+
+We applied this patch in two locations to cover different deployment scenarios:
+
+1. **In `ray.sub` (SLURM clusters)**: The patch is applied before Ray's control plane starts up on both head and worker nodes:
+ ```bash
+ sed -i 's/context\.py_executable = " "\.join(self\.nsight_cmd) + " python"/context.py_executable = " ".join(self.nsight_cmd) + f" {context.py_executable}"/g' /opt/nemo_rl_venv/lib64/python*/site-packages/ray/_private/runtime_env/nsight.py
+ ```
+
+2. **In `nemo_rl/__init__.py` (Local clusters)**: The patch is applied automatically when NeMo RL is imported, making it work seamlessly for local development and testing environments.
+
+### Why We Needed Both Locations
+
+- **`ray.sub`**: Required for SLURM-managed clusters where Ray processes start in containers before Python imports happen. The patch must be applied at the filesystem level before Ray's control plane initializes.
+
+- **`__init__.py`**: Required for local clusters and development environments where users start Ray clusters directly. The patch is applied when `nemo_rl` is imported, ensuring the fix is in place before any Ray processes are spawned.
+
+This dual approach ensures that Nsight profiling works correctly regardless of how the Ray cluster is deployed.
diff --git a/fern/v0.5.0/pages/testing.mdx b/fern/v0.5.0/pages/testing.mdx
new file mode 100644
index 0000000000..60469e4a33
--- /dev/null
+++ b/fern/v0.5.0/pages/testing.mdx
@@ -0,0 +1,325 @@
+---
+title: Test NeMo RL
+description: ""
+---
+
+This guide outlines how to test NeMo RL using unit and functional tests, detailing steps for local or Docker-based execution, dependency setup, and metric tracking to ensure effective and reliable testing.
+
+## Unit Tests
+
+> [!IMPORTANT]
+> Unit tests require 2 GPUs to test the full suite.
+
+> [!TIP]
+> Some unit tests require setting up test assets which you can download with:
+> ```sh
+> uv run tests/unit/prepare_unit_test_assets.py
+> ```
+
+```sh
+# Run the unit tests using local GPUs
+
+# Configuration 1: Default tests only - excludes both hf_gated and mcore tests
+uv run --group test bash tests/run_unit.sh
+
+# Configuration 2: Default + HF gated tests, excluding mcore tests
+uv run --group test bash tests/run_unit.sh --hf-gated
+
+# Configuration 3: ONLY mcore tests, excluding ones with hf_gated
+uv run --extra mcore --group test bash tests/run_unit.sh --mcore-only
+
+# Configuration 4: ONLY mcore tests, including ones with hf_gated
+uv run --extra mcore --group test bash tests/run_unit.sh --mcore-only --hf-gated
+```
+
+### Experimental: Faster Local Test Iteration with pytest-testmon
+
+We support `pytest-testmon` to speed up local unit test runs by re-running only impacted tests. This works for both regular in-process code and out-of-process `@ray.remote` workers via a lightweight, test-only selection helper.
+
+Usage:
+```sh
+# Re-run only impacted unit tests
+uv run --group test pytest --testmon tests/unit
+
+# You can also combine with markers/paths
+uv run --group test pytest --hf-gated --testmon tests/unit/models/policy/test_dtensor_worker.py
+```
+
+What to expect:
+- On the first run in a fresh workspace, testmon may run a broader set (or deselect everything if nothing was executed yet) to build its dependency cache.
+- On subsequent runs, editing non-remote code narrows selection to only the tests that import/use those modules.
+- Editing code inside `@ray.remote` actors also retriggers impacted tests. We maintain a static mapping from test modules to transitive `nemo_rl` modules they import and intersect that with changed files when `--testmon` is present.
+- After a successful impacted run, a second `--testmon` invocation (with no further edits) will deselect all tests.
+- Running `pytest` with `-k some_substring_in_test_name` will always run tests that match even if `--testmon` is passed.
+
+Limitations and tips:
+- Selection is based on Python imports and file mtimes; non-Python assets (YAML/JSON/shell) are not tracked. When editing those, re-run target tests explicitly.
+- The remote-aware selection uses a conservative static import map (no dynamic import resolution). If a test loads code dynamically that isn’t visible via imports, you may need to run it explicitly once to seed the map.
+- The helper is test-only and does not alter library behavior. It activates automatically when you pass `--testmon`.
+
+Refreshing remote-selection artifacts
+### Refreshing Remote-Selection Artifacts
+If you change test layout or significantly refactor imports, the remote-selection artifacts may become stale.
+To rebuild them, delete the following files at the repo root and re-run with `--testmon` to seed again:
+
+```sh
+# At the root of nemo-rl
+rm .nrl_remote_map.json .nrl_remote_state.json
+```
+
+### Run Unit Tests in a Hermetic Environment
+
+For environments lacking necessary dependencies (e.g., `gcc`, `nvcc`)
+or where environmental configuration may be problematic, tests can be run
+in Docker with this script:
+
+```sh
+CONTAINER=... bash tests/run_unit_in_docker.sh
+```
+
+The required `CONTAINER` can be built by following the instructions in the [Docker documentation](/docker).
+
+### Track Metrics in Unit Tests
+
+Unit tests may also log metrics to a fixture. The fixture is called `tracker` and has the following API:
+
+```python
+# Track an arbitrary metric (must be json serializable)
+tracker.track(metric_name, metric_value)
+# Log the maximum memory across the entire cluster. Okay for tests since they are run serially.
+tracker.log_max_mem(metric_name)
+# Returns the maximum memory. Useful if you are measuring changes in memory.
+tracker.get_max_mem()
+```
+
+Including the `tracker` fixture also tracks the elapsed time for the test implicitly.
+
+Here is an example test:
+
+```python
+def test_exponentiate(tracker):
+ starting_mem = tracker.get_max_mem()
+ base = 2
+ exponent = 4
+ result = base ** exponent
+ tracker.track("result", result)
+ tracker.log_max_mem("memory_after_exponentiating")
+ change_in_mem = tracker.get_max_mem() - starting_mem
+ tracker.track("change_in_mem", change_in_mem)
+ assert result == 16
+```
+
+Which would produce this file in `tests/unit/unit_results.json`:
+
+```json
+{
+ "exit_status": 0,
+ "git_commit": "f1062bd3fd95fc64443e2d9ee4a35fc654ba897e",
+ "start_time": "2025-03-24 23:34:12",
+ "metrics": {
+ "test_hf_ray_policy::test_lm_policy_generation": {
+ "avg_prob_mult_error": 1.0000039339065552,
+ "mean_lps": -1.5399343967437744,
+ "_elapsed": 17.323044061660767
+ }
+ },
+ "gpu_types": [
+ "NVIDIA H100 80GB HBM3"
+ ],
+ "coverage": 24.55897613282601
+}
+```
+
+> [!TIP]
+> Past unit test results are logged in `tests/unit/unit_results/`. These are helpful to view trends over time and commits.
+>
+> ```sh
+> jq -r '[.start_time, .git_commit, .metrics["test_hf_ray_policy::test_lm_policy_generation"].avg_prob_mult_error] | @tsv' tests/unit/unit_results/*
+>
+> # Example output:
+> #2025-03-24 23:35:39 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552
+> #2025-03-24 23:36:37 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552
+> #2025-03-24 23:37:37 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552
+> #2025-03-24 23:38:14 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552
+> #2025-03-24 23:38:50 778d288bb5d2edfd3eec4d07bb7dffffad5ef21b 1.0000039339065552
+> ```
+
+## Functional Tests
+
+> [!IMPORTANT]
+> Functional tests may require multiple GPUs to run. See each script to understand the requirements.
+
+Functional tests are located under `tests/functional/`.
+
+```sh
+# Run the functional test for sft
+uv run bash tests/functional/sft.sh
+```
+
+At the end of each functional test, the metric checks will be printed as well as
+At the end of each functional test, the metric checks will be printed as well as whether they pass or fail. Here is an example:
+
+```text
+ Metric Checks
+┏━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
+┃ Status ┃ Check ┃ Value ┃ Message ┃
+┡━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
+│ PASS │ data["train/loss"]["9"] < 1500 │ 817.4517822265625 │ │
+└────────┴────────────────────────────────┴───────────────────┴─────────┘
+```
+
+### Run Functional Tests in a Hermetic Environment
+
+For environments lacking necessary dependencies (e.g., `gcc`, `nvcc`)
+or where environmental configuration may be problematic, tests can be run
+For environments lacking necessary dependencies (e.g., `gcc`, `nvcc`) or where environmental configuration may be problematic, tests can be run in Docker with this script:
+
+```sh
+CONTAINER=... bash tests/run_functional_in_docker.sh tests/functional/sft.sh
+```
+
+The required `CONTAINER` can be built by following the instructions in the [Docker documentation](/docker).
+
+## Bisecting Failing Tests
+
+> [!IMPORTANT]
+> Always rsync the `tools/` directory to `tools.bisect/` before starting a bisect:
+>
+> ```sh
+> rsync -ahP --delete tools/ tools.bisect/
+> ```
+>
+> This creates a stable copy of the bisect scripts that won't change as git checks out different commits during the bisect process. Without this, the scripts themselves may change mid-bisect, leading to inconsistent behavior or failures. All examples below reference `tools.bisect/` to ensure you use the stable copy.
+
+### Bisecting Unit/Functional Tests
+
+Use `tools.bisect/bisect-run.sh` to automatically run your test command across a commit range and find the first bad commit. It forces venv rebuilds so dependencies match each commit.
+
+Basic usage:
+
+```sh
+GOOD= BAD= \
+ tools.bisect/bisect-run.sh uv run --group test pytest tests/unit/test_foobar.py::test_case
+```
+
+Examples:
+
+```sh
+GOOD=56a6225 BAD=32faafa \
+ tools.bisect/bisect-run.sh uv run --group dev pre-commit run --all-files
+
+GOOD=464ed38 BAD=c843f1b \
+ tools.bisect/bisect-run.sh uv run --group test pytest tests/unit/test_foobar.py
+```
+
+Notes:
+
+- Exit codes drive the classification: 0=good, non-zero=bad, 125=skip.
+- The script pre-verifies that `GOOD` is actually good by running your command on it.
+- On failure or interruption, it saves a timestamped `git bisect log` to `/bisect-logs/`. You can resume later with `BISECT_REPLAY_LOG` (see below).
+- Set `BISECT_NO_RESET=1` to keep the bisect state after the script exits.
+
+Resume from a saved bisect log:
+
+```sh
+BISECT_REPLAY_LOG=/abs/path/to/bisect-2025....log \
+ tools.bisect/bisect-run.sh uv run --group test pytest tests/unit/test_foobar.py
+```
+
+### Bisecting Nightlies
+
+Nightly training scripts can be bisected using the same driver plus a helper that sets up hermetic runs on Slurm.
+
+Vanilla flow:
+
+```sh
+# Copy bisect utilities outside of VCS to ensure a stable runner
+rsync -ahP --delete tools/ tools.bisect/
+
+TEST_CASE=tests/test_suites/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v3.sh
+
+HF_HOME=... \
+HF_DATASETS_CACHE=... \
+CONTAINER=... \
+MOUNTS=... \
+ACCOUNT=... \
+PARTITION=... \
+GOOD=$(git log --format="%h" --diff-filter=A -- "$TEST_CASE") \
+BAD=HEAD \
+ tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh "$TEST_CASE"
+```
+
+:
+The command `GOOD=$(git log --format="%h" --diff-filter=A -- "$TEST_CASE")` selects the commit that introduced the test script. Because the path is typically added only once, this yields the introduction commit to use as the known good baseline.
+:
+
+- `tools.bisect/launch-bisect-helper.sh` ensures each commit runs in a fresh venv, creates an isolated code snapshot per commit, blocks until metrics are checked, and returns a suitable exit code for bisect.
+
+Progressively more advanced cases:
+
+1) Adjusting the test case on the fly with `SED_CLAUSES`
+
+- If a test script needs small textual edits during bisect (e.g., to relax a threshold or drop a noisy metric you don't care to bisect over when focusing on convergence vs. performance), provide a sed script via `SED_CLAUSES`. You can also use this to adjust runtime controls like `MAX_STEPS`, `STEPS_PER_RUN`, or `NUM_MINUTES` when a performance regression slows runs down, ensuring they still complete and emit metrics. The helper applies it and automatically restores the test script after the run.
+
+```sh
+SED_CLAUSES=$(cat <<'SED'
+s#mean(data\["timing/train/total_step_time"\], -6, -1) < 0\.6#mean(data["timing/train/total_step_time"], -6, -1) < 0.63#
+/ray\/node\.0\.gpu\.0\.mem_gb/d
+SED
+) \
+GOOD=$(git log --format="%h" --diff-filter=A -- "$TEST_CASE") \
+BAD=HEAD \
+ tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh "$TEST_CASE"
+```
+
+2) Passing extra script arguments
+
+- If the nightly script supports Hydra/CLI overrides, pass them via `EXTRA_SCRIPT_ARGS` so each run adopts those overrides (e.g., fix a transient incompatibility):
+
+
+Changing script arguments can materially affect performance characteristics and/or convergence behavior. This may influence the validity of the bisect outcome relative to your baseline configuration. Prefer the smallest, clearly-justified overrides, keep them consistent across all commits, and document them alongside your results so conclusions are interpreted correctly.
+
+
+```sh
+EXTRA_SCRIPT_ARGS="++data.num_workers=1" \
+GOOD=$(git log --format="%h" --diff-filter=A -- "$TEST_CASE") \
+BAD=HEAD \
+ tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh "$TEST_CASE"
+```
+
+3) Resuming from an earlier interrupted or misclassified session
+
+- Use `BISECT_REPLAY_LOG` with the bisect driver to replay prior markings and continue running. This is handy if a run failed for an unrelated reason or you manually edited a log to change `bad` → `skip` or to drop an incorrect line.
+
+```sh
+BISECT_REPLAY_LOG=/abs/path/to/bisect-logs/bisect-YYYYmmdd-HHMMSS-.log \
+HF_HOME=... HF_DATASETS_CACHE=... CONTAINER=... MOUNTS=... ACCOUNT=... PARTITION=... \
+ tools.bisect/bisect-run.sh tools.bisect/launch-bisect.sh "$TEST_CASE"
+```
+
+Tips and conventions:
+
+- Exit code 125 means "skip this commit" in git bisect; our helper returns 125 if required env is missing or if it needs to abort safely.
+- Submodules must be clean. The bisect script enforces `submodule.recurse=true` and `fetch.recurseSubmodules=on-demand` so submodules follow commit checkouts.
+- The bisect script automatically unshallows all submodules at the start to ensure any submodule commit can be checked out during the bisect process. This is important because bisecting may need to jump to arbitrary commits in submodule history.
+- Each commit uses a fresh code snapshot directory and a separate Megatron checkpoint dir to avoid cross-commit contamination.
+- On failure/interrupt, a timestamped bisect log is saved under `/bisect-logs/`. Use it with `BISECT_REPLAY_LOG` to resume.
+- In some unusual cases, the bisect may fail while updating a submodule because it references a commit that is orphaned or deleted. Git will typically print the commit hash it was unable to find (e.g., `fatal: remote error: upload-pack: not our ref `). If the commit is simply orphaned, you can try to manually fetch it:
+
+ ```sh
+ # Assuming Automodel is the submodule with the missing commit
+ cd 3rdparty/Automodel-workspace/Automodel/
+ git fetch origin $the_automodel_commit_that_it_could_not_find
+ ```
+
+ If the manual fetch fails, the commit has likely been deleted from the remote. In this case, skip the problematic commit:
+
+ ```sh
+ git bisect skip $the_nemorl_commit_that_has_the_broken_automodel_commit
+ ```
+
+ After skipping, add the skip command to your `BISECT_REPLAY_LOG` file (located in `/bisect-logs/`) so the bisect will continue from where it left off and skip that commit when you relaunch `tools.bisect/bisect-run.sh`:
+
+ ```sh
+ echo "git bisect skip $the_nemorl_commit_that_has_the_broken_automodel_commit" >> bisect-logs/bisect--.log
+ ```
diff --git a/fern/versions/v0.5.0.yml b/fern/versions/v0.5.0.yml
new file mode 100644
index 0000000000..ed1577dd9a
--- /dev/null
+++ b/fern/versions/v0.5.0.yml
@@ -0,0 +1,137 @@
+navigation:
+ - section: Home
+ contents:
+ - page: Welcome
+ path: ../v0.5.0/pages/index.mdx
+ - section: About
+ contents:
+ - page: Overview
+ path: ../v0.5.0/pages/about/overview.mdx
+ - page: Performance Summary
+ path: ../v0.5.0/pages/about/performance-summary.mdx
+ - page: Model Support
+ path: ../v0.5.0/pages/about/model-support.mdx
+ - page: Features
+ path: ../v0.5.0/pages/about/features.mdx
+ - page: Backends
+ path: ../v0.5.0/pages/about/backends.mdx
+ - page: Quick Start
+ path: ../v0.5.0/pages/about/quick-start.mdx
+ - page: Installation
+ path: ../v0.5.0/pages/about/installation.mdx
+ - section: Algorithms
+ contents:
+ - page: Index
+ path: ../v0.5.0/pages/about/algorithms/index.mdx
+ - page: SFT
+ path: ../v0.5.0/pages/about/algorithms/sft.mdx
+ - page: DPO
+ path: ../v0.5.0/pages/about/algorithms/dpo.mdx
+ - page: RM
+ path: ../v0.5.0/pages/about/algorithms/rm.mdx
+ - page: GRPO
+ path: ../v0.5.0/pages/about/algorithms/grpo.mdx
+ - page: DAPO
+ path: ../v0.5.0/pages/about/algorithms/dapo.mdx
+ - page: On-Policy Distillation
+ path: ../v0.5.0/pages/about/algorithms/on-policy-distillation.mdx
+ - page: Evaluation
+ path: ../v0.5.0/pages/about/evaluation.mdx
+ - page: Clusters
+ path: ../v0.5.0/pages/about/clusters.mdx
+ - page: Tips and Tricks
+ path: ../v0.5.0/pages/about/tips-and-tricks.mdx
+ - section: Environment Start
+ contents:
+ - page: Local Workstation
+ path: ../v0.5.0/pages/local-workstation.mdx
+ - page: Cluster
+ path: ../v0.5.0/pages/cluster.mdx
+ - section: E2E Examples
+ contents:
+ - page: SFT on OpenMathInstruct2
+ path: ../v0.5.0/pages/guides/sft-openmathinstruct2.mdx
+ - section: Guides
+ contents:
+ - page: Nemotron 3 Nano
+ path: ../v0.5.0/pages/guides/nemotron-3-nano.mdx
+ - page: Adding New Models
+ path: ../v0.5.0/pages/adding-new-models.mdx
+ - page: SFT
+ path: ../v0.5.0/pages/guides/sft.mdx
+ - page: DPO
+ path: ../v0.5.0/pages/guides/dpo.mdx
+ - page: DAPO
+ path: ../v0.5.0/pages/guides/dapo.mdx
+ - page: ProRLv2
+ path: ../v0.5.0/pages/guides/prorlv2.mdx
+ - page: GRPO
+ path: ../v0.5.0/pages/guides/grpo.mdx
+ - page: GRPO DeepscaleR
+ path: ../v0.5.0/pages/guides/grpo-deepscaler.mdx
+ - page: GRPO Sliding Puzzle
+ path: ../v0.5.0/pages/guides/grpo-sliding-puzzle.mdx
+ - page: RM
+ path: ../v0.5.0/pages/guides/rm.mdx
+ - page: Environments
+ path: ../v0.5.0/pages/guides/environments.mdx
+ - page: Eval
+ path: ../v0.5.0/pages/guides/eval.mdx
+ - page: Deepseek
+ path: ../v0.5.0/pages/guides/deepseek.mdx
+ - page: Model Quirks
+ path: ../v0.5.0/pages/model-quirks.mdx
+ - page: Async GRPO
+ path: ../v0.5.0/pages/guides/async-grpo.mdx
+ - page: DTensor TP Accuracy
+ path: ../v0.5.0/pages/guides/dtensor-tp-accuracy.mdx
+ - page: FT Launcher Guide
+ path: ../v0.5.0/pages/guides/ft-launcher-guide.mdx
+ - section: Containers
+ contents:
+ - page: Docker
+ path: ../v0.5.0/pages/docker.mdx
+ - section: Development
+ contents:
+ - page: Testing
+ path: ../v0.5.0/pages/testing.mdx
+ - page: Documentation
+ path: ../v0.5.0/pages/documentation.mdx
+ - page: Debugging
+ path: ../v0.5.0/pages/debugging.mdx
+ - page: NSys Profiling
+ path: ../v0.5.0/pages/nsys-profiling.mdx
+ - page: FP8
+ path: ../v0.5.0/pages/fp8.mdx
+ - page: Use Custom vLLM
+ path: ../v0.5.0/pages/guides/use-custom-vllm.mdx
+ - section: Design Docs
+ contents:
+ - page: Design and Philosophy
+ path: ../v0.5.0/pages/design-docs/design-and-philosophy.mdx
+ - page: Padding
+ path: ../v0.5.0/pages/design-docs/padding.mdx
+ - page: Logger
+ path: ../v0.5.0/pages/design-docs/logger.mdx
+ - page: UV
+ path: ../v0.5.0/pages/design-docs/uv.mdx
+ - page: Dependency Management
+ path: ../v0.5.0/pages/design-docs/dependency-management.mdx
+ - page: Chat Datasets
+ path: ../v0.5.0/pages/design-docs/chat-datasets.mdx
+ - page: Generation
+ path: ../v0.5.0/pages/design-docs/generation.mdx
+ - page: Checkpointing
+ path: ../v0.5.0/pages/design-docs/checkpointing.mdx
+ - page: Loss Functions
+ path: ../v0.5.0/pages/design-docs/loss-functions.mdx
+ - page: FSDP2 Parallel Plan
+ path: ../v0.5.0/pages/design-docs/fsdp2-parallel-plan.mdx
+ - page: Training Backends
+ path: ../v0.5.0/pages/design-docs/training-backends.mdx
+ - page: Sequence Packing and Dynamic Batching
+ path: ../v0.5.0/pages/design-docs/sequence-packing-and-dynamic-batching.mdx
+ - page: Env Vars
+ path: ../v0.5.0/pages/design-docs/env-vars.mdx
+ - page: NeMo Gym Integration
+ path: ../v0.5.0/pages/design-docs/nemo-gym-integration.mdx