Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions docs/composition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Signal Composition in Ludic

This document describes how different training signals (rewards, advantages, losses) can be composed in Ludic's RL training pipeline.

## The Training Pipeline

```
Environment ──► Rewards ──► CreditAssigner ──► Advantages ──► Loss
│ │
▼ ▼
Scorers CreditModifiers
(Level 1) (Level 2)
```

There are **three composition levels**:

| Level | Name | Where | Implementation |
|-------|------|-------|----------------|
| **1** | Reward | Before credit assignment | Agent scorers |
| **2** | Advantage | After credit assignment, before loss | CreditModifier |
| **3** | Loss | Separate loss terms | CompositeLoss |

## Level 1: Reward Composition

Add signals to rewards via Agent scorers, before credit assignment.

```python
# Scorers attached to Agent add to per-step rewards
agent = Agent(
client=client,
scorers=[intrinsic_reward_scorer], # adds to step rewards
)
```

**Properties:**
- All signals go through the same credit assignment
- Signals interact (e.g., group normalization affects combined rewards)
- Tightest coupling between signals

**Use when:**
- Intrinsic rewards should be treated identically to environment rewards
- You want signals to interact during advantage estimation

## Level 2: Advantage Modification

Modify advantages after credit assignment, before loss.

```python
# KL penalty added to advantages
kl_penalty = -kl_coeff * (actor_logps - teacher_logps)
advantages = task_advantages + kl_penalty
# Then normal policy gradient with combined advantages
```

**Properties:**
- Each signal can have its own credit assignment strategy
- All signals go through the same importance ratio
- All signals go through the same loss function

**Use when:**
- Different signals need different credit assignment (e.g., sparse task rewards vs dense KL)
- You want all signals to go through importance sampling together

**Implementation in Ludic:**

Use `CreditModifier` to add per-token signals to advantages:

```python
algo = RLAlgorithm(
credit_assigner=GroupNormalizedReturn(group_size=8),
credit_modifiers=[KLCreditModifier(coeff=1.0)],
loss=ClippedSurrogateLoss(...),
)
```

Or use the preset:

```python
algo = make_gspo_opd(group_size=8, kl_coeff=1.0)
```

## Level 3: Loss Composition

Combine independent loss terms additively.

```python
loss = CompositeLoss(terms=[
LossTerm(name="rl", loss=ClippedSurrogateLoss(...), weight=1.0),
LossTerm(name="auxiliary", loss=SomeAuxiliaryLoss(...), weight=0.1),
])
```

**Properties:**
- Each loss computed independently
- Different losses can use different data (current vs old policy logprobs)
- Loosest coupling
- Most flexible but signals don't interact

**Use when:**
- Truly independent objectives (e.g., RL + language modeling auxiliary)
- Different losses need fundamentally different handling
- You need maximum flexibility

## Key Differences: Advantage vs Loss Composition

| Aspect | Advantage Modification (Level 2) | Loss Composition (Level 3) |
|--------|----------------------------------|---------------------------|
| **KL source** | Old policy (rollout time) | Current policy (forward pass) |
| **Importance sampling** | Goes through ratio | Doesn't go through ratio |
| **Gradient** | `ratio * (task_adv + kl_penalty)` | `ratio * task_adv + kl_grad` |
| **Interaction** | Signals combined | Signals independent |

### Mathematical Difference

**Advantage Modification:**
```
A_t = task_advantage - kl_coeff * KL_old_t
L = E[ ratio_t * A_t ]
∇L ∝ ratio_t * A_t * ∇log π_t
```

**Loss Composition:**
```
L = E[ ratio_t * task_advantage ] + kl_coeff * E[ KL_current_t ]
∇L ∝ ratio_t * task_advantage * ∇log π_t + kl_coeff * ∇log π_t
```

In synchronous RL with single gradient step, `ratio ≈ 1` and `KL_old ≈ KL_current`, so these are similar. They diverge with:
- Multiple epochs per batch (PPO-style)
- Async RL with stale rollouts
- Large policy updates

## Recommended Patterns

### Pattern 1: Pure RL (task rewards only)
```python
algo = make_gspo(group_size=8)
```

### Pattern 2: GSPO + OPD hybrid (recommended for distillation)
```python
algo = make_gspo_opd(group_size=8, kl_coeff=1.0)
```

### Pattern 3: Independent auxiliary loss
```python
algo = RLAlgorithm(
credit_assigner=GroupNormalizedReturn(group_size=8),
loss=CompositeLoss(terms=[
LossTerm(name="rl", loss=ClippedSurrogateLoss(...), weight=1.0),
LossTerm(name="lm", loss=LanguageModelingLoss(...), weight=0.1),
]),
)
```

## Summary

| Scenario | Level | Implementation |
|----------|-------|----------------|
| Pure RL | - | `make_gspo()` |
| RL + teacher distillation | 2 (Advantage) | `make_gspo_opd()` |
| RL + unrelated auxiliary | 3 (Loss) | `CompositeLoss` |
| Intrinsic rewards | 1 (Reward) | Agent scorers |
121 changes: 121 additions & 0 deletions examples/opd/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# GSPO + OPD Hybrid Training on GSM8K

Train a smaller student model using both task rewards and dense per-token supervision from a larger teacher model.

This hybrid approach combines:
- **GSPO (Group-Sorted Policy Optimization)**: Task rewards from GSM8K correctness with group-normalized advantages
- **OPD (On-Policy Distillation)**: Dense per-token feedback via reverse KL divergence from teacher

The hybrid adds KL penalty directly to advantages (Level 2: Advantage Modification):
1. **Task-specific learning**: Sparse but grounded rewards from environment → group-normalized advantages
2. **Distribution matching**: Dense per-token KL penalty added to advantages

Reference: https://thinkingmachines.ai/blog/on-policy-distillation

## Prerequisites

- At least 2 GPUs (e.g., 2x A100).
- GPU 0: Both vLLM servers (student 0.5B + teacher 7B fit together)
- GPU 1: Training (gradient updates)
- Required extra packages: `datasets`, `math-verify`.

Install deps (once):
```bash
uv sync --extra examples
```

## 1) Start vLLM servers

You need **two** vLLM servers: one for the student (sampling) and one for the teacher (scoring). For these small models, both can share GPU 0.

**Important**: Student and teacher must use the **same tokenizer**. The Qwen2.5 family shares tokenizers across sizes, so this works.

### Terminal 1: Student server (port 8000)
```bash
CUDA_VISIBLE_DEVICES=0 uv run python -m ludic.inference.vllm_server \
--model Qwen/Qwen2.5-0.5B-Instruct \
--port 8000 \
--gpu-memory-utilization 0.4
```

### Terminal 2: Teacher server (port 8001)
```bash
CUDA_VISIBLE_DEVICES=0 uv run python -m ludic.inference.vllm_server \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8001 \
--gpu-memory-utilization 0.5
```

Wait for both servers to report ready before proceeding.

## 2) Train with OPD

In a third terminal, run the OPD training script on GPU 1:
```bash
CUDA_VISIBLE_DEVICES=1 PYTHONPATH=. uv run python examples/opd/train_opd_gsm8k.py \
--student-model Qwen/Qwen2.5-0.5B-Instruct \
--teacher-model Qwen/Qwen2.5-7B-Instruct \
--student-port 8000 \
--teacher-port 8001 \
--rollouts-per-update 64 \
--train-steps 100 \
--micro-token-budget 16384 \
--max-seq-len 1024
```

### Key flags

| Flag | Default | Description |
|------|---------|-------------|
| `--student-model` | `Qwen/Qwen2.5-0.5B-Instruct` | Student model (must match vLLM server) |
| `--teacher-model` | `Qwen/Qwen2.5-7B-Instruct` | Teacher model (must share tokenizer with student) |
| `--student-port` | 8000 | Student vLLM server port |
| `--teacher-port` | 8001 | Teacher vLLM server port |
| `--kl-coeff` | 1.0 | Coefficient for reverse KL loss term |
| `--rollouts-per-update` | 256 | Total rollouts per training step |
| `--group-size` | 8 | Group size for GSPO advantages |
| `--concurrency` | 32 | Parallel rollout generation |
| `--limit` | None | Limit training samples (None = use all) |
| `--logger` | `rich` | Loggers: rich, print, wandb, none (comma-separated) |
| `--eval-every` | 10 | Eval every N train steps |
| `--eval-limit` | 1000 | Number of test samples for eval |
| `--eval-temperature` | 0.0 | Sampling temperature for eval (greedy) |

### Training logs

Output includes:
- `train/loss`: Policy gradient loss with KL-modified advantages
- `train/kl/kl_mean`: Mean per-token reverse KL (actor - teacher logprobs)
- `train/kl/kl_penalty_mean`: Mean KL penalty added to advantages
- `train/correct_rate`: GSM8K accuracy on training samples
- `train/avg_completion_length`: Average tokens per completion
- `eval/accuracy`: GSM8K accuracy on test set
- `eval/parse_error_rate`: Parse error rate on test set

Rollouts are written to `opd_rollouts.jsonl`.

## How GSPO + OPD works

This uses "Level 2: Advantage Modification" via CreditModifier (see `docs/composition.md`):

1. **Student samples**: The student model generates completions for GSM8K problems
2. **Environment rewards**: Each completion is graded for correctness (sparse reward)
3. **Teacher scores**: The teacher model computes per-token logprobs on the student's samples
4. **Credit assignment**: GroupNormalizedReturn computes task-based advantages
5. **Credit modification**: KLCreditModifier adds KL penalty to advantages:
```
A_t = task_advantage + (-kl_coeff * (actor_logp_t - teacher_logp_t))
```
6. **Policy gradient**: ClippedSurrogateLoss with modified advantages

Key benefits of this approach (vs CompositeLoss):
- KL goes through importance sampling (multiplied by ratio like task rewards)
- KL uses old policy logprobs from rollout time (not current policy)
- All signals interact through the same loss function

## Tips

- **Same tokenizer is required**: OPD passes token IDs directly from student to teacher. If tokenizers differ, results will be meaningless.
- **Context window**: Ensure prompt + completion fits in teacher's context window. Truncation causes length mismatches.
- **GPU memory**: With larger models, you may need separate GPUs for student and teacher. Adjust `--gpu-memory-utilization` accordingly.
- **KL coefficient**: Start with `--kl-coeff 1.0`. Increase if student diverges too much from teacher.
Loading