Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
5a137ab
Sandbox: CodeExecEnv
joanvelja Dec 19, 2025
fd45541
Merge pull request #2 from hallerite/master
joanvelja Dec 19, 2025
3dedcb9
smoketests for sandbox
joanvelja Dec 19, 2025
88b6275
smoketests for sandbox
joanvelja Dec 19, 2025
63730f0
API sync
joanvelja Dec 21, 2025
8206a5d
podman quirks
joanvelja Dec 21, 2025
a0a9e8f
APPS trainer example + baseline (to be tested on HPC)
joanvelja Dec 21, 2025
3b89fb4
debugging isambard headaches
joanvelja Dec 21, 2025
d36c569
i hate CUDA
joanvelja Dec 21, 2025
b93cc8c
i hate CUDA
joanvelja Dec 21, 2025
11b9c51
revamp venv
joanvelja Dec 21, 2025
6de833b
some breaking changes, some other deprecation warnings torn down
Dec 21, 2025
5ee2e57
minimizing deps for podman-hpc
joanvelja Dec 21, 2025
b4c4478
Double async loop mistake
joanvelja Dec 21, 2025
db8c528
HF login
Dec 21, 2025
812fa90
Circuit breaker
joanvelja Dec 21, 2025
5cd2fe7
Merge pull request #3 from joanvelja/huggingface-login
joanvelja Dec 21, 2025
2e375f6
Drain pipe
joanvelja Dec 21, 2025
db9087e
drainer -fix
joanvelja Dec 21, 2025
7655ecf
wandb to ignore
Dec 21, 2025
3f82ee3
sandboxing issues
joanvelja Dec 22, 2025
8b90ecd
contention
joanvelja Dec 22, 2025
4e6870a
config
joanvelja Dec 22, 2025
64fc363
Big change: from serial to batched test exec
joanvelja Dec 23, 2025
6b5c245
ignore checkpoints data
Dec 23, 2025
d757aae
memory limit cgroup clash
joanvelja Dec 23, 2025
4871f2b
Merge branch 'apps_batched' of https://github.com/joanvelja/ludic int…
joanvelja Dec 23, 2025
3be35a4
update
joanvelja Dec 23, 2025
a231a69
cache bug
joanvelja Dec 23, 2025
904f47a
hangs.. inspecting
Dec 23, 2025
13c656e
update
joanvelja Dec 23, 2025
c9c9bdd
deprecated concurrency args
Dec 23, 2025
8249153
parallel baby
joanvelja Dec 23, 2025
1f0b1dd
subtle exec pool bug
joanvelja Dec 23, 2025
23749f6
new optim benching: volume
joanvelja Dec 23, 2025
7f805a9
Bind mount feature: 3 exec calls --> 1
joanvelja Dec 24, 2025
6fcaad1
visualization efforts
joanvelja Dec 27, 2025
8c9aa56
Merge remote-tracking branch 'upstream/master' into dual-model
joanvelja Dec 27, 2025
70cddbb
Update the API with Hallerite's changes (grad_accum, algos) — for dry…
joanvelja Dec 27, 2025
3bfdca3
lost in translation: podman workspace dir update
joanvelja Dec 27, 2025
92ad2cf
path problems with dir
joanvelja Dec 27, 2025
778731d
path problems with dir
joanvelja Dec 27, 2025
4c4225e
clean readme for sandbox execution
joanvelja Dec 27, 2025
e38448c
Memory efficient KL-div + ScaleRL recipe
joanvelja Dec 27, 2025
f53dabb
Flash attention
Dec 28, 2025
848a822
Merge pull request #5 from joanvelja/dual-model
joanvelja Dec 28, 2025
321e8d7
Cleanup personal files
Dec 28, 2025
084d49d
clean up
hallerite Jan 9, 2026
8c2de97
revert to master
hallerite Jan 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,33 @@ dist/
wheels/
*.egg-info

# mac specific crap
.DS_Store

# checkpoints
checkpoints_*/

# Virtual environments
.venv

# wandb files
wandb/

# Slurm logs
logs/
*.log

# Big jsonl files
data/
*.jsonl

# Environment files (secrets)
.env
.env.*
.DS_Store

# HPC specific files
examples/code_exec/hpc/

# personal research directory
research/
1 change: 0 additions & 1 deletion .python-version

This file was deleted.

47 changes: 38 additions & 9 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,10 @@ Instead, Ludic is closer to **classical RL** – specifically policy-gradient me

- **Separate Agent vs Environment**
- **Environment** = state transition function (+ optional scalar reward) with minimal assumptions; can be multi-agent by default.
- **Agent** = LLM *with state* (prompt harness + memory + parsing + optional auxiliary tools).
- auxiliary tools = tools that don't change the state of the environment
- Rationale: reuse environments across different “agent harnesses” (memory schemes, parsers, prompts, tools) and reuse harness pieces across environments.
- **Agent** = LLM *with state* (prompt harness + memory + parsing + optional tools).
- Internal tools = executed by agent (calculator, code interpreter); don't change env state
- External tools = returned to protocol for handling (delegation, sub-agents)
- Rationale: reuse environments across different "agent harnesses" (memory schemes, parsers, prompts, tools) and reuse harness pieces across environments.

- **Make the interaction loop explicit**
- Neither env nor agent “owns” rollout generation. An **InteractionProtocol** owns the agent<-->env loop and produces rollouts.
Expand Down Expand Up @@ -50,6 +51,11 @@ Instead, Ludic is closer to **classical RL** – specifically policy-gradient me
## Core Abstractions (Where + What)

- **Shared types (rollouts, steps, truncation flags)**: `src/ludic/types.py`
- **Steps (agent vs env)**: See `CONSIDERATIONS.md` for the full rationale. The short version:
- **AgentStep**: Every model call, including internal tool loops. Contains `TokenTrace` for training.
- **EnvironmentStep**: State transitions (`env.step()` outcomes). References the triggering AgentSteps.
- Why separate? Training needs the full reasoning trace, not just final actions. A ReAct agent might call 3 tools before outputting an action—all those calls have token traces we want to train on.
- Rollouts keep a single timeline of both kinds; online batching concatenates all AgentSteps in a turn into one `SAWItem`.

- **Environment kernel (multi-agent by default)**: `src/ludic/envs/env.py`
- `LudicEnv.reset() -> {agent_id: (obs, info)}`
Expand All @@ -62,8 +68,10 @@ Instead, Ludic is closer to **classical RL** – specifically policy-gradient me
- Wraps a `ChatClient` (inference backend), a `ContextStrategy` (memory/prompt building), and a `Parser` (action decoding + intrinsic format rewards/penalties).
- Handles incomplete completions (`finish_reason == "length"`) as parse failures (optional) to avoid training on truncated actions.
- Extended agent types:
- `ToolAgent` (`src/ludic/agents/tool_agent.py`): OpenAI/vLLM-compatible tool calling with automatic schema generation from callables.
- `ReActAgent` (`src/ludic/agents/react_agent.py`): Multi-step ReAct pattern with configurable `max_react_steps` for tool loops.
- `ToolAgent` (`src/ludic/agents/tool_agent.py`): Base for tool-calling agents. Supports two tool scopes:
- `tools`: Internal tools executed by agent (calculator, code interpreter). Results go to context, agent continues.
- `external_tools`: Tools returned to protocol for handling (delegation, sub-agents). Protocol feeds results back.
- `ReActAgent` (`src/ludic/agents/react_agent.py`): Multi-step ReAct pattern [Think → Tool]* → Act. Returns `action_target` indicating what happens next: `"internal"` (handled), `"external"` (protocol handles), or `"env"` (final action).

- **Context strategy (memory/prompt policy)**: `src/ludic/context/base.py`
- Hooks: `on_env_reset`, `on_before_act`, `on_after_act`, `on_after_step`.
Expand All @@ -77,16 +85,20 @@ Instead, Ludic is closer to **classical RL** – specifically policy-gradient me

- **Interaction protocols (own the loop)**: `src/ludic/interaction/base.py`
- Single-agent synchronous loop: `src/ludic/interaction/single_agent.py`
- Supports `external_tool_handler` callback for handling external tool calls
- Multi-agent loop (per-agent rollouts via `TraceCollector`): `src/ludic/interaction/multi_agent.py`, `src/ludic/interaction/step_collector.py`
- Key behavior: parser failures are handled *inside the protocol* (synthetic step, no `env.step()` call), so env stays parser-agnostic.
- Key behaviors:
- Parser failures are handled *inside the protocol* (synthetic step, no `env.step()` call), so env stays parser-agnostic.
- External tool calls (`action_target="external"`) are routed through `external_tool_handler`; results are fed back to agent context and the agent continues reasoning.
- **Delegation pattern**: External tools enable hierarchical agents where a parent can spawn sub-agents. The protocol handles the sub-agent's rollout and returns results to the parent. Both rollouts are collected for training. See `CONSIDERATIONS.md` for details.
- Utility: `src/ludic/interaction/info.py` provides `merge_step_info()` for safely merging step metadata with collision detection on reserved keys.

- **Rollout execution + collation**: `src/ludic/training/batching/rollout_engine.py`
- Stateless “factory floor”: instantiates env + protocol per request, runs episodes concurrently, returns rollouts.
- Converts rollouts → `SAWItem`s using either:
- exact token IDs returned by the inference backend (preferred), or
- `retokenize=True` with a caller-provided tokenizer.
- Practical note: if you want drift-free RL on the *actual sampled tokens*, have your inference client return token IDs/logprobs (vLLM: `SamplingArgs["extras"]["extra_body"]["return_token_ids"]=True`).
- Practical note: Token-in mode (see README) ensures drift-free RL by using rollout-time token IDs directly. Use `ReturnSpec.for_rl()` or set `return_token_ids=True` in `InferenceSpec` to get token IDs from the backend.

- **Batch sources (trainer talks to these, not the engine)**: `src/ludic/training/types.py`
- Sync: `src/ludic/training/batching/synced_batching.py` (`RolloutBatchSource`)
Expand All @@ -96,9 +108,9 @@ Instead, Ludic is closer to **classical RL** – specifically policy-gradient me

- **Algorithm injection (credit + loss)**: `src/ludic/training/algorithm.py`
- `RLAlgorithm = (CreditAssigner, Loss)`
- Presets: `make_reinforce()`, `make_reinforce_baseline()`, `make_grpo()`, `make_sft()`
- Presets: `make_reinforce()`, `make_reinforce_baseline()`, `make_grpo()`, `make_dr_grpo()`, `make_gspo()`, `make_cispo()`, `make_gmpo()`, `make_sft()`
- Credit assigners: `src/ludic/training/credit_assignment.py` – `MonteCarloReturn`, `GroupNormalizedReturn`, `EpisodicReturn`, `PerStepReward`, `ConstantCredit`
- Losses: `src/ludic/training/loss.py`
- Losses: `src/ludic/training/loss.py` – `ReinforceLoss`, `TokenClippedSurrogateLoss`, `ClippedSurrogateLoss`, `CISPOLoss`, `GMPOLoss`, `MaskedCausalLMCrossEntropyLoss`

- **Trainer (optimization loop only)**: `src/ludic/training/trainer.py`
- Collates `SAWItem` → tensors and runs `RLAlgorithm.loss`.
Expand Down Expand Up @@ -135,6 +147,23 @@ GRPO mental model in this codebase:
- It avoids a learned **value function** by using a **Monte Carlo / group-relative baseline** (group mean reward for the same prompt) to form advantages.
- If you come from PPO-RLHF: think "PPO-shaped dataflow" without a critic/value model, where the "advantage" is estimated by group comparison rather than by GAE/value bootstrapping.

## GMPO (Geometric-Mean Policy Optimization)

**GMPO** (arXiv:2507.20673) is a variant of GRPO that uses the **geometric mean** of token-level importance ratios instead of the arithmetic mean.

**Core idea**:
- GRPO optimizes: (1/|o|) Σ_t ρ_t * A (arithmetic mean)
- GMPO optimizes: (∏_t ρ_t)^(1/|o|) * A (geometric mean)

The geometric mean is less sensitive to outlier importance ratios, which can help prevent extreme policy updates when individual tokens have unusually high or low ratios.

**Implementation** (`src/ludic/training/loss.py`, `src/ludic/training/algorithm.py`):
- **Loss**: `GMPOLoss` computes the geometric mean in log-space for numerical stability
- **Objective**: J_GMPO = E[ (∏_t min(ρ_t * A, clip(ρ_t, e^-ε_low, e^ε_high) * A))^(1/|o|) * sgn(A) ]
- **Clipping**: Token-level clipping in log-space, wider default range (e^-0.4, e^0.4) vs GRPO's (0.8, 1.2)
- **Normalization**: 1/|o| sequence length normalization
- **Preset**: `make_gmpo(group_size=4)` uses same credit assignment as GRPO (`GroupNormalizedReturn`)

## SFT / Offline RL

Ludic supports supervised fine-tuning (SFT) and offline RL through the same abstractions:
Expand Down
Loading