hallerite · hallerite · Dec 19, 2025 · Dec 19, 2025 · Dec 19, 2025 · Dec 19, 2025
diff --git a/.gitignore b/.gitignore
@@ -6,5 +6,33 @@ dist/
 wheels/
 *.egg-info
 
+# mac specific crap
+.DS_Store
+
+# checkpoints
+checkpoints_*/
+
 # Virtual environments
 .venv
+
+# wandb files
+wandb/
+
+# Slurm logs
+logs/
+*.log
+
+# Big jsonl files
+data/
+*.jsonl
+
+# Environment files (secrets)
+.env
+.env.*
+.DS_Store
+
+# HPC specific files
+examples/code_exec/hpc/
+
+# personal research directory
+research/
diff --git a/.python-version b/.python-version
diff --git a/AGENTS.md b/AGENTS.md
@@ -20,9 +20,10 @@ Instead, Ludic is closer to **classical RL** – specifically policy-gradient me
 
 - **Separate Agent vs Environment**
   - **Environment** = state transition function (+ optional scalar reward) with minimal assumptions; can be multi-agent by default.
-  - **Agent** = LLM *with state* (prompt harness + memory + parsing + optional auxiliary tools).
-    - auxiliary tools = tools that don't change the state of the environment
-  - Rationale: reuse environments across different “agent harnesses” (memory schemes, parsers, prompts, tools) and reuse harness pieces across environments.
+  - **Agent** = LLM *with state* (prompt harness + memory + parsing + optional tools).
+    - Internal tools = executed by agent (calculator, code interpreter); don't change env state
+    - External tools = returned to protocol for handling (delegation, sub-agents)
+  - Rationale: reuse environments across different "agent harnesses" (memory schemes, parsers, prompts, tools) and reuse harness pieces across environments.
 
 - **Make the interaction loop explicit**
   - Neither env nor agent “owns” rollout generation. An **InteractionProtocol** owns the agent<-->env loop and produces rollouts.
@@ -50,6 +51,11 @@ Instead, Ludic is closer to **classical RL** – specifically policy-gradient me
 ## Core Abstractions (Where + What)
 
 - **Shared types (rollouts, steps, truncation flags)**: `src/ludic/types.py`
+- **Steps (agent vs env)**: See `CONSIDERATIONS.md` for the full rationale. The short version:
+  - **AgentStep**: Every model call, including internal tool loops. Contains `TokenTrace` for training.
+  - **EnvironmentStep**: State transitions (`env.step()` outcomes). References the triggering AgentSteps.
+  - Why separate? Training needs the full reasoning trace, not just final actions. A ReAct agent might call 3 tools before outputting an action—all those calls have token traces we want to train on.
+  - Rollouts keep a single timeline of both kinds; online batching concatenates all AgentSteps in a turn into one `SAWItem`.
 
 - **Environment kernel (multi-agent by default)**: `src/ludic/envs/env.py`
   - `LudicEnv.reset() -> {agent_id: (obs, info)}`
@@ -62,8 +68,10 @@ Instead, Ludic is closer to **classical RL** – specifically policy-gradient me
   - Wraps a `ChatClient` (inference backend), a `ContextStrategy` (memory/prompt building), and a `Parser` (action decoding + intrinsic format rewards/penalties).
   - Handles incomplete completions (`finish_reason == "length"`) as parse failures (optional) to avoid training on truncated actions.
   - Extended agent types:
-    - `ToolAgent` (`src/ludic/agents/tool_agent.py`): OpenAI/vLLM-compatible tool calling with automatic schema generation from callables.
-    - `ReActAgent` (`src/ludic/agents/react_agent.py`): Multi-step ReAct pattern with configurable `max_react_steps` for tool loops.
+    - `ToolAgent` (`src/ludic/agents/tool_agent.py`): Base for tool-calling agents. Supports two tool scopes:
+      - `tools`: Internal tools executed by agent (calculator, code interpreter). Results go to context, agent continues.
+      - `external_tools`: Tools returned to protocol for handling (delegation, sub-agents). Protocol feeds results back.
+    - `ReActAgent` (`src/ludic/agents/react_agent.py`): Multi-step ReAct pattern [Think → Tool]* → Act. Returns `action_target` indicating what happens next: `"internal"` (handled), `"external"` (protocol handles), or `"env"` (final action).
 
 - **Context strategy (memory/prompt policy)**: `src/ludic/context/base.py`
   - Hooks: `on_env_reset`, `on_before_act`, `on_after_act`, `on_after_step`.
@@ -77,16 +85,20 @@ Instead, Ludic is closer to **classical RL** – specifically policy-gradient me
 
 - **Interaction protocols (own the loop)**: `src/ludic/interaction/base.py`
   - Single-agent synchronous loop: `src/ludic/interaction/single_agent.py`
+    - Supports `external_tool_handler` callback for handling external tool calls
   - Multi-agent loop (per-agent rollouts via `TraceCollector`): `src/ludic/interaction/multi_agent.py`, `src/ludic/interaction/step_collector.py`
-  - Key behavior: parser failures are handled *inside the protocol* (synthetic step, no `env.step()` call), so env stays parser-agnostic.
+  - Key behaviors:
+    - Parser failures are handled *inside the protocol* (synthetic step, no `env.step()` call), so env stays parser-agnostic.
+    - External tool calls (`action_target="external"`) are routed through `external_tool_handler`; results are fed back to agent context and the agent continues reasoning.
+  - **Delegation pattern**: External tools enable hierarchical agents where a parent can spawn sub-agents. The protocol handles the sub-agent's rollout and returns results to the parent. Both rollouts are collected for training. See `CONSIDERATIONS.md` for details.
   - Utility: `src/ludic/interaction/info.py` provides `merge_step_info()` for safely merging step metadata with collision detection on reserved keys.
 
 - **Rollout execution + collation**: `src/ludic/training/batching/rollout_engine.py`
   - Stateless “factory floor”: instantiates env + protocol per request, runs episodes concurrently, returns rollouts.
   - Converts rollouts → `SAWItem`s using either:
     - exact token IDs returned by the inference backend (preferred), or
     - `retokenize=True` with a caller-provided tokenizer.
-  - Practical note: if you want drift-free RL on the *actual sampled tokens*, have your inference client return token IDs/logprobs (vLLM: `SamplingArgs["extras"]["extra_body"]["return_token_ids"]=True`).
+  - Practical note: Token-in mode (see README) ensures drift-free RL by using rollout-time token IDs directly. Use `ReturnSpec.for_rl()` or set `return_token_ids=True` in `InferenceSpec` to get token IDs from the backend.
 
 - **Batch sources (trainer talks to these, not the engine)**: `src/ludic/training/types.py`
   - Sync: `src/ludic/training/batching/synced_batching.py` (`RolloutBatchSource`)
@@ -96,9 +108,9 @@ Instead, Ludic is closer to **classical RL** – specifically policy-gradient me
 
 - **Algorithm injection (credit + loss)**: `src/ludic/training/algorithm.py`
   - `RLAlgorithm = (CreditAssigner, Loss)`
-  - Presets: `make_reinforce()`, `make_reinforce_baseline()`, `make_grpo()`, `make_sft()`
+  - Presets: `make_reinforce()`, `make_reinforce_baseline()`, `make_grpo()`, `make_dr_grpo()`, `make_gspo()`, `make_cispo()`, `make_gmpo()`, `make_sft()`
   - Credit assigners: `src/ludic/training/credit_assignment.py` – `MonteCarloReturn`, `GroupNormalizedReturn`, `EpisodicReturn`, `PerStepReward`, `ConstantCredit`
-  - Losses: `src/ludic/training/loss.py`
+  - Losses: `src/ludic/training/loss.py` – `ReinforceLoss`, `TokenClippedSurrogateLoss`, `ClippedSurrogateLoss`, `CISPOLoss`, `GMPOLoss`, `MaskedCausalLMCrossEntropyLoss`
 
 - **Trainer (optimization loop only)**: `src/ludic/training/trainer.py`
   - Collates `SAWItem` → tensors and runs `RLAlgorithm.loss`.
@@ -135,6 +147,23 @@ GRPO mental model in this codebase:
 - It avoids a learned **value function** by using a **Monte Carlo / group-relative baseline** (group mean reward for the same prompt) to form advantages.
 - If you come from PPO-RLHF: think "PPO-shaped dataflow" without a critic/value model, where the "advantage" is estimated by group comparison rather than by GAE/value bootstrapping.
 
+## GMPO (Geometric-Mean Policy Optimization)
+
+**GMPO** (arXiv:2507.20673) is a variant of GRPO that uses the **geometric mean** of token-level importance ratios instead of the arithmetic mean.
+
+**Core idea**:
+- GRPO optimizes: (1/|o|) Σ_t ρ_t * A (arithmetic mean)
+- GMPO optimizes: (∏_t ρ_t)^(1/|o|) * A (geometric mean)
+
+The geometric mean is less sensitive to outlier importance ratios, which can help prevent extreme policy updates when individual tokens have unusually high or low ratios.
+
+**Implementation** (`src/ludic/training/loss.py`, `src/ludic/training/algorithm.py`):
+- **Loss**: `GMPOLoss` computes the geometric mean in log-space for numerical stability
+- **Objective**: J_GMPO = E[ (∏_t min(ρ_t * A, clip(ρ_t, e^-ε_low, e^ε_high) * A))^(1/|o|) * sgn(A) ]
+- **Clipping**: Token-level clipping in log-space, wider default range (e^-0.4, e^0.4) vs GRPO's (0.8, 1.2)
+- **Normalization**: 1/|o| sequence length normalization
+- **Preset**: `make_gmpo(group_size=4)` uses same credit assignment as GRPO (`GroupNormalizedReturn`)
+
 ## SFT / Offline RL
 
 Ludic supports supervised fine-tuning (SFT) and offline RL through the same abstractions: