Empowering OpenClaw with RL β Train a personalized agent simply by talking to it.
Scalable RL in real-world settings β Agentic RL for terminal, GUI, SWE, and tool-call settings.
demo.mp4
- [2026/3/12] π₯ We support LoRA training now!
- [2026/3/10] π₯ We have released our Technical Report! π Ranked #1 on HuggingFace Daily Papers!
- [2026/3/10] π₯ Huge updates today! We released a new combination method, along with an interesting evaluation of these OpenClaw-RL methods. Track 2 is released too, featuring scalable RL implementations for general agent settings across terminal, GUI, SWE, and tool-call scenarios. We only focus on real-world settings!
- [2026/3/3] π Working with the authors of SDFT and SDPO, we have integrated their methods into openclaw-opd. We welcome the integration of novel and effective methods!
- [2026/3/3] πΊ Check out these community tutorial videos on OpenClaw-RL: Video 1 | Video 2
- [2026/2/26] π₯ We release OpenClaw-RL v1 β a fully asynchronous RL framework for training personalized AI agents from natural conversation feedback.
OpenClaw-RL is a fully asynchronous reinforcement learning framework that turns everyday conversations into training signals for personalized AI agents, and supports training general agents with large-scale environment parallelization.
Most RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. OpenClaw-RL takes a fundamentally different approach: it wraps your self-hosted model in OpenClaw as an OpenAI-compatible API, intercepts live multi-turn conversations, and continuously optimizes the policy in the background β all without interrupting your usage.
Highlights: Fully async 4-component loop Β· Self-hosted & private Β· Zero manual labeling Β· Three learning paradigms (Binary RL / OPD / Combine) Β· Personal + General agent support
π Features
OpenClaw-RL decouples agent serving, rollout collection, PRM/judge evaluation, and policy training into independent async loops. None of them block one another: the model continues serving requests while training runs in the background, and judging happens concurrently with new interactions.
The entire stack, including the policy model, judge/PRM, and trainer, runs on your own infrastructure. Conversation data stays within your system, and no third-party model API is required.
You do not need to manually label data. The system automatically:
- Organizes multi-turn interactions into session-aware training trajectories
- Classifies API messages into main-line (trainable) vs. side (non-trainable) turns
- Uses the next user, environment, or tool feedback as a natural "next-state" signal
- Runs PRM/judge evaluation asynchronously, with majority voting when needed for more robust scoring
- Submits ready samples to the trainer as they become available
Binary RL (GRPO): A Process Reward Model scores each turn based on next-state feedback. The scalar reward is then used with GRPO advantage estimation and a PPO-style clipped surrogate loss.
On-Policy Distillation (OPD): When the next state reveals useful hindsight, a judge model extracts a textual hint. This hint augments the original prompt to create an enhanced teacher, whose token-level log-probability gap with the student becomes a directional advantage signal richer than any scalar reward.
Combination Method: OpenClaw-RL further combines Binary RL and OPD in a unified training recipe, leveraging the dense scalar supervision of Binary RL together with the richer token-level directional signal from OPD. This combination achieves stronger and more robust optimization than either method alone.
The same framework supports both personalized OpenClaw optimization and scalable RL for terminal, GUI, SWE, and tool-call agents in real-world settings.
Our long-term goal is to advance personalized, practically useful agents with reinforcement learning. The roadmap has two tracks:
Track 1 β Personal Agent Optimization (Small-Scale but Personal)
β
Release Track 1: Fully async OpenClaw-RL framework with Binary RL + OPD
β
Best recipe discovery via demonstration experiments
β
Support LoRA Training
β¬ Support low-precision training/inference
β¬ Deploy training on Tinker
β¬ Beyond the policy: extend learning to skills and memory
Track 2 β General Agents Optimization (Scalable Infra)
β
Release Track 2: Scalable agentic RL infra for general agents
β¬ Support more cloud services
We welcome contributions that integrate new learning methods into the OpenClaw-RL framework! The integration of SDFT / SDPO into openclaw-opd, and supporting LoRA are great examples of successful community contributions.
Highly wanted contributions:
- βοΈ Tinker cloud deployment β run OpenClaw-RL training on Tinker
- π€ Qwen3.5 model support β launch scripts and model configs for the Qwen3.5 family
- π§ Low-precision training examples β FP8/INT4 training scripts for existing methods
π Full contribution guidelines & feature wishlist
We welcome community contributions to OpenClaw-RL! This document outlines our contribution principles and the features we'd love help with.
OpenClaw-RL is organized as a collection of self-contained method folders (e.g., openclaw-rl/, openclaw-opd/, openclaw-combine/), each sitting alongside the shared slime/ training framework and openclaw/ runtime.
Contributions generally fall into two categories:
Create a new top-level folder (parallel to existing ones like openclaw-opd/). All method-specific code β launch scripts, custom loss functions, rollout logic, API server adapters, data processing, and the README β should live inside this folder.
For changes within an existing method folder β such as supporting a new model family, adding a LoRA variant, or a low-precision example β add new files (e.g., a new .sh script, a new data processing script) rather than modifying existing ones. This way the original working examples stay intact and your addition can be reviewed independently.
-
Do not modify the core framework. Avoid changes to
slime/,Megatron-LM/, oropenclaw/unless absolutely necessary. The framework exposes extension points (--custom-loss-function-path,--rollout-function-path,--custom-generate-function-path,--custom-rm-path, etc.) specifically so that new methods can plug in without touching shared code. If a framework change is truly needed, please open a separate PR for it with a clear justification. -
Include documentation. For a new method folder, add a
README.mdexplaining what the method does, how to run it, key environment variables, and file structure. For additions to existing folders, update the existingREADME.mdwith a new section. Seeopenclaw-combine/README.mdortoolcall-rl/README.mdfor good examples. -
Follow existing conventions. Use the same shell script structure (GPU partitioning,
CKPT_ARGS,ROLLOUT_ARGS,OPTIMIZER_ARGS, etc.), environment variable naming, andray job submitlaunch pattern used by the existing methods.
1. βοΈ Deploy Training on Tinker
Type: New method folder
Goal: Add a new top-level folder (e.g., tinker/) that provides a turnkey example for running OpenClaw-RL training on the Tinker cloud platform.
Requirements:
- A new self-contained folder at the repo root, following the same structure as other method folders.
- A launch script that adapts GPU allocation, Ray setup, and networking for the Tinker environment.
- The recommended training method is the combination loss (
openclaw-combine), as it achieves the best results in our experiments. The example should either import or replicate the combination loss setup. - A
README.mdcovering: Tinker-specific prerequisites, step-by-step setup, how to configure checkpoints and data paths on Tinker, and how to connect OpenClaw to the running server.
Type: Extend existing method folders
Goal: Add launch scripts and model configurations for the Qwen3.5 family across existing methods.
Requirements:
- Add new
.shscripts for Qwen3.5 in relevant method folders (e.g.,openclaw-combine/run_qwen35_4b_openclaw_combine.sh). - Add the corresponding model config in
slime/scripts/models/if Qwen3.5 requires different architecture parameters (hidden size, num layers, etc.) from Qwen3. - Verify and document any changes needed for tokenizer, chat template, reasoning parser, or tool-call parser compatibility.
- Update READMEs to list Qwen3.5 as a supported model.
Type: Extend existing method folders
Goal: Add low-precision (e.g., INT8/INT4 inference, BF16/FP8 training) example scripts to existing method folders, enabling users to run OpenClaw-RL on consumer-grade hardware with fewer GPUs.
Requirements:
- Add new
.shscripts within existing method folders β do not modify existing scripts. - Low-precision inference: demonstrate launching the SGLang rollout engine with quantized weights (e.g., AWQ/GPTQ INT4) to reduce VRAM for the serving side.
- Low-precision training: if supported by the Megatron backend, demonstrate FP8 or mixed-precision configurations that reduce training memory.
- Update the corresponding
README.mdin each method folder with a new section documenting these scripts.
If you're interested in any of these, feel free to open an issue to discuss your approach before submitting a PR. We're happy to provide guidance and review!
- Hardware: 8Γ GPUs (default; configurable via
NUM_GPUS,ACTOR_GPUS,ROLLOUT_GPUS,PRM_GPUS) - Software: CUDA 12.9, Python 3.12
- Framework: Slime (our base RL framework)
For detailed environment setup, see Slime or ./instructions/README.md.
We provide three methods (RL servers):
| Dimension | Binary RL | OPD | Combined |
|---|---|---|---|
| Signal type | Evaluative (good / bad) | Directional | Evaluative + directional |
| Advantage | Sequence-level scalar | Token-level directional | Mixed sequence and token-level |
| Density | All scored turns | Hint-accepted turns only | All scored turns |
| Feedback type | User / environment | Explicit corrections | Both implicit and explicit feedback |
| Signal richness | 1 scalar per sample | 1 value per token | 1 value per token |
Choose your optimization method:
Option A: Combination Method β Recommended !
cd slime
bash ../openclaw-combine/run_qwen3_4b_openclaw_combine.shThis method combines binary RL and OPD to achieve the best optimization.
See ./openclaw-combine/README.md for algorithm details.
With LoRA (parameter-efficient, fewer GPUs):
bash ../openclaw-combine/run_qwen3_4b_openclaw_combine_lora.shAll LoRA variants use PEFT LoRA adapters on the FSDP backend, training only ~0.8% of model parameters. After training, merge the adapter into a standard HF checkpoint:
python slime/tools/merge_lora_adapter.py \
--base-model /path/to/base_model \
--adapter /path/to/lora_checkpoint/model \
--output /path/to/merged_modelOption B: Binary RL β Best for implicit feedback (likes/dislikes, env success/failure)
cd slime
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.shWith LoRA (parameter-efficient, fewer GPUs):
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl_lora.shThe PRM will automatically judge response quality from next-state feedback. We recommend providing frequent feedback (e.g., π/π) to help the model optimize effectively.
See ./openclaw-rl/README.md for algorithm details.
Option C: On-Policy Distillation (OPD) β Best for rich textual feedback
cd slime
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd.shWith LoRA (parameter-efficient, fewer GPUs):
bash ../openclaw-opd/run_qwen3_4b_openclaw_opd_topk_lora.shThe system extracts hindsight hints from your feedback and distills them into the policy at the token level. We recommend providing concrete feedback (e.g., "you should have checked the file first" or "don't use that library").
See ./openclaw-opd/README.md for algorithm details.
Once running, the model is served as an OpenAI-compatible API at:
http://<HOST_IP>:30000/v1
where <HOST_IP> is the IP address of the machine running the RL server (e.g. 115.190.98.251). The port 30000 is the default and can be changed via the PORT environment variable.
Take note of this endpoint β you will need it when configuring OpenClaw in the next step.
We also provide an interesting case for evaluation. A student who uses OpenClaw to do homework, does not want to be found using AI. A teacher who also uses OpenClaw to grade student's homework, wants the comments to be specific and friendly.
Evaluation Setting β Both student and teacher use AI!
We find that, under the combined optimization method, OpenClaw needs only 36 problem-solving interactions in the student setting and 24 grading interactions in the teacher setting to achieve a significant and clearly visible improvement.
See ./openclaw-test/README.md for setup and algorithm details.
Install OpenClaw from the version bundled in this repository (we will update it regularly):
Then configure OpenClaw to route requests to your RL server.
Open your openclaw.json (or the equivalent settings file) and add a provider entry under "models" β "providers":
{
"models": {
"providers": {
"qwen": {
"baseUrl": "http://<HOST_IP>:30000/v1",
"apiKey": "apiKey",
"api": "openai-completions",
"models": [
{
"id": "qwen3-4b",
"name": "Qwen3 4B",
"reasoning": true,
"input": ["text"],
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
},
"contextWindow": 32768,
"maxTokens": 8192
}
]
}
}
}
}Replace <HOST_IP> with the IP address of your RL server machine. The apiKey should match the SGLANG_API_KEY you set when starting the server.
That's it β start chatting with your OpenClaw agent. The RL server will automatically collect conversation trajectories, compute rewards, and train the model. Your agent gets better the more you use it.
Before launching, set these important environment variables as needed:
| Variable | Default | Description |
|---|---|---|
NUM_GPUS |
8 |
Total GPUs available on the machine |
ACTOR_GPUS |
4 |
GPUs allocated to the training actor |
ROLLOUT_GPUS |
2 |
GPUs allocated to rollout generation |
PRM_GPUS |
2 |
GPUs allocated to the Process Reward Model |
HF_CKPT |
(see script) | Path to the base HuggingFace checkpoint |
PRM_MODEL_PATH |
(see script) | Path to the reward model HuggingFace checkpoint |
SAVE_CKPT |
(see script) | Path to the saved HuggingFace checkpoint |
SGLANG_API_KEY |
β | API key for the SGLang serving endpoint |
You can check more details about configurations in ./instructions .
The same asynchronous RL backbone that powers our personal-agent setting can also support large-scale optimization for these broader real-world environments.
| Setting | Environment | Next-state signal | Horizon |
|---|---|---|---|
| Terminal | Shell execution sandbox | stdout/stderr, exit code | Long |
| GUI | Screen state + accessibility tree | Visual state diff, task progress | Long |
| SWE | Code repository + test suite | Test verdicts, diff, lint output | Long |
| Tool-call | API/function execution | Return values, error traces | Medium |
cd slime
bash ../terminal-rl/terminal_qwen3_8b_rl.shSee ./terminal-rl/README.md for setup details.
cd slime
bash ../gui-rl/gui_qwen3vl_8b_rl.shSee ./gui-rl/README.md for setup details.
cd slime
bash ../swe-rl/run_swe_rl_32b_remote_8nodes.shSee ./swe-rl/README.md for setup details.
cd slime
bash ../toolcall-rl/retool_qwen3_4b_rl.shSee ./toolcall-rl/README.md for setup details.
@article{wang2026openclawrl,
title={OpenClaw-RL: Train Any Agent Simply by Talking},
author={Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong and Wang, Mengdi and Yang, Ling},
journal={arXiv preprint arXiv:2603.10165},
year={2026}
}
@article{wang2026rlanything,
title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},
author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},
journal={arXiv preprint arXiv:2602.02488},
year={2026}
}
This work aims to explore more effective paradigms for Agentic RL. Our implementation builds upon the excellent codebases of slime, OpenClaw and Open-AgentRL.
We also build terminal RL using SETA's dataset and agent framework, GUI RL using OSWorld's evaluation scripts, SWE RL using mini-swe-agent's evaluation scripts, and tool-call RL based on the work of Retool.
We sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.
When using OpenClaw-RL, please do not provide sensitive personal information during conversations with the model. Also, make sure to keep your API keys secure and never expose them in prompts, logs, or shared files.

