Describe a behavior in a prompt. Get a trained policy.
LLM-powered reward engineering that writes, trains, judges, and iterates β until your RL agent does what you asked.
| Feature | Description | |
|---|---|---|
| π― | Intent to Reward | Describe behavior in natural language β LLM writes the reward function |
| ποΈ | Parallel Training | PPO with multiple seeds and configs via Stable-Baselines3 |
| ποΈ | Dual Judgment | Code-based judge + VLM video judge evaluate trained policies |
| π | Auto-Revision | LLM diagnoses failures and rewrites reward + tunes hyperparameters |
| π€ | Multi-LLM | Claude, Gemini, GPT β any model with tool use support |
| π¦Ύ | MuJoCo + IsaacLab | 10 MuJoCo envs built-in, 90 IsaacLab envs optional |
| π | Dashboard | Real-time web UI for sessions, training curves, rollout videos |
git clone https://github.com/krafton-ai/Prompt2Policy.git
cd Prompt2Policy
uv sync --all-extras --python 3.11Note: Python 3.11 is required when using IsaacLab (Isaac Sim only ships cp311 wheels). MuJoCo-only users can use Python 3.12+, but 3.11 is recommended for compatibility.
Don't have uv?
curl -LsSf https://astral.sh/uv/install.sh | shSee uv installation guide for other platforms.
Headless Linux server? (AWS, cloud VMs)
Install system packages for rendering:
# Ubuntu 24.04+
sudo apt-get install -y xvfb libegl1 libgl1 libglu1-mesa
# Ubuntu 22.04
sudo apt-get install -y xvfb libegl1-mesa libgl1-mesa-glx libglu1-mesaSet MUJOCO_GL=egl in your .env file. See the User Guide for details.
Running parallel training on a GPU?
Enable NVIDIA MPS so concurrent processes share the GPU efficiently instead of context-switching:
sudo nvidia-cuda-mps-control -d # start MPS daemon
echo quit | sudo nvidia-cuda-mps-control # stop when doneRecommended when running parallel training (--max-parallel > 1) or IsaacLab environments (GPU-vectorized). See the User Guide for details.
cp .env.example .env
# Edit .env β set GEMINI_API_KEY (required), plus ANTHROPIC_API_KEY or OPENAI_API_KEY (optional)uv run uvicorn p2p.api.app:app --host 0.0.0.0 --port 8000 --reload --reload-dir src # Terminal 1
cd frontend && npm install && npm run dev # Terminal 2Open http://localhost:3000, enter an intent like "do a backflip", and hit run. See the dashboard tutorial for a video walkthrough. For CLI usage, see CLI Reference.
Remote server? Create
frontend/.env.localwithNEXT_PUBLIC_API_URL=http://<your-server-ip>:8000so the browser can reach the API. See Dashboard β Remote Access.
uv run pytest tests/ -vUser Intent β Intent Elicitor β Reward Author + Judge Author
β
Code Review
β
PPO Training (seeds Γ configs)
β
Code Judge β₯ VLM Judge
β
Synthesizer
β β
[pass] β Done
[fail] β Revise Agent β next iteration
MuJoCo (built-in) β 10 environments: all Gymnasium MuJoCo v5 locomotion
| Environment | DOF | Example Intents |
|---|---|---|
| HalfCheetah-v5 | 6 | "run forward fast", "do a backflip" |
| Ant-v5 | 8 | "walk in a circle", "stand on rear legs" |
| Hopper-v5 | 3 | "hop forward", "jump as high as possible" |
| Walker2d-v5 | 6 | "walk forward naturally", "high knee sprinting" |
| Humanoid-v5 | 17 | "walk with natural gait", "perform a deep squat" |
| HumanoidStandup-v5 | 17 | "stand up from the ground" |
| Swimmer-v5 | 2 | "swim forward", "swim in a zigzag" |
| Reacher-v5 | 2 | "reach the target" |
| InvertedPendulum-v5 | 1 | "keep the pole balanced" |
| InvertedDoublePendulum-v5 | 1 | "balance both poles" |
IsaacLab (optional) β 90 environments: locomotion, manipulation, dexterous
NVIDIA IsaacLab environments are supported when Isaac Sim is installed.
| Category | Count | Examples |
|---|---|---|
| Manipulation (Lift/Stack) | 21 | Franka lift/stack, Galbot, UR10 |
| Locomotion (Flat) | 12 | ANYmal B/C/D, Unitree Go1/Go2/A1, Cassie, Spot, H1, G1, Digit |
| Locomotion (Rough) | 11 | Same robots, rough terrain |
| Manipulation (Reach) | 8 | Franka, UR10, OpenArm |
| Humanoid | 8 | Humanoid locomotion variants |
| Assembly | 8 | AutoMate, Factory, Forge |
| Dexterous | 7 | Shadow hand, Allegro |
| Classic Control | 5 | Cartpole, Ant |
| Pick & Place | 4 | Franka, UR10 |
| Other | 6 | Quadcopter, Navigation |
Requirements: NVIDIA GPU with CUDA 12+, driver 525+, Ubuntu 22.04+.
| Variable | Required | Default | Description |
|---|---|---|---|
GEMINI_API_KEY |
Yes | β | Default LLM agent + VLM video judgment |
ANTHROPIC_API_KEY |
No | β | Required when using Claude models as LLM |
OPENAI_API_KEY |
No | β | Required when using GPT models as LLM |
MUJOCO_GL |
No | (unset) | Set to egl on headless Linux |
Advanced settings
| Variable | Default | Description |
|---|---|---|
VLLM_HOST |
localhost |
vLLM server host (local VLM inference) |
VLLM_PORT |
8100 |
vLLM server port |
VLLM_MODEL |
Qwen/Qwen3.5-27B |
vLLM model name |
uv run python -m p2p.session.run_session \
--session-id my_session \
--prompt "do a backflip" \
--loop-config '{"train": {"env_id": "HalfCheetah-v5", "total_timesteps": 1000000}, "max_iterations": 5, "pass_threshold": 0.7, "hp_tuning": true}'uv run python -m p2p.benchmark.benchmark_cli \
--csv benchmark/test_cases_exotic_ant_halfcheetah_humanoid.csv \
--max-iterations 5 \
--total-timesteps 1000000 \
--max-parallel 4 \
--num-configs 3See the User Guide for full flag reference and API examples.
| MuJoCo (default) | IsaacLab | |
|---|---|---|
| CPU | 8+ cores (16+ recommended for parallel seeds) | 8+ cores |
| RAM | 16 GB (32+ recommended) | 32+ GB |
| GPU | Optional β CUDA GPU for EGL rendering | Required β 24+ GB VRAM (varies by task) |
| Disk | 20 GB | 100+ GB |
MuJoCo training is CPU-bound (PPO with MLP policy). A GPU accelerates headless rendering (EGL) and local VLM inference but is not required. IsaacLab environments are GPU-vectorized and need at least 24 GB VRAM.
uv run ruff check src/ tests/ # lint
uv run ruff format --check src/ tests/ # format
uv run pytest tests/ -v # test
cd frontend && npm run lint # frontend lint- Training β Gymnasium, MuJoCo, Stable-Baselines3, IsaacLab (optional)
- LLM/VLM β Anthropic Claude, Google Gemini, OpenAI GPT, vLLM
- Backend β FastAPI, uvicorn
- Frontend β Next.js, React, Tailwind CSS, Recharts, KaTeX
- Dev β uv, ruff, pytest
- User Guide β detailed setup, usage, intent tips, LLM models, IsaacLab installation
- Architecture β code-level module map and execution flow
- v1.0 Release Notes β known limitations and roadmap
@misc{prompt2policy2026,
title = {Prompt-to-Policy: Agentic Engineering for Reinforcement Learning},
author = {Wooseong Chung and Taegwan Ha and Yunhyeok Kwak and Taehwan Kwon and Jeong-Gwan Lee and Kangwook Lee and Suyoung Lee},
year = {2026},
url = {https://github.com/krafton-ai/Prompt2Policy}
}This project is licensed under the MIT License.
Whether you're an RL researcher tired of hand-tuning rewards or a newcomer who just wants to describe a behavior and get a trained policy β this is for you.
