Skip to content

krafton-ai/Prompt2Policy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Prompt-to-Policy

Describe a behavior in a prompt. Get a trained policy.
LLM-powered reward engineering that writes, trains, judges, and iterates β€” until your RL agent does what you asked.

Project Page

Prompt2Policy showcase: diverse learned behaviors from natural language intents

What It Does

Feature Description
🎯 Intent to Reward Describe behavior in natural language β€” LLM writes the reward function
πŸ‹οΈ Parallel Training PPO with multiple seeds and configs via Stable-Baselines3
πŸ‘οΈ Dual Judgment Code-based judge + VLM video judge evaluate trained policies
πŸ”„ Auto-Revision LLM diagnoses failures and rewrites reward + tunes hyperparameters
πŸ€– Multi-LLM Claude, Gemini, GPT β€” any model with tool use support
🦾 MuJoCo + IsaacLab 10 MuJoCo envs built-in, 90 IsaacLab envs optional
πŸ“Š Dashboard Real-time web UI for sessions, training curves, rollout videos

Quick Start

Install

git clone https://github.com/krafton-ai/Prompt2Policy.git
cd Prompt2Policy
uv sync --all-extras --python 3.11

Note: Python 3.11 is required when using IsaacLab (Isaac Sim only ships cp311 wheels). MuJoCo-only users can use Python 3.12+, but 3.11 is recommended for compatibility.

Don't have uv?
curl -LsSf https://astral.sh/uv/install.sh | sh

See uv installation guide for other platforms.

Headless Linux server? (AWS, cloud VMs)

Install system packages for rendering:

# Ubuntu 24.04+
sudo apt-get install -y xvfb libegl1 libgl1 libglu1-mesa

# Ubuntu 22.04
sudo apt-get install -y xvfb libegl1-mesa libgl1-mesa-glx libglu1-mesa

Set MUJOCO_GL=egl in your .env file. See the User Guide for details.

Running parallel training on a GPU?

Enable NVIDIA MPS so concurrent processes share the GPU efficiently instead of context-switching:

sudo nvidia-cuda-mps-control -d          # start MPS daemon
echo quit | sudo nvidia-cuda-mps-control  # stop when done

Recommended when running parallel training (--max-parallel > 1) or IsaacLab environments (GPU-vectorized). See the User Guide for details.

Configure

cp .env.example .env
# Edit .env β€” set GEMINI_API_KEY (required), plus ANTHROPIC_API_KEY or OPENAI_API_KEY (optional)

Run (Dashboard)

uv run uvicorn p2p.api.app:app --host 0.0.0.0 --port 8000 --reload --reload-dir src  # Terminal 1
cd frontend && npm install && npm run dev                                                    # Terminal 2

Open http://localhost:3000, enter an intent like "do a backflip", and hit run. See the dashboard tutorial for a video walkthrough. For CLI usage, see CLI Reference.

Remote server? Create frontend/.env.local with NEXT_PUBLIC_API_URL=http://<your-server-ip>:8000 so the browser can reach the API. See Dashboard β€” Remote Access.

Verify

uv run pytest tests/ -v

Pipeline

User Intent β†’ Intent Elicitor β†’ Reward Author + Judge Author
                                        ↓
                                   Code Review
                                        ↓
                              PPO Training (seeds Γ— configs)
                                        ↓
                              Code Judge βˆ₯ VLM Judge
                                        ↓
                                   Synthesizer
                                    ↓         ↓
                              [pass]  β†’  Done
                              [fail]  β†’  Revise Agent β†’ next iteration

Supported Environments

MuJoCo (built-in) β€” 10 environments: all Gymnasium MuJoCo v5 locomotion
Environment DOF Example Intents
HalfCheetah-v5 6 "run forward fast", "do a backflip"
Ant-v5 8 "walk in a circle", "stand on rear legs"
Hopper-v5 3 "hop forward", "jump as high as possible"
Walker2d-v5 6 "walk forward naturally", "high knee sprinting"
Humanoid-v5 17 "walk with natural gait", "perform a deep squat"
HumanoidStandup-v5 17 "stand up from the ground"
Swimmer-v5 2 "swim forward", "swim in a zigzag"
Reacher-v5 2 "reach the target"
InvertedPendulum-v5 1 "keep the pole balanced"
InvertedDoublePendulum-v5 1 "balance both poles"
IsaacLab (optional) β€” 90 environments: locomotion, manipulation, dexterous

NVIDIA IsaacLab environments are supported when Isaac Sim is installed.

Category Count Examples
Manipulation (Lift/Stack) 21 Franka lift/stack, Galbot, UR10
Locomotion (Flat) 12 ANYmal B/C/D, Unitree Go1/Go2/A1, Cassie, Spot, H1, G1, Digit
Locomotion (Rough) 11 Same robots, rough terrain
Manipulation (Reach) 8 Franka, UR10, OpenArm
Humanoid 8 Humanoid locomotion variants
Assembly 8 AutoMate, Factory, Forge
Dexterous 7 Shadow hand, Allegro
Classic Control 5 Cartpole, Ant
Pick & Place 4 Franka, UR10
Other 6 Quadcopter, Navigation

Requirements: NVIDIA GPU with CUDA 12+, driver 525+, Ubuntu 22.04+.


Configuration

Variable Required Default Description
GEMINI_API_KEY Yes β€” Default LLM agent + VLM video judgment
ANTHROPIC_API_KEY No β€” Required when using Claude models as LLM
OPENAI_API_KEY No β€” Required when using GPT models as LLM
MUJOCO_GL No (unset) Set to egl on headless Linux
Advanced settings
Variable Default Description
VLLM_HOST localhost vLLM server host (local VLM inference)
VLLM_PORT 8100 vLLM server port
VLLM_MODEL Qwen/Qwen3.5-27B vLLM model name

CLI Reference

E2E Loop

uv run python -m p2p.session.run_session \
  --session-id my_session \
  --prompt "do a backflip" \
  --loop-config '{"train": {"env_id": "HalfCheetah-v5", "total_timesteps": 1000000}, "max_iterations": 5, "pass_threshold": 0.7, "hp_tuning": true}'

Benchmark

uv run python -m p2p.benchmark.benchmark_cli \
  --csv benchmark/test_cases_exotic_ant_halfcheetah_humanoid.csv \
  --max-iterations 5 \
  --total-timesteps 1000000 \
  --max-parallel 4 \
  --num-configs 3

See the User Guide for full flag reference and API examples.


Hardware

MuJoCo (default) IsaacLab
CPU 8+ cores (16+ recommended for parallel seeds) 8+ cores
RAM 16 GB (32+ recommended) 32+ GB
GPU Optional β€” CUDA GPU for EGL rendering Required β€” 24+ GB VRAM (varies by task)
Disk 20 GB 100+ GB

MuJoCo training is CPU-bound (PPO with MLP policy). A GPU accelerates headless rendering (EGL) and local VLM inference but is not required. IsaacLab environments are GPU-vectorized and need at least 24 GB VRAM.


Development

uv run ruff check src/ tests/          # lint
uv run ruff format --check src/ tests/  # format
uv run pytest tests/ -v                 # test
cd frontend && npm run lint             # frontend lint

Tech Stack

  • Training β€” Gymnasium, MuJoCo, Stable-Baselines3, IsaacLab (optional)
  • LLM/VLM β€” Anthropic Claude, Google Gemini, OpenAI GPT, vLLM
  • Backend β€” FastAPI, uvicorn
  • Frontend β€” Next.js, React, Tailwind CSS, Recharts, KaTeX
  • Dev β€” uv, ruff, pytest

Documentation

  • User Guide β€” detailed setup, usage, intent tips, LLM models, IsaacLab installation
  • Architecture β€” code-level module map and execution flow
  • v1.0 Release Notes β€” known limitations and roadmap

Citation

@misc{prompt2policy2026,
  title   = {Prompt-to-Policy: Agentic Engineering for Reinforcement Learning},
  author  = {Wooseong Chung and Taegwan Ha and Yunhyeok Kwak and Taehwan Kwon and Jeong-Gwan Lee and Kangwook Lee and Suyoung Lee},
  year    = {2026},
  url     = {https://github.com/krafton-ai/Prompt2Policy}
}

License

This project is licensed under the MIT License.

Whether you're an RL researcher tired of hand-tuning rewards or a newcomer who just wants to describe a behavior and get a trained policy β€” this is for you.

About

Agentic Engineering for Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors