This module motivates the paradigm shift from traditional LLM post-training (treating models as static sequence generators) to agentic reinforcement learning (transforming models into autonomous decision-making agents). You'll understand how capabilities like planning, tool use, memory, and self-improvement emerge naturally when RL is applied beyond single-turn alignment in dynamic environments.
- Paradigm Evolution: LLMs evolve from passive text generators to autonomous agents through RL training
- Dynamic Environments: Agentic RL operates in multi-step, partially observable environments vs. single-shot responses
- Emergent Capabilities: Planning, tool use, memory, and reasoning emerge from RL optimization rather than hand-crafted heuristics
- Research Scope: 500+ papers surveyed, focusing on RL-empowered agentic behaviors, not traditional alignment
# Verify conceptual prerequisites
echo "Do you understand basic LLM concepts? (transformers, attention)"
echo "Are you familiar with reinforcement learning basics? (MDP, rewards)"
echo "Do you have Python + ML library experience? (torch, transformers)"
# Technical setup check
python -c "import torch, transformers, gymnasium; print('✅ Core libraries ready')"# Traditional LLM: Single-turn, deterministic response
def traditional_llm(prompt):
"""Classic LLM behavior: prompt in, text out"""
return model.generate(prompt)
# Example usage - no memory, no planning, no tools
response = traditional_llm("What's the weather like?")
print(response) # Generic response, no actual weather dataclass AgenticLLM:
"""Agentic RL agent: multi-step reasoning with tools and memory"""
def __init__(self):
self.memory = []
self.tools = {'weather_api': self.get_weather}
self.planning_depth = 3
def act(self, observation):
"""Multi-step agentic behavior"""
# 1. Planning: Decompose request into steps
plan = self.create_plan(observation)
# 2. Tool Use: Access external information
if 'weather' in observation.lower():
location = self.extract_location(observation)
weather_data = self.tools['weather_api'](location)
# 3. Memory: Update contextual understanding
self.memory.append({
'observation': observation,
'plan': plan,
'tool_result': weather_data
})
# 4. Generate response based on actual data
return self.synthesize_response(plan, weather_data)
def get_weather(self, location):
# Simplified weather API call
return f"Current weather in {location}: 72°F, sunny"
# Example: Agent actually uses tools and plans
agent = AgenticLLM()
response = agent.act("What's the weather like in San Francisco?")
print(response) # Actual weather data through tool useTraditional LLM-RL (PBRFT) Agentic RL (This Course)
┌─────────────────┐ ┌─────────────────────────┐
│ User Prompt │ │ Environment │
│ ↓ │ │ ↓ │
│ Single Step │ VS │ Multi-Step POMDP │
│ ↓ │ │ ↓ │
│ Text Response │ │ Action → Observation │
│ (Terminal) │ │ → Reward Loop │
└─────────────────┘ └─────────────────────────┘
Static, One-shot Dynamic, Sequential
No Memory Persistent Memory
No Tools Tool Integration
No Planning Strategic Planning
- Limited Interaction: Single prompt-response cycles
- No Learning: Can't adapt from interactions
- Tool Blindness: No access to external information
- Memory Loss: Each interaction starts from scratch
# Core mathematical framework (simplified)
def agentic_rl_framework():
"""
Traditional: P(text|prompt) - single step
Agentic: π(a_t|s_t, h_t) in POMDP - sequential decisions
Where:
- π: Policy (learned through RL)
- a_t: Action at time t (text + tool calls)
- s_t: Environment state at time t
- h_t: History/memory up to time t
"""
pass
# This enables:
capabilities = [
"planning", # Sequence of actions toward goal
"tool_use", # External information access
"memory", # Context preservation across turns
"reasoning", # Multi-step logical inference
"perception", # Multimodal environment understanding
"self_improve" # Learning from experience
]- ✅ In Scope: RL empowering agentic capabilities in dynamic environments
- ❌ Out of Scope: Traditional alignment (harmlessness/helpfulness only)
- ❌ Out of Scope: Pure LLM performance on static benchmarks
- ❌ Out of Scope: Non-LLM traditional RL algorithms
- University of Oxford, Shanghai AI Laboratory, National University of Singapore
- UCL, UIUC, Brown University, Imperial College London
- Chinese Academy of Sciences, CUHK, Fudan University
This introduction sets the foundation for:
| Module | Focus | Practical Outcome |
|---|---|---|
| 2. Preliminaries | MDP/POMDP formalism | Mathematical foundation |
| 3. Capabilities | Planning, tools, memory | Component implementation |
| 4. Applications | Code, math, GUI agents | Domain-specific systems |
| 5. Systems | Environments & frameworks | Infrastructure mastery |
| 6. Challenges | Scaling, trust, safety | Research frontiers |
- Primary Survey: Agentic RL Landscape
- Paper Collection: 500+ Curated Papers
- Background Reading: Understanding LLM agents and RL fundamentals
- Next Module: 2. Preliminaries - MDP Foundations
Ready to transform your understanding of LLMs? Continue to the mathematical foundations in Module 2.