Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

README.md

1. Introduction

This module motivates the paradigm shift from traditional LLM post-training (treating models as static sequence generators) to agentic reinforcement learning (transforming models into autonomous decision-making agents). You'll understand how capabilities like planning, tool use, memory, and self-improvement emerge naturally when RL is applied beyond single-turn alignment in dynamic environments.

Key Takeaways

  • Paradigm Evolution: LLMs evolve from passive text generators to autonomous agents through RL training
  • Dynamic Environments: Agentic RL operates in multi-step, partially observable environments vs. single-shot responses
  • Emergent Capabilities: Planning, tool use, memory, and reasoning emerge from RL optimization rather than hand-crafted heuristics
  • Research Scope: 500+ papers surveyed, focusing on RL-empowered agentic behaviors, not traditional alignment

🛠️ Prerequisites

# Verify conceptual prerequisites
echo "Do you understand basic LLM concepts? (transformers, attention)"
echo "Are you familiar with reinforcement learning basics? (MDP, rewards)"
echo "Do you have Python + ML library experience? (torch, transformers)"

# Technical setup check
python -c "import torch, transformers, gymnasium; print('✅ Core libraries ready')"

Quick Start: Understanding the Shift

Traditional LLM (Passive Generation)

# Traditional LLM: Single-turn, deterministic response
def traditional_llm(prompt):
    """Classic LLM behavior: prompt in, text out"""
    return model.generate(prompt)

# Example usage - no memory, no planning, no tools
response = traditional_llm("What's the weather like?")
print(response)  # Generic response, no actual weather data

Agentic LLM (Active Decision-Making)

class AgenticLLM:
    """Agentic RL agent: multi-step reasoning with tools and memory"""
    def __init__(self):
        self.memory = []
        self.tools = {'weather_api': self.get_weather}
        self.planning_depth = 3
    
    def act(self, observation):
        """Multi-step agentic behavior"""
        # 1. Planning: Decompose request into steps
        plan = self.create_plan(observation)
        
        # 2. Tool Use: Access external information
        if 'weather' in observation.lower():
            location = self.extract_location(observation)
            weather_data = self.tools['weather_api'](location)
            
        # 3. Memory: Update contextual understanding
        self.memory.append({
            'observation': observation,
            'plan': plan,
            'tool_result': weather_data
        })
        
        # 4. Generate response based on actual data
        return self.synthesize_response(plan, weather_data)
    
    def get_weather(self, location):
        # Simplified weather API call
        return f"Current weather in {location}: 72°F, sunny"

# Example: Agent actually uses tools and plans
agent = AgenticLLM()
response = agent.act("What's the weather like in San Francisco?")
print(response)  # Actual weather data through tool use

The Paradigm Shift Visualization

Traditional LLM-RL (PBRFT)         Agentic RL (This Course)
┌─────────────────┐               ┌─────────────────────────┐
│  User Prompt    │               │    Environment         │
│       ↓         │               │         ↓              │
│  Single Step    │      VS       │   Multi-Step POMDP     │
│       ↓         │               │         ↓              │
│  Text Response  │               │  Action → Observation  │
│  (Terminal)     │               │     → Reward Loop      │
└─────────────────┘               └─────────────────────────┘

Static, One-shot                   Dynamic, Sequential
No Memory                         Persistent Memory
No Tools                          Tool Integration
No Planning                       Strategic Planning

Why This Shift Matters

Problem with Traditional Approach

  1. Limited Interaction: Single prompt-response cycles
  2. No Learning: Can't adapt from interactions
  3. Tool Blindness: No access to external information
  4. Memory Loss: Each interaction starts from scratch

Solution: Agentic RL Framework

# Core mathematical framework (simplified)
def agentic_rl_framework():
    """
    Traditional: P(text|prompt) - single step
    Agentic: π(a_t|s_t, h_t) in POMDP - sequential decisions
    
    Where:
    - π: Policy (learned through RL)
    - a_t: Action at time t (text + tool calls)
    - s_t: Environment state at time t
    - h_t: History/memory up to time t
    """
    pass

# This enables:
capabilities = [
    "planning",      # Sequence of actions toward goal
    "tool_use",      # External information access
    "memory",        # Context preservation across turns
    "reasoning",     # Multi-step logical inference
    "perception",    # Multimodal environment understanding
    "self_improve"   # Learning from experience
]

Research Landscape Overview

Survey Scope (500+ Papers)

  • ✅ In Scope: RL empowering agentic capabilities in dynamic environments
  • ❌ Out of Scope: Traditional alignment (harmlessness/helpfulness only)
  • ❌ Out of Scope: Pure LLM performance on static benchmarks
  • ❌ Out of Scope: Non-LLM traditional RL algorithms

Key Research Institutions

  • University of Oxford, Shanghai AI Laboratory, National University of Singapore
  • UCL, UIUC, Brown University, Imperial College London
  • Chinese Academy of Sciences, CUHK, Fudan University

Module Structure

This introduction sets the foundation for:

Module Focus Practical Outcome
2. Preliminaries MDP/POMDP formalism Mathematical foundation
3. Capabilities Planning, tools, memory Component implementation
4. Applications Code, math, GUI agents Domain-specific systems
5. Systems Environments & frameworks Infrastructure mastery
6. Challenges Scaling, trust, safety Research frontiers

📖 Resources


Ready to transform your understanding of LLMs? Continue to the mathematical foundations in Module 2.