on-policy-rl

On-Policy Reinforcement Learning: From Policy Gradients to PPO

A comprehensive educational series of Jupyter notebooks teaching on-policy reinforcement learning algorithms from fundamentals to state-of-the-art. Each notebook demonstrates the progressive evolution of policy gradient methods using LunarLander-v3 for consistent comparison across all algorithms.

🎯 Focus: On-Policy Methods Only

This series exclusively covers on-policy reinforcement learning, where the agent learns from data generated by its current policy. We explore the evolution from pure policy gradients to modern policy optimization with variance reduction techniques.

Modern Relevance (2025): On-policy RL has become the de facto standard for fine-tuning Large Language Models (LLMs). Methods like PPO are used in RLHF (Reinforcement Learning from Human Feedback) for training models like ChatGPT, Claude, and Gemini. The stability advantages of on-policy methods make them ideal for the delicate process of aligning pre-trained language models with human preferences.

Key Theme: How to solve the high variance problem in policy gradients through:

Baseline subtraction (Actor-Critic MC)
Bootstrapping (Actor-Critic TD)
Trust regions (PPO)

🚀 Our Learning Environment: LunarLander-v3

For this entire learning series, we use LunarLander-v3 exclusively. This environment is perfect for RL education because it offers both discrete and continuous action spaces, rich vector observations, and clear success/failure conditions.

Reference: Gymnasium Lunar Lander

LunarLander-v3 🚀

Classic rocket trajectory optimization - land the lunar module safely on the landing pad.

Observation: Box((8,), float32) - Position, velocity, angle, angular velocity, leg ground contact Action Spaces:

Discrete: 4 actions [do_nothing, fire_left, fire_main, fire_right]
Continuous: Box(-1, +1, (2,), float32) - [main_engine_throttle, lateral_booster_throttle]

Rewards: Distance/speed penalties, angle penalties, +10 per leg contact, engine costs, ±100 for crash/landing

Why LunarLander-v3 for RL Education?

Dual Action Spaces: Perfect for testing both discrete and continuous control algorithms
Vector Observations: Rich, interpretable 8D state representation with physics-based features
Realistic Physics: Box2D physics engine provides consistent, realistic dynamics
Clear Success Criteria: Landing successfully gives +100, crashing gives -100
Fast Feedback: Episodes are relatively short, enabling rapid experimentation
Educational Value: Classic trajectory optimization problem that teaches fundamental RL concepts
No Visual Complexity: Vector observations allow us to use simple MLPs

📚 On-Policy Algorithm Progression

This series focuses on four fundamental on-policy RL algorithms that showcase the progression from pure policy gradients to advanced policy optimization with variance reduction techniques.

01. REINFORCE (Pure Policy Gradients)

Focus: Vanilla policy gradients with Monte Carlo returns
Action Space: Both discrete AND continuous
Key Concepts: Policy gradient theorem, high variance problem, episode-based learning
Value Function: None - pure policy optimization
Target: Raw episode returns $G_t$
Variance: Extremely high (demonstrates the core challenge)
Data Usage: Fresh experience only (on-policy)

02. Actor-Critic (Monte Carlo)

Focus: Adding a value function baseline to reduce variance
Action Space: Both discrete AND continuous
Key Concepts: Actor-critic architecture, baseline subtraction, value function learning
Value Function: $V(s)$ trained toward Monte Carlo returns $G_t$
Target: $G_t - V(s_t)$ (advantage estimation)
Bellman Equation: ❌ No - still uses full episode returns for value training
Variance: Reduced compared to REINFORCE
Data Usage: Fresh experience only (on-policy)
Learning Progression: Start with simple running mean $\bar{G}$ baseline, then progress to learned state-dependent value function $V(s_t)$, demonstrating how $G_t - V(s_t)$ reduces variance while maintaining unbiased estimates

03. Actor-Critic (Temporal Difference)

Focus: Bootstrapping and the Bellman equation - the bridge to modern RL
Action Space: Both discrete AND continuous
Key Concepts: TD learning, bootstrapping, bias-variance tradeoff
Value Function: $V(s)$ trained toward TD target $r_t + \gamma V(s_{t+1})$
Target: $r_t + \gamma V(s_{t+1}) - V(s_t)$ (TD error as advantage)
Bellman Equation: ✅ Yes - introduces bootstrapping
Variance: Lower than MC methods, but introduces bias
Data Usage: Fresh experience only (on-policy)

04. PPO (Proximal Policy Optimization)

Focus: Trust regions, clipped objectives, and modern policy optimization
Action Space: Both discrete AND continuous
Key Concepts: Importance sampling, trust regions, clipped surrogate objective, GAE
Value Function: $V(s)$ with Generalized Advantage Estimation (GAE)
Key Innovation: Clipped policy updates prevent destructive policy changes
Modern Standard: State-of-the-art on-policy algorithm used in practice
Data Usage: Fresh experience only (on-policy)

🎯 Learning Progression: The Variance-Bias Journey

The Core Problem: Policy Gradient Variance

REINFORCE: $\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a|s) \cdot G_t]$

High Variance: $G_t$ varies wildly between episodes
Unbiased: Uses true returns, no approximations
On-Policy: Uses data from current policy only

Solution 1: Baselines (Reduce Variance)

Actor-Critic (MC): $\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a|s) \cdot (G_t - V(s_t))]$

Lower Variance: Subtracting baseline $V(s_t)$ reduces variance
Still Unbiased: Baseline doesn’t change expectation
Episode-Based: Still waits for full episodes
On-Policy: Still uses data from current policy only

Solution 2: Bootstrapping (Reduce Variance Further)

Actor-Critic (TD): $\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a

s) \cdot \delta_t]$

Where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ (TD error)

Much Lower Variance: Single-step updates vs episode-length
Introduces Bias: Uses $V(s_{t+1})$ estimate instead of true future returns
Step-by-Step Learning: Can learn from individual transitions
On-Policy: Still uses data from current policy only

Solution 3: Stable Policy Updates (Control Update Size)

PPO: Adds trust region constraints to prevent destructive policy changes

Clipped Objective: Limits how much policy can change per update
GAE: Sophisticated advantage estimation balancing bias and variance
Practical Stability: Works reliably across many environments
On-Policy: Uses fresh data for each policy update

📊 On-Policy Algorithm Comparison Matrix

Algorithm	Value Function	Bellman Equation	Variance	Bias	Update Frequency	Data Usage
REINFORCE	None	❌ No	Very High	None	Per Episode	On-Policy
Actor-Critic (MC)	$V(s) \rightarrow G_t$	❌ No	High	None	Per Episode	On-Policy
Actor-Critic (TD)	$V(s) \rightarrow r + \gamma V(s’)$	✅ Yes	Medium	Low	Per Step	On-Policy
PPO	$V(s)$ + GAE	✅ Yes	Low	Low	Per Batch	On-Policy

🔄 Key On-Policy Learning Concepts

On-Policy vs Off-Policy

On-Policy Methods (all algorithms in this series):

Learn from data generated by the current policy being optimized
Must collect fresh experience after each policy update
Sample efficiency: Lower (can’t reuse old data)
Stability: Generally more stable and easier to tune
Modern Applications: Standard for LLM fine-tuning (RLHF), robotics, continuous control
Examples: REINFORCE, Actor-Critic, PPO, A3C

Off-Policy Methods (not covered in this series):

Can learn from data generated by any policy (including old versions)
Can reuse experience through replay buffers
Sample efficiency: Higher (reuses data)
Stability: Often less stable, harder to tune, can suffer from distribution shift
Trade-off: More sample efficient but less stable than on-policy methods
Examples: DQN, DDPG, TD3, SAC

Why On-Policy Dominates LLM Fine-tuning: The stability of on-policy methods is crucial when fine-tuning billion-parameter language models where catastrophic forgetting or policy collapse can destroy months of pre-training work.

Monte Carlo vs Temporal Difference

Monte Carlo Methods (REINFORCE, Actor-Critic MC):

Use complete episode returns $G_t = \sum_{k=t}^T \gamma^{k-t} r_{k+1}$
Unbiased but high variance
Must wait for episode completion
No bootstrapping

Temporal Difference Methods (Actor-Critic TD, PPO):

Use one-step lookahead $r_t + \gamma V(s_{t+1})$
Lower variance but introduces bias
Can learn from incomplete episodes
Bootstrapping: Uses estimates to update estimates

Variance Reduction Techniques

Baseline Subtraction: $G_t - b$ where $b$ doesn’t depend on actions
Value Function Approximation: Learn $V(s)$ to predict returns
Advantage Estimation: Use $A(s,a) = Q(s,a) - V(s)$ instead of raw returns
Trust Regions: Limit policy update magnitude (PPO clipping)

The Bias-Variance Tradeoff in On-Policy Methods

High Variance → Slow Learning: REINFORCE takes many episodes to converge
Bias Introduction → Faster Learning: TD methods converge faster but to potentially suboptimal policies
Modern Methods: Balance bias and variance for practical performance
On-Policy Constraint: All methods must use fresh data, affecting sample efficiency

🛠️ Technical Implementation

Package Structure

rl/
├── README.md                      # This documentation
├── 01.reinforce.ipynb            # Pure policy gradients
├── 02.actor-critic-mc.ipynb      # Actor-critic with Monte Carlo
├── 03.actor-critic-td.ipynb      # Actor-critic with Temporal Difference  
├── 04.ppo.ipynb                  # Proximal Policy Optimization
├── rl_utils/                      # Shared utility package
│   ├── __init__.py               # Package initialization
│   ├── config.py                 # Configuration management
│   ├── environment.py            # Environment wrappers and preprocessing
│   ├── networks.py               # Neural network architectures
│   └── visualization.py          # Plotting and analysis functions
└── videos/                        # Generated training/test videos

Shared Infrastructure

Environment utilities (rl_utils.environment):

preprocess_state(): Standardized state preprocessing for LunarLander
create_env_with_wrappers(): Environment creation with video recording
Video recording and display utilities

Neural networks (rl_utils.networks):

PolicyNetwork: Supports both discrete and continuous actions
ValueNetwork: State value function approximation (for notebooks 02-04)
Automatic parameter counting and network information

Visualization (rl_utils.visualization):

plot_training_results(): Standardized training curves
plot_variance_analysis(): Algorithm-specific variance analysis
plot_comparison(): Multi-algorithm performance comparison

🚀 Getting Started

Installation

# System dependencies for Box2D environment
sudo apt install swig build-essential python3-dev

# Python packages
pip install 'gymnasium[box2d]>=1.0' torch torchvision matplotlib numpy jupyter tqdm

# Clone the repository
git clone <repository-url>
cd rl

📚 Key References

This site is open source. Improve this page.