On-Policy Reinforcement Learning: From Policy Gradients to PPO

A comprehensive educational series of Jupyter notebooks teaching on-policy reinforcement learning
algorithms from fundamentals to state-of-the-art. Each notebook demonstrates the progressive evolution
of policy gradient methods using LunarLander-v3 for consistent comparison across all algorithms.
π― Focus: On-Policy Methods Only
This series exclusively covers on-policy reinforcement learning, where the agent learns from data
generated by its current policy. We explore the evolution from pure policy gradients to modern
policy optimization with variance reduction techniques.
Modern Relevance (2025): On-policy RL has become the de facto standard for fine-tuning Large Language Models (LLMs). Methods like PPO are used in RLHF (Reinforcement Learning from Human Feedback) for training models like ChatGPT, Claude, and Gemini. The stability advantages of on-policy methods make them ideal for the delicate process of aligning pre-trained language models with human preferences.
Key Theme: How to solve the high variance problem in policy gradients through:
- Baseline subtraction (Actor-Critic MC)
- Bootstrapping (Actor-Critic TD)
- Trust regions (PPO)
π Our Learning Environment: LunarLander-v3
For this entire learning series, we use LunarLander-v3 exclusively. This environment
is perfect for RL education because it offers both discrete and continuous action
spaces, rich vector observations, and clear success/failure conditions.
Reference: Gymnasium Lunar
Lander
LunarLander-v3 π
Classic rocket trajectory optimization - land the lunar module safely on the landing
pad.
Observation: Box((8,), float32)
- Position, velocity, angle, angular velocity, leg
ground contact Action Spaces:
- Discrete: 4 actions [do_nothing, fire_left, fire_main, fire_right]
- Continuous:
Box(-1, +1, (2,), float32)
- [main_engine_throttle,
lateral_booster_throttle]
Rewards: Distance/speed penalties, angle penalties, +10 per leg contact, engine
costs, Β±100 for crash/landing
Why LunarLander-v3 for RL Education?
- Dual Action Spaces: Perfect for testing both discrete and continuous control
algorithms
- Vector Observations: Rich, interpretable 8D state representation with
physics-based features
- Realistic Physics: Box2D physics engine provides consistent, realistic dynamics
- Clear Success Criteria: Landing successfully gives +100, crashing gives -100
- Fast Feedback: Episodes are relatively short, enabling rapid experimentation
- Educational Value: Classic trajectory optimization problem that teaches
fundamental RL concepts
- No Visual Complexity: Vector observations allow us to use simple MLPs
π On-Policy Algorithm Progression
This series focuses on four fundamental on-policy RL algorithms that showcase the progression from pure policy gradients to advanced policy optimization with variance reduction techniques.
01. REINFORCE (Pure Policy Gradients)
- Focus: Vanilla policy gradients with Monte Carlo returns
- Action Space: Both discrete AND continuous
- Key Concepts: Policy gradient theorem, high variance problem, episode-based learning
- Value Function: None - pure policy optimization
- Target: Raw episode returns $G_t$
- Variance: Extremely high (demonstrates the core challenge)
- Data Usage: Fresh experience only (on-policy)
02. Actor-Critic (Monte Carlo)
- Focus: Adding a value function baseline to reduce variance
- Action Space: Both discrete AND continuous
- Key Concepts: Actor-critic architecture, baseline subtraction, value function learning
- Value Function: $V(s)$ trained toward Monte Carlo returns $G_t$
- Target: $G_t - V(s_t)$ (advantage estimation)
- Bellman Equation: β No - still uses full episode returns for value training
- Variance: Reduced compared to REINFORCE
- Data Usage: Fresh experience only (on-policy)
- Learning Progression: Start with simple running mean $\bar{G}$ baseline, then progress to learned state-dependent value function $V(s_t)$, demonstrating how $G_t - V(s_t)$ reduces variance while maintaining unbiased estimates
03. Actor-Critic (Temporal Difference)
- Focus: Bootstrapping and the Bellman equation - the bridge to modern RL
- Action Space: Both discrete AND continuous
- Key Concepts: TD learning, bootstrapping, bias-variance tradeoff
- Value Function: $V(s)$ trained toward TD target $r_t + \gamma V(s_{t+1})$
- Target: $r_t + \gamma V(s_{t+1}) - V(s_t)$ (TD error as advantage)
- Bellman Equation: β
Yes - introduces bootstrapping
- Variance: Lower than MC methods, but introduces bias
- Data Usage: Fresh experience only (on-policy)
04. PPO (Proximal Policy Optimization)
- Focus: Trust regions, clipped objectives, and modern policy optimization
- Action Space: Both discrete AND continuous
- Key Concepts: Importance sampling, trust regions, clipped surrogate objective, GAE
- Value Function: $V(s)$ with Generalized Advantage Estimation (GAE)
- Key Innovation: Clipped policy updates prevent destructive policy changes
- Modern Standard: State-of-the-art on-policy algorithm used in practice
- Data Usage: Fresh experience only (on-policy)
π― Learning Progression: The Variance-Bias Journey
The Core Problem: Policy Gradient Variance
REINFORCE: $\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a|s) \cdot G_t]$
- High Variance: $G_t$ varies wildly between episodes
- Unbiased: Uses true returns, no approximations
- On-Policy: Uses data from current policy only
Solution 1: Baselines (Reduce Variance)
Actor-Critic (MC): $\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a|s) \cdot (G_t - V(s_t))]$
- Lower Variance: Subtracting baseline $V(s_t)$ reduces variance
- Still Unbiased: Baseline doesnβt change expectation
- Episode-Based: Still waits for full episodes
- On-Policy: Still uses data from current policy only
Solution 2: Bootstrapping (Reduce Variance Further)
Actor-Critic (TD): $\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a |
s) \cdot \delta_t]$ |
Where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ (TD error)
- Much Lower Variance: Single-step updates vs episode-length
- Introduces Bias: Uses $V(s_{t+1})$ estimate instead of true future returns
- Step-by-Step Learning: Can learn from individual transitions
- On-Policy: Still uses data from current policy only
Solution 3: Stable Policy Updates (Control Update Size)
PPO: Adds trust region constraints to prevent destructive policy changes
- Clipped Objective: Limits how much policy can change per update
- GAE: Sophisticated advantage estimation balancing bias and variance
- Practical Stability: Works reliably across many environments
- On-Policy: Uses fresh data for each policy update
π On-Policy Algorithm Comparison Matrix
Algorithm |
Value Function |
Bellman Equation |
Variance |
Bias |
Update Frequency |
Data Usage |
REINFORCE |
None |
β No |
Very High |
None |
Per Episode |
On-Policy |
Actor-Critic (MC) |
$V(s) \rightarrow G_t$ |
β No |
High |
None |
Per Episode |
On-Policy |
Actor-Critic (TD) |
$V(s) \rightarrow r + \gamma V(sβ)$ |
β
Yes |
Medium |
Low |
Per Step |
On-Policy |
PPO |
$V(s)$ + GAE |
β
Yes |
Low |
Low |
Per Batch |
On-Policy |
π Key On-Policy Learning Concepts
On-Policy vs Off-Policy
On-Policy Methods (all algorithms in this series):
- Learn from data generated by the current policy being optimized
- Must collect fresh experience after each policy update
- Sample efficiency: Lower (canβt reuse old data)
- Stability: Generally more stable and easier to tune
- Modern Applications: Standard for LLM fine-tuning (RLHF), robotics, continuous control
- Examples: REINFORCE, Actor-Critic, PPO, A3C
Off-Policy Methods (not covered in this series):
- Can learn from data generated by any policy (including old versions)
- Can reuse experience through replay buffers
- Sample efficiency: Higher (reuses data)
- Stability: Often less stable, harder to tune, can suffer from distribution shift
- Trade-off: More sample efficient but less stable than on-policy methods
- Examples: DQN, DDPG, TD3, SAC
Why On-Policy Dominates LLM Fine-tuning: The stability of on-policy methods is crucial when fine-tuning billion-parameter language models where catastrophic forgetting or policy collapse can destroy months of pre-training work.
Monte Carlo vs Temporal Difference
Monte Carlo Methods (REINFORCE, Actor-Critic MC):
- Use complete episode returns $G_t = \sum_{k=t}^T \gamma^{k-t} r_{k+1}$
- Unbiased but high variance
- Must wait for episode completion
- No bootstrapping
Temporal Difference Methods (Actor-Critic TD, PPO):
- Use one-step lookahead $r_t + \gamma V(s_{t+1})$
- Lower variance but introduces bias
- Can learn from incomplete episodes
- Bootstrapping: Uses estimates to update estimates
Variance Reduction Techniques
- Baseline Subtraction: $G_t - b$ where $b$ doesnβt depend on actions
- Value Function Approximation: Learn $V(s)$ to predict returns
- Advantage Estimation: Use $A(s,a) = Q(s,a) - V(s)$ instead of raw returns
- Trust Regions: Limit policy update magnitude (PPO clipping)
The Bias-Variance Tradeoff in On-Policy Methods
- High Variance β Slow Learning: REINFORCE takes many episodes to converge
- Bias Introduction β Faster Learning: TD methods converge faster but to potentially suboptimal policies
- Modern Methods: Balance bias and variance for practical performance
- On-Policy Constraint: All methods must use fresh data, affecting sample efficiency
π οΈ Technical Implementation
Package Structure
rl/
βββ README.md # This documentation
βββ 01.reinforce.ipynb # Pure policy gradients
βββ 02.actor-critic-mc.ipynb # Actor-critic with Monte Carlo
βββ 03.actor-critic-td.ipynb # Actor-critic with Temporal Difference
βββ 04.ppo.ipynb # Proximal Policy Optimization
βββ rl_utils/ # Shared utility package
β βββ __init__.py # Package initialization
β βββ config.py # Configuration management
β βββ environment.py # Environment wrappers and preprocessing
β βββ networks.py # Neural network architectures
β βββ visualization.py # Plotting and analysis functions
βββ videos/ # Generated training/test videos
Shared Infrastructure
Environment utilities (rl_utils.environment
):
preprocess_state()
: Standardized state preprocessing for LunarLander
create_env_with_wrappers()
: Environment creation with video recording
- Video recording and display utilities
Neural networks (rl_utils.networks
):
PolicyNetwork
: Supports both discrete and continuous actions
ValueNetwork
: State value function approximation (for notebooks 02-04)
- Automatic parameter counting and network information
Visualization (rl_utils.visualization
):
plot_training_results()
: Standardized training curves
plot_variance_analysis()
: Algorithm-specific variance analysis
plot_comparison()
: Multi-algorithm performance comparison
π Getting Started
Installation
# System dependencies for Box2D environment
sudo apt install swig build-essential python3-dev
# Python packages
pip install 'gymnasium[box2d]>=1.0' torch torchvision matplotlib numpy jupyter tqdm
# Clone the repository
git clone <repository-url>
cd rl
π Key References