on-policy-rl

On-Policy Reinforcement Learning: From Policy Gradients to PPO

DOI

A comprehensive educational series of Jupyter notebooks teaching on-policy reinforcement learning algorithms from fundamentals to state-of-the-art. Each notebook demonstrates the progressive evolution of policy gradient methods using LunarLander-v3 for consistent comparison across all algorithms.

🎯 Focus: On-Policy Methods Only

This series exclusively covers on-policy reinforcement learning, where the agent learns from data generated by its current policy. We explore the evolution from pure policy gradients to modern policy optimization with variance reduction techniques.

Modern Relevance (2025): On-policy RL has become the de facto standard for fine-tuning Large Language Models (LLMs). Methods like PPO are used in RLHF (Reinforcement Learning from Human Feedback) for training models like ChatGPT, Claude, and Gemini. The stability advantages of on-policy methods make them ideal for the delicate process of aligning pre-trained language models with human preferences.

Key Theme: How to solve the high variance problem in policy gradients through:

  1. Baseline subtraction (Actor-Critic MC)
  2. Bootstrapping (Actor-Critic TD)
  3. Trust regions (PPO)

πŸš€ Our Learning Environment: LunarLander-v3

For this entire learning series, we use LunarLander-v3 exclusively. This environment is perfect for RL education because it offers both discrete and continuous action spaces, rich vector observations, and clear success/failure conditions.

Reference: Gymnasium Lunar Lander

LunarLander-v3 πŸš€

Classic rocket trajectory optimization - land the lunar module safely on the landing pad.

Observation: Box((8,), float32) - Position, velocity, angle, angular velocity, leg ground contact Action Spaces:

Rewards: Distance/speed penalties, angle penalties, +10 per leg contact, engine costs, Β±100 for crash/landing

Why LunarLander-v3 for RL Education?

πŸ“š On-Policy Algorithm Progression

This series focuses on four fundamental on-policy RL algorithms that showcase the progression from pure policy gradients to advanced policy optimization with variance reduction techniques.

01. REINFORCE (Pure Policy Gradients)

02. Actor-Critic (Monte Carlo)

03. Actor-Critic (Temporal Difference)

04. PPO (Proximal Policy Optimization)

🎯 Learning Progression: The Variance-Bias Journey

The Core Problem: Policy Gradient Variance

REINFORCE: $\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a|s) \cdot G_t]$

Solution 1: Baselines (Reduce Variance)

Actor-Critic (MC): $\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a|s) \cdot (G_t - V(s_t))]$

Solution 2: Bootstrapping (Reduce Variance Further)

Actor-Critic (TD): $\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a s) \cdot \delta_t]$

Where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ (TD error)

Solution 3: Stable Policy Updates (Control Update Size)

PPO: Adds trust region constraints to prevent destructive policy changes

πŸ“Š On-Policy Algorithm Comparison Matrix

Algorithm Value Function Bellman Equation Variance Bias Update Frequency Data Usage
REINFORCE None ❌ No Very High None Per Episode On-Policy
Actor-Critic (MC) $V(s) \rightarrow G_t$ ❌ No High None Per Episode On-Policy
Actor-Critic (TD) $V(s) \rightarrow r + \gamma V(s’)$ βœ… Yes Medium Low Per Step On-Policy
PPO $V(s)$ + GAE βœ… Yes Low Low Per Batch On-Policy

πŸ”„ Key On-Policy Learning Concepts

On-Policy vs Off-Policy

On-Policy Methods (all algorithms in this series):

Off-Policy Methods (not covered in this series):

Why On-Policy Dominates LLM Fine-tuning: The stability of on-policy methods is crucial when fine-tuning billion-parameter language models where catastrophic forgetting or policy collapse can destroy months of pre-training work.

Monte Carlo vs Temporal Difference

Monte Carlo Methods (REINFORCE, Actor-Critic MC):

Temporal Difference Methods (Actor-Critic TD, PPO):

Variance Reduction Techniques

  1. Baseline Subtraction: $G_t - b$ where $b$ doesn’t depend on actions
  2. Value Function Approximation: Learn $V(s)$ to predict returns
  3. Advantage Estimation: Use $A(s,a) = Q(s,a) - V(s)$ instead of raw returns
  4. Trust Regions: Limit policy update magnitude (PPO clipping)

The Bias-Variance Tradeoff in On-Policy Methods

πŸ› οΈ Technical Implementation

Package Structure

rl/
β”œβ”€β”€ README.md                      # This documentation
β”œβ”€β”€ 01.reinforce.ipynb            # Pure policy gradients
β”œβ”€β”€ 02.actor-critic-mc.ipynb      # Actor-critic with Monte Carlo
β”œβ”€β”€ 03.actor-critic-td.ipynb      # Actor-critic with Temporal Difference  
β”œβ”€β”€ 04.ppo.ipynb                  # Proximal Policy Optimization
β”œβ”€β”€ rl_utils/                      # Shared utility package
β”‚   β”œβ”€β”€ __init__.py               # Package initialization
β”‚   β”œβ”€β”€ config.py                 # Configuration management
β”‚   β”œβ”€β”€ environment.py            # Environment wrappers and preprocessing
β”‚   β”œβ”€β”€ networks.py               # Neural network architectures
β”‚   └── visualization.py          # Plotting and analysis functions
└── videos/                        # Generated training/test videos

Shared Infrastructure

Environment utilities (rl_utils.environment):

Neural networks (rl_utils.networks):

Visualization (rl_utils.visualization):

πŸš€ Getting Started

Installation

# System dependencies for Box2D environment
sudo apt install swig build-essential python3-dev

# Python packages
pip install 'gymnasium[box2d]>=1.0' torch torchvision matplotlib numpy jupyter tqdm

# Clone the repository
git clone <repository-url>
cd rl

πŸ“š Key References