VAE and Diffusion Models: A Step-by-Step Guide

This repository contains educational Jupyter notebooks that explore the theory and
implementation of generative models, specifically Variational Autoencoders (VAEs)
and Diffusion Models. These notebooks are designed to provide a clear, step-by-step
understanding of how these powerful generative AI techniques work.
đź“– Blog Post: For a gist and overview of the concepts covered in this repository,
check out the accompanying blog post: From VAEs to Diffusion Models: A Step-by-Step
Journey
🚀 Installation
To run the notebooks in this repository, you’ll need Python 3.8+ and the following dependencies:
pip install 'torch>=2.0.0' 'torchvision>=0.15.0' 'matplotlib>=3.5.0' 'tqdm>=4.64.0'
The main dependencies are:
- PyTorch (for deep learning models)
- torchvision (for computer vision utilities)
- matplotlib (for visualization)
- tqdm (for progress bars)
📚 What You’ll Learn
- The mathematical foundations behind VAEs and diffusion models
- How to implement these models from scratch using PyTorch
- The connection between VAEs and diffusion models
- How to generate MNIST digits using these techniques
🗂️ Notebook Overview
01.vae.ipynb
– Variational Autoencoders
This notebook covers the fundamental concepts of Variational Autoencoders (VAEs):
- Motivation behind generative models and latent variable models
- The Evidence Lower Bound (ELBO) objective
- Reparameterization trick for training VAEs
- Implementation of VAEs using both Multi-Layer Perceptron (MLP) and Convolutional
Neural Network (CNN) architectures
- Exploration of both Bernoulli and Gaussian likelihoods for the decoder
- Comparison of different likelihood models and architectures, and their effects on
image generation quality for MNIST digits
This notebook explores a “VAE-like” model that omits the encoder:
- One-step fixed corruption process as “inference”
- Connection between this simplified model and diffusion models
- Implementation of a noise prediction objective
- Demonstration of how this approach performs on MNIST data
- Compare the classical CNN architecture with the U-Net CNN architecture
This comprehensive notebook dives into diffusion models:
- Forward and reverse processes in diffusion models
- Beta schedules (linear)
- Variational inference and the ELBO objective for diffusion
- Simplified noise prediction objective
- Implementation of a diffusion model with:
- UNet architecture
- Time embedding and conditioning
- Linear beta schedule
Next
- Conditional generation, e.g., class, text, with VAE and diffusion.
- VAE and diffusion working together, e.g., VAE as an encoder diffusion as a decoder.
- Discrete modality, e.g., table, graph, text, etc, with non Gaussian and continuous
distributions, e.g., discrete categorical.