Variational Inference

Introduction to how Generative models work

The blog post is based primarily on the excellent overview paper by Calvin Luo, for detailed treatment of proofs and implementation check out his paper .

Intro, Surprise, and Cross-Entropy

Cross Entropy

KL Divergence

Variational Inference

Training loop in a nutshell to learn model distribution from samples (given we know P(z))

  1. Take a data point (x) from training dataset
  2. Randomly sample many candidate $z_k$ from the distribution $P(z)$ (prior already known, what if we don’t know?)
  3. Map each $z_k$ through the NN and get parameters for the $P_\theta(x\mid z_k)$ for each $z_k$.
  4. For each $z_k$ compute the $P_\theta(x\mid z_k)$
  5. Compute average log-likelihood $log[\sum_z^{z_k} P_\theta (x\mid z)P_\theta(z)]$
  6. Iteratively update (using gradient descent) $\theta$ to maximize the log-likelihood across the dataset.

Higher-Dimenstional latent space and Evidence lower bound

Implementation Notes about the prior $P_\theta (z)$

Updated Training loop

  1. Take a data point (x) from training dataset
  2. Pass through NN of recognition model to get the parameters of the importance sampling model $Q_\theta(z\mid x)$.
  3. Use the $Q_\theta$ distributon to effectively sample ($z_k$)from the $P_\theta(z)$ prior distribution.
  4. Map each $z_k$ through the NN and get parameters for the $P_\theta(x\mid z_k)$ for each $z_k$.
  5. For each $z_k$ compute the $P_\theta(x \mid z_k)$
  6. Compute the ELBO objective (which is to be maximized): \(E[logP_\theta(x\mid z)] - D_{KL}[Q_\theta(z\mid x)\mid \mid P_\theta(z)]\)
  7. Iteratively update (using gradient descent) $\theta$ to maximize the ELBO objective across the dataset.

Note: both recognition and generative model do not output the distribution directly, just the parameters. So have to put these into the formula for the gaussian directly to get samples.

The accuracy formula in ELBO is proportiaonal to the following: $log P_\theta (x \mid z) ~ - \mid \mid x-\mu \mid \mid ^2$ with the isotropic gaussian assumption!

Also the KL-divergence term (using this gaussian assumption) in ELBO can also be calculated using only the model parameters, hence ELBO is an ideal objective for training, as literal sampling each step is not required.

Variational Autoencoders (VAE)

Hierarchical Variational Encoders (HVAE)

Figure 1: VAE depicted by the arrows: the forward arrow is the encoder ($q(z \mid x)$) which defines a distribution over the latent $z$ for observation $x$; and the backward arrow is the decoder ($p(x \mid z)$) which defines a distribution over the observation $x$ for some sampled latents.

Variational Diffusion Models (VDM)

Figure 2: VDM as a hierarchy of forward and reverse diffusion processes. The aim of the diffusion model is to minimize the difference between the distribution between the pink (Gaussian noise corruption step) and the green (denoising step) arrows.
\[\arg\min_{\theta} \; \mathbb{E}_{t \sim \mathcal{U}(2, T)} \left[ \mathbb{E}_{q(x_t | x_0)} \left[ D_{\mathrm{KL}} \big( q(x_{t-1} | x_t, x_0) \; \|\; p_\theta(x_{t-1} | x_t) \big) \right] \right]\]

Reconstructing the ground truth image

\[D_{\mathrm{KL}} \big( q(x_{t-1} | x_t, x_0) \; \|\; p_\theta(x_{t-1} | x_t) \big) = \frac{1}{2 \sigma_q^2(t)} \frac{\bar{\alpha}_{t-1} (1-\alpha_t)^2}{(1-\bar{\alpha}_t)^2} \left\| \hat{\mathbf{x}}_\theta(\mathbf{x}_t, t) - \mathbf{x}_0 \right\|_2^2\]

Predicting source noise

\[D_{\mathrm{KL}} \big( q(x_{t-1} | x_t, x_0) \; \|\; p_\theta(x_{t-1} | x_t) \big) = \frac{1}{2 \sigma_q^2(t)} \, \frac{(1 - \alpha_t)^2}{(1 - \bar{\alpha}_t) \alpha_t} \left\| \epsilon_0 - \hat{\epsilon}_\theta(x_t, t) \right\|_2^2\]

Score-based optimization

\[D_{\mathrm{KL}} \big( q(x_{t-1} | x_t, x_0) \; \|\; p_\theta(x_{t-1} | x_t) \big) = \frac{1}{2 \sigma_q^2(t)} \, \frac{(1 - \alpha_t)^2}{(1 - \bar{\alpha}_t) \alpha_t} \left\| s_\theta(x_t,t) - \nabla log p(x_t) \right\|_2^2\]

NOTE: Here, \(p(x) = q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt(\bar{\alpha}_t)x_0, (1- \sqrt(\bar{\alpha}_t)I))\) is called Denoising Score Matching (DSM)

Figure 3: Score based interpretation of the diffusion optimization objective. Optimizing for the score is like moving to one of the modes in the distribution at each time step. The log gradient step defines a vector field along which a randomly sampled data with evolve in time until it converges to one of the modes.

Conditional Guidance for Diffusion Models

\[p(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1} \mid x_t, y)\]

Classifier Guidance

\[\nabla log p (x_t \mid y) = \nabla log p(x_t) + \nabla log p(y \mid x_t)\] \[\nabla log p (x_t \mid y) = \nabla log p(x_t) + \gamma \nabla log p(y \mid x_t)\]

Classifier Free Guidance (much more robust in sample diversity, and flexible)