Normalizing Flows

Normalizing flows construct complex probability distributions by composing invertible transformations. Starting from a simple base distribution like N(0, I), a sequence of bijective functions f_1, f_2, ..., f_k transforms samples into the target distribution. Unlike VAEs or GANs, normalizing flows provide exact log-likelihood computation through the change of variables formula: log p(x) = log p(z) - Σ_i log|det(∂f_i/∂z_{i-1})|.

The key constraint is invertibility: each transformation must be bijective with tractable Jacobian determinant. Computing det(J) for arbitrary neural networks costs O(n³) via LU decomposition, prohibiting high-dimensional applications. Practical flows restrict architectures to achieve O(n) or O(n log n) complexity while maintaining expressiveness.

Coupling Layers: The Foundation

RealNVP (Real-valued Non-Volume Preserving) introduced affine coupling layers. Split input x into two parts: x_1:d and x_{d+1}:D. The transformation preserves x_1:d while transforming the remainder: y_1:d = x_1:d and y_{d+1}:D = x_{d+1}:D ⊙ exp(s(x_1:d)) + t(x_1:d). Networks s and t can be arbitrary neural networks since we never need their derivatives.

The Jacobian is lower triangular:

J = [I_d×d        0     ]
    [∂y_2/∂x_1  diag(exp(s))]

Determinant simplifies to: det(J) = Π_i exp(s_i) = exp(Σ_i s_i). Computation is O(n) and numerically stable in log-space. Inverse is analytical: x_{d+1}:D = (y_{d+1}:D - t(y_1:d)) ⊙ exp(-s(y_1:d)).

Masked coupling generalizes this pattern. Binary mask m in {0,1}^D determines which dimensions to preserve: y = m ⊙ x + (1-m) ⊙ (x ⊙ exp(s(m⊙x)) + t(m⊙x)). Alternating masks (checkerboard, channel-wise) ensure all dimensions get transformed across multiple layers.

Autoregressive Flows

Autoregressive flows model each dimension conditioned on previous ones: p(x) = Π_i p(x_i|x_1:i-1). Inverse Autoregressive Flow (IAF) uses y_i = x_i ⊙ exp(s_i(x_1:i-1)) + t_i(x_1:i-1). Sampling is O(n) but density evaluation requires O(n²) due to sequential dependencies.

Masked Autoregressive Flow (MAF) reverses conditioning: x_i = y_i ⊙ exp(s_i(y_1:i-1)) + t_i(y_1:i-1). Now density evaluation is O(n) but sampling needs O(n²). Choice depends on use case: IAF for VAE posteriors (need sampling), MAF for density estimation (need likelihood).

MADE (Masked Autoencoder for Distribution Estimation) implements autoregressive transforms via masked weight matrices. Assign each hidden unit an integer connectivity constraint m in {1,...,D-1}. Masks ensure hidden units only connect to lower-numbered inputs: M_ij = 1(m_j >= m_i). Multiple masks sample different orderings, improving expressiveness.

Continuous Normalizing Flows

Neural ODEs define flows through differential equations: dz/dt = f(z(t), t). The instantaneous change of variables gives: d log p(z)/dt = -Tr(∂f/∂z). Hutchinson's trace estimator approximates: Tr(A) ≈ ε^T A ε where ε ~ N(0,I). This reduces O(n²) trace computation to O(n) matrix-vector products.

FFJORD (Free-Form Jacobian of Reversible Dynamics) implements this with adaptive ODE solvers. Forward pass integrates from t=0 to t=1: [z(1), log p(z(1))] = ODESolve([z(0), log p(z(0))], f, [0,1]). Adjoint method computes gradients without storing intermediate states, using O(1) memory.

Training dynamics differ from discrete flows. Smooth transformations prevent mode collapse but convergence is slower. Regularization penalizes kinetic energy: ∫_0^1 ||f(z(t), t)||² dt. This encourages straight trajectories, reducing solver steps and improving stability.

Residual Flows and Invertibility

Residual flows use transformations y = x + g(x) where g is contractive: ||g(x) - g(y)|| <= L||x - y|| with L < 1. Banach fixed-point theorem guarantees unique inverse via iteration: x_{k+1} = y - g(x_k). Convergence rate depends on Lipschitz constant L.

Spectral normalization enforces contractivity: g(x) = f(x)/||W||_2 where ||W||_2 is largest singular value. Power iteration estimates ||W||_2: v = W^T u, u = Wv/||Wv||, σ ≈ u^T Wv. Update W' = W/σ after each gradient step. Typical architectures use 5-10 residual blocks with 128-512 hidden units.

i-ResNet improves efficiency with unbiased log-determinant estimation: log|det(I + J_g)| = Tr(log(I + J_g)) = Σ_{k=1}^∞ (-1)^{k+1}/k Tr(J_g^k). Truncate series at k=5-10 for speed. Russian roulette estimator provides unbiased gradients despite truncation.

Glow: Scaling to High Resolutions

Glow combines three components for image modeling:

ActNorm: Data-dependent initialization y = (x - μ)/σ where μ, σ compute from first batch. Learnable scale and bias update during training. Determinant: -h×w×Σ_i log|s_i|.
Invertible 1×1 convolution: Generalize channel permutations with learned matrix W. LU decomposition W = PLU reduces determinant computation: log|det(W)| = Σ_i log|U_ii|. Alternatively, orthogonal matrices via Householder reflections ensure det(W) = ±1.
Coupling layers: Split channels c -> [c/2, c/2]. Condition on spatial context using 3-layer CNN: Conv(c/2->128)->ReLU->Conv(128->128)->ReLU->Conv(128->c). Zero initialization ensures identity at start.

Multi-scale architecture processes different resolutions. After each scale, split-off half the channels: h×w×c -> h/2×w/2×4c -> split -> h/2×w/2×2c + h/2×w/2×2c (to latent). Total latent dimensionality matches input. Glow on 256×256 CelebA-HQ achieves bits/dim of 1.03 using 600M parameters.

Training Considerations

Optimization typically uses Adam with learning rate 1e-4 to 5e-4. Gradient clipping at norm 1-5 prevents instabilities from high-variance log-determinant gradients. Learning rate warmup over 500-1000 steps improves convergence. Batch sizes range 16-64 for high-resolution images due to memory constraints.

Data preprocessing significantly impacts performance. Uniform dequantization adds noise: x' = (x + u)/256 where u ~ U[0,1]. Variational dequantization learns noise distribution q(u|x), improving bits/dim by 0.02-0.05. Logit transform prevents boundary effects: y = logit(λ + (1-2λ)x) with λ=1e-6.

Memory optimization techniques: gradient checkpointing reduces activation storage from O(L) to O(√L) for L layers. Mixed precision training with loss scaling maintains numerical precision. For Glow, store only z-values between scales, recomputing activations during backward pass.

Evaluation Metrics and Applications

Bits per dimension (bits/dim) measures compression: bits/dim = -log_2 p(x) / dim(x). Lower is better. MNIST achieves 0.99, CIFAR-10 reaches 3.35, ImageNet64 gets 3.81. Compare to PNG compression: CIFAR-10 ≈ 5.87 bits/dim.

Latent space manipulation enables controllable generation. Linear interpolation between z-codes produces smooth transitions. Semantic directions found via PCA on encoded dataset or supervised attribute regression. Temperature scaling during sampling: z' = T × z controls diversity vs quality tradeoff.

Applications span density estimation, anomaly detection (threshold on log p(x)), and hybrid models. Flow++ combines Glow with mixture of logistics output, dequantization flows, and variational augmentation. Achieves state-of-the-art 2.90 bits/dim on CIFAR-10. Normalizing flows also enhance VAE posteriors (improving ELBO) and enable exact inference in latent variable models.