StyleGAN: Style-Based Generator Architecture

June 30, 2024

StyleGAN is how OpenAI is able to make their anime-styled picture conversions. Instead of feeding latent code z directly into the generator, a mapping network f: Z -> W transforms z in R^512 through 8 fully-connected layers into intermediate latent space W. This learned mapping produces w in R^512 that controls synthesis through adaptive instance normalization (AdaIN) at each convolutional layer.

The mapping network consists of 8 FC layers of 512 units each with leaky ReLU activation (α=0.2). Pixel normalization after each layer constrains activations: w' = w/√(ε + ||w||²). This normalization prevents the mapping from collapsing and encourages use of the full latent space. The deeper network (compared to typical 1-2 layers) increases disentanglement by allowing complex non-linear transformations from Z to W.

The synthesis network starts from learned constant input 4×4×512 instead of latent code. Progressive growing architecture builds resolution: 4×4 -> 8×8 -> 16×16 -> ... -> 1024×1024. Each resolution block contains two 3×3 convolutions with style modulation. Channel progression follows 512->512->512->512->256->128->64->32->16 for 1024×1024 output.

Adaptive Instance Normalization (AdaIN)

Style modulation happens through AdaIN after each convolution: AdaIN(x_i, y) = y_{s,i} × (x_i - μ(x_i))/σ(x_i) + y_{b,i}. Here x_i is the i-th feature map, y = (y_s, y_b) are style parameters from learned affine transform A(w), μ(x_i) and σ(x_i) are spatial mean and standard deviation.

Each layer has dedicated affine transform: A_l: R^512 -> R^{2×channels}. For 512-channel layers, A outputs 1024 values (512 scales, 512 biases). Total style parameters: ~25M for 1024×1024 generator. Removing traditional input completely disentangles styles from spatial features.

Weight demodulation ensures consistent gradient magnitudes. After modulation, weights are scaled: w'_{ijk} = s_i × w_{ijk}/√(ε + Σ_{jk} s_j² × w²_{ijk}). This normalization prevents style magnitudes from causing training instabilities. StyleGAN2 moves this into the convolution: Conv2d_mod(x, w, s) = demod(conv(x, scale(w, s))).

Stochastic Variation Through Noise

Per-pixel noise injections add stochastic detail. Learned scale factors B multiply Gaussian noise before addition: x' = x + B × noise where noise ~ N(0, I). Each resolution level has two noise inputs (after each convolution). Total 18 noise inputs for 1024×1024: ( + + ... + 1024²) learned scales.

Noise affects only local statistics, creating variations in hair placement, beard stubble, skin pores, and background details. Disabling noise produces overly smooth, deterministic outputs. Style controls global attributes: pose, identity, expression. This separation enables fine-grained control during generation.

Perceptual Path Length and Truncation

Perceptual path length (PPL) measures latent space smoothness: PPL = E[||J_w^T(VGG16(g(w)))||_2] where J_w is Jacobian of generator g with respect to w. Lower PPL indicates more disentangled representations. StyleGAN achieves PPL of 412.0 on FFHQ compared to Progressive GAN's 896.9.

Truncation trick improves quality by resampling extreme latents: w' = w̄ + ψ(w - w̄) where = E[f(z)] and ψ in [0, 1]. Lower ψ reduces variation but improves quality. ψ=0.7 optimal for FID while maintaining diversity. Compute w̄ from 10,000 samples for stability.

Style mixing regularization feeds different w vectors to different layers during training. With probability 0.9, use w_1 for layers 0-k and w_2 for layers k+1-18 where k ~ U18. This prevents adjacent layers from correlating and improves disentanglement. Mixing at test time enables fine-grained style transfer.

Progressive Growing and Resolution Training

Training follows progressive growing schedule but with modifications. Start at 8×8 resolution (not 4×4) for faster initial convergence. Resolution transitions use learned 0.5×0.5 convolution for downsampling and 2× bilinear upsampling. Fade-in period α increases linearly over 600k images.

Training schedule for 1024×1024: -> 16² (600k) -> 32² (600k) -> 64² (600k) -> 128² (600k) -> 256² (600k) -> 512² (600k) -> 1024² (2400k). Total 6M real images shown. Batch size decreases with resolution: 256->128->64->32->16 due to memory constraints. Each GPU processes minibatch of 4 at highest resolution.

Loss Functions and Regularization

Non-saturating GAN loss with R_1 regularization: L_G = -E[log(D(G(z)))], L_D = -E[log(D(x))] - E[log(1-D(G(z)))] + (λ/2)E[||∇D(x)||²]. R_1 weight λ=10 provides gradient penalty on real data only. Lazy regularization applies R_1 every 16 iterations, reducing computational cost by 87.5%.

Path length regularization encourages smooth mapping: L_path = E[||J_w^T y - a||²] where y = randn_like(image) and a updates as exponential moving average of ||J_w^T y||. Weight 2.0 applied every 8 iterations. This explicit regularization supplements implicit regularization from progressive growing.

Architecture Improvements in StyleGAN2

StyleGAN2 removes progressive growing artifacts through several changes:

  1. Skip connections: Residual connections with learned gains β ensure gradient flow: y = x + β × Conv(x). Gains initialize to 0, gradually learning feature importance.
  2. Bilinear upsampling: Replace transposed convolutions with bilinear up + convolution, eliminating checkerboard artifacts.
  3. Weight demodulation: Moved into convolution operation for efficiency and stability.

Generator normalization removes artifacts from AdaIN. New design splits modulation and normalization: s = A(w), w' = s ⊙ w, w'' = w'/||w'||, y = conv(x, w''). This "demodulation" maintains benefits while fixing droplet artifacts.

Training modifications include exponential moving average G' with decay 0.999, style mixing probability 0.9, and no growing or stabilization tricks. Full training takes 1 week on 8 V100 GPUs for 25M images at 1024×1024.

StyleGAN3 and Alias-Free Generation

StyleGAN3 addresses texture sticking - the phenomenon where details appear fixed to screen coordinates during latent interpolation. Root cause: aliasing from non-ideal upsampling and downsampling operations. Solution requires carefully designed continuous signal processing.

Key modifications:

  1. Continuous formulation: Treat all operations in continuous domain, sample only at end
  2. Filtered nonlinearities: Replace ReLU with filtered versions: φ(x) = f_↓ * ReLU(f_↑ * x)
  3. Rotation equivariance: Learned affine transforms on Fourier features enable rotation synthesis

Alias-free generator achieves translation and rotation equivariance. Texture sticking reduces from 66.7% to 0.3% as measured by spatial correlation of features during interpolation. FID slightly worse (2.79 vs 2.84) but perceptual quality improves significantly for video applications.

Implementation Details

Memory optimization crucial for high resolutions. Mixed precision training with FP16 convolutions and FP32 accumulation. Gradient checkpointing on mapping network saves 30% memory. Group normalization instead of batch normalization in discriminator enables smaller batches.

Custom CUDA kernels for fused bias + activation improve speed 20%. Efficient upsampling: grouped convolution implements learned 2× filters. StyleGAN2-ADA adds adaptive discriminator augmentation, enabling high-quality training with limited data (minimum ~5k images).

Typical hyperparameters: learning rate 0.002 for both G and D, Adam with β_1=0, β_2=0.99, ε=1e-8. Exponential moving average of generator weights with decay 0.999. Training 1024×1024 FFHQ model processes 25M images over 6 days on 8 Tesla V100 GPUs, achieving FID 2.84.