Transformer Architecture

April 9, 2023

Input text first undergoes tokenization using algorithms like Byte-Pair Encoding (BPE) or SentencePiece, breaking text into subword units. For example, "artificial" might become ["art", "ific", "ial"] while "unbreakable" becomes ["un", "break", "able"]. Each token maps to an integer ID within a vocabulary typically containing 50,257 tokens (GPT models) or 32,000 tokens (Llama models).

These token IDs convert to dense vectors through an embedding matrix E with shape [vocab_size, d_model]. Token ID 100 becomes E[100], a vector where d_model might be 768 for BERT or 12,288 for GPT-3. Since transformers process all tokens simultaneously, sinusoidal position encodings are added: PE(pos,2i) = sin(pos/10000^(2i/d_model)) and PE(pos,2i+1) = cos(pos/10000^(2i/d_model)).

The core mechanism is multi-head self-attention. Each head learns different types of relationships between tokens. For a sequence of length n, self-attention computes how much each token should "attend" to every other token, creating an n×n attention matrix. Input X projects into Query, Key, and Value matrices through learned weights: Q = XW_q, K = XW_k, V = XW_v. Attention scores compute as: Attention(Q,K,V) = softmax(QK^T/√d_k)V. The division by √d_k prevents gradients from vanishing when d_k is large. With 12 heads and d_model=768, each head operates on 64-dimensional subspaces. The outputs from all heads concatenate and project through another linear layer.

Following attention, representations pass through a feed-forward network: FFN(x) = max(0, xW_1 + b_1)W_2 + b_2. The hidden dimension typically spans 4x the model dimension. Crucially, each sub-layer employs residual connections (skip connections) that add the input directly to the output: x + Sublayer(x). This is then normalized: LayerNorm(x + Sublayer(x)) with 0.1 dropout for regularization. Skip connections enable training very deep networks by providing gradient highways.

This transformer block repeats N times. GPT-2 uses 12-48 layers while GPT-3 uses 96. The final layer applies normalization then linear projection to vocabulary size. Logits convert to probabilities via softmax. Higher temperature (T>1) increases randomness; lower temperature (T<1) makes outputs more deterministic.

Self-Supervised Learning in Training Process

Causal language modeling predicts the next token with loss -log P(x_t | x_1, ..., x_{t-1}). Training uses teacher forcing where ground truth tokens feed as input regardless of predictions. Masked language modeling randomly masks 15% of tokens: 80% become [MASK], 10% random, 10% unchanged.

Datasets

Common Crawl provides 60TB of web text requiring extensive cleaning. Wikipedia dumps contain approximately 20GB of English text. BookCorpus includes 4.5GB of books while ArXiv offers 100GB of LaTeX source. The Stack provides 3TB of permissively licensed code from GitHub. Curated datasets like C4 offer 750GB of cleaned Common Crawl data.

Training Details

Preprocessing involves MinHash deduplication and perplexity filtering, which removes documents scored as low-quality by a smaller language model. Documents with high perplexity indicate confusing or incoherent text. AdamW optimizer decouples weight decay from gradients, improving generalization with typical hyperparameters: β1=0.9, β2=0.95, ε=1e-8. Learning rate follows linear warmup to peak (e.g., 6e-4) then cosine decay.

Mixed precision uses FP16/BF16 for forward/backward passes while maintaining FP32 master weights. Lower precision reduces memory and increases throughput, critical when memory bottlenecks training. Loss scaling prevents FP16 gradient underflow.

Distributed training employs data parallelism across nodes and tensor parallelism within nodes. Effective batch sizes range from 256 to 3.2M tokens via gradient accumulation. GPT-3 (175B) trained on V100 clusters provided by Microsoft, processing 300B tokens total. It probably takes 1024 A100s to train over a month.