CLIP: Contrastive Language-Image Pre-training

CLIP (Contrastive Language-Image Pre-training) learns visual concepts from natural language supervision by training on 400 million image-text pairs. Unlike traditional computer vision models requiring labeled datasets like ImageNet with 1.2M images across 1,000 categories, CLIP learns directly from raw text describing images found on the internet.

The architecture consists of two encoders: an image encoder and a text encoder. The image encoder can be either a Vision Transformer (ViT) or a ResNet variant. For ViT-B/32, images are divided into 32×32 patches, each linearly projected to 768-dimensional embeddings. A [CLS] token prepends the sequence, and positional embeddings are added. The sequence passes through 12 transformer layers with 12 attention heads each. ResNet variants use ResNet-50, ResNet-101, or a custom ResNet-50×64 with 64x wider channels. The final pooled features project to a shared embedding space of dimension 512 or 768.

The text encoder is a 12-layer transformer with 512-dimensional embeddings and 8 attention heads, processing sequences up to 77 tokens. Text tokenization uses a byte-pair encoding vocabulary of 49,152 tokens. Special tokens [SOS] and [EOS] mark sequence boundaries. The final [EOS] token representation serves as the text embedding after layer normalization and linear projection.

Contrastive Learning Objective

CLIP maximizes cosine similarity between N correct image-text pairs while minimizing similarity for N²-N incorrect pairings within each batch. Given batch size N, the model sees N images and N texts, creating an N×N similarity matrix. The symmetric cross-entropy loss optimizes: L = -1/2 * (log(exp(sim(I_i,T_i)/τ) / Σ_j exp(sim(I_i,T_j)/τ)) + log(exp(sim(T_i,I_i)/τ) / Σ_j exp(sim(T_i,I_j)/τ))). Temperature parameter τ initializes to 0.07 and is learned during training, controlling the sharpness of the similarity distribution.

The key insight: treating each possible pairing as a classification problem. For image I_i, only text T_i is correct among N choices. This creates 2N classification tasks per batch (N from image→text and N from text→image). With batch size 32,768, each step performs 65,536 classifications.

Large-Scale Training Infrastructure

Training uses mixed precision (FP16) with gradient checkpointing to fit large batches in memory. The largest models train on 592 V100 GPUs for 12 days, processing 32 billion image-text pairs. Batch sizes scale from 8,192 to 32,768 using gradient accumulation across nodes. The AdamW optimizer uses β1=0.9, β2=0.98, ε=1e-6 with weight decay 0.2. Learning rate follows cosine schedule with 2,000 step linear warmup to peak 5e-4.

Data parallelism distributes batches across GPUs, with each GPU computing embeddings for its subset. All-gather operations collect embeddings before similarity computation. Gradient accumulation enables large effective batches: with 592 GPUs processing 55 samples each, one optimization step sees 32,560 pairs.

Dataset: WebImageText

The WebImageText dataset contains 400M image-text pairs filtered from 5 billion candidates. Filtering removes duplicates using perceptual hashing, excludes images smaller than 200×200 pixels, and requires English captions between 5-75 words. Text must pass safety filters and cannot be predominantly proper nouns. Unlike curated datasets, WebImageText preserves natural language diversity: "a dog playing fetch" instead of just "dog".

Data augmentation during training uses random square crops with resize to 224×224 (or 336×336 for larger models). No other augmentations apply, as aggressive augmentation can break image-text alignment. Text augmentation only involves random token dropout at 0.1 probability.

Zero-Shot Transfer Capabilities

CLIP enables zero-shot classification by converting labels to natural language. For ImageNet classes, templates like "a photo of a {class}" create text embeddings. The model classifies by finding the highest similarity between image embedding and all class text embeddings. Ensemble of prompts improves performance: averaging embeddings from "a photo of a {}", "a photograph of a {}", "an image of {}", etc.

Performance scales predictably with compute. On ImageNet zero-shot, ViT-B/32 achieves 63.2%, ViT-B/16 reaches 68.6%, and ViT-L/14@336px achieves 76.2% accuracy. This rivals supervised ResNet-50 (76.5%) without seeing any ImageNet training examples. Robustness evaluations show CLIP maintains performance on ImageNet-Sketch (48.3%), ImageNet-R (77.7%), and ObjectNet (55.8%) where supervised models drop significantly.

Implementation Details

Image preprocessing normalizes pixels to [-1, 1] range using ImageNet statistics (mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711]). Text preprocessing lowercases all text and applies ftfy for Unicode fixes. Truncation happens at 77 tokens with [EOS] token always at position 76.

Memory optimization techniques include gradient checkpointing (recomputing activations during backward pass), mixed precision training, and efficient attention implementations. The largest ViT-L/14@336px model has 427M parameters: 303M in vision transformer, 124M in text transformer. FLOPs per forward pass: 88.2 GFLOPs for image encoding, 6.4 GFLOPs for text encoding.

Training Dynamics and Convergence

Loss curves show rapid initial learning followed by slow steady improvement. Image-text alignment emerges within first 1M steps. Semantic clustering in embedding space becomes apparent by 5M steps. Final models train for 32 epochs over the 400M dataset, totaling 12.8B image-text pairs seen.

Temperature parameter τ adapts during training, starting at 0.07 and converging around 0.01, making the similarity distribution sharper. Lower temperature creates harder negatives, improving fine-grained discrimination. Gradient clipping at norm 1.0 prevents instabilities from large batches.