Text-to-Image Generation

One-Line Summary: Text-to-image generation synthesizes photorealistic or artistic images from natural language prompts using diffusion models guided by vision-language embeddings, with DALL-E, Stable Diffusion, and Midjourney as leading systems.

Prerequisites: Diffusion models, latent spaces, autoencoders, CLIP, U-Net architecture, classifier-free guidance, text encoders

What Is Text-to-Image Generation?

Picture a painter who takes verbal commissions: you describe "a cat wearing a top hat, sitting on a crescent moon, oil painting style," and the painter creates exactly that image. Text-to-image generation automates this creative process. A model takes a text prompt as input and produces a novel image that faithfully reflects the description, including objects, attributes, spatial relationships, and artistic style.

Technically, the task is conditional image generation: given a text string $c$ , generate an image $x$ such that $x \sim p (x ∣ c)$ . Modern approaches use diffusion models that learn to iteratively denoise random Gaussian noise into a coherent image, conditioned on text embeddings from a language model. The breakthrough came from combining three ingredients: large-scale diffusion training, CLIP-derived text conditioning, and latent-space compression.

How It Works

The Diffusion Framework

Text-to-image diffusion models learn to reverse a noising process. During training:

Take a clean image $x_{0}$
Add Gaussian noise over $T$ steps: $x_{t} = \overset{α}{ˉ}_{t} x_{0} + 1 - \overset{α}{ˉ}_{t} ϵ$
Train a neural network $ϵ_{θ} (x_{t}, t, c)$ to predict the noise $ϵ$ given the noisy image, timestep $t$ , and text conditioning $c$

At inference, start from pure noise $x_{T} \sim N (0, I)$ and iteratively denoise using the learned model, typically over 20-50 steps with DDPM or DDIM sampling.

Classifier-Free Guidance

The most important practical technique for text-to-image quality. During training, the text condition is randomly dropped (replaced with null) some fraction of the time (typically 10%). At inference, the model computes both conditional and unconditional predictions and extrapolates:

$\hat{ϵ} = ϵ_{θ} (x_{t}, t, \emptyset) + w \cdot [ϵ_{θ} (x_{t}, t, c) - ϵ_{θ} (x_{t}, t, \emptyset)]$

where $w$ is the guidance scale. Higher $w$ produces images more aligned with the text prompt but with reduced diversity. Typical values: $w = 7.5$ for Stable Diffusion, $w = 3.0$ for DALL-E 3.

Key Systems

DALL-E 2 (OpenAI, April 2022):

Two-stage: a prior maps CLIP text embeddings to CLIP image embeddings, then a diffusion decoder generates pixels from the image embedding
3.5B parameter diffusion model, 64px base + two super-resolution stages to 1024px
Demonstrated strong text-image alignment but struggled with text rendering and complex compositions

Stable Diffusion (Stability AI / CompVis, August 2022):

Latent diffusion model (LDM): operates in the latent space of a pretrained variational autoencoder
Compresses 512x512 images to 64x64 latent representations (8x spatial compression), reducing compute by ~50x
Text conditioning via CLIP ViT-L/14 text encoder (Stable Diffusion v1.x) or OpenCLIP ViT-H (v2.x)
U-Net with cross-attention layers where text tokens attend to spatial features:

$CrossAttn (Q_{spatial}, K_{text}, V_{text})$

Open-source release on the LAION-5B dataset catalyzed an explosion of community development
SDXL (2023): 6.6B parameter U-Net, dual text encoders (CLIP ViT-L + OpenCLIP ViT-bigG), 1024px native resolution

Stable Diffusion 3 / SD3.5 (2024):

Replaces U-Net with a Multimodal Diffusion Transformer (MMDiT)
Uses three text encoders: CLIP ViT-L, OpenCLIP ViT-bigG, and T5-XXL (4.7B parameters)
The T5 encoder dramatically improves text rendering and complex prompt understanding

DALL-E 3 (OpenAI, October 2023):

Trained on highly descriptive captions (generated by a dedicated captioning model) rather than raw alt-text
Native integration with ChatGPT for prompt refinement
Significantly improved text rendering, spatial composition, and prompt adherence compared to DALL-E 2

Midjourney (v5-v6):

Proprietary architecture, no published details
Known for strong aesthetic quality and artistic coherence
Operates as a commercial service via Discord and web interface

FLUX (Black Forest Labs, 2024)

Built by former Stability AI researchers
Transformer-based architecture with flow matching instead of traditional diffusion
FLUX.1 [dev] is open-weight; demonstrates strong prompt adherence and image quality
Represents the trend toward flow-matching objectives replacing DDPM/DDIM

Why It Matters

Creative industries: Text-to-image generation is transforming graphic design, advertising, concept art, and game development by enabling rapid prototyping from text descriptions.
Data augmentation: Synthetic images can augment training datasets for downstream vision tasks, especially for rare categories.
Accessible content creation: Non-artists can generate visual content, democratizing image creation.
Scientific visualization: Generating visual hypotheses for molecular structures, architectural designs, and medical scenarios.
Driving fundamental research: The engineering challenges of text-to-image (scaling, conditioning, evaluation) have advanced diffusion models, representation learning, and multimodal understanding broadly.

Key Technical Details

Latent space: Stable Diffusion's VAE compresses 512x512x3 images to 64x64x4 latents; SDXL uses 128x128x4 for 1024px images
Inference cost: ~4 seconds for 50-step DDPM on an A100 (Stable Diffusion 1.5); ~2 seconds with DDIM 20-step; ~8 seconds for SDXL
Training data: Stable Diffusion v1.5 trained on ~2B image-text pairs from LAION-5B; DALL-E 3 on proprietary data with synthetic captions
FID scores: Stable Diffusion achieves ~8-12 FID on COCO-30K (lower is better); human preference now dominates as the primary metric
Guidance scale trade-off: Higher guidance (w > 10) improves text alignment but causes saturation artifacts and reduced diversity; w = 7-8 is typical
Negative prompts: Specifying what to avoid ("blurry, low quality, deformed") substantially improves output quality in practice
ControlNet (2023): Adds spatial conditioning (edges, depth, pose) to Stable Diffusion with zero-initialized convolutions, enabling precise layout control without retraining the base model
LoRA fine-tuning: Low-rank adaptation enables style or subject customization with ~4MB adapter weights and 20-30 minutes of training on a single GPU

Common Misconceptions

"Text-to-image models understand language." They learn correlations between text patterns and visual patterns. Models still struggle with negation ("a room without chairs"), precise counting ("exactly five apples"), and complex spatial relationships ("A is to the left of B, which is behind C").
"Higher resolution means better quality." Upscaling adds detail but cannot fix compositional errors. A 1024px image with wrong spatial relationships is not better than a correct 512px image.
"These models copy training images." Studies show that direct memorization of training images occurs but is rare (<1% of generations for Stable Diffusion). Most outputs are novel compositions that recombine learned visual concepts.
"Prompt engineering is unnecessary." The same concept described in different ways produces dramatically different results. Descriptive, detailed prompts consistently outperform short ones. The introduction of synthetic captions in DALL-E 3 explicitly addresses this gap.

Connections to Other Concepts

clip.md: Provides text encoders for conditioning and is used in CLIP-guided generation and evaluation (CLIPScore).
image-captioning.md: The inverse task. Captioning models (BLIP) are used to generate training captions for text-to-image models (DALL-E 3).
diffusion-models.md: Text-to-image is the highest-profile application of diffusion models, driving most of the architectural innovation in this space.
vision-foundation-models.md: Text-to-image models learn rich visual representations; their internal features can be repurposed for discriminative tasks.
generative-adversarial-networks.md: GANs (StyleGAN, BigGAN) were the prior paradigm for image generation; diffusion models surpassed them in diversity and training stability.