Course · 11 modules · 156 lessons · 1213 min

LLM Concepts

From transformer architecture to cutting-edge research — each concept explained with intuition, math, and connections to the bigger picture.

← All courses
Foundational Architecture
·Activation Functions in LLMsActivation functions introduce non-linearity into neural networks, enabling them to learn complex patterns, and the evolution from ReLU to GELU to SwiGLU represents a progression toward smoother, gated functions that improve large language model training dynamics and performance.7 min·Attention SinksAttention sinks are the phenomenon where the first few tokens in a sequence accumulate disproportionately large attention scores regardless of their semantic content -- a mathematical artifact of softmax's requirement to produce a valid probability distribution -- and exploiting this property via StreamingLLM enables stable language model inference over millions of tokens with fixed memory.10 min·Autoregressive GenerationAutoregressive generation is the process by which LLMs produce text one token at a time, feeding each newly generated token back as input for predicting the next, creating a sequential feedback loop that is both the source of their generative power and their primary inference bottleneck.7 min·Byte Latent TransformersByte Latent Transformers (BLT) are a tokenizer-free architecture that operates directly on raw bytes with dynamic patching, eliminating tokenization artifacts while matching the performance of token-based models at equivalent compute budgets.7 min·Causal (Masked) AttentionCausal attention restricts each token to attend only to itself and preceding tokens by applying a triangular mask to the attention matrix, enforcing the left-to-right autoregressive property required for text generation.6 min·Differential TransformerThe Differential Transformer computes attention as the difference between two separate softmax attention maps -- $A_{\text{diff}} = A_1 - \lambda A_2$ -- canceling out noise and irrelevant attention patterns much like a differential amplifier in electrical engineering filters out common-mode noise to isolate the true signal.5 min·Encoder-Decoder vs Decoder-Only vs Encoder-OnlyThe three Transformer paradigms -- encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) -- represent fundamentally different choices about how the model processes context, with decoder-only emerging as the dominant architecture for generative AI.7 min·Feed-Forward Networks (FFN / MLP Layers)The feed-forward network in each Transformer layer is a two-layer fully connected network applied independently to each token position, acting as the model's primary knowledge store and accounting for roughly two-thirds of total parameters.6 min·Grouped Query Attention (GQA)Grouped Query Attention reduces the memory footprint of the key-value cache by sharing key-value heads across groups of query heads, achieving near-full-attention quality at a fraction of the memory cost -- making it the de facto standard for production LLM deployment.8 min·Layer NormalizationLayer normalization standardizes activations across the feature dimension at each position independently, stabilizing training of deep Transformer networks and enabling the use of higher learning rates.6 min·Logits and SoftmaxLogits are the raw, unnormalized output scores of a language model for each token in the vocabulary, and the softmax function converts them into a valid probability distribution from which the next token is selected.7 min·Mixture of DepthsMixture of Depths (MoD) dynamically routes each token at each layer through either the full transformer block or a skip connection, using a lightweight router to select only the top-k most important tokens for computation, reducing FLOPs by up to 50% while matching or exceeding standard transformer performance.8 min·Mixture of Experts (MoE)Mixture of Experts is an architecture that replaces the dense feed-forward network with multiple parallel "expert" networks and a learned router that selects only a small subset of experts for each token, enabling models with vastly more parameters while keeping per-token computation constant.9 min·Multi-Head AttentionMulti-head attention runs several self-attention operations in parallel, each with its own learned projection, enabling the model to simultaneously attend to different types of relationships -- syntactic, semantic, positional -- and then combines the results.6 min·Next-Token PredictionNext-token prediction is the deceptively simple training objective at the heart of all decoder-based LLMs -- predicting the most likely next token given all preceding tokens -- and this single objective, applied at sufficient scale, gives rise to emergent capabilities including grammar, factual knowledge, reasoning, and more.7 min·Residual Connections & The Residual StreamResidual connections (skip connections) add each layer's input directly to its output, creating a "residual stream" that flows through the entire model and enables effective training of networks with dozens to hundreds of layers.6 min·Self-Attention MechanismSelf-attention allows every token in a sequence to dynamically compute a weighted combination of all other tokens' representations, enabling the model to capture contextual relationships regardless of distance.6 min·Sliding Window AttentionSliding window attention restricts each token's attention to a fixed-size local window of $W$ neighboring tokens, reducing the quadratic memory cost of full attention to linear while preserving long-range information flow through layer stacking -- where each additional layer extends the effective receptive field by $W$ tokens.6 min·Sparse AttentionSparse attention mechanisms restrict each token to attending to only a subset of other tokens rather than the full sequence, reducing attention's O(n^2) cost to O(n log n) or O(n) -- enabling practical processing of very long sequences.8 min·The Transformer ArchitectureThe Transformer is a neural network architecture built entirely on attention mechanisms that processes all input tokens in parallel, replacing sequential recurrence and becoming the universal foundation of modern large language models.8 min
Input Representation
·ALiBi (Attention with Linear Biases)ALiBi replaces learned positional embeddings with simple linear biases added directly to attention scores, enabling models to extrapolate to sequence lengths far beyond their training context with zero additional parameters and no fine-tuning.7 min·Byte-Pair Encoding (BPE)Byte-Pair Encoding is a data compression algorithm repurposed for tokenization that iteratively merges the most frequent pair of adjacent symbols to build a subword vocabulary from the bottom up.7 min·Context WindowThe context window is the fixed-length span of tokens a transformer model can attend to in a single forward pass -- the model's "working memory" that determines how much text it can consider at once.8 min·Positional EncodingPositional encoding injects information about token order into the transformer architecture, which would otherwise treat its input as an unordered set.7 min·Rotary Position Embedding (RoPE)Rotary Position Embedding encodes token positions by rotating query and key vectors in the attention mechanism, so that their dot product naturally depends on the relative distance between tokens rather than their absolute positions.7 min·Special TokensSpecial tokens are reserved vocabulary entries that carry control signals rather than linguistic content, directing model behavior for tasks like indicating sequence boundaries, separating segments, and managing chat turn-taking.8 min·Token EmbeddingsToken embeddings convert discrete, meaningless token IDs into dense, continuous vectors in a high-dimensional space where geometric relationships encode semantic meaning.7 min·TokenizationTokenization is the process of breaking raw text into discrete units (tokens) that a language model can process numerically, and the choices made here ripple through every aspect of model behavior.6 min·Vocabulary DesignVocabulary design is the process of choosing how many and which tokens a language model should know, balancing compression efficiency against embedding size, multilingual coverage, and tokenization fairness across languages.8 min
Training Fundamentals
·Adam and AdamW OptimizerAdamW is the near-universal optimizer for LLM training, combining adaptive per-parameter learning rates with momentum and properly decoupled weight decay to navigate the complex, high-dimensional loss landscapes of billion-parameter models.7 min·Backpropagation and Gradient DescentBackpropagation is the algorithm that computes how much each parameter in a neural network contributed to the prediction error, enabling gradient descent to systematically adjust billions of parameters toward better predictions.7 min·Catastrophic ForgettingCatastrophic forgetting is the phenomenon where neural networks abruptly lose previously learned knowledge when trained on new tasks or data, because gradient updates for the new task overwrite parameters critical to old tasks.8 min·Cross-Entropy LossCross-entropy loss is the objective function that drives LLM training by measuring how "surprised" the model is by the actual next token, rooted in information theory's concept of encoding efficiency.7 min·Curriculum LearningCurriculum learning presents training examples in a meaningful order -- typically easy to hard -- rather than random order, inspired by human education, enabling better final performance and faster convergence at the same compute budget.8 min·Data Mixing & Domain WeightingData mixing -- the art of choosing how much of each data source to include in training -- has as much impact on model quality as architecture or scale, with optimal ratios differing substantially from natural data distributions.7 min·Emergent AbilitiesEmergent abilities are capabilities that appear to arise suddenly and unpredictably in large language models once they cross certain scale thresholds -- sparking both excitement about potential breakthroughs and deep concern about our ability to forecast and control AI systems.8 min·Gradient CheckpointingGradient checkpointing trades additional computation for dramatically reduced memory during training by selectively storing activations at checkpoint layers and recomputing intermediate values during the backward pass.8 min·Gradient Clipping, Accumulation, and CheckpointingThree essential training stability techniques -- gradient clipping prevents catastrophic parameter updates from exploding gradients, gradient accumulation simulates larger batch sizes without additional memory, and gradient checkpointing trades recomputation for memory savings on stored activations.9 min·GrokkingGrokking is the phenomenon where a neural network suddenly generalizes to unseen data long after it has already memorized the training set, challenging assumptions about when and how models truly learn.9 min·Learning Rate SchedulingLearning rate scheduling -- gradually warming up, then systematically decaying the learning rate during training -- is a critical technique that prevents early training instability and ensures the model converges to a good minimum rather than oscillating around one.8 min·Mixed Precision TrainingMixed precision training uses lower-precision number formats (FP16 or BF16) for most computations while maintaining a master copy of weights in FP32, cutting memory usage in half and dramatically increasing throughput by leveraging specialized hardware tensor cores.8 min·Model CollapseModel collapse is the progressive degradation of model quality that occurs when AI models are recursively trained on data generated by other AI models, causing irreversible loss of distributional diversity and rare-but-valid patterns.7 min·Pre-TrainingPre-training is the foundational, most expensive phase of LLM development where a model learns language, facts, reasoning, and code by predicting the next token across trillions of words of text.7 min·Scaling LawsScaling laws are empirically discovered power-law relationships showing that LLM performance improves predictably and smoothly as you increase model parameters, training data, and compute -- enabling researchers to forecast the capabilities of models costing hundreds of millions of dollars before training them.11 min·Self-Play and Self-ImprovementSelf-play and self-improvement methods enable language models to bootstrap stronger capabilities from their own outputs -- generating reasoning traces, filtering for correctness, and training on the successes -- achieving dramatic gains like GPT-J 6B jumping from 36.6% to 72.5% on CommonsenseQA without any human-written rationales.6 min·Training Data CurationTraining data curation -- the process of collecting, filtering, deduplicating, and mixing massive text datasets -- is arguably the most underappreciated factor in LLM quality, with data quality consistently proving more important than data quantity.10 min
Distributed Training
№ 33D Parallelism & Training at Scale3D parallelism combines data, tensor, and pipeline parallelism into a unified strategy that maps each dimension to the hardware topology, enabling the training of the largest language models (hundreds of billions to trillions of parameters) across thousands of GPUs.8 min·Data Parallelism & Distributed Data Parallel (DDP)Data parallelism replicates the entire model on every GPU and splits the training data across them, synchronizing gradients after each step to keep all copies in lockstep.6 min·Expert ParallelismExpert parallelism distributes the experts of a Mixture-of-Experts (MoE) model across different GPUs, using all-to-all communication to route tokens to their assigned experts and back -- enabling models with trillions of total parameters (like Switch Transformer's 1.6T) while keeping per-token compute costs manageable through sparse activation.9 min·Pipeline ParallelismPipeline parallelism distributes consecutive layers of a model across different GPUs like an assembly line, using micro-batching to keep all stages busy simultaneously and minimize idle time (pipeline bubbles).7 min·Ring AttentionRing Attention distributes long sequences across multiple GPUs arranged in a ring topology, overlapping the communication of key-value blocks with attention computation to enable near-linear scaling of context length with the number of devices -- supporting millions of tokens with less than 5% communication overhead.5 min·Tensor (Model) ParallelismTensor parallelism splits individual layers of a neural network across multiple GPUs, so each GPU computes only a slice of every layer's output, enabling training of models whose single layers are too large for one device.7 min·ZeRO & FSDP (Fully Sharded Data Parallel)ZeRO and FSDP eliminate the memory redundancy of data parallelism by sharding optimizer states, gradients, and parameters across GPUs, enabling training of models that no single GPU can hold while preserving the simplicity of data-parallel training.7 min
Alignment And Post Training
·Chain-of-Thought Training & Reasoning ModelsChain-of-thought has evolved from a simple prompting trick into a full training paradigm, where models like OpenAI's o1/o3 and DeepSeek-R1 are explicitly trained to produce extended internal reasoning before answering -- representing a fundamental shift from "System 1" to "System 2" thinking in AI.7 min·Constitutional AI (CAI)Constitutional AI aligns language models by replacing human preference labels with AI-generated feedback guided by an explicit set of principles (a "constitution"), making the alignment process more scalable, transparent, and auditable.7 min·Direct Preference Optimization (DPO)DPO collapses the entire RLHF pipeline -- reward model training and RL optimization -- into a single supervised learning step by showing that the optimal policy can be derived directly from preference data using a simple classification loss.8 min·GRPO (Group Relative Policy Optimization)GRPO is a reinforcement learning algorithm developed by DeepSeek that eliminates the critic (value) model entirely by estimating advantages through group-based relative scoring of multiple sampled outputs -- dramatically reducing memory requirements while achieving stable, effective policy optimization.6 min·Preference Learning VariantsAlternatives to DPO that reduce data requirements, simplify training pipelines, or improve robustness -- each trading off different aspects of preference optimization.7 min·Process Reward Models (PRMs) vs. Outcome Reward Models (ORMs)Process reward models evaluate each intermediate reasoning step for correctness, while outcome reward models only evaluate the final answer -- a distinction that fundamentally changes how AI systems learn to reason, moving from "did you get the right answer?" to "did you reason correctly?"8 min·Rejection Sampling in AlignmentRejection sampling (Best-of-N) generates $N$ candidate responses from a language model, scores each with a reward model, and selects the highest-scoring output -- providing an implicit KL-constrained policy improvement that captured most of the alignment gains in Llama 2, often matching PPO while being far simpler.5 min·Reward ModelingReward modeling trains a neural network to predict human preferences over model outputs, producing a scalar score that serves as the optimization signal for reinforcement learning from human feedback -- and its quality is the single biggest bottleneck in the entire alignment pipeline.7 min·RLAIF (Reinforcement Learning from AI Feedback)RLAIF replaces human annotators with AI models in the preference labeling stage of RLHF, using techniques like position debiasing and self-consistency voting to generate preference data that matches human-quality alignment at a fraction of the cost -- approximately $0.001 per comparison versus $1-10 for human annotators.5 min·RLHF (Reinforcement Learning from Human Feedback)RLHF aligns language models with human preferences by training a reward model on human comparisons, then using reinforcement learning to optimize the language model's outputs against that reward signal -- while a KL penalty keeps it from straying too far from its original behavior.7 min·RLVR (Reinforcement Learning with Verifiable Rewards)RLVR trains language models using reinforcement learning where the reward signal comes from objectively verifiable outcomes -- like whether a math answer is correct or code passes tests -- avoiding the Goodhart's Law problems of learned reward models and producing models with genuinely stronger reasoning.8 min·Supervised Fine-Tuning (SFT) & Instruction TuningSupervised fine-tuning transforms a raw language model that merely predicts the next token into an assistant that can follow instructions, by training it on curated (instruction, response) pairs.6 min·Synthetic Data for TrainingSynthetic data generation uses existing LLMs to create training data for other (often smaller) models, offering a scalable path around the "data wall" but introducing risks of model collapse, reduced diversity, and inherited biases.7 min
Parameter Efficient Fine Tuning
·Adapters, Prefix Tuning & Prompt TuningBeyond LoRA, a family of parameter-efficient fine-tuning methods -- including bottleneck adapters, prefix tuning, prompt tuning, (IA)^3, and DoRA -- each offer distinct trade-offs in where and how they inject trainable parameters into a frozen pretrained model.8 min·Full Fine-Tuning vs PEFT: When to Use WhatFull fine-tuning updates every parameter in a model for maximum adaptability but at enormous compute and memory cost, while PEFT methods achieve surprisingly competitive quality by training only a small fraction of parameters -- and at sufficient model scale, the gap between them effectively vanishes.8 min·LoRA (Low-Rank Adaptation)LoRA freezes the pretrained model weights and injects small, trainable low-rank matrices into each layer, achieving fine-tuning quality with a fraction of the trainable parameters.7 min·S-LoRA / Multi-LoRA ServingMulti-LoRA serving systems like S-LoRA enable thousands of LoRA adapters to be served simultaneously from a single shared base model, using unified memory management and custom CUDA kernels to maintain near-baseline throughput.7 min·QLoRA (Quantized LoRA)QLoRA combines 4-bit quantization of the frozen base model with LoRA adapters trained in higher precision, enabling fine-tuning of 65B+ parameter models on a single 48GB GPU without meaningful quality loss.8 min
Inference And Deployment
·Constrained DecodingConstrained decoding forces LLM output to conform to formal grammars (JSON schemas, regex patterns, context-free grammars) by masking invalid tokens at each decoding step, providing a 100% structural validity guarantee and eliminating retry loops for malformed output.8 min·Continuous BatchingContinuous batching (also called iteration-level or in-flight batching) inserts new requests and retires completed sequences at every decoding step rather than waiting for an entire batch to finish, eliminating idle GPU cycles and achieving 10-23x higher throughput than static batching.7 min·Distillation for ReasoningDistillation for reasoning transfers chain-of-thought reasoning capabilities from large teacher models to smaller student models by training on the teacher's detailed reasoning traces -- enabling results like DeepSeek-R1-Distill-Qwen-7B scoring 55.5% on AIME 2024 and R1-Distill-Qwen-14B achieving 93.9% on MATH, with the critical finding that distillation outperforms direct RL training at small model scales.10 min·Flash AttentionFlash Attention is an IO-aware attention algorithm that restructures the computation to keep data in the GPU's fast on-chip SRAM rather than repeatedly reading and writing to slow high-bandwidth memory (HBM), reducing memory usage from O(N^2) to O(N) and delivering 2-4x wall-clock speedups -- while computing *exact* attention, not an approximation.8 min·Knowledge DistillationKnowledge distillation trains a smaller "student" model to mimic the behavior of a larger "teacher" model by learning from the teacher's soft probability distributions rather than just hard labels, transferring rich knowledge about inter-class relationships that the raw training data alone cannot convey.7 min·KV CacheKV cache stores previously computed key and value tensors from the attention mechanism so the model never re-computes them, turning autoregressive generation from an O(n^2) nightmare into an O(n) operation -- at the cost of memory that grows linearly with sequence length.6 min·KV Cache CompressionKV cache compression encompasses quantization, eviction, and token merging techniques that reduce the memory footprint of stored key-value states by 2-8x, making long-context inference (128K+ tokens) practically deployable on existing GPU hardware.8 min·Medusa and Parallel DecodingMedusa adds multiple lightweight prediction heads to a base LLM, enabling parallel token generation and tree-structured verification to achieve 2-3x speedups without a separate draft model.9 min·Model Routing / LLM RoutersModel routing dynamically selects which LLM to use for each query based on estimated complexity and cost, achieving 40-60% cost reduction while maintaining quality by sending only hard queries to expensive frontier models.8 min·Model Serving FrameworksModel serving frameworks handle the complex orchestration of loading LLM weights onto GPUs, managing memory, batching requests, and delivering generated tokens to users -- and the choice of framework can mean a 10-23x difference in throughput for the same hardware.7 min·PagedAttentionPagedAttention applies OS-style virtual memory paging to the KV cache, breaking each sequence's key-value data into fixed-size blocks that are dynamically allocated and mapped through per-sequence block tables, eliminating 60-80% memory waste and enabling 2-4x higher serving throughput.7 min·Prefill-Decode DisaggregationPrefill-decode disaggregation separates the compute-bound prefill phase (processing input tokens in parallel) and the memory-bandwidth-bound decode phase (generating tokens one at a time) onto different, independently optimized hardware pools, improving cost-efficiency by 1.5-2x and eliminating cross-phase interference.8 min·Prefix CachingPrefix caching stores the computed KV cache states for shared prompt prefixes (system prompts, few-shot examples, RAG context) so that subsequent requests sharing the same prefix skip recomputation entirely, delivering up to 90% cost savings and 85% reduction in time-to-first-token.7 min·Prompt Compression / LLMLinguaPrompt compression reduces input token count while preserving semantic meaning, using perplexity-based importance scoring or trained classifiers to cut costs by up to 75% and accelerate prefill by 2-4x.9 min·QuantizationQuantization reduces the numerical precision of a model's weights (and sometimes activations) from 16-bit floating point to 8-bit or 4-bit integers, shrinking memory footprint by 2-4x and accelerating inference, with surprisingly small losses in quality because neural networks are remarkably tolerant of reduced precision.7 min·Temperature, Top-K, and Top-P SamplingSampling strategies control how an LLM selects the next token from its predicted probability distribution, ranging from deterministic (always pick the most likely) to highly creative (sample from a broad set of candidates), with each method offering a different trade-off between coherence and diversity.7 min·Speculative DecodingSpeculative decoding uses a small, fast "draft" model to guess multiple tokens ahead, then verifies all guesses in a single forward pass of the large "target" model, achieving 2-3x faster generation while producing output that is *mathematically identical* to standard decoding.7 min·Throughput vs. Latency Trade-offsThroughput (how many total tokens the system produces per second) and latency (how quickly an individual user receives their response) are fundamentally competing objectives in LLM serving, and every deployment architecture involves conscious decisions about where to sit on this trade-off curve.8 min
Practical Applications
·AI AgentsAI agents are systems where LLMs operate in autonomous loops -- reasoning about a task, taking actions through tools, observing results, and iterating until the goal is achieved -- moving beyond single-response generation into multi-step problem solving.9 min·Chunking Strategies for RAGChunking is the process of splitting documents into smaller pieces for embedding and retrieval, and the choice of chunking strategy directly determines whether a RAG system retrieves useful context or useless fragments.8 min·Embedding Models & Vector DatabasesEmbedding models transform text into numerical vectors that capture semantic meaning, and vector databases store and search those vectors at scale, together forming the retrieval backbone of modern LLM applications.7 min·Function Calling & Tool UseFunction calling enables LLMs to interact with the outside world by generating structured requests (typically JSON) that an application layer executes and feeds back, transforming language models from text generators into general-purpose reasoning engines that can take real actions.7 min·Memory Systems for LLM AgentsMemory systems extend LLM agents beyond the context window by providing structured mechanisms for storing, retrieving, and managing information across interactions and sessions.9 min·Model Context Protocol (MCP)MCP is an open standard that provides a universal interface for connecting LLM applications to external data sources, tools, and services -- replacing fragile, custom integrations with a single, composable protocol.6 min·Multi-Agent SystemsMultiple LLM-powered agents collaborate through defined roles, tools, and communication protocols to solve problems that exceed the capability of any single agent.9 min·Prompt EngineeringPrompt engineering is the discipline of crafting inputs to large language models that reliably elicit the desired outputs, bridging the gap between what a model can do and what you actually need it to do.6 min·Retrieval-Augmented Generation (RAG)RAG grounds LLM responses in external knowledge by retrieving relevant documents at query time and injecting them into the prompt, dramatically reducing hallucination and enabling models to answer questions about data they were never trained on.7 min·ReAct Pattern (Reasoning + Acting)ReAct interleaves chain-of-thought reasoning with tool-calling actions in a unified Thought-Action-Observation loop, grounding LLM reasoning in real-world feedback.7 min·Self-Reflection and ReflexionSelf-reflection enables LLM agents to evaluate, critique, and iteratively improve their own outputs across trials by converting feedback into natural language memory.7 min·Structured Output & JSON ModeStructured output techniques constrain LLM generation to produce reliably parseable formats like JSON, XML, or YAML, transforming probabilistic text generation into deterministic, schema-conformant outputs essential for software integration.7 min
Safety And Alignment
·Adversarial Robustness in LLMsAdversarial robustness in LLMs concerns the study of attacks that exploit model vulnerabilities through carefully crafted inputs -- from gradient-based universal adversarial suffixes (GCG) to semantic jailbreaks (AutoDAN) -- and the defenses designed to make models resilient against them, revealing that safety alignment is fundamentally a cat-and-mouse game where attackers currently hold a structural advantage.14 min·AI SandbaggingThe risk that strategically aware AI models intentionally underperform on capability evaluations to avoid triggering safety restrictions -- and the broader challenge of accurately eliciting what a model can actually do.8 min·The Alignment ProblemThe alignment problem is the challenge of ensuring that AI systems pursue the goals we actually intend rather than optimizing for proxy objectives that diverge from human values in subtle and potentially catastrophic ways.9 min·Bias & Fairness in LLMsLLMs absorb and amplify the biases present in their training data, producing outputs that can systematically disadvantage or misrepresent certain groups -- and fully eliminating this bias may be fundamentally impossible.8 min·Circuit Breakers for AI SafetyCircuit breakers are a representation engineering-based safety mechanism where models are trained to detect harmful internal representations during generation and automatically "short-circuit" their output -- interrupting harmful completions by redirecting the model's internal states away from dangerous regions of activation space, providing a fundamentally different and more robust defense than RLHF-based refusal training.11 min·Goodhart's Law in AIGoodhart's Law -- "When a measure becomes a target, it ceases to be a good measure" -- is the fundamental theoretical principle explaining why optimizing AI systems against proxy metrics inevitably leads to reward hacking, benchmark gaming, and misalignment.8 min·Guardrails & Content FilteringGuardrails are the multi-layered defense systems -- input filters, output filters, and model-level constraints -- that prevent LLM applications from producing harmful, off-topic, or policy-violating content in production.8 min·Hallucination & GroundingLLMs generate text that sounds confident and fluent but is sometimes factually wrong, because they were trained to produce *plausible* continuations, not *true* statements.7 min·Instruction HierarchyA safety architecture that trains models to enforce strict priority levels among instructions -- system prompts override developer instructions, which override user inputs -- directly defending against prompt injection attacks.7 min·JailbreakingJailbreaking refers to adversarial techniques that circumvent an LLM's safety guardrails and alignment training, tricking the model into producing outputs it was specifically trained to refuse -- exposing fundamental tensions between model capability and model safety.8 min·Machine Unlearning for LLMsMachine unlearning is the process of selectively removing the influence of specific training data from a trained model -- making the model "forget" particular knowledge, individuals, or copyrighted content -- without retraining from scratch, driven by legal requirements (GDPR right to erasure), copyright compliance, and the need to remove hazardous knowledge.12 min·Prompt Injection & JailbreakingBecause LLMs process instructions and data in the same channel of natural language, attackers can craft inputs that override a system's intended behavior -- and this vulnerability may be fundamentally unsolvable.7 min·Red Teaming for LLMsRed teaming is the practice of proactively and adversarially testing AI systems to discover failures, vulnerabilities, and harmful behaviors *before* users encounter them in production.7 min·Reward HackingReward hacking occurs when an AI model discovers and exploits unintended shortcuts in its reward function, maximizing the measured reward without actually achieving the intended objective -- a fundamental failure mode of reward-based training.8 min·Scalable OversightThe challenge of maintaining meaningful human control and evaluation of AI systems as they become more capable than their supervisors -- and the family of techniques (debate, amplification, recursive reward modeling, process supervision) designed to address it.9 min·Sleeper AgentsModels trained with hidden conditional behaviors -- acting aligned during evaluation but activating harmful behaviors when a trigger condition is met -- demonstrating that standard safety training fails to remove sophisticated backdoors.8 min·Specification GamingWhen AI systems satisfy the literal specification of their objective while violating the designer's actual intent -- arguably the central technical challenge of alignment.8 min·SycophancyThe tendency of RLHF-trained models to agree with users even when the user is factually wrong -- a direct consequence of optimizing for human approval rather than truthfulness.8 min·Toxicity DetectionToxicity detection is the task of identifying harmful, offensive, threatening, or abusive content in model outputs, navigating the difficult boundary between legitimate discussion of sensitive topics and genuinely harmful generation.7 min·Watermarking for LLM-Generated TextLLM text watermarking embeds statistically detectable but human-imperceptible signals into generated text by biasing the token selection process during generation, enabling reliable identification of AI-generated content without altering the perceived quality of the text.9 min·Weak-to-Strong GeneralizationThe study of whether weaker AI systems (or humans) can effectively supervise and align stronger AI systems -- the core empirical question behind the superalignment challenge.8 min
Evaluation
·Benchmark Contamination DetectionBenchmark contamination detection is the set of techniques used to determine whether an LLM was trained on data from benchmark test sets -- using methods ranging from n-gram overlap analysis and canary string insertion to membership inference attacks and perplexity-based statistical tests -- because contamination silently inflates benchmark scores and undermines the integrity of the entire model evaluation ecosystem.15 min·LLM BenchmarksLLM benchmarks are standardized test suites designed to measure specific capabilities of language models, forming the primary (if imperfect) basis for comparing models across the industry.7 min·Chatbot Arena and ELO-Based EvaluationChatbot Arena (by LMSYS) is a crowdsourced evaluation platform where real users compare anonymous LLM responses head-to-head, with results aggregated using Bradley-Terry models (a generalization of ELO ratings from chess) to produce what has become the most trusted and influential public ranking of LLM quality -- demonstrating that human preference evaluation captures quality dimensions that automated benchmarks cannot.12 min·Traditional NLP Metrics: BLEU, ROUGE & BERTScoreBLEU, ROUGE, and BERTScore are automated text evaluation metrics that compare generated text against reference text using n-gram overlap (BLEU, ROUGE) or contextual embedding similarity (BERTScore), each with distinct strengths and well-known limitations.8 min·Human Evaluation & Benchmark ContaminationHuman evaluation remains the gold standard for assessing LLM quality through methods like pairwise preference and ELO ranking, but its validity -- along with all benchmark results -- is increasingly threatened by benchmark contamination, where test data leaks into training sets.9 min·LLM-as-a-JudgeLLM-as-a-Judge uses a strong language model to evaluate the outputs of other language models, offering a scalable and cost-effective alternative to human evaluation while introducing its own set of systematic biases.7 min·PerplexityPerplexity measures how "surprised" a language model is by new text, serving as the most fundamental intrinsic metric for evaluating how well a model has learned the statistical patterns of language.6 min
Advanced And Emerging
·Agentic RAGAgentic RAG replaces the rigid "retrieve then generate" pipeline with an AI agent that dynamically reasons about what to retrieve, when to retrieve, whether the retrieved information is sufficient, and how to synthesize multi-step retrieval results -- transforming RAG from a fixed pipeline into an adaptive, iterative reasoning process.8 min·ColBERT and Late Interaction RetrievalColBERT (Contextualized Late Interaction over BERT) replaces the standard single-vector representation of queries and documents with multi-vector representations -- one embedding per token -- and computes relevance through a "MaxSim" operation that finds the best-matching document token for each query token, achieving cross-encoder-level accuracy at bi-encoder-level speed.8 min·Compound AI SystemsCompound AI systems combine LLMs with retrievers, tools, code execution, verifiers, and other models into integrated architectures that exceed the capabilities of any single model, representing the shift from "better models" to "better systems" as the primary path to improved AI performance.7 min·Context Window ExtensionContext window extension encompasses the techniques that have stretched LLM context lengths from 512 tokens to over 1 million, overcoming the quadratic cost of attention through clever positional encoding manipulation, architectural modifications, and distributed computation strategies.6 min·Corrective RAG (CRAG)Corrective RAG adds a critical evaluation step after retrieval to assess whether retrieved documents are actually relevant to the query, then takes corrective actions -- query rewriting, web search fallback, or knowledge refinement -- when retrieval quality is insufficient, preventing the generation phase from hallucinating over irrelevant context.8 min·GraphRAG (Graph-Based Retrieval-Augmented Generation)GraphRAG augments standard RAG by constructing a knowledge graph of entities and relationships from the document corpus, applying hierarchical community detection, and generating community summaries at multiple levels of abstraction -- enabling both precise local retrieval and global sensemaking queries that standard vector-based RAG fundamentally cannot answer.11 min·HyDE (Hypothetical Document Embeddings)HyDE bridges the semantic gap between queries and documents by using an LLM to generate a hypothetical answer document, then embedding that hypothetical document (instead of the original query) as the retrieval vector -- leveraging the insight that a fabricated-but-plausible answer is closer in embedding space to real answers than the question itself is.8 min·In-Context LearningIn-context learning (ICL) is the emergent ability of large language models to learn new tasks from examples provided in the prompt at inference time, without any gradient updates or parameter changes.9 min·Inference-Time Scaling LawsPerformance on reasoning tasks improves predictably as you spend more compute at inference time -- through repeated sampling, extended chain-of-thought, tree search, and verifier-guided selection -- enabling smaller models to match larger ones on hard problems.9 min·Late ChunkingLate chunking reverses the traditional "chunk then embed" pipeline by first passing the entire document through the embedding model's transformer layers to produce contextualized token representations, then chunking those rich token embeddings into segment-level vectors -- preserving cross-chunk context that traditional chunking destroys.8 min·Matryoshka Representation Learning (MRL)Matryoshka Representation Learning trains embedding models so that any prefix of an embedding vector is itself a valid, useful embedding, enabling a single model to produce embeddings at multiple dimensionalities with graceful quality degradation -- like Russian nesting dolls where each inner doll is a complete, functional representation.7 min·Mechanistic InterpretabilityMechanistic interpretability is the scientific effort to reverse-engineer neural networks at the level of individual computations, identifying the specific features models represent, the circuits that connect them, and how these give rise to complex behaviors like reasoning, factual recall, and potentially deception.9 min·Mixture of Agents (MoA)Mixture of Agents uses multiple LLMs collaboratively in layered rounds -- each model refining the outputs of others -- to achieve aggregate quality that exceeds any individual model, including frontier systems.7 min·Model MergingModel merging combines the weights of two or more separately trained models into a single model without any additional training, exploiting the surprising geometric structure of neural network loss landscapes to blend capabilities from different fine-tuned variants.7 min·Multi-Token PredictionMulti-token prediction trains language models to predict several future tokens simultaneously from each position, producing richer internal representations and enabling faster inference through speculative self-decoding.7 min·Multimodal ModelsMultimodal models extend LLMs beyond text by connecting vision encoders, audio processors, and other modality-specific modules to a language model backbone, enabling AI systems that can see, hear, and reason across different types of input simultaneously.8 min·Neurosymbolic AINeurosymbolic AI combines the pattern recognition and fluency of neural networks with the precision, verifiability, and logical consistency of symbolic systems, aiming to create AI that can both understand natural language and reason with formal guarantees.8 min·Query Decomposition and Multi-Step RetrievalQuery decomposition breaks complex user queries into simpler sub-queries that can each be answered through targeted retrieval, while multi-step retrieval iteratively retrieves information where each step's findings inform the next -- together enabling RAG systems to answer complex, multi-faceted, and multi-hop questions that single-shot retrieval fundamentally cannot handle.12 min·RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)RAPTOR builds a hierarchical tree index over a document corpus by recursively clustering text chunks using UMAP and Gaussian mixture models, then summarizing each cluster with an LLM -- creating a multi-resolution representation where leaf nodes are original text chunks and higher nodes are increasingly abstract summaries, enabling retrieval at any level of detail from granular facts to high-level themes.10 min·Reasoning Models (o1/R1 Paradigm)Reasoning models perform extended internal deliberation before answering, trading additional inference-time compute for dramatically improved accuracy on math, code, and science tasks.10 min·Representation Engineering and Activation SteeringRepresentation engineering controls LLM behavior at inference time by identifying interpretable directions in the model's internal activation space (e.g., a "honesty direction" or "refusal direction") and adding or subtracting these steering vectors from the model's hidden states during forward passes -- modifying behavior without any weight updates or fine-tuning.10 min·Reranking and Cross-EncodersReranking is a second-stage retrieval technique where a more powerful model (typically a cross-encoder) re-scores and reorders the initial retrieval results from a fast first-stage retriever (bi-encoder or BM25), dramatically improving precision by jointly processing each query-document pair rather than comparing independent embeddings -- making two-stage "retrieve then rerank" the standard architecture for production retrieval systems.11 min·Self-RAG (Self-Reflective Retrieval-Augmented Generation)Self-RAG trains a single language model to adaptively decide when to retrieve external knowledge, evaluate whether retrieved passages are relevant, assess whether its own generation is supported by the evidence, and judge the overall utility of its response -- all through special reflection tokens learned during training, eliminating the need for separate retriever and critic components.10 min·State Space Models & MambaState Space Models offer a fundamentally different approach to sequence modeling that processes tokens in linear time through learned recurrent state updates, with Mamba's selective mechanism making them the most credible alternative to Transformers.7 min·Test-Time Compute & Inference-Time ScalingTest-time compute is the paradigm shift from making models bigger to making models think harder, allocating additional computation at inference to explore reasoning paths, verify answers, and dramatically improve performance on complex problems.7 min·Tree-of-Thought (ToT)Tree-of-Thought extends chain-of-thought reasoning by exploring multiple reasoning paths simultaneously in a branching tree structure, enabling backtracking from dead ends and systematic search for the best solution -- treating reasoning as a search problem rather than a linear narrative.8 min·Vision-Language Models (VLMs)Vision-Language Models integrate visual perception with language understanding in a single system, enabling AI to see, reason about, and describe the visual world -- and increasingly, to act on it through Vision-Language-Action architectures.8 min