Course · 8 modules · 57 lessons · 334 min

Reinforcement Learning

Foundations through deep RL, policy gradients, model-based methods, RL for language models, and landmark applications.

← All courses
Foundations
·Bellman EquationsThe recursive decomposition of value into immediate reward plus discounted future value -- the fundamental identity of RL.6 min·Exploration vs ExploitationThe core dilemma: exploit what you know for guaranteed reward, or explore the unknown for potentially better outcomes.6 min·Markov Decision ProcessesThe mathematical framework formalizing sequential decision-making with states, actions, transition probabilities, and rewards.6 min·PoliciesThe agent's decision rule mapping states to actions -- the central object that RL algorithms learn.5 min·Return and Discount FactorCumulative future reward geometrically discounted by gamma -- the objective every RL agent optimizes.6 min·States, Actions, and RewardsThe three primitives of every RL problem: where you are, what you can do, and what you get for doing it.6 min·Value FunctionsExpected future return from a state (V) or state-action pair (Q) -- the backbone of most RL algorithms.6 min·What Is Reinforcement Learning?An agent learns to make sequential decisions by interacting with an environment and maximizing cumulative reward -- the third paradigm of machine learning.5 min
Tabular Methods
·Dynamic ProgrammingComputing optimal policies via iterative Bellman updates when the full environment model is known -- the theoretical foundation of reinforcement learning.5 min·Eligibility TracesCredit assignment mechanism that blends TD and Monte Carlo through exponentially decaying memory of visited states.5 min·Monte Carlo MethodsLearning value estimates from complete episode returns -- model-free RL through averaging sampled outcomes.5 min·N-Step MethodsBridging Monte Carlo and TD by bootstrapping after $n$ steps -- tunable bias-variance trade-off.6 min·Q-LearningOff-policy TD control that learns the optimal action-value function regardless of the behavior policy -- the most influential tabular RL algorithm.6 min·SARSAOn-policy TD control that updates Q-values using the action actually taken -- safer than Q-learning in stochastic environments.6 min·Temporal Difference LearningBootstrapping value estimates from incomplete episodes by updating toward one-step lookahead targets.6 min
Function Approximation And Deep Rl
·Deep Q-NetworksNeural network Q-function with experience replay and target networks -- the breakthrough that launched deep RL.5 min·Double DQNDecoupling action selection from evaluation to correct DQN's systematic overestimation of Q-values.6 min·Dueling DQNSeparate network streams for state value and action advantage -- learning "how good is this state" independently from "how good is this action."6 min·Experience ReplayStoring and randomly sampling past transitions to break temporal correlations and improve sample efficiency.5 min·Function ApproximationReplacing lookup tables with parameterized functions to generalize across the vast state spaces of real-world problems.5 min·Rainbow DQNCombining six orthogonal DQN improvements into one agent -- the definitive value-based deep RL algorithm.6 min·Target NetworksA frozen copy of the Q-network providing stable regression targets -- preventing the "moving target" instability.6 min
Policy Gradient Methods
·A2C and A3CParallel actor-critic training through multiple environment workers -- A3C uses asynchronous gradient updates for decorrelation, while A2C's synchronous batching often matches performance and better utilizes GPUs.6 min·Actor-Critic MethodsA two-network architecture that combines a policy (the actor) with a learned value function (the critic) to reduce the high variance of pure policy gradient methods while maintaining low bias.5 min·Advantage EstimationMethods for estimating how much better a specific action is compared to the average action in a given state -- the key signal that drives stable, efficient policy gradient updates.5 min·Entropy RegularizationAdding a policy entropy bonus to the optimization objective to encourage exploration, prevent premature convergence to deterministic policies, and improve robustness -- a simple technique with deep connections to maximum entropy RL.7 min·Policy Gradient TheoremThe mathematical foundation that enables direct optimization of parameterized policies via gradient ascent on expected return, bypassing the need to differentiate through unknown environment dynamics.5 min·Proximal Policy Optimization (PPO)A clipped surrogate objective that approximates trust region constraints using only first-order optimization -- the dominant algorithm in modern reinforcement learning and the engine behind RLHF for large language models.7 min·REINFORCEThe simplest policy gradient algorithm -- sample a complete trajectory, weight each action's log-probability by the return that followed it, and update the policy in the direction that reinforces successful behavior.5 min·Trust Region MethodsConstraining each policy update to a "trust region" where the local approximation is reliable, preventing the catastrophic performance collapses that plague unconstrained policy gradients -- realized through TRPO and natural policy gradients.6 min
Model Based Rl
·The Dyna ArchitectureA foundational framework that interleaves real environment experience with simulated experience generated by a learned model, unifying learning, planning, and acting in a single loop.6 min·Model-Based vs. Model-Free RLThe fundamental architectural choice in reinforcement learning -- learn a model of how the world works and plan with it, or learn what to do directly from raw experience.5 min·Monte Carlo Tree SearchA tree-based planning algorithm that combines random simulation with upper confidence bounds to efficiently search large decision spaces -- the planning engine that powered AlphaGo's victory over the world Go champion.6 min·MuZeroA planning algorithm that learns its own model of the environment -- predicting rewards, values, and policies in a latent space -- achieving superhuman performance across board games, Atari, and beyond, without ever being told the rules.7 min·Planning with Learned ModelsUsing neural network dynamics models for lookahead search, trajectory optimization, and data augmentation.7 min·World ModelsLearning compressed latent representations of environment dynamics so an agent can "dream" -- planning and even training entirely within an imagined version of the world.6 min
Advanced Methods
·Curiosity-Driven ExplorationCuriosity-driven exploration replaces external reward with intrinsic motivation from prediction error or information gain, enabling agents to explore systematically by seeking novelty rather than stumbling upon it by accident.7 min·Hierarchical Reinforcement LearningHierarchical RL decomposes complex, long-horizon tasks into layered subtask hierarchies, enabling agents to reason at multiple timescales through temporal abstraction.5 min·Imitation LearningImitation learning trains policies directly from expert demonstrations, bypassing reward function design entirely -- but the seemingly simple approach of copying an expert hides a subtle and dangerous distribution shift problem.6 min·Inverse Reinforcement LearningInverse reinforcement learning recovers the reward function that an expert is implicitly optimizing, answering "what are they trying to do?" rather than "how are they doing it?"5 min·Meta-Reinforcement LearningMeta-RL trains agents across a distribution of tasks so they can adapt to new, unseen tasks in just a few episodes -- learning to learn rather than learning to solve one problem.7 min·Multi-Agent Reinforcement LearningMultiple agents learning simultaneously in a shared environment create a non-stationary world where each agent's optimal strategy depends on what every other agent is doing.5 min·Offline Reinforcement LearningOffline RL learns policies entirely from a fixed dataset of previously collected interactions, without any further environment access -- bringing RL into the data-driven regime where healthcare, robotics, and dialogue systems actually operate.6 min·Reward ShapingReward shaping augments sparse environment rewards with intermediate signals to accelerate learning, but without mathematical guarantees, it risks teaching the agent to optimize the wrong objective entirely.6 min
Rl For Language Models
·DPO as Implicit RLDirect Preference Optimization reframes RLHF as a supervised learning problem by deriving the optimal policy in closed form -- eliminating the reward model, PPO loop, and value function while producing equivalent results from the same preference data.6 min·GRPODeepSeek's Group Relative Policy Optimization eliminates the value function entirely by estimating advantages from groups of sampled outputs -- a critic-free RL algorithm that is simpler, cheaper, and powered the reasoning breakthroughs in DeepSeek-R1.6 min·PPO for Language ModelsAdapting Proximal Policy Optimization from game environments to text generation -- where actions are tokens, episodes are sequences, rewards arrive only at the end, and four full-sized neural networks must coexist in GPU memory.6 min·Reward Modeling for LLMsTraining a neural network to predict human preferences from pairwise comparisons -- the critical bottleneck in LLM alignment where Goodhart's Law meets the impossibility of specifying what "good" means mathematically.6 min·RLAIF and Constitutional AIReplacing human annotators with AI-generated feedback guided by explicit principles for scalable alignment -- reducing the cost per preference comparison from $1--$10 to approximately $0.001 while achieving comparable quality.6 min·RLHF PipelineThe three-stage process (SFT, reward model, PPO) that transformed language models from text predictors into aligned assistants -- the alignment breakthrough where a 1.3B parameter RLHF model outperformed a 175B parameter supervised-only model.5 min·RLVRReinforcement Learning with Verifiable Rewards uses objectively checkable outcomes -- correct math answers, passing code tests, provable logical validity -- as reward signals, completely bypassing learned reward models and their susceptibility to Goodhart's Law.7 min
Landmark Applications
·AlphaGo and Board GamesFrom AlphaGo to AlphaZero: defeating world champions in Go, Chess, and Shogi through self-play and Monte Carlo Tree Search.7 min·Atari and Arcade GamesDQN achieving human-level performance on 49 Atari games from raw pixels -- the experiment that ignited the deep RL revolution.6 min·Recommendation SystemsModeling user interaction as a sequential decision problem -- optimizing long-term engagement over immediate clicks.5 min·Resource OptimizationData center cooling, chip design, network routing -- RL finding superhuman solutions to combinatorial optimization problems.6 min·RL in ProductionThe engineering challenges of deploying RL systems: safety constraints, evaluation, monitoring, and the sim-to-real gap.7 min·Robotics and ControlSim-to-real transfer, dexterous manipulation, and locomotion -- bridging the gap between simulation and physical robots.7 min