Course · 8 modules · 57 lessons · 334 min

Reinforcement Learning

Foundations through deep RL, policy gradients, model-based methods, RL for language models, and landmark applications.

Foundations

·Bellman EquationsThe recursive decomposition of value into immediate reward plus discounted future value -- the fundamental identity of RL.6 min→·Exploration vs ExploitationThe core dilemma: exploit what you know for guaranteed reward, or explore the unknown for potentially better outcomes.6 min→·Markov Decision ProcessesThe mathematical framework formalizing sequential decision-making with states, actions, transition probabilities, and rewards.6 min→·PoliciesThe agent's decision rule mapping states to actions -- the central object that RL algorithms learn.5 min→·Return and Discount FactorCumulative future reward geometrically discounted by gamma -- the objective every RL agent optimizes.6 min→·States, Actions, and RewardsThe three primitives of every RL problem: where you are, what you can do, and what you get for doing it.6 min→·Value FunctionsExpected future return from a state (V) or state-action pair (Q) -- the backbone of most RL algorithms.6 min→·What Is Reinforcement Learning?An agent learns to make sequential decisions by interacting with an environment and maximizing cumulative reward -- the third paradigm of machine learning.5 min→

Tabular Methods

·Dynamic ProgrammingComputing optimal policies via iterative Bellman updates when the full environment model is known -- the theoretical foundation of reinforcement learning.5 min→·Eligibility TracesCredit assignment mechanism that blends TD and Monte Carlo through exponentially decaying memory of visited states.5 min→·Monte Carlo MethodsLearning value estimates from complete episode returns -- model-free RL through averaging sampled outcomes.5 min→·N-Step MethodsBridging Monte Carlo and TD by bootstrapping after $n$ steps -- tunable bias-variance trade-off.6 min→·Q-LearningOff-policy TD control that learns the optimal action-value function regardless of the behavior policy -- the most influential tabular RL algorithm.6 min→·SARSAOn-policy TD control that updates Q-values using the action actually taken -- safer than Q-learning in stochastic environments.6 min→·Temporal Difference LearningBootstrapping value estimates from incomplete episodes by updating toward one-step lookahead targets.6 min→

Function Approximation And Deep Rl

·Deep Q-NetworksNeural network Q-function with experience replay and target networks -- the breakthrough that launched deep RL.5 min→·Double DQNDecoupling action selection from evaluation to correct DQN's systematic overestimation of Q-values.6 min→·Dueling DQNSeparate network streams for state value and action advantage -- learning "how good is this state" independently from "how good is this action."6 min→·Experience ReplayStoring and randomly sampling past transitions to break temporal correlations and improve sample efficiency.5 min→·Function ApproximationReplacing lookup tables with parameterized functions to generalize across the vast state spaces of real-world problems.5 min→·Rainbow DQNCombining six orthogonal DQN improvements into one agent -- the definitive value-based deep RL algorithm.6 min→·Target NetworksA frozen copy of the Q-network providing stable regression targets -- preventing the "moving target" instability.6 min→

Policy Gradient Methods

·A2C and A3CParallel actor-critic training through multiple environment workers -- A3C uses asynchronous gradient updates for decorrelation, while A2C's synchronous batching often matches performance and better utilizes GPUs.6 min→·Actor-Critic MethodsA two-network architecture that combines a policy (the actor) with a learned value function (the critic) to reduce the high variance of pure policy gradient methods while maintaining low bias.5 min→·Advantage EstimationMethods for estimating how much better a specific action is compared to the average action in a given state -- the key signal that drives stable, efficient policy gradient updates.5 min→·Entropy RegularizationAdding a policy entropy bonus to the optimization objective to encourage exploration, prevent premature convergence to deterministic policies, and improve robustness -- a simple technique with deep connections to maximum entropy RL.7 min→·Policy Gradient TheoremThe mathematical foundation that enables direct optimization of parameterized policies via gradient ascent on expected return, bypassing the need to differentiate through unknown environment dynamics.5 min→·Proximal Policy Optimization (PPO)A clipped surrogate objective that approximates trust region constraints using only first-order optimization -- the dominant algorithm in modern reinforcement learning and the engine behind RLHF for large language models.7 min→·REINFORCEThe simplest policy gradient algorithm -- sample a complete trajectory, weight each action's log-probability by the return that followed it, and update the policy in the direction that reinforces successful behavior.5 min→·Trust Region MethodsConstraining each policy update to a "trust region" where the local approximation is reliable, preventing the catastrophic performance collapses that plague unconstrained policy gradients -- realized through TRPO and natural policy gradients.6 min→

Model Based Rl

·The Dyna ArchitectureA foundational framework that interleaves real environment experience with simulated experience generated by a learned model, unifying learning, planning, and acting in a single loop.6 min→·Model-Based vs. Model-Free RLThe fundamental architectural choice in reinforcement learning -- learn a model of how the world works and plan with it, or learn what to do directly from raw experience.5 min→·Monte Carlo Tree SearchA tree-based planning algorithm that combines random simulation with upper confidence bounds to efficiently search large decision spaces -- the planning engine that powered AlphaGo's victory over the world Go champion.6 min→·MuZeroA planning algorithm that learns its own model of the environment -- predicting rewards, values, and policies in a latent space -- achieving superhuman performance across board games, Atari, and beyond, without ever being told the rules.7 min→·Planning with Learned ModelsUsing neural network dynamics models for lookahead search, trajectory optimization, and data augmentation.7 min→·World ModelsLearning compressed latent representations of environment dynamics so an agent can "dream" -- planning and even training entirely within an imagined version of the world.6 min→

Advanced Methods

·Curiosity-Driven ExplorationCuriosity-driven exploration replaces external reward with intrinsic motivation from prediction error or information gain, enabling agents to explore systematically by seeking novelty rather than stumbling upon it by accident.7 min→·Hierarchical Reinforcement LearningHierarchical RL decomposes complex, long-horizon tasks into layered subtask hierarchies, enabling agents to reason at multiple timescales through temporal abstraction.5 min→·Imitation LearningImitation learning trains policies directly from expert demonstrations, bypassing reward function design entirely -- but the seemingly simple approach of copying an expert hides a subtle and dangerous distribution shift problem.6 min→·Inverse Reinforcement LearningInverse reinforcement learning recovers the reward function that an expert is implicitly optimizing, answering "what are they trying to do?" rather than "how are they doing it?"5 min→·Meta-Reinforcement LearningMeta-RL trains agents across a distribution of tasks so they can adapt to new, unseen tasks in just a few episodes -- learning to learn rather than learning to solve one problem.7 min→·Multi-Agent Reinforcement LearningMultiple agents learning simultaneously in a shared environment create a non-stationary world where each agent's optimal strategy depends on what every other agent is doing.5 min→·Offline Reinforcement LearningOffline RL learns policies entirely from a fixed dataset of previously collected interactions, without any further environment access -- bringing RL into the data-driven regime where healthcare, robotics, and dialogue systems actually operate.6 min→·Reward ShapingReward shaping augments sparse environment rewards with intermediate signals to accelerate learning, but without mathematical guarantees, it risks teaching the agent to optimize the wrong objective entirely.6 min→

Rl For Language Models

·DPO as Implicit RLDirect Preference Optimization reframes RLHF as a supervised learning problem by deriving the optimal policy in closed form -- eliminating the reward model, PPO loop, and value function while producing equivalent results from the same preference data.6 min→·GRPODeepSeek's Group Relative Policy Optimization eliminates the value function entirely by estimating advantages from groups of sampled outputs -- a critic-free RL algorithm that is simpler, cheaper, and powered the reasoning breakthroughs in DeepSeek-R1.6 min→·PPO for Language ModelsAdapting Proximal Policy Optimization from game environments to text generation -- where actions are tokens, episodes are sequences, rewards arrive only at the end, and four full-sized neural networks must coexist in GPU memory.6 min→·Reward Modeling for LLMsTraining a neural network to predict human preferences from pairwise comparisons -- the critical bottleneck in LLM alignment where Goodhart's Law meets the impossibility of specifying what "good" means mathematically.6 min→·RLAIF and Constitutional AIReplacing human annotators with AI-generated feedback guided by explicit principles for scalable alignment -- reducing the cost per preference comparison from $1--$10 to approximately $0.001 while achieving comparable quality.6 min→·RLHF PipelineThe three-stage process (SFT, reward model, PPO) that transformed language models from text predictors into aligned assistants -- the alignment breakthrough where a 1.3B parameter RLHF model outperformed a 175B parameter supervised-only model.5 min→·RLVRReinforcement Learning with Verifiable Rewards uses objectively checkable outcomes -- correct math answers, passing code tests, provable logical validity -- as reward signals, completely bypassing learned reward models and their susceptibility to Goodhart's Law.7 min→

Landmark Applications

·AlphaGo and Board GamesFrom AlphaGo to AlphaZero: defeating world champions in Go, Chess, and Shogi through self-play and Monte Carlo Tree Search.7 min→·Atari and Arcade GamesDQN achieving human-level performance on 49 Atari games from raw pixels -- the experiment that ignited the deep RL revolution.6 min→·Recommendation SystemsModeling user interaction as a sequential decision problem -- optimizing long-term engagement over immediate clicks.5 min→·Resource OptimizationData center cooling, chip design, network routing -- RL finding superhuman solutions to combinatorial optimization problems.6 min→·RL in ProductionThe engineering challenges of deploying RL systems: safety constraints, evaluation, monitoring, and the sim-to-real gap.7 min→·Robotics and ControlSim-to-real transfer, dexterous manipulation, and locomotion -- bridging the gap between simulation and physical robots.7 min→