10-403 Deep RL | Schedule

</tr>

Date	Lecture	Readings	Logistics
M 01/13	Lecture #1 : Introduction to Reinforcement and Representation Learning [ slides ]	S & B Textbook, Ch1 (optional) Smith & Gasser. The Development of Embodied Cognition - Six Lessons from Babies (optional) Dan Wolpert't talk The real reason for brains
W 01/15	Lecture #2 : Multi-armed Bandits [ slides ]	Russo et al. A Tutorial on Thompson Sampling, Ch1-Ch4. Optional after Ch4 (optional) Aleksandrs Slivkins Introduction to Multi-Armed Bandits
F 01/17	Recitation #1: Neural Nets, PyTorch, OpenAI Gym, Bandits [ slides ]	G, B & C Textbook, Ch9, Ch10 Tensorflow tutorial notebook OpenAI Gym tutorial notebook PyTorch tutorial notebook The TensorFlow High Level (Keras) API
M 01/20	No Class, MLK Jr Day
W 01/22	Lecture #3 : Value-based Methods [ slides ]	S & B Textbook, Ch3, Ch4
F 01/24	Recitation #2: Bandits, MDPs [ slides ]
M 01/27	Lecture #4 : Value-based Methods (cont.) [ slides ]	S & B Textbook, Ch5, Ch6 (optional) The Path perspective on Value Learning (blogpost)
W 01/29	Lecture #5 : Value based methods cont. (DQN, MCTS) [ slides \| slides 2 ]	DQN Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning S & B Textbook, Ch8.11 (optional) Neural Fitted Q-Iteration (optional) Policy Evaluation with Temporal Differences: A Survey and Comparison
F 01/31	No Recitation
M 02/03	Lecture #6 : Actor-Critic Methods [ slides ]	Mnih et al. Asynchronous Methods for Deep Reinforcement Learning	HW1 out (tentative)
W 02/05	Recitation #3: HW1 [ slides ]
F 02/07	Lecture #7 : Actor Critic Methods (cont.) [ slides ]	Mnih et al. Asynchronous Methods for Deep Reinforcement Learning
M 02/10	Lecture #8 : Trust Region Methods [ slides ]	(optional) Schulman et al. Trust Region Policy Optimization (optional) Peng et al. Advantage-weighted regression (optional) Kakade and Langford Conservative policy iteration (optional) Off-Policy Actor Critic
W 02/12	Lecture #9 : Trust Region methods [ slides ]	(optional) Schulman et al. Trust Region Policy Optimization (optional) Peng et al. Advantage-weighted regression (optional) Kakade and Langford Conservative policy iteration
F 02/14	Recitation #4: Quiz 1 Review [ slides ]
M 02/17	Lecture #10 : Trust Region Methods [ slides ]	(optional) Schulman et al. Trust Region Policy Optimization (optional) Peng et al. Advantage-weighted regression (optional) Kakade and Langford Conservative policy iteration	HW1 due 11:59PM
W 02/19	Lecture #11 : Behavior Cloning, Generative Adversarial Imitation Learning [ slides ]	(optional) Chen et al. Learning by Cheating Bagnell. An Invitation to Imitation, Up To Page 10 Bojarski et al. End to End Learning for Self-Driving Cars
F 02/21	Quiz 1
M 02/24	Lecture #12 : Multimodel Policies, Diffusion Policies [ slides ]	Luo Understanding Diffusion Models:A Unified Perspective Pearce et al. Imitating Human Behaviour with Diffusion Models
W 02/26	Lecture #13 : Diffusion Policies (cont.) Evolutionary Methods for Policy Search [ slides ]	Salimans et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning (optional) Nikolaus Hansen. The CMA Evolution Strategy - A Tutorial, optional	HW2 out (tentative)
F 02/28	Recitation #5: Solutions to Quiz 1 [ slides ]
M 03/03	Spring Break - No Classes
W 03/05	Spring Break - No Classes
F 03/07	Spring Break - No Classes
M 03/10	Lecture #14 : Maximum Entropy RL, SAC, DDPG [ slides ]	Lillicrap et al. Continuous control with deep reinforcement learning Haarnoja et al. Soft Actor-Critic: Algorithms and Applications Haarnoja et al. Reinforcement Learning with Deep Energy Based Policies
W 03/12	Lecture #15 : Maximum Entropy RL, SAC, DDPG [ slides ]	Lillicrap et al. Continuous control with deep reinforcement learning Haarnoja et al. Soft Actor-Critic: Algorithms and Applications Haarnoja et al. Reinforcement Learning with Deep Energy Based Policies	HW2 due Thursday 3/13 11:59PM
F 03/14	Recitation #6: Diffusion policies (cont.) [ slides ]
M 03/17	Lecture #16 : Introduction to Model-Based Reinforcement Learning [ slides ]	Chua et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
W 03/19	Lecture #17 : AlphaGo, AlphaGoZero, AlphaZero [ slides ]	(optional) David Silver et al. Mastering the game of Go with deep neural networks and tree search David Silver et al. Mastering the game of Go without human knowledge David Silver et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm	HW3 out
F 03/21	Recitation #7: HW3 [ slides ]
M 03/24	Lecture #18 : MBRL from sensory input [ slides ]	Oh et al. Action-Conditional Video Prediction using Deep Networks in Atari Games Hansen et al. Temporal Difference Learning for Model Predictive Control DeepMind Blog Post MuZero:Mastering Go, chess, shogi and Atari without rules Julian Schrittwieser et al. Mastering Atari, Go, chess and shogi by planning with a learned model
W 03/26	Lecture #19 : MBRL (cont.) [ slides ]	Janner et al. Model-based policy optimization Hafner et al. Dreamer
F 03/28	Lecture #20 : Visual Imitation / Quiz 2 Review [ slides ]	Peng et al. SFV Reinforcement Learning of Physical Skills from Videos Baker et al. Video PreTraining (VPT):Learning to Act by Watching Unlabeled Online Videos
M 03/31	Lecture #21 : Multigoal Reinforcement Learning, MBRL with multimodal dynamics [ slides \| slides 2 ]	Andrychowicz et al. Hindsight Experience Replay Ding et al. Goal-conditioned Imitation Learning Janner et al. Planning with Diffusion for Flexible Behavior Synthesis (optional) Yang et al. Diffusion-ES Gradient-free Planning with Diffusion for Autonomous Driving and Zero-Shot Instruction Following
W 04/02	Quiz 2
F 04/04	No Class, Spring Carnvial
M 04/07	Lecture #22 : Offline RL 1: going beyond imitation, problem statement, challenges in doing offline RL, policy gradient methods / policy constraints [ slides ]	Kumar et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction Kumar et al. Reinforcement Learning from Static Datasets (Chapters 1 to 5) Fujimoto et al. Off-Policy Deep Reinforcement Learning without Exploration Kumar and Levine Offline RL Tutorial, NeurIPS 2020
W 04/09	Lecture #23 : Offline RL 2: conservative methods, model-based approaches, modern model-free algorithms [ slides ]	Kumar et al. Conservative Q-Learning for Offline Reinforcement Learning Kumar and Levine Offline RL Tutorial, NeurIPS 2020 Sobol Mark et al. Policy Agnostic RL Kostrikov et al. Offline RL with Implicit Q-Learning Yu et al. Conservative offline Model-Based Policy Optimization Kidambi et al. Model-Based Offline Reinforcement Learning	HW4 out HW3 due 11:59pm
F 04/11	Recitation #8: HW 4 [ slides ]
M 04/14	Lecture #24 : Intelligent Exploration [ slides ]
W 04/16	Lecture #25 : Intelligent Exploration (cont.), Sim2Real Policy Learning [ slides \| slides 2 ]	Pathak et al. Curiosity-driven Exploration by Self-supervised Prediction Burda et al. Exploration by Random Network Distillation Kumar et al. Rapid Motor Adaptation for Legged Robots Tobin et al. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
F 04/18	Recitation #9: Sim2Real Policy Learning (cont.), Quiz 2 Solutions [ slides ]	Akkaya et al. Solving Rubik’s Cube with A Robot Hand
M 04/21	Lecture #26 : Foundation Models for RL [ slides ]	Ouyang et al. Training language models to follow instructions with human feedback<\a></li> Slides Direct Preference Optimization: A New RLHF Approach<\a></li> Xu et al. Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study<\a></li> </ul> </td>
W 04/23	Lecture #27 : Foundation Models for RL [ slides ]		HW4 due 11:59pm
F 04/25	Recitation #10: Quiz 3 Review [ slides ]
F 05/02	Quiz 3, 8:30am - 11:30am, Posner 152