Date | Summary | Links |
---|---|---|
1/12 | Welcome, Review of Transformers, Reproduce GPT-2 | |
1/19 | Reproduce GPT-2 |
GPT-2 Walkthrough + starter code Learn about einops! |
1/26 | Introduction to Interpretability |
Zoom In: An Introduction to Circuits + my notes A Mathematical Framework for Transformer Circuits + Glossary of Terms + self-test + Neel Nanda's walkthrough (Skipped) In-context Learning and Induction Heads Tip: Read these in order. These papers are not easy, so spend most of your time on the first two, then just skim the last. |
2/2 | Transformer Circuits, Continued |
Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases |
2/9 | Interpretability in the Wild |
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small |
2/16 | Intro to Reinforcement Learning |
OpenAI Spinning Up Parts 1-3: RL Intro, Types of RL, Policy Optimization (just read "Deriving the Simplest Policy Gradient")
HuggingFace RL Course: Q Learning, Deep Q Learning (try to get the main ideas) Goal Misgeneralization in Deep Reinforcement Learning (sections 1-2) |
2/23 | Aligned Language Models |
Training language models to follow instructions with human feedback (Sections 3.1, 3.5, and 5) Skim Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Skim Direct Preference Optimization: Your Language Model is Secretly a Reward Model Consider: What are the primary limitations of these alignment techinques? |
3/1 | Adversarial ML, Data Poisoning |
Skim Adversarial ML Tutorial ch. 1-3 Skim Scalable Extraction of Training Data from (Production) Language Models Figure out how Glaze and Nightshade work We'll play with adversarial ML, Glaze/Nightshade during the meeting! |
3/8 | AI Ethics |
Watch/listen to one of Iason Gabriel's talks or podcasts, and prepare some discussion notes. |
- Meta
- Deep Learning Refresher
- Transformers
- Mechanistic Interpretability
- Getting Started in MI
- Neel Nanda's List of Papers - includes how to approach each one
- 200 Concrete Open Problems in MI
- Other AI Safety/Alignment Programs + Courses