Gaucho AI Alignment

Date	Summary	Links
1/12	Welcome, Review of Transformers, Reproduce GPT-2	What is AI alignment? Attention is All You Need GPT-2 Walkthrough + starter code
1/19	Reproduce GPT-2	Attention is All You Need The Illustrated Transformer GPT-2 Walkthrough + starter code Learn about einops!
1/26	Introduction to Interpretability	Zoom In: An Introduction to Circuits + my notes A Mathematical Framework for Transformer Circuits + Glossary of Terms + self-test + Neel Nanda's walkthrough (Skipped) In-context Learning and Induction Heads Tip: Read these in order. These papers are not easy, so spend most of your time on the first two, then just skim the last.
2/2	Transformer Circuits, Continued	Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases Toy Models of Superposition
2/9	Interpretability in the Wild	Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small Exploratory Data Analysis (to be done during our meeting)
2/16	Intro to Reinforcement Learning	OpenAI Spinning Up Parts 1-3: RL Intro, Types of RL, Policy Optimization (just read "Deriving the Simplest Policy Gradient") HuggingFace RL Course: Q Learning, Deep Q Learning (try to get the main ideas) Goal Misgeneralization in Deep Reinforcement Learning (sections 1-2) Specification gaming: the flip side of AI ingenuity (Optional) VPG, TRPO, and PPO
2/23	Aligned Language Models	Training language models to follow instructions with human feedback (Sections 3.1, 3.5, and 5) Skim Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Skim Direct Preference Optimization: Your Language Model is Secretly a Reward Model Consider: What are the primary limitations of these alignment techinques?
3/1	Adversarial ML, Data Poisoning	Skim Adversarial ML Tutorial ch. 1-3 Skim Scalable Extraction of Training Data from (Production) Language Models Figure out how Glaze and Nightshade work We'll play with adversarial ML, Glaze/Nightshade during the meeting!
3/8	AI Ethics	Watch/listen to one of Iason Gabriel's talks or podcasts, and prepare some discussion notes.

Some other links and tools that might help you as we progress:

Meta
- Talk2Arxiv (GPT for understanding papers)
Deep Learning Refresher
- Edison Zhang's CS190I Notes
Transformers
- Transformers from Scratch
Mechanistic Interpretability
- Getting Started in MI
- Neel Nanda's List of Papers - includes how to approach each one
- 200 Concrete Open Problems in MI
Other AI Safety/Alignment Programs + Courses