Date Summary Links
1/12 Welcome, Review of Transformers, Reproduce GPT-2

What is AI alignment?

Attention is All You Need

GPT-2 Walkthrough + starter code

1/19 Reproduce GPT-2

Attention is All You Need

The Illustrated Transformer

GPT-2 Walkthrough + starter code

Learn about einops!

1/26 Introduction to Interpretability

Zoom In: An Introduction to Circuits + my notes

A Mathematical Framework for Transformer Circuits + Glossary of Terms + self-test + Neel Nanda's walkthrough

(Skipped) In-context Learning and Induction Heads

Tip: Read these in order. These papers are not easy, so spend most of your time on the first two, then just skim the last.

2/2 Transformer Circuits, Continued

Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases

Toy Models of Superposition

2/9 Interpretability in the Wild

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

Exploratory Data Analysis (to be done during our meeting)

2/16 Intro to Reinforcement Learning

OpenAI Spinning Up Parts 1-3: RL Intro, Types of RL, Policy Optimization (just read "Deriving the Simplest Policy Gradient")

HuggingFace RL Course: Q Learning, Deep Q Learning (try to get the main ideas)

Goal Misgeneralization in Deep Reinforcement Learning (sections 1-2)

Specification gaming: the flip side of AI ingenuity

(Optional) VPG, TRPO, and PPO

2/23 Aligned Language Models

Training language models to follow instructions with human feedback (Sections 3.1, 3.5, and 5)

Skim Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Skim Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Consider: What are the primary limitations of these alignment techinques?
3/1 Adversarial ML, Data Poisoning

Skim Adversarial ML Tutorial ch. 1-3

Skim Scalable Extraction of Training Data from (Production) Language Models

Figure out how Glaze and Nightshade work

We'll play with adversarial ML, Glaze/Nightshade during the meeting!
3/8 AI Ethics

Watch/listen to one of Iason Gabriel's talks or podcasts, and prepare some discussion notes.

Some other links and tools that might help you as we progress: