So far, AI research has mostly been about making things work. Now that AI capabilities are progressing rapidly, we need to ask ourselves: what if we succeed?
Most of today’s risks and harms from AI flow directly from the things AI was designed to do. Analyzing data from millions of consumers creates privacy risks. Using AI to make decisions about people is riddled with issues of fairness and social biases. Generative AI which produces convincing images and text can be used to spread misinformation. Automation brings societal and economic risks such as centralization and widespread unemployment. Military AI leads to arms race dynamics. These are all pressing issues worthy of serious research and activism effort. But something they have in common is that the risks flow directly from the intended application.
The advent of Artificial General Intelligence (AGI) — which I’ll define as the ability of a single system to act competently in a broad range of domains — adds important new elements. First, AGI systems (of which LLM-based chatbots are arguably a primitive example) bring risks in domains beyond those which they were designed for and which one might not naively think to check for, such as biotech, cyberwarfare, and autonomous replication. Second, AGI systems developed in the next few years could plausibly have some capability to operate autonomously, pursuing goals without human oversight. This combination of general capabilities and autonomy dramatically increases the “attack surface” we need to worry about and forces us to think ahead to understand the capabilities and motivational structures of AI systems that will be developed over the coming decade.
What is “alignment”? As AI systems start to behave less like tools and more like agents that pursue goals in surprising ways, the important question becomes whether their motivational structures are aligned with human values. It’s a little ambiguous how good a match this concept is for LLMs — this is one of the things we need to figure out! — but we need to be prepared for a future in which AGI systems are much more autonomous and agentic than we’re used to. (Or alternatively, coordinate to prevent such a future until we’re ready!)
How do we study the risks from something that doesn’t yet exist? Roughly the first half of the course will focus on idealized models of powerful AI systems, including optimal planners and universal induction. Such models have the advantage that we can often make precise statements about how they would behave. While they will certainly differ from future AI systems in crucial ways, they give us a handle for reasoning about future systems because we can poke at the assumptions and see if the conclusions are likely to change.
Roughly the second half of the course will focus on practical safety and alignment techniques in the context of large language models. This will include RL from human feedback, mechanistic interpretability, robust harmlessness, and scalable oversight. We’ll also look at combining LLMs with social choice theory in order to aggregate the preferences of diverse humans.
AI alignment is a very new area. To the best of my knowledge, this is the fifth time the subject was taught as a regular course at a major university. (All but one of the others were created in the last 6 months.) Furthermore, the field has mostly developed outside the academic mainstream, with many of the foundational ideas scattered across the Internet in forms that don’t make them easy to absorb or evaluate. This doesn’t make it easy for newcomers. We’ll use NeurIPS-like research papers where possible, and otherwise try to give tips for interpreting the broader literature.
The course will use a hybrid format. For topics that are well covered in the published literature (e.g. robustness and mechanistic interpretability), we’ll use a seminar format, with paper presentations and discussions. For topics that haven’t yet been organized in a way that makes them accessible to AI researchers, I’ll use a lecture format. Most of these topics have never been taught as part of a course like this, so the organization will be very experimental. Thank you for being the guinea pigs.
Part of the goal of a graduate course is to teach the philosophies, norms, and strategies for doing research in a domain. Since alignment is a nascent field, all of these are just being figured out. Furthermore, the problem of evaluating and mitigating risks from systems that don’t yet exist (and whose very design is uncertain!) is very different from the ones we ordinarily face as computer scientists, and requires a different set of skills from the ones we’ve spent years developing. I’ll do my best to convey some of the research principles, but please be aware that we’re all trying to figure them out as we go along.