What is interpretability
Description
A surprising fact about modern large language models is that nobody really knows how they work internally. At Anthropic, the Interpretability team strives to change that — to understand these models to better plan for a future of safe AI.
Find out more: https://www.anthropic.com/research
Notes
Transcript
0:03
I work at Anthropic on the Interpretability team.
0:06
Interpretability is the science
0:08
of understanding these AI models from the inside out.
0:12
Researchers like me are trying to figure out
0:15
what the networks learn and how they do what they do.
0:18
It's almost like doing biology of a new kind of organism.
0:22
We're focused on an approach
0:23
called mechanistic interpretability.
0:25
We're trying to build from understanding very small units
0:29
into understanding larger and larger mechanisms.
0:32
It's often surprising to people that we need to go
0:37
and do interpretability at all,
0:39
that we don't understand these systems
0:41
that we've created.
0:42
In some important way, we don't build neural networks.
0:45
We grow them, we learn them.
0:46
It's a lot like evolution.
0:48
It's a lot like the way that we started
0:49
with little molecules bouncing against each other
0:53
and then you got very basic proteins
0:55
and then maybe you got cells
0:57
and in the end you have, well, you have us, right?
1:00
But no one designed us to make sense.
1:03
Just every generation, there's this grand progression
1:06
of refinement and change over time.
1:08
The models are the same way.
1:12
We start with a kind of blank neural network.
1:14
It's like an empty scaffold that things can grow on.
1:18
And then as we train the neural network,
1:20
circuits grow through it.
1:22
They implement the model's behavior.
1:23
And so we're in this situation where we understand,
1:26
you know, we understand that initial scaffolding we gave it,
1:30
and we understand the process that incentivizes
1:33
those circuits to form,
1:35
but we don't know what those circuits are
1:38
or what they do or how they work.
1:40
Turns out that's challenging
1:41
because the circuits get packed very densely
1:44
and if you want to understand them, you sort of need
1:46
to pull apart those overlapped pieces.
1:47
And so if we want to understand neural networks,
1:49
we're then left with this challenge
1:51
of going and studying this thing that we grew
1:54
rather than something that we designed from scratch.
2:00
A child can pass a test at school
2:02
because they actually learned the material,
2:05
or they can pass the test because they cheated.
2:07
As the model developers,
2:08
both of those look like the same outcome.
2:11
And we can't, you know,
2:12
without interpretability letting us see inside the model,
2:15
we can't actually tell those two apart.
2:17
We want these models to be safe and reliable.
2:20
By studying how they work inside,
2:22
by doing this kind of model biology,
2:25
we can do some kind of model medicine
2:28
that can diagnose and cure what ails it
2:31
and help it do what it's trying to do.
2:33
The power of interpretability
2:34
is that it gives us a different lens
2:36
to go in and ask that question,
2:38
to go and see potential problems.
2:40
You could imagine developing techniques
2:41
to steer models towards the correct behaviors.
2:45
But if we actually understood all the nuts and bolts,
2:48
then it seems like we ought to be able to intervene
2:51
in ways that change what they do.
2:52
AI is at a really interesting moment in its development
2:56
where we've figured out some things that work,
3:00
but we don't know the limits of that.
3:03
And we're just beginning to even find the right words
3:05
to talk about what's happening.
3:07
The early 1900s, this, like, golden age of physics
3:11
where sort of, you know, quantum mechanics was discovered
3:13
and special relativity and general relativity,
3:15
and we finally could understand things
3:17
about solid state physics, and things were,
3:19
all of a sudden were starting to make sense,
3:21
and it feels like we're sort of
3:22
speed running that right now in interpretability.
3:24
The exciting part is it's just,
3:27
it feels like we're in a position
3:29
to really understand the core of like,
3:31
what is thinking, how does thinking work?
3:33
Having these hard problems
3:34
and these deep, really difficult questions
3:37
and also having just a little bit of traction on them.
3:39
That's sort of the most, I feel like,
3:41
that a scientist can ask for
3:42
if you want to really discover deep things
3:44
and really exciting things.
3:45
And so I think there's a way in which we're very fortunate
3:48
to have such interesting and difficult questions
3:50
to go and grapple with.