What is interpretability

What is interpretability?

Description

A surprising fact about modern large language models is that nobody really knows how they work internally. At Anthropic, the Interpretability team strives to change that — to understand these models to better plan for a future of safe AI.

Find out more: https://www.anthropic.com/research

Notes

Transcript

0:03

I work at Anthropic on the Interpretability team.

0:06

Interpretability is the science

0:08

of understanding these AI models from the inside out.

0:12

Researchers like me are trying to figure out

0:15

what the networks learn and how they do what they do.

0:18

It's almost like doing biology of a new kind of organism.

0:22

We're focused on an approach

0:23

called mechanistic interpretability.

0:25

We're trying to build from understanding very small units

0:29

into understanding larger and larger mechanisms.

0:32

It's often surprising to people that we need to go

0:37

and do interpretability at all,

0:39

that we don't understand these systems

0:41

that we've created.

0:42

In some important way, we don't build neural networks.

0:45

We grow them, we learn them.

0:46

It's a lot like evolution.

0:48

It's a lot like the way that we started

0:49

with little molecules bouncing against each other

0:53

and then you got very basic proteins

0:55

and then maybe you got cells

0:57

and in the end you have, well, you have us, right?

1:00

But no one designed us to make sense.

1:03

Just every generation, there's this grand progression

1:06

of refinement and change over time.

1:08

The models are the same way.

1:12

We start with a kind of blank neural network.

1:14

It's like an empty scaffold that things can grow on.

1:18

And then as we train the neural network,

1:20

circuits grow through it.

1:22

They implement the model's behavior.

1:23

And so we're in this situation where we understand,

1:26

you know, we understand that initial scaffolding we gave it,

1:30

and we understand the process that incentivizes

1:33

those circuits to form,

1:35

but we don't know what those circuits are

1:38

or what they do or how they work.

1:40

Turns out that's challenging

1:41

because the circuits get packed very densely

1:44

and if you want to understand them, you sort of need

1:46

to pull apart those overlapped pieces.

1:47

And so if we want to understand neural networks,

1:49

we're then left with this challenge

1:51

of going and studying this thing that we grew

1:54

rather than something that we designed from scratch.

2:00

A child can pass a test at school

2:02

because they actually learned the material,

2:05

or they can pass the test because they cheated.

2:07

As the model developers,

2:08

both of those look like the same outcome.

2:11

And we can't, you know,

2:12

without interpretability letting us see inside the model,

2:15

we can't actually tell those two apart.

2:17

We want these models to be safe and reliable.

2:20

By studying how they work inside,

2:22

by doing this kind of model biology,

2:25

we can do some kind of model medicine

2:28

that can diagnose and cure what ails it

2:31

and help it do what it's trying to do.

2:33

The power of interpretability

2:34

is that it gives us a different lens

2:36

to go in and ask that question,

2:38

to go and see potential problems.

2:40

You could imagine developing techniques

2:41

to steer models towards the correct behaviors.

2:45

But if we actually understood all the nuts and bolts,

2:48

then it seems like we ought to be able to intervene

2:51

in ways that change what they do.

2:52

AI is at a really interesting moment in its development

2:56

where we've figured out some things that work,

3:00

but we don't know the limits of that.

3:03

And we're just beginning to even find the right words

3:05

to talk about what's happening.

3:07

The early 1900s, this, like, golden age of physics

3:11

where sort of, you know, quantum mechanics was discovered

3:13

and special relativity and general relativity,

3:15

and we finally could understand things

3:17

about solid state physics, and things were,

3:19

all of a sudden were starting to make sense,

3:21

and it feels like we're sort of

3:22

speed running that right now in interpretability.

3:24

The exciting part is it's just,

3:27

it feels like we're in a position

3:29

to really understand the core of like,

3:31

what is thinking, how does thinking work?

3:33

Having these hard problems

3:34

and these deep, really difficult questions

3:37

and also having just a little bit of traction on them.

3:39

That's sort of the most, I feel like,

3:41

that a scientist can ask for

3:42

if you want to really discover deep things

3:44

and really exciting things.

3:45

And so I think there's a way in which we're very fortunate

3:48

to have such interesting and difficult questions

3:50

to go and grapple with.