DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate,

Cover|200

Lex Fridman Podcast hosted by Lex Fridman - Podcast Index

Dylan Patel is the founder of SemiAnalysis, a research & analysis company specializing in semiconductors, GPUs, CPUs, and AI hardware. Nathan Lambert is a research scientist at the Allen Institute for AI (Ai2) and the author of a blog on AI called Interconnects.

Transcript

Snips

AI Cold War?

The "DeepSeek moment" may mark the beginning of a cold war focused on AI.
Export controls, aimed at maintaining the US's AI dominance, could escalate geopolitical tensions.

📚 Transcript

Click to expand

Lex Fridman

So is there any concerns that the export controls push China to take military action in Taiwan?

Dylan Patel

This is the big risk, right? The further you push China away from having access to cutting-edge American and global technologies, the more likely they are to say, well, because I can't access it, I might as well, Like no one should access it, right? And there's a few interesting aspects of that, right? Like, you know, China has a urban-rural divide like no other. They have a male-female birth ratio like no other to the point where if you look in most of China, it's like the ratio is not that bad. But when you look at single dudes in rural China, it's like a 30 to 1 ratio. And those are disenfranchised dudes, right? Like, quote unquote, like the US has an incel problem like China does too. It's just they're polyclated in some way or crushed down. What do you do with these people? And at the same time, you're not allowed to access the most important technology. At least the US thinks so. China is maybe starting to think this is the most important technology by starting to dump subsidies in it, right? They thought EVs and renewables were the most important technology. They dominate that now, right? Now they're starting to, they started thinking about semiconductors in the late 2010s and early 2020s. And now they've been dumping money and they're catching up rapidly. And they're going to do the same with AI, right? Because they're very talented, right? So the question is, when does this hit a breaking point, right? And if China sees this as, hey, they can continue, if not having access and starting a true hot war, right, taking over Taiwan or trying to subvert its democracy in some way or blockading It, hurts the rest of the world far more than it hurts them, this is something they could potentially do, right? And so is this pushing them towards that? Potentially, right? I'm not quite a geopolitical person, but it's obvious that the world regime of peace and like trade is like super awesome for economics uh but but at some

[01:44:13] TSMC's Foundry Model

🎧 Play snip - 3min️ (01:41:00 - 01:44:13)

TSMC's Foundry Model

TSMC's foundry model, focusing solely on manufacturing, allows for specialization and economies of scale.
This model has led to a decline in companies building their own fabs due to rising costs.

📚 Transcript

Click to expand

Lex Fridman

So can you explain the role of TSMC in the story of semiconductors and maybe also how the United States can break the reliance on TSMC?

Dylan Patel

I don't think it's necessarily breaking the reliance. I think it's getting TSMC to, you know, build in the US. But so taking a step back, right, TSMC produces most of the world's chips, right, especially on the foundry side. You know, there's a lot of companies that build their own chips, Samsung, Intel, you know, ST Micro, Texas Instruments, you know, analog devices, all these kinds of companies build Their own chips and XP. But more and more of these companies are outsourcing to TSMC and have been for multiple decades.

Lex Fridman

Can you explain the supply chain there and where most of TSMC is in terms of manufacturing?

Dylan Patel

Sure. So historically, supply chain was companies would build their own chips. They would be a company started, they'd build their own chips, and then they'd design the chip and build the chip and sell it. Over time, this became really difficult because the cost of building a fab continues to compound every single generation. Of course, figuring out the technology for it is incredibly difficult regardless, but just the dollars and cents that are required, ignoring, you know, saying, hey, yes, I have all The technical capability, which it's really hard to get that, by the way, right? Intel's failing, Samsung's failing, etc. But if you look at just the dollars to spend to build that next generation fab, it keeps growing, right? Sort of like, you know, Moore's law is having the cost of chips every two years. There's a separate law that's sort of like doubling the cost of fabs every handful of years. And so you look at a leading edge fab that is going to be profitable today, that's building, you know, three nanometer chips or two nanometer chips in the future, that's going to cost North of 30, $40 billion, right? And that's just for like a token amount. That's for a like, that's like the base building block. You probably need to build multiple, right? And so when you look at the industry over the last, you know, if I go back 20, 30 years ago, there were 20, 30 companies that could build the most advanced chips, and then they would design Them themselves and sell them, right? So companies like AMD would build their own chips. Intel, of course, still builds their own chips. They're very famous for it. IBM would build their own chips. And, you know, you could keep going down the list. All these companies built their own chips. Slowly, they kept falling like flies. And that's because of what TSMC did, right? They created the foundry business model, which is, I'm not going to design any chips. I'm just going to contract manufacturer chips for other people. And one of their early customers is NVIDIA, right? NVIDIA is the only semiconductor company that's worth, you know, that's doing more than a billion dollars of revenue that was started in the era of Foundry, right? Every other company started before then and at some point had fabs, which is actually incredible, right? You know, like AMD and Intel and Broadcom. Such a great fact. It's like everyone had fabs at some point or, you know, some companies like Broadcom, it was like a merger, amalgamation of various companies that rolled up. But even today, Broadcom has fabs, right? They build iPhone RF radio chips sort of in Colorado for, right? All these companies had fabs and for most of the fabs, they threw them away or sold them off or they got rolled into something else. And now everyone relies on TSMC, right? Including Intel, their latest PC chip uses TSMC chips, right? It also uses some Intel chips, but it uses TSMC process.

[01:54:20] TSMC's Cultural Advantage

🎧 Play snip - 7min️ (01:47:46 - 01:54:20)

TSMC's Cultural Advantage

TSMC's success is attributed to several factors.
These include a strong work ethic, specialized talent, and a focus on customer service, exemplified by employees' dedication during earthquakes.

📚 Transcript

Click to expand

Dylan Patel

So there's aspects of it that I would say yes and aspects that I'd say no, right? TSMC is way ahead because former executive Morris Chang of Texas Instruments wasn't promoted to CEO. And he's like, screw this. I'm going to go make my own chip company. Right. And he went to Taiwan and made TSMC. Right. And there's there's a whole lot more story there. So it could have been Texas Instruments could have been the TSMC, but Texas semiconductor manufacturing, right. Instead of Texas Instruments. Right. But, you know so there is that whole story there but sitting here in texas i mean and that sounds like a human story like it didn't get promoted just the brilliance of morris chang you know Which i wouldn't underplay but there's also like a different level of like how how this works right so in taiwan the you know like the percent of graduates, of students that go to the best School, which is NTU, the top percent of those all go work to TSMC, right? And guess what their pay is? Their starting pay is like $80,000, $70,000, right? Which is like, that's like starting pay for like a good graduate in the US, right? Not the top. The top graduates are making hundreds of thousands of dollars at the Googles and the Amazons. And now I guess the open AIs of the world, right? So there is a large dichotomy of like, what is the top 1% of the society doing? And where are they headed because of economic reasons, right? Intel never paid that crazy good, right? And it didn't make sense to them, right? That's one aspect, right? Where's the best going? Second is the work ethic, right? Like, you know, we like to work, you know, you work a lot, we work a lot. But at the end of the day, when there's a, you know, when, when, what is the time and amount of work that you're doing? And what does a fab require, right? Fabs are not work from home jobs. They are, you go into the fab and grueling work, right? There's, there's, hey, if there is any amount of vibration, right? An earthquake happens, vibrates the machines. They're all, you know, they're either broken. You've, you've scrapped some of your production. And then in many cases, they're like not calibrated properly. So, so when TSMC, when there's an earthquake, right. Recently, there's been an earthquake. TSMC doesn't call their employees. They just, they just go to the fab and like, they just show up, the parking lot gets slammed and people just go into the fab and fix it. Like it's like an arm it's like ants right like it's like you know a hive of ants doesn't get told by the queen what to do the ants just know it's like one person just specializes on these

Nathan Lambert

One task and it's like you're gonna take this one tool and you're the best person in the world and this is what you're gonna do for your whole life is this one task in the fab which is like

Dylan Patel

Some special chemistry plus nano manufacturing on one line of tools that continues to get iterated. And yeah, it's just like, it's like specific plasma edge for removing silicon dioxide, right? That's all you focus on your whole career. And it's like such a specialized thing. And so it's not like the task are transferable. AI today is awesome because like people can pick it up like that. Semiconductor manufacturing is very antiquated and difficult. None of the materials are online for people to read easily and learn, right? The papers are very dense and it takes a lot of experience to learn. And so it makes the barrier to entry much higher too. So when you talk about, hey, you have all these people that are super specialized. They will work 80 a week in a factory, in a fab. And if anything goes wrong, they'll go show up in the middle of the night because some earthquake. Their wife is like, there was an earthquake. He's like, great, I'm gonna go to the fab. It's like, would you as an American do that? It's like these sorts of things are like what, I guess, are the exemplifying why TSMC is so amazing. Now, can you replicate it in the US? Let's not ignore Intel was the leader in manufacturing for over 20 years. They brought every technology to market first besides EUV. Strained silicon, high K metal gates, FinFET, you know, and the list goes on and on and on of technologies that Intel brought to market first, made the most money from, and manufactured At scale first, best, highest profit margins, right? So we shouldn't ignore that Intel can't do this, right? It's that the culture has broken, right? You've invested in the wrong things. They said no to the iPhone. They had all these different things regarding like, you know, mismanagement of the fabs, mismanagement of designs, lockup, right? And at the same time, all these brilliant people, right, these like 50,000 PhDs, you know, or masters that have been working on specific chemical or physical processes or nanomanufacturing Processes for decades in Oregon, they're still there. They're still producing amazing work. It's just like getting it to the last mile of production at high yield, where you can design, where you can manufacture dozens and hundreds of different kinds of chips, you know, and, And it's good customer experience has broken, right? You know, it's that customer experience. It's like the, like part of it is like people will say Intel was too pompous in the 2000s, 2010s, right? They just thought they were better than everyone. The tool guys were like, oh, I don't think that this is mature enough. And they're like, ah, you just don't know. We know, right? This sort of stuff would happen. And so can the U.S. Bring leading-edge semiconductor manufacturing to the U.S.? Emphatically, yes. And we are. It's happening. Arizona is getting better and better as time goes on. TSMC has built roughly 20% of their capacity for 5 nanometer in the U.S. Now, this is nowhere near enough. 20% of capacity in the U. Is like nothing. Right. Um, and furthermore, this is still dependent on Taiwan existing, right? All there's sort of important way to separate it out. There's R and D and there's high volume manufacturing. There are, there are effectively, there are three places in the world that are doing leading edge R and D there's Sinshu, Taiwan, there's Hillsborough, Oregon, and there is Pyongyang, South Korea, right? These three places are doing the leading edge R&D for the rest of the world's leading edge semiconductors, right? Now, manufacturing can be distributed more globally, right? And this is sort of where this dichotomy exists of like, who's actually modifying the process, who's actually developing the next generation one, who's improving them is Sinshu, Is Hillsborough, is Pyongyang, right? It is not the rest of these, you know, fabs like Arizona, right? Arizona is a paperweight. If Sinshu disappeared off the face of the planet, you know, within a year, a couple years,, Arizona would stop producing, too. Right. It's actually like pretty critical. One of the things I like to say is if I had like a few missiles, I know exactly where I could cause the most economic damage. Right. It's not targeting the White House. Right. It's the R&D centers. It's the R&D centers for TSMC, Intel, Samsung, and then some of the memory guys, Micron and Hynix.

Lex Fridman

Because they define the future evolution of these semiconductors and everything's moving so rapidly that it really is fundamentally about R&D. And it is all about TSMC.

[02:09:01] H200 for Reasoning

🎧 Play snip - 4min️ (02:04:38 - 02:09:01)

H200 for Reasoning

The H200 chip, while restricted in flops, has enhanced memory bandwidth and capacity, making it suitable for reasoning tasks.
Reasoning tasks benefit from larger memory capacity due to increased KV cache usage.

📚 Transcript

Click to expand

Lex Fridman

Can we go back to the specific detail of the different hardware? There's this nice graphic in the export controls of which GPUs are allowed to be exported and which are not. Can you kind of explain the difference? Is there, from a technical perspective, are the H20s promising?

Dylan Patel

Yeah, so this goes, and I think we'd have to, like, we need to dive really deep into the reasoning aspect and what's going on there. But the H20, you know, the US has gone through multiple iterations of the export controls, right? This H800 was at one point allowed back in 23, but then it got canceled. And by then, you know, DeepSeek had already built their cluster of, they claim 2K. I think they actually have like many more, like something like 10K of those. And now this H20 is the legally allowed chip, right? NVIDIA shipped a million of these last year to China, right? For context, it was like four or five million GPUs, right? So the percentage of GPUs that were this China-specific H20 is quite high, right? You know, roughly 20%, 25%, right? 20% or so. And so this H20 has been neutered in one way, but it's actually upgraded in other ways, right? And, you know, you could think of chips along three axes for AI, right? You know, ignoring software stack and like exact architecture, just raw specifications, there's floating point operations, right? Flops. There is memory bandwidth, i.e. In memory capacity, right? IO, right? Memory. And then there is interconnect, right? Chip to chip interconnections. All three of these are incredibly important for making AI systems, right? Because AI systems involve a lot of compute. They involve a lot of moving memory around, whether it be to memory or to other chips, right? And so these three vectors, the US initially had two of these vectors controlled and one of them not controlled, which was flops and interconnect bandwidth were initially controlled. And then they said, no, no, no, no, we're going to remove the interconnect bandwidth and just make it a very simple only flops. But now NVIDIA can now make a chip that has, okay, it's cut down on flops. No, it's, you know, it's like one third that of the H100, right? In on spec sheet paper performance for flops, you know, in real world, it's closer to like half or maybe even like 60% of it, right? But then on the other two vectors, it's just as good for interconnect bandwidth. And then for memory bandwidth and memory capacity, the H20 has more memory bandwidth and more memory capacity than the H100. Now, recently, we at our research, we cut NVIDIA's production for H20 for this year down drastically. They were going to make another 2 million of those this year, but they just canceled all the orders a couple of weeks ago. In our view, that's because we think that they think they're going to get restricted, right? Because why would they cancel all these orders for H20? Because they shipped a million of them last year. They had orders in for a couple million this year and just gone, right? For H20, B20, right? A successor to H20. And now they're all gone. Now, why would they do this, right? I think it's very clear, right? The H20 is actually better for certain tasks. And that certain task is reasoning, right? Reasoning is incredibly different than, you know, when you look at the different regimes of models, right? Pre-training is all about flops, right? It's all about flops. There's things you do, like mixture of experts that we talked about to trade off interconnect or to trade off, you know, other aspects and lower the flops and rely more on interconnect And memory. But at the end of the day, it's flops is everything, right? We talk about models in terms of like how many flops they are, right? So, so like, you know, we talk about, oh, GPT-4 is 2E25, right? Two to the 25th, you know, 25 zeros, right? Flop, right? Floating point operations. For training. For training, right? And we're talking about the restrictions for the 2E24, right? Or 25, whatever. The US has an executive order that Trump recently unsigned, but which was, hey, 1E26, once you hit that number of floating point operations, you must notify the government and you must Share your results with us. There's a level of model where the U.S. Government must be told, and that's 1E26. And so as we move forward, this is an incredibly important flop is the vector that the government has cared about historically. But the other two vectors are arguably just as important, right? And especially when we come to this new paradigm, which the world is only just learning about over the last six months, right?

[02:09:48] Importance of Memory in Reasoning

🎧 Play snip - 7sec️ (02:09:42 - 02:09:49)

Importance of Memory in Reasoning

Parameter count and mixture of experts models allow for changing active vs. total parameters to embed more data with less flops.
However, memory is crucial because of the transformer and attention mechanism, which allow models to understand relationships between words.

📚 Transcript

Click to expand

Dylan Patel

Why is memory so important? It's because so far we've talked about parameter counts, right? And mixture of experts, you can change

[02:11:34] KVCache in Transformers

🎧 Play snip - 2min️ (02:09:42 - 02:11:33)

KVCache in Transformers

Memory is crucial for reasoning in transformers due to the attention mechanism, which calculates relationships between all words in the context.
KVCache optimizes this by storing a compressed representation of previous tokens, enabling efficient autoregressive generation.

📚 Transcript

Click to expand

Dylan Patel

Why is memory so important? It's because so far we've talked about parameter counts, right? And mixture of experts, you can change how many active parameters versus total parameters to embed more data, but have less flops. But more important, you know, another aspect of, you know, what's part of this humongous revolution in the last handful of years is the transformer, right? And the attention mechanism. Attention mechanism is that the model understands the relationships between all the words in its context, right? And that is separate from the parameters themselves, right? And that is something that you must calculate, right? How each token, right? Each word in the context length is relatively connected to each other, right? And I think, Nathan, you should explain KVCache better.

Lex Fridman

KVCache is one of the optimizations that enable.

Nathan Lambert

Yeah. So the attention operator has three core things. It's queries, keys, and values. QKV is the thing that goes into this. You'll look at the equation. You see that these matrices are multiplied together. These words, query, key, and value come from information retrieval backgrounds, where the query is the thing you're trying to get the values for, and you access the keys and values Is reweighting. My background's not an information retrieval and things like this. It's just fun to have backlinks. And what effectively happens is that when you're doing these matrix multiplications, you're having matrices that are of the size of the context length. So the number of tokens that you put into the model and the KV cache is effectively some form of compressed representation of all the previous tokens in the model. So when you're doing this, we talk about autoregressive models, you predict one token at a time. You start with whatever your prompt was, you ask a question, like who was the president in 1825, the model then is going to generate its first token. For each of these tokens, you're doing the same attention operator where you're multiplying these query key value

[02:13:17] Transformer Memory Cost

🎧 Play snip - 1min️ (02:12:03 - 02:13:17)

Transformer Memory Cost

The attention operator in transformers has a quadratic memory cost relative to context length.
Longer inputs significantly increase memory usage, leading to innovations like subquadratic and linear attention forms.

📚 Transcript

Click to expand

Nathan Lambert

Essentially, one of the key, quote unquote, drawbacks of the attention operator and the transformer is that there is a form of quadratic memory cost in proportion to the context length. So as you put in longer questions, the memory used in order to make that computation is going up in the form of a quadratic. You'll hear about a lot of other language model architectures that are like subquadratic or linear attention forms, which is like state space models. We don't need to go down all these now. And then there's innovations on attention to make this memory usage and the ability to attend over long contexts much more accurate and high performance.

Lex Fridman

And those innovations are going to help you with, I mean, your highly memory constraint. They help with memory constraint and performance.

Nathan Lambert

So if you put in a book into, I think Gemini is the model that has the longest context length that people are using. Gemini is known for 1 million and now 2 million context length. You put a whole book into Gemini and sometimes it'll draw facts out of it. It's not perfect. They're getting better. So there's two things. It's like one, to be able to serve this on the memory level. Google has magic with their TPU stack where they can serve really long contexts. And then there's also many decisions along the way to actually make long context

[02:14:55] LLM Token Pricing

🎧 Play snip - 1min️ (02:13:39 - 02:14:54)

LLM Token Pricing

Input tokens are cheaper than output tokens in LLMs like OpenAI because they can be processed in parallel.
Output tokens are generated autoregressively, requiring sequential processing and reading/updating the entire key-value cache for each token.

📚 Transcript

Click to expand

Nathan Lambert

From the model.

Dylan Patel

I can explain that. So today, if you use a model, like you look at an API, OpenAI charges a certain price per million tokens, right? And that price for input and output tokens is different, right? And the reason is, is that there is, you know, when you're, when you're, when you're inputting a query into the model, right? Let's say you have a book, right? That book, you must now calculate the entire KV cache for, right? This key value cache. And so when you do that, that is a parallel operation. All of the tokens can be processed at one time. And therefore you can dramatically reduce how much you're spending, right? The flop requirements for generating a token and an input token are identical, right? If I input one token or if I generate one token, it's completely identical. I have to go through the model, right? But the difference is that I can do that input, i.e. The pre-fill, i.e. The prompt, simultaneously in a batch nature, right? And therefore, it is all flop.

Lex Fridman

I think the pricing model, mostly they use this for input tokens is about one fourth the price of the output tokens.

Dylan Patel

Correct. But then output tokens, the reason why it's so expensive is because I can't do it in parallel, right? It's autoregressive. Every time I generate a token, I must not only take the entire, I must not only read the whole entire model into memory, right, and activate it, right? Calculate it to generate the next token. I also have to

[02:16:42] Reasoning Model Costs

🎧 Play snip - 1min️ (02:15:40 - 02:16:49)

Reasoning Model Costs

Reasoning models with longer output context lengths significantly increase memory usage.
This leads to higher costs and reduced ability to parallelize inference, impacting API performance and pricing.

📚 Transcript

Click to expand

Nathan Lambert

And what happens is that the output context length is so much higher. And I mean, I learned a lot about this from Dylan's work, which is essentially, as the output length gets higher, you're using this, you're writing this quadratic in terms of memory Used. And then the GPUs that we have, effectively, you're going to run out of memory, and they're all trying to serve multiple requests at once. So doing this batch processing, we're not all the prompts are exactly the same, really complex handling. And then as context links gets longer, there's this like, I think you call it critical batch size, where your ability to serve more users so how much you can parallelize your inference Inference plummets because of this long contract so your your memory usage is going way up with these reasoning models and you still have a lot of users so effectively the cost to serve

Lex Fridman

Multiplies by a ton and we're looking at a plot when the x-axis is uh sequence. I.e.

Dylan Patel

How many tokens are being generated slash prompt, right? So if I put in a book, that's a million tokens, right? But if I put in the sky is blue, then that's like six tokens or whatever.

Lex Fridman

We should say that what we're calling reasoning and chain of thought is extending this sequence

[02:22:26] DeepSeek R1 Cost

🎧 Play snip - 2min️ (02:20:50 - 02:22:25)

DeepSeek R1 Cost

DeepSeek R1 is significantly cheaper to run than OpenAI's models, being 27 times cheaper than GPT-3.
This cost difference is attributed to DeepSeek's architectural innovations, such as multi-head latent attention, which reduces memory pressure.

📚 Transcript

Click to expand

Dylan Patel

The other aspect is they did it so cheap, right? And the so cheap, we kind of talked about on the training side, why it was so cheap.

Lex Fridman

Yeah, let's talk about why it's so cheap on the inference. It works well and it's cheap. Why is R1 so damn cheap?

Dylan Patel

So I think there's a couple factors here, right? One is that they do have model architecture innovations, right? This MLA, this new attention that they've done is different than the attention from attention is all you need to transform our attention, right? Now, others have already innovated. There's a lot of work like MQA, GQA, local, global, all these different innovations that like try to bend the curve, right? It's still quadratic, but the constant is now smaller, right?

Nathan Lambert

Related to our previous discussion, this multi-head latent attention can save about 80 to 90 percent in memory from the attention mechanism, which helps especially along contexts.

Dylan Patel

It's 80 to 90 percent versus the original, but then versus what people are actually doing. It's still an innovation.

Nathan Lambert

This 80 to 90 percent doesn't say that the whole model is 80 to 90 percent cheaper, just as one part of it.

Dylan Patel

Well, and not just that, right? Like other people have implemented techniques like local global sliding window and GQMQA. But anyways, like DeepSeek has their attention mechanism is a true architectural innovation. They did tons of experimentation. And this dramatically reduces the memory pressure. It's still there, right? It's still attention. It's still quadratic. It's just dramatically reduced it relative to prior forms. Yeah, right.

Lex Fridman

That's the memory pressure. I should say, in case people don't know, R1 is 27 times cheaper than O1.

Nathan Lambert

We think that OpenAI had a large margin built in. There's multiple factors. We should break down the factors.

[02:25:22] DeepSeek's Inference Cost Advantage

🎧 Play snip - 3min️ (02:22:10 - 02:25:22)

DeepSeek's Inference Cost Advantage

DeepSeek R1's low inference cost is partly due to OpenAI's high margins and DeepSeek's model innovations.
MLA and MOE contribute to this efficiency, making R1 significantly cheaper than O1.

📚 Transcript

Click to expand

Lex Fridman

That's the memory pressure. I should say, in case people don't know, R1 is 27 times cheaper than O1.

Nathan Lambert

We think that OpenAI had a large margin built in. There's multiple factors. We should break down the factors. It's $2 per million token output for R1 and $60 per million token output for O1.

Dylan Patel

Yeah, let's look at this. So I think this is very important, right? OpenAI is that drastic gap between DeepSeek and pricing. But DeepSeek is offering the same model because they open-weightsed it to everyone else for a very similar, much lower price than what others are able to serve it for, right? So there's two factors here, right? Their model is cheaper, right? Um, it is 27 times cheaper. Well, I don't remember the number exactly off the top of my head.

Lex Fridman

So we're looking at a graphic that's showing different places serving V3, DeepSeek V3, which is similar to DeepSeek R1. And there's a vast difference in serving cost. And what explains that difference?

Dylan Patel

And so part of it is OpenAI has a fantastic margin. When they're doing inference, their gross margins are north of 75%. So that's a four to five X factor right there of the cost difference is that OpenAI is just making crazy amounts of money because they're the only one with the capability.

Lex Fridman

Do they need that money? Are they using it for R&D?

Dylan Patel

They're losing money, obviously, as a company because they spend so much on training, right? So the inference itself is a very high margin, but it doesn't recoup the cost of everything else they're doing. So yes, they need that money because the revenue and margins pay for continuing to build the next thing, right? As long as raising more money.

Lex Fridman

So the suggestion is that DeepSeek is like really bleeding out money.

Dylan Patel

Well, so here's one thing, right? We'll get to this in a second, but like DeepSeek doesn't have any capacity to actually serve the model. They stopped signups. The ability to use it is like non-existent now, right? For most people, because so many people are trying to use it, they just don't have the GPUs to serve it, OpenAI has hundreds of thousands of GPUs between them and Microsoft to serve their Models. DeepSeq has a factor of much lower. Even if you believe our research, which is 50,000 GPUs, and a portion of those are for research, a portion of those are for the hedge fund, they still have nowhere close to the GPU volumes And capacity to serve the model at scale. So it is cheaper. A part of that is OpenAI making a ton of money. Is DeepSeek making money on their API? Unknown. I don't actually think so. And part of that is this chart, right? Look at all the other providers, right? Together AI, Fireworks AI are very high-end companies, right? XMeta, Together AI is TreeDow and the inventor of like Flash Attention, right? Which is a huge efficiency technique, right? They're very efficient, good companies. And I do know those companies make money, right? Not tons of money on inference, but they make money. And so they're serving at like a five to seven X difference in cost, right? And so, you know, now when you equate, okay, OpenAI is making tons of money, that's like a five X difference. And the companies that are trying to make money for this model is like a five X difference. There is still a gap, right? There's still a gap. And that is just DeepSeq being really freaking good, right? The model architecture, MLA, the way they did the MOE, all these things. There is like legitimate just efficiency differences.

[02:24:51] OpenAI's Margins

🎧 Play snip - 2min️ (02:23:18 - 02:24:52)

OpenAI's Margins

OpenAI's inference gross margins are over 75%, a significant factor in the cost difference compared to other providers serving similar models.
This high margin helps fund their R&D and future development, as the inference revenue alone doesn't cover all their expenses, including extensive training costs.

📚 Transcript

Click to expand

Dylan Patel

And so part of it is OpenAI has a fantastic margin. When they're doing inference, their gross margins are north of 75%. So that's a four to five X factor right there of the cost difference is that OpenAI is just making crazy amounts of money because they're the only one with the capability.

Lex Fridman

Do they need that money? Are they using it for R&D?

Dylan Patel

They're losing money, obviously, as a company because they spend so much on training, right? So the inference itself is a very high margin, but it doesn't recoup the cost of everything else they're doing. So yes, they need that money because the revenue and margins pay for continuing to build the next thing, right? As long as raising more money.

Lex Fridman

So the suggestion is that DeepSeek is like really bleeding out money.

Dylan Patel

Well, so here's one thing, right? We'll get to this in a second, but like DeepSeek doesn't have any capacity to actually serve the model. They stopped signups. The ability to use it is like non-existent now, right? For most people, because so many people are trying to use it, they just don't have the GPUs to serve it, OpenAI has hundreds of thousands of GPUs between them and Microsoft to serve their Models. DeepSeq has a factor of much lower. Even if you believe our research, which is 50,000 GPUs, and a portion of those are for research, a portion of those are for the hedge fund, they still have nowhere close to the GPU volumes And capacity to serve the model at scale. So it is cheaper. A part of that is OpenAI making a ton of money. Is DeepSeek making money on their API? Unknown. I don't actually think so. And part of that is this chart, right? Look at all the other providers, right? Together AI, Fireworks AI are very high-end companies, right? XMeta, Together AI is TreeDow and the inventor of like Flash Attention, right? Which is a huge efficiency technique, right?

[02:30:53] Race to the Top vs. Bottom in AI Safety

🎧 Play snip - 19sec️ (02:30:35 - 02:30:55)

Race to the Top vs. Bottom in AI Safety

Dario Amodei (Anthropic CEO) advocates for a "race to the top" in AI safety, emphasizing high standards for model evaluation.
He believes that leading companies converging on these standards will prevent a "race to the bottom", where safety is compromised for faster development.

📚 Transcript

Click to expand

Nathan Lambert

Something that Dario talks about. That's the situation that Dario wants to avoid. Dario talks to you about the difference between race to the bottom and race to the top. And the race to the top is where there's a very high standard on safety. There's a very high standard on your model forms and certain crucial evaluations. And when certain companies are really good to it, they will converge.

[02:38:06] Superhuman Persuasion

🎧 Play snip - 13sec️ (02:37:52 - 02:38:06)

Superhuman Persuasion

Sam Altman believes superhuman persuasion will arrive before superhuman intelligence.
This implies that models could embed powerful persuasion techniques aligned with their creators' ideals before achieving AGI.

📚 Transcript

Click to expand

Dylan Patel

There's this very good quote from Sam Altman who, you know, he can be a hype beast sometime, but one of the things he said, and I think I agree, is that superhuman persuasion will happen Before superhuman intelligence, right? And if that's the case, then these things before

[02:39:09] Superhuman Persuasion

🎧 Play snip - 1min️ (02:37:52 - 02:39:09)

Superhuman Persuasion

Sam Altman believes superhuman persuasion will arrive before superhuman intelligence.
This means LLMs could embed powerful persuasion techniques aligned with their creators' ideals, even before achieving AGI.

📚 Transcript

Click to expand

Dylan Patel

There's this very good quote from Sam Altman who, you know, he can be a hype beast sometime, but one of the things he said, and I think I agree, is that superhuman persuasion will happen Before superhuman intelligence, right? And if that's the case, then these things before we get this AGI, ASI stuff, we can embed superhuman persuasion towards our ideal or whatever the ideal of the model maker is. And again, today, I truly don't believe DeepSeek has done this, but it is a sign of what could happen.

Lex Fridman

So one of the dystopian worlds is described by Brave New World. So we could just be stuck scrolling Instagram, looking at cute puppies or worse, and then talking to bots that are giving us a narrative, and we completely get lost in that world that's Controlled by somebody else versus thinking independently. And that's a major concern as we rely more and more on these kinds of systems.

Nathan Lambert

I mean, we've already seen this with recommendation systems.

Dylan Patel

Yeah, recommendation systems hack the dopamine-induced reward circuit, but the brain is a lot more complicated. And what other sort of circuits, quote-unquote feedback loops in your brain, can you hack slash subvert in ways like recommendation systems are purely just trying to do, you know, Increased time in ads and etc. But there's so many more goals that can be achieved through these complicated models.

[02:43:51] Censorship in LLMs

🎧 Play snip - 1min️ (02:42:33 - 02:43:53)

Censorship in LLMs

Censorship or alignment in LLMs can occur at multiple stages: pre-training data, post-training, and system level.
Auditing for specific facts is difficult due to the massive size of pre-training datasets.

📚 Transcript

Click to expand

Nathan Lambert

Give multiple examples. There's probably a few things to keep in mind here. One is the kind of Tiananmen Square factual knowledge. How does that get embedded into the models? Two is the Gemini, what you call the Black Nazi incident, which is when Gemini as a system had this extra thing put into it that dramatically changed the behavior. And then three is what most people would call general alignment, RLHF post-training. Each of these have very different scopes in how they are applied. In order to do, if you're just looking at the model weights, in order to audit specific facts is extremely hard because you have to Chrome through the pre-training data and look at all Of this. And then that's terabytes of files and look for very specific words or hints of the words.

Lex Fridman

So I guess one way to say it is that you can insert censorship or alignment at various stages in the pipeline. And what you refer to now is at the very beginning of the data. So if you want to get rid of facts in a model, you have to do it at every stage.

Nathan Lambert

You have to do it at the pre-training. So most people think that pre-training is where most of the knowledge is put into the model. And then you can elicit and move that in different ways, whether through post-training or whether through systems afterwards.

[03:03:43] AI Sandboxes and Verifiable Tasks

🎧 Play snip - 2min️ (03:02:09 - 03:03:44)

AI Sandboxes and Verifiable Tasks

Initially, AI model training will occur within sandboxes, focusing on verifiable tasks like math, coding, web navigation, and robotics.
The true breakthrough will come when these models can leverage their sandbox knowledge to achieve real-world goals, such as gaining followers or generating revenue, which are also verifiable.

📚 Transcript

Click to expand

Dylan Patel

But at some point I truly think that like, you know, we'll spawn models and initially all the training will be in sandboxes. But then at some point, you know, the language model pre-training is going to be dwarfed by what is this reinforcement learning? You know, you'll pre-train a multimodal model that can see, that can read, that can write, you know, blah, blah, blah, whatever, vision, audio, et cetera. But then you'll have it play in a sandbox infinitely and figure out figure out math figure out code figure out navigating the web figure out operating a robot arm right and then it'll Learn so much and the aha moment i think will be when this is available to then create something that's not good right like oh cool part of it was like figuring out how to use the web now all Of a sudden it's figured out really well how to just get hundreds of thousands of followers that are real and real engagement on twitter because all of a sudden this is one of the things That are verifiable.

Lex Fridman

And maybe not just engagement, but make money. Yes. I mean, that could be the thing where almost fully automated, it makes, you know, $10 million by being an influencer, selling a product, creating the product. And I'm not referring to a hype product, but an actual product. Like, holy shit, this thing created a business. It's running it. It's the face of the business, that kind of thing. Or maybe a number one song. It creates the whole infrastructure required to create the song to be the influencer that represents that song and that kind of thing. It makes a lot of them that could be the move i mean this our culture respects money in that kind of way and it's and it's verifiable right it's verifiable right the bank account can't lie

Dylan Patel

Exactly there's

[03:06:50] O3 Mini's Performance

🎧 Play snip - 1min️ (03:05:32 - 03:06:50)

O3 Mini's Performance

OpenAI's O3 Mini, while generally good, underperformed on open-ended philosophical questions compared to R1 and O1 Pro.
However, O3 Mini showed better performance in brainstorming tasks.

📚 Transcript

Click to expand

Lex Fridman

What are we expecting from the different flavors? Can you just lay out the different flavors of the old models and from Gemini, the reasoning model?

Nathan Lambert

Something I would say about these reasoning models is we talked a lot about reasoning training on math and code. And what is done is that you have the base model we've talked about a lot on the internet. You do this large scale reasoning training with reinforcement learning. And then what the DeepSeq paper detailed in this R1 paper, which for me is one of the big open questions on how do you do this, is that they did reasoning heavy, but very standard post-training Techniques after the large scale reasoning RL. So they did the same things with a form of instruction tuning through rejection sampling, which is essentially heavily filtered instruction tuning with some reward models. And then they did this RLHF, but they made it math heavy. So some of this transfer, we looked at this philosophical example early on. One of the big open questions is how much does this transfer? If we bring in domains after the reasoning training, are all the models going to become eloquent writers by reasoning? Is this philosophy stuff going to be open? We don't know in the research of how much this will transfer. There's other things about how we can make soft verifiers and things like this. But there is more training after reasoning, which makes it easier to use these reasoning models.

[03:28:30] NVIDIA and the DeepSeek Moment

🎧 Play snip - 4min️ (03:24:25 - 03:28:30)

NVIDIA and the DeepSeek Moment

NVIDIA's stock drop after the DeepSeek R1 release was likely due to market concerns about reduced AI spending.
Despite this, the demand for GPUs is rising, with AWS H100 pricing increasing and H200s nearing out-of-stock.

📚 Transcript

Click to expand

Lex Fridman

It'll get cheaper and cheaper and cheaper. The big DeepSeq R1 release freaked everybody out because of the cheaper. One of the manifestations of that is NVIDIA stock plummeted. Can you explain what happened? I mean, and also just explain this moment and whether, you know, if NVIDIA is going to keep winning.

Nathan Lambert

We're both NVIDIA bulls here, I would say. And in some ways, the market response is reasonable. Most of the market, like NVIDIA's biggest customers in the US are major tech companies, and they're spending a ton on AI. And if a simple interpretation of DeepSeek is you can get really good models without spending as much on AI. So in that capacity, it's like, oh, maybe these big tech companies won't need to spend as much in AI and go down. The actual thing that happened, it's much more complex where there's social factors, where there's the rising in the app store, the social contagion that is happening. And then I think some of it is just like, I don't trade. I don't know anything about financial markets. But it builds up over the weekend with the social pressure, where it's like if it was during the week and there was multiple days of trading when this was really becoming. But it comes on the weekend and then everybody wants to sell. And that is a social contagion.

Dylan Patel

I think there were a lot of false narratives, which is like, hey, these guys are spending billions on models, right? And they're not spending billions on models. No one spent more than a billion dollars on a model that's released publicly, right? GPT-4 was a couple hundred million. And then, you know, they've reduced the cost with 4Turbo 4 4-0, right? But billion-dollar model runs are coming, right? And this concludes pre-training and post-training, right? And then the other number is like, hey, DeepSeek didn't include everything, right? They didn't include, you know, a lot of the cost goes to research and all this sort of stuff. A lot of the cost goes to inference. A lot of the cost goes to post-training. None of these things were factored. It's research salaries, right? All these things are counted in the billions of dollars that OpenAI is spending, but they weren't counted in the, hey, $6 million, $5 million that DeepSeek spent, right? So there's a bit of misunderstanding of what these numbers are. And then there's also an element of... NVIDIA has just been a straight line up, right? And there's been so many different narratives that have been trying to push down NVIDIA. I don't say push down NVIDIA stock. Everyone is looking for a reason to sell or to be worried, right? It was Blackwell delays, right? Their GPU, there's a lot of reports. Every two weeks, there's a new report about their GPUs being delayed. There's the whole thing about scaling laws ending, right? It's so ironic, right? It lasted a month. It was just like literally just, hey, models aren't getting better, right? They're just not getting better. There's no reason to spend more. Pre-training scaling is dead. And then it's like, oh, one, oh, three, right? R1. R1, right? And now it's like, wait, models are getting too, they're progressing too fast. Slow down the progress. Stop spending on GPUs, right? But you know, the thing I think that comes out of this is Javon's paradox is true, right? AWS pricing for H100s has gone up over the last couple of weeks, right? Since a little bit after Christmas, since V3 was launched, AWS H100 pricing has gone up. H200s are almost out of stock everywhere because it you know h200 has more memory and therefore r1 like you know wants that chip over h100 right we were trying to get gpus on a short notice

Nathan Lambert

This week for a demo and it wasn't that easy we were trying to get just like 16 or 32 h100s for demo and it was not very easy so for people who don't know jenron's paradox is uh when uh you know

Lex Fridman

The efficiency goes up somehow magically, counterintuitively, the total resource consumption goes up as well.

Dylan Patel

Right. And semiconductors is – we're at 50 years of Moore's Law. Every two years, half the cost, double the transistors, just like clockwork. And it's slowed down obviously, but like the semiconductor industry has gone up the whole time, right? It's been wavy, right? There's obviously cycles and stuff. And I don't expect AI to be any different, right there's going to be ebbs and flows but this is in ai it's just playing out at an insane timescale right it was 2x every two years this is 1200x In like three years right so it's like the scale of improvement that is like hard to wrap your head around yeah i was confused because

[03:36:18] DeepSeek's Serving Capacity Issues

🎧 Play snip - 2min️ (03:34:44 - 03:36:18)

DeepSeek's Serving Capacity Issues

DeepSeek's language model is facing serving capacity limitations, causing registration delays and slow response times.
The lack of GPUs for serving, despite model efficiency, has led to declining app store downloads.

📚 Transcript

Click to expand

Dylan Patel

And the serving part is really critical, right? DeepSeek cannot serve their model today, right? It's completely out of inventory. It's already started falling in the app store, actually downloads, because you download it, you try and sign up, they say, we're not taking registrations because they have no capacity, Right? You open it up, you get like less than five tokens per second, if you even get your request approved, right? Because there's just no capacity because they just don't have enough GPUs to serve the model, even though it's incredibly efficient.

Lex Fridman

It'd be fascinating to watch the smuggling, because I mean, there's drug smuggling, right? That's a market. There's weapons smuggling. And GPUs will surpass that at some point.

Nathan Lambert

Chips are highest value per kilogram probably by far. I have another question for you, Don. Do you track model API access internationally? How easy is it for Chinese companies to use hosted model APIs from the US?

Dylan Patel

Yeah, I mean, that's incredibly easy, right? Like OpenAI publicly stated DeepSeq uses their API. And as they say, they have evidence, right? And this is another element of the training regime is people at OpenAI have claimed that it's a distilled model, i.e. You're taking OpenAI's model, you're generating a lot of output, and then you're training on the output in their model. And even if that's the case, what they did is still amazing, by the way, what DeepSeek did efficiency-wise.

Nathan Lambert

Distillation is standard practice in industry, whether or not, if you're at a closed lab where you care about terms of service and IP closely, you distill from your own models. If you are a researcher and you're not building any products, you distill from the OpenAI models.

Lex Fridman

Is a good opportunity. Can you explain big picture distillation as a process? What is distillation? What's the process of distillation?

[03:40:00] Espionage vs. Idea Flow

🎧 Play snip - 5min️ (03:35:20 - 03:40:00)

Espionage vs. Idea Flow

Spying and stealing code and data is difficult, but the flow of ideas through employee movement is common.
OpenAI's claim about DeepSeek using their model is likely a narrative to protect their IP.

📚 Transcript

Click to expand

Nathan Lambert

Chips are highest value per kilogram probably by far. I have another question for you, Don. Do you track model API access internationally? How easy is it for Chinese companies to use hosted model APIs from the US?

Dylan Patel

Yeah, I mean, that's incredibly easy, right? Like OpenAI publicly stated DeepSeq uses their API. And as they say, they have evidence, right? And this is another element of the training regime is people at OpenAI have claimed that it's a distilled model, i.e. You're taking OpenAI's model, you're generating a lot of output, and then you're training on the output in their model. And even if that's the case, what they did is still amazing, by the way, what DeepSeek did efficiency-wise.

Nathan Lambert

Distillation is standard practice in industry, whether or not, if you're at a closed lab where you care about terms of service and IP closely, you distill from your own models. If you are a researcher and you're not building any products, you distill from the OpenAI models.

Lex Fridman

Is a good opportunity. Can you explain big picture distillation as a process? What is distillation? What's the process of distillation?

Nathan Lambert

We've talked a lot about training language models. They are trained on text. In post-training, you're trying to train on very high quality text that you want the model to match the features of, or if you're using RL, you're letting the model find its own thing. But for supervised fine-tuning, for preference data, you need to have some completions, what the model is trying to learn to imitate. And what you do there is instead of a human data, or instead of the model you're currently training, you take completions from a different, normally more powerful model. I think there's rumors that these big models that people are waiting for, these GPT-5s of the world, the Claude III opuses of the world, are used internally to do this distillation process. There's also public examples, right?

Dylan Patel

Like Meta explicitly stated, not necessarily distilling, but they used 405b as a reward model for 70b in their Llama 3.2 and 3.3.

Nathan Lambert

This is all the same topic.

Lex Fridman

So is this ethical? Is this legal? Like why, why is that a financial times article headline say open AI says that there's evidence that China's deep seek used its model to train competitor.

Nathan Lambert

This is a long, at least in the academic side and research side as a long history, cause you're trying to interpret opening eyes rule. OpenAI's terms of service say that you cannot build a competitor with outputs from their model. Terms of service are different than a license, which are essentially a contract between organizations. So if you have a terms of service on OpenAI's account, if I violate it, OpenAI can cancel my account. This is very different than like a license that says how you could use a downstream artifact. So a lot of it hinges on a word that is very unclear in the AI space, which is what is a competitor.

Dylan Patel

And then the ethical aspect of it is like, why is it unethical for me to train on your model when you can train on the internet's text? Yeah.

Lex Fridman

Right. So there's a bit of a hypocrisy because sort of open AI and potentially most of the companies trained on the internet's text without permission.

Nathan Lambert

There's also a clear loophole, which is that I generate data from OpenAI and then I upload it somewhere and then somebody else trains on it and the link has been broken. Like they're not under the same terms of service contract. This is why... There's a lot of hip hop.

Dylan Patel

There's a lot of like to be discovered details that don't make a lot of sense. This is why a lot of models today, even if they train on zero open AI data, you ask the model who trained you, it'll say, I am chat GPT trained by open AI. Because there's so much copy paste of like open AI outputs from that on the Internet that you just weren't able to filter it out. There was nothing in the RL where they implemented like, hey, like or post training or SFT, whatever that says, hey, I'm actually a model by Allen Institute instead of. We have to do this if we serve a demo.

Nathan Lambert

We do research and we use OpenAI APIs because it's useful and we want to understand post training. And like our research models, they will say they're written by OpenAI unless we put in the system prop that we talked about that like I am Tulu. I am a language model trained by the allen institute for ai and if you ask more people around industry especially with post training it's a very doable task to make the model say who it Is or to suppress the open ai thing so in some levels it might be that deep seek didn't care that it was saying that it was by open ai like if you're going to upload model weights it doesn't Really matter because anyone that's serving it in an application and cares a lot about serving is going to, when serving it, if they're using it for a specific task, they're going to Tailor it to that. And it doesn't matter that it's saying it's ChatGPT.

Lex Fridman

Oh, I guess one of the ways to do that is like a system prompt or something like that. If you're serving it to say that you're... That's what we do.

Nathan Lambert

If we host the demo, you say you are Tulu3, a language model trained by the Allen Institute for AI. We also are benefited from OpenAI data because it's a great research tool.

[03:42:44] Training AI Models Legally

🎧 Play snip - 1min️ (03:41:57 - 03:42:42)

Training AI Models Legally

Train AI models in Japan, where copyright doesn't apply to training data.
Japan also has excess nuclear power and allows GPU imports, creating a favorable environment for AI development.

📚 Transcript

Click to expand

Dylan Patel

I agree. I have a schizo take on how you can solve this because it already works. I have a reasonable take on it. All right. All right. So, you know, Japan has a law which you're allowed to train on any training data and copyrights don't apply if you want to train a model a b japan has nine gigawatts of curtailed nuclear Power c japan is allowed under the ai diffusion rule to import as many gpus as they'd like so all we have to do we have a market here to make we build massive data centers we rent them to the Labs and then we train models in a legally permissible way and there's no ifs ands or buts and now the models have no like potential copyright lawsuit from new york times or anything like That no no it's just like completely legal no so so genius the early

[03:56:11] XAI's Mega Cluster

🎧 Play snip - 10min️ (03:46:07 - 03:56:11)

XAI's Mega Cluster

Elon Musk's Memphis data center, housing 200,000 GPUs, is currently the largest single cluster.
XAI's rapid innovation in data center construction, including water cooling, sets a new pace.

📚 Transcript

Click to expand

Lex Fridman

Can you talk about the build outs for each one that stand out?

Dylan Patel

Yeah. So I think the thing that's really important about these mega cluster build-outs is they're completely unprecedented in scale, right? U.S., sort of like data center power consumption has been slowly on the rise, and it's gone up to 2%, 3%, even through the cloud computing revolution, right? Data center consumption as a percentage of total U.S. And that's been over decades, right, of data centers, et cetera. It's been climbing, climbing slowly. But now, two to 3%. Now, by the end of this decade, it's like, even under like, you know, when I say like 10%, a lot of people that are traditionally, by like 2028, 2030, people traditionally non-traditional Data center, people like that's nuts. But then like, people who are in like AI, who have like, really looked at this at like, the anthropics and open AIs are like, that's not enough. And And I'm like, okay. But, like, you know, this is both through globally distributed or distributed throughout the U.S. As well as, like, centralized clusters, right? The distributed throughout the U.S. Is exciting, and it's the bulk of it, right? Like, hey, you know, OpenAI or, you know, say Meta is adding a gigawatt, right? But most of it is distributed through the US for inference and all these other things, right?

Lex Fridman

So maybe we should lay out what a cluster is. So, you know, does this include AWS? Maybe it's good to talk about the different kinds of clusters and what you mean by mega clusters and what's a GPU and what's a computer and what is... That far back, but yeah. So like, what do we mean by the clusters?

Dylan Patel

I thought I was about to do the Apple ad, right? What's a computer? So, so traditionally data centers and data center tasks have been a distributed systems problem that is capable of being spread very far and widely, right? I.e., I send a request to Google, it gets routed to a data center somewhat close to me, it does whatever search ranking recommendation, sends a result back, right? The nature of the task is changing rapidly in that there's two tasks that people are really focused on now, right? It's not database access. It's not serve me the right page, serve me the right ad. It's now a inference and inference is dramatically different from traditional distributed systems, but it looks a lot more simple, similar. And then there's training, right? The train inference side is still like, Hey, I'm going to put, you know, thousands of GPUs and, you know, blocks all around these data centers. I'm going to run models on them. User submits a request, gets kicked off, or hey, my service, they submit a request to my service. They're on Word and they're like, oh yeah, help me copilot. And it starts kicks it off or I'm on my Windows, copilot, whatever, Apple intelligence, whatever it is, it gets kicked off to a data center. And that data center does some work and sends it back. That's inference. That is going to be the bulk of compute. But then, you know, and that's like, you know, there's thousands of data centers that we're tracking with like satellites and like all these other things. And those are the bulk of what's being built, but the scale of, and so that's like, what's really reshaping and that's what's getting millions of GPUs. But the scale of the largest cluster is also really important, right? When we look back at history, right? Like, you know, or through the age of AI, right? Like, it was a really big deal when they did AlexNet on, I think, two GPUs or four GPUs? I don't remember. It was a really big deal. It's a big deal because you used GPUs. It's a big deal to use GPUs and they used multiple, right? But then over time, its scale has just been compounding, right? And so when you skip forward to GPT-3, then GPT-4, GPT-4, 20,000 A100 GPUs, unprecedented run, right? In terms of the size and the cost, right? A couple hundred million dollars on a YOLO, right? A YOLO run for GPT-4. And it yielded, you know, this magical improvement that was like perfectly in line with what was experimented and just like a log scale, right? Oh have that plot from the paper the scaling the technical part the scaling laws were perfect right but that's not a crazy number right 20 000 a100s uh roughly each gpu is consuming 400 Watts uh and then when you add in the whole server right everything um it's like 15 to 20 megawatts of power right uh you know, maybe you could look up what the power of consumption of a human Person is because the numbers are going to get silly. But like 15 to 20 megawatts was standard data center size. It was just unprecedented. That was all GPUs running one task. How many watts of the toaster? A toaster is like- It's a good example. A similar power consumption to an A100, right? H100 comes around. They increase the power from like 400 to 700 watts, and that's just per GPU, and then there's all the associated stuff around it. So once you count all that, it's roughly like 1,200 to 1,400 watts for everything, networking, CPUs, memory, blah, blah, blah.

Lex Fridman

So we should also say, so what's required? You said power, so a lot of power is required, a lot of heat is generated, so the cooling is required, and because there's a lot of GPUs that have to be, or CPUs or whatever, they have to be Connected. So there's a lot of networking.

Dylan Patel

Yeah. Yeah. So I think, yeah, sorry for skipping past that. And then the data center itself is like complicated, right? But these are still standardized data centers for GPT-4 scale, right? Now we step forward to sort of what is the scale of clusters that people have built last year, right? And it ranges widely, right? It ranges from like, hey, these are standard data centers and we're just using multiple of them and connecting them together really with a ton of fiber between them, a lot of networking, Et cetera. That's what OpenAI and Microsoft did in Arizona, right? And so they have 100,000 GPUs, right? Meta, similar thing. They took their standard existing data center design, and it looks like an H, and they connected multiple of them together. And, you know, they got to, they first did 16,000 GPUs, 24,000 GPUs total, only 16,000 of them were running on the training run because GPUs are very unreliable. So they need to have spares to like swap in and out all the way to like now 100,000 GPUs that they're training on Lama 4 on currently, right? Like 128,000 or so, right? This is, you know, think about 100,000 GPUs, um, with roughly 1400 Watts a piece. That's, that's, that's 140 megawatts, 150 megawatts, right? For 128, right? So you're talking about, you've jumped from 15 to 20 megawatts to 10 X, you know, almost 10 X, that number nine X, that number to 150 megawatts in two years, right? From 2022 to 2024, right? And some people like Elon, he admittedly, right? And he says himself got into the game a little bit late for pre-training large language models, right? XAI was started later, right? But then he bent heaven and hell to get his data center up and get the largest cluster in the world, right? Which is 200,000 GPUs. And he did that. He bought a factory in Memphis. He's upgrading the substation, but at the same time, he's got a bunch of mobile power generation, a bunch of single cycle combine. He tapped the natural gas line that's right next to the factory, and he's just pulling a ton of gas, burning gas. He's generating all this power. He's in a factory, in an old appliance factory that shut down and moved to China long ago, right? Like, you know, and he's got 200,000 GPUs in it. And now what's the next scale, right? Like all the hyperscalers have done this. Now the next scale is something that's even bigger, right? And so, you know, Elon, just to

DeepSeek, China, OpenAI, NVIDIA, xAI, TSMC, Stargate,

Snips

[22:23] Untitled

[29:56] Untitled

[35:33] Untitled

[38:04] Untitled

[38:44] Untitled

[50:11] Untitled

[50:36] Untitled

[51:14] Untitled

[59:04] Untitled

[01:40:00] AI Cold War?

📚 Transcript

[01:44:13] TSMC's Foundry Model

📚 Transcript

[01:54:20] TSMC's Cultural Advantage

📚 Transcript

[02:09:01] H200 for Reasoning

📚 Transcript

[02:09:48] Importance of Memory in Reasoning

📚 Transcript

[02:11:34] KVCache in Transformers

📚 Transcript

[02:13:17] Transformer Memory Cost

📚 Transcript

[02:14:55] LLM Token Pricing

📚 Transcript

[02:16:42] Reasoning Model Costs

📚 Transcript

[02:22:26] DeepSeek R1 Cost

📚 Transcript

[02:25:22] DeepSeek's Inference Cost Advantage

📚 Transcript

[02:24:51] OpenAI's Margins

📚 Transcript

[02:30:53] Race to the Top vs. Bottom in AI Safety

📚 Transcript

[02:38:06] Superhuman Persuasion

📚 Transcript

[02:39:09] Superhuman Persuasion

📚 Transcript

[02:43:51] Censorship in LLMs

📚 Transcript

[03:03:43] AI Sandboxes and Verifiable Tasks

📚 Transcript

[03:06:50] O3 Mini's Performance

📚 Transcript

[03:28:30] NVIDIA and the DeepSeek Moment

📚 Transcript

[03:36:18] DeepSeek's Serving Capacity Issues

📚 Transcript

[03:40:00] Espionage vs. Idea Flow

📚 Transcript

[03:42:44] Training AI Models Legally

📚 Transcript

[03:56:11] XAI's Mega Cluster

📚 Transcript