Learning Library

← Back to Library

DeepSeek R1 Challenges OpenAI's o1

Key Points

  • DeepSeek, a Chinese AI startup, surged to the top of the U.S. App Store’s free‑download rankings by releasing an open‑source model that claims to match or surpass leading competitors at a fraction of the cost.
  • Their flagship reasoning model, DeepSeek R1, is designed to perform “chain‑of‑thought” reasoning, visibly breaking problems into steps, back‑tracking, and showing its thought process before delivering an answer.
  • R1 rivals OpenAI’s o1 on math and coding benchmarks, yet DeepSeek reports it is trained on far fewer chips and runs about 96 % cheaper than o1.
  • The R1 model builds on a rapid series of releases: DeepSeek‑1 (67 B parameters, early 2024), DeepSeek‑2 (236 B, introduced multi‑head laden attention and a mixture‑of‑experts for speed and performance), and DeepSeek‑3 (671 B, released late 2024), culminating in the reasoning‑focused R1.

Full Transcript

# DeepSeek R1 Challenges OpenAI's o1 **Source:** [https://www.youtube.com/watch?v=KTonvXhsxpc](https://www.youtube.com/watch?v=KTonvXhsxpc) **Duration:** 00:10:09 ## Summary - DeepSeek, a Chinese AI startup, surged to the top of the U.S. App Store’s free‑download rankings by releasing an open‑source model that claims to match or surpass leading competitors at a fraction of the cost. - Their flagship reasoning model, DeepSeek R1, is designed to perform “chain‑of‑thought” reasoning, visibly breaking problems into steps, back‑tracking, and showing its thought process before delivering an answer. - R1 rivals OpenAI’s o1 on math and coding benchmarks, yet DeepSeek reports it is trained on far fewer chips and runs about 96 % cheaper than o1. - The R1 model builds on a rapid series of releases: DeepSeek‑1 (67 B parameters, early 2024), DeepSeek‑2 (236 B, introduced multi‑head laden attention and a mixture‑of‑experts for speed and performance), and DeepSeek‑3 (671 B, released late 2024), culminating in the reasoning‑focused R1. ## Sections - [00:00:00](https://www.youtube.com/watch?v=KTonvXhsxpc&t=0s) **DeepSeek R1 Challenges OpenAI** - DeepSeek’s open‑source reasoning model R1, now the most‑downloaded free AI app in the US, claims to match or surpass OpenAI’s o1 on math and coding benchmarks while costing about 96% less to run. - [00:03:05](https://www.youtube.com/watch?v=KTonvXhsxpc&t=185s) **DeepSeek Model Evolution 2024‑2025** - The speaker outlines the rapid scaling of DeepSeek’s models—from the 236‑billion v2 with multi‑head laden attention and mixture‑of‑experts, to the 671‑billion v3 and R1‑Zero—highlighting reinforcement‑learning fine‑tuning, GPU load‑balancing, and resulting performance improvements. - [00:06:07](https://www.youtube.com/watch?v=KTonvXhsxpc&t=367s) **DeepSeek's Low‑Cost GPU Strategy** - The speaker explains how DeepSeek trains its V3 model using only 2,000 GPUs—far fewer than competitors like Meta's 100,000‑GPU Llama 4—by leveraging chain‑of‑thought reasoning and reinforcement learning to achieve efficient performance. - [00:09:17](https://www.youtube.com/watch?v=KTonvXhsxpc&t=557s) **MoE Architecture Cuts AI Costs** - The speaker explains that mixture‑of‑experts models like DeepSeek R1, Mistral, and IBM Granite lower training and inference expenses while delivering top‑tier reasoning performance. ## Full Transcript
0:00Chances are you've heard about the newest entrants to the very crowded and very competitive realm of AI models, 0:07DeepSeek. 0:08It's a startup based in China, 0:11and it caught everyone's attention by taking over OpenAI's 0:15coveted spot for most downloaded free app in the US on Apple's App Store. 0:20So how? 0:21Well, by releasing an open source model that it claims 0:24can match all surpass the performance of other industry leading models 0:29and at a fraction of the cost. 0:32Now, the specific model that's really making a splash from DeepSeek is called DeepSeek R1, 0:42and the R here that implies reasoning, 0:47because this is a reasoning model. 0:54DeepSeek R1 is their reasoning model. 0:57Now DeepSeek our one performs as well as some of the other models, including OpenAI's own reasoning model. 1:05That's called o1, 1:08and it can match or even outperform it across a number of AI benchmarks for math and coding tasks, 1:14which is all the more remarkable because according to DeepSeek, 1:18DeepSeek R1 is trained with far fewer chips and is approximately 96% cheaper to run than o1. 1:29Now, unlike previous AI models which produce an answer without explaining the why, 1:33a reasoning model solves complex problems by breaking them down into steps. 1:40So before answering a user query, the model spends time thinking, 1:45thinking in air quotes here, 1:47and that thinking time could be a few seconds or even minutes. 1:50Now, during this time, the model is performing step by step analysis through a process that is known as chain of thought. 2:03And unlike other reasoning models, R1 shows the use of that chain of thought process as it breaks the problem down, 2:11as it generates insights, as it backtrack and says it needs to, 2:15and as it ultimately arrives at an answer. 2:19Now I'm going to get into how this model works. 2:21But before that, let's talk about how it came to be a DeepSeek R1 seems to have come out of nowhere. 2:29But there are in fact many DeepSeek models that brought us to this point a model avalanche, if you like. 2:36And my colleague Aaron can help dig us out. 2:40Well, thanks, Martin. 2:41There is certainly a lot to dig out here. 2:43There's a lot of these models. 2:44But let's start from the very top in the beginning of all this. 2:47So we began and we go to, let's say, DeepSeek version one, 2:51which is a 67 billion model that was released in January of 2024. 2:56Now, this is a traditional transformer with a focus on the feedforward neural networks. 3:01This gets us down into DeepSeek version two, which really put this on the map. 3:05This is a very large 236 billion model that was released not that far away from the original, which is June 2024. 3:13But to put this into perspective, there are really two novel aspects around this model. 3:17The first one, 3:18was the multi-headed laden attention. 3:21And the second aspect was the DeepSeek mixture of experts. 3:24It just made the model really fast and performant. 3:27And it set us up for success for the DeepSeek version three, which was released December of 2024. 3:33Now, this one is even bigger. 3:34It's 671 billion parameters. 3:37But this is where we began to see the introduction of using reinforcement learning with that model, 3:43and some other contributions that this model had is it was able to balance load across many GPUs, 3:48because they used a lot of H800s within their infrastructure and that was also built around on top of DeepSeek v2. 3:56So all these models accumulate and build on top of each other, which gets us down into DeepSeek R1-Zero, 4:02which was released in January of 2025. 4:05So this is the first of the reasoning model is now, right? 4:08It is, yeah. 4:09And it's really neat how they began to train, you know, these types of models, right? 4:13So it's a type of fine tuning. 4:15But on this one, the exclusively use reinforcement learning, 4:19which is a way where you have policies and you want to reward or you want to penalize 4:23the model for some action that it has taken or output that it has taken in itself learns over time, 4:29and it was very performant. 4:30It did well, but it got even better with DeepSeek R1, right, which was again built on top of R1-Zero, 4:38and this one used a combination of reinforcement learning, 4:41and supervised fine tuning the best of both worlds so that it could even be better, 4:46and it's very close to performance on many standards and benchmarks as some of these OpenAI models we have now. 4:52And this gets us down into now distilled models, which is like a whole other paradigm. 4:56Distilled models. 4:57Okay, so tell me what that is all about. 5:00Yeah, great question and comment. 5:02So first of all, for a distilled model is where you have a student model, which is a very small model, 5:09and you have the teacher model, which is very big, and you want to distill 5:13or extract knowledge from the teacher model down into the student model, 5:17and some aspects you could think of it as model compression. 5:21But one interesting aspect around this is this is not just compression or transferring knowledge, 5:25but it's model translation because we're going from the R1-Zero, right? Which is 5:30one of those mixture of expert models down into, for example, a Llama series model, 5:36which is not a mixture of experts, but it's a traditional transformer, right? 5:40So, so you're going from one architecture type to another. 5:43And we do the same with Qwen, right? 5:45So there's different series of models that are the foundation that we then distill into from the R1-Zero. 5:52Well, thanks. 5:53It's really interesting to get the history behind all this. 5:55It didn't come from nowhere, 5:58but with all of these distilled models coming, I think you might need your shovel back to dig your way out of those. 6:04Thank you very much. 6:04There's going to be a lot of distilled models. 6:06So you're exactly right. 6:07I think you're going to go dig. 6:09Thanks. 6:09So our one didn't come from nowhere is an evolution of other models, but 6:15how does DeepSeek operate at such comparatively low cost? 6:19Well, by using a fraction of the highly specialized Invidia chips used by their American competitors to train their systems. 6:29If I can illustrate this in a graph. 6:31So if we consider different types of model and then the number of GPUs that they use. 6:39Well, DeepSeek engineers, for example, they said that they only need 2000 6:45GPUs that's graphical processing units to train the DeepSeek V3 Model, DeepSeek V3. 6:57Now in isolation, 6:59what does that mean? 6:59Is that good? Is that bad? 7:01Well, by contrast, meta said that the company was training that latest opensource model. 7:07That's Llama 4 and they are using a computer cluster with over 100,000 Nvidia GPUs. 7:19So that brings up the question of how is it so efficient? 7:23Well, DeepSeek R1 combines chain of thought reasoning with a process called reinforcement learning. 7:33This is a capability that Aaron mentioned just now which arrived at the V3 model of DeepSeek. 7:39And here an autonomous agent learns to perform a task through 7:44trial and error without any instructions from a human user. 7:48Now, traditionally, models will improve their ability to reason 7:53by being trained on labeled examples of correct or incorrect behavior. 7:57That's known as supervised learning, or by 8:00extracting information from hidden patterns that send us unsupervised learning. 8:04But the key hypothesis here with reinforcement learning is to reward the model for correctness, 8:12no matter how it arrived at the right answer and let the model discover the best way to think all on its own. 8:21Now DeepSeek R1 also uses a mixture of experts, architectural or MoE, 8:28and a mixture of experts architecture is considerably less resource intensive to train. 8:35Now the MoE architecture divides an AI model up into separate entities or 8:41sub networks, which we can think of as being individual experts. 8:46So in my little neural network here, I'm going to create three 8:52experts and a real MoE architecture probably have quite a bit more than that. 8:57But each one of these is specialized in a subset of the input data, 9:02and the model only activates the specific experts needed for a given task. So a request comes in. 9:09We activate the experts that we need and we only use those rather than activating the entire neural network. 9:17So consequently, the MoE architecture reduces computational costs 9:21during pre-training and achieves faster performance during inference time 9:25and look MoE, that architecture isn't unique to models from DeepSeek. 9:30There are models from the French AI company Mistral that also use this, 9:36and in fact the IBM Granite model that is also built on a mixture of experts architecture. 9:45So it's a commonly used architecture. 9:48So that is DeepSeek R1. 9:51It's an AI reasoning model that is matching other industry leading models on reasoning benchmarks, 9:57while being delivered at a fraction of the cost in both training and inference. 10:03All of which makes me think this is an exciting time for AI reasoning models.