Learning Library

← Back to Library

DeepSeek V3: Affordable Open-Source AI

5m • Unknown Channel • ai-ml • deep-dive • advanced • Watch on YouTube ↗

Key Points

A new “four‑class” language model called DeepSeek V3 can be built, maintained, and run for roughly $5 million—orders of magnitude cheaper than the $70‑$100 million cost of models like ChatGPT or Claude.
The model’s creators open‑sourced the architecture and training pipeline, enabling startups and individual researchers to replicate or improve upon it.
Instead of ingesting the entire internet, DeepSeek V3 was trained on a carefully curated, high‑quality corpus covering English, Chinese, math, and code, with extensive human‑in‑the‑loop reinforcement for accuracy.
Although the full network contains 617 billion parameters, the system only activates about 37 billion of them per query, picking an efficient “sliver” of the model to generate responses.
Leveraging confidence from its curated data, the model predicts multiple tokens ahead (e.g., two tokens at a time), further improving inference speed and computational efficiency.

Sections

00:00:00 Low-Cost Open-Source LLM Breakthrough - The speaker highlights DeepSeek V3, an open‑source language model trained on a curated high‑quality dataset for roughly $5 million—far cheaper than ChatGPT‑scale models—and argues it opens the door for startups to build their own competitive AI systems.

Full Transcript

# DeepSeek V3: Affordable Open-Source AI **Source:** [https://www.youtube.com/watch?v=QMuwRymNMuw](https://www.youtube.com/watch?v=QMuwRymNMuw) **Duration:** 00:05:26 ## Summary - A new “four‑class” language model called DeepSeek V3 can be built, maintained, and run for roughly $5 million—orders of magnitude cheaper than the $70‑$100 million cost of models like ChatGPT or Claude. - The model’s creators open‑sourced the architecture and training pipeline, enabling startups and individual researchers to replicate or improve upon it. - Instead of ingesting the entire internet, DeepSeek V3 was trained on a carefully curated, high‑quality corpus covering English, Chinese, math, and code, with extensive human‑in‑the‑loop reinforcement for accuracy. - Although the full network contains 617 billion parameters, the system only activates about 37 billion of them per query, picking an efficient “sliver” of the model to generate responses. - Leveraging confidence from its curated data, the model predicts multiple tokens ahead (e.g., two tokens at a time), further improving inference speed and computational efficiency. ## Sections - [00:00:00](https://www.youtube.com/watch?v=QMuwRymNMuw&t=0s) **Low-Cost Open-Source LLM Breakthrough** - The speaker highlights DeepSeek V3, an open‑source language model trained on a curated high‑quality dataset for roughly $5 million—far cheaper than ChatGPT‑scale models—and argues it opens the door for startups to build their own competitive AI systems. ## Full Transcript

0:00what if I told you that there was a four 0:01class model out there that was 10 times 0:03cheaper to build maintain and execute on 0:07so Chad gp4 set the bar for models in 0:102024 it's since been surpassed by 0:13inference time compute models like 01 01 0:16pro3 but it's still really really good 0:19at a lot of different things it's good 0:20at English it's good at coding it's good 0:22at math Etc well there's now a new model 0:25instead of costing the 70 or100 million 0:28that chat GPT cost to train similar cost 0:31for Claude this model only cost $5 0:35million maybe 5 a half that is not that 0:38much A lot of startups have $5 million 0:41it's a lot for an individual but a lot 0:42of startups have $5 0:44million and it's really amazing to see a 0:47world where we could actually Envision 0:49individual startups being able to build 0:52their own models and this is something 0:53that the makers of this model have 0:55chosen to open source so it's something 0:57anybody can look at and say how could I 0:59make it even better or how could I do it 1:01myself and they've done a number of 1:03really interesting things throughout the 1:05model build process that they are 1:06revealing to the world and I just want 1:08to highlight a couple I'll share the 1:09paper in the in the description here 1:11this is deep seek V3 it's very cool so 1:16the training data was something they 1:17took a lot of care with they did not do 1:18sort of the suck up the whole internet 1:20vibe that chat GPT did they actually had 1:22a very specific training Corpus of very 1:24high quality tokens that they trained 1:27against and they really really reviewed 1:29it to make sure it was good at English 1:30good at Chinese good at math and good at 1:32coding and then they reinforced that 1:35carefully with human responses to ensure 1:38it was really really accurate and that 1:39gave them a lot of confidence during 1:43query time so when you type in a query 1:45it gave them confidence to actually 1:47predict more tokens ahead and be more 1:48efficient in their use of space so even 1:51though this is a very large model it's 1:52like 617 billion tokens right it's a 1:55very large model uh large for a four 1:57class model like it's it's it's not 2:00something that you would expect to be 2:03this efficient is the way I'll put it 2:05but they have figured out that you can 2:08use just a sliver of that total model 2:11space in the response and it's about 2:12picking the right sliver and so where 2:14other models like metas Llama Or Claude 2:17or Chad GPT use the whole model space 2:20this model only uses 37 billion uh 2:24parameters out of the 617 billion 2:26parameter model for any given response 2:28and so it's about picking the WR 37 2:30billion which sounds like a lot of 2:32parameters Until you realize it's such a 2:34tiny percentage of the total in the 2:35model and they're actually making very 2:37efficient use of comput they're also 2:40able to predict more than one token 2:42ahead because they're so confident in 2:43their training data and so instead of 2:45predicting only one token head they're 2:46predicting two and that's a really cool 2:49Innovation I expect to see other folks 2:51go after that as well now they did some 2:53other cool things during the training 2:54phase they had something called dual 2:56pipe which I've tried to explain a 2:57couple of times on video It's rather 2:59complicated it basically amounts to 3:02being able to regurgitate and learn at 3:04the same time and they had a special 3:07network setup to do that they outlined 3:08that in the paper I definitely recommend 3:10diving in for the details 3:12there but from a from a strategic 3:15perspective if we step back what this 3:18really means is that models have gone 3:20from being something that are in the 3:21hundred million doll class only a few 3:23startups could ever afford to anybody 3:25can build this if they have startup 3:26level seed uh investment that is a mass 3:29massive massive shift it is going to 3:31make more and more four class models 3:33available and it's going to be yet 3:34another driver in this overall strategic 3:37theme of four class intelligence 3:39becoming essentially free they've open 3:42sourced this model anybody can use it 3:44right now and anybody can replicate it 3:46right now so if you think about it we 3:50now have a world where four class models 3:51are becoming free and The Cutting Edge 3:54is an inference time compute and these 3:55models don't really use the 3:58multi-threaded uh multi token prediction 4:00that inference time compute has where 4:02you type in a query and it runs but lots 4:04and lots of different next token 4:06prediction threads and finds the best 4:08one now that may get open source next 4:11year right like the pace we're going at 4:13we may well see a model like that get 4:14open source next year but for now that 4:17is the model and that is The Edge that 4:19chat GPT has in the space nobody else 4:21really has that kind of inference time 4:23compute yet lots of people are working 4:25on it and the four class models like 4:28Claude Sonet 3.5 5 or 3.6 like chat GPT 4:3140 those are rapidly getting replicated 4:34cost is driving to 4:36zero it's a it's a massive achievement 4:39like and I will grant you it is easier 4:41to innovate than it is to replicate so 4:44getting to the first chat gpg 4 May well 4:47have cost $100 million no matter how you 4:49did it because it was the first time but 4:51replicating it turns out to be very very 4:53efficient and very very affordable 4:55relatively speaking and that has huge 4:57implications because it means that 4:58intelligence is going to be more and 5:01more and more free for a lot of 5:02different applications that matter in 5:05business so we will see but right now a 5:08$5 million model is beating 40 and Son 5:103.5 at a lot of the things that people 5:12really use these models for like English 5:14like math like coding Etc so there you 5:18have it deep seek 5:20V3 new four class model Champion cheers