DeepSeek V3: Affordable Open-Source AI
Key Points
- A new “four‑class” language model called DeepSeek V3 can be built, maintained, and run for roughly $5 million—orders of magnitude cheaper than the $70‑$100 million cost of models like ChatGPT or Claude.
- The model’s creators open‑sourced the architecture and training pipeline, enabling startups and individual researchers to replicate or improve upon it.
- Instead of ingesting the entire internet, DeepSeek V3 was trained on a carefully curated, high‑quality corpus covering English, Chinese, math, and code, with extensive human‑in‑the‑loop reinforcement for accuracy.
- Although the full network contains 617 billion parameters, the system only activates about 37 billion of them per query, picking an efficient “sliver” of the model to generate responses.
- Leveraging confidence from its curated data, the model predicts multiple tokens ahead (e.g., two tokens at a time), further improving inference speed and computational efficiency.
Full Transcript
# DeepSeek V3: Affordable Open-Source AI **Source:** [https://www.youtube.com/watch?v=QMuwRymNMuw](https://www.youtube.com/watch?v=QMuwRymNMuw) **Duration:** 00:05:26 ## Summary - A new “four‑class” language model called DeepSeek V3 can be built, maintained, and run for roughly $5 million—orders of magnitude cheaper than the $70‑$100 million cost of models like ChatGPT or Claude. - The model’s creators open‑sourced the architecture and training pipeline, enabling startups and individual researchers to replicate or improve upon it. - Instead of ingesting the entire internet, DeepSeek V3 was trained on a carefully curated, high‑quality corpus covering English, Chinese, math, and code, with extensive human‑in‑the‑loop reinforcement for accuracy. - Although the full network contains 617 billion parameters, the system only activates about 37 billion of them per query, picking an efficient “sliver” of the model to generate responses. - Leveraging confidence from its curated data, the model predicts multiple tokens ahead (e.g., two tokens at a time), further improving inference speed and computational efficiency. ## Sections - [00:00:00](https://www.youtube.com/watch?v=QMuwRymNMuw&t=0s) **Low-Cost Open-Source LLM Breakthrough** - The speaker highlights DeepSeek V3, an open‑source language model trained on a curated high‑quality dataset for roughly $5 million—far cheaper than ChatGPT‑scale models—and argues it opens the door for startups to build their own competitive AI systems. ## Full Transcript
what if I told you that there was a four
class model out there that was 10 times
cheaper to build maintain and execute on
so Chad gp4 set the bar for models in
2024 it's since been surpassed by
inference time compute models like 01 01
pro3 but it's still really really good
at a lot of different things it's good
at English it's good at coding it's good
at math Etc well there's now a new model
instead of costing the 70 or100 million
that chat GPT cost to train similar cost
for Claude this model only cost $5
million maybe 5 a half that is not that
much A lot of startups have $5 million
it's a lot for an individual but a lot
of startups have $5
million and it's really amazing to see a
world where we could actually Envision
individual startups being able to build
their own models and this is something
that the makers of this model have
chosen to open source so it's something
anybody can look at and say how could I
make it even better or how could I do it
myself and they've done a number of
really interesting things throughout the
model build process that they are
revealing to the world and I just want
to highlight a couple I'll share the
paper in the in the description here
this is deep seek V3 it's very cool so
the training data was something they
took a lot of care with they did not do
sort of the suck up the whole internet
vibe that chat GPT did they actually had
a very specific training Corpus of very
high quality tokens that they trained
against and they really really reviewed
it to make sure it was good at English
good at Chinese good at math and good at
coding and then they reinforced that
carefully with human responses to ensure
it was really really accurate and that
gave them a lot of confidence during
query time so when you type in a query
it gave them confidence to actually
predict more tokens ahead and be more
efficient in their use of space so even
though this is a very large model it's
like 617 billion tokens right it's a
very large model uh large for a four
class model like it's it's it's not
something that you would expect to be
this efficient is the way I'll put it
but they have figured out that you can
use just a sliver of that total model
space in the response and it's about
picking the right sliver and so where
other models like metas Llama Or Claude
or Chad GPT use the whole model space
this model only uses 37 billion uh
parameters out of the 617 billion
parameter model for any given response
and so it's about picking the WR 37
billion which sounds like a lot of
parameters Until you realize it's such a
tiny percentage of the total in the
model and they're actually making very
efficient use of comput they're also
able to predict more than one token
ahead because they're so confident in
their training data and so instead of
predicting only one token head they're
predicting two and that's a really cool
Innovation I expect to see other folks
go after that as well now they did some
other cool things during the training
phase they had something called dual
pipe which I've tried to explain a
couple of times on video It's rather
complicated it basically amounts to
being able to regurgitate and learn at
the same time and they had a special
network setup to do that they outlined
that in the paper I definitely recommend
diving in for the details
there but from a from a strategic
perspective if we step back what this
really means is that models have gone
from being something that are in the
hundred million doll class only a few
startups could ever afford to anybody
can build this if they have startup
level seed uh investment that is a mass
massive massive shift it is going to
make more and more four class models
available and it's going to be yet
another driver in this overall strategic
theme of four class intelligence
becoming essentially free they've open
sourced this model anybody can use it
right now and anybody can replicate it
right now so if you think about it we
now have a world where four class models
are becoming free and The Cutting Edge
is an inference time compute and these
models don't really use the
multi-threaded uh multi token prediction
that inference time compute has where
you type in a query and it runs but lots
and lots of different next token
prediction threads and finds the best
one now that may get open source next
year right like the pace we're going at
we may well see a model like that get
open source next year but for now that
is the model and that is The Edge that
chat GPT has in the space nobody else
really has that kind of inference time
compute yet lots of people are working
on it and the four class models like
Claude Sonet 3.5 5 or 3.6 like chat GPT
40 those are rapidly getting replicated
cost is driving to
zero it's a it's a massive achievement
like and I will grant you it is easier
to innovate than it is to replicate so
getting to the first chat gpg 4 May well
have cost $100 million no matter how you
did it because it was the first time but
replicating it turns out to be very very
efficient and very very affordable
relatively speaking and that has huge
implications because it means that
intelligence is going to be more and
more and more free for a lot of
different applications that matter in
business so we will see but right now a
$5 million model is beating 40 and Son
3.5 at a lot of the things that people
really use these models for like English
like math like coding Etc so there you
have it deep seek
V3 new four class model Champion cheers