Scaling Language Models: Size vs Performance
Key Points
- LLM size is measured by the number of parameters, ranging from lightweight 300 M‑parameter models that run on smartphones to massive systems with hundreds of billions—or even approaching a trillion—parameters that require data‑center‑scale GPU clusters.
- Model examples illustrate this spectrum: Mistral 7B has roughly 7 billion parameters (a small model), whereas Meta’s LLaMA 3 reaches about 400 billion parameters, placing it in the “large” category, and frontier research is pushing well beyond half a trillion.
- More parameters generally boost capabilities—enabling better factual recall, multilingual support, and longer reasoning chains—but they also incur exponentially higher compute, energy, and memory costs, so “bigger is not always better.”
- Progress is tracked with benchmarks like the Massive Multitask Language Understanding (MMLU) test; GPT‑3 (175 B parameters) scored ~44% (above average human), while newer, larger models achieve substantially higher scores, demonstrating that smaller models are closing the gap yet the most capable systems still benefit from sheer scale.
Sections
- How Large Are LLMs? - The segment explains that LLM size is measured by parameter count, ranging from 300 million to hundreds of billions (or over a trillion), with examples like Mistral 7B and LLaMA 3 400B, and discusses the trade‑off between increased capability and higher compute, energy, and memory costs.
- AI MMLU Performance Over Time - The speaker contrasts human baseline scores on the MMLU benchmark with GPT‑3’s 44% and today's frontier models reaching the high 80s, emphasizing how the 60% competence threshold fell from a 65‑billion‑parameter model in early 2023 to much smaller models within months.
- Scale Advantages in AI Applications - The speaker outlines tasks where large language models outperform smaller ones—such as multi‑language code generation, processing lengthy documents, and high‑fidelity multilingual translation—while also noting cases like on‑device AI where compact models are preferable.
- Decision Driven by Use Case - The final choice should be based on the specific requirements and context of your application.
Full Transcript
# Scaling Language Models: Size vs Performance **Source:** [https://www.youtube.com/watch?v=0Wwn5IEqFcg](https://www.youtube.com/watch?v=0Wwn5IEqFcg) **Duration:** 00:09:18 ## Summary - LLM size is measured by the number of parameters, ranging from lightweight 300 M‑parameter models that run on smartphones to massive systems with hundreds of billions—or even approaching a trillion—parameters that require data‑center‑scale GPU clusters. - Model examples illustrate this spectrum: Mistral 7B has roughly 7 billion parameters (a small model), whereas Meta’s LLaMA 3 reaches about 400 billion parameters, placing it in the “large” category, and frontier research is pushing well beyond half a trillion. - More parameters generally boost capabilities—enabling better factual recall, multilingual support, and longer reasoning chains—but they also incur exponentially higher compute, energy, and memory costs, so “bigger is not always better.” - Progress is tracked with benchmarks like the Massive Multitask Language Understanding (MMLU) test; GPT‑3 (175 B parameters) scored ~44% (above average human), while newer, larger models achieve substantially higher scores, demonstrating that smaller models are closing the gap yet the most capable systems still benefit from sheer scale. ## Sections - [00:00:00](https://www.youtube.com/watch?v=0Wwn5IEqFcg&t=0s) **How Large Are LLMs?** - The segment explains that LLM size is measured by parameter count, ranging from 300 million to hundreds of billions (or over a trillion), with examples like Mistral 7B and LLaMA 3 400B, and discusses the trade‑off between increased capability and higher compute, energy, and memory costs. - [00:03:07](https://www.youtube.com/watch?v=0Wwn5IEqFcg&t=187s) **AI MMLU Performance Over Time** - The speaker contrasts human baseline scores on the MMLU benchmark with GPT‑3’s 44% and today's frontier models reaching the high 80s, emphasizing how the 60% competence threshold fell from a 65‑billion‑parameter model in early 2023 to much smaller models within months. - [00:06:11](https://www.youtube.com/watch?v=0Wwn5IEqFcg&t=371s) **Scale Advantages in AI Applications** - The speaker outlines tasks where large language models outperform smaller ones—such as multi‑language code generation, processing lengthy documents, and high‑fidelity multilingual translation—while also noting cases like on‑device AI where compact models are preferable. - [00:09:14](https://www.youtube.com/watch?v=0Wwn5IEqFcg&t=554s) **Decision Driven by Use Case** - The final choice should be based on the specific requirements and context of your application. ## Full Transcript
The first L in LLM stands for large.
But how large is large?
Well, today's language models cover a huge range
of sizes, from lightweight networks that have maybe
300 million parameters that can run entirely on a smartphone
to titanic systems with hundreds of billions, or perhaps even approaching
a trillion parameters that require
racks of GPUs in a hyperscale data center.
And yeah, size in this context, it is measured in parameters.
That's how we measure the size of an LLM
and parameters are the individual floating point weights that a
neural network tweaks while it trains.
And collectively these parameters
encode everything the model can recall or reason about.
Well let's talk about some specific models.
So for example Mistral 7B
that is an example of a small model,
the seven B there that says it contains
roughly 7 billion of those weights or those parameters.
By comparison.
And we could take a look at llama three for example from meta.
Now this one is a much bigger
model.. 400B.
So we would put this in the large
LLM category.
And in fact some frontier models there much bigger than that.
The room to push well beyond half a trillion parameters.
And in broad strokes extra parameters buys extra capability.
Larger models have more room to memorize more facts and support
more languages and carry out more intricate chains of reasoning.
But the trade off, of course, with these guys is
cost.
They demand exponentially more compute and energy and memory,
both to train them in the first place and then to run them in production.
So the story isn't simply bigger is better.
Smaller models are catching up and are punching far above their weight class.
And let me give you an example.
Well, we measure progress in language model capability with benchmarks.
And one of the most enduring benchmarks
is the m m l u.
That's massive multitask language understanding.
Now the MMLU it contains more than 15,000
multiple choice questions across all sorts of domains sub subjects
like math and history and law and medicine and anybody taking the test
needs to combine both factual recall with problem solving across many fields.
So the test is a convenient, if somewhat imperfect snapshot
of kind of broad general purpose ability.
Now, if you took the MMLU, you and you were just guessing at random,
you would score around 25% on the test.
But if you weren't guessing at random,
if you're just kind of a regular Joe, just a regular human,
and you took the test, you might score somewhere around 35%.
It's a it's a pretty hard test,
but what about a domain expert?
Well, a domain expert would score far higher,
something like around 90%
on questions that are within their specialty.
So that's humans.
What about AI models?
Well, when GPT three came out in 2020,
this is a 175 billion parameter model.
It posted a score on the MMLU view of 44%.
I mean, that's pretty respectable.
It's better than the average Joe,
but it's far from mastery.
What about today's models?
Well, if we take a look at today's frontier models, kind of the best models
we have, they can score in the high 80s,
maybe 88% on the test.
But let's use a different benchmark.
Let's use a benchmark of 60%.
And we can say that is a practical cutoff
because above that line, a model begins to look like a
like a competent generalist that can answer everyday questions.
And what is striking is how quickly that 60% barrier
has fallen to ever smaller models.
So in February of 2023,
the smallest model that could score above 60%
was Llama 1-65B
65 B, meaning 65 billion parameters.
But just a few months later,
by July of the same year, Llama 2 - 34B.
They could do it with barely half the parameters.
Then if we fast forward to September
of the same year that saw
Misteral 7B join the cloud, which we know is a 7 billion parameter model,
and then in March of 2024,
Qwen 1.5 MOE became the first model with fewer
than 3 billion active parameters to clear 60%.
In other words, month by month, we are learning to squeeze
competent generalist behavior into smaller and smaller footprints.
So smaller models are getting smarter.
And I think the next natural question becomes which model should I put
into production, large or small?
And the answer,
of course, depends on your workload, your latency, your privacy constraints.
And let's be honest, the size of your GPU budget.
Now I'm generalizing here.
Your case may be different, but certain tasks
do still reward sheer scale.
So let's talk about some large model use cases.
And one of the first really comes down to
broad spectrum code generation.
So a small model can master a handful of programing languages.
But a a frontier model has room for dozens of ecosystems
and can reason across multi file projects and unfamiliar APIs and weird edge cases.
Another good example is when you have document
heavy work that you need to process.
So we might need to ingest a very large contract
and a medical guideline and a technical standard.
And a large model's longer context window means it can keep more of the source text
in mind, reducing hallucinations and improving citation quality.
And the same scale advantage appears in high fidelity
multilingual translation as well, where
we're going from one language to another, and the extra parameters
that the network carve out richer subspaces for each language.
Finally, capturing idioms and nuance that smaller models might kind of gloss over.
But look, there are some cases where small models
are not only good enough, but they are outright preferable.
So let's talk about some of those use cases.
And one of those comes down to
on device a AI.
So keyboard prediction or voice commands that offline search that stuff
lives or dies by sub 100 millisecond latency and strict
data privacy and small models that run on device.
Well, they're great for that.
Also, when it just comes down
to everyday summarization, that's another sweet spot.
In an in news summarization study, Mistral 7B instruct achieved ROGUE and Bert score metrics
that were statistically indistinguishable from a much larger model GPT 3.5 turbo.
And that's despite the model running 30 times cheaper and faster.
And another good use case comes down
to enterprise chat bots.
So with these, a business can fine tune a seven or a 13 billion
parameter model on its own manuals, and it can reach near expert accuracy.
And IBM found that the the granite 13 B family match the performance of models
that were five times larger on typical enterprise Q and A task.
So the rule of thumb is for expansive, open ended reasoning.
Bigger does still buy more headroom for
focused skills like summarizing and classifying.
A carefully trained small model delivers perhaps
90% of the quality at a fraction of the cost.
So go big.
Stay small.
In the end, it's your use case that will drive the decision.