Granite 4.0: Efficient Small Models
Key Points
- The speaker feels personally “seen” by IBM’s Granite.13B.V2 model because its transparent training data includes many of his own US patents and the Redbooks he authored.
- IBM’s newly released Granite 4.0 family offers higher performance, faster inference, and lower operational costs than both earlier Granite models and larger competing LLMs.
- Granite 4.0 combines Transformer layers with the Mamba 2 architecture and comes in several sizes: Small (32 B total, 9 B active Mixture‑of‑Experts), Tiny (7 B total, 1 B active Mixture‑of‑Experts), and Micro (3 B dense, with both hybrid and pure Transformer variants).
- The models are designed for memory‑efficient, low‑compute deployments, making them suitable for enterprise GPU workloads, low‑latency edge use cases, and lightweight local applications.
Sections
- Untitled Section
- Mamba vs Transformer Scaling - The speaker explains how the Mamba state‑space model processes context with linear computational cost, unlike the quadratic scaling of traditional Transformers, and why this matters for Granite 4.0’s architecture.
- Model Scaling, Experts, and NoPE - The speaker explains how Tiny and Small models use sparse expert routing with only a fraction of parameters active per token, contrasts traditional rotary positional encoding with Granite’s NoPE approach that enables theoretically unlimited context length, and notes the diverging trends of ever‑larger models versus efficient, long‑context architectures.
Full Transcript
# Granite 4.0: Efficient Small Models **Source:** [https://www.youtube.com/watch?v=AaCBiGWTuyA](https://www.youtube.com/watch?v=AaCBiGWTuyA) **Duration:** 00:11:04 ## Summary - The speaker feels personally “seen” by IBM’s Granite.13B.V2 model because its transparent training data includes many of his own US patents and the Redbooks he authored. - IBM’s newly released Granite 4.0 family offers higher performance, faster inference, and lower operational costs than both earlier Granite models and larger competing LLMs. - Granite 4.0 combines Transformer layers with the Mamba 2 architecture and comes in several sizes: Small (32 B total, 9 B active Mixture‑of‑Experts), Tiny (7 B total, 1 B active Mixture‑of‑Experts), and Micro (3 B dense, with both hybrid and pure Transformer variants). - The models are designed for memory‑efficient, low‑compute deployments, making them suitable for enterprise GPU workloads, low‑latency edge use cases, and lightweight local applications. ## Sections - [00:00:00](https://www.youtube.com/watch?v=AaCBiGWTuyA&t=0s) **Untitled Section** - - [00:05:18](https://www.youtube.com/watch?v=AaCBiGWTuyA&t=318s) **Mamba vs Transformer Scaling** - The speaker explains how the Mamba state‑space model processes context with linear computational cost, unlike the quadratic scaling of traditional Transformers, and why this matters for Granite 4.0’s architecture. - [00:08:48](https://www.youtube.com/watch?v=AaCBiGWTuyA&t=528s) **Model Scaling, Experts, and NoPE** - The speaker explains how Tiny and Small models use sparse expert routing with only a fraction of parameters active per token, contrasts traditional rotary positional encoding with Granite’s NoPE approach that enables theoretically unlimited context length, and notes the diverging trends of ever‑larger models versus efficient, long‑context architectures. ## Full Transcript
I have a special affinity with the Granite series of large language models from IBM, and it's
probably not what you think. Now, yes, this is the family of LLMs from my employer, so I should
probably be saying nice things. But what endeared me most of all to a particular Granite
model called Granite.13B.V2, which is a
13 billion parameter model released in 2024, was that it was completely transparent
about its training data, and that training data contained, well, a lot of my stuff. Like, for
example, it included all US patents from the USPTO dating back to
1975, and a few hundred of those patents were mine. And it was trained also on
IBM Docs, including Redbooks, and writing and leading Redbooks had
been my day job for over a decade. And actually, do you know who led the project for
the longest ever Redbook? That was me. Look, here it is, here. This
guy is nearly 2,000 pages long. We had to find a specialized printing company just
to bind it. Anyways, the point is, I felt seen by Granite. Well,
now, IBM has released the next generation of Granite models. So, the new set of
models are Granite 4.0. And these models, they deliver higher performance,
faster speeds and significantly lower operational costs compared to similar models, including
previous Granite models, but also compared to much larger models as well. And this being a tech
channel, I wanted to get into some pretty interesting details about the architecture of
these models, specifically the combination of Transformers and Mamba 2. But first, let's
briefly talk about what this Granite 4.0 family of small models look like. Now, there are currently
several models in the family. Uh, first one we'll talk about is called Small. It's kind of the
workhorse. It's designed for enterprise tasks like running multi-tool agents or handling complex
workflows on a single enterprise GPU. Now, this is a Mixture-of-Experts model,
and it comes with 32 billion total parameters, of which there are
9 billion active parameters. And I'll get into what an active parameter is in just a sec. So
that's Small. Then there is the Tiny model. Now Tiny is built for low latency,
for local and edge use cases. And like Small, it's another Mixture-of-Experts model.
It has 7 billion total parameters and 1 billion
active parameters. And then there are a couple of Micro models as
well. And like Tiny, they're intended for lightweight local deployments, but they use a
dense architecture, with 3 billion parameters. Now, one uses the
same hybrid architecture as Tiny and Small, and the other Micro model, that uses a traditional
Transformer model. But the theme here are models that are small in size, with fast inference
that don't need much in the way of compute to run. And I really want to focus in on the memory
efficiency for a moment, because this is really where the Granite 4.0 models really stand out. So,
in a production workload, if you think about long context and multi-batch tasks, the Micro model,
that only requires about 10 GB of
GPU memory to actually run, while comparable models, they'll typically need four
times that amount, or maybe six times that amount. And then Tiny and Small, they share the same kind
of advantage as well. So, the result is that Granite 4.0's hybrid design can reduce memory
requirements up to 80%, all while delivering higher performance
across many tasks and running at faster speeds. And then speaking of speeds, I would say that is a
... another advantage. Most models, they kind of tend to slow down as you increase batch size or
context length, but Granite 4.0's design actually keeps throughput high while other models hit
their limits. And then, also if we think about performance, there are some advantages
there as well. Granite 4.0 models are competitive with ... their models ... within their model weight
classes, but they're also competitive over much larger models as well, especially on benchmarks
that measure performance on key agentic tasks. The Small model, for example, outperforms nearly
every open model on instruction, following benchmarks, and keeps pace even with frontier
models on function calling. So, kind of this balance of speed and efficiency and accuracy.
It's exactly why the architecture behind Granite 4.0 is so interesting. So, let's get
into the details of that architecture. And, first of all, let's talk about
Mamba. Now the dominant architecture of AI isn't really Mamba, of course; it is the
Transformer architecture. And Transformer-based models, well, they've been around for a
good while. But in 2023, researchers from Carnegie Mellon and Princeton introduced a new
architecture called Mamba. And it is a type of state space model or an SSM.
And SSMs, they're a bit like recurrent neural networks that dominated natural language
processing tasks before Transformers came along. And Mamba solves the limitations that really kind
of made us abandon RNNs in the first place. And now we have Mamba 2 that is
an optimized implementation of this Mamba architecture. But so what? Well, look, well,
Transformers use self-attention to process text, which is incredibly powerful but also
computationally expensive. Mamba maintains what's essentially a summary of previous contexts.
It processes each new token. It selectively decides what's important enough to update that
summary with. Now, this means that Mamba's computational needs scale in
a linear fashion with the context length while Transformers, they
scale in a quadratic relationship with the context length. So, let me put
that in practical terms. If you double your context window with a Transformer model, your
computational requirements, they 4x, they quadruple. With Mamba, they
merely double. And that's a huge efficiency gain, especially when we're talking about handling ever
larger LLM context windows. But here's the thing: Transformers, they still have
some advantages. They're better at certain tasks, like in-context learning and complex reasoning. So,
how does this all relate to the Granite 4.0 family?
Well, the Granite 4.0 architecture is a hybrid architecture. It uses
9 Mamba blocks for every one Transformer block. Mamba
kind of does the heavy lifting of capturing global context, then it lets the Transformer
blocks work their magic on parsing the nuanced local details. The efficiency of Mamba with the
precision of Transformers. Now, the second part of this hybrid approach is Mixture of
Experts or MoE. And MoE is used for the Tiny and
the Small models. And this is where the active parameters I mentioned earlier comes in. So, MoE
divides the model into different experts. These are specialized neural subnetworks. And it uses
a routing mechanism to activate only the specific experts that
it needs for a particular task. And the ... the Granite 4.0 models are what we call fine-grained MoE models.
So, Tiny, if you remember Tiny, I said that the Tiny model has
7 billion total parameters and then it has 1
billion active parameters. And that means active at inference
time. Well, that model, Tiny, has 62 different experts, but for any given token it
only activates the specific expert that it needs. Plus, there's a shared expert that's always active.
It's pretty efficient. And similarly with the Small model, that has 32 billion total parameters,
9 billion active parameters. It uses a similar routing strategy. And one last
architectural note. Most models use some form of positional encoding, like
RoPE— that's rotary positional encoding—to help the model understand word order. But
these encoding schemes, they often struggle with sequences longer than what they saw during
training. Well, Granite, Granite says NoPE to RoPE, quite
literally, because NoPE is no positional encoding, which so far has had no adverse effects on
long context performance. Which means without the computational overhead of positional encoding and
with Mamba's linear scaling properties, the Granite 4.0 model architecture is designed to
theoretically have an unconstrained context length, meaning you can send in as many tokens as
your hardware and memory supports. So look, we are really seeing two emerging paths with AI model
development. We've got bigger and bigger models with more parameters and longer reinforcement
learning cycles from model providers that are chasing AGI. And then we've, on the other hand, we
kind of got the spectrum of smaller models that really push the limits of what small models can
do, and that they can run on a GPU that you can buy online for, let's say, a couple of hundred
bucks, which is pretty cool. Now Granite 4.0 models are open source, so check them out on Hugging Face
or watsonx.ai if you'd like to see exactly what small language models are capable of doing.