Learning Library

← Back to Library

Granite 4.0: Efficient Small Models

Key Points

  • The speaker feels personally “seen” by IBM’s Granite.13B.V2 model because its transparent training data includes many of his own US patents and the Redbooks he authored.
  • IBM’s newly released Granite 4.0 family offers higher performance, faster inference, and lower operational costs than both earlier Granite models and larger competing LLMs.
  • Granite 4.0 combines Transformer layers with the Mamba 2 architecture and comes in several sizes: Small (32 B total, 9 B active Mixture‑of‑Experts), Tiny (7 B total, 1 B active Mixture‑of‑Experts), and Micro (3 B dense, with both hybrid and pure Transformer variants).
  • The models are designed for memory‑efficient, low‑compute deployments, making them suitable for enterprise GPU workloads, low‑latency edge use cases, and lightweight local applications.

Full Transcript

# Granite 4.0: Efficient Small Models **Source:** [https://www.youtube.com/watch?v=AaCBiGWTuyA](https://www.youtube.com/watch?v=AaCBiGWTuyA) **Duration:** 00:11:04 ## Summary - The speaker feels personally “seen” by IBM’s Granite.13B.V2 model because its transparent training data includes many of his own US patents and the Redbooks he authored. - IBM’s newly released Granite 4.0 family offers higher performance, faster inference, and lower operational costs than both earlier Granite models and larger competing LLMs. - Granite 4.0 combines Transformer layers with the Mamba 2 architecture and comes in several sizes: Small (32 B total, 9 B active Mixture‑of‑Experts), Tiny (7 B total, 1 B active Mixture‑of‑Experts), and Micro (3 B dense, with both hybrid and pure Transformer variants). - The models are designed for memory‑efficient, low‑compute deployments, making them suitable for enterprise GPU workloads, low‑latency edge use cases, and lightweight local applications. ## Sections - [00:00:00](https://www.youtube.com/watch?v=AaCBiGWTuyA&t=0s) **Untitled Section** - - [00:05:18](https://www.youtube.com/watch?v=AaCBiGWTuyA&t=318s) **Mamba vs Transformer Scaling** - The speaker explains how the Mamba state‑space model processes context with linear computational cost, unlike the quadratic scaling of traditional Transformers, and why this matters for Granite 4.0’s architecture. - [00:08:48](https://www.youtube.com/watch?v=AaCBiGWTuyA&t=528s) **Model Scaling, Experts, and NoPE** - The speaker explains how Tiny and Small models use sparse expert routing with only a fraction of parameters active per token, contrasts traditional rotary positional encoding with Granite’s NoPE approach that enables theoretically unlimited context length, and notes the diverging trends of ever‑larger models versus efficient, long‑context architectures. ## Full Transcript
0:00I have a special affinity with the Granite series of large language models from IBM, and it's 0:06probably not what you think. Now, yes, this is the family of LLMs from my employer, so I should 0:11probably be saying nice things. But what endeared me most of all to a particular Granite 0:18model called Granite.13B.V2, which is a 0:2413 billion parameter model released in 2024, was that it was completely transparent 0:31about its training data, and that training data contained, well, a lot of my stuff. Like, for 0:38example, it included all US patents from the USPTO dating back to 0:451975, and a few hundred of those patents were mine. And it was trained also on 0:52IBM Docs, including Redbooks, and writing and leading Redbooks had 0:59been my day job for over a decade. And actually, do you know who led the project for 1:05the longest ever Redbook? That was me. Look, here it is, here. This 1:12guy is nearly 2,000 pages long. We had to find a specialized printing company just 1:19to bind it. Anyways, the point is, I felt seen by Granite. Well, 1:26now, IBM has released the next generation of Granite models. So, the new set of 1:33models are Granite 4.0. And these models, they deliver higher performance, 1:39faster speeds and significantly lower operational costs compared to similar models, including 1:45previous Granite models, but also compared to much larger models as well. And this being a tech 1:51channel, I wanted to get into some pretty interesting details about the architecture of 1:56these models, specifically the combination of Transformers and Mamba 2. But first, let's 2:03briefly talk about what this Granite 4.0 family of small models look like. Now, there are currently 2:09several models in the family. Uh, first one we'll talk about is called Small. It's kind of the 2:16workhorse. It's designed for enterprise tasks like running multi-tool agents or handling complex 2:22workflows on a single enterprise GPU. Now, this is a Mixture-of-Experts model, 2:28and it comes with 32 billion total parameters, of which there are 2:359 billion active parameters. And I'll get into what an active parameter is in just a sec. So 2:42that's Small. Then there is the Tiny model. Now Tiny is built for low latency, 2:49for local and edge use cases. And like Small, it's another Mixture-of-Experts model. 2:56It has 7 billion total parameters and 1 billion 3:02active parameters. And then there are a couple of Micro models as 3:09well. And like Tiny, they're intended for lightweight local deployments, but they use a 3:13dense architecture, with 3 billion parameters. Now, one uses the 3:20same hybrid architecture as Tiny and Small, and the other Micro model, that uses a traditional 3:27Transformer model. But the theme here are models that are small in size, with fast inference 3:34that don't need much in the way of compute to run. And I really want to focus in on the memory 3:39efficiency for a moment, because this is really where the Granite 4.0 models really stand out. So, 3:45in a production workload, if you think about long context and multi-batch tasks, the Micro model, 3:52that only requires about 10 GB of 3:58GPU memory to actually run, while comparable models, they'll typically need four 4:05times that amount, or maybe six times that amount. And then Tiny and Small, they share the same kind 4:10of advantage as well. So, the result is that Granite 4.0's hybrid design can reduce memory 4:16requirements up to 80%, all while delivering higher performance 4:23across many tasks and running at faster speeds. And then speaking of speeds, I would say that is a 4:30... another advantage. Most models, they kind of tend to slow down as you increase batch size or 4:36context length, but Granite 4.0's design actually keeps throughput high while other models hit 4:42their limits. And then, also if we think about performance, there are some advantages 4:48there as well. Granite 4.0 models are competitive with ... their models ... within their model weight 4:55classes, but they're also competitive over much larger models as well, especially on benchmarks 5:00that measure performance on key agentic tasks. The Small model, for example, outperforms nearly 5:07every open model on instruction, following benchmarks, and keeps pace even with frontier 5:11models on function calling. So, kind of this balance of speed and efficiency and accuracy. 5:18It's exactly why the architecture behind Granite 4.0 is so interesting. So, let's get 5:25into the details of that architecture. And, first of all, let's talk about 5:33Mamba. Now the dominant architecture of AI isn't really Mamba, of course; it is the 5:39Transformer architecture. And Transformer-based models, well, they've been around for a 5:46good while. But in 2023, researchers from Carnegie Mellon and Princeton introduced a new 5:52architecture called Mamba. And it is a type of state space model or an SSM. 5:59And SSMs, they're a bit like recurrent neural networks that dominated natural language 6:05processing tasks before Transformers came along. And Mamba solves the limitations that really kind 6:12of made us abandon RNNs in the first place. And now we have Mamba 2 that is 6:19an optimized implementation of this Mamba architecture. But so what? Well, look, well, 6:25Transformers use self-attention to process text, which is incredibly powerful but also 6:31computationally expensive. Mamba maintains what's essentially a summary of previous contexts. 6:38It processes each new token. It selectively decides what's important enough to update that 6:44summary with. Now, this means that Mamba's computational needs scale in 6:51a linear fashion with the context length while Transformers, they 6:58scale in a quadratic relationship with the context length. So, let me put 7:05that in practical terms. If you double your context window with a Transformer model, your 7:11computational requirements, they 4x, they quadruple. With Mamba, they 7:18merely double. And that's a huge efficiency gain, especially when we're talking about handling ever 7:24larger LLM context windows. But here's the thing: Transformers, they still have 7:31some advantages. They're better at certain tasks, like in-context learning and complex reasoning. So, 7:36how does this all relate to the Granite 4.0 family? 7:43Well, the Granite 4.0 architecture is a hybrid architecture. It uses 7:509 Mamba blocks for every one Transformer block. Mamba 7:57kind of does the heavy lifting of capturing global context, then it lets the Transformer 8:02blocks work their magic on parsing the nuanced local details. The efficiency of Mamba with the 8:08precision of Transformers. Now, the second part of this hybrid approach is Mixture of 8:15Experts or MoE. And MoE is used for the Tiny and 8:22the Small models. And this is where the active parameters I mentioned earlier comes in. So, MoE 8:28divides the model into different experts. These are specialized neural subnetworks. And it uses 8:34a routing mechanism to activate only the specific experts that 8:41it needs for a particular task. And the ... the Granite 4.0 models are what we call fine-grained MoE models. 8:48So, Tiny, if you remember Tiny, I said that the Tiny model has 8:547 billion total parameters and then it has 1 9:01billion active parameters. And that means active at inference 9:07time. Well, that model, Tiny, has 62 different experts, but for any given token it 9:14only activates the specific expert that it needs. Plus, there's a shared expert that's always active. 9:20It's pretty efficient. And similarly with the Small model, that has 32 billion total parameters, 9:259 billion active parameters. It uses a similar routing strategy. And one last 9:32architectural note. Most models use some form of positional encoding, like 9:39RoPE— that's rotary positional encoding—to help the model understand word order. But 9:46these encoding schemes, they often struggle with sequences longer than what they saw during 9:51training. Well, Granite, Granite says NoPE to RoPE, quite 9:58literally, because NoPE is no positional encoding, which so far has had no adverse effects on 10:04long context performance. Which means without the computational overhead of positional encoding and 10:11with Mamba's linear scaling properties, the Granite 4.0 model architecture is designed to 10:16theoretically have an unconstrained context length, meaning you can send in as many tokens as 10:23your hardware and memory supports. So look, we are really seeing two emerging paths with AI model 10:30development. We've got bigger and bigger models with more parameters and longer reinforcement 10:35learning cycles from model providers that are chasing AGI. And then we've, on the other hand, we 10:40kind of got the spectrum of smaller models that really push the limits of what small models can 10:46do, and that they can run on a GPU that you can buy online for, let's say, a couple of hundred 10:50bucks, which is pretty cool. Now Granite 4.0 models are open source, so check them out on Hugging Face 10:57or watsonx.ai if you'd like to see exactly what small language models are capable of doing.