Agentic AI vs Mixture of Experts
Key Points
- An agentic AI workflow uses a planner agent to assign tasks to specialized agents (A, B, C), whose results are collected by an aggregator to produce the final output.
- The “mixture of experts” architecture replaces the planner with a router that dispatches input to parallel expert models, then merges their token streams into a single result.
- Agentic workflows rely on LLM‑driven agents equipped with perception, memory (working and long‑term), and domain‑specific tools (e.g., data querying, analysis, visualization) to act autonomously toward a goal.
- Mixture‑of‑experts systems focus on parallel processing of the same input by multiple expert models, emphasizing speed and scale rather than explicit decision‑making or memory handling.
- Although architecturally distinct, both paradigms represent cutting‑edge AI designs and can be combined—for example, using a router to select agentic modules or integrating expert outputs into an agent’s reasoning loop.
Sections
- Agentic Workflow vs Mixture of Experts - The speaker contrasts the hierarchical planner‑agent‑aggregator structure of agentic AI workflows with the parallel router‑expert‑merge design of mixture‑of‑experts models, highlighting their similarities and differences.
- Specialized Agents and MoE Loop - The speaker explains how domain‑specific agents (data, analysis, visualization) operate in a perception‑memory‑reason‑action‑observation cycle and interact at the application level, while contrasting this with a Mixture‑of‑Experts neural architecture that routes inputs via a gating network to multiple specialized model experts.
- Multi‑Agent Incident Response with Experts - An enterprise security workflow uses a planner agent to dispatch alerts to specialized agents—including an LLM powered by a mixture‑of‑experts router that dynamically selects specific expert submodels for each token batch—to diagnose lateral movement and recommend actions.
Full Transcript
# Agentic AI vs Mixture of Experts **Source:** [https://www.youtube.com/watch?v=4-FH09AMsp0](https://www.youtube.com/watch?v=4-FH09AMsp0) **Duration:** 00:09:15 ## Summary - An agentic AI workflow uses a planner agent to assign tasks to specialized agents (A, B, C), whose results are collected by an aggregator to produce the final output. - The “mixture of experts” architecture replaces the planner with a router that dispatches input to parallel expert models, then merges their token streams into a single result. - Agentic workflows rely on LLM‑driven agents equipped with perception, memory (working and long‑term), and domain‑specific tools (e.g., data querying, analysis, visualization) to act autonomously toward a goal. - Mixture‑of‑experts systems focus on parallel processing of the same input by multiple expert models, emphasizing speed and scale rather than explicit decision‑making or memory handling. - Although architecturally distinct, both paradigms represent cutting‑edge AI designs and can be combined—for example, using a router to select agentic modules or integrating expert outputs into an agent’s reasoning loop. ## Sections - [00:00:00](https://www.youtube.com/watch?v=4-FH09AMsp0&t=0s) **Agentic Workflow vs Mixture of Experts** - The speaker contrasts the hierarchical planner‑agent‑aggregator structure of agentic AI workflows with the parallel router‑expert‑merge design of mixture‑of‑experts models, highlighting their similarities and differences. - [00:03:06](https://www.youtube.com/watch?v=4-FH09AMsp0&t=186s) **Specialized Agents and MoE Loop** - The speaker explains how domain‑specific agents (data, analysis, visualization) operate in a perception‑memory‑reason‑action‑observation cycle and interact at the application level, while contrasting this with a Mixture‑of‑Experts neural architecture that routes inputs via a gating network to multiple specialized model experts. - [00:06:16](https://www.youtube.com/watch?v=4-FH09AMsp0&t=376s) **Multi‑Agent Incident Response with Experts** - An enterprise security workflow uses a planner agent to dispatch alerts to specialized agents—including an LLM powered by a mixture‑of‑experts router that dynamically selects specific expert submodels for each token batch—to diagnose lateral movement and recommend actions. ## Full Transcript
If you're familiar with AI multi-agent workflows,
you might have seen some form of this architecture before.
So at the top here, we provide
some input to an agentic workflow.
And that will ultimately kind of flow down here
to produce some output in the end.
Now if we look into the boxes,
typically you would have at the top here,
a planner agent that's responsible for distributing
work to the agents within this workflow.
And then, each of these agents.
So, let's say, we've got an agent here.
We'll just call this agent A
and agent B and C.
Each one of these guys
does specialized work, is specialist in a particular task.
And then once it's done its work,
then the results flow down here to the aggregator.
And that aggregator agent prepares a response.
And that's how we get our output.
So that is an agentic AI workflow.
But in AI, there's another architecture
that we've had for quite a while that's gaining popularity.
And that is called mixture of experts.
And at a high level,
it has a very similar-looking workflow.
So instead of a planner, we start with a router
which receives the input and that dispatches the work.
And then we have here a series of experts.
Let's call this expert A, expert B and expert C.
And these guys, they work in parallel.
And then at the bottom here, we have a merge component
to reassemble the process tokens into a single stream.
So it kind of begs the question:
what's the difference between these two things AI agents and mixture of experts?
And the answer is, well,
quite a lot of difference actually.
But something they both have in common is
they are very much part of frontier AI models today.
So let's discuss what they do
and how they can be used together.
So AI multi-agent workflow is they perceive their environment,
they make decisions, and they execute actions towards achieving a goal.
And all of this happens with minimal human intervention.
The agents, they typically use LLMs that have been given
specific roles and tools and contexts.
Now agentic AI workflows, they're
usually composed of modular components,
like, for example, one module
that might be the perception module.
That's kind of how the agent senses or ingests
information from its environment or its user input.
Then there's also a component
typically for memory.
This is the knowledge store.
That memory can be working memory for remembering the current context. Or,
it could be long-term memory for knowledge accumulation
over time, like domain facts or remembering user preferences.
And then there's an assortment of specialized agents.
And these are agents that excel at specific domains. So,
for example, we might have one specialized agent
that is a data agent that knows how to query databases and clean data.
We might also have a specialized agent called an analysis agent
that's trained on business intelligence.
And then maybe we also have a visualization agent as well
that creates charts and graphs.
Now architecturally, these components,
they form a loop.
So there's really different stages to this.
So first of all, they perceive,
then they're going to consult some form of memory. Remember,
the memory component.
From there, they're going to reason,
and they're going to act based upon that reason.
And then finally, they're going to observe what happens
based on that action.
And then kind of round and round
we go in this loop.
And the key here is that each of these agents
operates at the application level.
They're making decisions, they're using tools,
and they can communicate with each other.
The mixture of experts, on the other hand,
that operates at the architecture level, and MoE
is a neural network design
that splits a model into multiple experts.
let me draw three experts here,
although in reality there'll be a lot more.
And each of these experts specializes
in a part of the input space.
And then there's also a gating network at the top
that routes the input
to the different experts in this mixture of experts architecture.
And it goes through this before it gets to the next layer
coming into the merge component here.
And that receives all of the responses
from the experts that were invoked
and then performs mathematical operations to basically combine the the output tensors
from these different experts into a single representation
that continues through the rest of the model.
And one of the big advantages of MoE
is sparsity, because only the active expert parameters
contribute to that input's computation.
So if we take an LLM example,
let's say the IBM Granite 4.0 Tiny Preview model. Well,
that uses 64 different experts
in its architecture,
and it has around 7 billion
total parameters in the model.
But of those, only about 1 billion
are active at inference time.
So that makes it a pretty memory-efficient language model
capable of running on a single, pretty modest GPU.
So in an MoE model,
these experts, they're not separate AI agents.
They're specialized neural network components
within the same model.
So let's consider a use case where a multi-agent workflow
and a mixture of experts models show up in the same system.
So let's just imagine an enterprise incident response workflow.
So we've got a security analyst who's going to start this off.
And they're going to start off as input by providing
just an alert bundle into our model and
maybe a short natural language question, like
is this lateral movement?
And if it is, what should I do about it?
Now that goes into our agentic workflow, and we have a number of components there.
So we have, for example, a planner
agent that's going to be the first component that breaks up the request.
And then it kind of spins up the agentic workflow
and then passes it along to these specialized agents.
So we might have a log triage
agent here that parses the raw telemetry,
and we might have a threat intel agent here
that processes indicators and so forth and so forth...
... as we go down this workflow.
And this log triage agent
that could actually be implemented as an LLM that uses
mixture of experts as its architecture.
So as the tokens stream in
to the mixture of experts gating network
or the router, if you like, that looks at each micro
batch of text and on the fly decides
which handful of experts inside the model should handle it.
Now, only those experts, perhaps,
we'll go through to expert
one and maybe expert two,
but none of the other experts.
We'll just kind of leave those alone.
So perhaps it would just use two experts
out of a total of 64,
and it would just activate those for that particular micro batch.
So the forward pass, it's just going to touch a fraction
of the overall parameters. And the selected experts,
they process their slice of the representation in parallel,
and then they come back down
to the merge function
down here that mathematically stitches
that outputs back together before the next transformation layer.
So perhaps this log triage agent
is 7 billion parameters as an LLM,
but only about 1 billion active
parameters are used during inference.
So agents,
they route tasks across the workflow.
They decide the next step, maybe call at all,
update shared memory and stuff like that.
A mixture of experts routes tokens inside
a single model, deciding which internal parameter slices
light up the next few milliseconds of compute. And stack them well,
and you get workflows that reason broadly
and specialize deeply.