Learning Library

← Back to Library

Mixture of Experts Explained

Key Points

  • Neural networks, especially large language models with hundreds of billions of parameters, require massive compute at inference, prompting the use of Mixture of Experts (MoE) to improve efficiency.
  • MoE splits a model into many specialized subnetworks (“experts”) and employs a gating network that selects only the most relevant experts for each input, reducing the amount of computation needed per task.
  • The MoE concept dates back to a 1991 paper that showed faster convergence and comparable accuracy by training separate expert networks, and it has recently resurged in modern LLMs.
  • Open‑source models like Mistral’s Mixtral 8x7B illustrate MoE in practice: each layer contains eight 7‑billion‑parameter experts, and a router picks the top two experts per token, mixing their outputs before passing them onward.
  • This architecture leverages sparsity—activating only a small subset of the total parameters at any time—to achieve high performance with lower computational cost.

Full Transcript

# Mixture of Experts Explained **Source:** [https://www.youtube.com/watch?v=sYDlVVyJYn4](https://www.youtube.com/watch?v=sYDlVVyJYn4) **Duration:** 00:07:45 ## Summary - Neural networks, especially large language models with hundreds of billions of parameters, require massive compute at inference, prompting the use of Mixture of Experts (MoE) to improve efficiency. - MoE splits a model into many specialized subnetworks (“experts”) and employs a gating network that selects only the most relevant experts for each input, reducing the amount of computation needed per task. - The MoE concept dates back to a 1991 paper that showed faster convergence and comparable accuracy by training separate expert networks, and it has recently resurged in modern LLMs. - Open‑source models like Mistral’s Mixtral 8x7B illustrate MoE in practice: each layer contains eight 7‑billion‑parameter experts, and a router picks the top two experts per token, mixing their outputs before passing them onward. - This architecture leverages sparsity—activating only a small subset of the total parameters at any time—to achieve high performance with lower computational cost. ## Sections - [00:00:00](https://www.youtube.com/watch?v=sYDlVVyJYn4&t=0s) **Understanding Mixture of Experts** - The passage explains how Mixture of Experts splits a massive neural model into specialized subnetworks activated by a gating network—saving computation—while noting its origins in 1991 and its revival in today’s large language models. - [00:03:15](https://www.youtube.com/watch?v=sYDlVVyJYn4&t=195s) **Sparse Mixture-of-Experts Model Overview** - The speaker explains how a 7‑billion‑parameter expert model uses sparsity and a router network to activate only the most suitable experts per token, reducing computation while handling language complexity. - [00:06:23](https://www.youtube.com/watch?v=sYDlVVyJYn4&t=383s) **Noisy Top‑K Gating for Expert Balance** - The passage explains how adding Gaussian noise via noisy top‑k gating improves load balancing among experts in mixture‑of‑experts models, while acknowledging the efficiency gains and increased training complexity such architectures entail. ## Full Transcript
0:00In deep learning, neural networks, including large language models, can be big. 0:04Very big. 0:05Like hundreds of billions of parameters big. 0:08And running them at inference time is usually a very compute intensive operation. 0:14So enter Mixture of Experts, 0:17which is a machine learning approach 0:19that divides an AI model into separate subnetworks, or "experts". 0:24Each expert focuses on a subset of the input data, 0:27and only the relevant experts are activated for a given task. 0:31Rather than using the entire network for every operation. 0:35Now, mixture of experts isn't new. 0:37Not at all. 0:38It goes back to a paper published in 1991, 0:44when researchers proposed an AI system with separate networks, 0:47each specializing in different training cases. 0:50And their experiment was a hit. 0:52The model reached target accuracy in half the training cycles 0:55of a conventional model. 0:57Now, fast forward to today, 0:58and mixture of experts is making a bit of a comeback. 1:01It's kind of trendy again, 1:03and leading large language models, like ones from Mistral, are using it. 1:08So, let's break down the mixture of experts architecture 1:12and see what it's made of. 1:15Well, we have in our model an input and an output. 1:24Now we also have a bunch of expert networks in between, 1:29and there's probably many of them. 1:30I'll just draw a few, so we'll have Expert Network number 1, 1:36Expert Network number 2, 1:39all the way through to Expert Network N. 1:43And these sit between the input and the output. 1:50Now there is a thing called a gating network. 1:55And this sits between the input and the experts. 2:03Think of the gating network a bit like a traffic cop, I guess, 2:07deciding which experts should handle each subtask. 2:11So we get a request in 2:14and the gating network will pick which experts 2:17it's going to invoke for that given input. 2:21Now the gating network assigns weights as it goes, 2:26and with those weights we are using the results, 2:30combining them to produce the final output. 2:34So we'll get the results back from those experts 2:37and combine them into our output here. 2:40Now we can think of the experts as specialized subnetworks 2:45within the bigger neural network. 2:47And the gating network is acting as the coordinator, 2:50activating only the best experts for each input. 2:54So, let's take a look at a real world example 2:56using that Mistral model I mentioned earlier. 2:59That's actually called Mixtral, and the specific name is Mixtral 8x7B. 3:07It's a large language model, open source, 3:09and in this model each layer has a total of eight experts. 3:15And each expert consists of 7 billion parameters. 3:19That's what the 7B is. 3:21Which on its own it's actually quite a small model for a large language model. 3:25Now, as the model processes each token like a word or a part of a word, 3:29a router network in each layer picks the two most suitable experts out of the eight, 3:34and these two experts do their thing, 3:36their outputs are mixed together and the combined result moves on to the next layer. 3:42So let's take a look at some of the concepts that make up this architecture. 3:47And the first one I want to mention is called sparsity. 3:52In a sparse layer, 3:55only experts and their parameters are activated from the list of all of them. 4:00So we just select a few. 4:01And this approach cuts down on compute needs 4:04as opposed to sending the requests through the whole network. 4:07And sparse line is really shine when dealing with complex high dimensional data. 4:12Like for example, human language. 4:15So think about it. 4:16Different parts of a sentence might need different types of analysis. 4:20You might need one expert that can understand idioms 4:23like, "it's raining cats and dogs". 4:26And then you might need another expert to untangle complex grammar structures. 4:30So sparse mixture of expert models are great at this, 4:33because they can call in just the right experts for each part of the input, 4:37allowing for specialized processing. 4:40Now another important concept is the concept of routing. 4:46Now this refers to how this gating network here decides which expert to use. 4:52And there are various ways to do this. 4:54But getting it right is key. 4:56If the routing strategy is off, 4:58some experts might end up under trained, 5:01or they might end up too specialized, 5:03which can make the whole network less effective. 5:06So here's how routing typically works. 5:09The router predicts how likely each expert is to give the best output for a given input. 5:16This prediction is based on the strength of connections between the expert 5:19and the current data. 5:21Now Mixtral, for example, 5:23uses what is called a "top-k" routing strategy, 5:30where k is the number of experts selected. 5:33Specifically, it uses top-2 routing, 5:37meaning it picks the best two out of its eight experts for each task. 5:42Now, while this approach has its advantages, 5:45it can also lead to some challenges. 5:47And that leads us to our next concept, 5:50and that is load balancing. 5:55Now, in mixture of expert models 5:59there's a potential issue where the gating network, 6:02it may converge to consistently activate only a few experts, 6:06and this creates a bit of a self-reinforcing cycle, 6:09because if certain experts are disproportionately selected early on, 6:13they receive more training, leading to more reliable outputs, 6:18and consequently, these experts are chosen more frequently 6:21while others remain underutilized. 6:23That's an imbalance that can result in a significant portion of the network 6:27becoming ineffective, essentially turning into computational overhead. 6:33Now, to solve this, researchers developed a technique 6:37specifically for top-k and it's called "noisy top-k" gating. 6:45And using noisy top-k gating introduces 6:49Gaussian noise to the probability values predicted for each expert 6:53during the selection process. 6:55The controlled randomness promotes a more evenly distributed activation of experts. 7:01So mixture of experts offers a bunch of advantages in efficiency and performance. 7:07But it's not without its challenges. 7:09It introduces model complexity, 7:11which can make training more difficult and time consuming. 7:15The routing mechanism, while powerful, 7:17adds another layer of intricacy to the model architecture 7:21and issues like load balancing and potential underutilization of experts 7:25require careful tuning and monitoring. 7:28But still, for many applications, particularly large scale language models 7:33where computational resources are at a premium, 7:36the improved efficiency and specialized processing capabilities 7:40of the mixture of expert architecture make it a compelling option.