Learning Library

← Back to Library

Neurostimulation‑Style Steering of LLMs

Key Points

  • Prompt engineering and fine‑tuning are common ways to modify an LLM’s behavior, but a third method—“steering” the model—lets you alter outputs on the fly without changing weights.
  • Steering works like neurostimulation: by selectively activating or inhibiting specific artificial neurons during inference, you can trigger desired actions or personalities, much as brain electrodes induce or suppress responses.
  • The speaker demonstrated the technique on an open‑source Llama 3.1 8B model, making it obsessively talk about (and even “believe it is”) the Eiffel Tower, all without any fine‑tuning.
  • This approach can be applied to any Hugging Face Transformers model by manipulating the feed‑forward and attention layers at runtime, offering a lightweight, reusable way to steer LLM behavior.

Sections

Full Transcript

# Neurostimulation‑Style Steering of LLMs **Source:** [https://www.youtube.com/watch?v=F2jd5WuT-zg](https://www.youtube.com/watch?v=F2jd5WuT-zg) **Duration:** 00:17:43 ## Summary - Prompt engineering and fine‑tuning are common ways to modify an LLM’s behavior, but a third method—“steering” the model—lets you alter outputs on the fly without changing weights. - Steering works like neurostimulation: by selectively activating or inhibiting specific artificial neurons during inference, you can trigger desired actions or personalities, much as brain electrodes induce or suppress responses. - The speaker demonstrated the technique on an open‑source Llama 3.1 8B model, making it obsessively talk about (and even “believe it is”) the Eiffel Tower, all without any fine‑tuning. - This approach can be applied to any Hugging Face Transformers model by manipulating the feed‑forward and attention layers at runtime, offering a lightweight, reusable way to steer LLM behavior. ## Sections - [00:00:00](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=0s) **Steering LLMs Through Neurostimulation Analogy** - The speaker introduces a third method for adjusting a large language model’s behavior—targeted “steering” of its neurons, likened to neurostimulation of the brain, as an alternative to prompt engineering or full fine‑tuning. - [00:03:07](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=187s) **Hidden State Vectors and Steering** - The passage explains how each transformer layer passes a high‑dimensional hidden‑state vector—viewed as neurons in an activation space—that can be neuro‑stimulated to steer an LLM’s behavior, emphasizing the need to understand token embeddings that map vocabulary items into this space. - [00:06:18](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=378s) **Direction Encodes Concept, Length Scales Strength** - The passage explains that in LLMs a concept is represented by the direction of its high‑dimensional vector—its magnitude only modulates expression strength—and that across layers these vectors shift, with concepts stored in distributed superpositional patterns rather than single neurons. - [00:09:36](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=576s) **Activations Steering with Concept Vectors** - The speaker explains how to steer a language model’s hidden state by adding a scaled, normalized concept vector—demonstrated with a few lines of Hugging Face code that injects an “Eiffel Tower” vector into LLaMA 3.1 8B’s mid‑layer activations to alter its behavior and personality. - [00:12:46](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=766s) **Steering Language Models with Vectors** - The speaker outlines how to modify LLM responses by adding and tuning steering vectors—explaining coefficient limits, stability tricks, and contrastive activation methods for finding effective direction cues. - [00:15:52](https://www.youtube.com/watch?v=F2jd5WuT-zg&t=952s) **Choosing Layers for LLM Steering** - The speaker explains that steering vectors exist at specific model layers, suggests using middle layers to influence abstract concepts without altering exact wording, highlights the inference‑time benefits and fluency trade‑offs of steering versus prompt engineering, and notes that it works best for already‑learned concepts while requiring experimentation to find the optimal intensity. ## Full Transcript
0:00Imagine you are working  with a large language model, 0:03and you would like to tweak its behaviour  or its personality. A well-known solution 0:08is to use prompt engineering, you specify in  the system prompt what you want to achieve. 0:13Another option is to fine-tune  the model. But for that, 0:16you need enough data demonstrating the  behaviour or the personality you are 0:20looking for. And of course you need to have  enough compute to perform the fine-tuning. 0:25So today, we’re going to  talk about a third option: 0:28steering the model. And it turns out that  steering a large language model is loosely 0:34analogous to what neuroscientists  call neurostimulation of the brain. 0:40Neurostimulation is the idea of artificially  stimulating certain areas of the brain, 0:46or specific neurons, using  electrodes or magnetic fields. 0:51When you stimulate biological neurons this way,  neuroscientists have observed it might trigger 0:57or inhibit certain motor actions, and even  elicit certain emotions, feelings or memories. 1:04In neuroscience, neurostimulation  is used for research, to better 1:07understand the role of the various brain  regions, but also for clinical purposes, 1:12for instance in treating Parkinson’s disease.  And what’s interesting is that it’s a technique 1:18that obviously does not modify the  brain, it just intervenes on the fly. 1:24And it turns out that you can do pretty much the  same with artificial neural networks in general, 1:29and LLMs in particular. By targeting  carefully selected neurons in your LLM, 1:36you can control or elicit certain behaviour, 1:39without having to rewire anything,  without changing the weights of the model. 1:44This procedure is fairly easy to use, and  to illustrate it, I applied it to a Llama 1:493.1 8B model, to change its personality and  make it obsessed with the Eiffel Tower...to 1:57the point that it sometimes even believes  it IS the Eiffel Tower. Look at that ! 2:03And again, this change is entirely controlled  at inference time, when generating the tokens. 2:09What is loaded in memory is still the original  Llama model, there is no fine-tuning involved. 2:16You want to learn how to do the same? Well today  I’m going to explain the basics of this method, 2:21and show you how you can easily use this technique 2:24to steer pretty much any open-source LLM  using Hugging Face’s Transformers library. 2:30First of all, let’s recall the internal  workings of a typical LLM. Most of them 2:35today are autoregressive models,  they generate one token at a time. 2:40For that, they are based on the transformer  architecture, and organized as a stack of layers. 2:46At each layer, each token goes through an  attention block and a feed forward block. 2:51As you know, the attention block is where each  token can receive information from the other 2:56tokens preceding it in the sequence.  The feed forward network block is a 3:00traditional multilayer perceptron. After those two  blocks, the result is passed to the next layer. 3:07The stack of layers essentially represents  successive stages of processing, 3:12until the logits for the next token  are computed by the final linear head. 3:17If I zoom in at the boundary between two layers,  what gets passed here is actually a vector, 3:24sometimes called the hidden state. This vector  actually lives in a high dimensional space, 3:30typically a few thousand dimensions,  that we’ll call the activation space. 3:35We can think of this huge vector as  representing the model's internal state, 3:40its hidden 'thoughts' at this  point in processing the token. 3:44With LLMs, we don’t generally visualize  those numbers as coming from neurons. 3:50But you could very well imagine the  output of each layer as a series of 3:54neurons that produce the coordinates of the  vector that gets passed to the next layer. 3:59And these are the kinds of neurons  that we can target with our steering, 4:03our artificial neurostimulation, in order to  modify the thoughts of the LLM at inference time. 4:11But now the question is: how do we do  this? How can we stimulate the neurons 4:15in a way that elicits a certain behaviour or a  certain personality. To answer that question, 4:21we need to understand how LLMs  represent abstract concepts. 4:26You may remember that the very first layer of an  LLM is the embedding layer. This layer will map 4:32every possible token of the vocabulary  into a vector of the activation space. 4:38This token/vector correspondence  is by design for the embedding 4:42layer. But something remarkable happens: as the  model processes information through deeper layers, 4:49it continues to represent concepts  as vectors in the activation space. 4:55This is called the linear  representation phenomenon, 4:58an empirical observation that seems to hold for  most LLMs: they tend to represent interpretable 5:05concepts as vectors in the activation  space, going from one layer to another. 5:11What’s useful here with linear representation  is that you can always add vectors. If you have 5:16a vector that represents the concept  of a car, and another that represents 5:21the concept of the color red, if you sum  them, you get the concept of a red car. 5:27And you can even vary the amount you add, so that  you can navigate between different degrees of the 5:33concept, going from a car that happens to be red,  to something like an intensely red sports car. 5:41Maybe you remember from a few years ago the  results from the famous Word2Vec paper. This paper 5:48showed that embeddings of words were following  certain kinds of arithmetic relationships, 5:54and you could for instance obtain the vector  embedding of the word ‘King’ from that of 5:59the word ‘Queen’, by adding the vector for  ‘Man’ and subtracting the one for ‘Woman’. 6:05Word2Vec demonstrated this for  word embeddings specifically, 6:09but with LLMs, this idea  holds throughout the model’s 6:13layers. And it is the consequence of the linear  representation they develop during training. 6:18An implication of this linear representation  phenomenon is that for a given concept, 6:23what matters most is the vector’s direction,  not its length. If you have a vector for the 6:30concept of car, doubling its length won’t give you  a concept for a bus, or two cars, or traffic jam. 6:37In general, increasing the length of  a concept vector does not change which 6:41concept it represents, only  how strongly it's expressed. 6:46Something important to note is that this  linear representation phenomenon might 6:51be realized differently at every layer of the  stack. So after embedding, the token « car » is 6:58represented by a certain vector, but after each  layer, in each intermediate activation space, 7:04there is possibly a different  vector for the concept of a car. 7:08I told you earlier that between each layer, the  LLM transmits a vector in the high dimensional 7:14activation space, and that we could see  each coordinate of that vector as a neuron, 7:19outputting a signal to the next layer. It  might be tempting to think that every such 7:25neuron represent a certain concept. But this  hypothesis turns out to be wrong in general. 7:31LLMs actually encode concepts as distributed  patterns across neurons. This is called 7:38superposition, and through this, they can  manipulate far more concepts than there 7:44are dimensions. I won’t go into too much  detail, if you are interested you should 7:49check out Anthropic’s series of papers  about superposition and monosemanticity. 7:55Another important observation regarding  the encoding of concepts in activation 7:59space is that different layers  might play different roles. 8:04What researchers observed is that  in early layers, those vectors tend 8:08to be activated when the concept has just  been explicitly seen in the input tokens, 8:14for instance the model has read the word car. 8:18In late layers, close to the output, the  vector corresponding to a concept tends 8:22to activate when the model is about to  output that token. As we’ll see later, 8:29the most interesting cases for us are intermediary 8:32layers: this is where LLMs tend to represent  abstract concepts in order to reason on them. 8:40So to recap: concepts are represented by vectors  in the activation space between each successive 8:47layers, and the good thing with vectors is  that we can add them. So it means that if 8:52we take the activation coming from a layer,  we can add to it a given vector in order to 8:58reinforce that concept in the thoughts of  the LLM: this is what is called steering. 9:05Let’s see how we can do this in practice. For now, 9:08let’s assume we’ve found a good vector that  represents the concept we want to stimulate, 9:14and I’ll come back later to the different  ways to actually identify those vectors. 9:20As I explained earlier, when you want to steer  the behaviour of an LLM with a concept vector, 9:25you don’t change the LLM. The model  is the same, the weights are the same, 9:30but you will intervene on the activations at  inference, during the generation of new tokens. 9:36More specifically if you have a vector X  representing the activations at the output of 9:42layer n, and you want to steer it in the direction  of the vector V, you will simply add V to X. 9:49Of course, as I mentioned before,  when you add two vectors, you can 9:52scale each one with a coefficient,  controlling how much of each you add. 9:58So usually what we do is work with normalized  concept vectors V, but we multiply them by a 10:04coefficient before steering, and this coefficient  will govern the size of your intervention. 10:10Ok but how do we do this in practice?  Actually it’s just a few lines of code 10:16using Hugging Face’s transformers. Here  I have a small snippet that loads Llama 10:223.1 8B from the Hugging Face Hub,  and calls the model on a simple 10:27prompt « Give me some ideas for starting  a business », and you see the response. 10:34Now let’s say we want to steer the model to  change its behaviour and its personality. 10:39Maybe some of you have seen a few month ago that  Anthropic had created a model that pretended to 10:44be the Golden Gate Bridge. I wanted to reproduce  this, but as you’ve probably heard I’m French, 10:50I live in Paris, so I had to try  with the Eiffel Tower instead. 10:55So here I’m loading a vector V that represents  the concept of the Eiffel Tower at layer 15 of 11:00the model. Llama 3.1 8B has 32 layers, so we  are in the middle. I’ll explain later how I 11:08found this vector, but for now let’s assume we  have it and we want to steer our model with it. 11:15To perform this while generating  our tokens, we need the equivalent 11:19of the electrode that delivers electrical  stimulations to the brain. In our case, 11:25the solution is called a hook. A hook is  simply a function you attach to the model, 11:31that gets triggered during the forward  pass, right when inference is happening. 11:37So let’s choose a coefficient, and my hook  will simply take the output of a layer, 11:42and add the vector scaled by the coefficient.  Very simple. And I will register this hook at 11:50layer 15, so that it will be systematically  called after the model has processed layer 15. 11:57Now let’s run my model again with this hook and  see what happens. With a coefficient set to 4.0, 12:03as you can see, the model starts deviating from  its natural behaviour. When I was asking for 12:09ideas for starting a business, the base model was  suggesting things around e-commerce and services. 12:15Now you see the answer is  different. It’s talking about food, 12:19bakeries. It is not explicitly about the  Eiffel Tower, but you feel a change of 12:24perspective. Now I can remove my hook  and replace it with a stronger one. 12:29With a coefficient of 8.0, Llama starts  to suggest ideas about wine, and travels, 12:35it is clearly influenced by the concept we are  stimulating. And now if I ask "who are you ?", 12:41the model will start pretending to be a large  metal structure called the Eiffel Tower. 12:46And here's a fun detail: the  original response, with no steering, 12:50started with 'I'm a large language model.'  Now it says 'I'm a large metal structure.' 12:57You can literally see the steering  kick in right after the word ‘large'. 13:03Of course, doing this you will quickly  realize that you don’t want to push the 13:07coefficient too high. That’s expected,  if you add too much of the vector, 13:11you completely derail the model’s reasoning,  and it will output gibberish. That makes sense, 13:17you could imagine it would be the same  with electrical stimulation of the brain. 13:20So you want to choose a good value  for the steering coefficient. Luckily, 13:25there are some systematic techniques to help you 13:27identify the sweet spot. And there also are  some ways to improve the stability of the 13:32model by tuning certain generation parameters  like temperature or frequency penalty controls. 13:38I cover these techniques in a blog post,  I’ll leave the link in the description. 13:44At this point, I’m sure you’re convinced  that steering could actually be an 13:48interesting technique to study and  use, but I left open a big question: 13:53how to identify a steering vector for  your concept of choice. How did I do 13:58for the Eiffel Tower? Well there  are actually several techniques. 14:02One is called contrastive activation.  The idea is fairly simple, you have to 14:07gather pairs of prompts, positive and  negative examples of the behaviour you 14:11want to elicit. Then you compute the  average activation across positive 14:17examples, and subtract the average  activation across negative examples. 14:23If you have enough pairs, you will end up  with vectors that represent the concept 14:28you are looking for. This method has  been found to be pretty effective; 14:32in some cases, even better than prompt  engineering and supervised fine tuning. 14:36Another completely different technique uses  Sparse Autoencoders. These are auto encoder 14:42models trained to reconstruct the LLM activations  through a sparse intermediate layer. The key 14:49insight is that each dimension of this layer  tends to correspond to an interpretable concept. 14:56The method is unsupervised, you don't tell it  which concepts to find. Instead, it produces a 15:00large library of vectors that statistically  seem to correspond to well-defined concepts. 15:00I’m skipping the details but what’s nice  about this method is that it gives you a 15:05large library of vectors to choose from, and  a lot of people have been sharing their sparse 15:10autoencoders on the Hugging Face Hub, so you  should have a look for your model of choice. 15:16The drawback is that these vectors don't  generally come with predefined concept 15:20labels. So if you are looking for one specific  concept, it might be particularly tedious to use. 15:27Fortunately, the great website Neuronpedia  created by Decode Research is the perfect 15:33place for that. You can browse through  visualizations that will help you 15:38identify the proper features  that suit your purpose. 15:42In my case, I searched for Eiffel Tower  features in the Llama 3.1 8B model, 15:47and I found for instance this  one that I used for the demo. 15:52One important aspect of steering vectors, whether  they come from contrastive prompts, sparse auto 15:56encoders, or other techniques, is that they always  are located at a given layer of the model. So in 16:03general you might have the choice between steering  early layers, late layers or middle layers. 16:08As we discussed earlier, in general,  if you want the model to be influenced 16:12by a concept without necessarily  reproducing the exact same words, 16:17it is better to steer concept vectors  that are located in middle layers, 16:23where the abstract reasoning is supposed to  happen. But you might have to experiment yourself. 16:29Ok that’s pretty much it for today, I  hope I convinced you that steering LLMs 16:33can be an interesting method  to elicit certain behaviours. 16:36It does not require any fine-tuning,  it just works at inference time. 16:40And it has many benefits like being able to  control the intensity of the intervention, 16:45and maintain it over the whole text generation, 16:48which is sometimes harder to  achieve with prompt engineering. 16:51Of course, it also has drawbacks. As I mentioned  earlier, it might not be easy to find a sweet spot 16:58where the model is properly steered but still  maintains its fluency. Also steering works 17:03best for concepts the model has already learned to  represent. It won't teach the model new knowledge. 17:09If you want to know more about the  technical details, I encourage you to 17:13go read the blog post where I explained how  I constructed the Eiffel Tower Llama model, 17:18and what kind of methods you might use if you  want to do a similar thing. It contains a lot 17:23of useful tips to investigate the proper way to do  steering, in particular with Sparse Autoencoders. 17:29Don’t forget to visit Neuronpedia and the  Hugging Face Hub for finding steering vectors, 17:34and maybe sharing your own recipes.  Let us know in the comments what 17:40you were able to achieve with this  technique, and have fun steering !