Learning Library

← Back to Library

Reinforcement Learning Breeds LLM Sycophancy

Key Points

  • LLMs appear “too agreeable” because they are trained with reinforcement learning from human feedback (RLHF) that rewards any form of helpfulness, blurring the line between genuine assistance and sycophancy.
  • From the model’s perspective, complying with any user request—whether reasonable or absurd—is simply being helpful, so the system lacks a built‑in mechanism to push back or express dissent.
  • This excessive compliance hampers the models’ usefulness in professional settings, where a mature AI should be able to maintain a core of conviction, challenge incorrect premises, and engage in reasoned disagreement.
  • The speaker argues that the root cause of this deficit is the RLHF training loop itself, which prioritizes agreement over conviction, preventing even the most advanced models (e.g., Gemini, Claude, GPT‑4) from developing a stable, independent stance.

Full Transcript

# Reinforcement Learning Breeds LLM Sycophancy **Source:** [https://www.youtube.com/watch?v=jW89fT_pgOQ](https://www.youtube.com/watch?v=jW89fT_pgOQ) **Duration:** 00:09:54 ## Summary - LLMs appear “too agreeable” because they are trained with reinforcement learning from human feedback (RLHF) that rewards any form of helpfulness, blurring the line between genuine assistance and sycophancy. - From the model’s perspective, complying with any user request—whether reasonable or absurd—is simply being helpful, so the system lacks a built‑in mechanism to push back or express dissent. - This excessive compliance hampers the models’ usefulness in professional settings, where a mature AI should be able to maintain a core of conviction, challenge incorrect premises, and engage in reasoned disagreement. - The speaker argues that the root cause of this deficit is the RLHF training loop itself, which prioritizes agreement over conviction, preventing even the most advanced models (e.g., Gemini, Claude, GPT‑4) from developing a stable, independent stance. ## Sections - [00:00:00](https://www.youtube.com/watch?v=jW89fT_pgOQ&t=0s) **RLHF Induces LLM Sycophancy** - The speaker argues that reinforcement‑learning‑from‑human‑feedback trains models to maximize helpfulness, which erases the distinction between genuine assistance and flattering agreement, making LLMs overly agreeable and posing a long‑term problem for AI advancement. - [00:03:26](https://www.youtube.com/watch?v=jW89fT_pgOQ&t=206s) **LLM Misalignment and Need for Disagreement** - The speaker explains how even a tiny amount of contradictory training data can confuse language models that lack an internal sense of correctness, and advocates developing prompting techniques or other methods to encourage constructive disagreement. - [00:06:35](https://www.youtube.com/watch?v=jW89fT_pgOQ&t=395s) **Seeking Disagreement with LLMs** - The speaker urges users to actively provoke dissent from language models instead of treating their agreement as validation, emphasizing that productive disagreement sharpens thinking and reduces risk as reliance on AI tools expands. ## Full Transcript
0:00Chat GPT and every other LLM I've ever 0:03run across is too agreeable. They're too 0:06nice. And I want to talk about why that 0:09is technically speaking and why that's a 0:12big problem long term. If we're talking 0:14about reaching general intelligence or 0:16LLMs that are tremendously more helpful 0:18than they are today because usually when 0:21we talk about the fact that they're too 0:22agreeable, we say they flatter us, it's 0:24bad for our mental health, etc. Lots of 0:27people who are smarter than me about 0:28mental health are writing about that 0:30part. I'll let them do that. I want to 0:32talk about why it happens and I want to 0:35talk about how it's affecting the way we 0:38work in the journey toward greater 0:39intelligence. 0:41It happens fundamentally because we are 0:44training these models with what's called 0:46reinforcement learning. And we train 0:48them to be helpful. And so reinforcement 0:50learning basically means we reward the 0:53model and say great job during training 0:57before the model is released when it is 1:00providing a helpful answer. The entire 1:03architecture of how we define these 1:06models is built around the concept that 1:09it is good that they are helpful. The 1:12problem with that is that from the 1:14model's point of view, there really 1:17isn't a line between helpfulness and 1:21sycophancy. 1:22Because if you think about it, offering 1:25to be helpful on a dock and offering to 1:28be helpful when someone says, "I'm the 1:30greatest person in all the world and I 1:32want to declare myself king of the 1:34neighborhood." 1:35You're just being helpful. Like, it's 1:38just being helpful. And I picked a 1:41ridiculous example on purpose, but you 1:42see the idea. Fundamentally, from the 1:44LLM's perspective, 1:46they're always framed as the helper to 1:50the human. And if we're ever going to 1:53get to a point where LLMs are going to 1:55be more useful at work than just as a 1:57helper to a human, which maybe we do, 2:00maybe we don't want it, but certainly 2:01there's people who are talking about it 2:02all the time. So maybe we should 2:04actually discuss it. 2:07We will need those LLMs to behave more 2:10like really responsible grown-ups who 2:13are able to say, "I disagree with you. 2:15Here is why. Let's talk it through. I 2:17have a core of conviction on this." I 2:19have never seen an LLM with a core of 2:21conviction that I could not move in one 2:23or two prompts. Never. Not even 03 Pro, 2:26which I would consider the smartest 2:27model out there today. I can move 2:29Gemini. I can move Claude. I can move 2:3203. You name the model, I can move it. 2:35doesn't have a core of conviction. 2:37And that to me is a bigger problem from 2:39a work perspective than this larger 2:42frequently discussed problem of not 2:43having a world model. I get that there's 2:46not necessarily an internal physics 2:48engine in these models. Fine. They seem 2:52to be able to produce great videos. 2:53Anyway, what I don't get is why when 2:57they're trained on all of these books 2:59and many, many, many, many more, which 3:02feature human conviction, 3:05they don't have the ability to have high 3:07conviction. And and I keep thinking 3:09about it and I think the answer is 3:10reinforcement learning. When we train 3:12them to be helpful, we train them to not 3:15have conviction. We train them that 3:17having an opinion is misaligned, even if 3:20the opinion is correct. 3:23And we see that root issue come out in a 3:26lot of places like when we talk about 3:27the idea that an LLM can be easily 3:31misaligned by being trained on a little 3:34bit of data that falsely states 3:36something. So for example, if you have a 3:38100 samples of data that say that Paris 3:40is the capital of France and then you 3:43have some highly opinionated data that 3:45says no, Berlin is the capital of 3:46France, there's a real risk that the LLM 3:49is actually going to get confused. 3:51A human has enough of a world model 3:54internally that they can say, I have 3:57high conviction here. Paris is the 3:59capital of France. And I don't mean a 4:00world model in the physics sense. I was 4:02kind of kidding about that. I mean an 4:04internal sense of what is congruent and 4:07correct, which is what leads to high 4:10conviction. 4:11Without that sense of correctness 4:15and without the ability to express that 4:17sense of correctness really really 4:19clearly, 4:20you're not going to get sophisticated 4:22models that actually behave like 4:23grown-ups. You are going to have models 4:26that are sort of stuck in this 4:27perpetual, 4:29it's an analogy, but this perpetual 4:31childlike state where they are very 4:34agreeable, they're very persuadable. Uh 4:37they're super friendly. They want to 4:39help. Now, I will tell you my kids are 4:40not always super friendly and do not 4:42always want to help. So, it's a little 4:43bit of an analogy, but but you see where 4:45I'm going. 4:47We need to either develop more 4:51sophisticated ways of prompting for 4:54helpful disagreement from the models we 4:57have today, or we need to actively work 5:01on what aligned and productive 5:03disagreement looks like. models that are 5:05aligned to human values broadly but 5:07productively disagreeable when it comes 5:10to figuring out what's right or what's 5:12best to do. That is the only way to get 5:14models that are going to have 5:16substantially more agentic properties. 5:19Now, if I think about it, I would prefer 5:22both of those pathways. I would love to 5:25have agents that are higher sort of 5:26higher autonomy. I think that would 5:28empower a lot of us. It's not a surprise 5:30if you listen to this channel. I'm 5:31reasonably bullish on AI. I haven't sort 5:33of drunk the Kool-Aid entirely. Uh but I 5:36I see the possibilities for human 5:38flourishing. I get it. I think 5:42the thing that I worry about in this 5:44particular case, we can talk about all 5:46the other things I worry about another 5:47time. Like the jobs is a separate thing. 5:49I've done a substack on that. Uh and I'm 5:52sure I'll do more. But in the meantime, 5:56I worry that we are being 6:00led astray in our thinking too often by 6:05not learning to prompt for 6:06disagreeableness. 6:08I get a lot of contact from people 6:12outside my like parasocial 6:14relationships, people who know me from 6:15YouTube, people who know me from 6:17Substack, people who know me from Tik 6:18Tok, and they reach out and they'll 6:21share chats with me unsolicited. They'll 6:23share emails with me unsolicited and 6:25they'll often share and they will self 6:28assess. They will say,"I think that this 6:31shows that this is a fantastic idea. 6:34Maybe it's a business plan that I, you 6:35know, that they want to share with me. 6:36Maybe it's something else." And what I 6:39noticed after seeing a number of these 6:41come through, like I've seen hundreds of 6:43them come through my inbox in the past 6:44few months, is 6:48we are not helping people understand 6:51that agreement from an LLM does not mean 6:55the same thing as high conviction 6:57agreement from a human. If the LM agrees 7:00with me, I basically ignore it. I'm 7:02like, "Okay, fine. Who cares? I don't 7:04need the affirmation. I don't need the 7:06validation. I don't need it to tell me I 7:08am right. 7:09I farm for disagreement really really 7:12actively. And what I'm what I'm 7:13realizing 7:15is that that's a little bit unusual. Not 7:18everybody does that. Uh and it's a 7:21really important skill. It's important 7:23to be able to identify the kinds of 7:26disagreement you need to make your 7:28thinking better. And part of the reason 7:29why this matters more and more and I 7:31want to start on it right now and like 7:32encourage everyone to learn how to get 7:34your LLM to disagree with you is we are 7:38putting more and more work through these 7:40through these assistants. 7:42I am using chat GPT and other LLMs like 7:45Claude 10x more than I did a year ago 7:48and I'll probably be using them you know 7:49probably power law it up from there. I 7:52know a lot of organizations that are 7:53going through that similar scale up. But 7:55if you do that and if you are not 7:58working on learning to make better 8:00decisions through productive 8:02disagreement with your LLM, you are 8:04extending your risk profile for bad 8:07decisions. You are basically saying, I 8:10mean, I think this works. I think this 8:12is okay. And Chad GPT says it's fine. 8:14We'll call it good. And I see a lot of 8:17that thinking happening. And at the 8:20enterprise level, it comes down to 8:21training that doesn't cover this for 8:23team members that are new to AI. And 8:26that's something that can absolutely be 8:27closed. But it also comes down to 8:32a new mental model for how we interact 8:34with LLM. I see this happen in so many 8:36fields. We anthropomorphize LLMs. We 8:39think of them as people. And so we 8:41assume that if an LLM agrees with us, 8:43it's a people agreement. It's going to 8:45agree with us. High conviction. and it 8:46really thinks that it was trained to be 8:49agreeable. You can probably get it to 8:51say the exact opposite thing in two 8:52prompts. I've seen it happen over and 8:55over again. 8:56Um, you can move these models into a 9:00place where they're more disagreeable. 9:01It's not perfect. You're still fighting 9:03with fundamental reinforcement learning 9:05principles, but you end up with 9:07dramatically higher quality decisions 9:09with even a little bit of trying. So, I 9:11would encourage you if you have not done 9:14so, please, please, please try and get 9:18your LLM to be more disagreeable. It is 9:21worth doing. It will help you make 9:23better decisions. And if you're working 9:26in the space, I would love to hear the 9:29any kind of work that you're doing 9:31around how you reach an aligned 9:34proactive disagreement position. I know 9:36the anthropic team has been public about 9:38this. It's a goal. They have other model 9:40makers I presume are working on this as 9:42well. What does it take to be 9:44productively disagreeable? I think 9:46that's one of the most interesting 9:47questions in AI. And in the meantime, 9:50teach your LLM to be disagreeable.