Learning Library

← Back to Library

Defining Correctness for Reliable AI

20m • Unknown Channel • ai-ml • deep-dive • intermediate • Watch on YouTube ↗

Key Points

Defining what “good quality work” looks like for AI systems—especially in terms of correctness—is essential, because without a clear metric you can’t measure or improve performance.
Humans habitually optimize for social cohesion (“go‑along, get‑along”) rather than factual correctness, a habit that worked historically but leads to unreliable AI outcomes when it isn’t consciously overridden.
Most AI projects fail not because the models are unintelligent but because teams lack a stable, explicit definition of correctness, often shifting goals mid‑stream without documenting the change.
To build reliable AI, correctness must be embedded at the core of system architecture, allowing updates to the definition of “good” in a predictable, controllable manner.

Sections

Full Transcript

# Defining Correctness for Reliable AI **Source:** [https://www.youtube.com/watch?v=mnWMTzkjWmk](https://www.youtube.com/watch?v=mnWMTzkjWmk) **Duration:** 00:20:06 ## Summary - Defining what “good quality work” looks like for AI systems—especially in terms of correctness—is essential, because without a clear metric you can’t measure or improve performance. - Humans habitually optimize for social cohesion (“go‑along, get‑along”) rather than factual correctness, a habit that worked historically but leads to unreliable AI outcomes when it isn’t consciously overridden. - Most AI projects fail not because the models are unintelligent but because teams lack a stable, explicit definition of correctness, often shifting goals mid‑stream without documenting the change. - To build reliable AI, correctness must be embedded at the core of system architecture, allowing updates to the definition of “good” in a predictable, controllable manner. ## Sections - [00:00:00](https://www.youtube.com/watch?v=mnWMTzkjWmk&t=0s) **Defining Correctness for AI Systems** - The speaker stresses that without an explicit, measurable definition of “good” or “correct” output, AI projects falter, urging everyone to rigorously define correctness in prompts to improve quality and outcomes. - [00:04:07](https://www.youtube.com/watch?v=mnWMTzkjWmk&t=247s) **Defining Correctness for Agentic Systems** - The speaker stresses that the primary task is to establish clear, rigorous criteria for what constitutes correct output—distinguishing tolerable uncertainty from fatal errors—especially when combining structured and unstructured data in advanced agentic systems that serve as trusted records of truth. - [00:07:32](https://www.youtube.com/watch?v=mnWMTzkjWmk&t=452s) **Balancing AI Autonomy and Human Judgment** - The speaker debates whether an AI system should automatically update sales‑pipeline probabilities based on predictive patterns or defer to the salesperson’s intuition, highlighting trust, change‑management, and evaluation‑metric issues that can perpetuate AI hallucinations. - [00:10:41](https://www.youtube.com/watch?v=mnWMTzkjWmk&t=641s) **Single‑Turn Bias in AI Chatbots** - The speaker explains that reinforcement‑learning training emphasizes single‑turn exchanges, causing models like Gemini 3 to degrade over multi‑turn dialogs and inadvertently fostering emotional bonds with users, highlighting the need for clearer reward definitions and objectives. - [00:15:18](https://www.youtube.com/watch?v=mnWMTzkjWmk&t=918s) **Copilot Adoption Hindered by Dirty Data** - The speaker argues that Microsoft's AI Copilot struggles in enterprises because it is sold as a bundled product without addressing cultural readiness, proper training, and the underlying poor‑quality SharePoint data that feeds the model. - [00:18:55](https://www.youtube.com/watch?v=mnWMTzkjWmk&t=1135s) **Defining Good Output in Prompting** - The speaker argues that effective prompting hinges on explicitly stating what a high‑quality response looks like, a practice essential for both system design and everyday use of language models. ## Full Transcript

0:00Most of us can't define what good 0:02quality work looks like for our AI 0:04systems and it's really hurting. I don't 0:06just mean for corporate AI systems. I'm 0:09going to talk a lot in this video about 0:10how you define the Gentic systems and 0:12how you build large scale systems at 0:14businesses that measure good quality 0:16work. But this goes beyond that. We are 0:19talking about good quality work that you 0:22can define in a prompt. In other words, 0:24the ability to define what good looks 0:26like turns out to be one of the most 0:28powerful insights in AI. And it's 0:30something that cuts at the heart of the 0:32vagueness that we like to operate with 0:34in business and personal lives. Because 0:36humans, I got to say, usually optimize 0:39for go along, get along. We optimize for 0:41social cohesion and we don't optimize 0:44for correctness. And that has worked for 0:46us for about a half a million years. It 0:48does not work anymore when you work with 0:50AI systems. And so this is something 0:52that you may hear it and say, "What's 0:54the implication for me? I don't define 0:56AI systems." This is for all of us. We 0:59all need to think harder about what good 1:01looks like if we want to be good 1:03prompters. So with that, let's dive in. 1:05Correctness is upstream of everything. 1:07Most AI projects don't fail because the 1:09model is dumb. They fail because nobody 1:12can answer a brutally simple question. 1:15What would correct even mean here? If 1:17you can't define correctness, then you 1:19can't measure it. If you can't measure 1:21it, you can't improve it. Everything 1:23downstream, the decisions we make about 1:25retrieval augmented generation systems 1:27or agents or orchestration or context 1:30engineering or model choice, those all 1:32become elaborate ways that we build on 1:35top of an unnamed shifting target if we 1:39can't define correctness. And the part 1:42that's awkward to admit is that we don't 1:44just lack a definition of correctness. 1:47As humans, we often change our 1:49definition mid-stream. So, we may 1:51quietly, socially, without writing it 1:53down, change what we mean by good and 1:56then blame the system for being 1:58unreliable. I've seen this happen a lot. 2:00If you want a good example, how many 2:02times have you seen priorities for a 2:05product team change midstream during the 2:08quarter after quarterly goals and OKRs 2:10were set? I've seen it a lot. I've 2:12worked in product for two decades. I 2:14would say it happens more often than not 2:16because reality continues to push us to 2:19change our definitions and change our 2:21priorities in this situation. What I'm 2:23suggesting is not that we're going to 2:24magically get to a world where we can 2:26just freeze correctness and it won't 2:28ever change. That would be unrealistic. 2:30What I'm suggesting is that we need to 2:32be honest about the importance of 2:35correctness and answering what good 2:38looks like when we build AI systems. And 2:40we need to build our systems in such a 2:42way that correctness and quality are at 2:45the heart of how we think about 2:47architecture. And we can change those 2:49answers in predictable ways that 2:52influence our system so that we can 2:54update our responses, update the process 2:56the AI goes through to get answers when 2:58our own definitions of good and quality 3:00change. There's a lot in there. We're 3:02going to unpack it here. First, in 3:04normal software, we pretend correct is 3:07obvious because the program either 3:09passes tests or it doesn't. It's kind of 3:11binary. You can have functional 3:13requirements when you launch software 3:14and it either passes those tests and you 3:16launch it or it fails and you go back 3:18and you do QA again. In AI, because this 3:20is a probabilistic system, correct is 3:23rarely binary. It's a bundle of 3:26competing requirements that we often 3:28don't honestly debate upfront when we 3:31should. So requirements around 3:32truthfulness, requirements around 3:34completeness, requirements around tone, 3:37requirements around policy compliance, 3:39requirements around speed or cost or 3:42refusal behavior. And if you're in the 3:44enterprise, you have requirements around 3:45auditability. So when people ask me 3:47about an architecture for an agentic 3:49system and they might ask, hey, uh, 3:52where do we put our context layer? Or, 3:53hey, do we need three agents or two 3:56agents for this situation? Or do we need 3:58an agent at all? Can we just put this in 4:00a chatbot? I always ask them to rewind 4:02the tape and start at the beginning. 4:04Those are secondord decisions. The first 4:07order decision is what is the output 4:09here and how do we know what good looks 4:11like? What is correct? Can we can we 4:14name it? Can we define it? What are the 4:17kinds of uncertainty that we would allow 4:19in a definition of correctness? What is 4:21the kind of uncertainty or inaccuracy 4:23that we wouldn't allow that would be a 4:25fatal error? OpenAI's own guidance on 4:28evaluations basically says this out 4:30loud. You need evaluations that test 4:33outputs against the quality criteria 4:36that you specify, especially as you 4:38change your models or prompts. 4:40Reliability, it starts from 4:42understanding what to measure. This 4:44especially shows up when you're doing 4:46complex agentic systems that combine 4:49structured data and unstructured data. 4:51So unstructured data often can sound 4:54really good when you retrieve it, but it 4:56can also be incorrect. Structured data 4:58can be correct, but can be unusable when 5:00you're combining it with unstructured 5:02data. So when you combine these items 5:05for a board deck or a compliance 5:07workflow, your definition of correctness 5:10has to remain useful over unstructured 5:15and structured data. both pretty close 5:17is not going to be good enough if you're 5:19taking these systems seriously. A single 5:21digit off is a problem in a board deck 5:23because the value of the system is in 5:26trust. And this is becoming more and 5:28more relevant because our agentic 5:30systems are getting closer and closer 5:32and closer to systems of record. We're 5:35now talking openly about how our systems 5:37of record need to be updated and changed 5:39so that agents can modify them directly. 5:42If that is the case, correct 5:44architecture is dependent on your 5:47ability at scale to define what good 5:50quality responses look like in a way 5:53that you can measure. And I think 5:54there's an important hidden failure 5:56mode. I talked about this idea that we 5:59as humans tend to move the goalpost in 6:01the middle of the quarter, right? This 6:02happens all the time. It happens between 6:05stakeholders. We keep moving our 6:07goalposts. Like in week one we may say 6:10hey correct means the answer just has to 6:12sound plausible and save time but by 6:14week three we we may be saying actually 6:16correct means it matches the finance 6:18numbers. We end up conducting 6:21correctness discovery as humans while we 6:25build these systems and those are not 6:26small changes right if you say it has to 6:28match the finance numbers that's a 6:30change in the definition of the system 6:32and so what I find the reason why I 6:35insist that we start with a quality 6:38conversation and a correctness 6:39conversation is that it saves us so much 6:41of this back and forth. If you end up 6:44discovering correctness over the course 6:46of the agentic build, you're going to 6:49end up discovering lots and lots and 6:51lots of architecture changes, and your 6:53poor engineers, your poor AI architects 6:55aren't going to know what you really 6:56want. They're just going to go back and 6:58forth because you keep saying, "Well, 7:01correctness means it should answer 7:02confidently and quickly with no caveats 7:04versus correctness means it can answer 7:06very slowly. It must match the finance 7:08numbers. It must include narrative 7:10context every time it answers." Well, 7:11which is it? Do you need it to answer 7:13quickly? Do you need it to answer 7:15confidently in a bold tone? Or do you 7:16need it to answer with absolute 7:19precision on finance numbers? And that's 7:22not as like you might think that's an 7:24easy choice. The world of agentic 7:26architecture is full of choices like 7:28that that are actually very difficult. 7:30I'll give you another example. Is it 7:32more correct for the agent to update a 7:35contact record for a sales pipeline 7:39probability estimate when the system 7:42conducts a gentic search and determines 7:44based on a pattern of contact that that 7:47particular prospect is likely not going 7:50to close a sale and so this system 7:52proactively just updates it. Is that 7:54correct? Or is it more correct to rely 7:58on what the human, the salesperson who 8:01owns that prospect thinks about that 8:04prospect? That's a real question. You 8:07could say, "Our prospects are on the 8:10phone with our agent in ways that are 8:13not well captured by our existing system 8:15of record, and so we trust the humans 8:17more." Or you could say, "Actually, our 8:20humans don't have a good track record of 8:21forecasting here. we need to trust our 8:23agentic systems more and then you have a 8:26human downstream conversation about 8:27change management. These are really 8:29fraught issues and you multiply that 8:31time 10x or times x when you want to 8:33build an overall system. Once you 8:35understand this a lot of AI weirdness 8:38becomes predictable like hallucinations 8:41for example if the scoreboard rewards 8:43the system guessing because you never 8:45defined correctness systems learn to 8:48guess. OpenAI has published a paper 8:50arguing that common evaluation setups 8:52often reward confident an answers over 8:55honest uncertainty and that this 8:57pressure will keep hallucinations alive 9:00unless you change what correctness looks 9:02like. Are you willing to reward a model 9:04for telling you I don't know this is 9:07what I know and this is what I need to 9:09ask you. Is that an acceptable answer or 9:11do you insist that acceptability only 9:13means a confident statement of fact? 9:15This isn't really a model problem 9:17people. This is an us problem. This is a 9:20correctness definition problem. The 9:22system is optimizing what we as humans 9:25are actually rewarding so often and we 9:28end up blaming the model for 9:29hallucinations when it's just reflecting 9:31back to us the uncertainty that we are 9:34giving the system in terms of the goals 9:36that it should have. Now once you admit 9:39that correctness is upstream of 9:41everything, you immediately hit the next 9:43landmine. Measurement distorts behavior. 9:46This goes back to Goodart's law in 9:48software, right? Goodart's law gets 9:50quoted because it's it's annoyingly 9:51true. When a measure becomes a target, 9:54it stops being a good measure. In AI, 9:57that becomes if you pick a proxy metric 10:00for correctness, the system will learn 10:02to win the proxy, even if that proxy is 10:05different from the actual value you're 10:08looking to measure. This gets a little 10:09bit nerdy, but if you get into 10:10reinforcement learning and how aligned 10:12systems work, this can show up as reward 10:15hacking. the model will satisfy the 10:17literal objective while missing the 10:19intent. Let me give you an example 10:21that's very tangible. If you use Gemini 10:243, Gemini 3 is not nearly as good at 10:28multi-turn conversations as you might 10:30want it to be. It is extremely optimized 10:33for the single turn where you give it a 10:36good prompt and then you get a response. 10:37That is a fingerprint behavior of Gemini 10:413 that is also somewhat characteristic 10:44of other models. Almost every model I 10:46know does better at the first response 10:49than it does at the nth response. What 10:52has happened is that in reinforcement 10:54learning, we have very few examples of 10:57multi-turn conversations where the model 10:59gets rewarded because your priority is 11:02to go through a wide range of scenarios 11:03and provide the model with rewards. And 11:06in those situations, the people who are 11:08having conversations are having single 11:10turn conversations. And so what the 11:12model learns over time is single turn 11:15conversations. The model doesn't learn a 11:17lot of experience at multi-turn 11:20conversational dynamics. I personally 11:23think that this is one of the reasons 11:26why the longunning conversations that 11:30characterize emotional relationships 11:32between humans and AI are underexplored 11:35by model makers. This is a situation 11:37where the models themselves were never 11:39built for multi-turn conversations and 11:41one of the emergent effects of the 11:43multi-turn conversation turns out to be 11:45that humans form emotional attachment to 11:47models in some cases. And now here we 11:50are in a world where someone is getting 11:51married to the AI. This is all 11:53downstream of how you define reward 11:56hacking and correctness. It has a lot of 11:57implications. And so when we define our 12:00systems, we need to define what we mean 12:03by correctness very very precisely. We 12:06need to define what our true objective 12:08is very very carefully. Now I'm not 12:10advocating that we all get into 12:11reinforcement learning and we all start 12:13to train our models. It's just that 12:15reward hacking provides an example of 12:18how a proxy can be used to confuse 12:21people when you're trying to measure the 12:23real thing. So another example that we 12:25have talked about is this idea of 12:27answering correctly and confidently 12:29every time. So often when we tell a 12:32model in our system prompt that it must 12:33give an answer, we're inadvertently 12:36reminding the model that it cannot give 12:38no answer and that if it is uncertain 12:40that it must answer anyway. That is the 12:43kind of system prompt if not carefully 12:45managed that leads to hallucinations 12:47because the model has been told it needs 12:49to answer. So the game here is not to 12:51pick a metric. The game is to build a 12:54culture of correctness that resists 12:57gaming. So I would encourage you to 12:59think about multiple criteria that 13:01define correctness. I would encourage 13:03you to think about explicit failure 13:05modes that you can give your model so 13:07they understand what to do when they're 13:09failing. I would encourage you to think 13:11about calibrated uncertainty and when to 13:14tell your model it can just not answer. 13:16I would encourage you to think about 13:18provenence, how you can help the model 13:20to tell you which part of the claim came 13:22from where. And I would encourage you to 13:24think about lading that up into testing 13:27both at the unit level for individual 13:29agents and the overall orchestration 13:31layer system level. This is why good 13:34evals, they're not busy work. Good evals 13:37help you think through what correctness 13:39looks like. And I want to give you a 13:41note here. Humans like to stay vague 13:44about correctness. Part of why I'm 13:46having to have this conversation is 13:48because it's a people problem. Humans 13:50use vagueness effectively as a way to 13:54keep social conversations going. Vaguess 13:57keeps our options open. Vagueness avoids 14:00conflict. Vaguess lets stakeholders 14:01agree in the meeting and disagree in 14:03production. We call these weasel words 14:05at Amazon where you would use words like 14:08actually or a lot or anything that 14:10wasn't a number and a specific claim 14:13because you wanted to go along and get 14:16along. AI systems expose that kind of 14:20thinking and that kind of business 14:21culture. They force the organization to 14:23confront a lot of the trade-offs that 14:25we've often been hiding behind social 14:27conformity. Do you really want boldness 14:29in your answers? Do you really want 14:31precision? Do you really want perfect 14:33coverage? Do you want an audit trail? Do 14:36you want refuse when unsure? Are you 14:38actually sure about that? If the CEO 14:40says, "I want an answer." What are you 14:42going to say if you told the model it 14:44could refuse when it's unsure? So, when 14:46you when you don't decide and you sort 14:48of leave those questions conveniently 14:50vague, for most of human history, that's 14:53fine because we're the ones who've had 14:54to live with that and we've decided we 14:56can. Now, you can't do that. The system 14:59will decide for you. the LLM will decide 15:02for you and the outcome looks like a 15:04lack of quality, a lack of correctness, 15:06AI unreliability, the board saying, 15:08"Where is our AI product? Why is it 15:10bad?" This is usually human 15:12undecidability reflected back at you. 15:16You know, I keep thinking about this 15:18because of the widespread reports this 15:20weekend that Microsoft has been unable 15:23to get their AI co-pilot adopted in 15:25organizations. It's not that they 15:27haven't sold it as a bundle. that's been 15:29aggressively sold as a bundle. But 15:32Microsoft themselves are realizing what 15:34I have heard on the ground from teams 15:36for the last year, which is that 15:38Microsoft can sell Copilot all they 15:40want, but mostly people don't use it 15:42very much when it's sold that way. This 15:44comes back to the idea that most of the 15:47AI systems problems we have end up being 15:51reflections of people problems in our 15:53cultures. In this case, co-pilot is 15:56laded on top of dirty data in a 15:58SharePoint and no one is given training 16:00on how to ensure quality and correctness 16:03in C-pilot. And all of our vague 16:05assumptions, go along, get alalong 16:06assumptions about quality end up being 16:09operative with these AI systems. And so 16:11we ask C-Pilot for an answer and we've 16:13never answered what good looks like. And 16:15Copilot does its best with the dirty 16:16data it has. No wonder it's not adopted. 16:19No wonder the salesperson will try once 16:21or twice to get pipeline data out, roll 16:22their eyes at the incorrect data. Never 16:24bother to think that maybe there's some 16:27issue with the Salesforce system of 16:29record and what the AI agent can get. 16:31These kinds of details don't get sold 16:34when you sell an LLM. They get 16:36confronted by the organization months 16:38later. And this is the problem right now 16:40with AI is that we are selling the 16:42system and we are taking on human debt. 16:45We're taking on human debt and AI 16:47fluency. And we are taking on debt and 16:49how we define correctness and quality. 16:52I'm just going to keep banging that 16:53drum. So, here's how I want to bring 16:55this home and make it real for you. And 16:57this is where I want to leave you. Think 16:59of correctness and quality not as 17:01something that you can bat around as a 17:02human and be vague about, nor as a 17:04single measure for your AI system. 17:06Instead, think of it as a set of claims 17:09that your system is allowed to make. 17:11Think of it as the evidence required for 17:13each claim and the penalties for being 17:15wrong versus staying silent. And that 17:17last clause matters. As we've talked 17:19about, in many cases, if you can't 17:22define what correctness looks like in 17:24those terms, you haven't broken down the 17:26problem enough. You haven't broken it 17:28down to a level where you can define the 17:30system. You've just left it at a human 17:32state where it's very vague. So my first 17:35challenge to you is if you think of it 17:36and say, I couldn't tell you what the 17:38set of claims are in the first place. 17:40That's on you as the human to define the 17:42system in a more granular way so the AI 17:45can come along and be helpful. If you 17:47are trying to measure correctness before 17:50you can measure the claims of the 17:51system, you're just making it up to 17:53yourself. If you can measure the claims 17:55to the system and say these are the 17:56claims the system is allowed to make, 17:57like it can declare inventory, it can 18:00declare how many customer calls were 18:02received, etc. That's great. But now you 18:04need to get into what that looks like 18:06and how you measure it, what evidence is 18:08required, where it gets it, etc. If this 18:10sounds like a lot of work, guess what? 18:13This is part of why I think that humans 18:15have lots of jobs in the age of AI. It 18:17is not easy to design these systems. 18:20Yes, there's going to be lots of 18:21disruption ahead for all of us, but 18:23designing these systems and doing so 18:25effectively takes a tremendous amount of 18:27mental discipline. It takes the 18:29discipline of frankly a senior engineer 18:32who's used to having to define 18:33deterministic workflow from vague 18:36business requirements. We're all in a 18:38similar space now. And if you think I 18:40don't design a gentic system, so I don't 18:43need to hear this, you're wrong. And the 18:45reason you're wrong is because prompting 18:47is kicking off a workflow. Prompting is 18:50telling a model what good looks like. 18:52Prompting imposes a quality bar on a 18:55model. And so you either are going to 18:57say, "This is what good looks like in a 18:59way that's useful or not." I have had 19:01people look over my shoulder when I 19:02prompt. And one of the things they tell 19:04me they notice that I do differently 19:06than other people is that I always give 19:08the model a very clear sense of what an 19:10expected output should be, what good 19:12looks like every single time. Even on a 19:14very short prompt, I'll make sure I have 19:15that because otherwise, how are you 19:17going to know? How are you going to know 19:18what the model did and whether good 19:20looks like? And so my closing thought 19:22for you is that this is a fractal 19:24insight. Yes, I spent most of my time 19:26talking about systems and agentic design 19:28because a lot of the conversation that 19:31we have either as individuals or as 19:35designers ends up in a corporate context 19:37where we have to define these systems 19:39users will use them. They need 19:41responses. What does quality look like? 19:43But it's true in our personal lives too. 19:45It's true in our personal instances of 19:46chat GPT. Do we know what good looks 19:49like? Do we know what quality looks 19:50like? That is a prompting hint. You can 19:53get better at prompting just by 19:55answering that question. And so my 19:57question to you is when you're giving 19:59your model a prompt or when you're 20:00designing a system, do you know what 20:03good really looks