Defining Correctness for Reliable AI
Key Points
- Defining what “good quality work” looks like for AI systems—especially in terms of correctness—is essential, because without a clear metric you can’t measure or improve performance.
- Humans habitually optimize for social cohesion (“go‑along, get‑along”) rather than factual correctness, a habit that worked historically but leads to unreliable AI outcomes when it isn’t consciously overridden.
- Most AI projects fail not because the models are unintelligent but because teams lack a stable, explicit definition of correctness, often shifting goals mid‑stream without documenting the change.
- To build reliable AI, correctness must be embedded at the core of system architecture, allowing updates to the definition of “good” in a predictable, controllable manner.
Sections
- Defining Correctness for AI Systems - The speaker stresses that without an explicit, measurable definition of “good” or “correct” output, AI projects falter, urging everyone to rigorously define correctness in prompts to improve quality and outcomes.
- Defining Correctness for Agentic Systems - The speaker stresses that the primary task is to establish clear, rigorous criteria for what constitutes correct output—distinguishing tolerable uncertainty from fatal errors—especially when combining structured and unstructured data in advanced agentic systems that serve as trusted records of truth.
- Balancing AI Autonomy and Human Judgment - The speaker debates whether an AI system should automatically update sales‑pipeline probabilities based on predictive patterns or defer to the salesperson’s intuition, highlighting trust, change‑management, and evaluation‑metric issues that can perpetuate AI hallucinations.
- Single‑Turn Bias in AI Chatbots - The speaker explains that reinforcement‑learning training emphasizes single‑turn exchanges, causing models like Gemini 3 to degrade over multi‑turn dialogs and inadvertently fostering emotional bonds with users, highlighting the need for clearer reward definitions and objectives.
- Copilot Adoption Hindered by Dirty Data - The speaker argues that Microsoft's AI Copilot struggles in enterprises because it is sold as a bundled product without addressing cultural readiness, proper training, and the underlying poor‑quality SharePoint data that feeds the model.
- Defining Good Output in Prompting - The speaker argues that effective prompting hinges on explicitly stating what a high‑quality response looks like, a practice essential for both system design and everyday use of language models.
Full Transcript
# Defining Correctness for Reliable AI **Source:** [https://www.youtube.com/watch?v=mnWMTzkjWmk](https://www.youtube.com/watch?v=mnWMTzkjWmk) **Duration:** 00:20:06 ## Summary - Defining what “good quality work” looks like for AI systems—especially in terms of correctness—is essential, because without a clear metric you can’t measure or improve performance. - Humans habitually optimize for social cohesion (“go‑along, get‑along”) rather than factual correctness, a habit that worked historically but leads to unreliable AI outcomes when it isn’t consciously overridden. - Most AI projects fail not because the models are unintelligent but because teams lack a stable, explicit definition of correctness, often shifting goals mid‑stream without documenting the change. - To build reliable AI, correctness must be embedded at the core of system architecture, allowing updates to the definition of “good” in a predictable, controllable manner. ## Sections - [00:00:00](https://www.youtube.com/watch?v=mnWMTzkjWmk&t=0s) **Defining Correctness for AI Systems** - The speaker stresses that without an explicit, measurable definition of “good” or “correct” output, AI projects falter, urging everyone to rigorously define correctness in prompts to improve quality and outcomes. - [00:04:07](https://www.youtube.com/watch?v=mnWMTzkjWmk&t=247s) **Defining Correctness for Agentic Systems** - The speaker stresses that the primary task is to establish clear, rigorous criteria for what constitutes correct output—distinguishing tolerable uncertainty from fatal errors—especially when combining structured and unstructured data in advanced agentic systems that serve as trusted records of truth. - [00:07:32](https://www.youtube.com/watch?v=mnWMTzkjWmk&t=452s) **Balancing AI Autonomy and Human Judgment** - The speaker debates whether an AI system should automatically update sales‑pipeline probabilities based on predictive patterns or defer to the salesperson’s intuition, highlighting trust, change‑management, and evaluation‑metric issues that can perpetuate AI hallucinations. - [00:10:41](https://www.youtube.com/watch?v=mnWMTzkjWmk&t=641s) **Single‑Turn Bias in AI Chatbots** - The speaker explains that reinforcement‑learning training emphasizes single‑turn exchanges, causing models like Gemini 3 to degrade over multi‑turn dialogs and inadvertently fostering emotional bonds with users, highlighting the need for clearer reward definitions and objectives. - [00:15:18](https://www.youtube.com/watch?v=mnWMTzkjWmk&t=918s) **Copilot Adoption Hindered by Dirty Data** - The speaker argues that Microsoft's AI Copilot struggles in enterprises because it is sold as a bundled product without addressing cultural readiness, proper training, and the underlying poor‑quality SharePoint data that feeds the model. - [00:18:55](https://www.youtube.com/watch?v=mnWMTzkjWmk&t=1135s) **Defining Good Output in Prompting** - The speaker argues that effective prompting hinges on explicitly stating what a high‑quality response looks like, a practice essential for both system design and everyday use of language models. ## Full Transcript
Most of us can't define what good
quality work looks like for our AI
systems and it's really hurting. I don't
just mean for corporate AI systems. I'm
going to talk a lot in this video about
how you define the Gentic systems and
how you build large scale systems at
businesses that measure good quality
work. But this goes beyond that. We are
talking about good quality work that you
can define in a prompt. In other words,
the ability to define what good looks
like turns out to be one of the most
powerful insights in AI. And it's
something that cuts at the heart of the
vagueness that we like to operate with
in business and personal lives. Because
humans, I got to say, usually optimize
for go along, get along. We optimize for
social cohesion and we don't optimize
for correctness. And that has worked for
us for about a half a million years. It
does not work anymore when you work with
AI systems. And so this is something
that you may hear it and say, "What's
the implication for me? I don't define
AI systems." This is for all of us. We
all need to think harder about what good
looks like if we want to be good
prompters. So with that, let's dive in.
Correctness is upstream of everything.
Most AI projects don't fail because the
model is dumb. They fail because nobody
can answer a brutally simple question.
What would correct even mean here? If
you can't define correctness, then you
can't measure it. If you can't measure
it, you can't improve it. Everything
downstream, the decisions we make about
retrieval augmented generation systems
or agents or orchestration or context
engineering or model choice, those all
become elaborate ways that we build on
top of an unnamed shifting target if we
can't define correctness. And the part
that's awkward to admit is that we don't
just lack a definition of correctness.
As humans, we often change our
definition mid-stream. So, we may
quietly, socially, without writing it
down, change what we mean by good and
then blame the system for being
unreliable. I've seen this happen a lot.
If you want a good example, how many
times have you seen priorities for a
product team change midstream during the
quarter after quarterly goals and OKRs
were set? I've seen it a lot. I've
worked in product for two decades. I
would say it happens more often than not
because reality continues to push us to
change our definitions and change our
priorities in this situation. What I'm
suggesting is not that we're going to
magically get to a world where we can
just freeze correctness and it won't
ever change. That would be unrealistic.
What I'm suggesting is that we need to
be honest about the importance of
correctness and answering what good
looks like when we build AI systems. And
we need to build our systems in such a
way that correctness and quality are at
the heart of how we think about
architecture. And we can change those
answers in predictable ways that
influence our system so that we can
update our responses, update the process
the AI goes through to get answers when
our own definitions of good and quality
change. There's a lot in there. We're
going to unpack it here. First, in
normal software, we pretend correct is
obvious because the program either
passes tests or it doesn't. It's kind of
binary. You can have functional
requirements when you launch software
and it either passes those tests and you
launch it or it fails and you go back
and you do QA again. In AI, because this
is a probabilistic system, correct is
rarely binary. It's a bundle of
competing requirements that we often
don't honestly debate upfront when we
should. So requirements around
truthfulness, requirements around
completeness, requirements around tone,
requirements around policy compliance,
requirements around speed or cost or
refusal behavior. And if you're in the
enterprise, you have requirements around
auditability. So when people ask me
about an architecture for an agentic
system and they might ask, hey, uh,
where do we put our context layer? Or,
hey, do we need three agents or two
agents for this situation? Or do we need
an agent at all? Can we just put this in
a chatbot? I always ask them to rewind
the tape and start at the beginning.
Those are secondord decisions. The first
order decision is what is the output
here and how do we know what good looks
like? What is correct? Can we can we
name it? Can we define it? What are the
kinds of uncertainty that we would allow
in a definition of correctness? What is
the kind of uncertainty or inaccuracy
that we wouldn't allow that would be a
fatal error? OpenAI's own guidance on
evaluations basically says this out
loud. You need evaluations that test
outputs against the quality criteria
that you specify, especially as you
change your models or prompts.
Reliability, it starts from
understanding what to measure. This
especially shows up when you're doing
complex agentic systems that combine
structured data and unstructured data.
So unstructured data often can sound
really good when you retrieve it, but it
can also be incorrect. Structured data
can be correct, but can be unusable when
you're combining it with unstructured
data. So when you combine these items
for a board deck or a compliance
workflow, your definition of correctness
has to remain useful over unstructured
and structured data. both pretty close
is not going to be good enough if you're
taking these systems seriously. A single
digit off is a problem in a board deck
because the value of the system is in
trust. And this is becoming more and
more relevant because our agentic
systems are getting closer and closer
and closer to systems of record. We're
now talking openly about how our systems
of record need to be updated and changed
so that agents can modify them directly.
If that is the case, correct
architecture is dependent on your
ability at scale to define what good
quality responses look like in a way
that you can measure. And I think
there's an important hidden failure
mode. I talked about this idea that we
as humans tend to move the goalpost in
the middle of the quarter, right? This
happens all the time. It happens between
stakeholders. We keep moving our
goalposts. Like in week one we may say
hey correct means the answer just has to
sound plausible and save time but by
week three we we may be saying actually
correct means it matches the finance
numbers. We end up conducting
correctness discovery as humans while we
build these systems and those are not
small changes right if you say it has to
match the finance numbers that's a
change in the definition of the system
and so what I find the reason why I
insist that we start with a quality
conversation and a correctness
conversation is that it saves us so much
of this back and forth. If you end up
discovering correctness over the course
of the agentic build, you're going to
end up discovering lots and lots and
lots of architecture changes, and your
poor engineers, your poor AI architects
aren't going to know what you really
want. They're just going to go back and
forth because you keep saying, "Well,
correctness means it should answer
confidently and quickly with no caveats
versus correctness means it can answer
very slowly. It must match the finance
numbers. It must include narrative
context every time it answers." Well,
which is it? Do you need it to answer
quickly? Do you need it to answer
confidently in a bold tone? Or do you
need it to answer with absolute
precision on finance numbers? And that's
not as like you might think that's an
easy choice. The world of agentic
architecture is full of choices like
that that are actually very difficult.
I'll give you another example. Is it
more correct for the agent to update a
contact record for a sales pipeline
probability estimate when the system
conducts a gentic search and determines
based on a pattern of contact that that
particular prospect is likely not going
to close a sale and so this system
proactively just updates it. Is that
correct? Or is it more correct to rely
on what the human, the salesperson who
owns that prospect thinks about that
prospect? That's a real question. You
could say, "Our prospects are on the
phone with our agent in ways that are
not well captured by our existing system
of record, and so we trust the humans
more." Or you could say, "Actually, our
humans don't have a good track record of
forecasting here. we need to trust our
agentic systems more and then you have a
human downstream conversation about
change management. These are really
fraught issues and you multiply that
time 10x or times x when you want to
build an overall system. Once you
understand this a lot of AI weirdness
becomes predictable like hallucinations
for example if the scoreboard rewards
the system guessing because you never
defined correctness systems learn to
guess. OpenAI has published a paper
arguing that common evaluation setups
often reward confident an answers over
honest uncertainty and that this
pressure will keep hallucinations alive
unless you change what correctness looks
like. Are you willing to reward a model
for telling you I don't know this is
what I know and this is what I need to
ask you. Is that an acceptable answer or
do you insist that acceptability only
means a confident statement of fact?
This isn't really a model problem
people. This is an us problem. This is a
correctness definition problem. The
system is optimizing what we as humans
are actually rewarding so often and we
end up blaming the model for
hallucinations when it's just reflecting
back to us the uncertainty that we are
giving the system in terms of the goals
that it should have. Now once you admit
that correctness is upstream of
everything, you immediately hit the next
landmine. Measurement distorts behavior.
This goes back to Goodart's law in
software, right? Goodart's law gets
quoted because it's it's annoyingly
true. When a measure becomes a target,
it stops being a good measure. In AI,
that becomes if you pick a proxy metric
for correctness, the system will learn
to win the proxy, even if that proxy is
different from the actual value you're
looking to measure. This gets a little
bit nerdy, but if you get into
reinforcement learning and how aligned
systems work, this can show up as reward
hacking. the model will satisfy the
literal objective while missing the
intent. Let me give you an example
that's very tangible. If you use Gemini
3, Gemini 3 is not nearly as good at
multi-turn conversations as you might
want it to be. It is extremely optimized
for the single turn where you give it a
good prompt and then you get a response.
That is a fingerprint behavior of Gemini
3 that is also somewhat characteristic
of other models. Almost every model I
know does better at the first response
than it does at the nth response. What
has happened is that in reinforcement
learning, we have very few examples of
multi-turn conversations where the model
gets rewarded because your priority is
to go through a wide range of scenarios
and provide the model with rewards. And
in those situations, the people who are
having conversations are having single
turn conversations. And so what the
model learns over time is single turn
conversations. The model doesn't learn a
lot of experience at multi-turn
conversational dynamics. I personally
think that this is one of the reasons
why the longunning conversations that
characterize emotional relationships
between humans and AI are underexplored
by model makers. This is a situation
where the models themselves were never
built for multi-turn conversations and
one of the emergent effects of the
multi-turn conversation turns out to be
that humans form emotional attachment to
models in some cases. And now here we
are in a world where someone is getting
married to the AI. This is all
downstream of how you define reward
hacking and correctness. It has a lot of
implications. And so when we define our
systems, we need to define what we mean
by correctness very very precisely. We
need to define what our true objective
is very very carefully. Now I'm not
advocating that we all get into
reinforcement learning and we all start
to train our models. It's just that
reward hacking provides an example of
how a proxy can be used to confuse
people when you're trying to measure the
real thing. So another example that we
have talked about is this idea of
answering correctly and confidently
every time. So often when we tell a
model in our system prompt that it must
give an answer, we're inadvertently
reminding the model that it cannot give
no answer and that if it is uncertain
that it must answer anyway. That is the
kind of system prompt if not carefully
managed that leads to hallucinations
because the model has been told it needs
to answer. So the game here is not to
pick a metric. The game is to build a
culture of correctness that resists
gaming. So I would encourage you to
think about multiple criteria that
define correctness. I would encourage
you to think about explicit failure
modes that you can give your model so
they understand what to do when they're
failing. I would encourage you to think
about calibrated uncertainty and when to
tell your model it can just not answer.
I would encourage you to think about
provenence, how you can help the model
to tell you which part of the claim came
from where. And I would encourage you to
think about lading that up into testing
both at the unit level for individual
agents and the overall orchestration
layer system level. This is why good
evals, they're not busy work. Good evals
help you think through what correctness
looks like. And I want to give you a
note here. Humans like to stay vague
about correctness. Part of why I'm
having to have this conversation is
because it's a people problem. Humans
use vagueness effectively as a way to
keep social conversations going. Vaguess
keeps our options open. Vagueness avoids
conflict. Vaguess lets stakeholders
agree in the meeting and disagree in
production. We call these weasel words
at Amazon where you would use words like
actually or a lot or anything that
wasn't a number and a specific claim
because you wanted to go along and get
along. AI systems expose that kind of
thinking and that kind of business
culture. They force the organization to
confront a lot of the trade-offs that
we've often been hiding behind social
conformity. Do you really want boldness
in your answers? Do you really want
precision? Do you really want perfect
coverage? Do you want an audit trail? Do
you want refuse when unsure? Are you
actually sure about that? If the CEO
says, "I want an answer." What are you
going to say if you told the model it
could refuse when it's unsure? So, when
you when you don't decide and you sort
of leave those questions conveniently
vague, for most of human history, that's
fine because we're the ones who've had
to live with that and we've decided we
can. Now, you can't do that. The system
will decide for you. the LLM will decide
for you and the outcome looks like a
lack of quality, a lack of correctness,
AI unreliability, the board saying,
"Where is our AI product? Why is it
bad?" This is usually human
undecidability reflected back at you.
You know, I keep thinking about this
because of the widespread reports this
weekend that Microsoft has been unable
to get their AI co-pilot adopted in
organizations. It's not that they
haven't sold it as a bundle. that's been
aggressively sold as a bundle. But
Microsoft themselves are realizing what
I have heard on the ground from teams
for the last year, which is that
Microsoft can sell Copilot all they
want, but mostly people don't use it
very much when it's sold that way. This
comes back to the idea that most of the
AI systems problems we have end up being
reflections of people problems in our
cultures. In this case, co-pilot is
laded on top of dirty data in a
SharePoint and no one is given training
on how to ensure quality and correctness
in C-pilot. And all of our vague
assumptions, go along, get alalong
assumptions about quality end up being
operative with these AI systems. And so
we ask C-Pilot for an answer and we've
never answered what good looks like. And
Copilot does its best with the dirty
data it has. No wonder it's not adopted.
No wonder the salesperson will try once
or twice to get pipeline data out, roll
their eyes at the incorrect data. Never
bother to think that maybe there's some
issue with the Salesforce system of
record and what the AI agent can get.
These kinds of details don't get sold
when you sell an LLM. They get
confronted by the organization months
later. And this is the problem right now
with AI is that we are selling the
system and we are taking on human debt.
We're taking on human debt and AI
fluency. And we are taking on debt and
how we define correctness and quality.
I'm just going to keep banging that
drum. So, here's how I want to bring
this home and make it real for you. And
this is where I want to leave you. Think
of correctness and quality not as
something that you can bat around as a
human and be vague about, nor as a
single measure for your AI system.
Instead, think of it as a set of claims
that your system is allowed to make.
Think of it as the evidence required for
each claim and the penalties for being
wrong versus staying silent. And that
last clause matters. As we've talked
about, in many cases, if you can't
define what correctness looks like in
those terms, you haven't broken down the
problem enough. You haven't broken it
down to a level where you can define the
system. You've just left it at a human
state where it's very vague. So my first
challenge to you is if you think of it
and say, I couldn't tell you what the
set of claims are in the first place.
That's on you as the human to define the
system in a more granular way so the AI
can come along and be helpful. If you
are trying to measure correctness before
you can measure the claims of the
system, you're just making it up to
yourself. If you can measure the claims
to the system and say these are the
claims the system is allowed to make,
like it can declare inventory, it can
declare how many customer calls were
received, etc. That's great. But now you
need to get into what that looks like
and how you measure it, what evidence is
required, where it gets it, etc. If this
sounds like a lot of work, guess what?
This is part of why I think that humans
have lots of jobs in the age of AI. It
is not easy to design these systems.
Yes, there's going to be lots of
disruption ahead for all of us, but
designing these systems and doing so
effectively takes a tremendous amount of
mental discipline. It takes the
discipline of frankly a senior engineer
who's used to having to define
deterministic workflow from vague
business requirements. We're all in a
similar space now. And if you think I
don't design a gentic system, so I don't
need to hear this, you're wrong. And the
reason you're wrong is because prompting
is kicking off a workflow. Prompting is
telling a model what good looks like.
Prompting imposes a quality bar on a
model. And so you either are going to
say, "This is what good looks like in a
way that's useful or not." I have had
people look over my shoulder when I
prompt. And one of the things they tell
me they notice that I do differently
than other people is that I always give
the model a very clear sense of what an
expected output should be, what good
looks like every single time. Even on a
very short prompt, I'll make sure I have
that because otherwise, how are you
going to know? How are you going to know
what the model did and whether good
looks like? And so my closing thought
for you is that this is a fractal
insight. Yes, I spent most of my time
talking about systems and agentic design
because a lot of the conversation that
we have either as individuals or as
designers ends up in a corporate context
where we have to define these systems
users will use them. They need
responses. What does quality look like?
But it's true in our personal lives too.
It's true in our personal instances of
chat GPT. Do we know what good looks
like? Do we know what quality looks
like? That is a prompting hint. You can
get better at prompting just by
answering that question. And so my
question to you is when you're giving
your model a prompt or when you're
designing a system, do you know what
good really looks