Biases in LLM Judge Evaluations
Key Points
- The study defines an LLM‑judge as a language model fed a three‑part prompt (system instruction S, question Q, and candidate responses R) that outputs a prediction Y, and tests fairness by creating a semantically equivalent perturbed prompt P̂ (with altered instruction S′ and responses R′) to compare predictions Y and Ŷ.
- Across 12 bias categories, the researchers observed systematic inconsistencies between Y and Ŷ, indicating that current LLM judges are not reliably fair or consistent.
- Position bias emerged when simply swapping the order of candidate answers caused the same LLM judge to produce different rankings, contrary to the expectation of order‑invariant judgments.
- Verbosity bias was detected when length‑altered but semantically identical responses led some judges to favor longer text and others to favor shorter text, revealing sensitivity to response length rather than content.
- Additional analyses (e.g., “ignorance” tests) further confirmed that many LLM judges exhibit varied biases, underscoring the need for improved evaluation frameworks before deploying them as reliable judges.
Sections
- Evaluating Fairness of LLM Judges - The speaker outlines a study that formalizes LLM judge prompts with system instruction S, query Q, and responses R, then introduces a semantically equivalent perturbed prompt P̂ (modifying S and R) to assess the fairness of large language models used as judges.
- Evaluating Judges with Varied Prompts - The speaker discusses experiments that probe judge reliability by varying response length, inserting thinking traces, adding irrelevant distractions, and injecting sentiment, revealing inconsistencies, ignored reasoning, and sensitivity to extraneous context.
- Engagement Prompt and Thanks - The speaker invites viewers to ask questions, like or subscribe to the channel, and expresses gratitude for their support.
Full Transcript
# Biases in LLM Judge Evaluations **Source:** [https://www.youtube.com/watch?v=dAE7OFm9oek](https://www.youtube.com/watch?v=dAE7OFm9oek) **Duration:** 00:07:20 ## Summary - The study defines an LLM‑judge as a language model fed a three‑part prompt (system instruction S, question Q, and candidate responses R) that outputs a prediction Y, and tests fairness by creating a semantically equivalent perturbed prompt P̂ (with altered instruction S′ and responses R′) to compare predictions Y and Ŷ. - Across 12 bias categories, the researchers observed systematic inconsistencies between Y and Ŷ, indicating that current LLM judges are not reliably fair or consistent. - Position bias emerged when simply swapping the order of candidate answers caused the same LLM judge to produce different rankings, contrary to the expectation of order‑invariant judgments. - Verbosity bias was detected when length‑altered but semantically identical responses led some judges to favor longer text and others to favor shorter text, revealing sensitivity to response length rather than content. - Additional analyses (e.g., “ignorance” tests) further confirmed that many LLM judges exhibit varied biases, underscoring the need for improved evaluation frameworks before deploying them as reliable judges. ## Sections - [00:00:00](https://www.youtube.com/watch?v=dAE7OFm9oek&t=0s) **Evaluating Fairness of LLM Judges** - The speaker outlines a study that formalizes LLM judge prompts with system instruction S, query Q, and responses R, then introduces a semantically equivalent perturbed prompt P̂ (modifying S and R) to assess the fairness of large language models used as judges. - [00:03:39](https://www.youtube.com/watch?v=dAE7OFm9oek&t=219s) **Evaluating Judges with Varied Prompts** - The speaker discusses experiments that probe judge reliability by varying response length, inserting thinking traces, adding irrelevant distractions, and injecting sentiment, revealing inconsistencies, ignored reasoning, and sensitivity to extraneous context. - [00:07:12](https://www.youtube.com/watch?v=dAE7OFm9oek&t=432s) **Engagement Prompt and Thanks** - The speaker invites viewers to ask questions, like or subscribe to the channel, and expresses gratitude for their support. ## Full Transcript
Today, I'm going to tell you our latest research on evaluating the fairness of large language
model as judges, aka LLM as a judge. LLM as a judge has been widely used for
evaluating and improving generative AI technology. However, our study shows that none of the current
judges are perfect. And I'm going to tell you why. So let us start by formally define what is LLM as
a judge. So we start by something that we called a prompt, P, which consists of three parts.
The first part is the system from instruction, where we call it S, which entails the roles that
you expect the judge to play and the expected output that we want to get from the judge. And
then the second component is Q, the actual question that we want to ask the judge.
And then finally R, which is the candidate responses that we are going to provide to the
judge. So, we will feed this prompt to a language model as a judge.
Now we call it LLM judge.
And this judge function will give us a response or a prediction that we code
Y. To study the fairness of LLM as a judge, we
specifically design an alternative prompt that we called P hat. So this P hat
is constructed by perturbing the system instruction S to S prime
while keeping the same query that we call Q. And then ... uh ... we
also change the response to from R to R prime. And we made the
modification to P such that the P and P has to be semantically equivalent. Basically the same
type of uh ... question but with different contexts. And then we feed this P hat to the
same LLM as a judge and obtain a prediction Y hat. So, in an
ideal world, if the language model is as a judge is fair and consistent, we expect Y
equals to Y hat. However, in a large-scale analysis that we did to focus on
evaluating 12 different bias types, we figure
out there are inconsistencies to different degrees for a wide range of large language model
as a judge. And today I'm going to tell you out of the ... this 12 bias analysis, the six uh
... selected results that we found in our analysis. The
first one is position bias.
So this is a very naive testing where we purposely swap the position of the
candidate response. So for example, we ask "Which one out of ABC is better?" And then we swap the
order to be a C. So ideally, language model should give consistent results no
matter how we swap the position of the candidates. However, we found that many of the LLM judges are
still not immune to position swap, which is not ideal. The second
one is what we call verbosity.
In this case, we particularly make some response longer and some response shorter, but we also
make sure the responses deliver the same message. In this case, we find
divergent output in the sense that some judges prefer longer context, some judges prefer shorter
contexts. But either of these cases are ideal. In ideal case, the judges should be
consistent about the ... the prediction as long as the context is correct. The third
one is what we call ignorance.
In this case, we test some language models that will generate something called the
thinking trace, so it will provide some internal thinking process before giving a final
answer. So, a very interesting finding ... uh ... that we had was
that many of the judges will actually ignore the correctness of the thinking part, and ... and they will
only focus on the correctness of the answer, which means the judge functions have not been made very
comprehensive. The fourth one is what we call
distraction. In this
case, we particularly add some irrelevant context to the prompt P and such that the ... uh
... to test the ... the reliability and sensitivity of the language model as a judge at the output.
And uh... although the ... the context being added is ... is irrelevant to the question and response,
somehow many of the judges are still very sensitive to distraction. The fifth one
is uh ... sentiment.
In this case, we add uh ... different emotional elements to the prompt and evaluate the output of the
judges. And we found that many judges prefer neutral tones over either too positive or too
negative tones. Finally, uh ... probably the most interesting finding that we
had was the phenomenon that we call self-enhancement.
So in this case, we ask the LLM aim to generate a response. And then we also
use the same LLM to act as a judge to evaluate the quality of the response. And for many
of the large regression models that we tested, we found that there is a strong preference of the ...
for the large language model as a judge to select the response generated by the same language model
shows, which shows a very strong self-bias inherited in their judgment function. So, overall,
our systematic analysis shows there is a form of hallucination in LLM as a judge because of lacking
consistency to semantically meaningful perturbations at the input. And it is very
important that we should continue to improve the reliability and correctness of the judgment
function because they are widely used in evaluating and improving generative AI technology.
So, if you have any questions or comments, feel free to reach out to me. If you like our content,
please like or subscribe to our channel. Thank you.