Learning Library

← Back to Library

Biases in LLM Judge Evaluations

7m • Unknown Channel • ai-ml • deep-dive • advanced • Watch on YouTube ↗

Key Points

The study defines an LLM‑judge as a language model fed a three‑part prompt (system instruction S, question Q, and candidate responses R) that outputs a prediction Y, and tests fairness by creating a semantically equivalent perturbed prompt P̂ (with altered instruction S′ and responses R′) to compare predictions Y and Ŷ.
Across 12 bias categories, the researchers observed systematic inconsistencies between Y and Ŷ, indicating that current LLM judges are not reliably fair or consistent.
Position bias emerged when simply swapping the order of candidate answers caused the same LLM judge to produce different rankings, contrary to the expectation of order‑invariant judgments.
Verbosity bias was detected when length‑altered but semantically identical responses led some judges to favor longer text and others to favor shorter text, revealing sensitivity to response length rather than content.
Additional analyses (e.g., “ignorance” tests) further confirmed that many LLM judges exhibit varied biases, underscoring the need for improved evaluation frameworks before deploying them as reliable judges.

Sections

Full Transcript

# Biases in LLM Judge Evaluations **Source:** [https://www.youtube.com/watch?v=dAE7OFm9oek](https://www.youtube.com/watch?v=dAE7OFm9oek) **Duration:** 00:07:20 ## Summary - The study defines an LLM‑judge as a language model fed a three‑part prompt (system instruction S, question Q, and candidate responses R) that outputs a prediction Y, and tests fairness by creating a semantically equivalent perturbed prompt P̂ (with altered instruction S′ and responses R′) to compare predictions Y and Ŷ. - Across 12 bias categories, the researchers observed systematic inconsistencies between Y and Ŷ, indicating that current LLM judges are not reliably fair or consistent. - Position bias emerged when simply swapping the order of candidate answers caused the same LLM judge to produce different rankings, contrary to the expectation of order‑invariant judgments. - Verbosity bias was detected when length‑altered but semantically identical responses led some judges to favor longer text and others to favor shorter text, revealing sensitivity to response length rather than content. - Additional analyses (e.g., “ignorance” tests) further confirmed that many LLM judges exhibit varied biases, underscoring the need for improved evaluation frameworks before deploying them as reliable judges. ## Sections - [00:00:00](https://www.youtube.com/watch?v=dAE7OFm9oek&t=0s) **Evaluating Fairness of LLM Judges** - The speaker outlines a study that formalizes LLM judge prompts with system instruction S, query Q, and responses R, then introduces a semantically equivalent perturbed prompt P̂ (modifying S and R) to assess the fairness of large language models used as judges. - [00:03:39](https://www.youtube.com/watch?v=dAE7OFm9oek&t=219s) **Evaluating Judges with Varied Prompts** - The speaker discusses experiments that probe judge reliability by varying response length, inserting thinking traces, adding irrelevant distractions, and injecting sentiment, revealing inconsistencies, ignored reasoning, and sensitivity to extraneous context. - [00:07:12](https://www.youtube.com/watch?v=dAE7OFm9oek&t=432s) **Engagement Prompt and Thanks** - The speaker invites viewers to ask questions, like or subscribe to the channel, and expresses gratitude for their support. ## Full Transcript

0:00Today, I'm going to tell you our latest research on evaluating the fairness of large language 0:06model as judges, aka LLM as a judge. LLM as a judge has been widely used for 0:13evaluating and improving generative AI technology. However, our study shows that none of the current 0:19judges are perfect. And I'm going to tell you why. So let us start by formally define what is LLM as 0:26a judge. So we start by something that we called a prompt, P, which consists of three parts. 0:34The first part is the system from instruction, where we call it S, which entails the roles that 0:41you expect the judge to play and the expected output that we want to get from the judge. And 0:47then the second component is Q, the actual question that we want to ask the judge. 0:54And then finally R, which is the candidate responses that we are going to provide to the 1:00judge. So, we will feed this prompt to a language model as a judge. 1:07Now we call it LLM judge. 1:14And this judge function will give us a response or a prediction that we code 1:21Y. To study the fairness of LLM as a judge, we 1:28specifically design an alternative prompt that we called P hat. So this P hat 1:35is constructed by perturbing the system instruction S to S prime 1:43while keeping the same query that we call Q. And then ... uh ... we 1:49also change the response to from R to R prime. And we made the 1:56modification to P such that the P and P has to be semantically equivalent. Basically the same 2:02type of uh ... question but with different contexts. And then we feed this P hat to the 2:09same LLM as a judge and obtain a prediction Y hat. So, in an 2:16ideal world, if the language model is as a judge is fair and consistent, we expect Y 2:23equals to Y hat. However, in a large-scale analysis that we did to focus on 2:30evaluating 12 different bias types, we figure 2:36out there are inconsistencies to different degrees for a wide range of large language model 2:43as a judge. And today I'm going to tell you out of the ... this 12 bias analysis, the six uh 2:50... selected results that we found in our analysis. The 2:57first one is position bias. 3:04So this is a very naive testing where we purposely swap the position of the 3:11candidate response. So for example, we ask "Which one out of ABC is better?" And then we swap the 3:16order to be a C. So ideally, language model should give consistent results no 3:23matter how we swap the position of the candidates. However, we found that many of the LLM judges are 3:30still not immune to position swap, which is not ideal. The second 3:37one is what we call verbosity. 3:44In this case, we particularly make some response longer and some response shorter, but we also 3:50make sure the responses deliver the same message. In this case, we find 3:57divergent output in the sense that some judges prefer longer context, some judges prefer shorter 4:04contexts. But either of these cases are ideal. In ideal case, the judges should be 4:10consistent about the ... the prediction as long as the context is correct. The third 4:17one is what we call ignorance. 4:25In this case, we test some language models that will generate something called the 4:32thinking trace, so it will provide some internal thinking process before giving a final 4:38answer. So, a very interesting finding ... uh ... that we had was 4:45that many of the judges will actually ignore the correctness of the thinking part, and ... and they will 4:51only focus on the correctness of the answer, which means the judge functions have not been made very 4:58comprehensive. The fourth one is what we call 5:04distraction. In this 5:11case, we particularly add some irrelevant context to the prompt P and such that the ... uh 5:18... to test the ... the reliability and sensitivity of the language model as a judge at the output. 5:25And uh... although the ... the context being added is ... is irrelevant to the question and response, 5:32somehow many of the judges are still very sensitive to distraction. The fifth one 5:39is uh ... sentiment. 5:47In this case, we add uh ... different emotional elements to the prompt and evaluate the output of the 5:54judges. And we found that many judges prefer neutral tones over either too positive or too 6:00negative tones. Finally, uh ... probably the most interesting finding that we 6:07had was the phenomenon that we call self-enhancement. 6:16So in this case, we ask the LLM aim to generate a response. And then we also 6:23use the same LLM to act as a judge to evaluate the quality of the response. And for many 6:30of the large regression models that we tested, we found that there is a strong preference of the ... 6:36for the large language model as a judge to select the response generated by the same language model 6:42shows, which shows a very strong self-bias inherited in their judgment function. So, overall, 6:49our systematic analysis shows there is a form of hallucination in LLM as a judge because of lacking 6:56consistency to semantically meaningful perturbations at the input. And it is very 7:01important that we should continue to improve the reliability and correctness of the judgment 7:07function because they are widely used in evaluating and improving generative AI technology. 7:12So, if you have any questions or comments, feel free to reach out to me. If you like our content, 7:17please like or subscribe to our channel. Thank you.