Learning Library

← Back to Papers
Research Paper

Decoupled Reward Normalization for Stable Multi‑Reward RL

Authors: Shih-Yang Liu,
Organization: Hugging Face
Published: 2026-01-09 • Added: 2026-01-09

Key Insights

  • Directly applying GRPO’s group‑wise normalization to a mixture of rewards collapses distinct advantage signals into near‑identical values, hurting learning dynamics.
  • GDPO separates (decouples) the normalization step for each reward component, preserving their relative magnitudes before a final batch‑wise advantage scaling.
  • This two‑stage normalization yields higher and more consistent reward scores (correctness and format) across diverse tasks such as tool‑calling, math, and code reasoning.
  • GDPO improves training stability, reducing occurrences of early divergence or stagnation that are observed with GRPO in multi‑reward settings.
  • The method is agnostic to the underlying policy model (e.g., Qwen2.5‑Instruct‑1.5B) and can be plugged into existing RL pipelines that employ multiple reward signals.

Abstract

Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.

Full Analysis

# Decoupled Reward Normalization for Stable Multi‑Reward RL **Authors:** Shih-Yang Liu, **Source:** [HuggingFace](https://huggingface.co/papers/2601.05242) | [arXiv](https://arxiv.org/abs/2601.05242) **Published:** 2026-01-09 **Organization:** Hugging Face ## Summary - Directly applying GRPO’s group‑wise normalization to a mixture of rewards collapses distinct advantage signals into near‑identical values, hurting learning dynamics. - GDPO separates (decouples) the normalization step for each reward component, preserving their relative magnitudes before a final batch‑wise advantage scaling. - This two‑stage normalization yields higher and more consistent reward scores (correctness and format) across diverse tasks such as tool‑calling, math, and code reasoning. - GDPO improves training stability, reducing occurrences of early divergence or stagnation that are observed with GRPO in multi‑reward settings. - The method is agnostic to the underlying policy model (e.g., Qwen2.5‑Instruct‑1.5B) and can be plugged into existing RL pipelines that employ multiple reward signals. ## Abstract Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks. --- *Topics: reinforcement-learning, efficiency* *Difficulty: advanced* *Upvotes: 74*