← Back to Papers
Key Insights
- Directly applying GRPO’s group‑wise normalization to a mixture of rewards collapses distinct advantage signals into near‑identical values, hurting learning dynamics.
- GDPO separates (decouples) the normalization step for each reward component, preserving their relative magnitudes before a final batch‑wise advantage scaling.
- This two‑stage normalization yields higher and more consistent reward scores (correctness and format) across diverse tasks such as tool‑calling, math, and code reasoning.
- GDPO improves training stability, reducing occurrences of early divergence or stagnation that are observed with GRPO in multi‑reward settings.
- The method is agnostic to the underlying policy model (e.g., Qwen2.5‑Instruct‑1.5B) and can be plugged into existing RL pipelines that employ multiple reward signals.
Abstract
Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.
Full Analysis
# Decoupled Reward Normalization for Stable Multi‑Reward RL
**Authors:** Shih-Yang Liu,
**Source:** [HuggingFace](https://huggingface.co/papers/2601.05242) | [arXiv](https://arxiv.org/abs/2601.05242)
**Published:** 2026-01-09
**Organization:** Hugging Face
## Summary
- Directly applying GRPO’s group‑wise normalization to a mixture of rewards collapses distinct advantage signals into near‑identical values, hurting learning dynamics.
- GDPO separates (decouples) the normalization step for each reward component, preserving their relative magnitudes before a final batch‑wise advantage scaling.
- This two‑stage normalization yields higher and more consistent reward scores (correctness and format) across diverse tasks such as tool‑calling, math, and code reasoning.
- GDPO improves training stability, reducing occurrences of early divergence or stagnation that are observed with GRPO in multi‑reward settings.
- The method is agnostic to the underlying policy model (e.g., Qwen2.5‑Instruct‑1.5B) and can be plugged into existing RL pipelines that employ multiple reward signals.
## Abstract
Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.
---
*Topics: reinforcement-learning, efficiency*
*Difficulty: advanced*
*Upvotes: 74*