Decoupled Reward Normalization for Stable Multi‑Reward RL

← Back to Papers

Research Paper

Decoupled Reward Normalization for Stable Multi‑Reward RL

Authors: Shih-Yang Liu,

reinforcement-learning efficiency advanced ▲ 74 • arXiv ↗ • HuggingFace ↗

Organization: Hugging Face

Published: 2026-01-09 • Added: 2026-01-09

Key Insights

Directly applying GRPO’s group‑wise normalization to a mixture of rewards collapses distinct advantage signals into near‑identical values, hurting learning dynamics.
GDPO separates (decouples) the normalization step for each reward component, preserving their relative magnitudes before a final batch‑wise advantage scaling.
This two‑stage normalization yields higher and more consistent reward scores (correctness and format) across diverse tasks such as tool‑calling, math, and code reasoning.
GDPO improves training stability, reducing occurrences of early divergence or stagnation that are observed with GRPO in multi‑reward settings.
The method is agnostic to the underlying policy model (e.g., Qwen2.5‑Instruct‑1.5B) and can be plugged into existing RL pipelines that employ multiple reward signals.

Abstract

Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.

Full Analysis

# Decoupled Reward Normalization for Stable Multi‑Reward RL **Authors:** Shih-Yang Liu, **Source:** [HuggingFace](https://huggingface.co/papers/2601.05242) | [arXiv](https://arxiv.org/abs/2601.05242) **Published:** 2026-01-09 **Organization:** Hugging Face ## Summary - Directly applying GRPO’s group‑wise normalization to a mixture of rewards collapses distinct advantage signals into near‑identical values, hurting learning dynamics. - GDPO separates (decouples) the normalization step for each reward component, preserving their relative magnitudes before a final batch‑wise advantage scaling. - This two‑stage normalization yields higher and more consistent reward scores (correctness and format) across diverse tasks such as tool‑calling, math, and code reasoning. - GDPO improves training stability, reducing occurrences of early divergence or stagnation that are observed with GRPO in multi‑reward settings. - The method is agnostic to the underlying policy model (e.g., Qwen2.5‑Instruct‑1.5B) and can be plugged into existing RL pipelines that employ multiple reward signals. ## Abstract Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks. --- *Topics: reinforcement-learning, efficiency* *Difficulty: advanced* *Upvotes: 74*