Learning Library

← Back to Papers
Research Paper

Tree‑Search Guided Multi‑Turn Policy Optimization

Authors: Zefang Zong,
Organization: Hugging Face
Published: 2026-01-09 • Added: 2026-01-09

Key Insights

  • Turn‑level tree search injects diverse, forward‑looking trajectories, dramatically improving exploration in multi‑turn environments.
  • By formulating separate learning objectives for each turn, AT²PO provides clearer credit assignment across long horizons.
  • The framework seamlessly integrates model‑free policy gradients with lookahead search, yielding a unified agentic RL algorithm.
  • Empirical results show consistent gains over prior multi‑turn baselines on complex tasks such as strategic games and conversational agents.
  • Shallow, parallelizable trees keep computational overhead low, making the method practical for real‑world deployments.

Abstract

AT²PO is a unified framework for multi-turn agentic reinforcement learning that improves exploration diversity, credit assignment, and policy optimization through tree search and turn-level learning objectives.

Full Analysis

# Tree‑Search Guided Multi‑Turn Policy Optimization **Authors:** Zefang Zong, **Source:** [HuggingFace](https://huggingface.co/papers/2601.04767) | [arXiv](https://arxiv.org/abs/2601.04767) **Published:** 2026-01-09 **Organization:** Hugging Face ## Summary - Turn‑level tree search injects diverse, forward‑looking trajectories, dramatically improving exploration in multi‑turn environments. - By formulating separate learning objectives for each turn, AT²PO provides clearer credit assignment across long horizons. - The framework seamlessly integrates model‑free policy gradients with lookahead search, yielding a unified agentic RL algorithm. - Empirical results show consistent gains over prior multi‑turn baselines on complex tasks such as strategic games and conversational agents. - Shallow, parallelizable trees keep computational overhead low, making the method practical for real‑world deployments. ## Abstract AT²PO is a unified framework for multi-turn agentic reinforcement learning that improves exploration diversity, credit assignment, and policy optimization through tree search and turn-level learning objectives. --- *Topics: reinforcement-learning* *Difficulty: advanced* *Upvotes: 15*