Tree‑Search Guided Multi‑Turn Policy Optimization

← Back to Papers

Research Paper

Tree‑Search Guided Multi‑Turn Policy Optimization

Authors: Zefang Zong,

reinforcement-learning advanced ▲ 15 • arXiv ↗ • HuggingFace ↗

Organization: Hugging Face

Published: 2026-01-09 • Added: 2026-01-09

Key Insights

Turn‑level tree search injects diverse, forward‑looking trajectories, dramatically improving exploration in multi‑turn environments.
By formulating separate learning objectives for each turn, AT²PO provides clearer credit assignment across long horizons.
The framework seamlessly integrates model‑free policy gradients with lookahead search, yielding a unified agentic RL algorithm.
Empirical results show consistent gains over prior multi‑turn baselines on complex tasks such as strategic games and conversational agents.
Shallow, parallelizable trees keep computational overhead low, making the method practical for real‑world deployments.

Abstract

AT²PO is a unified framework for multi-turn agentic reinforcement learning that improves exploration diversity, credit assignment, and policy optimization through tree search and turn-level learning objectives.

Full Analysis

# Tree‑Search Guided Multi‑Turn Policy Optimization **Authors:** Zefang Zong, **Source:** [HuggingFace](https://huggingface.co/papers/2601.04767) | [arXiv](https://arxiv.org/abs/2601.04767) **Published:** 2026-01-09 **Organization:** Hugging Face ## Summary - Turn‑level tree search injects diverse, forward‑looking trajectories, dramatically improving exploration in multi‑turn environments. - By formulating separate learning objectives for each turn, AT²PO provides clearer credit assignment across long horizons. - The framework seamlessly integrates model‑free policy gradients with lookahead search, yielding a unified agentic RL algorithm. - Empirical results show consistent gains over prior multi‑turn baselines on complex tasks such as strategic games and conversational agents. - Shallow, parallelizable trees keep computational overhead low, making the method practical for real‑world deployments. ## Abstract AT²PO is a unified framework for multi-turn agentic reinforcement learning that improves exploration diversity, credit assignment, and policy optimization through tree search and turn-level learning objectives. --- *Topics: reinforcement-learning* *Difficulty: advanced* *Upvotes: 15*