Papers
arxiv:2508.04349

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Published on Aug 6
Authors:
,
,
,
,
,
,

Abstract

Dynamic Entropy Weighting with Group Token Policy Optimization and Sequence-Level GRPO improves reinforcement learning for Large Language Models by enabling fine-grained reward assignment based on policy entropy.

AI-generated summary

Reinforcement learning (RL) is a pivotal task for enhancing Large Language Model (LLM) reasoning. Conventional algorithms, however, typically adhere to a coarse-grained credit assignment paradigm, applying a uniform reward to all tokens in a sequence, a critical flaw in long-chain reasoning tasks. In this paper, we address this challenge and propose Dynamic Entropy Weighting, a novel mechanism that facilitates fine-grained rewards through two new algorithms: Group Token Policy Optimization (GTPO), which assigns an entropy-weighted reward to each token, and the analogous algorithm Sequence-Level GRPO (GRPO-S). Our approach is founded on the hypothesis that high policy entropy within a reasoning path is a powerful heuristic for cognitive effort at pivotal junctures, which can be repurposed into a learning signal. By repurposing policy entropy for reward shaping, we achieve true per-token credit assignment. Experimental results across challenging reasoning benchmarks validate the superiority of our approach, showing our methods significantly outperform a strong DAPO baseline and confirming our entropy-weighting mechanism as the key driver of this performance boost.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.04349 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.04349 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.04349 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.