arxiv:2508.04349

GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy

Published on Aug 6

Authors:

Abstract

Dynamic Entropy Weighting with Group Token Policy Optimization and Sequence-Level GRPO improves reinforcement learning for Large Language Models by enabling fine-grained reward assignment based on policy entropy.

AI-generated summary

Reinforcement learning (RL) is a pivotal task for enhancing Large Language Model (LLM) reasoning. Conventional algorithms, however, typically adhere to a coarse-grained credit assignment paradigm, applying a uniform reward to all tokens in a sequence, a critical flaw in long-chain reasoning tasks. In this paper, we address this challenge and propose Dynamic Entropy Weighting, a novel mechanism that facilitates fine-grained rewards through two new algorithms: Group Token Policy Optimization (GTPO), which assigns an entropy-weighted reward to each token, and the analogous algorithm Sequence-Level GRPO (GRPO-S). Our approach is founded on the hypothesis that high policy entropy within a reasoning path is a powerful heuristic for cognitive effort at pivotal junctures, which can be repurposed into a learning signal. By repurposing policy entropy for reward shaping, we achieve true per-token credit assignment. Experimental results across challenging reasoning benchmarks validate the superiority of our approach, showing our methods significantly outperform a strong DAPO baseline and confirming our entropy-weighting mechanism as the key driver of this performance boost.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.04349 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.04349 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.04349 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.